期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An improved parallel hybrid bi-conjugate gradient method suitable for distributed parallel computing

Tong-Xiang Gu Xian-Yu Zuo Xing-Ping Liu Pei-Lu Li 《Journal of Computational and Applied Mathematics》2009

An improved parallel hybrid bi-conjugate gradient method (IBiCGSTAB(2) method, in brief) for solving large sparse linear systems with nonsymmetric coefficient matrices is proposed for distributed parallel environments. The method reduces five global synchronization points to two by reconstructing the BiCGSTAB(2) method in [G.L.G. Sleijpen, H.A. van der Vorst, Hybrid bi-conjugate gradient methods for CFD problems, in: M. Hafez, K. Oshima (Eds.), Computational Fluid Dynamics Review 1995, John Wiley & Sons Ltd, Chichester, 1995, pp. 457–476] and the communication time required for the inner product can be efficiently overlapped with useful computation. The cost is only slightly increased computation time, which can be ignored, compared with the reduction of communication time. Performance and isoefficiency analysis shows that the IBiCGSTAB(2) method has better parallelism and scalability than the BiCGSTAB(2) method. Numerical experiments show that the scalability can be improved by a factor greater than 2.5 and the improvement in parallel communication performance approaches 60%. 相似文献

2.

Non-uniform data distribution for communication-efficient parallel clustering

《Journal of computational science》2013,4(6):489-495

Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operation which hinders the scalability of the approach. This work studies a different parallel formulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature of the centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements. 相似文献

3.

A preconditioning technique for Schur complement systems arising in stochastic optimization

Cosmin G. Petra Mihai Anitescu 《Computational Optimization and Applications》2012,52(2):315-344

Deterministic sample average approximations of stochastic programming problems with recourse are suitable for a scenario-based parallelization. In this paper the parallelization is obtained by using an interior-point method and a Schur complement mechanism for the interior-point linear systems. However, the direct linear solves involving the dense Schur complement matrix are expensive, and adversely affect the scalability of this approach. We address this issue by proposing a stochastic preconditioner for the Schur complement matrix and by using Krylov iterative methods for the solution of the dense linear systems. The stochastic preconditioner is built based on a subset of existing scenarios and can be assembled and factorized on a separate process before the computation of the Schur complement matrix finishes on the remaining processes. The expensive factorization of the Schur complement is removed from the parallel execution flow and the scaling of the optimization solver is considerably improved with this approach. The spectral analysis indicates an exponentially fast convergence in probability to 1 of the eigenvalues of the preconditioned matrix with the number of scenarios incorporated in the preconditioner. Numerical experiments performed on the relaxation of a unit commitment problem show good performance, in terms of both the accuracy of the solution and the execution time. 相似文献

4.

An architecture independent study of parallel segment trees

Alexandros V. Gerbessiotis 《Journal of Discrete Algorithms》2006,4(1):1-24

We study the computation, communication and synchronization requirements related to the construction and search of parallel segment trees in an architecture independent way. Our proposed parallel algorithms are optimal in space and time compared to the corresponding sequential algorithms utilized to solve the introduced problems and are described in the context of the bulk-synchronous parallel (BSP) model of computation. Our methods are more scalable and can thus be made to work for larger values of processor size p relative to problem size n than other segment tree related algorithms that have been described on other realistic distributed-memory parallel models and also provide a natural way to approach searching problems on latency-tolerant models of computation that maintains a balanced query load among the processors. 相似文献

5.

LOCAL PRESSURE CORRECTION FOR THE STOKES SYSTEM

Malte Braack Utku Kaya 《计算数学(英文版)》2020,38(1):125-141

Pressure correction methods constitute the most widely used solvers for the timedependent Navier-Stokes equations.There are several different pressure correction methods,where each time step usually consists in a predictor step for a non-divergence-free velocity,followed by a Poisson problem for the pressure(or pressure update),and a final velocity correction to obtain a divergence-free vector field.In some situations,the equations for the velocities are solved explicitly,so that the numerical most expensive step is the elliptic pressure problem.We here propose to solve this Poisson problem by a domain decomposition method which does not need any communication between the sub-regions.Hence,this system is perfectly adapted for parallel computation.We show under certain assumptions that this new scheme has the same order of convergence as the original pressure correction scheme(with global projection).Numerical examples for the Stokes system show the effectivity of this new pressure correction method.The convergence order O(k^2)for resulting velocity fields can be observed in the norm l^2(0,T;L^2(Ω)). 相似文献

6.

Balanced Truncation Model Reduction of Large-Scale Dense Systems on Parallel Computers

PETER BENNER ENRIQUE S. QUINTANA-ORTí GREGORIO QUINTANA-ORTí 《Mathematical and Computer Modelling of Dynamical Systems: Methods, Tools and Applications in Engineering and Related Sciences》2013,19(4):383-405

Model reduction is an area of fundamental importance in many modeling and control applications. In this paper we analyze the use of parallel computing in model reduction methods based on balanced truncation of large-scale dense systems. The methods require the computation of the Gramians of a linear-time invariant system. Using a sign function-based solver for computing full-rank factors of the Gramians yields some favorable computational aspects in the subsequent computation of the reduced-order model, particularly for non-minimal systems. As sign function-based computations only require efficient implementations of basic linear algebra operations readily available, e.g., in the BLAS, LAPACK, and ScaLAPACK, good performance of the resulting algorithms on parallel computers is to be expected. Our experimental results on a PC cluster show the performance and scalability of the parallel implementation. 相似文献

7.

Constructing disjoint paths on expander graphs

D. Peleg E. Upfal 《Combinatorica》1989,9(3):289-313

In a typical parallel or distributed computation model processors are connected by a spars interconnection network. To establish open-line communication between pairs of processors that wish to communicate interactively, a set of disjoint paths has to be constructed on the network. Since communication needs vary in time, paths have to be dynamically constructed and destroyed.We study the complexity of constructing disjoint paths between given pairs of vertices on expander interconnection graphs. These graphs have been shown before to possess desirable properties for other communication tasks.We present a sufficient condition for the existence ofKn ^Q edge-disjoint paths connecting any set ofK pairs of vertices on an expander graph, wheren is the number of vertices and<1 is some constant. We then show that the computational problem of constructing these paths lies in the classes Deterministic-P and Random-P C.Furthermore, we show that the set of paths can be constructed in probabilistic polylog time in the parallel-distributed model of computation, in which then participating processors reside in the nodes of the communication graph and all communication is done through edges of the graph. Thus, the disjoint paths are constructed in the very computation model that uses them.Finally, we show how to apply variants of our parallel algorithms to find sets ofvertex-disjoint paths when certain conditions are satisfied.Supported in part by a Weizmann fellowship and by contract ONR N00014-85-C-0731. 相似文献

8.

Bulk-Synchronous Parallel Gaussian Elimination

A. Tiskin 《Journal of Mathematical Sciences》2002,108(6):977-991

The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. We study the BSP complexity of Gaussian elimination and related problems. First, we analyze the Gaussian elimination without pivoting, which can be applied to the LU decomposition of symmetric positive-definite or diagonally dominant real matrices. Then we analyze the Gaussian elimination with Schönhage's recursive local pivoting suitable for the LU decomposition of matrices over a finite field, and for the QR decomposition of real matrices by the Givens rotations. Both versions of Gaussian elimination can be performed with an optimal amount of local computation, but optimal communication and synchronization costs cannot be achieved simultaneously. The algorithms presented in the paper allow one to trade off communication and synchronization costs in a certain range, achieving optimal or near-optimal cost values at the extremes. Bibliography: 19 titles. 相似文献

9.

Algebraic interface‐based coarsening AMG preconditioner for multi‐scale sparse matrices with applications to radiation hydrodynamics computation

下载免费PDF全文

Xiaowen Xu Zeyao Mo 《Numerical Linear Algebra with Applications》2017,24(2)

Coarsening is a crucial component of algebraic multigrid (AMG) methods for iteratively solving sparse linear systems arising from scientific and engineering applications. Its application largely determines the complexity of the AMG iteration operator. Usually, high operator complexities lead to fast convergence of the AMG method; however, they require additional memory and as such do not scale as well in parallel computation. In contrast, although low operator complexities improve parallel scalability, they often lead to deterioration in convergence. This study introduces a new type of coarsening strategy called algebraic interface‐based coarsening that yields a better balance between convergence and complexity for a class of multi‐scale sparse matrices. Numerical results for various model‐type problems and a radiation hydrodynamics practical application are provided to show the effectiveness of the proposed AMG solver. 相似文献

10.

Minimizing the total weighted late work in scheduling of identical parallel processors with communication delays

《Applied Mathematical Modelling》2014,38(15-16):3975-3986

This paper addresses a certain type of scheduling problem that arises when a parallel computation is to be executed on a set of identical parallel processors. It is assumed that if two precedence-related tasks are processed on two different processors, due to the information transferring, there will be a task-dependent communication delay between them. For each task, a processing time, a due date and a weight is given while the goal is to minimize the total weighted late work. An integer linear mathematical programming model and a branch-and-bound algorithm have been developed for the proposed problem. Comparing the results obtained by the proposed branch-and-bound algorithm with those obtained by CPLEX, indicates the effectiveness of the method. 相似文献