首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Recent developments in high performance computer architecture have a significant effect on all fields of scientific computing. Linear algebra and especially the solution of linear systems of equations lie at the heart of many applications in scientific computing. This paper describes and analyzes three parallel versions of the dense direct methods such as the Gaussian elimination method and the LU form of Gaussian elimination that are used in linear system solving on a multicore using an OpenMP interface. More specifically, we present two naive parallel algorithms based on row block and row cyclic data distribution and we put special emphasis on presenting a third parallel algorithm based on the pipeline technique. Further, we propose an implementation of the pipelining technique in OpenMP. Experimental results on a multicore CPU show that the proposed OpenMP pipeline implementation achieves good overall performance compared to the other two naive parallel methods. Finally, in this work we propose a simple, fast and reasonably analytical model to predict the performance of the direct methods with the pipelining technique.  相似文献   

2.
The solution of linear systems continues to play an important role in scientific computing. The problems to be solved often are of very large size, so that solving them requires large computer resources. To solve these problems, at least supercomputers with large shared memory or massive parallel computer systems with distributed memory are needed.

This paper gives a survey of research on parallel implementation of various direct methods to solve dense linear systems. In particular are considered: Gaussian elimination, Gauss-Jordan elimination and a variant due to Huard (1979), and an algorithm due to Enright (1978), designed in relation to solving (stiff) ODEs, such that stepsize and other method parameters can easily be varied.

Some theoretical results are mentioned, including a new result on error analysis of Huard's algorithm. Moreover, practical considerations and results of experiments on supercomputers and on a distributed-memory computer system are presented.  相似文献   


3.
This paper presents a parallel algorithm for computing the inversion of a dense matrix based on Gauss-Jordan elimination. The algorithm is proposed for the implementation on the linear array at a processor level which operates in a pipeline fashion. Two types of architecture are considered—one which uses serial data transfer (AP/S) and another which uses parallel data transfer (AP/P) between neighboring processors. The speed up of AP/S and AP/P are O(n/2) and O(4/5n), respectively.  相似文献   

4.
Parallel computing is now an essential paradigm for high performance scientific computing. Most existing hardware and software solutions are expensive or difficult to use. We developed Playdoh, a Python library for distributing computations across the free computing units available in a small network of multicore computers. Playdoh supports independent and loosely coupled parallel problems such as global optimisations, Monte Carlo simulations and numerical integration of partial differential equations. It is designed to be lightweight and easy to use and should be of interest to scientists wanting to turn their lab computers into a small cluster at no cost.  相似文献   

5.
Model reduction is an area of fundamental importance in many modeling and control applications. In this paper we analyze the use of parallel computing in model reduction methods based on balanced truncation of large-scale dense systems. The methods require the computation of the Gramians of a linear-time invariant system. Using a sign function-based solver for computing full-rank factors of the Gramians yields some favorable computational aspects in the subsequent computation of the reduced-order model, particularly for non-minimal systems. As sign function-based computations only require efficient implementations of basic linear algebra operations readily available, e.g., in the BLAS, LAPACK, and ScaLAPACK, good performance of the resulting algorithms on parallel computers is to be expected. Our experimental results on a PC cluster show the performance and scalability of the parallel implementation.  相似文献   

6.
The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. We study the BSP complexity of Gaussian elimination and related problems. First, we analyze the Gaussian elimination without pivoting, which can be applied to the LU decomposition of symmetric positive-definite or diagonally dominant real matrices. Then we analyze the Gaussian elimination with Schönhage's recursive local pivoting suitable for the LU decomposition of matrices over a finite field, and for the QR decomposition of real matrices by the Givens rotations. Both versions of Gaussian elimination can be performed with an optimal amount of local computation, but optimal communication and synchronization costs cannot be achieved simultaneously. The algorithms presented in the paper allow one to trade off communication and synchronization costs in a certain range, achieving optimal or near-optimal cost values at the extremes. Bibliography: 19 titles.  相似文献   

7.
软件流水线通过重叠连续的循环实体来实现有效的精细调度.然而,其性能可能受限制于循环里缺乏足够的并行操作或者资源需求.‘‘先展开后调度”技术在进行软件流水线调度之前先展开循环,从而能够发现更多的并行操作和充分利用关键资源.研究循环展开如何影响软件流水线的性能和资源利用,并进一步提出如何选择优化的循环展开次数.  相似文献   

8.
This article is devoted to the development and study of an algorithm for solving large systems of linear algebraic equations with sparse stiffness matrix on supercomputer by using the preconditioned conjugate gradient method (PCG). An efficient preconditioner is constructed on the basis of the domain decomposition method (the additive Schwarz method) which makes it possible to implement the algorithm on several computing nodes. We describe the parallel algorithm of the action of the stiffness matrix and the preconditioner on a vector. In addition, to increase the computational efficiency we make use of the routines from Intel®MKL: the direct solver (PARDISO) and the matrix–vector multiplication for sparse matrices (Sparse BLAS). We also study efficiency of using OpenMP directives on each computational node and compare it with pure MPI parallelization. The corresponding performance and scalability charts are presented.  相似文献   

9.
Parallel iterative methods are powerful in solving large systems of linear equations (LEs). The existing parallel computing research results focus mainly on sparse systems or others with particular structure. Most are based on parallel implementation of the classical relaxation methods such as Gauss-Seidel, SOR, and AOR methods which can be efficiently carried out on multiprocessor system. In this paper, we propose a novel parallel splitting operator method in which we divide the coefficient matrix into two or three parts. Then we convert the original problem (LEs) into a monotone (linear) variational inequality problem (VI) with separable structure. Finally, an inexact parallel splitting augmented Lagrangian method is proposed to solve the variational inequality problem (VI). To avoid dealing with the matrix inverse operator, we introduce proper inexact terms in subproblems such that the complexity of each iteration of the proposed method is O(n2). In addition, the proposed method does not require any special structure of system of LEs under consideration. Convergence of the proposed methods in dealing with two and three separable operators respectively, is proved. Numerical computations are provided to show the applicability and robustness of the proposed methods.  相似文献   

10.
We are concerned with the efficient implementation of symplectic implicit Runge-Kutta (IRK) methods applied to systems of Hamiltonian ordinary differential equations by means of Newton-like iterations. We pay particular attention to time-symmetric symplectic IRK schemes (such as collocation methods with Gaussian nodes). For an s-stage IRK scheme used to integrate a \(\dim \)-dimensional system of ordinary differential equations, the application of simplified versions of Newton iterations requires solving at each step several linear systems (one per iteration) with the same \(s\dim \times s\dim \) real coefficient matrix. We propose a technique that takes advantage of the symplecticity of the IRK scheme to reduce the cost of methods based on diagonalization of the IRK coefficient matrix. This is achieved by rewriting one step of the method centered at the midpoint on the integration subinterval and observing that the resulting coefficient matrix becomes similar to a skew-symmetric matrix. In addition, we propose a C implementation (based on Newton-like iterations) of Runge-Kutta collocation methods with Gaussian nodes that make use of such a rewriting of the linear system and that takes special care in reducing the effect of round-off errors. We report some numerical experiments that demonstrate the reduced round-off error propagation of our implementation.  相似文献   

11.
Summary. A parallelizable and vectorizable algorithm for solving linear algebraic systems arising from two-point boundary value ODEs is described. The method is equivalent to Gaussian elimination, with row partial pivoting, applied to a certain column-reordered version of the usual almost-block-diagonal coefficient matrix. We present analytical and numerical evidence to show that the algorithm is stable in most circumstances. Results from implementation on a shared-memory multiprocessor and a vector processor are given. The approach can be extended to handle problems with multipoint and integral conditions or with algebraic parameters. Received April 26, 1991 / Revised version received September 3, 1993  相似文献   

12.
A two-level OpenMP + MPI parallel implementation is used to numerically solve a model kinetic equation for problems with complex three-dimensional geometry. The scalability and robustness of the method are demonstrated by computing the classical gas flow through a circular pipe of finite length and the flow past a reentry vehicle model. It is shown that the two-level model significantly speeds up the computations and improves the scalability of the method.  相似文献   

13.
Andrews et al. [Automatic method for hiding latency in high bandwidth networks, in: Proceedings of the ACM Symposium on Theory of Computing, 1996, pp. 257-265; Improved methods for hiding latency in high bandwidth networks, in: Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, 1996, pp. 52-61] introduced a number of techniques for automatically hiding latency when performing simulations of networks with unit delay links on networks with arbitrary unequal delay links. In their work, they assume that processors of the host network are identical in computational power to those of the guest network being simulated. They further assume that the links of the host are able to pipeline messages, i.e., they are able to deliver P packets in time O(P+d) where d is the delay on the link.In this paper we examine the effect of eliminating one or both of these assumptions. In particular, we provide an efficient simulation of a linear array of homogeneous processors connected by unit-delay links on a linear array of heterogeneous processors connected by links with arbitrary delay. We show that the slowdown achieved by our simulation is optimal. We then consider the case of simulating cliques by cliques; i.e., a clique of heterogeneous processors with arbitrary delay links is used to simulate a clique of homogeneous processors with unit delay links. We reduce the slowdown from the obvious bound of the maximum delay link to the average of the link delays. In the case of the linear array we consider both links with and without pipelining. For the clique simulation the links are not assumed to support pipelining.The main motivation of our results (as was the case with Andrews et al.) is to mitigate the degradation of performance when executing parallel programs designed for different architectures on a network of workstations (NOW). In such a setting it is unlikely that the links provided by the NOW will support pipelining and it is quite probable the processors will be heterogeneous. Combining our result on clique simulation with well-known techniques for simulating shared memory PRAMs on distributed memory machines provides an effective automatic compilation of a PRAM algorithm on a NOW.  相似文献   

14.
The frontal method is a variant of Gaussian elimination that has been widely used since the mid 1970s. In the innermost loop of the computation the method exploits dense linear algebra kernels, which are straightforward to vectorize and parallelize. This makes the method attractive for modern computer architectures. However, unless the matrix can be ordered so that the front is never very large, frontal methods can require many more floating‐point operations for factorization than other approaches. We are interested in matrices that have a highly asymmetric structure. We use the idea of a row graph of an unsymmetric matrix combined with a variant of Sloan's profile reduction algorithm to reorder the rows. We also look at applying the spectral method to the row graph. Numerical experiments performed on a range of practical problems illustrate that our proposed MSRO and hybrid MSRO row ordering algorithms yield substantial reductions in the front sizes and, when used with a frontal solver, significantly enhance its performance both in terms of the factorization time and storage requirements. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

15.
Summary An ascent exchange algorithm for computing the strict Chebyshev solution to general systems of linear equations is presented. It uses generalized exchange rules to ensure convergence and splits up the entire system into subsystems by means of a canonical decomposition of a matrix obtained by Gaussian elimination methods. All updating procedures are developed and several numerical examples illustrate the efficiency of the algorithm.  相似文献   

16.
Iain S. Duff 《PAMM》2007,7(1):1140501-1140502
Current problems in scientific computing often require the solution of sets of sparse linear equations of very large order. Particularly for three dimensional problems, it may to be impractical to solve them using a direct method but equally iterative methods may not converge and often standard preconditioning techniques do not work. We thus propose using a hybrid method by which we mean that either a nearby problem or a subproblem is solved by a direct method with the overall problem being solved by an iterative method. We discuss two applications of this at CERFACS in the solution of large scale problems from industry. (© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

17.
We present the recurrence formulas for computing the approximate inverse factors of tridiagonal and pentadiagonal matrices using bordering technique. Resulting algorithms are used to approximate the inverse of pivot blocks needed for constructing block ILU preconditioners for solving the block tridiagonal linear systems, arising from discretization of partial differential equations. Resulting preconditioners are suitable for parallel implementation. Comparison with other methods are also included.  相似文献   

18.
The paper is devoted to developing an original cost-efficient algorithm for solving the inverse problem of finding a variable magnetization in a rectangular parallelepiped. The problem is ill-posed and is described by the integral Fredholm equation. It is shown that after discretization of the area and approximation of the integral operator, this problem is reduced to solving a system of linear algebraic equations with the Toeplitz-block-Toeplitz matrix. We have constructed the memory efficient variant of the stabilized biconjugate gradient method BiCGSTABmem. This optimized algorithm exploits the special structure of the matrix to reduce the memory requirements and computing time. The efficient implementation is developed for multicore CPU and GPU. A series of the model problems with synthetic and real magnetic data are solved. Investigation of efficiency and speedup of parallel algorithm is performed.  相似文献   

19.
We present algorithms to determine the number of nonzeros in each row and column of the factors of a sparse matrix, for both the QR factorization and the LU factorization with partial pivoting. The algorithms use only the nonzero structure of the input matrix, and run in time nearly linear in the number of nonzeros in that matrix. They may be used to set up data structures or schedule parallel operations in advance of the numerical factorization.The row and column counts we compute are upper bounds on the actual counts. If the input matrix is strong Hall and there is no coincidental numerical cancellation, the counts are exact for QR factorization and are the tightest bounds possible for LU factorization.These algorithms are based on our earlier work on computing row and column counts for sparse Cholesky factorization, plus an efficient method to compute the column elimination tree of a sparse matrix without explicitly forming the product of the matrix and its transpose.This revised version was published online in October 2005 with corrections to the Cover Date.  相似文献   

20.
利用Gauss消元法原理,提出了线性DEDS研究中方程求解、传递矩阵求取、矩阵伪逆运算的机械化方法,为DEDS的建模、分析、计算提供了新的途径.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号