首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
It has long been recognized that many direct parallel tridiagonal solvers are only efficient for solving a single tridiagonal equation of large sizes, and they become inefficient when naively used in a three-dimensional ADI solver. In order to improve the parallel efficiency of an ADI solver using a direct parallel solver, we implement the single parallel partition (SPP) algorithm in conjunction with message vectorization, which aggregates several communication messages into one to reduce the communication costs. The measured performances show that the longest allowable message vector length (MVL) is not necessarily the best choice. To understand this observation and optimize the performance, we propose an improved model that takes the cache effect into consideration. The optimal MVL for achieving the best performance is shown to depend on number of processors and grid sizes. Similar dependence of the optimal MVL is also found for the popular block pipelined method.  相似文献   

2.
For the solution of large scale simulations in structural mechanics iterative solving methods are mandatory. The efficiency of such methods can crucially depend on different factors: choice of material parameters, quality of the underlying computational mesh and number of processors in a parallel computing system. We distinguish between three aspects of ‘efficiency’: processor efficiency (degree to which the solving algorithm is able to exploit the processor's computational power), parallel efficiency (ratio between computation and communication times) and numerical efficiency (convergence behaviour). With the new FEM software package Feast we pursue the aim to develop a solver mechanism which at the same time gains high efficiencies in all three aspects, while trying to minimise the mentioned dependencies. (© 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

3.
For the parallel integration of nonstiff initial value problems (IVPs), three main approaches can be distinguished: approaches based on “parallelism across the problem”, on “parallelism across the method” and on “parallelism across the steps”. The first type of parallelism does not require special integration methods and can be exploited within any available IVP solver. The method-parallelism approach received much attention, particularly within the class of explicit Runge-Kutta methods originating from fixed point iteration of implicit Runge-Kutta methods of Gaussian type. The construction and implementation on a parallel machine of such methods is extremely simple. Since the computational work per processor is modest with respect to the number of data to be exchanged between the various processors, this type of parallelism is most suitable for shared memory systems. The required number of processors is roughly half the order of the generating Runge-Kutta method and the speed-up with respect to a good sequential IVP solver is about a factor 2. The third type of parallelism (step-parallelism) can be achieved in any IVP solver based on predictor-corrector iteration and requires the processors to communicate after each full iteration. If the iterations have sufficient computational volume, then the step-parallel approach may be suitable for implementation on distributed memory systems. Most step-parallel methods proposed so far employ a large number of processors, but lack the property of robustness, due to a poor convergence behaviour in the iteration process. Hence, the effective speed-up is rather poor. The dynamic step-parallel iteration process proposed in the present paper is less massively parallel, but turns out to be sufficiently robust to achieve speed-up factors up to 15.  相似文献   

4.
Banded linear systems occur frequently in mathematics and physics. However, direct solvers for large systems cannot be performed in parallel without communication. The aim of this paper is to develop a general asymmetric banded solver with a direct approach that scales across many processors efficiently. The key mechanism behind this is that reduction to a row-echelon form is not required by the solver. The method requires more floating point calculations than a standard solver such as LU decomposition, but by leveraging multiple processors the overall solution time is reduced. We present a solver using a superposition approach that decomposes the original linear system into q subsystems, where q is the number of superdiagonals. These methods show optimal computational cost when q processors are available because each system can be solved in parallel asynchronously. This is followed by a q×q dense constraint matrix problem that is solved before a final vectorized superposition is performed. Reduction to row echelon form is not required by the solver, and hence the method avoids fill-in. The algorithm is first developed for tridiagonal systems followed by an extension to arbitrary banded systems. Accuracy and performance is compared with existing solvers and software is provided in the supplementary material.  相似文献   

5.
We present the parallelization of a linear programming solver using a primal-dual interior point method on one of the heterogeneous processors, namely the Cell BE processor. Focus is given to Cholesky factorization as it is the most computationally expensive kernel in interior point methods. To make it easier to develop and port to other heterogeneous systems, we propose a two-phase implementation procedure where we first develop a shared-memory multithreaded application that executes only on the main processor, and then offload the compute-intensive tasks to execute on the synergistic processors (Cell accelerator cores). We used parent–child supernode amalgamation to increase sizes of the blocks, but we noticed that the existence of many small blocks cause significant performance degradation. To reduce the overhead of small blocks, we extend the block fan-out algorithm such that small blocks are aggregated into large blocks without adding extra zeros. We also use another type of amalgamation that can merge any two consecutive supernodes and use it to avoid having very small blocks in a composed block. The suggested block aggregation method is able to speedup the whole LP solver of up to 2.5 when compared to using parent–child supernode amalgamation alone.  相似文献   

6.
As a synchronization parallel framework, the parallel variable transformation (PVT) algorithm is effective to solve unconstrained optimization problems. In this paper, based on the idea that a constrained optimization problem is equivalent to a differentiable unconstrained optimization problem by introducing the Fischer Function, we propose an asynchronous PVT algorithm for solving large-scale linearly constrained convex minimization problems. This new algorithm can terminate when some processor satisfies terminal condition without waiting for other processors. Meanwhile, it can enhances practical efficiency for large-scale optimization problem. Global convergence of the new algorithm is established under suitable assumptions. And in particular, the linear rate of convergence does not depend on the number of processors.  相似文献   

7.
This paper presents a detailed analysis of the scalability and parallelization of Local Search algorithms for constraint-based and SAT (Boolean satisfiability) solvers. We propose a framework to estimate the parallel performance of a given algorithm by analyzing the runtime behavior of its sequential version. Indeed, by approximating the runtime distribution of the sequential process with statistical methods, the runtime behavior of the parallel process can be predicted by a model based on order statistics. We apply this approach to study the parallel performance of a constraint-based Local Search solver (Adaptive Search), two SAT Local Search solvers (namely Sparrow and CCASAT), and a propagation-based constraint solver (Gecode, with a random labeling heuristic). We compare the performance predicted by our model to actual parallel implementations of those methods using up to 384 processes. We show that the model is accurate and predicts performance close to the empirical data. Moreover, as we study different types of problems, we observe that the experimented solvers exhibit different behaviors and that their runtime distributions can be approximated by two types of distributions: exponential (shifted and non-shifted) and lognormal. Our results show that the proposed framework estimates the runtime of the parallel algorithm with an average discrepancy of 21 % w.r.t. the empirical data across all the experiments with the maximum allowed number of processors for each technique.  相似文献   

8.
We present a parallel implementation of the optimal quantization method on a grid computing. Its purpose is to price instantaneously multidimensional American options. Numerical tests are proceeded with variable number of processors, from 4 to 128. Finally a spatial extrapolation of Richardson–Romberg is introduced to speed up the convergence rate and stabilize the results.  相似文献   

9.
We investigate thalamo-cortical systems that are modeled by nonlinear Volterra integro-differential equations of convolution type. We divide the systems into smaller subsystems in such a way that each of them is solved separately by a processor working independently of other processors results of which are shared only once in the process of computations. We solve the subsystems concurrently in a parallel computing environment and present results of numerical experiments, which show savings in the run time and therefore efficiency of our approach. For our numerical simulations, we apply different numbers np of processors and each case shows that the run time decreases with increasing np. The optimal speed-up is obtained with np = N, where N is the (moderate) number of equations in the thalamo-cortical model.  相似文献   

10.
The parallel solution of multiple systems of initial-value problems (IVPs) in ordinary differential equations is challenging because the amount of computation involved in solving a given IVP is generally not well correlated with that of solving another. In this paper, we describe how to efficiently solve multiple systems of stiff IVPs in parallel within a single-instruction, multiple-data (SIMD) implementation on the Cell Broadband Engine (CBE) of the RODAS solver for stiff IVPs. We solve two systems of stiff IVPs simultaneously on each of the eight synergistic processing elements per CBE chip for a total of 16 systems of IVPs. We demonstrate a speedup of 1.89 (a parallel efficiency of over 94%) over the corresponding serial code on a realistic example involving the operation of a chemical reactor. The techniques described apply to other multi-core processors besides the CBE and can be expected to increase in importance as computer architectures evolve to feature larger word sizes.  相似文献   

11.
A parallel hybrid linear solver based on the Schur complement method has the potential to balance the robustness of direct solvers with the efficiency of preconditioned iterative solvers. However, when solving large-scale highly-indefinite linear systems, this hybrid solver often suffers from either slow convergence or large memory requirements to solve the Schur complement systems. To overcome this challenge, we in this paper discuss techniques to preprocess the Schur complement systems in parallel. Numerical results of solving large-scale highly-indefinite linear systems from various applications demonstrate that these techniques improve the reliability and performance of the hybrid solver and enable efficient solutions of these linear systems on hundreds of processors, which was previously infeasible using existing state-of-the-art solvers.  相似文献   

12.
This paper deals with solving stiff systems of differential equations by implicit Multistep Runge-Kutta (MRK) methods. For this type of methods, nonlinear systems of dimension sd arise, where s is the number of Runge-Kutta stages and d the dimension of the problem. Applying a Newton process leads to linear systems of the same dimension, which can be very expensive to solve in practice. With a parallel iterative linear system solver, especially designed for MRK methods, we approximate these linear systems by s systems of dimension d, which can be solved in parallel on a computer with s processors. In terms of Jacobian evaluations and LU-decompositions, the k-steps-stage MRK applied with this technique is on s processors equally expensive as the widely used k-step Backward Differentiation Formula on 1 processor, whereas the stability properties are better than that of BDF. A simple implementation of both methods shows that, for the same number of Newton iterations, the accuracy delivered by the new method is higher than that of BDF.  相似文献   

13.
Given m semi-identical processors which are parallel processors all working with the same speed but in different time intervals of availability and n independent tasks with deadlines, the problem of constructing a feasible pre-emptive schedule is examined. We present an O (nm log n) time algorithm to construct such a schedule whenever one exists. We show that the number of induced pre-emptions is proportional to the total number of processing intervals and deadlines.  相似文献   

14.
We integrate tabu search, simulated annealing, genetic algorithms, and random restarting. In addition, while simulating the original Markov chain (defined on a state space tailored either to stand-alone simulated annealing or to the hybrid scheme) with the original cooling schedule implicitly, we speed up both stand-alone simulated annealing and the combination by a factor going to infinity as the number of transitions generated goes to infinity. Beyond this, speedup nearly linear in the number of independent parallel processors often can be expected.This research was (partially) supported by the Air Force Office of Scientific Research and the Office of Naval Research Contract #F49620-90-C-0033.  相似文献   

15.
本文讨论可换速平行机工件带起止值的抢先进度表模型,给出了工件在可换速平行机上存在可行的抢先进度表的充要条件;给出了一个以O(m~(7/3)n~3+blogb)为界的算法来构造一个可行的抢先进度表.  相似文献   

16.
Extreme scale simulation requires fast and scalable algorithms, such as multigrid methods. To achieve asymptotically optimal complexity, it is essential to employ a hierarchy of grids. The cost to solve the coarsest grid system can often be neglected in sequential computings, but cannot be ignored in massively parallel executions. In this case, the coarsest grid can be large and its efficient solution becomes a challenging task. We propose solving the coarse grid system using modern, approximate sparse direct methods and investigate the expected gains compared with traditional iterative methods. Since the coarse grid system only requires an approximate solution, we show that we can leverage block low-rank techniques, combined with the use of single precision arithmetic, to significantly reduce the computational requirements of the direct solver. In the case of extreme scale computing, the coarse grid system is too large for a sequential solution, but too small to permit massively parallel efficiency. We show that the agglomeration of the coarse grid system to a subset of processors is necessary for the sparse direct solver to achieve performance. We demonstrate the efficiency of the proposed method on a Stokes-type saddle point system solved with a monolithic Uzawa multigrid method. In particular, we show that the use of an approximate sparse direct solver for the coarse grid system can outperform that of a preconditioned minimal residual iterative method. This is demonstrated for the multigrid solution of systems of order up to 1 0 11 degrees of freedom on a petascale supercomputer using 43,200 processes.  相似文献   

17.
In general, solving Global Optimization (GO) problems by Branch-and-Bound (B&B) requires a huge computational capacity. Parallel execution is used to speed up the computing time. As in this type of algorithms, the foreseen computational workload (number of nodes in the B&B tree) changes dynamically during the execution, the load balancing and the decision on additional processors is complicated. We use the term left-over to represent the number of nodes that still have to be evaluated at a certain moment during execution. In this work, we study new methods to estimate the left-over value based on the observed amount of pruning. This provides information about the remaining running time of the algorithm and the required computational resources. We focus on their use for interval B&B GO algorithms.  相似文献   

18.
A new decomposition method for multistage stochastic linear programming problems is proposed. A multistage stochastic problem is represented in a tree-like form and with each node of the decision tree a certain linear or quadratic subproblem is associated. The subproblems generate proposals for their successors and some backward information for their predecessors. The subproblems can be solved in parallel and exchange information in an asynchronous way through special buffers. After a finite time the method either finds an optimal solution to the problem or discovers its inconsistency. An analytical illustrative example shows that parallelization can speed up computation over every sequential method. Computational experiments indicate that for large problems we can obtain substantial gains in efficiency with moderate numbers of processors.This work was partly supported by the International Institute for Applied Systems Analysis, Laxenburg, Austria.  相似文献   

19.
In this paper we propose and describe a parallel implementation of a block preconditioner for the solution of saddle point linear systems arising from Finite Element (FE) discretization of 3D coupled consolidation problems. The Mixed Constraint Preconditioner developed in [L. Bergamaschi, M. Ferronato, G. Gambolati, Mixed constraint preconditioners for the solution to FE coupled consolidation equations, J. Comput. Phys., 227(23) (2008), 9885-9897] is combined with the parallel FSAI preconditioner which is used here to approximate the inverses of both the structural (1, 1) block and an appropriate Schur complement matrix. The resulting preconditioner proves effective in the acceleration of the BiCGSTAB iterative solver. Numerical results on a number of test cases of size up to 2×106 unknowns and 1.2×108 nonzeros show the perfect scalability of the overall code up to 256 processors.  相似文献   

20.
In this paper we propose and describe a parallel implementation of a block preconditioner for the solution of saddle point linear systems arising from Finite Element (FE) discretization of 3D coupled consolidation problems. The Mixed Constraint Preconditioner developed in [L. Bergamaschi, M. Ferronato, G. Gambolati, Mixed constraint preconditioners for the solution to FE coupled consolidation equations, J. Comput. Phys., 227(23) (2008), 9885–9897] is combined with the parallel FSAI preconditioner which is used here to approximate the inverses of both the structural (1, 1) block and an appropriate Schur complement matrix. The resulting preconditioner proves effective in the acceleration of the BiCGSTAB iterative solver. Numerical results on a number of test cases of size up to 2×106 unknowns and 1.2×108 nonzeros show the perfect scalability of the overall code up to 256 processors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号