共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
Roger B. Sidje 《Numerical Linear Algebra with Applications》1997,4(4):305-331
Numerical methods related to Krylov subspaces are widely used in large sparse numerical linear algebra. Vectors in these subspaces are manipulated via their representation onto orthonormal bases. Nowadays, on serial computers, the method of Arnoldi is considered as a reliable technique for constructing such bases. However, although easily parallelizable, this technique is not as scalable as expected for communications. In this work we examine alternative methods aimed at overcoming this drawback. Since they retrieve upon completion the same information as Arnoldi's algorithm does, they enable us to design a wide family of stable and scalable Krylov approximation methods for various parallel environments. We present timing results obtained from their implementation on two distributed-memory multiprocessor supercomputers: the Intel Paragon and the IBM Scalable POWERparallel SP2. © 1997 John Wiley & Sons, Ltd. 相似文献
3.
Iterative regularization algorithms for constrained image deblurring on graphics processors 总被引:1,自引:0,他引:1
Valeria Ruggiero Thomas Serafini Riccardo Zanella Luca Zanni 《Journal of Global Optimization》2010,48(1):145-157
The ability of the modern graphics processors to operate on large matrices in parallel can be exploited for solving constrained
image deblurring problems in a short time. In particular, in this paper we propose the parallel implementation of two iterative
regularization methods: the well known expectation maximization algorithm and a recent scaled gradient projection method.
The main differences between the considered approaches and their impact on the parallel implementations are discussed. The
effectiveness of the parallel schemes and the speedups over standard CPU implementations are evaluated on test problems arising
from astronomical images. 相似文献
4.
We present a method for compressing binary images via monochromatic pattern substitution. Such method has no relevant loss of compression effectiveness if the image is partitioned into up to a thousand blocks, approximately, and each block is compressed independently. Therefore, it can be implemented on a distributed system with no interprocessor communication. In the theoretical context of unbounded parallelism, interprocessor communication is needed. Compression effectiveness has a bell-shaped behaviour which is again competitive with the sequential performance when the highest degree of parallelism is reached. Finally, the method has a speed-up if applied sequentially to an image partitioned into up to 256 blocks. It follows that such speed-up can be applied to a parallel implementation on a small scale system. 相似文献
5.
We describe an O((log n)2) time parallel algorithm, using n processors, for finding the lexicographically first depth first search tree in the random graph G n, p, with p fixed. The problem itself is complete for P, and so is unlikely to be efficiently parallelizable always. 相似文献
6.
Alfredo Buttari Markus Huber Philippe Leleux Theo Mary Ulrich Rüde Barbara Wohlmuth 《Numerical Linear Algebra with Applications》2022,29(1):e2407
Extreme scale simulation requires fast and scalable algorithms, such as multigrid methods. To achieve asymptotically optimal complexity, it is essential to employ a hierarchy of grids. The cost to solve the coarsest grid system can often be neglected in sequential computings, but cannot be ignored in massively parallel executions. In this case, the coarsest grid can be large and its efficient solution becomes a challenging task. We propose solving the coarse grid system using modern, approximate sparse direct methods and investigate the expected gains compared with traditional iterative methods. Since the coarse grid system only requires an approximate solution, we show that we can leverage block low-rank techniques, combined with the use of single precision arithmetic, to significantly reduce the computational requirements of the direct solver. In the case of extreme scale computing, the coarse grid system is too large for a sequential solution, but too small to permit massively parallel efficiency. We show that the agglomeration of the coarse grid system to a subset of processors is necessary for the sparse direct solver to achieve performance. We demonstrate the efficiency of the proposed method on a Stokes-type saddle point system solved with a monolithic Uzawa multigrid method. In particular, we show that the use of an approximate sparse direct solver for the coarse grid system can outperform that of a preconditioned minimal residual iterative method. This is demonstrated for the multigrid solution of systems of order up to degrees of freedom on a petascale supercomputer using 43,200 processes. 相似文献
7.
《Mathematical and Computer Modelling》2006,43(9-10):966-975
The inherent structure of cellular automata is trivially parallelizable and can directly benefit from massively parallel machines in computationally intensive problems. This paper presents both block synchronous and block pipeline (with asynchronous message passing) parallel implementations of cellular automata on distributed memory (message-passing) architectures. A structural design problem is considered to study the performance of the various cellular automata implementations. The synchronous parallel implementation is a mixture of Jacobi and Gauss–Seidel style iteration, where it becomes more Jacobi like as the number of processors increases. Therefore, it exhibits divergence because of the mathematical characteristics of Jacobi iteration matrix for the structural problem as the number of processors increases. The proposed pipeline implementation preserves convergence by simulating a pure Gauss–Seidel style row-wise iteration. Numerical results for analysis and design of a cantilever plate made of composite material show that the pipeline update scheme is convergent and successfully generates optimal designs. 相似文献
8.
Lee A Yau C Giles MB Doucet A Holmes CC 《Journal of computational and graphical statistics》2010,19(4):769-789
We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design. 相似文献
9.
《Journal of computational and graphical statistics》2013,22(4):769-789
We present a case study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multicore processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we find speedups from 35- to 500-fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modeling into complex data-rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design. This article has supplementary material online. 相似文献
10.
A. A. Gaganov 《Journal of Mathematical Sciences》1991,55(2):1541-1552
An algorithm is proposed for the restructuring of arithmetic expressions involving unary functions to a form convenient for parallel computation. Upper bounds are presented for the restructuring time on a serial processor, the parallel computing time, and the required number of processors.Translated from Zapiski Nauchnykh Seminarov Leningradskogo Otdeleniya Matematicheskogo Instituta im. V. A. Steklova Akad. Nauk SSSR, Vol. 175, pp. 37–52, 1988. 相似文献
11.
Jacek Błażewicz Maciej Machowiak Jan Węglarz Mikhail Y. Kovalyov Denis Trystram 《Annals of Operations Research》2004,129(1-4):65-80
The problem of optimal scheduling n tasks in a parallel processor system is studied. The tasks are malleable, i.e., a task may be executed by several processors simultaneously and the processing speed of a task is a nonlinear function of the number of processors allocated to it. The total number of processors is m and it is an upper bound on the number of processors that can be used by all the tasks simultaneously. It is assumed that the number of processors is sufficient to process all the tasks simultaneously, i.e. n≤m. The objective is to find a task schedule and a processor allocation such that the overall task completion time, i.e. the makespan, is minimized. The problem is motivated by real-life applications of parallel computer systems in scientific computing of highly parallelizable tasks. An O(n) algorithm is presented to solve this problem when all the processing speed functions are convex. If these functions are all concave and the number of tasks is a constant, the problem can be solved in polynomial time. A relaxed problem, in which the number of processors allocated to each task is not required to be integer, can be solved in O(nmax {m,nlog 2 m}) time. It is proved that the minimum makespan values for the original and relaxed problems coincide. For n=2 or n=3, an optimal solution for the relaxed problem can be converted into an optimal solution for the original problem in a constant time. 相似文献
12.
Parallel algorithms for evaluating arithmetic expressions generally assume the computation tree form to be at hand. The computation tree form can be generated within the same resource bounds as the parenthesis matching problem can be solved. We provide a new cost optimal parallel algorithm for the latter problem, which runs in time O(log n) using O(n/log n) processors on an
. We also prove that the algorithm is the fastest possible independently of the number of processors available. 相似文献
13.
This paper presents an ADBASE-based parallel algorithm forsolving multiple objective linear programs (MOLPs). Job balance,speedup and scalability are of primary interest in evaluatingefficiency of the new algorithm. The scalability of a parallelalgorithm is a measure of its capacity to increase performance withrespect to the number of processors used. Implementation results onIntel iPSC/2 and Paragon multiprocessors show that the algorithmsignificantly speeds up the process of solving MOLPs, which isunderstood as generating all or some efficient extreme points andunbounded efficient edges. The algorithm is shown to be scalable andgives better results for large problems. Motivation andjustification for solving large MOLPs are also included. 相似文献
14.
Alexandros V. Gerbessiotis 《Journal of Discrete Algorithms》2006,4(1):1-24
We study the computation, communication and synchronization requirements related to the construction and search of parallel segment trees in an architecture independent way. Our proposed parallel algorithms are optimal in space and time compared to the corresponding sequential algorithms utilized to solve the introduced problems and are described in the context of the bulk-synchronous parallel (BSP) model of computation. Our methods are more scalable and can thus be made to work for larger values of processor size p relative to problem size n than other segment tree related algorithms that have been described on other realistic distributed-memory parallel models and also provide a natural way to approach searching problems on latency-tolerant models of computation that maintains a balanced query load among the processors. 相似文献
15.
椭圆离散方程并行预条件子的局部构造算法 Ⅰ.基本方法 总被引:1,自引:0,他引:1
用有限元或差分法离散所得的大型稀疏椭圆型线性代数方程组Au=f(1)构造高效率的迭代算法,是目前计算方法的一个极其活跃的方向. 相似文献
16.
Inpainting is an image interpolation problem with broad applications in image and vision analysis. Described in the current expository paper are our recent efforts in developing universal inpainting models based on the Bayesian and variational principles. Discussed in detail are several variational inpainting models built upon geometric image models, the associated Euler‐Lagrange PDEs and their geometric and dynamic interpretations, as well as effective computational approaches. Novel efforts are then made to further extend this systematic variational framework to the inpainting of oscillatory textures, interpolation of missing wavelet coefficients as in the wireless transmission of JPEG2000 images, as well as light‐adapted inpainting schemes motivated by Weber's law in visual perception. All these efforts lead to the conclusion that unlike many familiar image processors such as denoising, segmentation, and compression, the performance of a variational/Bayesian inpainting scheme much more crucially depends on whether the image prior model well resolves the spatial coupling (or geometric correlation) of image features. As a highlight, we show that the Besov image models appear to be less interesting for image inpainting in the wavelet domain, highly contrary to their significant roles in thresholding‐based denoising and compression. Thus geometry is the single most important keyword throughout this paper. © 2005 Wiley Periodicals, Inc. 相似文献
17.
A family of constrained pressure residual preconditioners for parallel reservoir simulations 下载免费PDF全文
Large‐scale reservoir simulations are extremely time‐consuming because of the solution of large‐scale linear systems arising from the Newton or Newton–Raphson iterations. The problem becomes even worse when highly heterogeneous geological models are employed. This paper introduces a family of multi‐stage preconditioners for parallel black oil simulations, which are based on the famous constrained pressure residual preconditioner. Numerical experiments demonstrate that our preconditioners are robust, efficient, and scalable. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
18.
We develop the theory of information-based complexity from a parallel point of view. For a model of computation with p processors, each being capable of arithmetic operations and decisions, we analyze the complexity of function approximation, numerical integration, and solution of Fredholm integral equations. We obtain tight bounds on the complexity, considered as a function of three variables simultaneously: the number of processors, the required precision, and (in the case of approximation and integral equations) the number of points, in which the approximate solution is to be determined. 相似文献
19.
《Journal of Algorithms in Cognition, Informatics and Logic》1988,9(1):92-113
In this paper we define the binary tree algebraic computation (BTAC) problem and develop an efficient parallel algorithm for solving this problem. A variety of graph problems (minimum covering set, minimum r-dominating set, maximum matching set, etc.) for trees and two terminal series parallel (TTSP) graphs can be converted to instances of the BTAC problem. Thus efficient parallel algorithms for these problems are obtained systematically by using the BTAC algorithm. The parallel computation model is an exclusive read exclusive write PRAM. The algorithms for tree problems run in O(log n) time with O(n) processors. The algorithms for TTSP graph problems run in O(log m) time with O(m) processors where n (m) is the number of vertices (edges) in the input graph. These algorithms are within an O(log n) factor of optimal. 相似文献
20.
This paper presents two integer linear programming (ILP) models for cyclic scheduling of tasks with unit/general processing
time. Our work is motivated by digital signal processing (DSP) applications on FPGAs (Field-Programmable Gate Arrays)—hardware
architectures hosting several sets of identical arithmetic units. These hardware units can be formalized as dedicated sets
of parallel identical processors. We propose a method to find an optimal periodic schedule of DSP algorithms on such architectures
where the number of available arithmetic units must be determined by the scheduling algorithm with respect to the capacity
of the FPGA circuit. The emphasis is put on the efficiency of the ILP models. We show the advantages of our models in comparison
with common ILP models on benchmarks and randomly generated instances. 相似文献