期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives

Janus J. Eriksen 《Molecular physics》2017,115(17-18):2086-2101

ABSTRACT

It is demonstrated how the non-proprietary OpenACC standard of compiler directives may be used to compactly and efficiently accelerate the rate-determining steps of two of the most routinely applied many-body methods of electronic structure theory, namely the second-order Møller-Plesset (MP2) model in its resolution-of-the-identity approximated form and the (T) triples correction to the coupled cluster singles and doubles model (CCSD(T)). By means of compute directives as well as the use of optimised device math libraries, the operations involved in the energy kernels have been ported to graphics processing unit (GPU) accelerators, and the associated data transfers correspondingly optimised to such a degree that the final implementations (using either double and/or single precision arithmetics) are capable of scaling to as large systems as allowed for by the capacity of the host central processing unit (CPU) main memory. The performance of the hybrid CPU/GPU implementations is assessed through calculations on test systems of alanine amino acid chains using one-electron basis sets of increasing size (ranging from double- to pentuple-ζ quality). For all but the smallest problem sizes of the present study, the optimised accelerated codes (using a single multi-core CPU host node in conjunction with six GPUs) are found to be capable of reducing the total time-to-solution by at least an order of magnitude over optimised, OpenMP-threaded CPU-only reference implementations. 相似文献

2.

Large-scale simulations on multiple Graphics Processing Units (GPUs) for the direct simulation Monte Carlo method

C.-C. Su M.R. Smith F.-A. Kuo J.-S. Wu C.-W. Hsieh K.-C. Tseng 《Journal of computational physics》2012,231(23):7932-7958

In this study, the application of the two-dimensional direct simulation Monte Carlo (DSMC) method using an MPI-CUDA parallelization paradigm on Graphics Processing Units (GPUs) clusters is presented. An all-device (i.e. GPU) computational approach is adopted where the entire computation is performed on the GPU device, leaving the CPU idle during all stages of the computation, including particle moving, indexing, particle collisions and state sampling. Communication between the GPU and host is only performed to enable multiple-GPU computation. Results show that the computational expense can be reduced by 15 and 185 times when using a single GPU and 16 GPUs respectively when compared to a single core of an Intel Xeon X5670 CPU. The demonstrated parallel efficiency is 75% when using 16 GPUs as compared to a single GPU for simulations using 30 million simulated particles. Finally, several very large-scale simulations in the near-continuum regime are employed to demonstrate the excellent capability of the current parallel DSMC method. 相似文献

3.

GPU-accelerated simulation of colloidal suspensions with direct hydrodynamic interactions

M. Kopp F. Höfling 《The European physical journal. Special topics》2012,210(1):101-117

Solvent-mediated hydrodynamic interactions between colloidal particles can significantly alter their dynamics. We discuss the implementation of Stokesian dynamics in leading approximation for streaming processors as provided by the compute unified device architecture (CUDA) of recent graphics processors (GPUs). Thereby, the simulation of explicit solvent particles is avoided and hydrodynamic interactions can easily be accounted for in already available, highly accelerated molecular dynamics simulations. Special emphasis is put on efficient memory access and numerical stability. The algorithm is applied to the periodic sedimentation of a cluster of four suspended particles. Finally, we investigate the runtime performance of generic memory access patterns of complexity O(N ²) for various GPU algorithms relying on either hardware cache or shared memory. 相似文献

4.

A multilevel Cartesian non-uniform grid time domain algorithm

Jun Meng Amir Boag Vitaliy Lomakin Eric Michielssen 《Journal of computational physics》2010,229(22):8430-8444

A multilevel Cartesian non-uniform grid time domain algorithm (CNGTDA) is introduced to rapidly compute transient wave fields radiated by time dependent three-dimensional source constellations. CNGTDA leverages the observation that transient wave fields generated by temporally bandlimited and spatially confined source constellations can be recovered via interpolation from appropriately delay- and amplitude-compensated field samples. This property is used in conjunction with a multilevel scheme, in which the computational domain is hierarchically decomposed into subdomains with sparse non-uniform grids used to obtain the fields. For both surface and volumetric source distributions, the computational cost of CNGTDA to compute the transient field at N_s observation locations from N_s collocated sources for N_t discrete time instances scales as O(N_tN_slogN_s) and O(N_tN_slog²N_s) in the low- and high-frequency regimes, respectively. Coupled with marching-on-in-time (MOT) time domain integral equations, CNGTDA can facilitate efficient analysis of large scale time domain electromagnetic and acoustic problems. 相似文献

5.

Analysis of Millimeter Wave Scattering by the Infinite Plane Metallic Grating Using PCG-FFT Technique

K. F. Tsang L. Mo Z. B. Ye 《International Journal of Infrared and Millimeter Waves》2003,24(6):1005-1022

In this paper, both fast Fourier transformation (FFT) and preconditioned CG technique are introduced into method of lines (MOL) to further enhance the computational efficiency of this semi-analytic method. Electromagnetic wave scattering by an infinite plane metallic grating is used as the examples to describe its implementation. For arbitrary incident wave, Helmholz equation and boundary condition are first transformed into new ones so that the impedance matrix elements are calculated by FFT technique. As a result, this Topelitz impedance matrix only requires O(N) memory storage for the conjugate gradient FFT method to solve the current distribution with the computational complexity O(N log N) . Our numerical results show that circulate matrix preconditioner can speed up CG-FFT method to converge in much smaller CPU time than the banded matrix preconditioner. 相似文献

6.

A direct O(N log N) finite difference method for fractional diffusion equations

Hong Wang Kaixin Wang Treena Sircar 《Journal of computational physics》2010,229(21):8095-8104

Fractional diffusion equations model phenomena exhibiting anomalous diffusion that can not be modeled accurately by the second-order diffusion equations. Because of the nonlocal property of fractional differential operators, the numerical methods have full coefficient matrices which require storage of O(N²) and computational cost of O(N³) where N is the number of grid points. 相似文献

7.

An O(N logN) alternating-direction finite difference method for two-dimensional fractional diffusion equations

Hong Wang Kaixin Wang 《Journal of computational physics》2011,230(21):7830-7839

Fractional diffusion equations model phenomena exhibiting anomalous diffusion that cannot be modeled accurately by the second-order diffusion equations. Because of the nonlocal property of fractional differential operators, the numerical methods for fractional diffusion equations often generate dense or even full coefficient matrices. Consequently, the numerical solution of these methods often require computational work of O(N³) per time step and memory of O(N²) for where N is the number of grid points. 相似文献

8.

A pilgrimage to gravity on GPUs

J. Bédorf S. Portegies Zwart 《The European physical journal. Special topics》2012,210(1):201-216

In this short review we present the developments over the last 5 decades that have led to the use of Graphics Processing Units (GPUs) for astrophysical simulations. Since the introduction of NVIDIA’s Compute Unified Device Architecture (CUDA) in 2007 the GPU has become a valuable tool for N-body simulations and is so popular these days that almost all papers about high precision N-body simulations use methods that are accelerated by GPUs. With the GPU hardware becoming more advanced and being used for more advanced algorithms like gravitational tree-codes we see a bright future for GPU like hardware in computational astrophysics. 相似文献

9.

A sparse octree gravitational N-body code that runs entirely on the GPU processor

Jeroen Bédorf Evghenii Gaburov Simon Portegies Zwart 《Journal of computational physics》2012,231(7):2825-2839

We present the implementation and performance of a new gravitational N-body tree-code that is specifically designed for the graphics processing unit (GPU).¹ All parts of the tree-code algorithm are executed on the GPU. We present algorithms for parallel construction and traversing of sparse octrees. These algorithms are implemented in CUDA and tested on NVIDIA GPUs, but they are portable to OpenCL and can easily be used on many-core devices from other manufacturers. This portability is achieved by using general parallel-scan and sort methods. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second. 相似文献

10.

Performance potential for simulating spin models on GPU

Martin Weigel 《Journal of computational physics》2012,231(8):3064-3082

Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed. 相似文献

11.

Importance of explicit vectorization for CPU and GPU software performance

Neil G. Dickson Kamran Karimi Firas Hamze 《Journal of computational physics》2011,230(13):5383-5398

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9× to 12× speedup over the original CPU version, in addition to speedup from multi-threading. This is 2× faster than the fully-optimized GPU version, indicating the importance of optimizing CPU implementations. 相似文献

12.

A Cartesian treecode for screened coulomb interactions

Peijun Li Hans Johnston Robert Krasny 《Journal of computational physics》2009,228(10):3858-3868

A treecode algorithm is presented for evaluating electrostatic potentials in a charged particle system undergoing screened Coulomb interactions in 3D. The method uses a far-field Taylor expansion in Cartesian coordinates to compute particle–cluster interactions. The Taylor coefficients are evaluated using new recurrence relations which permit efficient computation of high order approximations. Two types of clusters are considered, uniform cubes and adapted rectangular boxes. The treecode error, CPU time and memory usage are reported and compared with direct summation for randomly distributed particles inside a cube, on the surface of a sphere and on an 8-sphere configuration. For a given order of Taylor approximation, the treecode CPU time scales as O(NlogN)

O (N \log N)

and the memory usage scales as O(N)

O (N)

, where N is the number of particles. Results show that the treecode is well suited for non-homogeneous particle distributions as in the sphere and 8-sphere test cases. 相似文献

13.

Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI)

Bormin Huang Jarno Mielikainen Hyunjong Oh Hung-Lung Allen Huang 《Journal of computational physics》2011,230(6):2207-2221

Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware’s efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs.In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364× speedup for 1 GPU and 1455× speedup for all 4 GPUs, both with respect to the original CPU-based single-threaded Fortran code with the –O₂ compiling optimization. The significant 1455× speedup using a computer with four GPUs means that the proposed GPU-based high-performance forward model is able to compute one day’s amount of 1,296,000 IASI spectra within nearly 10 min, whereas the original single CPU-based version will impractically take more than 10 days. This model runs over 80% of the theoretical memory bandwidth with asynchronous data transfer. A novel CPU–GPU pipeline implementation of the IASI radiative transfer model is proposed. The GPU-based high-performance IASI radiative transfer model is suitable for the assimilation of the IASI radiance observations into the operational numerical weather forecast model. 相似文献

14.

Solution to PDEs using radial basis function finite-differences (RBF-FD) on multiple GPUs

Evan F. Bollig Natasha Flyer Gordon Erlebacher 《Journal of computational physics》2012,231(21):7133-7151

相似文献

15.

Triviality of Hierarchical O(N) Spin Model in Four Dimensions with Large N

Hiroshi Watanabe 《Journal of statistical physics》2004,115(5-6):1669-1713

The renormalization group transformation for the hierarchical O(N) spin model in four dimensions is studied by means of characteristic functions of single-site measures, and convergence of the critical trajectory to the Gaussian fixed point is shown for a sufficiently large N. In the strong coupling regime, the trajectory is controlled by the help of the exactly solved O(∞) trajectory, while, in the weak coupling regime, convergence to the Gaussian fixed point is shown by power decay of the effective coupling constant. 相似文献

16.

CPU vs. GPU - Performance comparison for the Gram-Schmidt algorithm

T.?Brandes Email author A.?Arnold T.?Soddemann D.?Reith 《The European physical journal. Special topics》2012,210(1):73-88

The Gram-Schmidt method is a classical method for determining QR decompositions, which is commonly used in many applications in computational physics, such as orthogonalization of quantum mechanical operators or Lyapunov stability analysis. In this paper, we discuss how well the Gram-Schmidt method performs on different hardware architectures, including both state-of-the-art GPUs and CPUs. We explain, in detail, how a smart interplay between hardware and software can be used to speed up those rather compute intensive applications as well as the benefits and disadvantages of several approaches. In addition, we compare some highly optimized standard routines of the BLAS libraries against our own optimized routines on both processor types. Particular attention was paid to the strong hierarchical memory of modern GPUs and CPUs, which requires cache-aware blocking techniques for optimal performance. Our investigations show that the performance strongly depends on the employed algorithm, compiler and a little less on the employed hardware. Remarkably, the performance of the NVIDIA CUDA BLAS routines improved significantly from CUDA 3.2 to CUDA 4.0. Still, BLAS routines tend to be slightly slower than manually optimized code on GPUs, while we were not able to outperform the BLAS routines on CPUs. Comparing optimized implementations on different hardware architectures, we find that a NVIDIA GeForce GTX580 GPU is about 50% faster than a corresponding Intel X5650 Westmere hexacore CPU. The self-written codes are included as supplementary material. 相似文献

17.

Acceleration of the Smith–Waterman algorithm using single and multiple graphics processors

Ali Khajeh-Saeed Stephen Poole J. Blair Perot 《Journal of computational physics》2010,229(11):4247-4258

Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith–Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith–Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith–Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs. 相似文献

18.

GPU accelerated simulations of bluff body flows using vortex particle methods

Diego Rossinelli Michael Bergdorf Georges-Henri Cottet Petros Koumoutsakos 《Journal of computational physics》2010,229(9):3316-3333

We present a GPU accelerated solver for simulations of bluff body flows in 2D using a remeshed vortex particle method and the vorticity formulation of the Brinkman penalization technique to enforce boundary conditions. The efficiency of the method relies on fast and accurate particle-grid interpolations on GPUs for the remeshing of the particles and the computation of the field operators. The GPU implementation uses OpenGL so as to perform efficient particle-grid operations and a CUFFT-based solver for the Poisson equation with unbounded boundary conditions. The accuracy and performance of the GPU simulations and their relative advantages/drawbacks over CPU based computations are reported in simulations of flows past an impulsively started circular cylinder from Reynolds numbers between 40 and 9500. The results indicate up to two orders of magnitude speed up of the GPU implementation over the respective CPU implementations. The accuracy of the GPU computations depends on the Re number of the flow. For Re up to 1000 there is little difference between GPU and CPU calculations but this agreement deteriorates (albeit remaining to within 5% in drag calculations) for higher Re numbers as the single precision of the GPU adversely affects the accuracy of the simulations. 相似文献

19.

Colloquium: Large scale simulations on GPU clusters

Massimo Bernaschi Mauro Bisson Massimiliano Fatica 《The European Physical Journal B - Condensed Matter and Complex Systems》2015,88(6):158

Graphics processing units (GPU) are currently used as a cost-effective platform forcomputer simulations and big-data processing. Large scale applications require thatmultiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times,sub-optimal because the GPU features are not exploited at their best. We describe how itis possible to achieve an excellent efficiency for applications in statistical mechanics,particle dynamics and networks analysis by using suitable memory access patterns andmechanisms like CUDA streams, profiling tools, etc. Similar concepts andtechniques may be applied also to other problems like the solution of Partial DifferentialEquations. 相似文献

20.

Numerical Solution of Dyson Brownian Motion and a Sampling Scheme for Invariant Matrix Ensembles

Xingjie Helen Li Govind Menon 《Journal of statistical physics》2013,153(5):801-812

The Dyson Brownian Motion (DBM) describes the stochastic evolution of N points on the line driven by an applied potential, a Coulombic repulsion and identical, independent Brownian forcing at each point. We use an explicit tamed Euler scheme to numerically solve the Dyson Brownian motion and sample the equilibrium measure for non-quadratic potentials. The Coulomb repulsion is too singular for the SDE to satisfy the hypotheses of rigorous convergence proofs for tamed Euler schemes (Hutzenthaler et al. in Ann. Appl. Probab. 22(4):1611–1641, 2012). Nevertheless, in practice the scheme is observed to be stable for time steps of O(1/N ²) and to relax exponentially fast to the equilibrium measure with a rate constant of O(1) independent of N. Further, this convergence rate appears to improve with N in accordance with O(1/N) relaxation of local statistics of the Dyson Brownian motion. This allows us to use the Dyson Brownian motion to sample N×N Hermitian matrices from the invariant ensembles. The computational cost of generating M independent samples is O(MN ⁴) with a naive scheme, and O(MN ³logN) when a fast multipole method is used to evaluate the Coulomb interaction. 相似文献