期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

GPU accelerated simulations of bluff body flows using vortex particle methods

Diego Rossinelli Michael Bergdorf Georges-Henri Cottet Petros Koumoutsakos 《Journal of computational physics》2010,229(9):3316-3333

We present a GPU accelerated solver for simulations of bluff body flows in 2D using a remeshed vortex particle method and the vorticity formulation of the Brinkman penalization technique to enforce boundary conditions. The efficiency of the method relies on fast and accurate particle-grid interpolations on GPUs for the remeshing of the particles and the computation of the field operators. The GPU implementation uses OpenGL so as to perform efficient particle-grid operations and a CUFFT-based solver for the Poisson equation with unbounded boundary conditions. The accuracy and performance of the GPU simulations and their relative advantages/drawbacks over CPU based computations are reported in simulations of flows past an impulsively started circular cylinder from Reynolds numbers between 40 and 9500. The results indicate up to two orders of magnitude speed up of the GPU implementation over the respective CPU implementations. The accuracy of the GPU computations depends on the Re number of the flow. For Re up to 1000 there is little difference between GPU and CPU calculations but this agreement deteriorates (albeit remaining to within 5% in drag calculations) for higher Re numbers as the single precision of the GPU adversely affects the accuracy of the simulations. 相似文献

2.

Acceleration of the Smith–Waterman algorithm using single and multiple graphics processors

Ali Khajeh-Saeed Stephen Poole J. Blair Perot 《Journal of computational physics》2010,229(11):4247-4258

Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith–Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith–Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith–Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs. 相似文献

3.

层析法计算三维物体全息图的并行加速研究

下载免费PDF全文

肖波郑华东刘柯健李飞高智方《应用光学》2019,40(4):620-626

随着计算空间光调制器的分辨率的尺寸逐渐变大，全息图三维动态显示的计算量也越来越大，使得对全息计算速度提出了新的要求。利用GPU并行计算处理的方式实现全息图的快速层析法计算，该方法利用GPU并行多线程和层析法中的图像二维傅里叶变换的优势对菲涅尔衍射变换算法加速计算；同时通过对GPU底层资源的调用和对CUDA中程序的流处理过程，有效减少中间的延时等待。通过对计算速度对比分析表明:与在CPU上运算相比，计算速度大幅提升，基于GPU并行计算的方法比基于CPU计算的方法速度快10倍左右。相似文献

4.

Performance potential for simulating spin models on GPU

Martin Weigel 《Journal of computational physics》2012,231(8):3064-3082

Graphics processing units (GPUs) are recently being used to an increasing degree for general computational purposes. This development is motivated by their theoretical peak performance, which significantly exceeds that of broadly available CPUs. For practical purposes, however, it is far from clear how much of this theoretical performance can be realized in actual scientific applications. As is discussed here for the case of studying classical spin models of statistical mechanics by Monte Carlo simulations, only an explicit tailoring of the involved algorithms to the specific architecture under consideration allows to harvest the computational power of GPU systems. A number of examples, ranging from Metropolis simulations of ferromagnetic Ising models, over continuous Heisenberg and disordered spin-glass systems to parallel-tempering simulations are discussed. Significant speed-ups by factors of up to 1000 compared to serial CPU code as well as previous GPU implementations are observed. 相似文献

5.

图形处理器并行计算用于离子发动机粒子模拟

下载免费PDF全文

仇钎刘宇任军学汤海滨钟凌伟温正李娟《强激光与粒子束》2011,23(2)

为了研究离子发动机羽流对航天器的影响,采用质点网格-蒙特卡罗碰撞方法对离子发动机羽流中的交换电荷离子进行了模拟。利用计算设备统一架构技术,开发出一套基于图形处理器的并行粒子模拟程序。随机数生成采用并行MT19937伪随机数生成器算法,电场方程使用完全近似存储格式的代数多重网格法求解。r-z轴对称坐标系中,在z=0 m处获得的电流密度均值为4.5×10^-5 A/m²,图形处理器所得结果与中央处理器模拟结果吻合。在16核心的NVIDIA GeForce 9400 GT图形显示卡上,取得相对于Intel Core 2 E6300中央处理器4.5~10.0倍的加速比。相似文献

6.

Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI)

Bormin Huang Jarno Mielikainen Hyunjong Oh Hung-Lung Allen Huang 《Journal of computational physics》2011,230(6):2207-2221

Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware’s efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs.In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364× speedup for 1 GPU and 1455× speedup for all 4 GPUs, both with respect to the original CPU-based single-threaded Fortran code with the –O₂ compiling optimization. The significant 1455× speedup using a computer with four GPUs means that the proposed GPU-based high-performance forward model is able to compute one day’s amount of 1,296,000 IASI spectra within nearly 10 min, whereas the original single CPU-based version will impractically take more than 10 days. This model runs over 80% of the theoretical memory bandwidth with asynchronous data transfer. A novel CPU–GPU pipeline implementation of the IASI radiative transfer model is proposed. The GPU-based high-performance IASI radiative transfer model is suitable for the assimilation of the IASI radiance observations into the operational numerical weather forecast model. 相似文献

7.

Lattice Boltzmann simulations on GPUs with ESPResSo

D. Roehm A. Arnold 《The European physical journal. Special topics》2012,210(1):89-100

For the dynamics of macromolecules in solution, hydrodynamic interactions mediated by the solvent molecules often play an important role, although one is not interested in the dynamics of the solvent itself. In computer simulations one can therefore save a large amount of computer time by replacing the solvent with a lattice fluid. The macromolecules are propagated by Molecular Dynamics (MD), while the fluid is governed by the fluctuating Lattice-Boltzmann (LB) equation. We present a fluctuating LB implementation for a single graphics card (GPU) coupled to a MD simulation running on conventional processors (CPUs). Particular emphasis lies on the optimization of the combined code. In our implementation, the LB update is performed in parallel with the force calculation on the CPU, which often completely hides the additional computational cost of the LB. Compared to our parallel LB implementation on a conventional quad-core CPU, the GPU LB is 50 times faster, and we show that a whole commodity cluster with Infiniband interconnnect cannot outperform a single GPU in strong scaling. The presented code is part of the open source simulation package ESPResSo (). 相似文献

8.

Fast evaluation of Helmholtz potential on graphics processing units (GPUs)

Shaojing Li Boris Livshitz Vitaliy Lomakin 《Journal of computational physics》2010,229(22):8463-8483

This paper presents a parallel algorithm implemented on graphics processing units (GPUs) for rapidly evaluating spatial convolutions between the Helmholtz potential and a large-scale source distribution. The algorithm implements a non-uniform grid interpolation method (NGIM), which uses amplitude and phase compensation and spatial interpolation from a sparse grid to compute the field outside a source domain. NGIM reduces the computational time cost of the direct field evaluation at N observers due to N co-located sources from O(N²) to O(N) in the static and low-frequency regimes, to O(N log N) in the high-frequency regime, and between these costs in the mixed-frequency regime. Memory requirements scale as O(N) in all frequency regimes. Several important differences between CPU and GPU implementations of the NGIM are required to result in optimal performance on respective platforms. In particular, in the CPU implementations all operations, where possible, are pre-computed and stored in memory in a preprocessing stage. This reduces the computational time but significantly increases the memory consumption. In the GPU implementations, where handling memory often is a critical bottle neck, several special memory handling techniques are used to accelerate the computations. A significant latency of the GPU global memory access is hidden by implementing coalesced reading, which requires arranging many array elements in contiguous parts of memory. Contrary to the CPU version, most of the steps in the GPU implementations are executed on-fly and only necessary arrays are kept in memory. This results in significantly reduced memory consumption, increased problem size N that can be handled, and reduced computational time on GPUs. The obtained GPU–CPU speed-up ratios are from 150 to 400 depending on the required accuracy and problem size. The presented method and its CPU and GPU implementations can find important applications in various fields of physics and engineering. 相似文献

9.

SU (2) lattice gauge theory simulations on Fermi GPUs

Nuno Cardoso Pedro Bicudo 《Journal of computational physics》2011,230(10):3998-4010

In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. 相似文献

10.

激波与火焰面相互作用数值模拟的GPU加速 总被引：1，自引：0，他引：1

蒋华董刚陈霄《计算物理》2016,33(1):23-29

为考察计算机图形处理器（GPU）在计算流体力学中的计算能力,采用基于CPU/GPU异构并行模式的方法对激波与火焰界面相互作用的典型可压缩反应流进行数值模拟,优化并行方案,考察不同网格精度对计算结果和计算加速性能的影响.结果表明,和传统的基于信息传递的MPI 8线程并行计算相比,GPU并行模拟结果与MPI并行模拟结果相同;两种计算方法的计算时间均随网格数量的增加呈线性增长趋势,但GPU的计算时间比MPI明显降低.当网格数量较小时（1.6×10⁴）,GPU计算得到的单个时间步长平均时间的加速比为8.6;随着网格数量的增加,GPU的加速比有所下降,但对较大规模的网格数量（4.2×10⁶）,GPU的加速比仍可达到5.9.基于GPU的异构并行加速算法为可压缩反应流的高分辨率大规模计算提供了较好的解决途径. 相似文献

11.

Optimizing Image Reconstruction in SENSE Using GPU

Sohaib?A.?Qazi Email author View author&#;s OrcID profile Saima?Nasir Abeera?Saeed Hammad?Omer 《Applied magnetic resonance》2018,49(2):151-164

Parallel magnetic resonance imaging (MRI) (pMRI) uses multiple receiver coils to reduce the MRI scan time. To accelerate the data acquisition process in MRI, less amount of data is acquired from the scanner which leads to artifacts in the reconstructed images. SENSitivity Encoding (SENSE) is a reconstruction algorithm in pMRI to remove aliasing artifacts from the undersampled multi coil data and recovers fully sampled images. The main limitation of SENSE is computing inverse of the encoding matrix. This work proposes the inversion of encoding matrix using Jacobi singular value decomposition (SVD) algorithm for image reconstruction on GPUs to accelerate the reconstruction process. The performance of Jacobi SVD is compared with Gauss–Jordan algorithm. The simulations are performed on two datasets (brain and cardiac) with acceleration factors 2, 4, 6 and 8. The results show that the graphics processing unit (GPU) provides a speed up to 21.6 times as compared to CPU reconstruction. Jacobi SVD algorithm performs better in terms of acceleration in reconstructions on GPUs as compared to Gauss–Jordan method. The proposed algorithm is suitable for any number of coils and acceleration factors for SENSE reconstruction on real time processing systems. 相似文献

12.

A pilgrimage to gravity on GPUs

J. Bédorf S. Portegies Zwart 《The European physical journal. Special topics》2012,210(1):201-216

In this short review we present the developments over the last 5 decades that have led to the use of Graphics Processing Units (GPUs) for astrophysical simulations. Since the introduction of NVIDIA’s Compute Unified Device Architecture (CUDA) in 2007 the GPU has become a valuable tool for N-body simulations and is so popular these days that almost all papers about high precision N-body simulations use methods that are accelerated by GPUs. With the GPU hardware becoming more advanced and being used for more advanced algorithms like gravitational tree-codes we see a bright future for GPU like hardware in computational astrophysics. 相似文献

13.

GPU-accelerated T-matrix algorithm for light-scattering simulations

Giovanni Iadarola Carlo Forestiere Luca Dal Negro Fabio Villone Giovanni Miano 《Journal of computational physics》2012,231(17):5640-5652

Modern graphical processing units (GPUs) have recently become a pervasive technology able to rapidly solve large parallel problems which previously required runs on clusters or supercomputers. In this paper we propose an effective strategy to parallelize the T-matrix method on GPUs in order to speed-up light scattering simulations. We have tackled two of the most computationally intensive scattering problems that are of interest in nano-optics: the scattering from an isolated non-axisymmetric particle and from an agglomerate of arbitrary shaped particles. We show that fully exploiting the GPU potential we can achieve more than 20 times (20×) acceleration over sequential execution in the investigated scenarios, opening exciting prospectives in the analysis and the design of optical nanostructures. 相似文献

14.

Convergence behavior of a new DSMC algorithm

M.A. Gallis J.R. Torczynski D.J. Rader G.A. Bird 《Journal of computational physics》2009,228(12):4532-4548

The convergence rate of a new direct simulation Monte Carlo (DSMC) method, termed “sophisticated DSMC”, is investigated for one-dimensional Fourier flow. An argon-like hard-sphere gas at 273.15 K and 266.644 Pa is confined between two parallel, fully accommodating walls 1 mm apart that have unequal temperatures. The simulations are performed using a one-dimensional implementation of the sophisticated DSMC algorithm. In harmony with previous work, the primary convergence metric studied is the ratio of the DSMC-calculated thermal conductivity to its corresponding infinite-approximation Chapman–Enskog theoretical value. As discretization errors are reduced, the sophisticated DSMC algorithm is shown to approach the theoretical values to high precision. The convergence behavior of sophisticated DSMC is compared to that of original DSMC. The convergence of the new algorithm in a three-dimensional implementation is also characterized. Implementations using transient adaptive sub-cells and virtual sub-cells are compared. The new algorithm is shown to significantly reduce the computational resources required for a DSMC simulation to achieve a particular level of accuracy, thus improving the efficiency of the method by a factor of 2. 相似文献

15.

An introduction to multi-GPU programming for physicists

M. Bernaschi M. Bisson M. Fatica E. Phillips 《The European physical journal. Special topics》2012,210(1):17-31

We present and compare different approaches for using multiple Graphics Processing Units in the simulation of physical systems. As benchmarks we consider the time required to update a single spin of the 3D Heisenberg spin glass model, by using both the Over-relaxation and the Heat Bath algorithms, and the solution of a Poisson equation by using a finite-difference method. The results show that a suitable combination of techniques allows to hide almost completely the communication overhead by using the CPU as a communication coprocessor of the GPU. Large scale simulations on clusters of GPUs can be efficiently carried out by following the same approach for other applications where a clear cut exists between bulk and boundaries data. 相似文献

16.

Colloquium: Large scale simulations on GPU clusters

Massimo Bernaschi Mauro Bisson Massimiliano Fatica 《The European Physical Journal B - Condensed Matter and Complex Systems》2015,88(6):158

Graphics processing units (GPU) are currently used as a cost-effective platform forcomputer simulations and big-data processing. Large scale applications require thatmultiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times,sub-optimal because the GPU features are not exploited at their best. We describe how itis possible to achieve an excellent efficiency for applications in statistical mechanics,particle dynamics and networks analysis by using suitable memory access patterns andmechanisms like CUDA streams, profiling tools, etc. Similar concepts andtechniques may be applied also to other problems like the solution of Partial DifferentialEquations. 相似文献

17.

Real-time imaging with radial GRAPPA: Implementation on a heterogeneous architecture for low-latency reconstructions

Haris Saybasili Daniel A. Herzka Nicole Seiberlich Mark A. Griswold 《Magnetic resonance imaging》2014

Combination of non-Cartesian trajectories with parallel MRI permits to attain unmatched acceleration rates when compared to traditional Cartesian MRI during real-time imaging. However, computationally demanding reconstructions of such imaging techniques, such as k-space domain radial generalized auto-calibrating partially parallel acquisitions (radial GRAPPA) and image domain conjugate gradient sensitivity encoding (CG-SENSE), lead to longer reconstruction times and unacceptable latency for online real-time MRI on conventional computational hardware. Though CG-SENSE has been shown to work with low-latency using a general purpose graphics processing unit (GPU), to the best of our knowledge, no such effort has been made for radial GRAPPA. Radial GRAPPA reconstruction, which is robust even with highly undersampled acquisitions, is not iterative, requiring only significant computation during initial calibration while achieving good image quality for low-latency imaging applications. In this work, we present a very fast, low-latency, reconstruction framework based on a heterogeneous system using multi-core CPUs and GPUs. We demonstrate an implementation of radial GRAPPA that permits reconstruction times on par with or faster than acquisition of highly accelerated datasets in both cardiac and dynamic musculoskeletal imaging scenarios. Acquisition and reconstruction times are reported. 相似文献

18.

GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

Chunye Gong Jie Liu Lihua Chi Haowei Huang Jingyue Fang Zhenghu Gong 《Journal of computational physics》2011,230(15):6010-6022

Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (S_n) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670. 相似文献

19.

Data assimilation using a GPU accelerated path integral Monte Carlo approach

John C. Quinn Henry D.I. Abarbanel 《Journal of computational physics》2011,230(22):8168-8178

The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a graphics processing unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin–Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 300, compared to an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases. 相似文献

20.

结构材料辐照损伤的分子动力学程序GPU并行化及优化

祁美玲杨琼王苍龙田园杨磊《计算物理》2017,34(4):461-467

基于NIVIDIA公司的CUDA架构对结构材料辐照损伤的分子动力学程序在单个GPU上进行并行化,并对影响程序运行效率的相关因素进行分析和测试.经过一系列优化,当粒子数为两百万时,对比单CPU的执行时间,优化后的GPU程序其双精度加速比可达112倍,单精度加速比达到了三百倍,为后续扩展多GPU结构材料辐照损伤的分子动力学程序奠定基础. 相似文献