期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Importance of explicit vectorization for CPU and GPU software performance

Neil G. Dickson Kamran Karimi Firas Hamze 《Journal of computational physics》2011,230(13):5383-5398

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9× to 12× speedup over the original CPU version, in addition to speedup from multi-threading. This is 2× faster than the fully-optimized GPU version, indicating the importance of optimizing CPU implementations. 相似文献

2.

Fast evaluation of Helmholtz potential on graphics processing units (GPUs)

Shaojing Li Boris Livshitz Vitaliy Lomakin 《Journal of computational physics》2010,229(22):8463-8483

This paper presents a parallel algorithm implemented on graphics processing units (GPUs) for rapidly evaluating spatial convolutions between the Helmholtz potential and a large-scale source distribution. The algorithm implements a non-uniform grid interpolation method (NGIM), which uses amplitude and phase compensation and spatial interpolation from a sparse grid to compute the field outside a source domain. NGIM reduces the computational time cost of the direct field evaluation at N observers due to N co-located sources from O(N²) to O(N) in the static and low-frequency regimes, to O(N log N) in the high-frequency regime, and between these costs in the mixed-frequency regime. Memory requirements scale as O(N) in all frequency regimes. Several important differences between CPU and GPU implementations of the NGIM are required to result in optimal performance on respective platforms. In particular, in the CPU implementations all operations, where possible, are pre-computed and stored in memory in a preprocessing stage. This reduces the computational time but significantly increases the memory consumption. In the GPU implementations, where handling memory often is a critical bottle neck, several special memory handling techniques are used to accelerate the computations. A significant latency of the GPU global memory access is hidden by implementing coalesced reading, which requires arranging many array elements in contiguous parts of memory. Contrary to the CPU version, most of the steps in the GPU implementations are executed on-fly and only necessary arrays are kept in memory. This results in significantly reduced memory consumption, increased problem size N that can be handled, and reduced computational time on GPUs. The obtained GPU–CPU speed-up ratios are from 150 to 400 depending on the required accuracy and problem size. The presented method and its CPU and GPU implementations can find important applications in various fields of physics and engineering. 相似文献

3.

结构材料辐照损伤的分子动力学程序GPU并行化及优化

祁美玲杨琼王苍龙田园杨磊《计算物理》2017,34(4):461-467

基于NIVIDIA公司的CUDA架构对结构材料辐照损伤的分子动力学程序在单个GPU上进行并行化,并对影响程序运行效率的相关因素进行分析和测试.经过一系列优化,当粒子数为两百万时,对比单CPU的执行时间,优化后的GPU程序其双精度加速比可达112倍,单精度加速比达到了三百倍,为后续扩展多GPU结构材料辐照损伤的分子动力学程序奠定基础. 相似文献

4.

GPU accelerated simulations of bluff body flows using vortex particle methods

Diego Rossinelli Michael Bergdorf Georges-Henri Cottet Petros Koumoutsakos 《Journal of computational physics》2010,229(9):3316-3333

We present a GPU accelerated solver for simulations of bluff body flows in 2D using a remeshed vortex particle method and the vorticity formulation of the Brinkman penalization technique to enforce boundary conditions. The efficiency of the method relies on fast and accurate particle-grid interpolations on GPUs for the remeshing of the particles and the computation of the field operators. The GPU implementation uses OpenGL so as to perform efficient particle-grid operations and a CUFFT-based solver for the Poisson equation with unbounded boundary conditions. The accuracy and performance of the GPU simulations and their relative advantages/drawbacks over CPU based computations are reported in simulations of flows past an impulsively started circular cylinder from Reynolds numbers between 40 and 9500. The results indicate up to two orders of magnitude speed up of the GPU implementation over the respective CPU implementations. The accuracy of the GPU computations depends on the Re number of the flow. For Re up to 1000 there is little difference between GPU and CPU calculations but this agreement deteriorates (albeit remaining to within 5% in drag calculations) for higher Re numbers as the single precision of the GPU adversely affects the accuracy of the simulations. 相似文献

5.

激波与火焰面相互作用数值模拟的GPU加速 总被引：1，自引：0，他引：1

蒋华董刚陈霄《计算物理》2016,33(1):23-29

为考察计算机图形处理器（GPU）在计算流体力学中的计算能力,采用基于CPU/GPU异构并行模式的方法对激波与火焰界面相互作用的典型可压缩反应流进行数值模拟,优化并行方案,考察不同网格精度对计算结果和计算加速性能的影响.结果表明,和传统的基于信息传递的MPI 8线程并行计算相比,GPU并行模拟结果与MPI并行模拟结果相同;两种计算方法的计算时间均随网格数量的增加呈线性增长趋势,但GPU的计算时间比MPI明显降低.当网格数量较小时（1.6×10⁴）,GPU计算得到的单个时间步长平均时间的加速比为8.6;随着网格数量的增加,GPU的加速比有所下降,但对较大规模的网格数量（4.2×10⁶）,GPU的加速比仍可达到5.9.基于GPU的异构并行加速算法为可压缩反应流的高分辨率大规模计算提供了较好的解决途径. 相似文献

6.

图形处理器并行计算用于离子发动机粒子模拟