首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
针对NVIDIA CUDA(Compute Unified Device Architecture)架构的第三代GPU高性能计算技术开展了研究,利用具有448个处理核心的NVIDIA GPU GTX470实现了脉冲压缩雷达的基本数据处理算法,包括脉冲压缩算法与相参积累算法。根据GPU的并行处理架构,将脉冲压缩、相参积累完成了并行算法优化设计,有效的将算法映射到GPU GTX470的448个处理核心中,完成了脉冲压缩雷达基本处理算法的GPU并行处理实现,并针对处理结果效果与实时性进行了评估。  相似文献   

2.
The Gram-Schmidt method is a classical method for determining QR decompositions, which is commonly used in many applications in computational physics, such as orthogonalization of quantum mechanical operators or Lyapunov stability analysis. In this paper, we discuss how well the Gram-Schmidt method performs on different hardware architectures, including both state-of-the-art GPUs and CPUs. We explain, in detail, how a smart interplay between hardware and software can be used to speed up those rather compute intensive applications as well as the benefits and disadvantages of several approaches. In addition, we compare some highly optimized standard routines of the BLAS libraries against our own optimized routines on both processor types. Particular attention was paid to the strong hierarchical memory of modern GPUs and CPUs, which requires cache-aware blocking techniques for optimal performance. Our investigations show that the performance strongly depends on the employed algorithm, compiler and a little less on the employed hardware. Remarkably, the performance of the NVIDIA CUDA BLAS routines improved significantly from CUDA 3.2 to CUDA 4.0. Still, BLAS routines tend to be slightly slower than manually optimized code on GPUs, while we were not able to outperform the BLAS routines on CPUs. Comparing optimized implementations on different hardware architectures, we find that a NVIDIA GeForce GTX580 GPU is about 50% faster than a corresponding Intel X5650 Westmere hexacore CPU. The self-written codes are included as supplementary material.  相似文献   

3.
We present the implementation and performance of a new gravitational N-body tree-code that is specifically designed for the graphics processing unit (GPU).1 All parts of the tree-code algorithm are executed on the GPU. We present algorithms for parallel construction and traversing of sparse octrees. These algorithms are implemented in CUDA and tested on NVIDIA GPUs, but they are portable to OpenCL and can easily be used on many-core devices from other manufacturers. This portability is achieved by using general parallel-scan and sort methods. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second.  相似文献   

4.
 为了研究离子发动机羽流对航天器的影响,采用质点网格-蒙特卡罗碰撞方法对离子发动机羽流中的交换电荷离子进行了模拟。利用计算设备统一架构技术,开发出一套基于图形处理器的并行粒子模拟程序。随机数生成采用并行MT19937伪随机数生成器算法,电场方程使用完全近似存储格式的代数多重网格法求解。r-z轴对称坐标系中,在z=0 m处获得的电流密度均值为4.5×10-5 A/m2,图形处理器所得结果与中央处理器模拟结果吻合。在16核心的NVIDIA GeForce 9400 GT图形显示卡上,取得相对于Intel Core 2 E6300中央处理器4.5~10.0倍的加速比。  相似文献   

5.
A sub-pixel digital image correlation (DIC) method with a path-independent displacement tracking strategy has been implemented on NVIDIA compute unified device architecture (CUDA) for graphics processing unit (GPU) devices. Powered by parallel computing technology, this parallel DIC (paDIC) method, combining an inverse compositional Gauss–Newton (IC-GN) algorithm for sub-pixel registration with a fast Fourier transform-based cross correlation (FFT-CC) algorithm for integer-pixel initial guess estimation, achieves a superior computation efficiency over the DIC method purely running on CPU. In the experiments using simulated and real speckle images, the paDIC reaches a computation speed of 1.66×105 POI/s (points of interest per second) and 1.13×105 POI/s respectively, 57–76 times faster than its sequential counterpart, without the sacrifice of accuracy and precision. To the best of our knowledge, it is the fastest computation speed of a sub-pixel DIC method reported heretofore.  相似文献   

6.
We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.  相似文献   

7.
We use the graphics processing unit (GPU) for fast calculations of helicity amplitudes of physics processes. As our first attempt, we compute $u\overline{u}\rightarrow n\gamma$ (n=2 to 8) processes in pp collisions at $\sqrt{s}=14$  TeV by transferring the MadGraph generated HELAS amplitudes (FORTRAN) into newly developed HEGET (HELAS Evaluation with GPU Enhanced Technology) codes written in CUDA, a C-platform developed by NVIDIA for general purpose computing on the GPU. Compared with the usual CPU programs, we obtain a 40–150 times better performance on the GPU.  相似文献   

8.
黄磊  张李超  鄢然 《应用光学》2015,36(5):762-767
数字散斑相关方法有着测量环境简单、全场非接触等优点,但算法效率一直是限制其发展的瓶颈之一。GPU有着天然的并行性,GPU高性能运算可以为计算机图形处理带来极大的效率提升。利用CUDA平台编程对传统的数字散斑逐点搜索算法、十字搜索算法及遗传算法进行GPU高性能并行处理,并与传统方法比较分析。实验结果表明,对于尺寸为150150像素的散斑图像,3种方法效率分别提升了20倍、8倍、31倍;对于尺寸为500500像素的散斑图像,3种方法效率分别提升了183倍、33倍、44倍;对于尺寸为1 0001 000像素的散斑图像,3种方法效率分别提升了424倍、116倍、44倍。  相似文献   

9.
In this paper, we focus on graphical processing unit (GPU) and discuss how its architecture affects the choice of algorithm and implementation of fully-implicit petroleum reservoir simulation. In order to obtain satisfactory performance on new many-core architectures such as GPUs, the simulator developers must know a great deal on the specific hardware and spend a lot of time on fine tuning the code. Porting a large petroleum reservoir simulator to emerging hardware architectures is expensive and risky. We analyze major components of an in-house reservoir simulator and investigate how to port them to GPUs in a cost-effective way. Preliminary numerical experiments show that our GPU-based simulator is robust and effective. More importantly, these numerical results clearly identify the main bottlenecks to obtain ideal speedup on GPUs and possibly other many-core architectures.  相似文献   

10.
We present the GPU calculation with the common unified device architecture (CUDA) for the Wolff single-cluster algorithm of the Ising model. Proposing an algorithm for a quasi-block synchronization, we realize the Wolff single-cluster Monte Carlo simulation with CUDA. We perform parallel computations for the newly added spins in the growing cluster. As a result, the GPU calculation speed for the two-dimensional Ising model at the critical temperature with the linear size L = 4096 is 5.60 times as fast as the calculation speed on a current CPU core. For the three-dimensional Ising model with the linear size L = 256, the GPU calculation speed is 7.90 times as fast as the CPU calculation speed. The idea of quasi-block synchronization can be used not only in the cluster algorithm but also in many fields where the synchronization of all threads is required.  相似文献   

11.
The present work assesses the impact - in terms of time to solution, throughput analysis, and hardware scalability - of transferring computationally intensive tasks, found in compressible reacting flow solvers, to the GPU. Attention is focused on outlining the workflow and data transfer penalties associated with “plugging in” a recently developed GPU-based chemistry library into (a) a purely CPU-based solver and (b) a GPU-based solver, where, except for the chemistry, all other variables are computed on the GPU. This comparison allows quantification of host-to-device (and vice versa) data transfer penalties on the overall solver speedup as a function of mesh and reaction mechanism size. To this end, a recently developed GPU-based chemistry library known as UMChemGPU is employed to treat the kinetics in the flow solver KARFS. UMChemGPU replaces conventional CPU-based Cantera routines using a matrix-based formulation. The impact of i) data transfer times, ii) chemistry acceleration, and iii) the hardware architecture is studied in detail in the context of GPU saturation limits. Hydrogen and dimethyl ether (DME) reaction mechanisms are used to assess the impact of the number of species/reactions on overall/chemistry-only speedup. It was found that offloading the source term computation to UMChemGPU results in up to 7X reduction in overall time to solution and four orders of magnitude faster source term computation compared to conventional CPU-based methods. Furthermore, the metrics for achieving maximum performance gain using GPU chemistry with an MPI + CUDA solver are explained using the Roofline model. Integrating the UMChemGPU with an MPI + OpenMP solver does not improve the overall performance due to the associated data copy time between the device (GPU) and host (CPU) memory spaces. The performance portability was demonstrated using three different GPU architectures, and the findings are expected to translate to a wide variety of high-performance codes in the combustion community.  相似文献   

12.
张锐  文立华  校金友 《计算物理》2015,32(3):299-309
提出一种大规模声学边界元法的高效率、高精度GPU并行计算方法.基于Burton-Miller边界积分方程,推导适于GPU的并行计算格式并实现了传统边界元法的GPU加速算法.为提高原型算法的效率,研究GPU数据缓存优化方法.由于GPU的双精度浮点运算能力较低,为了降低数值误差,研究基于单精度浮点运算实现的doublesingle精度算法.数值算例表明,改进的算法实现了最高89.8%的GPU使用效率,且数值精度与直接使用双精度数相当,而计算时间仅为其1/28,显存消耗也仅为其一半.该方法可在普通PC机(8GB内存,NVIDIA Ge Force 660 Ti显卡)上快速完成自由度超过300万的大规模声学边界元分析,计算速度和内存消耗均优于快速边界元法.  相似文献   

13.
The compute unified device architecture (CUDA) is a programming approach for performing scientific calculations on a graphics processing unit (GPU) as a data-parallel computing device. The programming interface allows to implement algorithms using extensions to standard C language. With continuously increased number of cores in combination with a high memory bandwidth, a recent GPU offers incredible resources for general purpose computing. First, we apply this new technology to Monte Carlo simulations of the two dimensional ferromagnetic square lattice Ising model. By implementing a variant of the checkerboard algorithm, results are obtained up to 60 times faster on the GPU than on a current CPU core. An implementation of the three dimensional ferromagnetic cubic lattice Ising model on a GPU is able to generate results up to 35 times faster than on a current CPU core. As proof of concept we calculate the critical temperature of the 2D and 3D Ising model using finite size scaling techniques. Theoretical results for the 2D Ising model and previous simulation results for the 3D Ising model can be reproduced.  相似文献   

14.
Continuing our previous studies on QED and QCD processes, we use the graphics processing unit (GPU) for fast calculations of helicity amplitudes for general Standard Model (SM) processes. Additional HEGET codes to handle all SM interactions are introduced, as well as the program MG2CUDA that converts arbitrary MadGraph generated HELAS amplitudes (FORTRAN) into HEGET codes in CUDA. We test all the codes by comparing amplitudes and cross sections for multi-jet processes at the LHC associated with production of single and double weak bosons, a top-quark pair, Higgs boson plus a weak boson or a top-quark pair, and multiple Higgs bosons via weak-boson fusion, where all the heavy particles are allowed to decay into light quarks and leptons with full spin correlations. All the helicity amplitudes computed by HEGET are found to agree with those computed by HELAS within the expected numerical accuracy, and the cross sections obtained by gBASES, a GPU version of the Monte Carlo integration program, agree with those obtained by BASES (FORTRAN), as well as those obtained by MadGraph. The performance of GPU was over a factor of 10 faster than CPU for all processes except those with the highest number of jets.  相似文献   

15.
随着计算空间光调制器的分辨率的尺寸逐渐变大,全息图三维动态显示的计算量也越来越大,使得对全息计算速度提出了新的要求。利用GPU并行计算处理的方式实现全息图的快速层析法计算,该方法利用GPU并行多线程和层析法中的图像二维傅里叶变换的优势对菲涅尔衍射变换算法加速计算;同时通过对GPU底层资源的调用和对CUDA中程序的流处理过程,有效减少中间的延时等待。通过对计算速度对比分析表明:与在CPU上运算相比,计算速度大幅提升,基于GPU并行计算的方法比基于CPU计算的方法速度快10倍左右。  相似文献   

16.
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.  相似文献   

17.
An increasing number of massively-parallel supercomputers are based on heterogeneous node architectures combining traditional, powerful multicore CPUs with energy-efficient GPU accelerators. Such systems offer high computational performance with modest power consumption. As the industry trend of closer integration of CPU and GPU silicon continues, these architectures are a possible template for future exascale systems. Given the longevity of large-scale parallel HPC applications, it is important that there is a mechanism for easy migration to such hybrid systems. The OpenACC programming model offers a directive-based method for porting existing codes to run on hybrid architectures. In this paper, we describe our experiences in porting the Himeno benchmark to run on the Cray XK6 hybrid supercomputer. We describe the OpenACC programming model and the changes needed in the code, both to port the functionality and to tune the performance. Despite the additional PCIe-related overheads when transferring data from one GPU to another over the Cray Gemini interconnect, we find the application gives very good performance and scales well. Of particular interest is the facility to launch OpenACC kernels and data transfers asynchronously, which speeds the Himeno benchmark by 5%–10%. Comparing performance with an optimised code on a similar CPU-based system (using 32 threads per node), we find the OpenACC GPU version to be just under twice the speed in a node-for-node comparison. This speed-up is limited by the computational simplicity of the Himeno benchmark and is likely to be greater for more complicated applications.  相似文献   

18.
Solvent-mediated hydrodynamic interactions between colloidal particles can significantly alter their dynamics. We discuss the implementation of Stokesian dynamics in leading approximation for streaming processors as provided by the compute unified device architecture (CUDA) of recent graphics processors (GPUs). Thereby, the simulation of explicit solvent particles is avoided and hydrodynamic interactions can easily be accounted for in already available, highly accelerated molecular dynamics simulations. Special emphasis is put on efficient memory access and numerical stability. The algorithm is applied to the periodic sedimentation of a cluster of four suspended particles. Finally, we investigate the runtime performance of generic memory access patterns of complexity O(N 2) for various GPU algorithms relying on either hardware cache or shared memory.  相似文献   

19.
In the present study, a GPU accelerated 1D space–time CESE method is developed and applied to shock tube problems with and without condensation. We have demonstrated how to implement the CESE algorithm to solve 1D shock tube problems using an older generation GPU (the NVIDIA 9800 GT) with relatively limited memory. To optimize the code performance, we used Shared Memory and solved the inter-Block boundary problem in two ways, namely the branch scheme and the overlapping scheme. The implementations of these schemes are discussed in detail and their performances are compared for the Sod shock tube problems. For the Sod problem without condensation, the speedup over an Intel CPU E7300 is 23 for the branch scheme and 41 for the overlapping scheme, respectively. While for problems with condensation, both schemes achieve higher acceleration ratios, 53 and 71, respectively. The higher speedup of the condensation case can be ascribed to the source term calculation which has a local dependence on the mesh point and the SOURCE kernel has a higher acceleration ratio.  相似文献   

20.
In this short review we present the developments over the last 5 decades that have led to the use of Graphics Processing Units (GPUs) for astrophysical simulations. Since the introduction of NVIDIA’s Compute Unified Device Architecture (CUDA) in 2007 the GPU has become a valuable tool for N-body simulations and is so popular these days that almost all papers about high precision N-body simulations use methods that are accelerated by GPUs. With the GPU hardware becoming more advanced and being used for more advanced algorithms like gravitational tree-codes we see a bright future for GPU like hardware in computational astrophysics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号