首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
2.
In this study, the application of the two-dimensional direct simulation Monte Carlo (DSMC) method using an MPI-CUDA parallelization paradigm on Graphics Processing Units (GPUs) clusters is presented. An all-device (i.e. GPU) computational approach is adopted where the entire computation is performed on the GPU device, leaving the CPU idle during all stages of the computation, including particle moving, indexing, particle collisions and state sampling. Communication between the GPU and host is only performed to enable multiple-GPU computation. Results show that the computational expense can be reduced by 15 and 185 times when using a single GPU and 16 GPUs respectively when compared to a single core of an Intel Xeon X5670 CPU. The demonstrated parallel efficiency is 75% when using 16 GPUs as compared to a single GPU for simulations using 30 million simulated particles. Finally, several very large-scale simulations in the near-continuum regime are employed to demonstrate the excellent capability of the current parallel DSMC method.  相似文献   

3.
An increasing number of massively-parallel supercomputers are based on heterogeneous node architectures combining traditional, powerful multicore CPUs with energy-efficient GPU accelerators. Such systems offer high computational performance with modest power consumption. As the industry trend of closer integration of CPU and GPU silicon continues, these architectures are a possible template for future exascale systems. Given the longevity of large-scale parallel HPC applications, it is important that there is a mechanism for easy migration to such hybrid systems. The OpenACC programming model offers a directive-based method for porting existing codes to run on hybrid architectures. In this paper, we describe our experiences in porting the Himeno benchmark to run on the Cray XK6 hybrid supercomputer. We describe the OpenACC programming model and the changes needed in the code, both to port the functionality and to tune the performance. Despite the additional PCIe-related overheads when transferring data from one GPU to another over the Cray Gemini interconnect, we find the application gives very good performance and scales well. Of particular interest is the facility to launch OpenACC kernels and data transfers asynchronously, which speeds the Himeno benchmark by 5%–10%. Comparing performance with an optimised code on a similar CPU-based system (using 32 threads per node), we find the OpenACC GPU version to be just under twice the speed in a node-for-node comparison. This speed-up is limited by the computational simplicity of the Himeno benchmark and is likely to be greater for more complicated applications.  相似文献   

4.
激波与火焰面相互作用数值模拟的GPU加速   总被引:1,自引:0,他引:1  
蒋华  董刚  陈霄 《计算物理》2016,33(1):23-29
为考察计算机图形处理器(GPU)在计算流体力学中的计算能力,采用基于CPU/GPU异构并行模式的方法对激波与火焰界面相互作用的典型可压缩反应流进行数值模拟,优化并行方案,考察不同网格精度对计算结果和计算加速性能的影响.结果表明,和传统的基于信息传递的MPI 8线程并行计算相比,GPU并行模拟结果与MPI并行模拟结果相同;两种计算方法的计算时间均随网格数量的增加呈线性增长趋势,但GPU的计算时间比MPI明显降低.当网格数量较小时(1.6×104),GPU计算得到的单个时间步长平均时间的加速比为8.6;随着网格数量的增加,GPU的加速比有所下降,但对较大规模的网格数量(4.2×106),GPU的加速比仍可达到5.9.基于GPU的异构并行加速算法为可压缩反应流的高分辨率大规模计算提供了较好的解决途径.  相似文献   

5.
陈骏  文豪华  鲁兰原  范俊 《中国物理 B》2016,25(1):18707-018707
Membrane curvature is no longer thought of as a passive property of the membrane; rather, it is considered as an active, regulated state that serves various purposes in the cell such as between cells and organelle definition. While transport is usually mediated by tiny membrane bubbles known as vesicles or membrane tubules, such communication requires complex interplay between the lipid bilayers and cytosolic proteins such as members of the Bin/Amphiphysin/Rvs(BAR) superfamily of proteins. With rapid developments in novel experimental techniques, membrane remodeling has become a rapidly emerging new field in recent years. Molecular dynamics(MD) simulations are important tools for obtaining atomistic information regarding the structural and dynamic aspects of biological systems and for understanding the physics-related aspects. The availability of more sophisticated experimental data poses challenges to the theoretical community for developing novel theoretical and computational techniques that can be used to better interpret the experimental results to obtain further functional insights. In this review, we summarize the general mechanisms underlying membrane remodeling controlled or mediated by proteins. While studies combining experiments and molecular dynamics simulations recall existing mechanistic models, concurrently, they extend the role of different BAR domain proteins during membrane remodeling processes. We review these recent findings, focusing on how multiscale molecular dynamics simulations aid in understanding the physical basis of BAR domain proteins, as a representative of membrane-remodeling proteins.  相似文献   

6.
We present a GPU accelerated solver for simulations of bluff body flows in 2D using a remeshed vortex particle method and the vorticity formulation of the Brinkman penalization technique to enforce boundary conditions. The efficiency of the method relies on fast and accurate particle-grid interpolations on GPUs for the remeshing of the particles and the computation of the field operators. The GPU implementation uses OpenGL so as to perform efficient particle-grid operations and a CUFFT-based solver for the Poisson equation with unbounded boundary conditions. The accuracy and performance of the GPU simulations and their relative advantages/drawbacks over CPU based computations are reported in simulations of flows past an impulsively started circular cylinder from Reynolds numbers between 40 and 9500. The results indicate up to two orders of magnitude speed up of the GPU implementation over the respective CPU implementations. The accuracy of the GPU computations depends on the Re number of the flow. For Re up to 1000 there is little difference between GPU and CPU calculations but this agreement deteriorates (albeit remaining to within 5% in drag calculations) for higher Re numbers as the single precision of the GPU adversely affects the accuracy of the simulations.  相似文献   

7.
In this short review we present the developments over the last 5 decades that have led to the use of Graphics Processing Units (GPUs) for astrophysical simulations. Since the introduction of NVIDIA’s Compute Unified Device Architecture (CUDA) in 2007 the GPU has become a valuable tool for N-body simulations and is so popular these days that almost all papers about high precision N-body simulations use methods that are accelerated by GPUs. With the GPU hardware becoming more advanced and being used for more advanced algorithms like gravitational tree-codes we see a bright future for GPU like hardware in computational astrophysics.  相似文献   

8.
In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model.  相似文献   

9.
Simplified spherical harmonics approximation(SPN) equations are widely used in modeling light propagation in biological tissues. However, with the increase of order N, its computational burden will severely aggravate. We propose a graphics processing unit(GPU) accelerated framework for SPN equations. Compared with the conventional central processing unit implementation, an increased performance of the GPU framework is obtained with an increase in mesh size, with the best speed-up ratio of 25 among the studied cases. The influence of thread distribution on the performance of the GPU framework is also investigated.  相似文献   

10.
The present theoretical understanding of various properties of superionic conductors is reviewed. Emphasis is put on their treatment as classical many-particle systems and on the analysis of their dynamic behaviour. Different kinds of approaches pertaining to the low frequency dynamics are considered in detail. They include stochastic models, like hopping or Fokker-Planck models as well as a hydrodynamic theory. The high frequency (phonon-) dynamics and the information obtained from computer simulations is also analysed. As far as possible, the relevance of the different approaches with respect to experiments on specific materials is discussed. Possible directions for future investigations are outlined.  相似文献   

11.
We review the recent literature on lattice simulations for few- and many-body systems. We focus on methods that combine the framework of effective field theory with computational lattice methods. Lattice effective field theory is discussed for cold atoms as well as low-energy nucleons with and without pions. A number of different lattice formulations and computational algorithms are considered, and an effort is made to show common themes in studies of cold atoms and low-energy nuclear physics as well as common themes in work by different collaborations.  相似文献   

12.
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.  相似文献   

13.
Recently, an implicit, nonlinearly consistent, energy- and charge-conserving one-dimensional (1D) particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen et al., J. Comput. Phys. 230 (18) (2011)]. The method employs a Jacobian-free Newton–Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU–GPU implementation of the 1D implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 100 times faster than an equivalent single-core CPU (Intel Xeon X5460) compiler-optimized execution. For the test case chosen, the mixed-precision hybrid CPU–GPU solver is shown to over-perform the DP CPU-only serial version by a factor of ~100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.  相似文献   

14.
Over the last few decades, the computational demands of massive particle-based simulations for both scientific and industrial purposes have been continuously increasing. Hence, considerable efforts are being made to develop parallel computing techniques on various platforms. In such simulations, particles freely move within a given space, and so on a distributed-memory system, load balancing, i.e., assigning an equal number of particles to each processor, is not guaranteed. However, shared-memory systems achieve better load balancing for particle models, but suffer from the intrinsic drawback of memory access competition, particularly during (1) paring of contact candidates from among neighboring particles and (2) force summation for each particle. Here, novel algorithms are proposed to overcome these two problems. For the first problem, the key is a pre-conditioning process during which particle labels are sorted by a cell label in the domain to which the particles belong. Then, a list of contact candidates is constructed by pairing the sorted particle labels. For the latter problem, a table comprising the list indexes of the contact candidate pairs is created and used to sum the contact forces acting on each particle for all contacts according to Newton’s third law. With just these methods, memory access competition is avoided without additional redundant procedures. The parallel efficiency and compatibility of these two algorithms were evaluated in discrete element method (DEM) simulations on four types of shared-memory parallel computers: a multicore multiprocessor computer, scalar supercomputer, vector supercomputer, and graphics processing unit. The computational efficiency of a DEM code was found to be drastically improved with our algorithms on all but the scalar supercomputer. Thus, the developed parallel algorithms are useful on shared-memory parallel computers with sufficient memory bandwidth.  相似文献   

15.
The lattice Boltzmann method (LBM) can gain a great amount of performance benefit by taking advantage of graphics processing unit (GPU) computing, and thus, the GPU, or multi-GPU based LBM can be considered as a promising and competent candidate in the study of large-scale fluid flows. However, the multi-GPU based lattice Boltzmann algorithm has not been studied extensively, especially for simulations of flow in complex geometries. In this paper, through coupling with the message passing interface (MPI) technique, we present an implementation of multi-GPU based LBM for fluid flow through porous media as well as some optimization strategies based on the data structure and layout, which can apparently reduce memory access and completely hide the communication time consumption. Then the performance of the algorithm is tested on a one-node cluster equipped with four Tesla C1060 GPU cards where up to 1732 MFLUPS is achieved for the Poiseuille flow and a nearly linear speedup with the number of GPUs is also observed.  相似文献   

16.
Spatial computing is an emerging field that recognizes the importance of explicitly handling spatial relationships at three levels: computer architectures, programming languages and applications. In this context, we present MGS, an experimental programming language where data structures are fields on abstract spaces. In MGS, fields are transformed using rules. We show that this approach is able to unify, at least for programming purposes, several computational models like Lindenmayer systems and cellular automata. The MGS notions of topological collection and transformation are formalized using concepts developed in algebraic topology. We propose to use transformations in order to implement a discrete version of some differential operators. These transformations satisfy a Stokes-like theorem. This result constitutes a geometric view of programming where data are handled like fields in physics. The relevance of this approach for the design of autonomic software systems is discussed in the conclusion.  相似文献   

17.
詹飞  马晓川  杨力 《声学学报》2018,43(4):445-452
针对宽带编码脉冲、多输入多输出等新型目标探测体制发展带来的运算量和数据存储需求剧增的问题,根据水下航行器相位编码脉冲回波检测算法的数据级并行特点,提出应用图形处理器(Graphics Processing Unit,GPU)众核处理架构,并从任务分配策略、数据处理流程、GPU硬件资源利用率和存储器访问等角度考虑,设计了算法在GPU上的并行实现框架。利用湖试数据测试了桌面级GPU平台、嵌入式GPU平台与基于多核数字信号处理器(Digital Signal Processor,DSP)的传统航行器信号处理平台的性能,与多核DSP平台相比,嵌入式GPU平台在功耗、运算性能等方面更有优势。研究结果表明采用嵌入式GPU平台可大幅提升每瓦特性能指标并简化系统设计,能满足新型航行器探测系统大数据量、低功耗和实时性的应用需求。   相似文献   

18.
High-performance streams of (pseudo) random numbers are crucial for the efficient implementation of countless stochastic algorithms, most importantly, Monte Carlo simulations and molecular dynamics simulations with stochastic thermostats. A number of implementations of random number generators has been discussed for GPU platforms before and some generators are even included in the CUDA supporting libraries. Nevertheless, not all of these generators are well suited for highly parallel applications where each thread requires its own generator instance. For this specific situation encountered, for instance, in simulations of lattice models, most of the high-quality generators with large states such as Mersenne twister cannot be used efficiently without substantial changes. We provide a broad review of existing CUDA variants of random-number generators and present the CUDA implementation of a new massively parallel high-quality, high-performance generator with a small memory load overhead.  相似文献   

19.
For the dynamics of macromolecules in solution, hydrodynamic interactions mediated by the solvent molecules often play an important role, although one is not interested in the dynamics of the solvent itself. In computer simulations one can therefore save a large amount of computer time by replacing the solvent with a lattice fluid. The macromolecules are propagated by Molecular Dynamics (MD), while the fluid is governed by the fluctuating Lattice-Boltzmann (LB) equation. We present a fluctuating LB implementation for a single graphics card (GPU) coupled to a MD simulation running on conventional processors (CPUs). Particular emphasis lies on the optimization of the combined code. In our implementation, the LB update is performed in parallel with the force calculation on the CPU, which often completely hides the additional computational cost of the LB. Compared to our parallel LB implementation on a conventional quad-core CPU, the GPU LB is 50 times faster, and we show that a whole commodity cluster with Infiniband interconnnect cannot outperform a single GPU in strong scaling. The presented code is part of the open source simulation package ESPResSo ().  相似文献   

20.
李大禹  胡立发  穆全全  宣丽 《光子学报》2008,37(8):1643-1647
利用GPU进行液晶自适应光学波前重构的加速计算.介绍了液晶自适应光学的Zernike模式波前重构算法,详细论述了GPU的通用架构和GPU实现波前重构的方法,给出了GPU与CPU的实验对比结果.结果表明,GPU计算波前重构不但可以准确无误地计算出液晶波前校正器的灰度级分布,计算速度更是传统CPU波前计算的几十倍.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号