首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The implementation of an edge-based three-dimensional Reynolds Average Navier–Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.  相似文献   

2.
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalability in a multi-thread parallel computation on a CPU.  相似文献   

3.
An efficient computing framework, namely PFlows, for fully resolved-direct numerical simulations of particle-laden flows was accelerated on NVIDIA General Processing Units (GPUs) and GPU-like accelerator (DCU) cards. The framework is featured as coupling the lattice Boltzmann method for fluid flow with the immersed boundary method for fluid-particle interaction, and the discrete element method for particle collision, using two fixed Eulerian meshes and one moved Lagrangian point mesh, respectively. All the parts are accelerated by a fine-grained parallelism technique using CUDA on GPUs, and further using HIP on DCU cards, i.e., the calculation on each fluid grid, each immersed boundary point, each particle motion, and each pair-particle collision is responsible by one computer thread, respectively. Coalesced memory accesses to LBM distribution functions with the data layout of Structure of Arrays are used to maximize utilization of hardware bandwidth. Parallel reduction with shared memory for data of immersed boundary points is adopted for the sake of reducing access to global memory when integrate particle hydrodynamic force. MPI computing is further used for computing on heterogeneous architectures with multiple CPUs-GPUs/DCUs. The communications between adjacent processors are hidden by overlapping with calculations. Two benchmark cases were conducted for code validation, including a pure fluid flow and a particle-laden flow. The performances on a single accelerator show that a GPU V100 can achieve 7.1–11.1 times speed up, while a single DCU can achieve 5.6–8.8 times speed up compared to a single Xeon CPU chip (32 cores). The performances on multi-accelerators show that parallel efficiency is 0.5–0.8 for weak scaling and 0.68–0.9 for strong scaling on up to 64 DCU cards even for the dense flow (φ = 20%). The peak performance reaches 179 giga lattice updates per second (GLUPS) on 256 DCU cards by using 1 billion grids and 1 million particles. At last, a large-scale simulation of a gas-solid flow with 1.6 billion grids and 1.6 million particles was conducted using only 32 DCU cards. This simulation shows that the present framework is prospective for simulations of large-scale particle-laden flows in the upcoming exascale computing era.  相似文献   

4.
This paper presents a Navier–Stokes solver for steady and unsteady turbulent flows on unstructured/hybrid grids, with triangular and quadrilateral elements, which was implemented to run on Graphics Processing Units (GPUs). The paper focuses on programming issues for efficiently porting the CPU code to the GPU, using the CUDA language. Compared with cell‐centered schemes, the use of a vertex‐centered finite volume scheme on unstructured grids increases the programming complexity since the number of nodes connected by edge to any other node might vary a lot. Thus, delicate GPU memory handling is absolutely necessary in order to maximize the speed‐up of the GPU implementation with respect to the Fortran code running on a single CPU core. The developed GPU‐enabled code is used to numerically study steady and unsteady flows around the supercritical airfoil OAT15A, by laying emphasis on the transonic buffet phenomenon. The computations were carried out on NVIDIA's Ge‐Force GTX 285 graphics cards and speed‐ups up to ~46 × (on a single GPU, with double precision arithmetic) are reported. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

5.
A typical large‐scale CFD code based on adaptive, edge‐based finite‐element formulations for the solution of compressible and incompressible flow is taken as a test bed to port such codes to graphics hardware (graphics processing units, GPUs) using semi‐automatic techniques. In previous work, a GPU version of this code was presented, in which, for many run configurations, all mesh‐sized loops required throughout time stepping were ported. This approach simultaneously achieves the fine‐grained parallelism required to fully exploit the capabilities of many‐core GPUs, completely avoids the crippling bottleneck of GPU–CPU data transfer, and uses a transposed memory layout to meet the distinct memory access requirements posed by GPUs. The present work describes the next step of this porting effort, namely to integrate GPU‐based, fine‐grained parallelism with Message‐Passing‐Interface‐based, coarse‐grained parallelism, in order to achieve a code capable of running on multi‐GPU clusters. This is carried out in a semi‐automated fashion: the existing Fortran–Message Passing Interface code is preserved, with the translator inserting data transfer calls as required. Performance benchmarks indicate up to a factor of 2 performance advantage of the NVIDIA Tesla M2050 GPU (Santa Clara, CA, USA) over the six‐core Intel Xeon X5670 CPU (Santa Clara, CA, USA), for certain run configurations. In addition, good scalability is observed when running across multiple GPUs. The approach should be of general interest, as how best to run on GPUs is being presently considered for many so‐called legacy codes. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

6.
利用有限体积法求解描述水流运动的二维浅水方程组,模拟洪水波运动传播过程,并通过GPU并行计算技术对程序进行加速,建立了浅水运动高效模拟方法。数值模拟结果表明,基于本文提出的GPU并行策略以及通用并行计算架构(CUDA)支持,能够实现相比CPU单核心最高112倍的加速比,为利用单机实现快速洪水预测以及防灾减灾决策提供有效支撑。此外,对基于GPU并行计算的浅水模拟计算精度进行了论证,并对并行性能优化进行了分析。利用所建模型模拟了溃坝洪水在三维障碍物间的传播过程。  相似文献   

7.
现代图形处理器(graphics processing units,GPU)具有较强的并行数值运算功能.该文简单介绍了GPU的硬件结构,基于GPU通用计算的数据结构和实现方法,以及用于编写片元程序的OpenGL着色语言.介绍了应用GPU计算大规模力学问题的研究进展.简要介绍了以下内容:应用GPU模拟自然界的流体现象,其实质是使用有限差分法求解Navier-Stokes方程;应用GPU实现有限元法计算,使用基于GPU的共轭梯度法求解有限元方程组;应用GPU实现分子动力学计算,用GPU计算原子间短程作用力,并生成邻近原子列表;应用GPU实现量子力学Monte Carlo计算;应用GPU实现n个物体的引力相互作用,用GPU纹理存储n个物体的位置、质量、速度和加速度等.对基于图象处理器和中央处理器的计算作比较,已完成了以下基于GPU的计算:实现求解线性方程组的高斯消元法和共轭梯度法,并应用于大规模的有限元计算;加速无网格法计算;加速线性和非线性分子结构力学方法计算;用于计算分析碳纳米管的力学性能.指出GPU在大规模力学计算中的研究方向.  相似文献   

8.
A molecular structural mechanics approach to carbon nanotubes on graphics processing units (GPUs) is reported. As a powerful parallel and relatively low cost processor, the GPU is used to accelerate the computations of the molecular structural mechanics approach. The data structures, matrix-vector multiplication algorithm, texture reduction algorithm, and ICCG method on the GPU are presented. The computations for Young's moduli of carbon nanotubes by the molecular structural mechanics approach on the GPU show its accuracy. The running times of large degree of freedom (DOF) carbon nanotubes, whose DOF is larger than 100,000, on the GPU are compared against those on the CPU, proving the GPU can accelerate the computations of the molecular structural mechanics approach to carbon nanotubes.  相似文献   

9.
In pursuit of obtaining high fidelity solutions to the fluid flow equations in a short span of time, graphics processing units (GPUs) which were originally intended for gaming applications are currently being used to accelerate computational fluid dynamics (CFD) codes. With a high peak throughput of about 1 TFLOPS on a PC, GPUs seem to be favourable for many high-resolution computations. One such computation that involves a lot of number crunching is computing time accurate flow solutions past moving bodies. The aim of the present paper is thus to discuss the development of a flow solver on unstructured and overset grids and its implementation on GPUs. In its present form, the flow solver solves the incompressible fluid flow equations on unstructured/hybrid/overset grids using a fully implicit projection method. The resulting discretised equations are solved using a matrix-free Krylov solver using several GPU kernels such as gradient, Laplacian and reduction. Some of the simple arithmetic vector calculations are implemented using the CU++: An Object Oriented Framework for Computational Fluid Dynamics Applications using Graphics Processing Units, Journal of Supercomputing, 2013, doi:10.1007/s11227-013-0985-9 approach where GPU kernels are automatically generated at compile time. Results are presented for two- and three-dimensional computations on static and moving grids.  相似文献   

10.
Nowadays, high performance computing (HPC) systems experience a disruptive moment with a variety of novel architectures and frameworks, without any clarity of which one is going to prevail. In this context, the portability of codes across different architectures is of major importance. This paper presents a portable implementation model based on an algebraic operational approach for direct numerical simulation (DNS) and large eddy simulation (LES) of incompressible turbulent flows using unstructured hybrid meshes. The strategy proposed consists in representing the whole time-integration algorithm using only three basic algebraic operations: sparse matrix–vector product, a linear combination of vectors and dot product. The main idea is based on decomposing the nonlinear operators into a concatenation of two SpMV operations. This provides high modularity and portability. An exhaustive analysis of the proposed implementation for hybrid CPU/GPU supercomputers has been conducted with tests using up to 128 GPUs. The main objective consists in understanding the challenges of implementing CFD codes on new architectures.  相似文献   

11.
We implement and evaluate a massively parallel and scalable algorithm based on a multigrid preconditioned Defect Correction method for the simulation of fully nonlinear free surface flows. The simulations are based on a potential model that describes wave propagation over uneven bottoms in three space dimensions and is useful for fast analysis and prediction purposes in coastal and offshore engineering. A dedicated numerical model based on the proposed algorithm is executed in parallel by utilizing affordable modern special purpose graphics processing unit (GPU). The model is based on a low‐storage flexible‐order accurate finite difference method that is known to be efficient and scalable on a CPU core (single thread). To achieve parallel performance of the relatively complex numerical model, we investigate a new trend in high‐performance computing where many‐core GPUs are utilized as high‐throughput co‐processors to the CPU. We describe and demonstrate how this approach makes it possible to do fast desktop computations for large nonlinear wave problems in numerical wave tanks (NWTs) with close to 50/100 million total grid points in double/single precision with 4 GB global device memory available. A new code base has been developed in C++ and compute unified device architecture C and is found to improve the runtime more than an order in magnitude in double precision arithmetic for the same accuracy over an existing CPU (single thread) Fortran 90 code when executed on a single modern GPU. These significant improvements are achieved by carefully implementing the algorithm to minimize data‐transfer and take advantage of the massive multi‐threading capability of the GPU device. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

12.
We present a novel implementation of the modal DG method for hyperbolic conservation laws in two dimensions on graphics processing units (GPUs) using NVIDIA's Compute Unified Device Architecture. Both flexible and highly accurate, DG methods accommodate parallel architectures well as their discontinuous nature produces element‐local approximations. High‐performance scientific computing suits GPUs well, as these powerful, massively parallel, cost‐effective devices have recently included support for double‐precision floating‐point numbers. Computed examples for Euler equations over unstructured triangle meshes demonstrate the effectiveness of our implementation on an NVIDIA GTX 580 device. Profiling of our method reveals performance comparable with an existing nodal DG‐GPU implementation for linear problems. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

13.
14.
With the increasing heterogeneity and on‐node parallelism of high‐performance computing hardware, a major challenge is to develop portable and efficient algorithms and software. In this work, we present our implementation of a portable code to perform surface reconstruction using NVIDIA's Thrust library. Surface reconstruction is a technique commonly used in volume tracking methods for simulations of multimaterial flow with interfaces. We have designed a 3D mesh data structure that is easily mapped to the 1D vectors used by Thrust and at the same time is simple to use and uses familiar data structure terminology (such as cells, faces, vertices, and edges). With this new data structure in place, we have implemented a piecewise linear interface reconstruction algorithm in 3 dimensions that effectively exploits the symmetry present in a uniform rectilinear computational cell. Finally, we report performance results, which show that a single implementation of these algorithms can be compiled to multiple backends (specifically, multi‐core CPUs, NVIDIA GPUs, and Intel Xeon Phi processors), making efficient use of the available parallelism on each. We also compare performance of our implementation to a legacy FORTRAN implementation in Message Passing Interface (MPI) and show performance parity on single and multi‐core CPU and achieved good parallel speed‐ups on GPU. Our research demonstrates the advantage of performance portability of the underlying data‐parallel programming model.  相似文献   

15.
In recent years, rover-based planetary exploration missions have induced some new challenges related to both the speed and the fidelity of rover simulation. This paper introduces ROSTDyn (rover simulation based on terramechanics and dynamics), a good-fidelity (in a linear motion without side forces and related torques), real-time (with an Intel Core2 CPU and ATI Radeon HD 4650 GPU) simulation platform for planetary rovers developed using C++ on the basis of the Vortex physics engine. The inherent trade-off between high fidelity and high speed is overcome by using an improved and simplified terramechanics model and Vortex. This paper presents the key technologies and algorithms constituting ROSTDyn, including the creation of the rover model and terrain model, computation of contact-area parameters, computation of interactive force/torque model, and ROSTDyn’s implementation. Speed tests confirm that ROSTDyn can perform a real-time simulation when the display frequency is less than 45 Hz and the computation frequency is less than 450 Hz. A comparison of the simulation and experiment results for an example involving a six-wheel rover climbing a series of slopes confirms the good fidelity of ROSTDyn.  相似文献   

16.
针对计算机中央处理器上串行实现GPS捕获算法耗时长的缺点,利用具有强并行处理能力的图形处理器设计实现了两种分别适用于不同载噪比信号的并行捕获算法以提高捕获速度。所提算法基于计算机统一设备架构的设计思想,采用了并行码相位搜索捕获策略,通过对GPS星座32颗卫星多通道、多频点的并行搜索实现了强信号捕获,而对弱信号则采用非相关积分法,通过对单颗卫星多时段、多频点的并行搜索再进行通道的串行处理来实现并行捕获。仿真结果表明:两种并行捕获算法比串行实现的捕获算法速度提高了10倍;采用非相干积分提高了弱信号捕获能力,对于载噪比为40 dB的10 ms中频数据,在保证捕获速度的同时,仍能够有效实现正确捕获。  相似文献   

17.
In this article, we discuss how the fast multipole method (FMM) can be implemented on modern parallel computers, ranging from computer clusters to multicore processors and graphics cards (GPU). The FMM is a somewhat difficult application for parallel computing because of its tree structure and the fact that it requires many complex operations which are not regularly structured. Computational linear algebra with dense matrices for example allows many optimizations that leverage the regular computation pattern. FMM can be similarly optimized but we will see that the complexity of the optimization steps is greater. The discussion will start with a general presentation of FMMs. We briefly discuss parallel methods for the FMM, such as building the FMM tree in parallel, and reducing communication during the FMM procedure. Finally, we will focus on porting and optimizing the FMM on GPUs.  相似文献   

18.
While new power-efficient computer architectures exhibit spectacular theoretical peak performance, they require specific conditions to operate efficiently, which makes porting complex algorithms a challenge. Here, we report results of the semi-implicit method for pressure linked equations (SIMPLE) and the pressure implicit with operator splitting (PISO) methods implemented on the graphics processing unit (GPU). We examine the advantages and disadvantages of the full porting over a partial acceleration of these algorithms run on unstructured meshes. We found that the full-port strategy requires adjusting the internal data structures to the new hardware and proposed a convenient format for storing internal data structures on GPUs. Our implementation is validated on standard steady and unsteady problems and its computational efficiency is checked by comparing its results and run times with those of some standard software (OpenFOAM) run on central processing unit (CPU). The results show that a server-class GPU outperforms a server-class dual-socket multi-core CPU system running essentially the same algorithm by up to a factor of 4.  相似文献   

19.
构建航天飞行器的结构有限元模型是准确模拟飞行仿真、完成飞行器在轨飞行阶段结构故障监测和诊断的基础。采用细长体飞行器简化梁模型,提出新的基于CUDA(Compute Unified Device Architecture)的有限元单元刚度矩阵生成和总刚度矩阵组装算法。依据梁单元矩阵的对称性,结合GPU硬件架构提出并行生成算法并进行改进。为有效减少装配时间,在装配过程中采用着色算法,提出了基于GPU(Graphics Processing Unit)共享内存的非零项组装策略,通过在不同计算平台下算例对比,验证了新算法的快速性。数值算例表明,本文算法的求解效率较高,针对一定计算规模内的模型可满足快速计算与诊断的实时性要求。  相似文献   

20.
三维大规模有限差分网格生成技术是三维有限差分计算的基础,网格生成效率是三维有限差分网格生成的研究热点。传统的阶梯型有限差分网格生成方法主要有射线穿透法和切片法。本文在传统串行射线穿透法的基础上,提出了基于GPU (graphic processing unit)并行计算技术的并行阶梯型有限差分网格生成算法。并行算法应用基于分批次的数据传输策略,使得算法能够处理的数据规模不依赖于GPU内存大小,平衡了数据传输效率和网格生成规模之间的关系。为了减少数据传输量,本文提出的并行算法可以在GPU线程内部相互独立的生成射线起点坐标,进一步提高了并行算法的执行效率和并行化程度。通过数值试验的对比可以看出,并行算法的执行效率远远高于传统射线穿透法。最后,通过有限差分计算实例可以证实并行算法能够满足复杂模型大规模数值模拟的需求。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号