首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The implementation of an edge-based three-dimensional Reynolds Average Navier–Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.  相似文献   

2.
With the increasing heterogeneity and on‐node parallelism of high‐performance computing hardware, a major challenge is to develop portable and efficient algorithms and software. In this work, we present our implementation of a portable code to perform surface reconstruction using NVIDIA's Thrust library. Surface reconstruction is a technique commonly used in volume tracking methods for simulations of multimaterial flow with interfaces. We have designed a 3D mesh data structure that is easily mapped to the 1D vectors used by Thrust and at the same time is simple to use and uses familiar data structure terminology (such as cells, faces, vertices, and edges). With this new data structure in place, we have implemented a piecewise linear interface reconstruction algorithm in 3 dimensions that effectively exploits the symmetry present in a uniform rectilinear computational cell. Finally, we report performance results, which show that a single implementation of these algorithms can be compiled to multiple backends (specifically, multi‐core CPUs, NVIDIA GPUs, and Intel Xeon Phi processors), making efficient use of the available parallelism on each. We also compare performance of our implementation to a legacy FORTRAN implementation in Message Passing Interface (MPI) and show performance parity on single and multi‐core CPU and achieved good parallel speed‐ups on GPU. Our research demonstrates the advantage of performance portability of the underlying data‐parallel programming model.  相似文献   

3.
This paper presents a Navier–Stokes solver for steady and unsteady turbulent flows on unstructured/hybrid grids, with triangular and quadrilateral elements, which was implemented to run on Graphics Processing Units (GPUs). The paper focuses on programming issues for efficiently porting the CPU code to the GPU, using the CUDA language. Compared with cell‐centered schemes, the use of a vertex‐centered finite volume scheme on unstructured grids increases the programming complexity since the number of nodes connected by edge to any other node might vary a lot. Thus, delicate GPU memory handling is absolutely necessary in order to maximize the speed‐up of the GPU implementation with respect to the Fortran code running on a single CPU core. The developed GPU‐enabled code is used to numerically study steady and unsteady flows around the supercritical airfoil OAT15A, by laying emphasis on the transonic buffet phenomenon. The computations were carried out on NVIDIA's Ge‐Force GTX 285 graphics cards and speed‐ups up to ~46 × (on a single GPU, with double precision arithmetic) are reported. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

4.
5.
We implement and evaluate a massively parallel and scalable algorithm based on a multigrid preconditioned Defect Correction method for the simulation of fully nonlinear free surface flows. The simulations are based on a potential model that describes wave propagation over uneven bottoms in three space dimensions and is useful for fast analysis and prediction purposes in coastal and offshore engineering. A dedicated numerical model based on the proposed algorithm is executed in parallel by utilizing affordable modern special purpose graphics processing unit (GPU). The model is based on a low‐storage flexible‐order accurate finite difference method that is known to be efficient and scalable on a CPU core (single thread). To achieve parallel performance of the relatively complex numerical model, we investigate a new trend in high‐performance computing where many‐core GPUs are utilized as high‐throughput co‐processors to the CPU. We describe and demonstrate how this approach makes it possible to do fast desktop computations for large nonlinear wave problems in numerical wave tanks (NWTs) with close to 50/100 million total grid points in double/single precision with 4 GB global device memory available. A new code base has been developed in C++ and compute unified device architecture C and is found to improve the runtime more than an order in magnitude in double precision arithmetic for the same accuracy over an existing CPU (single thread) Fortran 90 code when executed on a single modern GPU. These significant improvements are achieved by carefully implementing the algorithm to minimize data‐transfer and take advantage of the massive multi‐threading capability of the GPU device. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

6.
We describe the performance of Chicoma , a 3D unstructured mesh compressible flow solver, on graphics processing unit (GPU) hardware. The approach used to deploy the solver on GPU architectures derives from the threaded multicore execution model used in Chicoma , and attempts to improve memory performance via the application of graph theory techniques. The result is a scheme that can be deployed on the GPU with high‐level programming constructs, for example, compiler directives, rather than low‐level programming extensions. With an NVIDIA Fermi‐class GPU (NVIDIA Corp., Sta. Clara, CA, USA) and double precision floating point arithmetic, we observe performance gains of 4–5 × on problem sizes of 106– 107 tetrahedra. We also compare GPU performance to threaded multicore performance with OpenMP and demonstrate hybrid multicore‐GPU calculations with adaptive mesh refinement. Published 2012. This article is a US Government work and is in the public domain in the USA.  相似文献   

7.
While new power-efficient computer architectures exhibit spectacular theoretical peak performance, they require specific conditions to operate efficiently, which makes porting complex algorithms a challenge. Here, we report results of the semi-implicit method for pressure linked equations (SIMPLE) and the pressure implicit with operator splitting (PISO) methods implemented on the graphics processing unit (GPU). We examine the advantages and disadvantages of the full porting over a partial acceleration of these algorithms run on unstructured meshes. We found that the full-port strategy requires adjusting the internal data structures to the new hardware and proposed a convenient format for storing internal data structures on GPUs. Our implementation is validated on standard steady and unsteady problems and its computational efficiency is checked by comparing its results and run times with those of some standard software (OpenFOAM) run on central processing unit (CPU). The results show that a server-class GPU outperforms a server-class dual-socket multi-core CPU system running essentially the same algorithm by up to a factor of 4.  相似文献   

8.
We optimized the Arbitrary accuracy DErivatives Riemann problem (ADER) ‐ Discontinuous Galerkin (DG) numerical method using the CUDA‐C language to run the code in a graphic processing unit (GPU). We focus on solving linear hyperbolic partial–differential equations where the method can be expressed as a combination of precomputed matrix multiplications becoming a good candidate to be used on the GPU hardware. Moreover, the method is arbitrarily high order involving intensive work on local data, a property that is also beneficial for the target hardware. We compare our GPU implementation against CPU versions of the same method observing similar convergence properties up to a threshold where the error remains fixed. This behavior is in agreement with the CPU version, but the threshold is slightly larger than in the CPU case. We also observe a big difference when considering single and double precisions where in the first case, the threshold error is significantly larger. Finally, we did observe a speed‐up factor in computational time that depends on the order of the method and the size of the problem. In the best case, our novel GPU implementation runs 23 times faster than the CPU version. We used three partial–differential equation to test the code considering the linear advection equation, the seismic wave equation, and the linear shallow water equation, all of them considering variable coefficients. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

9.
Intel's latest Xeon Phi processor, Knights Landing (KNL), has the potential to provide over 2.6 TFLOPS. However, to obtain maximum performance on the KNL, significant refactoring and optimization of application codes are still required to exploit key architectural innovations that KNL features—wide vector units, many‐core node design, and deep memory hierarchy. The experience and insights gained in porting and running FEFLO (a typical edge‐based finite element code for the solution of compressible and incompressible flows) on the KNL platform are described in this paper. In particular, optimizations used to extract on‐node parallelism via vectorization and multithreading and improve internode communication are considered. These optimizations resulted in a 2.3× performance gain on a 16 node runs of FEFLO, with the potential for larger performance gains as the code is scaled beyond 16 nodes. The impact of the different configurations of KNL's on‐package MCDRAM (Multi‐Channel DRAM) memory on FEFLO's performance is also explored. Finally, the performance of the optimized versions of FEFLO for KNL and Haswell (Intel Xeon) is compared.  相似文献   

10.
This paper describes the implementation and performances of a parallel solver for the direct numerical simulation of the three‐dimensional and time‐dependent Navier–Stokes equations on distributed‐memory, massively parallel computers. The feasibility of this approach to study Marangoni flow instability in half zone liquid bridges is examined. The results indicate that the incompressible, non‐linear Navier–Stokes problem, governing the Marangoni flows behavior, can effectively be parallelized on a distributed memory parallel machine by remapping the distributed data structure. The numerical code is based on a three‐dimensional Simplified Marker and Cell (SMAC) primitive variable method applied to a staggered finite difference grid. Using this method, the problem is split into two problems, one parabolic and the other elliptic A parallel algorithm, explicit in time, is utilized to solve the parabolic equations. A parallel multisplitting kernel is introduced for the solution of the pseudo pressure elliptic equation, representing the most time‐consuming part of the algorithm. A grid‐partition strategy is used in the parallel implementations of both the parabolic equations and the multisplitting elliptic kernel. A Message Passing Interface (MPI) is coded for the boundary conditions; this protocol is portable to different systems supporting this interface for interprocessor communications. Numerical experiments illustrate good numerical properties and parallel efficiency. In particular, good scalability on a large number of processors can be achieved as long as the granularity of the parallel application is not too small. However, increasing the number of processors, the Speed‐Up is ever smaller than the ideal linear Speed‐Up. The communication timings indicate that complex practical calculations, such as the solutions of the Navier–Stokes equations for the numerical simulation of the instability of Marangoni flows, can be expected to run on a massively parallel machine with good efficiency. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

11.
In both bubbly and porous media flow, the jumps in coefficients may yield an ill‐conditioned linear system. The solution of this system using an iterative technique like the conjugate gradient (CG) is delayed because of the presence of small eigenvalues in the spectrum of the coefficient matrix. To accelerate the convergence, we use two levels of preconditioning. For the first level, we choose between out‐of‐the‐box incomplete LU decomposition, sparse approximate inverse, and truncated Neumann series‐based preconditioner. For the second level, we use deflation. Through our experiments, we show that it is possible to achieve a computationally fast solver on a graphics processing unit. The preconditioners discussed in this work exhibit fine‐grained parallelism. We show that the graphics processing unit version of the two‐level preconditioned CG can be up to two times faster than a dual quad core CPU implementation. John Wiley & Sons, Ltd.  相似文献   

12.
A molecular structural mechanics approach to carbon nanotubes on graphics processing units (GPUs) is reported. As a powerful parallel and relatively low cost processor, the GPU is used to accelerate the computations of the molecular structural mechanics approach. The data structures, matrix-vector multiplication algorithm, texture reduction algorithm, and ICCG method on the GPU are presented. The computations for Young's moduli of carbon nanotubes by the molecular structural mechanics approach on the GPU show its accuracy. The running times of large degree of freedom (DOF) carbon nanotubes, whose DOF is larger than 100,000, on the GPU are compared against those on the CPU, proving the GPU can accelerate the computations of the molecular structural mechanics approach to carbon nanotubes.  相似文献   

13.
The computational efficiency of existing hydrocodes is expected to suffer as computer architectures advance beyond the traditional parallel central processing unit (CPU) model 1 . Concerning new computer architectures, sources of relative performance degradation might include reduced memory bandwidth per core, increased resource contention due to concurrency, increased single instruction, multiple data (SIMD) length, and increasingly complex memory hierarchies. Concerning existing codes, any performance degradation will be influenced by a lack of attention to performance in their design and implementation. This work reports on considerations for improving computational performance in preparation for current and expected changes to computer architecture. The algorithms studied will include increasingly complex prototypes for radiation hydrodynamics codes, such as gradient routines and diffusion matrix assembly (e.g., in 1 - 6 ). The meshes considered for the algorithms are structured or unstructured meshes. The considerations applied for performance improvements are meant to be general in terms of architecture (not specifically graphical processing unit (GPUs) or multi‐core machines, for example) and include techniques for vectorization, threading, tiling, and cache blocking. Out of a survey of optimization techniques on applications such as diffusion and hydrodynamics, we make general recommendations with a view toward making these techniques conceptually accessible to the applications code developer. Published 2015. This article is a U.S. Government work and is in the public domain in the USA.  相似文献   

14.
A parallel large eddy simulation code that adopts domain decomposition method has been developed for large‐scale computation of turbulent flows around an arbitrarily shaped body. For the temporal integration of the unsteady incompressible Navier–Stokes equation, fractional 4‐step splitting algorithm is adopted, and for the modelling of small eddies in turbulent flows, the Smagorinsky model is used. For the parallelization of the code, METIS and Message Passing Interface Libraries are used, respectively, to partition the computational domain and to communicate data between processors. To validate the parallel architecture and to estimate its performance, a three‐dimensional laminar driven cavity flow inside a cubical enclosure has been solved. To validate the turbulence calculation, the turbulent channel flows at Reτ = 180 and 1050 are simulated and compared with previous results. Then, a backward facing step flow is solved and compared with a DNS result for overall code validation. Finally, the turbulent flow around MIRA model at Re = 2.6 × 106 is simulated by using approximately 6.7 million nodes. Scalability curve obtained from this simulation shows that scalable results are obtained. The calculated drag coefficient agrees better with the experimental result than those previously obtained by using two‐equation turbulence models. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

15.
Several next generation high performance computing platforms are or will be based on the so‐called many‐core architectures, which represent a significant departure from commodity multi‐core architectures. A key issue in transitioning large‐scale simulation codes from multi‐core to many‐core systems is closing the serial performance gap, that is, overcoming the large difference in single‐core performance between multi‐core and many‐core systems. In this paper, we discuss how this problem was addressed for a 3D unstructured mesh hydrodynamics code, describe how Amdahl's law can be used to estimate performance targets and guide optimization efforts, and present timing studies performed on multi‐core and many‐core platforms. Published 2014. This article is a U.S. Government work and is in the public domain in the USA.  相似文献   

16.
Real-time simulation of industrial equipment is a huge challenge nowadays. The high performance and fine-grained parallel computing provided by graphics processing units (GPUs) bring us closer to our goals. In this article, an industrial-scale rotating drum is simulated using simplified discrete element method (DEM) without consideration of the tangential components of contact force and particle rotation. A single GPU is used first to simulate a small model system with about 8000 particles in real-time, and the simulation is then scaled up to industrial scale using more than 200 GPUs in a 1D domain-decomposition parallelization mode. The overall speed is about 1/11 of the real-time. Optimization of the communication part of the parallel GPU codes can speed up the simulation further, indicating that such real-time simulations have not only methodological but also industrial implications in the near future.  相似文献   

17.
This paper describes a nonlinear, three‐dimensional spectral collocation method for the simulation of the incompressible Navier–Stokes equations under the Boussinesq approximation, motivated by geophysical and environmental flows. These flows are driven by the interaction of stratified fluid with topography, which this model accurately accounts for by using a mapped coordinate system. The spectral collocation method is implemented with both a Fourier trigonometric expansion and the Chebyshev polynomials, as appropriate for the domain boundary conditions. The coordinate mapping prohibits the use of existing, fast solution methods that rely on the separation of variables, so a preconditioner based on the approximate solution of a corresponding finite‐difference problem with geometric multigrid is used. The model is parallelized with the Message Passing Interface library, and it runs effectively on shared and distributed‐memory systems. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

18.
In pursuit of obtaining high fidelity solutions to the fluid flow equations in a short span of time, graphics processing units (GPUs) which were originally intended for gaming applications are currently being used to accelerate computational fluid dynamics (CFD) codes. With a high peak throughput of about 1 TFLOPS on a PC, GPUs seem to be favourable for many high-resolution computations. One such computation that involves a lot of number crunching is computing time accurate flow solutions past moving bodies. The aim of the present paper is thus to discuss the development of a flow solver on unstructured and overset grids and its implementation on GPUs. In its present form, the flow solver solves the incompressible fluid flow equations on unstructured/hybrid/overset grids using a fully implicit projection method. The resulting discretised equations are solved using a matrix-free Krylov solver using several GPU kernels such as gradient, Laplacian and reduction. Some of the simple arithmetic vector calculations are implemented using the CU++: An Object Oriented Framework for Computational Fluid Dynamics Applications using Graphics Processing Units, Journal of Supercomputing, 2013, doi:10.1007/s11227-013-0985-9 approach where GPU kernels are automatically generated at compile time. Results are presented for two- and three-dimensional computations on static and moving grids.  相似文献   

19.
Gas Kinetic Method‐based flow solvers have become popular in recent years owing to their robustness in simulating high Mach number compressible flows. We evaluate the performance of the newly developed analytical gas kinetic method (AGKM) by Xuan et al. in performing direct numerical simulation of canonical compressible turbulent flow on graphical processing unit (GPU)s. We find that for a range of turbulent Mach numbers, AGKM results shows excellent agreement with high order accurate results obtained with traditional Navier–Stokes solvers in terms of key turbulence statistics. Further, AGKM is found to be more efficient as compared with the traditional gas kinetic method for GPU implementation. We present a brief overview of the optimizations performed on NVIDIA K20 GPU and show that GPU optimizations boost the speedup up‐to 40x as compared with single core CPU computations. Hence, AGKM can be used as an efficient method for performing fast and accurate direct numerical simulations of compressible turbulent flows on simple GPU‐based workstations. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

20.
In this article, we apply Davis's second‐order predictor‐corrector Godunov type method to numerical solution of the Savage–Hutter equations for modeling granular avalanche flows. The method uses monotone upstream‐centered schemes for conservation laws (MUSCL) reconstruction for conservative variables and Harten–Lax–van Leer contact (HLLC) scheme for numerical fluxes. Static resistance conditions and stopping criteria are incorporated into the algorithm. The computation is implemented on graphics processing unit (GPU) by using compute unified device architecture programming model. A practice of allocating memory for two‐dimensional array in GPU is given and computational efficiency of two‐dimensional memory allocation is compared with one‐dimensional memory allocation. The effectiveness of the present simulation model is verified through several typical numerical examples. Numerical tests show that significant speedups of the GPU program over the CPU serial version can be obtained, and Davis's method in conjunction with MUSCL and HLLC schemes is accurate and robust for simulating granular avalanche flows with shock waves. As an application example, a case with a teardrop‐shaped hydraulic jump in Johnson and Gray's granular jet experiment is reproduced by using specific friction coefficients given in the literature. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号