首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 893 毫秒
1.
This paper presents a Navier–Stokes solver for steady and unsteady turbulent flows on unstructured/hybrid grids, with triangular and quadrilateral elements, which was implemented to run on Graphics Processing Units (GPUs). The paper focuses on programming issues for efficiently porting the CPU code to the GPU, using the CUDA language. Compared with cell‐centered schemes, the use of a vertex‐centered finite volume scheme on unstructured grids increases the programming complexity since the number of nodes connected by edge to any other node might vary a lot. Thus, delicate GPU memory handling is absolutely necessary in order to maximize the speed‐up of the GPU implementation with respect to the Fortran code running on a single CPU core. The developed GPU‐enabled code is used to numerically study steady and unsteady flows around the supercritical airfoil OAT15A, by laying emphasis on the transonic buffet phenomenon. The computations were carried out on NVIDIA's Ge‐Force GTX 285 graphics cards and speed‐ups up to ~46 × (on a single GPU, with double precision arithmetic) are reported. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

2.
We optimized the Arbitrary accuracy DErivatives Riemann problem (ADER) ‐ Discontinuous Galerkin (DG) numerical method using the CUDA‐C language to run the code in a graphic processing unit (GPU). We focus on solving linear hyperbolic partial–differential equations where the method can be expressed as a combination of precomputed matrix multiplications becoming a good candidate to be used on the GPU hardware. Moreover, the method is arbitrarily high order involving intensive work on local data, a property that is also beneficial for the target hardware. We compare our GPU implementation against CPU versions of the same method observing similar convergence properties up to a threshold where the error remains fixed. This behavior is in agreement with the CPU version, but the threshold is slightly larger than in the CPU case. We also observe a big difference when considering single and double precisions where in the first case, the threshold error is significantly larger. Finally, we did observe a speed‐up factor in computational time that depends on the order of the method and the size of the problem. In the best case, our novel GPU implementation runs 23 times faster than the CPU version. We used three partial–differential equation to test the code considering the linear advection equation, the seismic wave equation, and the linear shallow water equation, all of them considering variable coefficients. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

3.
针对计算机中央处理器上串行实现GPS捕获算法耗时长的缺点,利用具有强并行处理能力的图形处理器设计实现了两种分别适用于不同载噪比信号的并行捕获算法以提高捕获速度。所提算法基于计算机统一设备架构的设计思想,采用了并行码相位搜索捕获策略,通过对GPS星座32颗卫星多通道、多频点的并行搜索实现了强信号捕获,而对弱信号则采用非相关积分法,通过对单颗卫星多时段、多频点的并行搜索再进行通道的串行处理来实现并行捕获。仿真结果表明:两种并行捕获算法比串行实现的捕获算法速度提高了10倍;采用非相干积分提高了弱信号捕获能力,对于载噪比为40 dB的10 ms中频数据,在保证捕获速度的同时,仍能够有效实现正确捕获。  相似文献   

4.
基于FTM算法的GPU加速   总被引:1,自引:1,他引:0  
为了解决FTM(Front Tracking Method)算法在计算机中计算耗时长的问题,利用CUDA(Compute Unified Device Architecture)来实现FTM算法在GPU中的并行计算。结合GPU并行计算架构的特性以及FTM算法的特点,本文通过共享内存的引入、线程块划分和线程块共享内存边界元素的纳入、迭代方法的改进和迭代过程中存储结构的变换等方法,提出了将FTM算法中的网格计算以及界面标记点处理方法在GPU中的实现方式。最后,通过模拟单气泡在静止液体中的自由上升运动,验证了FTM在GPU中计算的可行性与计算效率的提升。  相似文献   

5.
The implementation of an edge-based three-dimensional Reynolds Average Navier–Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.  相似文献   

6.
In this article, we apply Davis's second‐order predictor‐corrector Godunov type method to numerical solution of the Savage–Hutter equations for modeling granular avalanche flows. The method uses monotone upstream‐centered schemes for conservation laws (MUSCL) reconstruction for conservative variables and Harten–Lax–van Leer contact (HLLC) scheme for numerical fluxes. Static resistance conditions and stopping criteria are incorporated into the algorithm. The computation is implemented on graphics processing unit (GPU) by using compute unified device architecture programming model. A practice of allocating memory for two‐dimensional array in GPU is given and computational efficiency of two‐dimensional memory allocation is compared with one‐dimensional memory allocation. The effectiveness of the present simulation model is verified through several typical numerical examples. Numerical tests show that significant speedups of the GPU program over the CPU serial version can be obtained, and Davis's method in conjunction with MUSCL and HLLC schemes is accurate and robust for simulating granular avalanche flows with shock waves. As an application example, a case with a teardrop‐shaped hydraulic jump in Johnson and Gray's granular jet experiment is reproduced by using specific friction coefficients given in the literature. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

7.
A typical large‐scale CFD code based on adaptive, edge‐based finite‐element formulations for the solution of compressible and incompressible flow is taken as a test bed to port such codes to graphics hardware (graphics processing units, GPUs) using semi‐automatic techniques. In previous work, a GPU version of this code was presented, in which, for many run configurations, all mesh‐sized loops required throughout time stepping were ported. This approach simultaneously achieves the fine‐grained parallelism required to fully exploit the capabilities of many‐core GPUs, completely avoids the crippling bottleneck of GPU–CPU data transfer, and uses a transposed memory layout to meet the distinct memory access requirements posed by GPUs. The present work describes the next step of this porting effort, namely to integrate GPU‐based, fine‐grained parallelism with Message‐Passing‐Interface‐based, coarse‐grained parallelism, in order to achieve a code capable of running on multi‐GPU clusters. This is carried out in a semi‐automated fashion: the existing Fortran–Message Passing Interface code is preserved, with the translator inserting data transfer calls as required. Performance benchmarks indicate up to a factor of 2 performance advantage of the NVIDIA Tesla M2050 GPU (Santa Clara, CA, USA) over the six‐core Intel Xeon X5670 CPU (Santa Clara, CA, USA), for certain run configurations. In addition, good scalability is observed when running across multiple GPUs. The approach should be of general interest, as how best to run on GPUs is being presently considered for many so‐called legacy codes. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

8.
三维大规模有限差分网格生成技术是三维有限差分计算的基础,网格生成效率是三维有限差分网格生成的研究热点。传统的阶梯型有限差分网格生成方法主要有射线穿透法和切片法。本文在传统串行射线穿透法的基础上,提出了基于GPU (graphic processing unit)并行计算技术的并行阶梯型有限差分网格生成算法。并行算法应用基于分批次的数据传输策略,使得算法能够处理的数据规模不依赖于GPU内存大小,平衡了数据传输效率和网格生成规模之间的关系。为了减少数据传输量,本文提出的并行算法可以在GPU线程内部相互独立的生成射线起点坐标,进一步提高了并行算法的执行效率和并行化程度。通过数值试验的对比可以看出,并行算法的执行效率远远高于传统射线穿透法。最后,通过有限差分计算实例可以证实并行算法能够满足复杂模型大规模数值模拟的需求。  相似文献   

9.
This paper describes a pressure correction method for single‐ and multilayer open flow models. The method does not require any complex procedures to solve the discretization of the Poisson equation and is distinguished by a high computational efficiency. The algorithm can easily be adapted to irregular meshes and parallelized. Parabolic interpolation of the pressure profile is used for the free surface. The discretization of the Poisson equation is written in a matrix form, allowing its usage also in the case of basic function expansion of the depth pressure profile. The paper presents the results of algorithm verification where experimental data sensitive to the numerical dissipation of the calculation model was used. Iteration convergence is high including problems with dry‐bed flooding. The complete described technique of pressure correction is implemented in OpenCL on the GPU. Computation time for a test problem solved using CPU and GPU is compared.  相似文献   

10.
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalability in a multi-thread parallel computation on a CPU.  相似文献   

11.
With the increasing heterogeneity and on‐node parallelism of high‐performance computing hardware, a major challenge is to develop portable and efficient algorithms and software. In this work, we present our implementation of a portable code to perform surface reconstruction using NVIDIA's Thrust library. Surface reconstruction is a technique commonly used in volume tracking methods for simulations of multimaterial flow with interfaces. We have designed a 3D mesh data structure that is easily mapped to the 1D vectors used by Thrust and at the same time is simple to use and uses familiar data structure terminology (such as cells, faces, vertices, and edges). With this new data structure in place, we have implemented a piecewise linear interface reconstruction algorithm in 3 dimensions that effectively exploits the symmetry present in a uniform rectilinear computational cell. Finally, we report performance results, which show that a single implementation of these algorithms can be compiled to multiple backends (specifically, multi‐core CPUs, NVIDIA GPUs, and Intel Xeon Phi processors), making efficient use of the available parallelism on each. We also compare performance of our implementation to a legacy FORTRAN implementation in Message Passing Interface (MPI) and show performance parity on single and multi‐core CPU and achieved good parallel speed‐ups on GPU. Our research demonstrates the advantage of performance portability of the underlying data‐parallel programming model.  相似文献   

12.
Gas Kinetic Method‐based flow solvers have become popular in recent years owing to their robustness in simulating high Mach number compressible flows. We evaluate the performance of the newly developed analytical gas kinetic method (AGKM) by Xuan et al. in performing direct numerical simulation of canonical compressible turbulent flow on graphical processing unit (GPU)s. We find that for a range of turbulent Mach numbers, AGKM results shows excellent agreement with high order accurate results obtained with traditional Navier–Stokes solvers in terms of key turbulence statistics. Further, AGKM is found to be more efficient as compared with the traditional gas kinetic method for GPU implementation. We present a brief overview of the optimizations performed on NVIDIA K20 GPU and show that GPU optimizations boost the speedup up‐to 40x as compared with single core CPU computations. Hence, AGKM can be used as an efficient method for performing fast and accurate direct numerical simulations of compressible turbulent flows on simple GPU‐based workstations. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

13.
研究了GPU(Graphics Processing Units)计算应用于有限元方法中的总刚计算和组装、稀疏矩阵与向量乘积运算、线性方程组求解问题,并基于CUDA(Compute Unified Device Architecture)平台利用GTX295GPU进行程序实现和测试。系统总刚采用CSR(Compressed Sparse Row)压缩格式存放于GPU显存中,用单元染色方法实现总刚并行计算组装,用共轭梯度迭代法求解大规模线性方程组。对300万自由度以内的空间桁架和平面问题算例,GPU有限元计算分别获得最高9.5倍和6.5倍的计算加速比,并且加速比随系统自由度的增加而近似线性增加,GFLOP/s峰值也有近10倍的增加。  相似文献   

14.
针对无网格Galerkin法计算耗时的问题,采用逐节点对法来组装刚度矩阵、共轭梯度法求解基于CSR格式存储的稀疏线性方程组,提出了一种利用罚函数法施加本质边界条件的EFG法GPU加速并行算法,给出了刚度矩阵和惩罚刚度矩阵的统一格式,以及GPU加速并行算法的流程图。编写了基于CUDA构架平台的GPU程序,且在NVIDIA GeForce GTX 660显卡上通过数值算例对所提算法进行了性能测试与分析比较,探讨了影响加速比的因素。算例结果验证了所提算法的可行性,并在满足计算精度的前提下,其加速比最大可达17倍;同时线性方程组的求解对加速比起决定性影响。  相似文献   

15.
Implementations of the Boussinesq wave model to calculate free surface wave evolution in large basins are, in general, computationally very expensive, requiring huge amounts of CPU time and memory. For large scale problems, it is either not affordable or practical to run on a single PC. To facilitate such extensive computations, a parallel Boussinesq wave model is developed using the domain decomposition technique in conjunction with the message passing interface (MPI). The published and well‐tested numerical scheme used by the serial model, a high‐order finite difference method, is identical to that employed in the parallel model. Parallelization of the tridiagonal matrix systems included in the serial scheme is the most challenging aspect of the work, and is accomplished using a parallel matrix solver combined with an efficient data transfer scheme. Numerical tests on a distributed‐memory super‐computer show that the performance of the current parallel model in simulating wave evolution is very satisfactory. A linear speedup is gained as the number of processors increases. These tests showed that the CPU time efficiency of the model is about 75–90%. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

16.
We present a novel implementation of the modal DG method for hyperbolic conservation laws in two dimensions on graphics processing units (GPUs) using NVIDIA's Compute Unified Device Architecture. Both flexible and highly accurate, DG methods accommodate parallel architectures well as their discontinuous nature produces element‐local approximations. High‐performance scientific computing suits GPUs well, as these powerful, massively parallel, cost‐effective devices have recently included support for double‐precision floating‐point numbers. Computed examples for Euler equations over unstructured triangle meshes demonstrate the effectiveness of our implementation on an NVIDIA GTX 580 device. Profiling of our method reveals performance comparable with an existing nodal DG‐GPU implementation for linear problems. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

17.
18.
Nowadays, high performance computing (HPC) systems experience a disruptive moment with a variety of novel architectures and frameworks, without any clarity of which one is going to prevail. In this context, the portability of codes across different architectures is of major importance. This paper presents a portable implementation model based on an algebraic operational approach for direct numerical simulation (DNS) and large eddy simulation (LES) of incompressible turbulent flows using unstructured hybrid meshes. The strategy proposed consists in representing the whole time-integration algorithm using only three basic algebraic operations: sparse matrix–vector product, a linear combination of vectors and dot product. The main idea is based on decomposing the nonlinear operators into a concatenation of two SpMV operations. This provides high modularity and portability. An exhaustive analysis of the proposed implementation for hybrid CPU/GPU supercomputers has been conducted with tests using up to 128 GPUs. The main objective consists in understanding the challenges of implementing CFD codes on new architectures.  相似文献   

19.
20.
Robust computational procedures for the solution of non‐hydrostatic, free surface, irrotational and inviscid free‐surface water waves in three space dimensions can be based on iterative preconditioned defect correction (PDC) methods. Such methods can be made efficient and scalable to enable prediction of free‐surface wave transformation and accurate wave kinematics in both deep and shallow waters in large marine areas or for predicting the outcome of experiments in large numerical wave tanks. We revisit the classical governing equations are fully nonlinear and dispersive potential flow equations. We present new detailed fundamental analysis using finite‐amplitude wave solutions for iterative solvers. We demonstrate that the PDC method in combination with a high‐order discretization method enables efficient and scalable solution of the linear system of equations arising in potential flow models. Our study is particularly relevant for fast and efficient simulation of non‐breaking fully nonlinear water waves over varying bottom topography that may be limited by computational resources or requirements. To gain insight into algorithmic properties and proper choices of discretization parameters for different PDC strategies, we study systematically limits of accuracy, convergence rate, algorithmic and numerical efficiency and scalability of the most efficient known PDC methods. These strategies are of interest, because they enable generalization of geometric multigrid methods to high‐order accurate discretizations and enable significant improvement in numerical efficiency while incuring minimal storage requirements. We demonstrate robustness using such PDC methods for practical ranges of interest for coastal and maritime engineering, that is, from shallow to deep water, and report details of numerical experiments that can be used for benchmarking purposes. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号