期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王珏邱流潮《计算力学学报》2013,30(Z1)

光滑粒子流体动力学(SPH)法是一种无网格的拉格朗日效值方法,广泛应用于计算流体领域模拟复杂自由表面流问题.SPH方法的主要缺点就是计算量过大,而基于GPU的并行计算方法可使SPH计算得到有效加速.本文应用基于GPU的SPH并行计算方法研究了二维楔形体的入水砰击问题.数值计算结果与文献中对应的解析解比较一致,验证了基于GPU的SPH方法的精度和可靠性.仿真结果同时显示基于GPU的并行计算方法可使SPH计算速度得到显著提高. 相似文献

2.

大规模有限元系统的GPU加速计算研究

刘小虎胡耀国符伟《计算力学学报》2012,29(1):146-152

研究了GPU(Graphics Processing Units)计算应用于有限元方法中的总刚计算和组装、稀疏矩阵与向量乘积运算、线性方程组求解问题,并基于CUDA(Compute Unified Device Architecture)平台利用GTX295GPU进行程序实现和测试。系统总刚采用CSR(Compressed Sparse Row)压缩格式存放于GPU显存中,用单元染色方法实现总刚并行计算组装,用共轭梯度迭代法求解大规模线性方程组。对300万自由度以内的空间桁架和平面问题算例,GPU有限元计算分别获得最高9.5倍和6.5倍的计算加速比,并且加速比随系统自由度的增加而近似线性增加,GFLOP/s峰值也有近10倍的增加。相似文献

3.

基于GPU并行计算的浅水波运动数值模拟

许栋徐彬 David PAyet 白玉川及春宁《计算力学学报》2016,33(1):113-120

利用有限体积法求解描述水流运动的二维浅水方程组,模拟洪水波运动传播过程,并通过GPU并行计算技术对程序进行加速,建立了浅水运动高效模拟方法。数值模拟结果表明,基于本文提出的GPU并行策略以及通用并行计算架构(CUDA)支持,能够实现相比CPU单核心最高112倍的加速比,为利用单机实现快速洪水预测以及防灾减灾决策提供有效支撑。此外,对基于GPU并行计算的浅水模拟计算精度进行了论证,并对并行性能优化进行了分析。利用所建模型模拟了溃坝洪水在三维障碍物间的传播过程。相似文献

4.

一种基于块雅可比迭代的高阶FR格式隐式方法

于要杰刘锋高超冯毅《力学学报》2021,53(6):1586-1598

最近, 基于非结构网格的高阶通量重构格式(flux reconstruction, FR)因其构造简单且通用性强而受到越来越多人的关注. 但将FR格式应用于大规模复杂流动的模拟时仍面临计算开销大、求解时间长等问题. 因此, 亟需发展与之相适应的高效隐式求解方法和并行计算技术. 本文提出了一种基于块Jacobi迭代的高阶FR格式求解定常二维欧拉方程的单GPU隐式时间推进方法. 由于直接求解FR格式空间和隐式时间离散后的全局线性方程组效率低下并且内存占用很大. 而通过块雅可比迭代的方式, 能够改变全局线性方程组左端矩阵的特征, 克服影响求解并行性的相邻单元依赖问题, 使得只需要存储和计算对角块矩阵. 最终将求解全局线性方程组转化为求解一系列局部单元线性方程组, 进而又可利用LU分解法在GPU上并行求解这些小型局部线性方程组. 通过二维无黏Bump流动和NACA0012无黏绕流两个数值实验表明, 该隐式方法计算收敛所用的迭代步数和计算时间均远小于使用多重网格加速的显式Runge-Kutta格式, 且在计算效率方面至少有一个量级的提升. 相似文献

5.

基于射线穿透法的GPU并行阶梯型有限差分网格生成算法

李平麻铁昌许香照马天宝《爆炸与冲击》2020,40(2)

三维大规模有限差分网格生成技术是三维有限差分计算的基础，网格生成效率是三维有限差分网格生成的研究热点。传统的阶梯型有限差分网格生成方法主要有射线穿透法和切片法。本文在传统串行射线穿透法的基础上，提出了基于GPU （graphic processing unit）并行计算技术的并行阶梯型有限差分网格生成算法。并行算法应用基于分批次的数据传输策略，使得算法能够处理的数据规模不依赖于GPU内存大小，平衡了数据传输效率和网格生成规模之间的关系。为了减少数据传输量，本文提出的并行算法可以在GPU线程内部相互独立的生成射线起点坐标，进一步提高了并行算法的执行效率和并行化程度。通过数值试验的对比可以看出，并行算法的执行效率远远高于传统射线穿透法。最后，通过有限差分计算实例可以证实并行算法能够满足复杂模型大规模数值模拟的需求。相似文献

6.

耦合GPU与PCG的EFG法并行计算及应用研究

龚曙光廖宇犁刘奇良张建平卢海山《应用力学学报》2017,(1):100-106

针对迭代法求解无网格Galerkin法中线性方程组收敛速度慢的问题,提出了一种耦合GPU和预处理共轭梯度法的无网格Galerkin法并行算法,在对其总体刚度矩阵、总体惩罚刚度矩阵进行并行联合组装的同时即可得到对角预处理共轭矩阵,有效地节省了GPU的存储空间和计算时间;通过采用四面体积分背景网格,提高了所提算法对三维复杂几何形状问题的适应性。通过2个三维算例验证了所提算法的可行性,且预处理共轭梯度法与共轭梯度法相比,其迭代次数最大可减少1686倍,最大的迭代时间可节省1003倍;同时探讨了加速比与线程数和节点个数之间的关系,当线程数为64时其加速比可达到最大,且预处理共轭梯度法的加速比与共轭梯度法相比可增大4.5倍,预处理共轭梯度法的加速比最大达到了88.5倍。相似文献

7.

基于CUDA的有限元矩阵并行装配算法研究

《计算力学学报》2020,(3)

构建航天飞行器的结构有限元模型是准确模拟飞行仿真、完成飞行器在轨飞行阶段结构故障监测和诊断的基础。采用细长体飞行器简化梁模型,提出新的基于CUDA(Compute Unified Device Architecture)的有限元单元刚度矩阵生成和总刚度矩阵组装算法。依据梁单元矩阵的对称性,结合GPU硬件架构提出并行生成算法并进行改进。为有效减少装配时间,在装配过程中采用着色算法,提出了基于GPU(Graphics Processing Unit)共享内存的非零项组装策略,通过在不同计算平台下算例对比,验证了新算法的快速性。数值算例表明,本文算法的求解效率较高,针对一定计算规模内的模型可满足快速计算与诊断的实时性要求。相似文献

8.

图形处理器在大规模力学问题计算中的应用进展

夏健明魏德敏《力学进展》2010,40(1):57-63

现代图形处理器(graphics processing units,GPU)具有较强的并行数值运算功能.该文简单介绍了GPU的硬件结构,基于GPU通用计算的数据结构和实现方法,以及用于编写片元程序的OpenGL着色语言.介绍了应用GPU计算大规模力学问题的研究进展.简要介绍了以下内容:应用GPU模拟自然界的流体现象,其实质是使用有限差分法求解Navier-Stokes方程;应用GPU实现有限元法计算,使用基于GPU的共轭梯度法求解有限元方程组;应用GPU实现分子动力学计算,用GPU计算原子间短程作用力,并生成邻近原子列表;应用GPU实现量子力学Monte Carlo计算;应用GPU实现n个物体的引力相互作用,用GPU纹理存储n个物体的位置、质量、速度和加速度等.对基于图象处理器和中央处理器的计算作比较,已完成了以下基于GPU的计算:实现求解线性方程组的高斯消元法和共轭梯度法,并应用于大规模的有限元计算;加速无网格法计算;加速线性和非线性分子结构力学方法计算;用于计算分析碳纳米管的力学性能.指出GPU在大规模力学计算中的研究方向. 相似文献

9.

基于CUDA的有限元矩阵并行装配算法研究

胡斌星李新国孙鹏《计算力学学报》2020,37(3):368-376

构建航天飞行器的结构有限元模型是准确模拟飞行仿真、完成飞行器在轨飞行阶段结构故障监测和诊断的基础。采用细长体飞行器简化梁模型，提出新的基于CUDA（Compute Unified Device Architecture）的有限元单元刚度矩阵生成和总刚度矩阵组装算法。依据梁单元矩阵的对称性，结合GPU硬件架构提出并行生成算法并进行改进。为有效减少装配时间，在装配过程中采用着色算法，提出了基于GPU（Graphics Processing Unit）共享内存的非零项组装策略，通过在不同计算平台下算例对比，验证了新算法的快速性。数值算例表明，本文算法的求解效率较高，针对一定计算规模内的模型可满足快速计算与诊断的实时性要求。相似文献

10.

基于三角形周长的暗星全天球自主快速识别

张同双周海渊钟德安郭敬明《中国惯性技术学报》2017,(1):57-62

随着星敏感器极限探测星等能力的提高,导航星暗星数目及识别特征库急剧增加,造成星图识别耗时长,误匹配率高。通过介绍传统的三角形星图识别算法原理及不足,结合主星识别算法思想,以主星对星角距和周长为识别特征,提出了一种基于三角形周长的暗星全天球自主快速识别算法。由于9.0Mv全天球导航特征库的构建过于繁琐,为了提高运算速度,采用NVIDIA图像处理器(GPU)并行计算架构,通过CUDA计算比CPU获得了20倍以上的加速。对特征库按周长构造了散列函数,分段存储成若干个子块。识别过程中,对观测三角形周长采用哈希查找法实现子块的快速定位,星角距的识别只在相应的子块内进行,提高了识别速度,增加了识别成功率。用实拍星图分别对基于星角距和周长特征的识别算法进行了验证,实验表明,两种算法均能实现9Mv星的识别,平均识别时间分别为37.7123 s和2.0422 s,后者具有明显的优势,能够大大提高星敏感器的数据更新率。相似文献

11.

A massively parallel GPU‐accelerated model for analysis of fully nonlinear free surface waves

A. P. Engsig‐Karup Morten G. Madsen Stefan L. Glimberg 《国际流体数值方法杂志》2012,70(1):20-36

We implement and evaluate a massively parallel and scalable algorithm based on a multigrid preconditioned Defect Correction method for the simulation of fully nonlinear free surface flows. The simulations are based on a potential model that describes wave propagation over uneven bottoms in three space dimensions and is useful for fast analysis and prediction purposes in coastal and offshore engineering. A dedicated numerical model based on the proposed algorithm is executed in parallel by utilizing affordable modern special purpose graphics processing unit (GPU). The model is based on a low‐storage flexible‐order accurate finite difference method that is known to be efficient and scalable on a CPU core (single thread). To achieve parallel performance of the relatively complex numerical model, we investigate a new trend in high‐performance computing where many‐core GPUs are utilized as high‐throughput co‐processors to the CPU. We describe and demonstrate how this approach makes it possible to do fast desktop computations for large nonlinear wave problems in numerical wave tanks (NWTs) with close to 50/100 million total grid points in double/single precision with 4 GB global device memory available. A new code base has been developed in C++ and compute unified device architecture C and is found to improve the runtime more than an order in magnitude in double precision arithmetic for the same accuracy over an existing CPU (single thread) Fortran 90 code when executed on a single modern GPU. These significant improvements are achieved by carefully implementing the algorithm to minimize data‐transfer and take advantage of the massive multi‐threading capability of the GPU device. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

12.

Parallel computing of discrete element method on multi-core processors

Yusuke Shigeto Mikio Sakai 《Particuology》2011,9(4):398-405

This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalability in a multi-thread parallel computation on a CPU. 相似文献

13.

Semi‐automatic porting of a large‐scale Fortran CFD code to GPUs

Andrew Corrigan Fernando Camelli Rainald Löhner Fernando Mut 《国际流体数值方法杂志》2012,69(2):314-331

相似文献

14.

Parallel implementation of data assimilation

下载免费PDF全文

Alexander Bibov Heikki Haario 《国际流体数值方法杂志》2017,83(7):606-622

Kalman filter is a sequential estimation scheme that combines predicted and observed data to reduce the uncertainty of the next prediction. Because of its sequential nature, the algorithm cannot be efficiently implemented on modern parallel compute hardware nor can it be practically implemented on large‐scale dynamical systems because of memory issues. In this paper, we attempt to address pitfalls of the earlier low‐memory approach described in and extend it for parallel implementation. First, we describe a low‐memory method that enables one to pack covariance matrix data employed by the Kalman filter into a low‐memory form by means of certain quasi‐Newton approximation. Second, we derive parallel formulation of the filtering task, which allows to compute several filter iterations independently. Furthermore, this leads to an improvement of estimation quality as the method takes into account the cross‐correlations between consequent system states. We experimentally demonstrate this improvement by comparing the suggested algorithm with the other data assimilation methods that can benefit from parallel implementation. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

15.

Parallel adaptive refinement for unsteady flow calculations on 3D unstructured grids

Jacob Waltz 《国际流体数值方法杂志》2004,46(1):37-57

A parallel adaptive refinement algorithm for three‐dimensional unstructured grids is presented. The algorithm is based on an hierarchical h‐refinement/derefinement scheme for tetrahedral elements.The algorithm has been fully parallelized for shared‐memory platforms via a domain decomposition of the mesh at the algebraic level. The effectiveness of the procedure is demonstrated with applications which involve unsteady compressible fluid flow. A parallel speedup study of the algorithm also is included. Published in 2004 by John Wiley & Sons, Ltd. 相似文献

16.

Molecular structural mechanics approach to carbon nanotubes on graphics processing units

Jian-ming Xia De-min Wei 《European Journal of Mechanics - A/Solids》2010,29(3):440-447

A molecular structural mechanics approach to carbon nanotubes on graphics processing units (GPUs) is reported. As a powerful parallel and relatively low cost processor, the GPU is used to accelerate the computations of the molecular structural mechanics approach. The data structures, matrix-vector multiplication algorithm, texture reduction algorithm, and ICCG method on the GPU are presented. The computations for Young's moduli of carbon nanotubes by the molecular structural mechanics approach on the GPU show its accuracy. The running times of large degree of freedom (DOF) carbon nanotubes, whose DOF is larger than 100,000, on the GPU are compared against those on the CPU, proving the GPU can accelerate the computations of the molecular structural mechanics approach to carbon nanotubes. 相似文献

17.

无网格Galerkin法GPU加速并行计算及其应用

龚曙光刘奇良卢海山周志勇张佳《计算力学学报》2015,32(6):745-751

针对无网格Galerkin法计算耗时的问题,采用逐节点对法来组装刚度矩阵、共轭梯度法求解基于CSR格式存储的稀疏线性方程组,提出了一种利用罚函数法施加本质边界条件的EFG法GPU加速并行算法,给出了刚度矩阵和惩罚刚度矩阵的统一格式,以及GPU加速并行算法的流程图。编写了基于CUDA构架平台的GPU程序,且在NVIDIA GeForce GTX 660显卡上通过数值算例对所提算法进行了性能测试与分析比较,探讨了影响加速比的因素。算例结果验证了所提算法的可行性,并在满足计算精度的前提下,其加速比最大可达17倍;同时线性方程组的求解对加速比起决定性影响。相似文献

18.

Accelerating fully resolved simulation of particle-laden flows on heterogeneous computer architectures

《Particuology》2023

An efficient computing framework, namely PFlows, for fully resolved-direct numerical simulations of particle-laden flows was accelerated on NVIDIA General Processing Units (GPUs) and GPU-like accelerator (DCU) cards. The framework is featured as coupling the lattice Boltzmann method for fluid flow with the immersed boundary method for fluid-particle interaction, and the discrete element method for particle collision, using two fixed Eulerian meshes and one moved Lagrangian point mesh, respectively. All the parts are accelerated by a fine-grained parallelism technique using CUDA on GPUs, and further using HIP on DCU cards, i.e., the calculation on each fluid grid, each immersed boundary point, each particle motion, and each pair-particle collision is responsible by one computer thread, respectively. Coalesced memory accesses to LBM distribution functions with the data layout of Structure of Arrays are used to maximize utilization of hardware bandwidth. Parallel reduction with shared memory for data of immersed boundary points is adopted for the sake of reducing access to global memory when integrate particle hydrodynamic force. MPI computing is further used for computing on heterogeneous architectures with multiple CPUs-GPUs/DCUs. The communications between adjacent processors are hidden by overlapping with calculations. Two benchmark cases were conducted for code validation, including a pure fluid flow and a particle-laden flow. The performances on a single accelerator show that a GPU V100 can achieve 7.1–11.1 times speed up, while a single DCU can achieve 5.6–8.8 times speed up compared to a single Xeon CPU chip (32 cores). The performances on multi-accelerators show that parallel efficiency is 0.5–0.8 for weak scaling and 0.68–0.9 for strong scaling on up to 64 DCU cards even for the dense flow (φ = 20%). The peak performance reaches 179 giga lattice updates per second (GLUPS) on 256 DCU cards by using 1 billion grids and 1 million particles. At last, a large-scale simulation of a gas-solid flow with 1.6 billion grids and 1.6 million particles was conducted using only 32 DCU cards. This simulation shows that the present framework is prospective for simulations of large-scale particle-laden flows in the upcoming exascale computing era. 相似文献