首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 573 毫秒
1.
詹飞  马晓川  杨力 《声学学报》2018,43(4):445-452
针对宽带编码脉冲、多输入多输出等新型目标探测体制发展带来的运算量和数据存储需求剧增的问题,根据水下航行器相位编码脉冲回波检测算法的数据级并行特点,提出应用图形处理器(Graphics Processing Unit,GPU)众核处理架构,并从任务分配策略、数据处理流程、GPU硬件资源利用率和存储器访问等角度考虑,设计了算法在GPU上的并行实现框架。利用湖试数据测试了桌面级GPU平台、嵌入式GPU平台与基于多核数字信号处理器(Digital Signal Processor,DSP)的传统航行器信号处理平台的性能,与多核DSP平台相比,嵌入式GPU平台在功耗、运算性能等方面更有优势。研究结果表明采用嵌入式GPU平台可大幅提升每瓦特性能指标并简化系统设计,能满足新型航行器探测系统大数据量、低功耗和实时性的应用需求。  相似文献   

2.
激波与火焰面相互作用数值模拟的GPU加速   总被引:1,自引:0,他引:1  
蒋华  董刚  陈霄 《计算物理》2016,33(1):23-29
为考察计算机图形处理器(GPU)在计算流体力学中的计算能力,采用基于CPU/GPU异构并行模式的方法对激波与火焰界面相互作用的典型可压缩反应流进行数值模拟,优化并行方案,考察不同网格精度对计算结果和计算加速性能的影响.结果表明,和传统的基于信息传递的MPI 8线程并行计算相比,GPU并行模拟结果与MPI并行模拟结果相同;两种计算方法的计算时间均随网格数量的增加呈线性增长趋势,但GPU的计算时间比MPI明显降低.当网格数量较小时(1.6×104),GPU计算得到的单个时间步长平均时间的加速比为8.6;随着网格数量的增加,GPU的加速比有所下降,但对较大规模的网格数量(4.2×106),GPU的加速比仍可达到5.9.基于GPU的异构并行加速算法为可压缩反应流的高分辨率大规模计算提供了较好的解决途径.  相似文献   

3.
Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9× to 12× speedup over the original CPU version, in addition to speedup from multi-threading. This is 2× faster than the fully-optimized GPU version, indicating the importance of optimizing CPU implementations.  相似文献   

4.
实验势精修是20世纪80年代英国散裂中子源无定型材料组开发的用于分析中子散射实验数据的软件. 实验势精修的目标是根据中子散射数据重建样品的三维原子结构. 在过去的几十年,实验势精修被广泛用于中子散射实验数据分析,为实验用户提供了可靠的分析结果. 但是实验势精修是基于共享内存并行计算(OpenMP)的Fortran程序,不支持计算机服务器集群跨节点并行加速和GPU加速;这限制了它的分析速度. 随着计算机服务器集群的广泛建设和GPU加速技术的普遍使用,有必要重新编写EPSR程序以提高运算速度. 本文使用面向对象的C++语言,开发了一套实现EPSR算法的开源软件包NeuDATool;软件通过MPI和CUDA C实现了计算机集群跨节点并行和GPU加速. 使用液态水和玻璃态二氧化硅的中子散射实验数据对软件进行了测试. 测试显示软件可以正确重建出样品的三维原子结构;并且模拟体系达到10万原子以上时,使用GPU加速可以比串行的CPU算法提高400倍以上的模拟速度. NeuDATool为中子实验用户尤其是对熟悉C++编程并希望定义特殊分析算法的实验科学家提供了一种新的选择.  相似文献   

5.
耗散粒子动力学GPU并行计算研究   总被引:1,自引:0,他引:1       下载免费PDF全文
林晨森  陈硕  李启良  杨志刚 《物理学报》2014,63(10):104702-104702
研究了耗散粒子动力学基于计算统一设备架构的图形处理器(GPU)并行计算的实施.对其中涉及的算法映射模型、Cell-List法数组的并行化更新、随机数生成、存储器访问优化、负载平衡等进行了详细的讨论.进一步模拟了Poiseuille流动和突扩突缩流动,从而验证了GPU计算结果的正确性.计算结果表明,相对于基于中央处理器的串行计算,在耗散粒子动力学中实施GPU并行计算可以获得约20倍的加速比.  相似文献   

6.
Graphics processing unit (GPU) based fast calculation method for computer generated spherical hologram (CGSH) of a real-existing object is proposed. Three-dimensional (3D) point cloud is constructed by capturing a real-existing object from multiple directions using a depth camera. The GPU based calculation is used in both hologram generation part and numerical reconstruction part of the CGSH. The improved calculation efficiency is verified by comparing the computation speed between central processing unit (CPU) based and GPU based imDlementation.  相似文献   

7.
The lattice Boltzmann method (LBM) can gain a great amount of performance benefit by taking advantage of graphics processing unit (GPU) computing, and thus, the GPU, or multi-GPU based LBM can be considered as a promising and competent candidate in the study of large-scale fluid flows. However, the multi-GPU based lattice Boltzmann algorithm has not been studied extensively, especially for simulations of flow in complex geometries. In this paper, through coupling with the message passing interface (MPI) technique, we present an implementation of multi-GPU based LBM for fluid flow through porous media as well as some optimization strategies based on the data structure and layout, which can apparently reduce memory access and completely hide the communication time consumption. Then the performance of the algorithm is tested on a one-node cluster equipped with four Tesla C1060 GPU cards where up to 1732 MFLUPS is achieved for the Poiseuille flow and a nearly linear speedup with the number of GPUs is also observed.  相似文献   

8.
黄磊  张李超  鄢然 《应用光学》2015,36(5):762-767
数字散斑相关方法有着测量环境简单、全场非接触等优点,但算法效率一直是限制其发展的瓶颈之一。GPU有着天然的并行性,GPU高性能运算可以为计算机图形处理带来极大的效率提升。利用CUDA平台编程对传统的数字散斑逐点搜索算法、十字搜索算法及遗传算法进行GPU高性能并行处理,并与传统方法比较分析。实验结果表明,对于尺寸为150150像素的散斑图像,3种方法效率分别提升了20倍、8倍、31倍;对于尺寸为500500像素的散斑图像,3种方法效率分别提升了183倍、33倍、44倍;对于尺寸为1 0001 000像素的散斑图像,3种方法效率分别提升了424倍、116倍、44倍。  相似文献   

9.
Wang L  Zhao J  Di J  Jiang H 《Optics letters》2011,36(9):1620-1622
We present a simple and effective method for reconstructing extended focused images in digital holography using a graphics processing unit (GPU). The Fresnel transform method is simplified by an algorithm named fast Fourier transform pruning with frequency shift. Then the pixel size consistency problem is solved by coordinate transformation and combining the subpixel resampling and the fast Fourier transform pruning with frequency shift. With the assistance of the GPU, we implemented an improved parallel version of this method, which obtained about a 300-500-fold speedup compared with central processing unit codes.  相似文献   

10.
The auditory mismatch negativity (MMN) has been considered a preattentive index of auditory processing and/or a signature of prediction error computation. This study tries to demonstrate the presence of an MMN to deviant trials included in complex auditory stimuli sequences, and its possible relationship to predictive coding. Additionally, the transfer of information between trials is expected to be represented by stimulus-preceding negativity (SPN), which would possibly fit the predictive coding framework. To accomplish these objectives, the EEG of 31 subjects was recorded during an auditory paradigm in which trials composed of stimulus sequences with increasing or decreasing frequencies were intermingled with deviant trials presenting an unexpected ending. Our results showed the presence of an MMN in response to deviant trials. An SPN appeared during the intertrial interval and its amplitude was reduced in response to deviant trials. The presence of an MMN in complex sequences of sounds and the generation of an SPN component, with different amplitudes in deviant and standard trials, would support the predictive coding framework.  相似文献   

11.
Simulation time is one of the bottlenecks of finite-difference-time-domain (FDTD) method. There are several ways of reducing the simulation time, one of which is the usage of graphical processing unit (GPU). Thus in this paper we present comparison between two free FDTD software packages. One is based on central processing unit and other is based on GPU. The 3D test structures we analyzed were metallic rectangular cavity resonator and microring resonator based refractive index sensor. The comparison between two FDTD software packages is made with regard to simulation time and numerical accuracy. It is shown that both packages agree in numerical results and that GPU based FDTD implementation performs same simulation up to 18 times faster.  相似文献   

12.
The answers to data assimilation questions can be expressed as path integrals over all possible state and parameter histories. We show how these path integrals can be evaluated numerically using a Markov Chain Monte Carlo method designed to run in parallel on a graphics processing unit (GPU). We demonstrate the application of the method to an example with a transmembrane voltage time series of a simulated neuron as an input, and using a Hodgkin–Huxley neuron model. By taking advantage of GPU computing, we gain a parallel speedup factor of up to about 300, compared to an equivalent serial computation on a CPU, with performance increasing as the length of the observation time used for data assimilation increases.  相似文献   

13.
We present the implementation and performance of a new gravitational N-body tree-code that is specifically designed for the graphics processing unit (GPU).1 All parts of the tree-code algorithm are executed on the GPU. We present algorithms for parallel construction and traversing of sparse octrees. These algorithms are implemented in CUDA and tested on NVIDIA GPUs, but they are portable to OpenCL and can easily be used on many-core devices from other manufacturers. This portability is achieved by using general parallel-scan and sort methods. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second.  相似文献   

14.
Aiming at the problem that traditional infrared scene real-time radiometric rendering method leads to greater calculation error for securing real-time purpose, this article studies the IR rendering comprehensive optimization method, which secures real-time performance as well as calculation accuracy. Firstly, based on the effective average value principle, the spectrum coupling thermal emission and reflected radiations in the spectral radiometric equation are decomposed into physical quantities, and the spectral radiometric equation is improved to become a simpler calculation between “primer” radiance terms and effective average factors. Secondly, the parameter processing method is proposed to cope with the situation when index parameters of effective average factors exceed the maximum dimensionalities of graphics processing unit (GPU) look-up-table (LUT); and pre-calculation method is applied to promote the real-time evaluation efficiency of the physical quantities in the radiometric equation. Finally, concurrent computation of radiometric equation is achieved with GPU IR scene generation software and the precise and real-time rendering of three-dimensional IR scene is realized.  相似文献   

15.
随着计算空间光调制器的分辨率的尺寸逐渐变大,全息图三维动态显示的计算量也越来越大,使得对全息计算速度提出了新的要求。利用GPU并行计算处理的方式实现全息图的快速层析法计算,该方法利用GPU并行多线程和层析法中的图像二维傅里叶变换的优势对菲涅尔衍射变换算法加速计算;同时通过对GPU底层资源的调用和对CUDA中程序的流处理过程,有效减少中间的延时等待。通过对计算速度对比分析表明:与在CPU上运算相比,计算速度大幅提升,基于GPU并行计算的方法比基于CPU计算的方法速度快10倍左右。  相似文献   

16.
陈富州  程晨  罗洪刚 《物理学报》1996,68(12):120202-120202
密度矩阵重正化群方法(DMRG)在求解一维强关联格点模型的基态时可以获得较高的精度,在应用于二维或准二维问题时,要达到类似的精度通常需要较大的计算量与存储空间.本文提出一种新的DMRG异构并行策略,可以同时发挥计算机中央处理器(CPU)和图形处理器(GPU)的计算性能.针对最耗时的哈密顿量对角化部分,实现了数据的分布式存储,并且给出了CPU和GPU之间的负载平衡策略.以费米Hubbard模型为例,测试了异构并行程序在不同DMRG保留状态数下的运行表现,并给出了相应的性能基准.应用于4腿梯子时,观测到了高温超导中常见的电荷密度条纹,此时保留状态数达到104,使用的GPU显存小于12 GB.  相似文献   

17.
陈富州  程晨  罗洪刚 《物理学报》1963,68(12):120202-120202
密度矩阵重正化群方法(DMRG)在求解一维强关联格点模型的基态时可以获得较高的精度,在应用于二维或准二维问题时,要达到类似的精度通常需要较大的计算量与存储空间.本文提出一种新的DMRG异构并行策略,可以同时发挥计算机中央处理器(CPU)和图形处理器(GPU)的计算性能.针对最耗时的哈密顿量对角化部分,实现了数据的分布式存储,并且给出了CPU和GPU之间的负载平衡策略.以费米Hubbard模型为例,测试了异构并行程序在不同DMRG保留状态数下的运行表现,并给出了相应的性能基准.应用于4腿梯子时,观测到了高温超导中常见的电荷密度条纹,此时保留状态数达到104,使用的GPU显存小于12 GB.  相似文献   

18.
陈富州  程晨  罗洪刚 《物理学报》2010,68(12):120202-120202
密度矩阵重正化群方法(DMRG)在求解一维强关联格点模型的基态时可以获得较高的精度,在应用于二维或准二维问题时,要达到类似的精度通常需要较大的计算量与存储空间.本文提出一种新的DMRG异构并行策略,可以同时发挥计算机中央处理器(CPU)和图形处理器(GPU)的计算性能.针对最耗时的哈密顿量对角化部分,实现了数据的分布式存储,并且给出了CPU和GPU之间的负载平衡策略.以费米Hubbard模型为例,测试了异构并行程序在不同DMRG保留状态数下的运行表现,并给出了相应的性能基准.应用于4腿梯子时,观测到了高温超导中常见的电荷密度条纹,此时保留状态数达到104,使用的GPU显存小于12 GB.  相似文献   

19.
陈富州  程晨  罗洪刚 《物理学报》2018,68(12):120202-120202
密度矩阵重正化群方法(DMRG)在求解一维强关联格点模型的基态时可以获得较高的精度,在应用于二维或准二维问题时,要达到类似的精度通常需要较大的计算量与存储空间.本文提出一种新的DMRG异构并行策略,可以同时发挥计算机中央处理器(CPU)和图形处理器(GPU)的计算性能.针对最耗时的哈密顿量对角化部分,实现了数据的分布式存储,并且给出了CPU和GPU之间的负载平衡策略.以费米Hubbard模型为例,测试了异构并行程序在不同DMRG保留状态数下的运行表现,并给出了相应的性能基准.应用于4腿梯子时,观测到了高温超导中常见的电荷密度条纹,此时保留状态数达到104,使用的GPU显存小于12 GB.  相似文献   

20.
陈富州  程晨  罗洪刚 《物理学报》2007,68(12):120202-120202
密度矩阵重正化群方法(DMRG)在求解一维强关联格点模型的基态时可以获得较高的精度,在应用于二维或准二维问题时,要达到类似的精度通常需要较大的计算量与存储空间.本文提出一种新的DMRG异构并行策略,可以同时发挥计算机中央处理器(CPU)和图形处理器(GPU)的计算性能.针对最耗时的哈密顿量对角化部分,实现了数据的分布式存储,并且给出了CPU和GPU之间的负载平衡策略.以费米Hubbard模型为例,测试了异构并行程序在不同DMRG保留状态数下的运行表现,并给出了相应的性能基准.应用于4腿梯子时,观测到了高温超导中常见的电荷密度条纹,此时保留状态数达到104,使用的GPU显存小于12 GB.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号