首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We present the implementation and performance of a new gravitational N-body tree-code that is specifically designed for the graphics processing unit (GPU).1 All parts of the tree-code algorithm are executed on the GPU. We present algorithms for parallel construction and traversing of sparse octrees. These algorithms are implemented in CUDA and tested on NVIDIA GPUs, but they are portable to OpenCL and can easily be used on many-core devices from other manufacturers. This portability is achieved by using general parallel-scan and sort methods. The gravitational tree-code outperforms tuned CPU code during the tree-construction and shows a performance improvement of more than a factor 20 overall, resulting in a processing rate of more than 2.8 million particles per second.  相似文献   

2.
Graphics processing units (GPU) are currently used as a cost-effective platform forcomputer simulations and big-data processing. Large scale applications require thatmultiple GPUs work together but the efficiency obtained with cluster of GPUs is, at times,sub-optimal because the GPU features are not exploited at their best. We describe how itis possible to achieve an excellent efficiency for applications in statistical mechanics,particle dynamics and networks analysis by using suitable memory access patterns andmechanisms like CUDA streams, profiling tools, etc. Similar concepts andtechniques may be applied also to other problems like the solution of Partial DifferentialEquations.  相似文献   

3.
Solvent-mediated hydrodynamic interactions between colloidal particles can significantly alter their dynamics. We discuss the implementation of Stokesian dynamics in leading approximation for streaming processors as provided by the compute unified device architecture (CUDA) of recent graphics processors (GPUs). Thereby, the simulation of explicit solvent particles is avoided and hydrodynamic interactions can easily be accounted for in already available, highly accelerated molecular dynamics simulations. Special emphasis is put on efficient memory access and numerical stability. The algorithm is applied to the periodic sedimentation of a cluster of four suspended particles. Finally, we investigate the runtime performance of generic memory access patterns of complexity O(N 2) for various GPU algorithms relying on either hardware cache or shared memory.  相似文献   

4.
In this work we explore the performance of CUDA in quenched lattice SU (2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model.  相似文献   

5.
Graphics Processing Unit (GPU), originally developed for real-time, high-definition 3D graphics in computer games, now provides great faculty in solving scientific applications. The basis of particle transport simulation is the time-dependent, multi-group, inhomogeneous Boltzmann transport equation. The numerical solution to the Boltzmann equation involves the discrete ordinates (Sn) method and the procedure of source iteration. In this paper, we present a GPU accelerated simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The performance of the GPU simulations are reported with the simulations of vacuum boundary condition. The discussion of the relative advantages and disadvantages of the GPU implementation, the simulation on multi GPUs, the programming effort and code portability are also reported. The results show that the overall performance speedup of one NVIDIA Tesla M2050 GPU ranges from 2.56 compared with one Intel Xeon X5670 chip to 8.14 compared with one Intel Core Q6600 chip for no flux fixup. The simulation with flux fixup on one M2050 is 1.23 times faster than on one X5670.  相似文献   

6.
In this study, the application of the two-dimensional direct simulation Monte Carlo (DSMC) method using an MPI-CUDA parallelization paradigm on Graphics Processing Units (GPUs) clusters is presented. An all-device (i.e. GPU) computational approach is adopted where the entire computation is performed on the GPU device, leaving the CPU idle during all stages of the computation, including particle moving, indexing, particle collisions and state sampling. Communication between the GPU and host is only performed to enable multiple-GPU computation. Results show that the computational expense can be reduced by 15 and 185 times when using a single GPU and 16 GPUs respectively when compared to a single core of an Intel Xeon X5670 CPU. The demonstrated parallel efficiency is 75% when using 16 GPUs as compared to a single GPU for simulations using 30 million simulated particles. Finally, several very large-scale simulations in the near-continuum regime are employed to demonstrate the excellent capability of the current parallel DSMC method.  相似文献   

7.
This paper presents a parallel algorithm implemented on graphics processing units (GPUs) for rapidly evaluating spatial convolutions between the Helmholtz potential and a large-scale source distribution. The algorithm implements a non-uniform grid interpolation method (NGIM), which uses amplitude and phase compensation and spatial interpolation from a sparse grid to compute the field outside a source domain. NGIM reduces the computational time cost of the direct field evaluation at N observers due to N co-located sources from O(N2) to O(N) in the static and low-frequency regimes, to O(N log N) in the high-frequency regime, and between these costs in the mixed-frequency regime. Memory requirements scale as O(N) in all frequency regimes. Several important differences between CPU and GPU implementations of the NGIM are required to result in optimal performance on respective platforms. In particular, in the CPU implementations all operations, where possible, are pre-computed and stored in memory in a preprocessing stage. This reduces the computational time but significantly increases the memory consumption. In the GPU implementations, where handling memory often is a critical bottle neck, several special memory handling techniques are used to accelerate the computations. A significant latency of the GPU global memory access is hidden by implementing coalesced reading, which requires arranging many array elements in contiguous parts of memory. Contrary to the CPU version, most of the steps in the GPU implementations are executed on-fly and only necessary arrays are kept in memory. This results in significantly reduced memory consumption, increased problem size N that can be handled, and reduced computational time on GPUs. The obtained GPU–CPU speed-up ratios are from 150 to 400 depending on the required accuracy and problem size. The presented method and its CPU and GPU implementations can find important applications in various fields of physics and engineering.  相似文献   

8.
Satellite-observed radiance is a nonlinear functional of surface properties and atmospheric temperature and absorbing gas profiles as described by the radiative transfer equation (RTE). In the era of hyperspectral sounders with thousands of high-resolution channels, the computation of the radiative transfer model becomes more time-consuming. The radiative transfer model performance in operational numerical weather prediction systems still limits the number of channels we can use in hyperspectral sounders to only a few hundreds. To take the full advantage of such high-resolution infrared observations, a computationally efficient radiative transfer model is needed to facilitate satellite data assimilation. In recent years the programmable commodity graphics processing unit (GPU) has evolved into a highly parallel, multi-threaded, many-core processor with tremendous computational speed and very high memory bandwidth. The radiative transfer model is very suitable for the GPU implementation to take advantage of the hardware’s efficiency and parallelism where radiances of many channels can be calculated in parallel in GPUs.In this paper, we develop a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) launched in 2006 onboard the first European meteorological polar-orbiting satellites, METOP-A. Each IASI spectrum has 8461 spectral channels. The IASI radiative transfer model consists of three modules. The first module for computing the regression predictors takes less than 0.004% of CPU time, while the second module for transmittance computation and the third module for radiance computation take approximately 92.5% and 7.5%, respectively. Our GPU-based IASI radiative transfer model is developed to run on a low-cost personal supercomputer with four GPUs with total 960 compute cores, delivering near 4 TFlops theoretical peak performance. By massively parallelizing the second and third modules, we reached 364× speedup for 1 GPU and 1455× speedup for all 4 GPUs, both with respect to the original CPU-based single-threaded Fortran code with the –O2 compiling optimization. The significant 1455× speedup using a computer with four GPUs means that the proposed GPU-based high-performance forward model is able to compute one day’s amount of 1,296,000 IASI spectra within nearly 10 min, whereas the original single CPU-based version will impractically take more than 10 days. This model runs over 80% of the theoretical memory bandwidth with asynchronous data transfer. A novel CPU–GPU pipeline implementation of the IASI radiative transfer model is proposed. The GPU-based high-performance IASI radiative transfer model is suitable for the assimilation of the IASI radiance observations into the operational numerical weather forecast model.  相似文献   

9.
We present and compare different approaches for using multiple Graphics Processing Units in the simulation of physical systems. As benchmarks we consider the time required to update a single spin of the 3D Heisenberg spin glass model, by using both the Over-relaxation and the Heat Bath algorithms, and the solution of a Poisson equation by using a finite-difference method. The results show that a suitable combination of techniques allows to hide almost completely the communication overhead by using the CPU as a communication coprocessor of the GPU. Large scale simulations on clusters of GPUs can be efficiently carried out by following the same approach for other applications where a clear cut exists between bulk and boundaries data.  相似文献   

10.
Modern graphical processing units (GPUs) have recently become a pervasive technology able to rapidly solve large parallel problems which previously required runs on clusters or supercomputers. In this paper we propose an effective strategy to parallelize the T-matrix method on GPUs in order to speed-up light scattering simulations. We have tackled two of the most computationally intensive scattering problems that are of interest in nano-optics: the scattering from an isolated non-axisymmetric particle and from an agglomerate of arbitrary shaped particles. We show that fully exploiting the GPU potential we can achieve more than 20 times (20×) acceleration over sequential execution in the investigated scenarios, opening exciting prospectives in the analysis and the design of optical nanostructures.  相似文献   

11.
12.
In this paper, we focus on graphical processing unit (GPU) and discuss how its architecture affects the choice of algorithm and implementation of fully-implicit petroleum reservoir simulation. In order to obtain satisfactory performance on new many-core architectures such as GPUs, the simulator developers must know a great deal on the specific hardware and spend a lot of time on fine tuning the code. Porting a large petroleum reservoir simulator to emerging hardware architectures is expensive and risky. We analyze major components of an in-house reservoir simulator and investigate how to port them to GPUs in a cost-effective way. Preliminary numerical experiments show that our GPU-based simulator is robust and effective. More importantly, these numerical results clearly identify the main bottlenecks to obtain ideal speedup on GPUs and possibly other many-core architectures.  相似文献   

13.
J S Bagla  T Padmanabhan 《Pramana》1997,49(2):161-192
In this review we discuss cosmologicalN-body codes with a special emphasis on particle mesh codes. We present the mathematical model for each component ofN-body codes. We compare alternative methods for computing each quantity by calculating errors for each of the components. We suggest an optimum set of components that can be combined to reduce the overall errors inN-body codes.  相似文献   

14.
We consider the time evolution of a system of N identical bosons whose interaction potential is rescaled by N −1. We choose the initial wave function to describe a condensate in which all particles are in the same one-particle state. It is well known that in the mean-field limit N → ∞ the quantum N-body dynamics is governed by the nonlinear Hartree equation. Using a nonperturbative method, we extend previous results on the mean-field limit in two directions. First, we allow a large class of singular interaction potentials as well as strong, possibly time-dependent external potentials. Second, we derive bounds on the rate of convergence of the quantum N-body dynamics to the Hartree dynamics.  相似文献   

15.
16.
Recent N-body simulations are in favor of the presence of a co-rotating Dark Disk that might contribute significantly (10%–50%) to the local Dark Matter density. Such substructure could have dramatic effect on directional detection. Indeed, in the case of a null lag velocity, one expects an isotropic WIMP velocity distribution arising from the Dark Disk contribution, which might weaken the strong angular signature expected in directional detection. For a wide range of Dark Disk parameters, we evaluate in this Letter the effect of such dark component on the discovery potential of upcoming directional detectors. As a conclusion of our study, using only the angular distribution of nuclear recoils, we show that Dark Disk models as suggested by recent N-body simulations will not affect significantly the Dark Matter reach of directional detection, even in extreme configurations.  相似文献   

17.
The Hall-Post inequalities relating N-body to (N − 1)-body energies of quantum bound states are applied to delimit, in the space of coupling constants, the domain of Borromean binding where a composite system is bound while the smaller subsystems are unbound.  相似文献   

18.
We present large scale molecular dynamic (MD) simulations in bcc iron containing a relatively long Griffith crack loaded in mode I at a temperature of K and 300 K. We use N-body potentials of Finnis-Sinclair type. The paper also includes a stress analysis performed in the framework of anisotropic fracture mechanics and on the atomic level as well. It enables us to understand why at 0 K brittle fracture in MD is detected, while at 300 K ductile behavior at the crack front in MD is monitored, starting from the free sample surface.  相似文献   

19.
We study a three dimensional continuous model of gravitating matter rotating at constant angular velocity. In the rotating reference frame, by a finite dimensional reduction, we prove the existence of non-radial stationary solutions whose supports are made of an arbitrarily large number of disjoint compact sets, in the low angular velocity and large scale limit. At first order, the solutions behave like point particles, thus making the link with the relative equilibria in N-body dynamics.  相似文献   

20.
Abstract

Several N -body problems in ordinary (3-dimensional) space are introduced which are characterized by Newtonian equations of motion (“acceleration equal force;” in most cases, the forces are velocity-dependent) and are amenable to exact treatment (“solvable” and/or “integrable” and/or “linearizable”). These equations of motion are always rotation-invariant, and sometimes translation-invariant as well. In many cases they are Hamiltonian, but the discussion of this aspect is postponed to a subsequent paper. We consider “few-body problems” (with, say, N =1,2,3,4,6,8,12,16,...) as well as “many-body problems” (N an arbitrary positive integer). The main focus of this paper is on various techniques to uncover such N -body problems. We do not discuss the detailed behavior of the solutions of all these problems, but we do identify several models whose motions are completely periodic or multiply periodic, and we exhibit in rather explicit form the solutions in some cases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号