首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The approach used to calculate the two‐electron integral by many electronic structure packages including generalized atomic and molecular electronic structure system‐UK has been designed for CPU‐based compute units. We redesigned the two‐electron compute algorithm for acceleration on a graphical processing unit (GPU). We report the acceleration strategy and illustrate it on the (ss|ss) type integrals. This strategy is general for Fortran‐based codes and uses the Accelerator compiler from Portland Group International and GPU‐based accelerators from Nvidia. The evaluation of (ss|ss) type integrals within calculations using Hartree Fock ab initio methods and density functional theory are accelerated by single and quad GPU hardware systems by factors of 43 and 153, respectively. The overall speedup for a single self consistent field cycle is at least a factor of eight times faster on a single GPU compared with that of a single CPU. © 2011 Wiley Periodicals, Inc. J Comput Chem, 2011  相似文献   

2.
A new parallel algorithm and its implementation for the RI‐MP2 energy calculation utilizing peta‐flop‐class many‐core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual‐level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi‐node and multi‐GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi‐node and multi‐GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc.  相似文献   

3.
During the past few years, graphics processing units (GPUs) have become extremely popular in the high performance computing community. In this study, we present an implementation of an acceleration engine for the solvent–solvent interaction evaluation of molecular dynamics simulations. By careful optimization of the algorithm speed‐ups up to a factor of 54 (single‐precision GPU vs. double‐precision CPU) could be achieved. The accuracy of the single‐precision GPU implementation is carefully investigated and does not influence structural, thermodynamic, and dynamic quantities. Therefore, the implementation enables users of the GROMOS software for biomolecular simulation to run the solvent–solvent interaction evaluation on a GPU, and thus, to speed‐up their simulations by a factor 6–9. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

4.
We present a way to improve the performance of the electronic structure Vienna Ab initio Simulation Package (VASP) program. We show that high-performance computers equipped with graphics processing units (GPUs) as accelerators may reduce drastically the computation time when offloading these sections to the graphic chips. The procedure consists of (i) profiling the performance of the code to isolate the time-consuming parts, (ii) rewriting these so that the algorithms become better-suited for the chosen graphic accelerator, and (iii) optimizing memory traffic between the host computer and the GPU accelerator. We chose to accelerate VASP with NVIDIA GPU using CUDA. We compare the GPU and original versions of VASP by evaluating the Davidson and RMM-DIIS algorithms on chemical systems of up to 1100 atoms. In these tests, the total time is reduced by a factor between 3 and 8 when running on n (CPU core + GPU) compared to n CPU cores only, without any accuracy loss. © 2012 Wiley Periodicals, Inc.  相似文献   

5.
We investigated the performance of heterogeneous computing with graphics processing units (GPUs) and many integrated core (MIC) with 20 CPU cores (20×CPU). As a practical example toward large scale electronic structure calculations using grid‐based methods, we evaluated the Hartree potentials of silver nanoparticles with various sizes (3.1, 3.7, 4.9, 6.1, and 6.9 nm) via a direct integral method supported by the sinc basis set. The so‐called work stealing scheduler was used for efficient heterogeneous computing via the balanced dynamic distribution of workloads between all processors on a given architecture without any prior information on their individual performances. 20×CPU + 1GPU was up to ~1.5 and ~3.1 times faster than 1GPU and 20×CPU, respectively. 20×CPU + 2GPU was ~4.3 times faster than 20×CPU. The performance enhancement by CPU + MIC was considerably lower than expected because of the large initialization overhead of MIC, although its theoretical performance is similar with that of CPU + GPU. © 2016 Wiley Periodicals, Inc.  相似文献   

6.
We present the first graphical processing unit (GPU) coprocessor‐enabled version of the Order‐N Electronic Total Energy Package (ONETEP) code for linear‐scaling first principles quantum mechanical calculations on materials. This work focuses on porting to the GPU the parts of the code that involve atom‐localized fast Fourier transform (FFT) operations. These are among the most computationally intensive parts of the code and are used in core algorithms such as the calculation of the charge density, the local potential integrals, the kinetic energy integrals, and the nonorthogonal generalized Wannier function gradient. We have found that direct porting of the isolated FFT operations did not provide any benefit. Instead, it was necessary to tailor the port to each of the aforementioned algorithms to optimize data transfer to and from the GPU. A detailed discussion of the methods used and tests of the resulting performance are presented, which show that individual steps in the relevant algorithms are accelerated by a significant amount. However, the transfer of data between the GPU and host machine is a significant bottleneck in the reported version of the code. In addition, an initial investigation into a dynamic precision scheme for the ONETEP energy calculation has been performed to take advantage of the enhanced single precision capabilities of GPUs. The methods used here result in no disruption to the existing code base. Furthermore, as the developments reported here concern the core algorithms, they will benefit the full range of ONETEP functionality. Our use of a directive‐based programming model ensures portability to other forms of coprocessors and will allow this work to form the basis of future developments to the code designed to support emerging high‐performance computing platforms.Copyright © 2013 Wiley Periodicals, Inc.  相似文献   

7.
We introduce a complete implementation of viscoelastic model for numerical simulations of the phase separation kinetics in dynamic asymmetry systems such as polymer blends and polymer solutions on a graphics processing unit (GPU) by CUDA language and discuss algorithms and optimizations in details. From studies of a polymer solution, we show that the GPU-based implementation can predict correctly the accepted results and provide about 190 times speedup over a single central processing unit (CPU). Further accuracy analysis demonstrates that both the single and the double precision calculations on the GPU are sufficient to produce high-quality results in numerical simulations of viscoelastic model. Therefore, the GPU-based viscoelastic model is very promising for studying many phase separation processes of experimental and theoretical interests that often take place on the large length and time scales and are not easily addressed by a conventional implementation running on a single CPU.  相似文献   

8.
Space charge effects play important roles in the performance of various types of mass analyzers. Simulation of space charge effects is often limited by the computation capability. In this study, we evaluate the method of using graphics processing unit (GPU) to accelerate ion trajectory simulation. Simulation using GPU has been compared with multi-core central processing unit (CPU), and an acceleration of about 390 times have been obtained using a single computer for simulation of up to 105 ions in quadrupole ion traps. Characteristics of trapped ions can be investigated at detailed levels within a reasonable simulation time. Space charge effects on the trapping capacities of linear and 3D ion traps, ion cloud shapes, ion motion frequency shift, mass spectrum peak coalescence effects between two ion clouds of close m/z are studied with the ion trajectory simulation using GPU.  相似文献   

9.
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well‐exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)‐based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off‐loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance‐to‐price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer‐class GPUs this improvement equally reflects in the performance‐to‐price ratio. Although memory issues in consumer‐class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost‐efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well‐balanced ratio of CPU and consumer‐class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc.  相似文献   

10.
We identify hardware that is optimal to produce molecular dynamics (MD) trajectories on Linux compute clusters with the GROMACS 2018 simulation package. Therefore, we benchmark the GROMACS performance on a diverse set of compute nodes and relate it to the costs of the nodes, which may include their lifetime costs for energy and cooling. In agreement with our earlier investigation using GROMACS 4.6 on hardware of 2014, the performance to price ratio of consumer GPU nodes is considerably higher than that of CPU nodes. However, with GROMACS 2018, the optimal CPU to GPU processing power balance has shifted even more toward the GPU. Hence, nodes optimized for GROMACS 2018 and later versions enable a significantly higher performance to price ratio than nodes optimized for older GROMACS versions. Moreover, the shift toward GPU processing allows to cheaply upgrade old nodes with recent GPUs, yielding essentially the same performance as comparable brand-new hardware. © 2019 Wiley Periodicals, Inc.  相似文献   

11.
In this work, we present a tentative step toward the efficient implementation of polarizable molecular mechanics force fields with GPU acceleration. The computational bottleneck of such applications is found in the treatment of electrostatics, where higher-order multipoles and a self-consistent treatment of polarization effects are needed. We have implemented a GPU accelerated code, based on the Tinker program suite, for the computation of induced dipoles. The largest test system used shows a speedup factor of over 20 for a single precision GPU implementation, when comparing to the serial CPU version. A discussion of the optimization and parametrization steps is included. Comparison between different graphic cards and CPU-GPU embedding is also given. The current work demonstrates the potential usefulness of GPU programming in accelerating this field of applications.  相似文献   

12.
We describe a complete implementation of all‐atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2009  相似文献   

13.
We have accelerated an ab initio quantum Monte Carlo electronic structure calculation using general purpose computing on graphical processing units (GPGPU). The part of the code causing the bottleneck for extended systems is replaced by Compute Unified Device Architecture‐GPGPU subroutine kernels which build up spline basis set expansions of electronic orbital functions at each Monte Carlo step. We have achieved a speedup of a factor of 30 for the bottleneck for a simulation of solid TiO2 with 1536 electrons. To improve the performance with GPGPU we propose a new updating scheme for Monte Carlo sampling, quasi‐simultaneous updating, which is intermediate between configuration‐by‐configuration updating and the widely used particle‐by‐particle updating. The error in the energy due to by the single precision treatment and the new updating scheme is found to be within the required accuracy of ~10?3 hartree per primitive cell. © 2012 Wiley Periodicals, Inc.  相似文献   

14.
The computation of electron repulsion integrals (ERIs) is the most time‐consuming process in the density functional calculation using Gaussian basis set. Many temporal ERIs are calculated, and most are stored on slower storage, such as cache or memory, because of the shortage of registers, which are the fastest storage in a central processing unit (CPU). Moreover, the heavy register usage makes it difficult to launch many concurrent threads on a graphics processing unit (GPU) to hide latency. Hence, we propose to optimize the calculation order of one‐center ERIs to minimize the number of registers used, and to calculate each ERI with three or six co‐operating threads. The performance of this method is measured on a recent CPU and a GPU. The proposed approach is found to be efficient for high angular basis functions with a GPU. When combined with a recent GPU, it accelerates the computation almost 4‐fold. © 2014 Wiley Periodicals, Inc.  相似文献   

15.
Excited-state calculations are implemented in a development version of the GPU-based TeraChem software package using the configuration interaction singles (CIS) and adiabatic linear response Tamm-Dancoff time-dependent density functional theory (TDA-TDDFT) methods. The speedup of the CIS and TDDFT methods using GPU-based electron repulsion integrals and density functional quadrature integration allows full ab initio excited-state calculations on molecules of unprecedented size. CIS/6-31G and TD-BLYP/6-31G benchmark timings are presented for a range of systems, including four generations of oligothiophene dendrimers, photoactive yellow protein (PYP), and the PYP chromophore solvated with 900 quantum mechanical water molecules. The effects of double and single precision integration are discussed, and mixed precision GPU integration is shown to give extremely good numerical accuracy for both CIS and TDDFT excitation energies (excitation energies within 0.0005 eV of extended double precision CPU results).  相似文献   

16.
A new hardware‐agnostic contraction algorithm for tensors of arbitrary symmetry and sparsity is presented. The algorithm is implemented as a stand‐alone open‐source code libxm . This code is also integrated with general tensor library libtensor and with the Q‐Chem quantum‐chemistry package. An overview of the algorithm, its implementation, and benchmarks are presented. Similarly to other tensor software, the algorithm exploits efficient matrix multiplication libraries and assumes that tensors are stored in a block‐tensor form. The distinguishing features of the algorithm are: (i) efficient repackaging of the individual blocks into large matrices and back, which affords efficient graphics processing unit (GPU)‐enabled calculations without modifications of higher‐level codes; (ii) fully asynchronous data transfer between disk storage and fast memory. The algorithm enables canonical all‐electron coupled‐cluster and equation‐of‐motion coupled‐cluster calculations with single and double substitutions (CCSD and EOM‐CCSD) with over 1000 basis functions on a single quad‐GPU machine. We show that the algorithm exhibits predicted theoretical scaling for canonical CCSD calculations, O (N 6), irrespective of the data size on disk. © 2017 Wiley Periodicals, Inc.  相似文献   

17.
We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discuss the algorithms that are used to exploit the processing power of the GPUs and show the performance that can be achieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solvent MD simulations. We show results for a range of tests including the accuracy of single point force evaluations and energy conservation as well as structural properties pertainining to protein dynamics. The numerical noise due to rounding errors within the SPSP precision model is sufficiently large to lead to an accumulation of errors which can result in unphysical trajectories for long time scale simulations. We recommend the use of the mixed-precision SPDP model since the numerical results obtained are comparable with those of the full double precision DPDP model and the reference double precision CPU implementation but at significantly reduced computational cost. Our implementation provides performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers.  相似文献   

18.
This article describes an extension of the quantum supercharger library (QSL) to perform quantum mechanical (QM) gradient and optimization calculations as well as hybrid QM and molecular mechanical (QM/MM) molecular dynamics simulations. The integral derivatives are, after the two‐electron integrals, the most computationally expensive part of the aforementioned calculations/simulations. Algorithms are presented for accelerating the one‐ and two‐electron integral derivatives on a graphical processing unit (GPU). It is shown that a Hartree–Fock ab initio gradient calculation is up to 9.3X faster on a single GPU compared with a single central processing unit running an optimized serial version of GAMESS‐UK, which uses the efficient Schlegel method for ‐ and ‐orbitals. Benchmark QM and QM/MM molecular dynamics simulations are performed on cellobiose in vacuo and in a 39 Å water sphere (45 QM atoms and 24843 point charges, respectively) using the 6‐31G basis set. The QSL can perform 9.7 ps/day of ab initio QM dynamics and 6.4 ps/day of QM/MM dynamics on a single GPU in full double precision. © 2015 Wiley Periodicals, Inc.  相似文献   

19.
A series of well‐defined double hydrophilic graft copolymers containing poly(poly(ethylene glycol) methyl ether acrylate) (PPEGMEA) backbone and poly(2‐vinylpyridine) (P2VP) side chains were synthesized by successive single electron transfer living radical polymerization (SET‐LRP) and atom transfer radical polymerization (ATRP). The backbone was first prepared by SET‐LRP of poly(ethylene glycol) methyl ether acrylate (PEGMEA) macromonomer using CuBr/tris(2‐(dimethylamino)ethyl)amine as catalytic system. The obtained homopolymer then reacted with lithium diisopropylamide and 2‐chloropropionyl chloride at ?78 °C to afford PPEGMEA‐Cl macroinitiator. poly(poly(ethylene glycol) methyl ether acrylate)‐g‐poly(2‐vinylpyridine) double hydrophilic graft copolymers were finally synthesized by. ATRP of 2‐vinylpyridine initiated by PPEGMEA‐Cl macroinitiator at 25 °C using CuCl/hexamethyldiethylenetriamine as catalytic system via the grafting‐ from strategy. The molecular weights of both the backbone and the side chains were controllable and the molecular weight distributions kept relatively narrow (Mw/Mn ≤ 1.40). pH‐Responsive micellization behavior was investigated by 1H NMR, dynamic light scattering, and transmission electron microscopy and this kind of double hydrophilic graft copolymer aggregated to form micelles with P2VP‐core while pH of the aqueous solution was above 5.0. © 2011 Wiley Periodicals, Inc. J Polym Sci Part A: Polym Chem, 2011  相似文献   

20.
The influence of the total number of cores, the number of cores dedicated to Particle mesh Ewald (PME) calculation and the choice of single vs. double precision on the performance of molecular dynamic (MD) simulations in the size of 70,000 to 1.7 million of atoms was analyzed on three different high‐performance computing facilities employing GROMACS 4 by running about 6000 benchmark simulations. Small and medium sized systems scaled linear up to 64 and 128 cores, respectively. Systems with half a million to 1.2 million atoms scaled linear up to 256 cores. The best performance was achieved by dedicating 25% of the total number of cores to PME calculation. Double precision calculations lowered the performance by 30–50%. A database for collecting information about MD simulations and the achieved performance was created and is freely available online and allows the fast estimation of the performance that can be expected in similar environments. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2011  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号