首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
We accelerated an ab initio molecular QMC calculation by using GPGPU. Only the bottle‐neck part of the calculation is replaced by CUDA subroutine and performed on GPU. The performance on a (single core CPU + GPU) is compared with that on a (single core CPU with double precision), getting 23.6 (11.0) times faster calculations in single (double) precision treatments on GPU. The energy deviation caused by the single precision treatment was found to be within the accuracy required in the calculation, ~10?5 hartree. The accelerated computational nodes mounting GPU are combined to form a hybrid MPI cluster on which we confirmed the performance linearly scales to the number of nodes. © 2011 Wiley Periodicals, Inc. J Comput Chem, 2011  相似文献   

2.
We present here a set of algorithms that completely rewrites the Hartree–Fock (HF) computations common to many legacy electronic structure packages (such as GAMESS‐US, GAMESS‐UK, and NWChem) into a massively parallel compute scheme that takes advantage of hardware accelerators such as Graphical Processing Units (GPUs). The HF compute algorithm is core to a library of routines that we name the Quantum Supercharger Library (QSL). We briefly evaluate the QSL's performance and report that it accelerates a HF 6‐31G Self‐Consistent Field (SCF) computation by up to 20 times for medium sized molecules (such as a buckyball) when compared with mature Central Processing Unit algorithms available in the legacy codes in regular use by researchers. It achieves this acceleration by massive parallelization of the one‐ and two‐electron integrals and optimization of the SCF and Direct Inversion in the Iterative Subspace routines through the use of GPU linear algebra libraries. © 2015 Wiley Periodicals, Inc.  相似文献   

3.
The computation of electron repulsion integrals (ERIs) is the most time‐consuming process in the density functional calculation using Gaussian basis set. Many temporal ERIs are calculated, and most are stored on slower storage, such as cache or memory, because of the shortage of registers, which are the fastest storage in a central processing unit (CPU). Moreover, the heavy register usage makes it difficult to launch many concurrent threads on a graphics processing unit (GPU) to hide latency. Hence, we propose to optimize the calculation order of one‐center ERIs to minimize the number of registers used, and to calculate each ERI with three or six co‐operating threads. The performance of this method is measured on a recent CPU and a GPU. The proposed approach is found to be efficient for high angular basis functions with a GPU. When combined with a recent GPU, it accelerates the computation almost 4‐fold. © 2014 Wiley Periodicals, Inc.  相似文献   

4.
Several efficient algorithms for the accurate and fast calculation of the molecular incomplete gamma function Fm(z) with a complex argument z are developed. The complex incomplete gamma function is arising in molecular integrals over the gauge-including atomic orbitals. Two kinds of algorithms are recommended: (1) a high-precision version and (2) a fast version. The high-precision version is able to guarantee 15 significant figures (10(-15) in the relative error) and the fast version is able to guarantee 12 significant figures (10(-12) in the relative error), at worst, within the double-precision arithmetic. The fast version is about 5-20 times faster than the high-precision version. For most molecular calculations, the fast version will give a satisfied precision.  相似文献   

5.
We report porting of the Divide‐Expand‐Consolidate Resolution of the Identity second‐order Møller–Plesset perturbation (DEC‐RI‐MP2) method to the graphic processing units (GPUs) using OpenACC compiler directives. It is shown that the OpenACC compiler directives implementation efficiently accelerates the rate‐determining step of the DEC‐RI‐MP2 method with minor implementation effort. Moreover, the GPU acceleration results in a better load balance and thus in an overall scaling improvement of the DEC algorithm. The resulting cross‐platform hybrid MPI/OpenMP/OpenACC implementation has scalable and portable performance on heterogeneous HPC architectures. The GPU‐enabled code was benchmarked using a reduced version of the S12L test set of Stefan Grimme (Grimme, Chem. Eur. J. 2012, 18, 9955) consisting of supramolecular complexes up to 158 atoms and 4292 contracted basis functions (cc‐pVTZ). The test set results demonstrate the general applicability of the DEC‐RI‐MP2 method showing results consistent with the DEC‐RI‐MP2 introductory paper (Baudin et al., J. Chem. Phys. 2016, 144, 054102) on molecules of complicated electronic structures. © 2016 Wiley Periodicals, Inc.  相似文献   

6.
In this paper, the SHARK integral generation and digestion engine is described. In essence, SHARK is based on a reformulation of the popular McMurchie/Davidson approach to molecular integrals. This reformulation leads to an efficient algorithm that is driven by BLAS level 3 operations. The algorithm is particularly efficient for high angular momentum basis functions (up to L = 7 is available by default, but the algorithm is programmed for arbitrary angular momenta). SHARK features a significant number of specific programming constructs that are designed to greatly simplify the workflow in quantum chemical program development and avoid undesirable code duplication to the largest possible extent. SHARK can handle segmented, generally and partially generally contracted basis sets. It can be used to generate a host of one- and two-electron integrals over various kernels including, two-, three-, and four-index repulsion integrals, integrals over Gauge Including Atomic Orbitals (GIAOs), relativistic integrals and integrals featuring a finite nucleus model. SHARK provides routines to evaluate Fock like matrices, generate integral transformations and related tasks. SHARK is the essential engine inside the ORCA package that drives essentially all tasks that are related to integrals over basis functions in version ORCA 5.0 and higher. Since the core of SHARK is based on low-level basic linear algebra (BLAS) operations, it is expected to not only perform well on present day but also on future hardware provided that the hardware manufacturer provides a properly optimized BLAS library for matrix and vector operations. Representative timings and comparisons to the Libint library used by ORCA are reported for Intel i9 and Apple M1 max processors.  相似文献   

7.
We present a way to improve the performance of the electronic structure Vienna Ab initio Simulation Package (VASP) program. We show that high-performance computers equipped with graphics processing units (GPUs) as accelerators may reduce drastically the computation time when offloading these sections to the graphic chips. The procedure consists of (i) profiling the performance of the code to isolate the time-consuming parts, (ii) rewriting these so that the algorithms become better-suited for the chosen graphic accelerator, and (iii) optimizing memory traffic between the host computer and the GPU accelerator. We chose to accelerate VASP with NVIDIA GPU using CUDA. We compare the GPU and original versions of VASP by evaluating the Davidson and RMM-DIIS algorithms on chemical systems of up to 1100 atoms. In these tests, the total time is reduced by a factor between 3 and 8 when running on n (CPU core + GPU) compared to n CPU cores only, without any accuracy loss. © 2012 Wiley Periodicals, Inc.  相似文献   

8.
In this work, we present a tentative step toward the efficient implementation of polarizable molecular mechanics force fields with GPU acceleration. The computational bottleneck of such applications is found in the treatment of electrostatics, where higher-order multipoles and a self-consistent treatment of polarization effects are needed. We have implemented a GPU accelerated code, based on the Tinker program suite, for the computation of induced dipoles. The largest test system used shows a speedup factor of over 20 for a single precision GPU implementation, when comparing to the serial CPU version. A discussion of the optimization and parametrization steps is included. Comparison between different graphic cards and CPU-GPU embedding is also given. The current work demonstrates the potential usefulness of GPU programming in accelerating this field of applications.  相似文献   

9.
The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand–protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual‐GPU version leads to a 39‐fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc.  相似文献   

10.
Three different algorithms for the calculation of many center electron-repulsion integrals are discussed, all of which are considered to be economic in terms of the number of arithmetic operations. The common features of the algorithms are as follows: Cartesian Gaussian functions are used, integrals are calculated by blocks (a block being defined as the set of integrals obtainable from four given exponents on four given centers), and functions may be adopted to R(3). Adaption to molecular point group symmetry is not considered. Tables are given showing the minimum number of operations for a selection of block types allowing one to identify the theoretically most economic, and the corresponding salient features. Comments concerning the computer implementations are also given both on sealar and vector processors. In particular, the Cyber 205 is considered, a vector processor on which we have implemented what we believe to be the most efficient algorithm.  相似文献   

11.
Molecular integral formulas and corresponding computational algorithms are developed for the relativistic spin-orbit and core potential operators that are obtained from atomic relativistic calculations by means of the effective core potential procedure. Much use is made of earlier work on core potential integrals by McMurchie and Davidson. The resulting computer code has been made part of the ARGOS (Argonne, Ohio State) program from the C?OLUMBUS suite of programs, which computes the needed integrals over symmetry-adapted combinations of generally contacted Gaussian atomic orbitals.  相似文献   

12.
The approach used to calculate the two‐electron integral by many electronic structure packages including generalized atomic and molecular electronic structure system‐UK has been designed for CPU‐based compute units. We redesigned the two‐electron compute algorithm for acceleration on a graphical processing unit (GPU). We report the acceleration strategy and illustrate it on the (ss|ss) type integrals. This strategy is general for Fortran‐based codes and uses the Accelerator compiler from Portland Group International and GPU‐based accelerators from Nvidia. The evaluation of (ss|ss) type integrals within calculations using Hartree Fock ab initio methods and density functional theory are accelerated by single and quad GPU hardware systems by factors of 43 and 153, respectively. The overall speedup for a single self consistent field cycle is at least a factor of eight times faster on a single GPU compared with that of a single CPU. © 2011 Wiley Periodicals, Inc. J Comput Chem, 2011  相似文献   

13.
A new parallel algorithm and its implementation for the RI‐MP2 energy calculation utilizing peta‐flop‐class many‐core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual‐level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi‐node and multi‐GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi‐node and multi‐GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc.  相似文献   

14.
The Gauss transform of Slater‐type orbitals is used to express several types of molecular integrals involving these functions in terms of simple auxiliary functions. After reviewing this transform and the way it can be combined with the shift operator technique, a master formula for overlap integrals is derived and used to obtain multipolar moments associated to fragments of two‐center distributions and overlaps of derivatives of Slater functions. Moreover, it is proved that integrals involving two‐center distributions and irregular harmonics placed at arbitrary points (which determine the electrostatic potential, field and field gradient, as well as higher order derivatives of the potential) can be expressed in terms of auxiliary functions of the same type as those appearing in the overlap. The recurrence relations and series expansions of these functions are thoroughly studied, and algorithms for their calculation are presented. The usefulness and efficiency of this procedure are tested by developing two independent codes: one for the derivatives of the overlap integrals with respect to the centers of the functions, and another for derivatives of the potential (electrostatic field, field gradient, and so forth) at arbitrary points. © 2007 Wiley Periodicals, Inc. Int J Quantum Chem, 2008  相似文献   

15.
We present new algorithms to improve the performance of ENUF method (F. Hedman, A. Laaksonen, Chem. Phys. Lett. 425, 2006, 142) which is essentially Ewald summation using Non‐Uniform FFT (NFFT) technique. A NearDistance algorithm is developed to extensively reduce the neighbor list size in real‐space computation. In reciprocal‐space computation, a new algorithm is developed for NFFT for the evaluations of electrostatic interaction energies and forces. Both real‐space and reciprocal‐space computations are further accelerated by using graphical processing units (GPU) with CUDA technology. Especially, the use of CUNFFT (NFFT based on CUDA) very much reduces the reciprocal‐space computation. In order to reach the best performance of this method, we propose a procedure for the selection of optimal parameters with controlled accuracies. With the choice of suitable parameters, we show that our method is a good alternative to the standard Ewald method with the same computational precision but a dramatically higher computational efficiency. © 2015 Wiley Periodicals, Inc.  相似文献   

16.
Excited-state calculations are implemented in a development version of the GPU-based TeraChem software package using the configuration interaction singles (CIS) and adiabatic linear response Tamm-Dancoff time-dependent density functional theory (TDA-TDDFT) methods. The speedup of the CIS and TDDFT methods using GPU-based electron repulsion integrals and density functional quadrature integration allows full ab initio excited-state calculations on molecules of unprecedented size. CIS/6-31G and TD-BLYP/6-31G benchmark timings are presented for a range of systems, including four generations of oligothiophene dendrimers, photoactive yellow protein (PYP), and the PYP chromophore solvated with 900 quantum mechanical water molecules. The effects of double and single precision integration are discussed, and mixed precision GPU integration is shown to give extremely good numerical accuracy for both CIS and TDDFT excitation energies (excitation energies within 0.0005 eV of extended double precision CPU results).  相似文献   

17.
Summary In this paper, we report our massively parallel implementation of grid techniques for the solution of the time-dependent Schrödinger equation in three spatial dimensions on the Connection Machine, which is a Single Instruction Multiple Data (SIMD) computer. Most of the operations involved in this calculation may be executed independently for each grid point. The few operations which cannot be executed independently are implemented using parallel communication algorithms. In addition, we report a simple modification of the multidimensional FFT, which provides an estimated 15% reduction in computational complexity relative to the standard 2-D FFT. It is suggested that this modification may be very well suited to hypercube communication topologies.  相似文献   

18.
Alchemical free energy (AFE) calculations based on molecular dynamics (MD) simulations are key tools in both improving our understanding of a wide variety of biological processes and accelerating the design and optimization of therapeutics for numerous diseases. Computing power and theory have, however, long been insufficient to enable AFE calculations to be routinely applied in early stage drug discovery. One of the major difficulties in performing AFE calculations is the length of time required for calculations to converge to an ensemble average. CPU implementations of MD‐based free energy algorithms can effectively only reach tens of nanoseconds per day for systems on the order of 50,000 atoms, even running on massively parallel supercomputers. Therefore, converged free energy calculations on large numbers of potential lead compounds are often untenable, preventing researchers from gaining crucial insight into molecular recognition, potential druggability and other crucial areas of interest. Graphics Processing Units (GPUs) can help address this. We present here a seamless GPU implementation, within the PMEMD module of the AMBER molecular dynamics package, of thermodynamic integration (TI) capable of reaching speeds of >140 ns/day for a 44,907‐atom system, with accuracy equivalent to the existing CPU implementation in AMBER. The implementation described here is currently part of the AMBER 18 beta code and will be an integral part of the upcoming version 18 release of AMBER. © 2018 Wiley Periodicals, Inc.  相似文献   

19.
20.
This article describes an extension of the quantum supercharger library (QSL) to perform quantum mechanical (QM) gradient and optimization calculations as well as hybrid QM and molecular mechanical (QM/MM) molecular dynamics simulations. The integral derivatives are, after the two‐electron integrals, the most computationally expensive part of the aforementioned calculations/simulations. Algorithms are presented for accelerating the one‐ and two‐electron integral derivatives on a graphical processing unit (GPU). It is shown that a Hartree–Fock ab initio gradient calculation is up to 9.3X faster on a single GPU compared with a single central processing unit running an optimized serial version of GAMESS‐UK, which uses the efficient Schlegel method for ‐ and ‐orbitals. Benchmark QM and QM/MM molecular dynamics simulations are performed on cellobiose in vacuo and in a 39 Å water sphere (45 QM atoms and 24843 point charges, respectively) using the 6‐31G basis set. The QSL can perform 9.7 ps/day of ab initio QM dynamics and 6.4 ps/day of QM/MM dynamics on a single GPU in full double precision. © 2015 Wiley Periodicals, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号