首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Using a grid‐based method to search the critical points in electron density, we show how to accelerate such a method with graphics processing units (GPUs). When the GPU implementation is contrasted with that used on central processing units (CPUs), we found a large difference between the time elapsed by both implementations: the smallest time is observed when GPUs are used. We tested two GPUs, one related with video games and other used for high‐performance computing (HPC). By the side of the CPUs, two processors were tested, one used in common personal computers and other used for HPC, both of last generation. Although our parallel algorithm scales quite well on CPUs, the same implementation on GPUs runs around 10× faster than 16 CPUs, with any of the tested GPUs and CPUs. We have found what one GPU dedicated for video games can be used without any problem for our application, delivering a remarkable performance, in fact; this GPU competes against one HPC GPU, in particular when single‐precision is used. © 2014 Wiley Periodicals, Inc.  相似文献   

2.
The computation of electron repulsion integrals (ERIs) is the most time‐consuming process in the density functional calculation using Gaussian basis set. Many temporal ERIs are calculated, and most are stored on slower storage, such as cache or memory, because of the shortage of registers, which are the fastest storage in a central processing unit (CPU). Moreover, the heavy register usage makes it difficult to launch many concurrent threads on a graphics processing unit (GPU) to hide latency. Hence, we propose to optimize the calculation order of one‐center ERIs to minimize the number of registers used, and to calculate each ERI with three or six co‐operating threads. The performance of this method is measured on a recent CPU and a GPU. The proposed approach is found to be efficient for high angular basis functions with a GPU. When combined with a recent GPU, it accelerates the computation almost 4‐fold. © 2014 Wiley Periodicals, Inc.  相似文献   

3.
The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand–protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual‐GPU version leads to a 39‐fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc.  相似文献   

4.
The approach used to calculate the two‐electron integral by many electronic structure packages including generalized atomic and molecular electronic structure system‐UK has been designed for CPU‐based compute units. We redesigned the two‐electron compute algorithm for acceleration on a graphical processing unit (GPU). We report the acceleration strategy and illustrate it on the (ss|ss) type integrals. This strategy is general for Fortran‐based codes and uses the Accelerator compiler from Portland Group International and GPU‐based accelerators from Nvidia. The evaluation of (ss|ss) type integrals within calculations using Hartree Fock ab initio methods and density functional theory are accelerated by single and quad GPU hardware systems by factors of 43 and 153, respectively. The overall speedup for a single self consistent field cycle is at least a factor of eight times faster on a single GPU compared with that of a single CPU. © 2011 Wiley Periodicals, Inc. J Comput Chem, 2011  相似文献   

5.
We present an algorithm to efficiently compute accurate volumes and surface areas of macromolecules on graphical processing unit (GPU) devices using an analytic model which represents atomic volumes by continuous Gaussian densities. The volume of the molecule is expressed by means of the inclusion–exclusion formula, which is based on the summation of overlap integrals among multiple atomic densities. The surface area of the molecule is obtained by differentiation of the molecular volume with respect to atomic radii. The many‐body nature of the model makes a port to GPU devices challenging. To our knowledge, this is the first reported full implementation of this model on GPU hardware. To accomplish this, we have used recursive strategies to construct the tree of overlaps and to accumulate volumes and their gradients on the tree data structures so as to minimize memory contention. The algorithm is used in the formulation of a surface area‐based non‐polar implicit solvent model implemented as an open source plug‐in (named GaussVol) for the popular OpenMM library for molecular mechanics modeling. GaussVol is 50 to 100 times faster than our best optimized implementation for the CPUs, achieving speeds in excess of 100 ns/day with 1 fs time‐step for protein‐sized systems on commodity GPUs. © 2017 Wiley Periodicals, Inc.  相似文献   

6.
Principal component analysis (PCA) and other multivariate analysis methods have been used increasingly to analyse and understand depth‐profiles in XPS, AES and SIMS. For large images or three‐dimensional (3D) imaging depth‐profiles, PCA has been difficult to apply until now simply because of the size of the matrices of data involved. In a recent paper, we described two algorithms, random vector 1 (RV1) and random vector 2 (RV2), that improve the speed of PCA and allow datasets of unlimited size, respectively. In this paper, we now apply the RV2 algorithm to perform PCA on full 3D time‐of‐flight SIMS data for the first time without subsampling. The dataset we process in this way is a 128 × 128 pixel depth‐profile of 120 layers, each voxel having a 70 439 value mass spectrum associated with it. This forms over a terabyte of data when uncompressed and took 27 h to process using the RV2 algorithm using a conventional windows desktop personal computer (PC). While full PCA (e.g. using RV2) is to be preferred for final reports or publications, a much more rapid method is needed during analysis sessions to inform decisions on the next analytical step. We have therefore implemented the RV1 algorithm on a PC having a graphical processor unit (GPU) card containing 2880 individual processor cores. This increases the speed of calculation by a factor of around 4.1 compared with what is possible using a fast commercially available desktop PC having central processing units alone, and full PCA is performed in less than 7 s. The size of the dataset that can be processed in this way is limited by the size of the memory on the GPU card. This is typically sufficient for two‐dimensional images but not 3D depth‐profiles without sampling. We have therefore examined efficient sampling schemes that allow a good approximate solution to the PCA problem for large 3D datasets. We find that low‐discrepancy series such as Sobol series sampling gives more rapid convergence than random sampling, and we recommend such methods for routine use. Using the GPU and low‐discrepancy series together, we anticipate that any time‐of‐flight SIMS dataset, of whatever size, can be efficiently and accurately processed into PCA components in a maximum of around 10 s using a commercial PC with a widely available GPU card, although the longer RV2 approach is still to be preferred for the presentation of final results, such as in published papers. Copyright © 2016 The Authors Surface and Interface Analysis Published by John Wiley & Sons Ltd  相似文献   

7.
Motivation: A cluster of strongly interacting ionization groups in protein molecules with irregular ionization behavior is suggestive for specific structure–function relationship. However, their computational treatment is unconventional (e.g., lack of convergence in naive self‐consistent iterative algorithm). The stringent evaluation requires evaluation of Boltzmann averaged statistical mechanics sums and electrostatic energy estimation for each microstate. Summary: irGPU: Irregular strong interactions in proteins—a GPU solver is novel solution to a versatile problem in protein biophysics—atypical protonation behavior of coupled groups. The computational severity of the problem is alleviated by parallelization (via GPU kernels) which is applied for the electrostatic interaction evaluation (including explicit electrostatics via the fast multipole method) as well as statistical mechanics sums (partition function) estimation. Special attention is given to the ease of the service and encapsulation of theoretical details without sacrificing rigor of computational procedures. irGPU is not just a solution‐in‐principle but a promising practical application with potential to entice community into deeper understanding of principles governing biomolecule mechanisms. © 2015 Wiley Periodicals, Inc.  相似文献   

8.
A new parallel algorithm and its implementation for the RI‐MP2 energy calculation utilizing peta‐flop‐class many‐core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual‐level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi‐node and multi‐GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi‐node and multi‐GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc.  相似文献   

9.
Graphical processing units (GPUs) are emerging in computational chemistry to include Hartree?Fock (HF) methods and electron‐correlation theories. However, ab initio calculations of large molecules face technical difficulties such as slow memory access between central processing unit and GPU and other shortfalls of GPU memory. The divide‐and‐conquer (DC) method, which is a linear‐scaling scheme that divides a total system into several fragments, could avoid these bottlenecks by separately solving local equations in individual fragments. In addition, the resolution‐of‐the‐identity (RI) approximation enables an effective reduction in computational cost with respect to the GPU memory. The present study implemented the DC‐RI‐HF code on GPUs using math libraries, which guarantee compatibility with future development of the GPU architecture. Numerical applications confirmed that the present code using GPUs significantly accelerated the HF calculations while maintaining accuracy. © 2014 Wiley Periodicals, Inc.  相似文献   

10.
During the past few years, graphics processing units (GPUs) have become extremely popular in the high performance computing community. In this study, we present an implementation of an acceleration engine for the solvent–solvent interaction evaluation of molecular dynamics simulations. By careful optimization of the algorithm speed‐ups up to a factor of 54 (single‐precision GPU vs. double‐precision CPU) could be achieved. The accuracy of the single‐precision GPU implementation is carefully investigated and does not influence structural, thermodynamic, and dynamic quantities. Therefore, the implementation enables users of the GROMOS software for biomolecular simulation to run the solvent–solvent interaction evaluation on a GPU, and thus, to speed‐up their simulations by a factor 6–9. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

11.
The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well‐exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)‐based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off‐loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance‐to‐price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer‐class GPUs this improvement equally reflects in the performance‐to‐price ratio. Although memory issues in consumer‐class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost‐efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well‐balanced ratio of CPU and consumer‐class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime. © 2015 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc.  相似文献   

12.
We present the first graphical processing unit (GPU) coprocessor‐enabled version of the Order‐N Electronic Total Energy Package (ONETEP) code for linear‐scaling first principles quantum mechanical calculations on materials. This work focuses on porting to the GPU the parts of the code that involve atom‐localized fast Fourier transform (FFT) operations. These are among the most computationally intensive parts of the code and are used in core algorithms such as the calculation of the charge density, the local potential integrals, the kinetic energy integrals, and the nonorthogonal generalized Wannier function gradient. We have found that direct porting of the isolated FFT operations did not provide any benefit. Instead, it was necessary to tailor the port to each of the aforementioned algorithms to optimize data transfer to and from the GPU. A detailed discussion of the methods used and tests of the resulting performance are presented, which show that individual steps in the relevant algorithms are accelerated by a significant amount. However, the transfer of data between the GPU and host machine is a significant bottleneck in the reported version of the code. In addition, an initial investigation into a dynamic precision scheme for the ONETEP energy calculation has been performed to take advantage of the enhanced single precision capabilities of GPUs. The methods used here result in no disruption to the existing code base. Furthermore, as the developments reported here concern the core algorithms, they will benefit the full range of ONETEP functionality. Our use of a directive‐based programming model ensures portability to other forms of coprocessors and will allow this work to form the basis of future developments to the code designed to support emerging high‐performance computing platforms.Copyright © 2013 Wiley Periodicals, Inc.  相似文献   

13.
A new program, PHI, with the ability to calculate the magnetic properties of large spin systems and complex orbitally degenerate systems, such as clusters of d‐block and f‐block ions, is presented. The program can intuitively fit experimental data from multiple sources, such as magnetic and spectroscopic data, simultaneously. PHI is extensively parallelized and can operate under the symmetric multiprocessing, single process multiple data, or GPU paradigms using a threaded, MPI or GPU model, respectively. For a given problem PHI is been shown to be almost 12 times faster than the well‐known program MAGPACK, limited only by available hardware. © 2013 Wiley Periodicals, Inc.  相似文献   

14.
15.
We accelerated an ab initio molecular QMC calculation by using GPGPU. Only the bottle‐neck part of the calculation is replaced by CUDA subroutine and performed on GPU. The performance on a (single core CPU + GPU) is compared with that on a (single core CPU with double precision), getting 23.6 (11.0) times faster calculations in single (double) precision treatments on GPU. The energy deviation caused by the single precision treatment was found to be within the accuracy required in the calculation, ~10?5 hartree. The accelerated computational nodes mounting GPU are combined to form a hybrid MPI cluster on which we confirmed the performance linearly scales to the number of nodes. © 2011 Wiley Periodicals, Inc. J Comput Chem, 2011  相似文献   

16.
Coarse grain (CG) molecular models have been proposed to simulate complex systems with lower computational overheads and longer timescales with respect to atomistic level models. However, their acceleration on parallel architectures such as graphic processing units (GPUs) presents original challenges that must be carefully evaluated. The objective of this work is to characterize the impact of CG model features on parallel simulation performance. To achieve this, we implemented a GPU‐accelerated version of a CG molecular dynamics simulator, to which we applied specific optimizations for CG models, such as dedicated data structures to handle different bead type interactions, obtaining a maximum speed‐up of $14$ on the NVIDIA GTX480 GPU with Fermi architecture. We provide a complete characterization and evaluation of algorithmic and simulated system features of CG models impacting the achievable speed‐up and accuracy of results, using three different GPU architectures as case studies. © 2012 Wiley Periodicals, Inc.  相似文献   

17.
A new hardware‐agnostic contraction algorithm for tensors of arbitrary symmetry and sparsity is presented. The algorithm is implemented as a stand‐alone open‐source code libxm . This code is also integrated with general tensor library libtensor and with the Q‐Chem quantum‐chemistry package. An overview of the algorithm, its implementation, and benchmarks are presented. Similarly to other tensor software, the algorithm exploits efficient matrix multiplication libraries and assumes that tensors are stored in a block‐tensor form. The distinguishing features of the algorithm are: (i) efficient repackaging of the individual blocks into large matrices and back, which affords efficient graphics processing unit (GPU)‐enabled calculations without modifications of higher‐level codes; (ii) fully asynchronous data transfer between disk storage and fast memory. The algorithm enables canonical all‐electron coupled‐cluster and equation‐of‐motion coupled‐cluster calculations with single and double substitutions (CCSD and EOM‐CCSD) with over 1000 basis functions on a single quad‐GPU machine. We show that the algorithm exhibits predicted theoretical scaling for canonical CCSD calculations, O (N 6), irrespective of the data size on disk. © 2017 Wiley Periodicals, Inc.  相似文献   

18.
We investigated the performance of heterogeneous computing with graphics processing units (GPUs) and many integrated core (MIC) with 20 CPU cores (20×CPU). As a practical example toward large scale electronic structure calculations using grid‐based methods, we evaluated the Hartree potentials of silver nanoparticles with various sizes (3.1, 3.7, 4.9, 6.1, and 6.9 nm) via a direct integral method supported by the sinc basis set. The so‐called work stealing scheduler was used for efficient heterogeneous computing via the balanced dynamic distribution of workloads between all processors on a given architecture without any prior information on their individual performances. 20×CPU + 1GPU was up to ~1.5 and ~3.1 times faster than 1GPU and 20×CPU, respectively. 20×CPU + 2GPU was ~4.3 times faster than 20×CPU. The performance enhancement by CPU + MIC was considerably lower than expected because of the large initialization overhead of MIC, although its theoretical performance is similar with that of CPU + GPU. © 2016 Wiley Periodicals, Inc.  相似文献   

19.
In the arsenal of tools that a computational modeller can bring to bare on the study of cardiac arrhythmias, the most widely used and arguably the most successful is that of an excitable medium, a special case of a reaction-diffusion model. These are used to simulate the internal chemical reactions of a cardiac cell and the diffusion of their membrane voltages. Via a number of different methodologies it has previously been shown that reaction-diffusion systems are at multiple levels Turing complete. That is, they are capable of computation in the same manner as a universal Turing machine. However, all such computational systems are subject to a limitation known as the Halting problem. By constructing a universal logic gate using a cardiac cell model, we highlight how the Halting problem therefore could limit what it is possible to predict about cardiac tissue, arrhythmias and re-entry.All simulations for this work were carried out on the GPU of an XBox 360 development console, and we also highlight the great gains in computational power and efficiency produced by such general purpose processing on a GPU for cardiac simulations.  相似文献   

20.
GALAMOST [graphics processing unit (GPU)‐accelerated large‐scale molecular simulation toolkit] is a molecular simulation package designed to utilize the computational power of GPUs. Besides the common features of molecular dynamics (MD) packages, it is developed specially for the studies of self‐assembly, phase transition, and other properties of polymeric systems at mesoscopic scale by using some lately developed simulation techniques. To accelerate the simulations, GALAMOST contains a hybrid particle‐field MD technique where particle–particle interactions are replaced by interactions of particles with density fields. Moreover, the numerical potential obtained by bottom‐up coarse‐graining methods can be implemented in simulations with GALAMOST. By combining these force fields and particle‐density coupling method in GALAMOST, the simulations for polymers can be performed with very large system sizes over long simulation time. In addition, GALAMOST encompasses two specific models, that is, a soft anisotropic particle model and a chain‐growth polymerization model, by which the hierarchical self‐assembly of soft anisotropic particles and the problems related to polymerization can be studied, respectively. The optimized algorithms implemented on the GPU, package characteristics, and benchmarks of GALAMOST are reported in detail. © 2013 Wiley Periodicals, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号