首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We investigate and test an algorithm suitable for the parallel calculation of the potential energy of a protein, or its spatial gradient, when the protein atoms interact via pair potentials. This algorithm is similar to one previously proposed, but it is more efficient, having half the interprocessor communications costs. For a given protein, we show that there is an optimal number of processors that gives a maximum speedup of the potential energy calculation compared to a sequential machine. (Using more than the optimum number of processors actually increases the computation time). With the optimum number the computation time is proportional to the protein size N. This is a considerable improvement in performance compared to sequential machines, where the computation time is proportional to N2. We also show that the dependence of the maximum speedup on the message latency time is relatively weak.  相似文献   

2.
We present parallelization of a quantum-chemical tree-code for linear scaling computation of the Coulomb matrix. Equal time partition is used to load balance computation of the Coulomb matrix. Equal time partition is a measurement based algorithm for domain decomposition that exploits small variation of the density between self-consistent-field cycles to achieve load balance. Efficiency of the equal time partition is illustrated by several tests involving both finite and periodic systems. It is found that equal time partition is able to deliver 91%-98% efficiency with 128 processors in the most time consuming part of the Coulomb matrix calculation. The current parallel quantum chemical tree code is able to deliver 63%-81% overall efficiency on 128 processors with fine grained parallelism (less than two heavy atoms per processor).  相似文献   

3.
We present ONETEP (order-N electronic total energy package), a density functional program for parallel computers whose computational cost scales linearly with the number of atoms and the number of processors. ONETEP is based on our reformulation of the plane wave pseudopotential method which exploits the electronic localization that is inherent in systems with a nonvanishing band gap. We summarize the theoretical developments that enable the direct optimization of strictly localized quantities expressed in terms of a delocalized plane wave basis. These same localized quantities lead us to a physical way of dividing the computational effort among many processors to allow calculations to be performed efficiently on parallel supercomputers. We show with examples that ONETEP achieves excellent speedups with increasing numbers of processors and confirm that the time taken by ONETEP as a function of increasing number of atoms for a given number of processors is indeed linear. What distinguishes our approach is that the localization is achieved in a controlled and mathematically consistent manner so that ONETEP obtains the same accuracy as conventional cubic-scaling plane wave approaches and offers fast and stable convergence. We expect that calculations with ONETEP have the potential to provide quantitative theoretical predictions for problems involving thousands of atoms such as those often encountered in nanoscience and biophysics.  相似文献   

4.
Short-range molecular dynamics simulations of molecular systems are commonly parallelized by replicated-data methods, in which each processor stores a copy of all atom positions. This enables computation of bonded 2-, 3-, and 4-body forces within the molecular topology to be partitioned among processors straightforwardly. A drawback to such methods is that the interprocessor communication scales as N (the number of atoms) independent of P (the number of processors). Thus, their parallel efficiency falls off rapidly when large numbers of processors are used. In this article a new parallel method for simulating macromolecular or small-molecule systems is presented, called force-decomposition. Its memory and communication costs scale as N/√P, allowing larger problems to be run faster on greater numbers of processors. Like replicated-data techniques, and in contrast to spatial-decomposition approaches, the new method can be simply load balanced and performs well even for irregular simulation geometries. The implementation of the algorithm in a prototypical macromolecular simulation code ParBond is also discussed. On a 1024-processor Intel Paragon, ParBond runs a standard benchmark simulation of solvated myoglobin with a parallel efficiency of 61% and at 40 times the speed of a vectorized version of CHARMM running on a single Cray Y-MP processor. © 1996 by John Wiley & Sons, Inc.  相似文献   

5.
Parallel computing seems to be the solution for molecular dynamics of large atomic systems, such as proteins in water environments, but the simulation time critically depends on the processor allocation strategy. A study of the optimal processor allocation based on a space decomposition algorithm for single instruction multiple data flow mesh computers is presented. A particular effort has been made to identify the best criterion according to which the atoms can be allocated to the processors using a spatial decomposition approach. The computing time depends on the granularity of the space decomposition among processing elements and on the ratio between the computation power of processing elements and the communication speed of the interprocessor network. © 1996 by John Wiley & Sons, Inc.  相似文献   

6.
We discuss issues in developing scalable parallel algorithms and focus on the distribution, as opposed to the replication, of key data structures. Replication of large data structures limits the maximum calculation size by imposing a low ratio of processors to memory. Only applications which distribute both data and computation across processors are truly scalable. The use of shared data structures that may be independently accessed by each process even in a distributed memory environment greatly simplifies development and provides a significant performance enhancement. We describe tools we have developed to support this programming paradigm. These tools are used to develop a highly efficient and scalable algorithm to perform self-consistent field calculations on molecular systems. A simple and classical strip-mining algorithm suffices to achieve an efficient and scalable Fock matrix construction in which all matrices are fully distributed. By strip mining over atoms, we also exploit all available sparsity and pave the way to adopting more sophisticated methods for summation of the Coulomb and exchange interactions. © 1996 by John Wiley & Sons, Inc.  相似文献   

7.
Evaluation of long-range Coulombic interactions still represents a bottleneck in the molecular dynamics (MD) simulations of biological macromolecules. Despite the advent of sophisticated fast algorithms, such as the fast multipole method (FMM), accurate simulations still demand a great amount of computation time due to the accuracy/speed trade-off inherently involved in these algorithms. Unless higher order multipole expansions, which are extremely expensive to evaluate, are employed, a large amount of the execution time is still spent in directly calculating particle-particle interactions within the nearby region of each particle. To reduce this execution time for pair interactions, we developed a computation unit (board), called MD-Engine II, that calculates nonbonded pairwise interactions using a specially designed hardware. Four custom arithmetic-processors and a processor for memory manipulation ("particle processor") are mounted on the computation board. The arithmetic processors are responsible for calculation of the pair interactions. The particle processor plays a central role in realizing efficient cooperation with the FMM. The results of a series of 50-ps MD simulations of a protein-water system (50,764 atoms) indicated that a more stringent setting of accuracy in FMM computation, compared with those previously reported, was required for accurate simulations over long time periods. Such a level of accuracy was efficiently achieved using the cooperative calculations of the FMM and MD-Engine II. On an Alpha 21264 PC, the FMM computation at a moderate but tolerable level of accuracy was accelerated by a factor of 16.0 using three boards. At a high level of accuracy, the cooperative calculation achieved a 22.7-fold acceleration over the corresponding conventional FMM calculation. In the cooperative calculations of the FMM and MD-Engine II, it was possible to achieve more accurate computation at a comparable execution time by incorporating larger nearby regions.  相似文献   

8.
The computation of root mean square deviations (RMSD) is an important step in many bioinformatics applications. If approached naively, each RMSD computation takes time linear in the number of atoms. In addition, a careful implementation is required to achieve numerical stability, which further increases runtimes. In practice, the structural variations under consideration are often induced by rigid transformations of the protein, or are at least dominated by a rigid component. In this work, we show how RMSD values resulting from rigid transformations can be computed in constant time from the protein's covariance matrix, which can be precomputed in linear time. As a typical application scenario is protein clustering, we will also show how the Ward‐distance which is popular in this field can be reduced to RMSD evaluations, yielding a constant time approach for their computation. © 2014 Wiley Periodicals, Inc.  相似文献   

9.
The quantum hydrodynamic equations associated with the de Broglie-Bohm formulation of quantum mechanics are solved using a new methodology which gives an accurate, unitary, and stable propagation of a time dependent quantum wave packet [B. K. Kendrick, J. Chem. Phys. 119, 5805 (2003)]. The methodology is applied to an N-dimensional model chemical reaction with an activation barrier. A parallel version of the methodology is presented which is designed to run on massively parallel supercomputers. The computational scaling properties of the parallel code are investigated both as a function of the number of processors and the dimension N. A decoupling scheme is introduced which decouples the multidimensional quantum hydrodynamic equations into a set of uncoupled one-dimensional problems. The decoupling scheme dramatically reduces the computation time and is highly parallelizable. Furthermore, the computation time is shown to scale linearly with respect to the dimension N=2,...,100.  相似文献   

10.
We present a comparison between two different approaches to parallelizing the grand canonical Monte Carlo simulation technique (GCMC) for classical fluids: a spatial decomposition and a time decomposition. The spatial decomposition relies on the fact that for short-ranged fluids, such as the cut and shifted Lennard-Jones potential used in this work, atoms separated by a greater distance than the reach of the potential act independently, and thus different processors can work concurrently in regions of the same system which are sufficiently far apart. The time decomposition is an exactly parallel approach which employs simultaneous (GCMC) simulations, one per processor, identical in every respect except the initial random number seed, with the thermodynamic output variables averaged across all processors. While scaling characteristics for the spatial decomposition are presented for 8–1024 processor systems, the comparison between the two decompositions is limited to the 8–128 processor range due to the warm-up time and memory imitations of the time decomposition. Using a combination of speed and statistical efficiency, the two algorithms are compared at two different state points. While the time decomposition reaches a given value of standard error in the system's potential energy more quickly than the spatial decomposition for both densities, the warm-up time demands of the time decomposition quickly become insurmountable as the system size increases. © 1996 by John Wiley & Sons, Inc.  相似文献   

11.
We describe a kernel energy method (KEM) for applying quantum crystallography to large molecules, with an emphasis on the calculation of the molecular energy of peptides. The computational difficulty of representing the system increases only modestly with the number of atoms. The calculations are carried out on modern parallel supercomputers. By adopting the approximation that a full biological molecule can be represented by smaller “kernels” of atoms, the calculations are greatly simplified. Moreover, collections of kernels are, from a computational point of view, well suited for parallel computation. The result is a modest increase in computational time as the number of atoms increases, while retaining the ab initio character of the calculations. We describe a test of our method, and establish its accuracy using 15 different peptides of biological interest. © 2005 Wiley Periodicals, Inc. Int J Quantum Chem, 2005  相似文献   

12.
Classical molecular dynamics simulations of biological macromolecules in explicitly modeled solvent typically require the evaluation of interactions between all pairs of atoms separated by no more than some distance R, with more distant interactions handled using some less expensive method. Performing such simulations for periods on the order of a millisecond is likely to require the use of massive parallelism. The extent to which such simulations can be efficiently parallelized, however, has historically been limited by the time required for interprocessor communication. This article introduces a new method for the parallel evaluation of distance-limited pairwise particle interactions that significantly reduces the amount of data transferred between processors by comparison with traditional methods. Specifically, the amount of data transferred into and out of a given processor scales as O(R(3/2)p(-1/2)), where p is the number of processors, and with constant factors that should yield a substantial performance advantage in practice.  相似文献   

13.
The atomistic molecular dynamics program YASP has been parallelized for shared-memory computer architectures. Parallelization was restricted to the most CPU-time-consuming parts: neighbor-list construction, calculation of nonbonded, angle and dihedral forces, and constraints. Most of the sequential FORTRAN code was kept; parallel constructs were inserted as compiler directives using the OpenMP standard. Only in the case of the neighbor list did the data structure have to be changed. The parallel code achieves a useful speedup over the sequential version for systems of several thousand atoms and above. On an IBM Regatta p690+, the throughput increases with the number of processors up to a maximum of 12-16 processors depending on the characteristics of the simulated systems. On dual-processor Xeon systems, the speedup is about 1.7.  相似文献   

14.
In this paper we present an efficient parallelization of the ONX algorithm for linear computation of the Hartree-Fock exchange matrix [J. Chem. Phys. 106, 9708 (1997)]. The method used is based on the equal time (ET) partitioning recently introduced [J. Chem. Phys. 118, 9128 (2003)] and [J. Chem. Phys. 121, 6608 (2004)]. ET exploits the slow variation of the density matrix between self-consistent-field iterations to achieve load balance. The method is presented and some benchmark calculations are discussed for gas phase and periodic systems with up to 128 processors. The current parallel ONX code is able to deliver up to 77% overall efficiency for a cluster of 50 water molecules on 128 processors (2.56 processors per heavy atom) and up to 87% for a box of 64 water molecules (two processors per heavy atom) with periodic boundary conditions.  相似文献   

15.
We describe an efficient algorithm for carrying out a “divide-and-conquer” fit of a molecule's electronic density on massively parallel computers. Near linear speedups are achieved with up to 48 processors on a Cray T3E, and our results indicate that similar efficiencies could be attained on an even greater number of processors. To achieve optimum efficiency, the algorithm combines coarse and fine-grain parallelization and adapts itself to the existing ratio of processors to subsystems. The subsystems employed in our divide-and-conquer approach can also be made smaller or bigger, depending on the number of processors available. This allows us to further reduce the wallclock time and improve the method's overall efficiency. The strategies implemented in this paper can be extended to any other divide-and-conquer method used within an ab initio, density functional, or semi-empirical quantum mechanical program. Received: 15 September 1997 / Accepted: 21 January 1998  相似文献   

16.
Generation of the list of near-neighbor pairs of atoms not bonded to each other is a key feature of many programs for calculating the energy and energy derivatives for large molecules. Because this step can take a significant amount of CPU time, more efficient nonbonded list generation can speed up the energy calculations. In this article, a novel nonbonded list generation algorithm, BYCC, is introduced. It combines certain features of other algorithms and achieves more rapid nonbonded list generation; a factor of approximately 2.5 for a molecule of 5000 atoms with a cutoff in the 10 A range is obtained on Hewlett-Packard (HP) and Alpha processors, without greatly increasing memory requirements.  相似文献   

17.
Fully ab initio treatment of complex solid systems needs computational software which is able to efficiently take advantage of the growing power of high performance computing (HPC) architectures. Recent improvements in CRYSTAL, a periodic ab initio code that uses a Gaussian basis set, allows treatment of very large unit cells for crystalline systems on HPC architectures with high parallel efficiency in terms of running time and memory requirements. The latter is a crucial point, due to the trend toward architectures relying on a very high number of cores with associated relatively low memory availability. An exhaustive performance analysis shows that density functional calculations, based on a hybrid functional, of low‐symmetry systems containing up to 100,000 atomic orbitals and 8000 atoms are feasible on the most advanced HPC architectures available to European researchers today, using thousands of processors. © 2012 Wiley Periodicals, Inc.  相似文献   

18.
We consider the problem of predicting the mode of binding of a small molecule to a receptor site on a protein. One plausible approach, given a rigid molecule and its geometry, is to search directly for the orientation in space that maximizes the degree of contact. The computation time required for such a naive procedure is proportional to n3m3, where n is the number of points in the site where binding can occur, and m is the number of atoms in the ligand. We give an alternative, combinatorial approach, in which only “contact–no-contact” criteria are considered. We relate this problem to the well-known combinatorial problem of finding cliques in a graph and show that we can use a solution to the clique problem not only to solve our original problem, but also the problem of avoiding energetically unfavorable matches. Our experience with this method indicates that the computation time required is proportional to nm2.8, with a lower constant of proportionality than that of the naive procedure.  相似文献   

19.
In this paper, we present the implementation of efficient approximations to time-dependent density functional theory (TDDFT) within the Tamm-Dancoff approximation (TDA) for hybrid density functionals. For the calculation of the TDDFT/TDA excitation energies and analytical gradients, we combine the resolution of identity (RI-J) algorithm for the computation of the Coulomb terms and the recently introduced "chain of spheres exchange" (COSX) algorithm for the calculation of the exchange terms. It is shown that for extended basis sets, the RIJCOSX approximation leads to speedups of up to 2 orders of magnitude compared to traditional methods, as demonstrated for hydrocarbon chains. The accuracy of the adiabatic transition energies, excited state structures, and vibrational frequencies is assessed on a set of 27 excited states for 25 molecules with the configuration interaction singles and hybrid TDDFT/TDA methods using various basis sets. Compared to the canonical values, the typical error in transition energies is of the order of 0.01 eV. Similar to the ground-state results, excited state equilibrium geometries differ by less than 0.3 pm in the bond distances and 0.5° in the bond angles from the canonical values. The typical error in the calculated excited state normal coordinate displacements is of the order of 0.01, and relative error in the calculated excited state vibrational frequencies is less than 1%. The errors introduced by the RIJCOSX approximation are, thus, insignificant compared to the errors related to the approximate nature of the TDDFT methods and basis set truncation. For TDDFT/TDA energy and gradient calculations on Ag-TB2-helicate (156 atoms, 2732 basis functions), it is demonstrated that the COSX algorithm parallelizes almost perfectly (speedup ~26-29 for 30 processors). The exchange-correlation terms also parallelize well (speedup ~27-29 for 30 processors). The solution of the Z-vector equations shows a speedup of ~24 on 30 processors. The parallelization efficiency for the Coulomb terms can be somewhat smaller (speedup ~15-25 for 30 processors), but their contribution to the total calculation time is small. Thus, the parallel program completes a Becke3-Lee-Yang-Parr energy and gradient calculation on the Ag-TB2-helicate in less than 4 h on 30 processors. We also present the necessary extension of the Lagrangian formalism, which enables the calculation of the TDDFT excited state properties in the frozen-core approximation. The algorithms described in this work are implemented into the ORCA electronic structure system.  相似文献   

20.
Two algorithms are presented for parallel direct computation of energies with second-order perturbation theory. Closed-shell MP2 theory as well as the open-shell perturbation theories OPT2(2) and ZAPT2 have been implemented. The algorithms are designed for distributed memory parallel computers. The first algorithm exhibits an excellent load balance and scales well when relatively few processors are used, but a large communication overhead reduces the efficiency for larger numbers of processors. The other algorithm employs very little interprocessor communication and scales well for large systems. In both implementations the memory requirement has been reduced by allowing the two-electron integral transformation to be performed in multiple passes and by distributing the (partially) transformed integrals between processors. Results are presented for systems with up to 327 basis functions. © 1995 John Wiley & Sons, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号