首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Summary Eigensolving (diagonalizing) small dense matrices threatens to become a bottleneck in the application of massively parallel computers to electronic structure methods. Because the computational cost of electronic structure methods typically scales asO(N 3) or worse, even teraflop computer systems with thousands of processors will often confront problems withN 10,000. At present, diagonalizing anN×N matrix onP processors is not efficient whenP is large compared toN. The loss of efficiency can make diagonalization a bottleneck on a massively parallel computer, even though it is typically a minor operation on conventional serial machines. This situation motivates a search for both improved methods and identification of the computer characteristics that would be most productive to improve.In this paper, we compare the performance of several parallel and serial methods for solving dense real symmetric eigensystems on a distributed memory message passing parallel computer. We focus on matrices of sizeN=200 and processor countsP=1 toP=512, with execution on the Intel Touchstone DELTA computer. The best eigensolver method is found to depend on the number of available processors. Of the methods tested, a recently developed Blocked Factored Jacobi (BFJ) method is the slowest for smallP, but the fastest for largeP. Its speed is a complicated non-monotonic function of the number of processors used. A detailed performance analysis of the BFJ method shows that: (1) the factor most responsible for limited speedup is communication startup cost; (2) with current communication costs, the maximum achievable parallel speedup is modest (one order of magnitude) compared to the best serial method; and (3) the fastest solution is often achieved by using less than the maximum number of available processors.Pacific Northwest Laboratory is operated for the U.S. Department of Energy (DOE) by Battelle Memorial Institute under contract DE-AC06-76RLO 1830  相似文献   

2.
3.
A parallel implementation of the conventionally used NDDO (MNDO, AM1, PM3, CLUSTER‐Z1) and modified NDDO‐WF (CLUSTER‐Z2) techniques for semiempirical quantum chemical calculations of large molecular systems in the sp‐ and spd‐basis, respectively, is described. The atom‐pair distribution of data over processors forms the basis of the parallelization. The technological aspects of designing scalable parallel calculations on supercomputers (using ScaLAPACK and MPI libraries) are discussed. The scaling of individual algorithms and the entire package was carried out for model systems with 894, 1920, and 2014 atomic orbitals. The package speed‐up provided by different multiprocessor systems involving a cluster of Intel PIII processors, Alpha‐21264‐processor‐built machine MBC‐1000M, and Cray‐T3E is analyzed. The effect of computer characteristics on the package performance is discussed. © 2002 Wiley Periodicals, Inc. Int J Quantum Chem, 2002  相似文献   

4.
Communication algorithms, tailored for molecular dynamics simulation on d-meshes, are evaluated in terms of communication efficiency. It has been shown elsewhere that d-meshes are better than other regular topologies, e.g., hypercubes and standard toroidal 4-meshes, when compared in their diameter and average distance among nodes. Collective communication is needed in molecular dynamics simulation for the distribution of coordinates and calculation and distribution of new energies. We show that both collective communication patterns used in molecular dynamics can be efficiently solved with congestion-free algorithms for all-to-all communication based on store-and-forward routing and routing tables. Our results indicate that d-meshes compete with hypercubes in parallel computers. Therefore d-meshes can also be used as a communication upgrade of existing molecular dynamics simulation platforms and can be successfully applied to perform fast molecular dynamics simulation.  相似文献   

5.
Algorithms to enhance parallel performance of molecular dynamics simulations on parallel computers by dynamic load balancing are described. Load balancing is achieved by redistribution of work based on either a history of time spent computing per processor or on the number of pair interactions computed per processor. The two algorithms we detail are designed to yield optimal load balancing on both workstation clusters and parallel supercomputers. We illustrate these methods using a small molecular dynamics kernel developed for the simulation of rigid molecular solvents. In addition, we discuss our observation regarding global communications performance on workstation clusters with a fiber distributed data interface (FDDI) using a high-speed point-to-point switch (Gigaswitch) and the k-ary 3-cube of the Cray T3D. © 1995 by John Wiley & Sons, Inc.  相似文献   

6.
With strict detailed balance, parallel Monte Carlo simulation through domain decomposition cannot be validated with conventional Markov chain theory, which describes an intrinsically serial stochastic process. In this work, the parallel version of Markov chain theory and its role in accelerating Monte Carlo simulations via cluster computing is explored. It is shown that sequential updating is the key to improving efficiency in parallel simulations through domain decomposition. A parallel scheme is proposed to reduce interprocessor communication or synchronization, which slows down parallel simulation with increasing number of processors. Parallel simulation results for the two-dimensional lattice gas model show substantial reduction of simulation time for systems of moderate and large size.  相似文献   

7.
Two algorithms are presented for parallel direct computation of energies with second-order perturbation theory. Closed-shell MP2 theory as well as the open-shell perturbation theories OPT2(2) and ZAPT2 have been implemented. The algorithms are designed for distributed memory parallel computers. The first algorithm exhibits an excellent load balance and scales well when relatively few processors are used, but a large communication overhead reduces the efficiency for larger numbers of processors. The other algorithm employs very little interprocessor communication and scales well for large systems. In both implementations the memory requirement has been reduced by allowing the two-electron integral transformation to be performed in multiple passes and by distributing the (partially) transformed integrals between processors. Results are presented for systems with up to 327 basis functions. © 1995 John Wiley & Sons, Inc.  相似文献   

8.
A coarse-grain parallel implementation of the free energy perturbation (FEP) module of the AMBER molecular dynamics program is described and then demonstrated using five different molecular systems. The difference in the free energy of (aqueous) solvation is calculated for two monovalent cations ΔΔGaq(Li+ Δ Cs+), and for the zero-sum ethane-to-ethane′ perturbation ΔΔGaq(CH3? methyl? XX? methyl? CH3), where X is a ghost methyl. The difference in binding free energy for a docked HIV-1 protease inhibitor into its ethylene mimetic is examined by mutating its fifth peptide bond, ΔG(CO? NH → CH?CH). A potassium ion (K+) is driven outward from the center of mass of ionophore salinomycin (SAL?) in a potential of mean force calculation ΔGMeOH(SAL? · K+) carried out in methanol solvent. Parallel speedup obtained is linearly proportional to the number of parallel processors applied. Finally, the difference in free energy of solvation of phenol versus benzene, ΔΔGoct(phenol → benzene), is determined in water-saturated octanol and then expressed in terms of relative partition coefficients, Δ log(Po/w). Because no interprocessor communication is required, this approach is scalable and applicable in general for any parallel architecture or network of machines. FEP calculations run on the nCUBE/2 using 50 or 100 parallel processors were completed in clock times equivalent to or twice as fast as a Cray Y-MP. The difficulty of ensuring adequate system equilibrium when agradual configurational reorientation follows the mutation of the Hamiltonian is discussed and analyzed. The results of a successful protocol for overcoming this equilibration problem are presented. The types of molecular perturbations for which this method is expected to perform most efficiently are described. © 1994 by John Wiley & Sons, Inc.  相似文献   

9.
Parallel computing seems to be the solution for molecular dynamics of large atomic systems, such as proteins in water environments, but the simulation time critically depends on the processor allocation strategy. A study of the optimal processor allocation based on a space decomposition algorithm for single instruction multiple data flow mesh computers is presented. A particular effort has been made to identify the best criterion according to which the atoms can be allocated to the processors using a spatial decomposition approach. The computing time depends on the granularity of the space decomposition among processing elements and on the ratio between the computation power of processing elements and the communication speed of the interprocessor network. © 1996 by John Wiley & Sons, Inc.  相似文献   

10.
In this article a procedure is derived to obtain a performance gain for molecular dynamics (MD) simulations on existing parallel clusters. Parallel clusters use a wide array of interconnection technologies to connect multiple processors together, often at different speeds, such as multiple processor computers and networking. It is demonstrated how to configure existing programs for MD simulations to efficiently handle collective communication on parallel clusters with processor interconnections of different speeds.  相似文献   

11.
This article describes an extension to previously developed constraint techniques. These enhanced constraint methods will enable the study of large computational chemistry problems that cannot be easily handled with current constrained molecular dynamics (MD) methods. These methods are based on an O(N) solution to the constrained equations of motion. The benefits of this approach are that (1) the system constraints are solved exactly at each time step, (2) the solution algorithm is noniterative, (3) the algorithm is recursive and scales as O(N), (4) the algorithm is numerically stable, (5) the algorithm is highly amenable to parallel processing, and (6) potentially greater integration step sizes are possible. It is anticipated that application of this methodology will provide a 10- to 100-improvement in the speed of a large molecular trajectory as compared with the time required to run a conventional atomistic unconstrained simulation. It is, therefore, anticipated that this methodology will provide an enabling capacity for pursuing the drug discovery process for large molecular systems. © 1995 John Wiley & Sons, Inc.  相似文献   

12.
We investigate and test an algorithm suitable for the parallel calculation of the potential energy of a protein, or its spatial gradient, when the protein atoms interact via pair potentials. This algorithm is similar to one previously proposed, but it is more efficient, having half the interprocessor communications costs. For a given protein, we show that there is an optimal number of processors that gives a maximum speedup of the potential energy calculation compared to a sequential machine. (Using more than the optimum number of processors actually increases the computation time). With the optimum number the computation time is proportional to the protein size N. This is a considerable improvement in performance compared to sequential machines, where the computation time is proportional to N2. We also show that the dependence of the maximum speedup on the message latency time is relatively weak.  相似文献   

13.
A parallel algorithm for efficient calculation of the second derivatives (Hessian) of the conformational energy in internal coordinates is proposed. This parallel algorithm is based on the master/slave model. A master processor distributes the calculations of components of the Hessian to one or more slave processors that, after finishing their calculations, send the results to the master processor that assembles all the components of the Hessian. Our previously developed molecular analysis system for conformational energy optimization, normal mode analysis, and Monte Carlo simulation for internal coordinates is extended to use this parallel algorithm for Hessian calculation on a massively parallel computer. The implementation of our algorithm uses the message passing interface and works effectively on both distributed-memory parallel computers and shared-memory parallel computers. We applied this system to the Newton–Raphson energy optimization of the structures of glutaminyl transfer RNA (Gln-tRNA) with 74 nucleotides and glutaminyl-tRNA synthetase (GlnRS) with 540 residues to analyze the performance of our system. The parallel speedups for the Hessian calculation were 6.8 for Gln-tRNA with 24 processors and 11.2 for GlnRS with 54 processors. The parallel speedups for the Newton–Raphson optimization were 6.3 for Gln-tRNA with 30 processors and 12.0 for GlnRS with 62 processors. © 1998 John Wiley & Sons, Inc. J Comput Chem 19: 1716–1723, 1998  相似文献   

14.
Classical molecular dynamics simulations of biological macromolecules in explicitly modeled solvent typically require the evaluation of interactions between all pairs of atoms separated by no more than some distance R, with more distant interactions handled using some less expensive method. Performing such simulations for periods on the order of a millisecond is likely to require the use of massive parallelism. The extent to which such simulations can be efficiently parallelized, however, has historically been limited by the time required for interprocessor communication. This article introduces a new method for the parallel evaluation of distance-limited pairwise particle interactions that significantly reduces the amount of data transferred between processors by comparison with traditional methods. Specifically, the amount of data transferred into and out of a given processor scales as O(R(3/2)p(-1/2)), where p is the number of processors, and with constant factors that should yield a substantial performance advantage in practice.  相似文献   

15.
Many systems of great importance in material science, chemistry, solid-state physics, and biophysics require forces generated from an electronic structure calculation, as opposed to an empirically derived force law to describe their properties adequately. The use of such forces as input to Newton's equations of motion forms the basis of the ab initio molecular dynamics method, which is able to treat the dynamics of chemical bond-breaking and -forming events. However, a very large number of electronic structure calculations must be performed to compute an ab initio molecular dynamics trajectory, making the efficiency as well as the accuracy of the electronic structure representation critical issues. One efficient and accurate electronic structure method is the generalized gradient approximation to the Kohn-Sham density functional theory implemented using a plane-wave basis set and atomic pseudopotentials. The marriage of the gradient-corrected density functional approach with molecular dynamics, as pioneered by Car and Parrinello (R. Car and M. Parrinello, Phys Rev Lett 1985, 55, 2471), has been demonstrated to be capable of elucidating the atomic scale structure and dynamics underlying many complex systems at finite temperature. However, despite the relative efficiency of this approach, it has not been possible to obtain parallel scaling of the technique beyond several hundred processors on moderately sized systems using standard approaches. Consequently, the time scales that can be accessed and the degree of phase space sampling are severely limited. To take advantage of next generation computer platforms with thousands of processors such as IBM's BlueGene, a novel scalable parallelization strategy for Car-Parrinello molecular dynamics is developed using the concept of processor virtualization as embodied by the Charm++ parallel programming system. Charm++ allows the diverse elements of a Car-Parrinello molecular dynamics calculation to be interleaved with low latency such that unprecedented scaling is achieved. As a benchmark, a system of 32 water molecules, a common system size employed in the study of the aqueous solvation and chemistry of small molecules, is shown to scale on more than 1500 processors, which is impossible to achieve using standard approaches. This degree of parallel scaling is expected to open new opportunities for scientific inquiry.  相似文献   

16.
We present a method of parallelizing flat histogram Monte Carlo simulations, which give the free energy of a molecular system as an output. In the serial version, a constant probability distribution, as a function of any system parameter, is calculated by updating an external potential that is added to the system Hamiltonian. This external potential is related to the free energy. In the parallel implementation, the simulation is distributed on to different processors. With regular intervals the modifying potential is summed over all processors and distributed back to every processor, thus spreading the information of which parts of parameter space have been explored. This implementation is shown to decrease the execution time linearly with added number of processors.  相似文献   

17.
We present a comparison between two different approaches to parallelizing the grand canonical Monte Carlo simulation technique (GCMC) for classical fluids: a spatial decomposition and a time decomposition. The spatial decomposition relies on the fact that for short-ranged fluids, such as the cut and shifted Lennard-Jones potential used in this work, atoms separated by a greater distance than the reach of the potential act independently, and thus different processors can work concurrently in regions of the same system which are sufficiently far apart. The time decomposition is an exactly parallel approach which employs simultaneous (GCMC) simulations, one per processor, identical in every respect except the initial random number seed, with the thermodynamic output variables averaged across all processors. While scaling characteristics for the spatial decomposition are presented for 8–1024 processor systems, the comparison between the two decompositions is limited to the 8–128 processor range due to the warm-up time and memory imitations of the time decomposition. Using a combination of speed and statistical efficiency, the two algorithms are compared at two different state points. While the time decomposition reaches a given value of standard error in the system's potential energy more quickly than the spatial decomposition for both densities, the warm-up time demands of the time decomposition quickly become insurmountable as the system size increases. © 1996 by John Wiley & Sons, Inc.  相似文献   

18.
Based on the molecular dynamics software package CovalentMD 2.0, the fastest molecular dynamics simulation for covalent crystalline silicon with bond-order potentials has been implemented on the third highest performance supercomputer “Sunway TaihuLight” in the world (before June 2019), and already obtained 16.0 Pflops (1015 floating point operation per second) in double precision for the simulation of crystalline silicon, which is recordly high for rigorous atomistic simulation of covalent materials. The simulations used up to 160,768 64-core processors, totally nearly 10.3 million cores, to simulate more than 137 billion silicon atoms, where the parallel efficiency is over 80% on the whole machine. The running performance on a single processor reached 15.1% of its theoretical peak at highest. The longitudinal dimension of the simulated system is far beyond the range with scale-dependent properties, while the lateral dimension significantly exceeds the experimentally measurable range. Our simulation enables virtual experiments on real-world nanostructured materials and devices for predicting macroscale properties and behaviors from microscale structures directly, bringing about many exciting new possibilities in nanotechnology, information technology, electronics and renewable energies, etc. © 2019 Wiley Periodicals, Inc.  相似文献   

19.
The optimization of the atomic and molecular clusters with a large number of atoms is a very challenging topic. This article proposes a parallel differential evolution (DE) optimization scheme for large‐scale clusters. It combines a modified DE algorithm with improved genetic operators and a parallel strategy with a migration operator to address the problems of numerous local optima and large computational demanding. Results of Lennard–Jones (LJ) clusters and Gupta‐potential Co clusters show the performance of the algorithm surpasses those in previous researches in terms of successful rate, convergent speed, and global searching ability. The overall performance for large or challenging LJ clusters is enhanced significantly. The average number of local minimizations per hit of the global minima for Co clusters is only about 3–4% of that in previous methods. Some global optima for Co are also updated. We then apply the algorithm to optimize the Pt clusters with Gupta potential from the size 3 to 130 and analyze their electronic properties by density functional theory calculation. The clusters with 13, 38, 54, 75, 108, and 125 atoms are extremely stable and can be taken as the magic numbers for Pt systems. It is interesting that the more stable structures, especially magic‐number ones, tend to have a larger energy gap between the highest occupied molecular orbital and the lowest unoccupied molecular orbital. It is also found that the clusters are gradually close to the metal bulk from the size N > 80 and Pt38 is expected to be more active than Pt75 in catalytic reaction. © 2013 Wiley Periodicals, Inc.  相似文献   

20.
Dynamics simulations of molecular systems are notoriously computationally intensive. Using parallel computers for these simulations is important for reducing their turnaround time. In this article we describe a parallelization of the simulation program CHARMM for the Intel iPSC/860, a distributed memory multiprocessor. In the parallelization, the computational work is partitioned among the processors for core calculations including the calculation of forces, the integration of equations of motion, the correction of atomic coordinates by constraint, and the generation and update of data structures used to compute nonbonded interactions. Processors coordinate their activity using synchronous communication to exchange data values. Key data structures used are partitioned among the processors in nearly equal pieces, reducing the memory requirement per node and making it possible to simulate larger molecular systems. We examine the effectiveness of the parallelization in the context of a case study of a realistic molecular system. While effective speedup was achieved for many of the dynamics calculations, other calculations fared less well due to growing communication costs for exchanging data among processors. The strategies we used are applicable to parallelization of similar molecular mechanics and dynamics programs for distributed memory multiprocessors. © 1992 by John Wiley & Sons, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号