首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
In this paper we study divisible load scheduling in systems with limited memory. Divisible loads are parallel computations which can be divided into independent parts processed in parallel on remote computers, and the part sizes may be arbitrary. The distributed system is a heterogeneous single level tree. The total size of processor memories is too small to accommodate the whole load at any moment of time. Therefore, the load is distributed in many rounds. Memory reservations have block nature. The problem consists in distributing the load taking into account communication time, computation time, and limited memory buffers so that the whole processing finishes as early as possible. This problem is both combinatorial and algebraic in nature. Therefore, hybrid algorithms are given to solve it. Two algorithms are proposed to solve the combinatorial component. A branch-and-bound algorithm is nearly unusable due to its complexity. Then, a genetic algorithm is proposed with more tractable execution times. For a given solution of the combinatorial part we formulate the solution of the algebraic part as a linear programming problem. An extensive computational study is performed to analyze the impact of various system parameters on the quality of the solutions. From this we were able to infer on the nature of the scheduling problem.  相似文献   

2.
Parallel computation offers a challenging opportunity to speed up the time consuming enumerative procedures that are necessary to solve hard combinatorial problems. Theoretical analysis of such a parallel branch and bound algorithm is very hard and empirical analysis is not straightforward because the performance of a parallel algorithm cannot be evaluated simply by executing the algorithm on a few parallel systems. Among the difficulties encountered are the noise produced by other users on the system, the limited variation in parallelism (the number of processors in the system is strictly bounded) and the waste of resources involved: most of the time, the outcomes of all computations are already known and the only issue of interest is when these outcomes are produced.We will describe a way to simulate the execution of parallel branch and bound algorithms on arbitrary parallel systems in such a way that the memory and cpu requirements are very reasonable. The use of simulation has only minor consequences for the formulation of the algorithm.  相似文献   

3.
The parallel shop and the open shop are two machine environments that have received much attention in the literature of scheduling theory. A common generalization—the open shop with parallel machines—is considered in this paper. Polynomial-time algorithms are presented for obtaining minimum-length preemptive schedules for three cases. Open shops with single-operation machines of equal speed are scheduled with essentially no more difficulty than an ordinary open shop. Open shops with multiple-operation machines of equal speed are scheduled with the aid of a sequence of network flow computations. The general open shop problem with parallel machines of arbitrary speeds can be solved by linear programming, in much the same way as an optimal preemptive schedule can be found for unrelated parallel machines.  相似文献   

4.
In this paper, a processing element (PE) is characterized by its computation bandwidth, I/O bandwidth, and the size of its local memory. In carrying out a computation, a PE is said to be balanced if the computing time equals the I/O time. Consider a balanced PE for some computation. Suppose that the computation band-width of the PE is increased by a factor of α relative to its I/O bandwidth. Then when carrying out the same computation the PE will be imbalanced; i.e., it will have to wait for I/O. A standard method of avoiding this I/O bottleneck is to reduce the overall I/O requirement of the PE by increasing the size of its local memory. This paper addresses the question of by how much the PE's local memory must be enlarged in order to restore balance.The following results are shown: For matrix computations such as matrix multiplication and Gaussian elimination, the size of the local memory must be increased by a factor of α2. For computations such as relaxation on a k-dimensional grid, the local memory must be enlarged by a factor of αk. For some other computations such as the FFT and sorting, the increase is exponential; i.e., the size of the new memory must be the size of the original memory to the αth power. All these results indicate that to design a balanced PE, the size of its local memory must be increased much more rapidly than its computation bandwidth. This phenomenon seems to be common for many computations where an output may depend on a large subset of the inputs.Implications of these results for some parallel computer architectures are also discussed. One particular result is that to balance an array of p linearly connected PEs for performing matrix computations such as matrix multiplication and matrix triangularization, the size of each PE's local memory must grow linearly with p. Thus, the larger the array is, the larger each PE's local memory must be.  相似文献   

5.
This paper describes a collection of parallel optimal control algorithms which are suitable for implementation on an advanced computer with the facility for large-scale parallel processing. Specifically, a parallel nongradient algorithm and a parallel variablemetric algorithm are used to search for the initial costate vector that defines the solution to the optimal control problem. To avoid the computational problems sometimes associated with simultaneous forward integration of both the state and costate equations, a parallel shooting procedure based upon partitioning of the integration interval is considered. To further speed computations, parallel integration methods are proposed. Application of this all-parallel procedure to a forced Van der Pol system indicates that convergence time is significantly less than that required by highly efficient serial procedures.This research was supported in part by the Air Force Office of Scientific Research, Air Force Systems Command, USAF, under Grant No. AFOSR-77-3418.  相似文献   

6.
Calcium waves are modeled by parabolic partial differential equations, whose simulation codes contain Krylov subspace methods as computational kernels. This paper presents GPU-based parallel computations for the conjugate gradient method applied to the finite difference discretization of a Poisson equation as prototype problem for the computational kernel. The CUDA algorithm tests the three memory systems of global memory, texture memory, and shared memory of a CUDA-enabled GPU. Due to the caching mechanism and coalesced read/write operations, the CUDA algorithm using global memory and single precision floating point numbers outperforms algorithms accessing texture memory and the shared memory. (© 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

7.
The neural networks of the human brain act as very efficient parallel processing computers co-ordinating memory related responses to a multitude of input signals from sensory organs. Information storage, update and appropriate retrieval are controlled at the molecular level by the neuronal cytoskeleton which serves as the internal communication network within neurons. Information flow in the highly ordered parallel networks of the filamentous protein polymers which make up the cytoskeleton may be compared to atmospheric flows which exhibit long-range spatiotemporal correlations, i.e. long-term memory. Such long-range spatiotemporal correlations are ubiquitous to real world dynamical systems and is recently identified as signature of self-organized criticality or chaos. The signatures of self-organized criticality i.e. long-range temporal correlations have recently been identified in the electrical activity of the brain. The physics of self-organized criticality or chaos is not yet identified. A recently developed non-deterministic cell dynamical system model for atmospheric flows predicts the observed long-range spatiotemporal correlations as intrinsic to quantum-like mechanics governing flow dynamics. The model visualises large scale circulations to form as the result of spatial integration of enclosed small scale perturbations with intrinsic two-way ordered energy flow between the scales. Such a concept maybe applied for the collection and integration of a multitude of signals at the cytoskeletal level and manifested in activation of neurons in the macroscale. The cytoskeleton networks inside neurons may be the elementary units of a unified dynamic memory circulation network with intrinsic global response to local stimuli. A cell dynamical system model for human memory circulation network analogous to atmospheric circulations network is presented in this paper. The model like the analysis of Koruga et al. make use of certain connections to the concept of Cantorian-Fractal spacetime.  相似文献   

8.
An automated general purpose method is introduced for computing a rigorous estimate of a bounded region in ℝ n whose points satisfy a given property. The method is based on calculations conducted in interval arithmetic and the constructed approximation is built of rectangular boxes of variable sizes. An efficient strategy is proposed, which makes use of parallel computations on multiple machines and refines the estimate gradually. It is proved that under certain assumptions the result of computations converges to the exact result as the precision of calculations increases. The time complexity of the algorithm is analyzed, and the effectiveness of this approach is illustrated by constructing a lower bound on the set of parameters for which an overcompensatory nonlinear Leslie population model exhibits more than one attractor, which is of interest from the biological point of view. This paper is accompanied by efficient and flexible software written in C++ whose source code is freely available at .  相似文献   

9.
《Journal of Complexity》1988,4(2):87-105
Under the systolic communication model, each cell (or processor) in a parallel processing system can operate directly on data residing at the cell's input queues and move computed results directly to the cell's output queues. Incoming and outgoing data need not be stored in the cell's local memory, if not required by the computation. By avoiding these local memory accesses, systolic communication can achieve high efficiency when executing many systolic algorithms. Though efficient, systolic communication may lead to deadlocks at run time if data arriving at a cell's input queues are improperly ordered. This paper describes the nature of this deadlock problem, gives an abstract formulation of the problem, and provides a deadlock avoidance strategy.  相似文献   

10.
In this study we introduce strategies for a load-balanced parallelization of sparse matrix computations on a cluster of PCs with minimum communication overhead. Based on these strategies a parallel sparse Conjugate Gradient Algorithm for CFD computations is evolved. The proposed parallel algorithm is implemented on Anu-cluster, a cluster of eight PCs, under ANULIB message passing environment. The parallel sparse code is tested both on linear and non-linear problems and found to give good performance. Results are compared with those from dense matrix computations.  相似文献   

11.
The design of efficient algorithms for large-scale gas dynamics computations with hybrid (heterogeneous) computing systems whose high performance relies on massively parallel accelerators is addressed. A high-order accurate finite volume algorithm with polynomial reconstruction on unstructured hybrid meshes is used to compute compressible gas flows in domains of complex geometry. The basic operations of the algorithm are implemented in detail for massively parallel accelerators, including AMD and NVIDIA graphics processing units (GPUs). Major optimization approaches and a computation transfer technique are covered. The underlying programming tool is the Open Computing Language (OpenCL) standard, which performs on accelerators of various architectures, both existing and emerging.  相似文献   

12.
We present parallel lightweight algorithms to construct wavelet trees, rank and select structures, and suffix arrays in a shared-memory setting. The work and depth of our first parallel wavelet tree algorithm match those of the best existing parallel algorithm while requiring asymptotically less memory and our second algorithm achieves the same asymptotic bounds for small alphabet sizes. Our experiments show that they are both faster and more memory-efficient than existing parallel algorithms. We also present an experimental evaluation of the parallel construction of rank and select structures, which are used in wavelet trees. Next, we design the first parallel suffix array algorithm based on induced copying. Our induced copying requires linear work and polylogarithmic depth for constant alphabets sizes. When combined with a parallel prefix doubling algorithm, it is more efficient in practice both in terms of running time and memory usage compared to existing parallel implementations. As an application, we combine our algorithms to build the FM-index in parallel.  相似文献   

13.
The paper presents a parallel direct solver for multi-physics problems. The solver is dedicated for solving problems resulting from adaptive finite element method computations. The concept of finite element is actually replaced by the concept of the node. The computational mesh consists of several nodes, related to element vertices, edges, faces and interiors. The ordering of unknowns in the solver is performed on the level of nodes. The concept of the node can be efficiently utilized in order to recognize unknowns that can be eliminated at a given node of the elimination tree. The solver is tested on the exemplary three-dimensional multi-physics problem involving the computations of the linear acoustics coupled with linear elasticity. The three-dimensional tetrahedral mesh generation and the solver algorithm are modeled by using graph grammar formalism. The execution time and the memory usage of the solver are compared with the MUMPS solver.  相似文献   

14.
Data partitioning and load balancing are important components of parallel computations. Many different partitioning strategies have been developed, with great effectiveness in parallel applications. But the load-balancing problem is not yet solved completely; new applications and architectures require new partitioning features. Existing algorithms must be enhanced to support more complex applications. New models are needed for non-square, non-symmetric, and highly connected systems arising from applications in biology, circuits, and materials simulations. Increased use of heterogeneous computing architectures requires partitioners that account for non-uniform computing, network, and memory resources. And, for greatest impact, these new capabilities must be delivered in toolkits that are robust, easy-to-use, and applicable to a wide range of applications. In this paper, we discuss our approaches to addressing these issues within the Zoltan Parallel Data Services toolkit.  相似文献   

15.
并行分批排序起源于半导体芯片制造过程。在并行分批排序中,工件可成批加工,批加工机器最多可同时加工B个工件,批的加工时间为批中所有工件的最大工时。首先根据传统的机器环境和目标函数对并行分批排序已有成果进行分类介绍,主要为单机和平行机的机器环境,以及极小化最大完工时间、极小化总完工时间、极小化最大延迟、极小化误工工件数、极小化总延误和极小化最大延误的目标函数;然后梳理了由基本问题所衍生出来的具有新特点的16类新型并行分批排序,包括差异尺寸工件、多目标、工件加工时间或顺序存在限制、考虑费用和具有特殊机制等情况;最后展望未来的研究方向。  相似文献   

16.
This paper extends the authors' parallel nested dissection algorithm of [13] originally devised for solving sparse linear systems. We present a class of new applications of the nested dissection method, this time to path algebra computations (in both cases of single source and all pair paths), where the path algebra problem is defined by a symmetric matrix A whose associated graph G with n vertices is planar. We substantially improve the known algorithms for path algebra problems of that general class; this has further applications to maximum flow and minimum cut problems in an undirected planar network and to the feasibility testing of a multicommodity flow in a planar network.  相似文献   

17.
We describe a distributed memory parallel Delaunay refinement algorithm for simple polyhedral domains whose constituent bounding edges and surfaces are separated by angles between 90° to 270° inclusive. With these constraints, our algorithm can generate meshes containing tetrahedra with circumradius to shortest edge ratio less than 2, and can tolerate more than 80% of the communication latency caused by unpredictable and variable remote gather operations.

Our experiments show that the algorithm is efficient in practice, even for certain domains whose boundaries do not conform to the theoretical limits imposed by the algorithm. The algorithm we describe is the first step in the development of much more sophisticated guaranteed-quality parallel mesh generation algorithms.  相似文献   


18.
Good performance of parallel finite element computations on unstructured meshes requires high-quality mesh partitioning. Such a decomposition task is normally done by a graph-based partitioning approach. However, the main shortcoming of graph partitioning algorithms is that minimizing the so-called edge cut is not entirely the same as minimizing the communication overhead. This paper thus proposes a unified framework of multi-objective cost functions, which take into account several factors that are not captured by the graph-based partitioning approach. Freely adjustable weighting parameters in the framework also promote a flexible treatment of different optimization objectives. A greedy-style post-improvement procedure is designed to use these cost functions to improve the quality of subdomain meshes arising from the graph-based partitioning approach. Both serial and parallel implementation of the post-improvement procedure have been done. Numerical experiments show that communication overhead can indeed be reduced by this improvement procedure, thereby increasing the performance of parallel finite element computations.  相似文献   

19.
We present a new variant of the suffix tree called a distributed suffix tree (DST) which allows for much larger databases of strings to be handled efficiently. The method is based on a new linear time construction algorithm for subtrees of a suffix tree. The new data structure tackles the memory bottleneck problem by constructing these subtrees independently and in parallel. It is designed for distributed memory parallel computing environments (e.g., Beowulf clusters). The central advantage is that standard operations of biological importance on suffix trees are shown to be easily translatable to this new data structure. While none of these operations on the DST require inter-process communication, many have optimal expected parallel running times.  相似文献   

20.
Tiling optimization for the solution of the Dirichlet problem for the two-dimensional heat equation on computers with distributed memory is investigated. Estimates of the amount of communications and computations are obtained. The tiling optimization problem is reduced to the minimization of a function that explicitly expresses the dependence of the execution time on the tile size and the parameters of the target supercomputer—the dimension and size of the computing environment, processor performance, initialization time, and capacity of the communication channels.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号