期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On the two-phase method for preemptive scheduling

《European Journal of Operational Research》1988,37(2):227-235

Preemptive scheduling problems on parallel processors may in some cases be solved by a two-phase method: forst solve a LP problem which will give the minimum total completion time T and the processing times of the jobs on the various processors; second construct a feasible schedule using T time units. Some extensions of this procedure are discussed. A general model for preemptive scheduling is described with a two-phase method for solving problems of this type. 相似文献

2.

Parallel algorithms for solving linear systems with block-tridiagonal matrices on multi-core CPU with GPU

Elena N. Akimova Dmitry V. Belousov 《Journal of computational science》2012,3(6):445-449

For solving systems of linear algebraic equations with block-tridiagonal matrices arising in geoelectrics problems, the parallel matrix sweep algorithm, conjugate gradient method with preconditioner, and square root method are proposed and implemented numerically on multi-core CPU Intel with graphics processors NVIDIA. Investigation of efficiency and optimization of parallel algorithms for solving the problem with quasi-model data are performed. 相似文献

3.

Simulating Flows of Viscous Incompressible Fluid on Graphics Processors Using the Splitting Scheme and Multigrid Method

Volkov K. N. Emel’yanov V. N. Karpenko A. G. Teterina I. V. 《Computational Mathematics and Mathematical Physics》2019,59(1):136-149

Computational Mathematics and Mathematical Physics - The use of general purpose graphics processors for the numerical solution of problems in dynamics of viscous incompressible fluid is discussed.... 相似文献

4.

An asynchronous direct solver for banded linear systems

Michael A. Jandron Anthony A. Ruffa James Baglama 《Numerical Algorithms》2017,76(1):211-235

Banded linear systems occur frequently in mathematics and physics. However, direct solvers for large systems cannot be performed in parallel without communication. The aim of this paper is to develop a general asymmetric banded solver with a direct approach that scales across many processors efficiently. The key mechanism behind this is that reduction to a row-echelon form is not required by the solver. The method requires more floating point calculations than a standard solver such as LU decomposition, but by leveraging multiple processors the overall solution time is reduced. We present a solver using a superposition approach that decomposes the original linear system into q subsystems, where q is the number of superdiagonals. These methods show optimal computational cost when q processors are available because each system can be solved in parallel asynchronously. This is followed by a q×q dense constraint matrix problem that is solved before a final vectorized superposition is performed. Reduction to row echelon form is not required by the solver, and hence the method avoids fill-in. The algorithm is first developed for tridiagonal systems followed by an extension to arbitrary banded systems. Accuracy and performance is compared with existing solvers and software is provided in the supplementary material. 相似文献

5.

On diagnosability of large multiprocessor networks

R. Ahlswede H. Aydinian 《Discrete Applied Mathematics》2008,156(18):3464-3474

We consider problems of fault diagnosis in multiprocessor systems. Preparata, Metze and Chien [F.P. Preparata, G. Metze, R.T. Chien, On the connection assignment problem of diagnosable systems, IEEE Trans. Comput. EC 16 (12) (1967) 848-854] introduced a graph theoretical model for system-level diagnosis, in which processors perform tests on one another via links in the system. Fault-free processors correctly identify the status of tested processors, while the faulty processors can give arbitrary test results. The goal is to identify faulty processors based on the test results. A system is said to be t-diagnosable if faulty units can be identified, provided the number of faulty units present does not exceed t. We explore here diagnosis problems for n-cube systems and give bounds for diagnosability of the n-cube. We also describe a simple diagnosis algorithm A which is linear in time and which can be used for sequential diagnosis as well as for incomplete diagnosis in one step. In particular, the algorithm applied to arbitrary topology based interconnection systems G with N processors improves previously known ones. It has sequential diagnosability , which is optimal in the worst case. 相似文献

6.

First fit decreasing scheduling on uniform multiprocessors

Manfred Kunde Horst Steppat 《Discrete Applied Mathematics》1985,10(2):165-177

Independent tasks are nonpreemptively scheduled on m≥2 processors which are assumed to have different speeds. The purpose of this paper is to show that the worst case ratio of the multifit algorithm MF, which is based on the bin-packing method FFD (first fit decreasing), depends on the order of the processors and that the MF has a better worst case behavior than the well-known LPT algorithm for certain processor configurations. 相似文献

7.

Efficient automatic simulation of parallel computation on networks of workstations

Christos Kaklamanis Danny Krizanc Giuseppe Persiano 《Discrete Applied Mathematics》2006,154(10):1500-1509

Andrews et al. [Automatic method for hiding latency in high bandwidth networks, in: Proceedings of the ACM Symposium on Theory of Computing, 1996, pp. 257-265; Improved methods for hiding latency in high bandwidth networks, in: Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, 1996, pp. 52-61] introduced a number of techniques for automatically hiding latency when performing simulations of networks with unit delay links on networks with arbitrary unequal delay links. In their work, they assume that processors of the host network are identical in computational power to those of the guest network being simulated. They further assume that the links of the host are able to pipeline messages, i.e., they are able to deliver P packets in time O(P+d) where d is the delay on the link.In this paper we examine the effect of eliminating one or both of these assumptions. In particular, we provide an efficient simulation of a linear array of homogeneous processors connected by unit-delay links on a linear array of heterogeneous processors connected by links with arbitrary delay. We show that the slowdown achieved by our simulation is optimal. We then consider the case of simulating cliques by cliques; i.e., a clique of heterogeneous processors with arbitrary delay links is used to simulate a clique of homogeneous processors with unit delay links. We reduce the slowdown from the obvious bound of the maximum delay link to the average of the link delays. In the case of the linear array we consider both links with and without pipelining. For the clique simulation the links are not assumed to support pipelining.The main motivation of our results (as was the case with Andrews et al.) is to mitigate the degradation of performance when executing parallel programs designed for different architectures on a network of workstations (NOW). In such a setting it is unlikely that the links provided by the NOW will support pipelining and it is quite probable the processors will be heterogeneous. Combining our result on clique simulation with well-known techniques for simulating shared memory PRAMs on distributed memory machines provides an effective automatic compilation of a PRAM algorithm on a NOW. 相似文献

8.

Scheduling Malleable Tasks on Parallel Processors to Minimize the Makespan

Jacek Błażewicz Maciej Machowiak Jan Węglarz Mikhail Y. Kovalyov Denis Trystram 《Annals of Operations Research》2004,129(1-4):65-80

The problem of optimal scheduling n tasks in a parallel processor system is studied. The tasks are malleable, i.e., a task may be executed by several processors simultaneously and the processing speed of a task is a nonlinear function of the number of processors allocated to it. The total number of processors is m and it is an upper bound on the number of processors that can be used by all the tasks simultaneously. It is assumed that the number of processors is sufficient to process all the tasks simultaneously, i.e. n≤m. The objective is to find a task schedule and a processor allocation such that the overall task completion time, i.e. the makespan, is minimized. The problem is motivated by real-life applications of parallel computer systems in scientific computing of highly parallelizable tasks. An O(n) algorithm is presented to solve this problem when all the processing speed functions are convex. If these functions are all concave and the number of tasks is a constant, the problem can be solved in polynomial time. A relaxed problem, in which the number of processors allocated to each task is not required to be integer, can be solved in O(nmax {m,nlog ² m}) time. It is proved that the minimum makespan values for the original and relaxed problems coincide. For n=2 or n=3, an optimal solution for the relaxed problem can be converted into an optimal solution for the original problem in a constant time. 相似文献

9.

Iterative regularization algorithms for constrained image deblurring on graphics processors 总被引：1，自引：0，他引：1

Valeria Ruggiero Thomas Serafini Riccardo Zanella Luca Zanni 《Journal of Global Optimization》2010,48(1):145-157

The ability of the modern graphics processors to operate on large matrices in parallel can be exploited for solving constrained image deblurring problems in a short time. In particular, in this paper we propose the parallel implementation of two iterative regularization methods: the well known expectation maximization algorithm and a recent scaled gradient projection method. The main differences between the considered approaches and their impact on the parallel implementations are discussed. The effectiveness of the parallel schemes and the speedups over standard CPU implementations are evaluated on test problems arising from astronomical images. 相似文献

10.

Implementation of simultaneous memory address access in models that forbid it

Uzi Vishkin 《Journal of Algorithms in Cognition, Informatics and Logic》1983,4(1):45-50

Models of synchronized parallel computation in which all the processors have access to a common memory are considered. We focus on algorithms in models that allow simultaneous access to the same memory location, for both read and write instructions. Assume that such an algorithm uses p processors, d time units, and s memory space. We present a universal algorithm that implements this algorithm in models that forbid simultaneous access to the same memory location, using p processors, O(dlog²p) time units, and O (s + p) memory space his implementation algorithm is shown to compare favorably with its conventional naive counterpart, as the extra memory space it requires is independent of the implemented algorithm. 相似文献

11.

Retrieving topology of interconnections in computational cluster based on results of MPI benchmarks

P. S. Bannikov A. N. Salnikov 《Moscow University Computational Mathematics and Cybernetics》2014,38(2):73-82

Modern computing clusters have complex interconnections between processors. The complete specification of a communication network is often insufficiently detailed. Methods for automatically refining specification are considered, which use algorithms for reconstructing the topology of connections between processors from results of benchmarking of the cluster communication network. Methods for 3D visualizing the cluster topology and for visually comparing the constructed topology with the template topology provided by the specification are also considered. The approach was tested on systems with the BlueGene P architecture and on the Lomonosov supercomputer. A greedy algorithm for retrieving the topology and an adaptation of the multidimensional scaling method for visualization are designed. 相似文献

12.

On-line scheduling of multi-core processor tasks with virtualization

Deshi Ye Guochuan Zhang 《Operations Research Letters》2010,38(4):307-1572

We consider an on-line list scheduling problem of multi-core processor tasks with virtualization to minimize makespan. The competitive ratio of an on-line algorithm is shown for every specific m, where m is the number of processors. Better on-line algorithms are presented for a small number of processors. 相似文献

13.

Faster deterministic wakeup in multiple access channels

Gianluca De Marco Marco Pellegrini 《Discrete Applied Mathematics》2007,155(8):898-903

We consider the fundamental problem of waking up n processors sharing a multiple access channel. We assume the weakest model of synchronization, the locally synchronous model, in which no global clock is available: processors have local clocks ticking at the same rate, but each clock starts counting the rounds in the round in which the correspondent processor wakes up. Moreover, the number n of processors is not known to the processors. We propose a new deterministic algorithm for this problem in time O(n³log³n), which improves on the currently best upper bound of O(n⁴log⁵n). 相似文献

14.

A study of average-case speedup and scalability of parallel computations on static networks

《Mathematical and Computer Modelling》1999,29(9):83-94

A parallel system consists of a parallel algorithm and a parallel machine that supports the implementation of the algorithm. The scalability of a parallel system is a measure of its capability to increase speedup in proportion to the number of processors, or its capability to keep a constant efficiency as the number of processors increases. The present paper is devoted to the investigation of the average-case scalability of parallel algorithms executing on multicomputers with symmetric static networks, including the completely connected network, ring, hypercube, and torus. In particular, we characterize the communication overhead such that the expected efficiency can be kept at certain constant level, and that the number of tasks grows at the rate Θ(P log P). 相似文献

15.

Parallel iterative linear solvers for multistep Runge-Kutta methods

《Journal of Computational and Applied Mathematics》1997,85(1):145-167

This paper deals with solving stiff systems of differential equations by implicit Multistep Runge-Kutta (MRK) methods. For this type of methods, nonlinear systems of dimension sd arise, where s is the number of Runge-Kutta stages and d the dimension of the problem. Applying a Newton process leads to linear systems of the same dimension, which can be very expensive to solve in practice. With a parallel iterative linear system solver, especially designed for MRK methods, we approximate these linear systems by s systems of dimension d, which can be solved in parallel on a computer with s processors. In terms of Jacobian evaluations and LU-decompositions, the k-steps-stage MRK applied with this technique is on s processors equally expensive as the widely used k-step Backward Differentiation Formula on 1 processor, whereas the stability properties are better than that of BDF. A simple implementation of both methods shows that, for the same number of Newton iterations, the accuracy delivered by the new method is higher than that of BDF. 相似文献

16.

Assignment and scheduling in parallel matrix factorization

《Linear algebra and its applications》1986

We consider the problem of factoring a dense n×n matrix on a network consisting of P MIMD processors, with no shared memory, when the network is smaller than the number of elements in the matrix (P<n²). The specific example analyzed is a computational network that arises in computing the LU, QR, or Cholesky factorizations. We prove that if the nodes of the network are evenly distributed among processors and if computations are scheduled by a round-robin or a least-recently-executed scheduling algorithm, then optimal order of speedup is achieved. However, such speedup is not necessarily achieved for other scheduling algorithms or if the computation for the nodes is inappropriately split across processors, and we give examples of these phenomena. Lower bounds on execution time for the algorithm are established for two important node-assignment strategies. 相似文献

17.

Optimum Allocation of Processes in a Distributed Environment: A Process-to-Process Approach

Stella Sofianopoulou 《The Journal of the Operational Research Society》1990,41(4):329-337

The process allocation problem (PAP) concerned with the assignment of a number of communicating processes to a certain number (not known a priori) of identical processors in a telecommunications environment is examined. The objective is to minimize the total message-passing between processes residing on different processors subject to constraints on the processing power and storage capacity (code-, data-storage and occupancy) of the processors. Constraints imposed on the co-location of certain processes on the same processor are also included. The problem is formulated as a 0-1 linear maximization problem, taking into account only the number of processes involved, while the number of processors required is produced automatically with the optimum solution. An implicit enumeration algorithm is developed which produces an optimum message-passing allocation. Computational results of a set of random problems which have similar characteristics to a real-world application in telecommunications are also presented. 相似文献

18.

Scheduling precedence graphs of bounded height

Danny Dolev Manfred K Warmuth 《Journal of Algorithms in Cognition, Informatics and Logic》1984,5(1):48-59

The existence of a schedule for a partially ordered set of unit length tasks on m identical processors is known to be NP-complete (J. D. Ullman, NP-complete scheduling problems, J. Comput. System Sci., 10 (1975), 384–393). The problem remains NP-complete even if we restrict the precedence graph to be of height bounded by a constant. (J. K. Lenkstra and A. H. G. Rinnooy Kan, Complexity of scheduling under precedence constraints, Operations Res., 26 (1978), 22–35; D. Dolev and M. K. Warmuth, “Scheduling Flat Graphs,” IBM Research Report RJ 3398, 1982). In these NP-completeness proofs the upper bound on the number of available processors varies with the problem instance. We present a polynomial algorithm for the case where the upper bound on the number of available processors and the height of the precedence graph are both constants. 相似文献

19.

Torus data flow for parallel computation of missized matrix problems

《Linear algebra and its applications》1986

A data-flow approach is used to solve dense symmetric systems of equations on a torus-connected 2-D mesh of processors. A torus mapping of the matrix onto this processor array allows the Cholesky decomposition to be completed in 3n − 2 time steps using only n²/4 processors (less than half the number needed in previously reported results). New definitions for missized problems and parallel algorithm performance are given along with various time-step, efficiency, and processor utilization plots. 相似文献

20.

The b and h functions for integral equations with displacement kernels: A computational method and an application to radiative transfer

《Applied mathematics and computation》1987,23(2):93-101

Many problems in the theory of radiative transfer reduce to the solution of Fredholm integral equations with displacement kernels. Frequently, we are interested in the solutions of the Fredholm integral equations as well as certain functionals on the solution (reflection and transmission coefficients, etc.). Earlier it was shown that these functionals can be expressed algebraically in terms of the basic functions b and h. Normally, these functions are computed as solutions of an initial-value problem. Since they represent internal interactions due to isotropic illuminations, they are also solutions of a linear two-point boundary-value problem, which, unfortunately, s unstable. The purpose of this paper is to show that this unstable problem can be solved using a Gram-Schmidt orthogonalization scheme. This is demonstrated by making comparisons against earlier calculations using the initial-value method. In addition, the process is ideally suited to take advantage of multitasking on parallel processors. 相似文献