|
1. |
Performance of a particle‐in‐cell plasma simulation code on the BBN TC2000 |
|
Concurrency: Practice and Experience,
Volume 4,
Issue 1,
1992,
Page 1-18
Judy E. Sturtevant,
Phil M. Campbell,
Arthur B. Maccabe,
Preview
|
PDF (950KB)
|
|
摘要:
AbstractThe BBN TC2000 is a multiple instruction, multiple data (MIMD) machine that combines a physically distributed memory with a logically shared memory programming environment using the unique Butterfly switch. Particle‐in‐cell (PIC) plasma simulations model the interaction of charged particles with electric and magnetic fields. This paper describes the implementation of both a 1‐D electrostatic and a 2 1/2‐D electromagnetic PIC (particle‐in‐cell) plasma simulation code on a BBN TC2000. Performance is compared to implementation of the same code on the shared memory Sequent Ba
ISSN:1040-3108
DOI:10.1002/cpe.4330040102
出版商:John Wiley&Sons, Ltd
年代:1992
数据来源: WILEY
|
2. |
Numerical simulation of three‐dimensional thermal convection on the array processor DAP 510 |
|
Concurrency: Practice and Experience,
Volume 4,
Issue 1,
1992,
Page 19-35
W. Erhard,
M. Schäfer,
Preview
|
PDF (783KB)
|
|
摘要:
AbstractIn this paper we deal with the numerical simulation of time dependent three‐dimensional thermal convection on the array processor DAP 510. Applying finite differences in combination with a pressure correction method to the underlying non‐linear system of partial differential equations, we reduce the numerical solution of the problem to the solution of a sequence of sparse linear systems. Using polynomial preconditioned conjugate gradient methods for the solution of these systems results in a highly parallel algorithm for the simulation of the considered flows on the DAP 510. Using this parallel algorithm, data can be mapped in different ways onto the processor array. Depending on the number of grid points, several methods are shown. Numerical experiments illustrate the capabilities of the proposed algori
ISSN:1040-3108
DOI:10.1002/cpe.4330040103
出版商:John Wiley&Sons, Ltd
年代:1992
数据来源: WILEY
|
3. |
Fault‐tolerant parallel programming in Argus |
|
Concurrency: Practice and Experience,
Volume 4,
Issue 1,
1992,
Page 37-55
Henri E. Bal,
Preview
|
PDF (1142KB)
|
|
摘要:
AbstractFault tolerance is an issue ignored in most parallel languages. The overhead of making parallel, high‐performance programs resilient to processor crashes is often too high, given the low probability of such events. If parallel systems become more large‐scaled, however, processor failures will become likely, so they should be dealt with. Two approaches to this problem are feasible. First, the system can make programs fault‐tolerant transparently. It can log messages, make checkpoints, and so on. Second, the programmer can write explicit code for handling failures in an application‐specific way. The latter approach is potentially more efficient, but also requires more work from the programmer. In this paper, we intend to get some initial insight into how hard and efficient explicit fault‐tolerant parallel programming is. We do so by implementing four parallel applications in Argus, a language supporting parallelism as well as fault tolerance. Our experiences indicate that the extra effort needed for fault tolerance varies much between different applications. Also, trade‐offs can frequently be made between programming effort and efficiency. One lesson we learned is that fault tolerance should not be added as an afterthought, but is best taken into account from the start. As another result, the ability to integrate transparent and explicit mechanisms for fault tolerance would sometimes be hi
ISSN:1040-3108
DOI:10.1002/cpe.4330040104
出版商:John Wiley&Sons, Ltd
年代:1992
数据来源: WILEY
|
4. |
Computation and data movement on RP3 |
|
Concurrency: Practice and Experience,
Volume 4,
Issue 1,
1992,
Page 57-78
Luigi Brochard,
Alex Freau,
Preview
|
PDF (1029KB)
|
|
摘要:
AbstractWe present in this paper a study of the computation and communication costs on RP3 and on some issues about algorithm designs on a three‐level memory hierarchy multi‐processor. Using very simple algorithms (vector‐add, vector‐sum, saxpy, … ), we compare different implementations which differ on data localization (global or local) and data cacheability (cacheable or non‐cacheable). This comparison is done using a performance monitoring system (VPMC) that records instructions, data movement, cache requests and misses. The output of the VPMC was then used as input to an analytical performance model which we used to compute the elemental computation and communication times of every basic algorithm. Regarding cacheability (marking the data cacheable instead of non‐cacheable), we found it worthwhile as long as data are blocked adequately. For our simple 1‐D data structures, a block size equal to a multiple of the cache line size gives the best results. However, considering possible load imbalance, a block size equal to the cache line seems optimal. Regarding localization (copying data from global to local, working on local data instead of global and copying data back), we found it ineffective, at least with the RP3 local and global communication speed r
ISSN:1040-3108
DOI:10.1002/cpe.4330040105
出版商:John Wiley&Sons, Ltd
年代:1992
数据来源: WILEY
|
5. |
Designing Algorithms on RP3 |
|
Concurrency: Practice and Experience,
Volume 4,
Issue 1,
1992,
Page 79-106
Luigi Brochard,
Alex Freau,
Preview
|
PDF (1478KB)
|
|
摘要:
AbstractWe study here the behavior of two numerical algorithms (matrix multiplication and finite difference method) on a three‐level memory hierarchy multi‐processor RP3. Using different versions of these algorithms, which differ on data placement (global, local, global and cacheable, local and cacheable) and on data access (blocked or non‐blocked), we study the impact of these parameters on the performance of the program. This performance analysis is done using a very accurate monitoring system (VPMC) which records instructions, memory requests, cache requests and misses. We perform also a theoretical performance analysis of these programs using a model of computation and communication. Good agreement is found between theoretical and experimental results. As a conclusion we discuss the use of local memory on such a machine and show that it is ineffective with RP3 cache, local and global memory communication speed ratios. We also discuss optimal use of cache and show that the optima can only be realized under some cache properties (private store‐in cache with user control of write‐back) and show that blocked optimal algorithms are to be used to find it. Comparing programming of shared and distributed memory multi‐processors, we remark that optimized algorithms for shared memory systems utilize the same blocking techniques used for programming distributed memory systems, leading to a common programmi
ISSN:1040-3108
DOI:10.1002/cpe.4330040106
出版商:John Wiley&Sons, Ltd
年代:1992
数据来源: WILEY
|
6. |
Masthead |
|
Concurrency: Practice and Experience,
Volume 4,
Issue 1,
1992,
Page -
Preview
|
PDF (103KB)
|
|
ISSN:1040-3108
DOI:10.1002/cpe.4330040101
出版商:John Wiley&Sons, Ltd
年代:1992
数据来源: WILEY
|
|