|
1. |
Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers |
|
Concurrency: Practice and Experience,
Volume 6,
Issue 7,
1994,
Page 543-570
Jaeyoung Choi,
David W. Walker,
Jack J. Dongarra,
Preview
|
PDF (1538KB)
|
|
摘要:
AbstractThe paper describes Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non‐transposed matrix multiplication routine C = A ⋅ B, but also transposed multiplication routines C = AT⋅ B, C = A ⋅ BT, and C = AT⋅ BT, for a block cyclic data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta
ISSN:1040-3108
DOI:10.1002/cpe.4330060702
出版商:John Wiley&Sons, Ltd
年代:1994
数据来源: WILEY
|
2. |
Matrix multiplication on the Intel Touchstone Delta |
|
Concurrency: Practice and Experience,
Volume 6,
Issue 7,
1994,
Page 571-594
Steven Huss‐Lederman,
Elaine M. Jacobson,
Anna Tsao,
Guodong Zhang,
Preview
|
PDF (1306KB)
|
|
摘要:
AbstractMatrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message‐passing architecture with a two‐dimensional mesh topology. We analyze and compare three algorithms and obtain an implementation, BiMMeR, that uses communication primitives highly suited to the Delta and exploits the single node assembly‐coded matrix multiplication. Our algorithm is completely general, i.e. able to deal with various data layouts as well as arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86 %, with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800 × 8800 matrix. We describe BiMMeR's design and implementation and present performance results that demonstrate scalability and robust behavior over varying mesh topo
ISSN:1040-3108
DOI:10.1002/cpe.4330060703
出版商:John Wiley&Sons, Ltd
年代:1994
数据来源: WILEY
|
3. |
Determining update latency bounds in Galactica Net |
|
Concurrency: Practice and Experience,
Volume 6,
Issue 7,
1994,
Page 595-611
S. Clayton,
A. Wilson,
R. J. Duckworth,
W. Michalson,
Preview
|
PDF (1079KB)
|
|
摘要:
AbstractThe paper looks at the problem of ensuring the performance of real‐time applications hosted on Galactica Net, a mesh‐based distributed cache coherent shared memory multiprocessing system. A method for determining strict upper bounds on worst case latencies in wormhole routed networks of known or unknown communication patterns is presented. From this, a tool for determining upper bounds for shared memory update latencies is developed, and it is shown that the update latency of Galactica Net is deterministic. The analytical bounds are then compared with maximum latencies observed in simulations of GNet, with which they compare favorably. Finally, it is shown that the tool for determining update latency bounds is useful for comparing differing GNet system configurations in order to minimize update latency bou
ISSN:1040-3108
DOI:10.1002/cpe.4330060704
出版商:John Wiley&Sons, Ltd
年代:1994
数据来源: WILEY
|
4. |
Masthead |
|
Concurrency: Practice and Experience,
Volume 6,
Issue 7,
1994,
Page -
Preview
|
PDF (103KB)
|
|
ISSN:1040-3108
DOI:10.1002/cpe.4330060701
出版商:John Wiley&Sons, Ltd
年代:1994
数据来源: WILEY
|
|