Last update: 95/09/1 - Author: Corinne ANCOURT
This phase was designed to automatize the generation of Fortran 77 program onto distributed memory machines using an Emulated Shared Memory scheme and to exploit the universal message passing capability provided by the INMOS T9000 processor and C104 hardware router. One half of the processors perform computations and the other half emulate memory banks providing the compiler with a better understood target machine, a multiprocessor with a fast local memory managed as a software cache and a slow shared memory. The fast context switching times and intelligent on-chip channel processors make possible to overlap computations and communications when T9000 and C104 are used.
This work was partially funded by ESPRIT project 2701 (PUMA - WorkPackage 6.5) and by DRET.
This phase takes as input a sequential Fortran77 program meeting the following conditions:
Task generation is based on control partitioning. The data dependence graph between program instructions is used to build parallel tasks. Loop transformations like tiling transformation and distribution are used on nested loops in order to define blocks of loop iterations that can be computed in parallel.
The dependence graph is used to decide if a given tiling is legal. The current implementation does not include an automatic estimation of the tile size and a default size is used.
Each tile is seen as a logically independent task. Each task is made of three parts: a prologue to read the input data from the emulated shared memory, a computational part and a final part to store the results. Ideally several tasks should be executed by the same physical processor to overlap communications and computations.
A 2-PMD distributed Fortran77 program containing calls to the runtime communication library PVM is generated. The input program is transformed into two subroutines:
COMPUTE(PROC_ID), contains the computational part of the code and receives a (logical) processor number as parameter;
BANK(BANK_ID), contains the shared memory emulator part of the code and receives a (logical) bank number as parameter.
The general structures of these two subroutines are very close since each send (receive) must be met by a corresponding receive (send). Like the input program they are sequences of nested loop. The outermost loop nest defines which tile is being executed. Each tile body is made of two or three sections:
The potential advantages of this approach are:
The obvious disavantage is that a full software cache cannot be fully statically compiled. However regular code can exploit the underlying INMOS hardware very efficiently.
A full description of the approach and examples are given in .
To run the WP65 phase with wpips, ask the distributed view.
e-mail: ancourt-at-cri.mines-paristech.fr and pips-support-at-cri.mines-paristech.fr