HPFC: a Prototype HPF Compiler
Fabien COELHO
Hpfc is an High Performance Fortran Compiler which is being developed at CRI. This
project aims at building a prototype compiler to be able to test new
optimization techniques for compiling HPF. It has been supported by
PARADIGME [1]. The compiler is implemented on top of
PIPS [5, 7, 6] (Scientific Programs
Interprocedural Parallelizer) which provides many program analyses such
as effects,
regions,
etc. Realistic codes (hundreds of lines, heavy I/Os, not fully
optimizable) are compiled by the prototype and have run on a network of
workstations and on a CM5. Some experiments were also performed on an
alpha farm.
The prototype compiler is available at the source level and as SunOS4 and
Solaris2 binaries with the
PIPS distribution (CRI/ENSMP).
Note that this implementation is a working research prototype! This means
that despite it has shown to work on real HPF codes, and sometimes to
perform better than commercial compilers, it remains a prototypes with
bugs, core dumps, limitations and small documentation and small support.
The compiler does not take as input full HPF, but includes both
simplification and extensions (FC directives).
The base language is mainly FORTRAN 77, plus some extensions toward HPF.
- The input language syntax is restricted to FORTRAN 77, and more
precisely to its PIPS FORTRAN subset. Few features (computed
GOTO and substrings) are missing, but it matches
the standard closely.
- HPF directives start on column 1 with
[Cc!\*][Hh][Pp][Ff]$
.
- FC (Fabien COELHO) directives with
[Cc!\*][Ff][Cc][Dd]$
.
- Very simple forall instructions without continuation are
handled. They must be independent.
!hpf$ independent
forall(i=1:n) a(i)=a(i)+1
- ! style FORTRAN 90 comments are managed in directives.
!hpf$ dynamic A ! here is a comment
!hpf$ template T(n) ! and another...
- 6th column and
&
style continuations are handled in
directives.
!hpf$ processors
!hpf$x P(hpf_number_of_processors())
!hpf$ distribute T(block) &
!hpf$ onto P
- The directive analysis does handle FORTRAN lexical issues in
directives. However, I do not recommand to use this feature which is
a bad Fortran custom.
The HPF mapping directives are both simplified and extended on a
syntactic point of view:
- Arrays may be used as templates.
real A(n,n), B(n,n), C(n)
!hpf$ align B(i,j) with A(i,j)
*hpf$ align C(i) with A(*,i)
- No alignment trees (A is aligned with B which is aligned with C).
- Dummy variable-based or implicit syntax for alignments
(the implicit syntax is not strictly conformant but definitely convenient)
!hpf$ align A(i,j) with T(j,i)
!hpf$ align D with T2
- list of distributee and alignee are allowed:
!hpf$ align A(i,j), B(j,k) with T(i,k)
!hpf$ distribute T(block,*), D(*,cyclic) onto P
- The default distribution is block on each dimension of
the processor arrangement.
!hpf$ template T(n,n,n)
!hpf$ processors P4(2,2)
!hpf$ distribute T onto P4 ! == T(block,block,*)
- The default processors arrangement is a vector of 8 processors
named hpfcX.
!hpf$ distribute T(cyclic)
!
! same as
!
!hpf$ processors hpfcX(8)
!hpf$ distribute T(cyclic) onto hpfcX
- Static HPF mapping directives are allowed for variables which are
local or in commons. This is not strictly speaking HPF conformant.
They cannot be remapped. Commons with distributed arrays should be
declared exactly the same in the different subroutines and
functions. Overlaps are managed thanks to the interprocedural
compilation.
integer n
parameter(n=1000)
common /shared/ V(n,n), U(n,n), vmax
real V, U, vmax
!hpf$ align with U :: V
!hpf$ distribute U(block,block) onto P
- All sizes of HPF objects (distributed arrays, templates, processors)
must be statically known.
hpf_number_of_processors()
is given a
default value.
integer n,m
parameter (n=10,m=10000)
real A(m,n)
!hpf$ template T(m,n), T2(100,100,100)
!hpf$ template D(100,m)
!hpf$ processors P(hpf_number_of_processors())
!hpf$ processors p2(4,4), proc(n)
- Some simple so called free style is handled:
!hpf$ align with T :: A, B
!hpf$ distribute onto P :: T(block)
!
!hpf$ re align with T(i) :: C(i)
!hpf$ redistribute onto P :: T(cyclic)
!
!hpf$ distribute (block) onto P :: X, Y
!hpf$ realign (i,j) with D(j,i) :: V, W
The mapping of an array may be modified dynamically within the program or
at subroutine boundaries. Dynamic directives are as constrained as static
directives.
- Dynamic HPF mapping directives (realign,
redistribute) are only allowed on local arrays, and
so that there is only one reaching distribution for each array reference
in an instruction.
!hpf$ distribute A(block,block)
if (a(1,1).lt.1.0) then
!hpf$ redistribute A(block,cyclic)
!hpf$ independent(i,j)
A = A ...
endif
!hpf$ redistribute A(block,block) ! back...
print *, A
- Prescriptive and descriptive mappings at subroutine
interface are allowed. Thus full arrays can be passed as subroutine
arguments.
subroutine predes(X,Y)
real X(100), Y(100)
!hpf$ distribute X * (block) ! descriptive
!hpf$ distribute Y(block) ! prescriptive
- Local sections of distributed arrays may be passed to a pure
function within a parallel loop, if properly aligned...
subroutine fft1(signal,spectrum)
!fcd$ pure ! equivalent to the f90 attribute...
complex signal(255), spectrum(255)
...
end
subroutine process
complex signal(255,255), spectrum(255,255)
!hpf$ align with signal:: spectrum
!hpf$ distribute signal(*,block)
!hpf$ independent
do i=1, 255
call fft1(signal(1,i),spectrum(1,i)) ! signal(:,i)
enddo
- Directive inherit is handled as a default prescriptive
block distribution, thus both following directives are strictly
equivalent for Hpfc:
!hpf$ inherit X
!hpf$ distribute X
Parallelism can be expressed:
- Directives independent and new are taken into
account. They are managed orthogonaly and
homogeneously.
Moreover independent can be specified the parallel loop
indexes. Thus one can write
!hpf$ new(x),
!hpf$. independent(j,i)
do j=1, n
do i=1, n
x = A(i,j)*A(i,j)+B(i,j)
C(i,j) = x*x+x
enddo
enddo
- Reduction functions (sum, prod, min, max) are recognized by name and
with a special FORTRAN 77 syntax to please the parser.
real A(n)
x = redmin1(A, 5, n-3) ! x = MIN(A(5:n-3))
- The reduction directive is also implemented for +, *,
min and max:
s = 2.1
!hpf$ independent,
!hpf$x reduction(s)
do i=2, n-1
s = s + a(i)
enddo
print *, 'sum is ', s
- There is a pure directive (instead of an attribute to a
function or subroutine, because the compiler deal with FORTRAN 77
only). They should only deal with scalar data. Actually it
declares elemental functions.
program foo
external myop
!fcd$ pure myop ! the external real myop function is pure
real A(100)
!hpf$ distribute A
!hpf$ independent
do i=1, n
A(i) = myop(A(i))
enddo
end
real function myop(x)
!fcd$ pure ! the current function is pure
myop = x*x+sqrt(sqrt(x))
end
Hpfc incorporates many extensions of HPF, the need of which was
encountered in real applications.
- There is an io directive (functions declared io
are just called by the host after a gather of the needed data).
program blah
!fcd$ io output ! output subroutine is io...
...
call output(A)
end
subroutine output(x)
!fcd$ io ! current function is io
print *, 'here is the output', x
end
- Code sections can be managed by the host only, without fine grain
update of the values on the nodes, with host and end
host. All user's functions called within such section should be
either pure or io. Example:
c m will be updated on the node only once
c a correct value is obtained from the user...
cfcd$ host
10 print *, 'm = ?'
read *, m
if (m.lt.0.or.m.gt.100) then
print *, 'invalid value, retry'
goto 10
endif
cfcd$ end host
- Simple sections of codes can be localized with local and
end local. The point is to tell the compiler about a variables
private to a section of code. For instance:
cfcd$ local, new(c1,w)
c1 = 4/((1/v(np-1,np))+(1/v(np-1,np-1)) +
. (1/v(np-2,np))+(1/v(np-2,np-1)) )
w = c1*deltat
be(np-1,1) = 2*w/(4*w+3*h)
be(np-1,2) = 3*h/(4*w+3*h)
v(np-1,np) = c1
cfcd$ end local
- Synchronization (synchro) and timing
(time/end time) functions can be requested for
performance measurements. Timing functions can be nested (up to
ten). They must apply on a section of code. The message attached to the
end is displayed together with the time.
These directives cannot appear within a parallel loop.
Such an approach with special directives should have been chosen by
vendors instead of propriatary calls (for instance the DEC compiler
handles call hpf_synchro() to synchronize HPF programs on
user request, what cannot be considered as portable).
!fcd$ synchro
print *, 'let us go'
!fcd$ time
do i=1, 10
!fcd$ time
call foo
!fcd$ end time('foo call')
enddo
!fcd$ end time('full loop')
!fcd$ synchro
print *, 'that is all'
time on and time off are obsolescent directives for
time and end time.
- The
tell
directive allows to tell about the progression of
a run onto stderr, and can be used at development time.
program foo
!fcd$ tell('foo -> bla')
call bla
!fcd$ tell('foo -> arg')
call arg
end
- Arrays the values of which are dead at a point can be marked as such
with the kill directive. For remappings on dead arrays, no
communication will occur at runtime.
!hpf$ template T(n)
!hpf$ align with T :: A
!hpf$ processors P(4)
!hpf$ distribute T(block) onto P
!
! A is used
!
!fcd$ kill A ! I know the value will not be reused...
!hpf$ redistribute T(cyclic) onto P
!
! A is used again, but defined before being referenced.
!
- For differentiating a sequential and parallel version of a same
computation (the best way to compute something can be different
if you take the machine characteristics into account), the user may rely
on HPFC to run cpp over its code with the
__HPFC__
macro defined.
The idea is not to impose the less performant
solution to be applied on the code for sequential execution because it
is a mandatory way of doing things in parallel.
#if (defined(__HPFC__))
mm=0.
!hpf$ independent, &
!hpf$ reduction(mm)
do i=1, n
mm = max(abs(a(i)),mm)
end do
if (mm.ge.eps) goto 10
#else
do i=1, n
if (abs(a(i)).ge.eps) goto 10
end do
#endif
10 continue
- The set directive (with int and bool
variants) allows to switch compiler options from the source code.
They are declaration directives. The options are those reported in the
properties of PIPS. To be manipulated with great care. Introduced for
validation purposes.
!fcd$ set bool('HPFC_USE_BUFFERS',1) ! set to TRUE
!fcd$ set int ('HPFC_BUFFER_SIZE',500000)
!fcd$ set bool('HPFC_LAZY_MESSAGES',0) ! set to FALSE
!fcd$ setbool ('HPFC_IGNORE_FCD_SYNCHRO',1) ! no SYNCHRO
- The fake directive tells the compiler that the source code
provided for a given function is a fake one which mimics its
effects. Thus the compiler will not attempt to generate any code for
this function. It is simply understood as ``do not compile me''.
This feature is used for interfacing Hpfc with XPOMP.
A fake function must be either
pure
or io
.
subroutine xpomp_get_depth(d)
integer d
!fcd$ io
!fcd$ fake
read *, d
end
- Fake
io
and pure
function will not be compiled by the
Hpfc compiler. However, these function will have to be provided
for linking the final executables. The special ld io
and
ld pure
directives allow this information to be provided
as additional shell-interpreted linker options:
! Tells where to find xpomp functions:
!fcd$ ld io -L$XPOMP_RUNTIME -lxpomp
The io
version is linked to the host and initial
(sequential) executables. The pure
applies to all executables.
- When using external library function in programs submitted PIPS, it
is required to provide stub functions that mimics the effects of the
library functions. Files of stubs to be included can be specified
directly within the source code with the special
stubs
directive.
The file is searched in the current directory and in the
$PIPS_ROOT/Share
directory, with the .direct
(preprocessed) then the .f suffix.
! Tells the name of the stub file.
!fcd$ stubs xpomp_stubs.f
The compiler generates code for the SPMD message passing programming
model. It includes many optimizations. The generated code is based on a
runtime support library. The communications are handled by PVM.
The whole compilation process is driven and automated by shell scripts.
Hpfc's target programming model is a host-node distributed memory
message passing model. The host is in charge of I/Os and of the process
management. The nodes perform the computations. Two programs are generated
by the compiler. One for the host, and an SPMD code for the nodes, which
is parametrized by the node id. The generated codes rely on a runtime
library which provides many services. Lower level issues in the process
management (spawning of tasks, etc.) and basic communications are handled
by PVM.
The basic compilation technique is known as the runtime resolution of
accesses to distributed arrays. Each processor executes the whole
control flow of the original program, and each reference to distributed
data is resolved dynamically, and elementary messages are exchanged,
following the Owner Computes Rule. This is of course very
inefficient, and optimizations are needed if the program is expected to
run quicker on a parallel machine.
The following optimizations are implemented within Hpfc:
- Distributed array declarations are reduced on the nodes.
- Reductions are compiled thru calls to dedicated runtime functions.
- Overlap analysis.
Conditions:
- Independent loop nest, maybe non perfectly nested.
- Rectangular iteration space, maybe not statically known.
- Multidimensional block distributions.
- Arrays should be nearly aligned one with the other.
- No alignement stride, no replication of written arrays.
- flip-flops are allowed.
Output:
- New loop nest on local data for the nodes.
- Communications are vectorized outside of the loop nest.
- Array declarations are extended to store data from neighbour
processors on overlap areas.
- I/O communications (collect to/update from host) are specially
managed with a polyhedron-based technique (abstract [3]).
- 1D shift on rectangular areas of block-distributed stride-1 aligned
arrays are recognized and compiled thru calls to appropriate runtime
support functions. It is not far from a hack, but it allows to handle
wraparound computations. Example:
!hpf$ distribute V(block,block) onto P
!hpf$ independent
do i=2, n-1
V(1,i) = V(n,i)
enddo
- loop nests that fully copy an array into another, both being
aligned, are especially handled. One more hack.
- Remappings (realign and redistribute) are taken into
account (abstract [4]).
- And more linear algebra techniques to come
(abstract [2])
- Dynamic arrays with similar but distinct mappings (that is the
mapping directives are different but results in the very same actual
mapping on the processors) are merged. Such instances typically appear
if a realign for replication is requested, although not all dimensions
are actually distributed (replicating a non-distributed dimension as no
effect on the actual data mapping). The realign depends on the
algorithms, and the distribution choice may change for various
processors arrangements considered for the program.
!hpf$ align A(i,j) with T(i,j)
!hpf$ distribute T(block,*)
!
! is similar to
!
!hpf$ align A(i,j) with T(j,i)
!hpf$ distribute T(*,block)
The Hpfc generated code relies on a runtime library and on runtime
datastrutures. These functions are written in FORTRAN 77 on top of
PVM3. Many functions are generated automatically from m4 templates,
depending on the manipulated data types and arities. Parts of the
library are ported to use the CM5 communication library instead of
PVM3. The runtime library was successfully compiled on the following
PVM architectures: SUN4 SUNMP RS6K ALPHA CM5.
Runtime data structures:
- HPF objects description
- Arity and object bounds
- Alignements and Distributions
- Distributed arrays new declarations
- PVM processes management and communication synchronization
- PVM process identifiers for the host and the nodes
- PVM identity of the process
- Number of messages exchanged with each other processes
- Number of multicasts
- HPF processors/PVM processes mapping
- Current status (library internal data)
Runtime functions:
- Runtime data structures initialization
- Processes initializations
- Ownership computations (for the runtime resolution)
- Some coherency checkings
- High level packing functions
- The data must be block-distributed and stride-1 aligned
- A local rectangular area is packed
- Any type and arity
- Reduction functions for distributed data
- The data must be block-distributed and stride-1 aligned
- The accessed area must be rectangular
- Implemented: MIN, MAX, SUM, PROD
- Any type and arity (when it makes sense)
- Subarray shift functions
- The data must be block-distributed and stride-1 aligned
- The shifted area must be rectangular
- The shift must be on one dimension
Hpfc is integrated in PIPS, a prototype parallelizer. There are many
utilities used by Hpfc or that can be used to manipulate Hpfc.
- hpfc, the compiler driver: Allows to call the Hpfc many
scripts.
- interactive sessions (thanks to GNU readline library)
- compile (HPF to F77 with pips)
- make (F77 to PVM executables)
- run (the original sequential or new parallel codes)
- delete (hpfc directories)
- clean (pips directories)
- pvm (start, halt and so on)
- ...
- shells used by the compiler:
- hpfc_directives: manage HPF directives.
- hpfc_add_includes: adds file inclusions.
- hpfc_add_warning: adds a warning to automatically
generated files.
- hpfc_generate_init: generates an init file
- hpfc_generate_h: generates a header file from a FORTRAN
file.
- hpfc_install: installation of the needed and generated
files in a directory.
- Options:
- for inlining or not global to local runtime resolution
computations.
- for managing especially buffers over PVM.
- for ignoring or not timing and synchronization directives.
- for applying the dynamic liveness analysis of array copies.
- Implementation overview
- hpfc is the compiler driver shell.
- HPF directives are translated into fake FORTRAN calls
to reuse the PIPS parser using a
shell and sed script.
- These calls are filtered out of the parsed code and analyzed to set
internal data structures which describe array mappings and so.
- New declarations are computed for the distributed arrays.
- The input code is analyzed by PIPS (array regions, preconditions).
- Then 2 codes are generated for each subroutine. One for the
host and an SPMD code for the nodes. Implemented as a double
rewriting of the AST.
- Initialization codes for the runtime data structures are also
generated.
- Finally common declarations are dealt with, plus global
initializations.
- The generated codes are compiled and linked to PVM and the Hpfc
Runtime library. Two executables are built (host and nodes).
The host spawns the nodes.
Figure 1: HPFC overview
Figure 1 shows an overview of Hpfc within PIPS,
decomposed into several interacting phases within the compiler.
Hpfc is a prototype HPF compiler in which some of our techniques
have been implemented. It serves as a realistic testbed for testing new
optimization techniques. The compiler includes 18,000 lines of ANSI C
code as a PIPS module, plus 1,300 of shell and other sed scripts for the
drivers, and finally 3,500 lines of M4/FORTRAN 77 for the runtime
support library (expanded to +10,000).
Prototype compilers that generate code from FORTRAN (more or less
HPF) or C for distributed-memory machines: See the excellent
survey
maintained by Jean-Louis Pazat at IRISA.
References
- 1
-
Luc Bougé.
ParaDigme - Le parallélisme de données comme modèle
fondamental pour le calcul massivement parallèle.
Rapport final, ENS-Lyon, December 1994.
Thème 10 : Langages, modèles d'exécution et architectres pour le
parallélisme massif, appel d'offres du Ministère de la Recherche et de
l'Espace.
- 2
-
Fabien Coelho.
Étude de la Compilation du High Performance Fortran.
Master's thesis, Université Paris VI, September 1993.
Rapport de DEA Systèmes Informatiques. TR EMP E/178/CRI.
- 3
-
Fabien Coelho.
Compilation of I/O Communications for HPF.
In 5th Symposium on the Frontiers of Massively Parallel
Computation, pages 102-109, February 1995.
- 4
-
Fabien Coelho and Corinne Ancourt.
Optimal Compilation of HPF Remappings.
CRI TR A 277, CRI, École des mines de Paris, October 1995.
To appear in JPDC, 1996.
- 5
-
François Irigoin, Pierre Jouvelot, and Rémi Triolet.
Semantical interprocedural parallelization: An overview of the PIPS
project.
In ACM International Conference on Supercomputing, pages
144-151, June 1991.
- 6
-
Ronan Keryell.
WPips and EPips User Manual (Paralléliseur Interprocédural de
Programmes Scientifiques) -- Linear Algebra based Automatic Parallelizer and
Program Transformer.
CRI, École des mines de Paris, April 1996.
Also report A-288.
- 7
-
Ronan Keryell, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet,
François Irigoin, and Pierre Jouvelot.
PIPS: A Framework for Building Interprocedural Compilers,
Parallelizers and Optimizers.
Technical Report 289, CRI, École des mines de Paris, April 1996.
This document URL is:
http://www.cri.ensmp.fr/pips/hpfc.html
Fabien COELHO
Wed Mar 26 13:24:56 MET 1997