HPFC: a Prototype HPF Compiler

Hpfc is an High Performance Fortran Compiler which is being developed at CRI. This project aims at building a prototype compiler to be able to test new optimization techniques for compiling HPF. It has been supported by PARADIGME [1]. The compiler is implemented on top of PIPS [5, 7, 6] (Scientific Programs Interprocedural Parallelizer) which provides many program analyses such as effects, regions, etc. Realistic codes (hundreds of lines, heavy I/Os, not fully optimizable) are compiled by the prototype and have run on a network of workstations and on a CM5. Some experiments were also performed on an alpha farm.

NEW The prototype compiler is available at the source level and as SunOS4 and Solaris2 binaries with the PIPS distribution (CRI/ENSMP). Note that this implementation is a working research prototype! This means that despite it has shown to work on real HPF codes, and sometimes to perform better than commercial compilers, it remains a prototypes with bugs, core dumps, limitations and small documentation and small support.

Compiler input language

The compiler does not take as input full HPF, but includes both simplification and extensions (FC directives).

Base language

The base language is mainly FORTRAN 77, plus some extensions toward HPF.

The input language syntax is restricted to FORTRAN 77, and more precisely to its PIPS FORTRAN subset. Few features (computed GOTO and substrings) are missing, but it matches the standard closely.
HPF directives start on column 1 with [Cc!\*][Hh][Pp][Ff]$.
FC (Fabien COELHO) directives with [Cc!\*][Ff][Cc][Dd]$.
Very simple forall instructions without continuation are handled. They must be independent.
```
!hpf$ independent
      forall(i=1:n) a(i)=a(i)+1
```

! style FORTRAN 90 comments are managed in directives.

!hpf$ dynamic A     ! here is a comment
!hpf$ template T(n) ! and another...

6th column and & style continuations are handled in directives.

!hpf$ processors
!hpf$x   P(hpf_number_of_processors())
!hpf$ distribute T(block) &
!hpf$    onto P

The directive analysis does handle FORTRAN lexical issues in directives. However, I do not recommand to use this feature which is a bad Fortran custom.

Static mapping directives

The HPF mapping directives are both simplified and extended on a syntactic point of view:

Arrays may be used as templates.

      real A(n,n), B(n,n), C(n)
!hpf$ align B(i,j) with A(i,j)
*hpf$ align C(i) with A(*,i)

No alignment trees (A is aligned with B which is aligned with C).
Dummy variable-based or implicit syntax for alignments (the implicit syntax is not strictly conformant but definitely convenient)
```
!hpf$ align A(i,j) with T(j,i)
!hpf$ align D with T2
```

list of distributee and alignee are allowed:

!hpf$ align A(i,j), B(j,k) with T(i,k)
!hpf$ distribute T(block,*), D(*,cyclic) onto P

The default distribution is block on each dimension of the processor arrangement.

!hpf$ template T(n,n,n)
!hpf$ processors P4(2,2)
!hpf$ distribute T onto P4 ! == T(block,block,*)

The default processors arrangement is a vector of 8 processors named hpfcX.

!hpf$ distribute T(cyclic)
!
! same as
!
!hpf$ processors hpfcX(8)
!hpf$ distribute T(cyclic) onto hpfcX

Static HPF mapping directives are allowed for variables which are local or in commons. This is not strictly speaking HPF conformant. They cannot be remapped. Commons with distributed arrays should be declared exactly the same in the different subroutines and functions. Overlaps are managed thanks to the interprocedural compilation.
```
      integer n
      parameter(n=1000)
      common /shared/ V(n,n), U(n,n), vmax
      real V, U, vmax
!hpf$ align with U :: V
!hpf$ distribute U(block,block) onto P
```

All sizes of HPF objects (distributed arrays, templates, processors) must be statically known. hpf_number_of_processors() is given a default value.

      integer n,m
      parameter (n=10,m=10000)
      real A(m,n)
!hpf$ template T(m,n), T2(100,100,100)
!hpf$ template D(100,m)
!hpf$ processors P(hpf_number_of_processors())
!hpf$ processors p2(4,4), proc(n)

Some simple so called free style is handled:

!hpf$ align with T :: A, B
!hpf$ distribute onto P :: T(block)
! 
!hpf$ re align with T(i) :: C(i)
!hpf$ redistribute onto P :: T(cyclic)
!
!hpf$ distribute (block) onto P :: X, Y
!hpf$ realign (i,j) with D(j,i) :: V, W

Dynamic mapping directives

The mapping of an array may be modified dynamically within the program or at subroutine boundaries. Dynamic directives are as constrained as static directives.

Dynamic HPF mapping directives (realign, redistribute) are only allowed on local arrays, and so that there is only one reaching distribution for each array reference in an instruction.

!hpf$ distribute A(block,block)
      if (a(1,1).lt.1.0) then
!hpf$   redistribute A(block,cyclic)
!hpf$   independent(i,j)
        A = A ... 
      endif
!hpf$ redistribute A(block,block) ! back...
      print *, A

Prescriptive and descriptive mappings at subroutine interface are allowed. Thus full arrays can be passed as subroutine arguments.

      subroutine predes(X,Y)
      real X(100), Y(100)
!hpf$ distribute X * (block) ! descriptive
!hpf$ distribute Y(block)    ! prescriptive

Local sections of distributed arrays may be passed to a pure function within a parallel loop, if properly aligned...

      subroutine fft1(signal,spectrum)
!fcd$ pure ! equivalent to the f90 attribute...
      complex signal(255), spectrum(255)
      ...
      end

      subroutine process
      complex signal(255,255), spectrum(255,255)
!hpf$ align with signal:: spectrum
!hpf$ distribute signal(*,block)
!hpf$ independent
      do i=1, 255
        call fft1(signal(1,i),spectrum(1,i)) ! signal(:,i)
      enddo

Directive inherit is handled as a default prescriptive block distribution, thus both following directives are strictly equivalent for Hpfc:
```
!hpf$ inherit X
!hpf$ distribute X
```

HPF Parallelism directives

Parallelism can be expressed:

Directives independent and new are taken into account. They are managed orthogonaly and homogeneously. Moreover independent can be specified the parallel loop indexes. Thus one can write
```
!hpf$ new(x),
!hpf$.   independent(j,i)
      do j=1, n
        do i=1, n
          x = A(i,j)*A(i,j)+B(i,j)
          C(i,j) = x*x+x
        enddo
      enddo
```
Reduction functions (sum, prod, min, max) are recognized by name and with a special FORTRAN 77 syntax to please the parser.
```
      real A(n)
      x = redmin1(A, 5, n-3) ! x = MIN(A(5:n-3))
```

The reduction directive is also implemented for +, *, min and max:

      s = 2.1
!hpf$ independent, 
!hpf$x    reduction(s)
      do i=2, n-1 
        s = s + a(i)
      enddo
      print *, 'sum is ', s

There is a pure directive (instead of an attribute to a function or subroutine, because the compiler deal with FORTRAN 77 only). They should only deal with scalar data. Actually it declares elemental functions.

      program foo
      external myop
!fcd$ pure myop ! the external real myop function is pure
      real A(100)
!hpf$ distribute A
!hpf$ independent
      do i=1, n
         A(i) = myop(A(i))
      enddo
      end
      real function myop(x)
!fcd$ pure ! the current function is pure
      myop = x*x+sqrt(sqrt(x))
      end

Extensions to HPF

Hpfc incorporates many extensions of HPF, the need of which was encountered in real applications.

There is an io directive (functions declared io are just called by the host after a gather of the needed data).

      program blah
!fcd$ io output ! output subroutine is io...
      ...
      call output(A)
      end
      subroutine output(x)
!fcd$ io ! current function is io
      print *, 'here is the output', x
      end

Code sections can be managed by the host only, without fine grain update of the values on the nodes, with host and end host. All user's functions called within such section should be either pure or io. Example:

c m will be updated on the node only once
c   a correct value is obtained from the user...
cfcd$ host
 10   print *, 'm = ?'
      read *, m
      if (m.lt.0.or.m.gt.100) then
         print *, 'invalid value, retry'
         goto 10
      endif
cfcd$ end host

Simple sections of codes can be localized with local and end local. The point is to tell the compiler about a variables private to a section of code. For instance:

cfcd$ local, new(c1,w)
      c1 = 4/((1/v(np-1,np))+(1/v(np-1,np-1)) +
     .  (1/v(np-2,np))+(1/v(np-2,np-1)) )
      w = c1*deltat
      be(np-1,1) = 2*w/(4*w+3*h)
      be(np-1,2) = 3*h/(4*w+3*h)
      v(np-1,np)  = c1
cfcd$ end local

Synchronization (synchro) and timing (time/end time) functions can be requested for performance measurements. Timing functions can be nested (up to ten). They must apply on a section of code. The message attached to the end is displayed together with the time. These directives cannot appear within a parallel loop. Such an approach with special directives should have been chosen by vendors instead of propriatary calls (for instance the DEC compiler handles call hpf_synchro() to synchronize HPF programs on user request, what cannot be considered as portable).
```
!fcd$ synchro
      print *, 'let us go'
!fcd$ time
      do i=1, 10
!fcd$   time
        call foo
!fcd$   end time('foo call')
      enddo
!fcd$ end time('full loop')
!fcd$ synchro
      print *, 'that is all'
```
time on and time off are obsolescent directives for time and end time.
The tell directive allows to tell about the progression of a run onto stderr, and can be used at development time.
```
      program foo
!fcd$ tell('foo -> bla')
      call bla
!fcd$ tell('foo -> arg')
      call arg
      end
```

Arrays the values of which are dead at a point can be marked as such with the kill directive. For remappings on dead arrays, no communication will occur at runtime.

!hpf$ template T(n)
!hpf$ align with T :: A
!hpf$ processors P(4)
!hpf$ distribute T(block) onto P
!
! A is used
!
!fcd$ kill A ! I know the value will not be reused...
!hpf$ redistribute T(cyclic) onto P
!
! A is used again, but defined before being referenced.
!

For differentiating a sequential and parallel version of a same computation (the best way to compute something can be different if you take the machine characteristics into account), the user may rely on HPFC to run cpp over its code with the __HPFC__ macro defined. The idea is not to impose the less performant solution to be applied on the code for sequential execution because it is a mandatory way of doing things in parallel.
```
#if (defined(__HPFC__))
      mm=0.
!hpf$ independent, &
!hpf$   reduction(mm)
      do i=1, n
        mm = max(abs(a(i)),mm)
      end do
      if (mm.ge.eps) goto 10
#else
      do i=1, n
        if (abs(a(i)).ge.eps) goto 10
      end do
#endif

  10  continue
```
The set directive (with int and bool variants) allows to switch compiler options from the source code. They are declaration directives. The options are those reported in the properties of PIPS. To be manipulated with great care. Introduced for validation purposes.
```
!fcd$ set bool('HPFC_USE_BUFFERS',1)   ! set to TRUE
!fcd$ set int ('HPFC_BUFFER_SIZE',500000) 
!fcd$ set bool('HPFC_LAZY_MESSAGES',0) ! set to FALSE
!fcd$ setbool ('HPFC_IGNORE_FCD_SYNCHRO',1) ! no SYNCHRO
```
The fake directive tells the compiler that the source code provided for a given function is a fake one which mimics its effects. Thus the compiler will not attempt to generate any code for this function. It is simply understood as ``do not compile me''. This feature is used for interfacing Hpfc with XPOMP. A fake function must be either pure or io.
```
      subroutine xpomp_get_depth(d)
      integer d
!fcd$ io
!fcd$ fake
      read *, d
      end
```
Fake io and pure function will not be compiled by the Hpfc compiler. However, these function will have to be provided for linking the final executables. The special ld io and ld pure directives allow this information to be provided as additional shell-interpreted linker options:
```
! Tells where to find xpomp functions:
!fcd$ ld io -L$XPOMP_RUNTIME -lxpomp
```
The io version is linked to the host and initial (sequential) executables. The pure applies to all executables.
When using external library function in programs submitted PIPS, it is required to provide stub functions that mimics the effects of the library functions. Files of stubs to be included can be specified directly within the source code with the special stubs directive. The file is searched in the current directory and in the $PIPS_ROOT/Share directory, with the .direct (preprocessed) then the .f suffix.
```
! Tells the name of the stub file.
!fcd$ stubs xpomp_stubs.f
```

Compiler output

The compiler generates code for the SPMD message passing programming model. It includes many optimizations. The generated code is based on a runtime support library. The communications are handled by PVM. The whole compilation process is driven and automated by shell scripts.

Target programming model

Hpfc's target programming model is a host-node distributed memory message passing model. The host is in charge of I/Os and of the process management. The nodes perform the computations. Two programs are generated by the compiler. One for the host, and an SPMD code for the nodes, which is parametrized by the node id. The generated codes rely on a runtime library which provides many services. Lower level issues in the process management (spawning of tasks, etc.) and basic communications are handled by PVM.

Compiler optimizations

The basic compilation technique is known as the runtime resolution of accesses to distributed arrays. Each processor executes the whole control flow of the original program, and each reference to distributed data is resolved dynamically, and elementary messages are exchanged, following the Owner Computes Rule. This is of course very inefficient, and optimizations are needed if the program is expected to run quicker on a parallel machine.

The following optimizations are implemented within Hpfc:

Distributed array declarations are reduced on the nodes.
Reductions are compiled thru calls to dedicated runtime functions.
Overlap analysis.
Conditions:
- Independent loop nest, maybe non perfectly nested.
- Rectangular iteration space, maybe not statically known.
- Multidimensional block distributions.
- Arrays should be nearly aligned one with the other.
- No alignement stride, no replication of written arrays.
- flip-flops are allowed.
Output:
- New loop nest on local data for the nodes.
- Communications are vectorized outside of the loop nest.
- Array declarations are extended to store data from neighbour processors on overlap areas.
I/O communications (collect to/update from host) are specially managed with a polyhedron-based technique (abstract [3]).
1D shift on rectangular areas of block-distributed stride-1 aligned arrays are recognized and compiled thru calls to appropriate runtime support functions. It is not far from a hack, but it allows to handle wraparound computations. Example:
```
!hpf$ distribute V(block,block) onto P
!hpf$ independent
      do i=2, n-1
         V(1,i) = V(n,i)
      enddo
```
loop nests that fully copy an array into another, both being aligned, are especially handled. One more hack.
Remappings (realign and redistribute) are taken into account (abstract [4]).
And more linear algebra techniques to come (abstract [2])
Dynamic arrays with similar but distinct mappings (that is the mapping directives are different but results in the very same actual mapping on the processors) are merged. Such instances typically appear if a realign for replication is requested, although not all dimensions are actually distributed (replicating a non-distributed dimension as no effect on the actual data mapping). The realign depends on the algorithms, and the distribution choice may change for various processors arrangements considered for the program.
```
!hpf$ align A(i,j) with T(i,j)
!hpf$ distribute T(block,*)
!
! is similar to
!
!hpf$ align A(i,j) with T(j,i)
!hpf$ distribute T(*,block)
```

Runtime library

The Hpfc generated code relies on a runtime library and on runtime datastrutures. These functions are written in FORTRAN 77 on top of PVM3. Many functions are generated automatically from m4 templates, depending on the manipulated data types and arities. Parts of the library are ported to use the CM5 communication library instead of PVM3. The runtime library was successfully compiled on the following PVM architectures: SUN4 SUNMP RS6K ALPHA CM5.

Runtime data structures:

HPF objects description
- Arity and object bounds
- Alignements and Distributions
- Distributed arrays new declarations
PVM processes management and communication synchronization
- PVM process identifiers for the host and the nodes
- PVM identity of the process
- Number of messages exchanged with each other processes
- Number of multicasts
- HPF processors/PVM processes mapping
Current status (library internal data)

Runtime functions:

Runtime data structures initialization
Processes initializations
Ownership computations (for the runtime resolution)
Some coherency checkings
High level packing functions
- The data must be block-distributed and stride-1 aligned
- A local rectangular area is packed
- Any type and arity
Reduction functions for distributed data
- The data must be block-distributed and stride-1 aligned
- The accessed area must be rectangular
- Implemented: MIN, MAX, SUM, PROD
- Any type and arity (when it makes sense)
Subarray shift functions
- The data must be block-distributed and stride-1 aligned
- The shifted area must be rectangular
- The shift must be on one dimension

Driver and utilities

Hpfc is integrated in PIPS, a prototype parallelizer. There are many utilities used by Hpfc or that can be used to manipulate Hpfc.

hpfc, the compiler driver: Allows to call the Hpfc many scripts.
- interactive sessions (thanks to GNU readline library)
- compile (HPF to F77 with pips)
- make (F77 to PVM executables)
- run (the original sequential or new parallel codes)
- delete (hpfc directories)
- clean (pips directories)
- pvm (start, halt and so on)
- ...
shells used by the compiler:
- hpfc_directives: manage HPF directives.
- hpfc_add_includes: adds file inclusions.
- hpfc_add_warning: adds a warning to automatically generated files.
- hpfc_generate_init: generates an init file
- hpfc_generate_h: generates a header file from a FORTRAN file.
- hpfc_install: installation of the needed and generated files in a directory.
Options:
- for inlining or not global to local runtime resolution computations.
- for managing especially buffers over PVM.
- for ignoring or not timing and synchronization directives.
- for applying the dynamic liveness analysis of array copies.
Implementation overview
- hpfc is the compiler driver shell.
- HPF directives are translated into fake FORTRAN calls to reuse the PIPS parser using a shell and sed script.
- These calls are filtered out of the parsed code and analyzed to set internal data structures which describe array mappings and so.
- New declarations are computed for the distributed arrays.
- The input code is analyzed by PIPS (array regions, preconditions).
- Then 2 codes are generated for each subroutine. One for the host and an SPMD code for the nodes. Implemented as a double rewriting of the AST.
- Initialization codes for the runtime data structures are also generated.
- Finally common declarations are dealt with, plus global initializations.
- The generated codes are compiled and linked to PVM and the Hpfc Runtime library. Two executables are built (host and nodes). The host spawns the nodes.

Figure 1: HPFC overview

Figure 1 shows an overview of Hpfc within PIPS, decomposed into several interacting phases within the compiler.

Conclusion

Hpfc is a prototype HPF compiler in which some of our techniques have been implemented. It serves as a realistic testbed for testing new optimization techniques. The compiler includes 18,000 lines of ANSI C code as a PIPS module, plus 1,300 of shell and other sed scripts for the drivers, and finally 3,500 lines of M4/FORTRAN 77 for the runtime support library (expanded to +10,000).

Related work

Prototype compilers that generate code from FORTRAN (more or less HPF) or C for distributed-memory machines: See the excellent survey maintained by Jean-Louis Pazat at IRISA.

Adaptor (GMD)
Annai (CSCS-NEC)
DEC f90 (Digital)
D System (Rice, Syracuse)
EPPP (CRI Montr�al)
Fx (Carnegie Mellon University)
HPFC (�cole des mines de Paris)
HPF Mapper (NA Software)
Pandore (IRISA)
Paradigm (CRHPC, University of Illinois at Urbana-Champaign)
pghpf (Portland Group Inc.)
PPPE (Esprit Project)
Prepare (Esprit Project)
sHPF Translator (University of Southampton)
vastHPF (Pacific-Sierra Research)
VFCS (University of Vienna)
xHPF (Applied Parallel Research)
xlhpf (IBM)

References

1: Luc Bougé. ParaDigme - Le parallélisme de données comme modèle fondamental pour le calcul massivement parallèle. Rapport final, ENS-Lyon, December 1994. Th�me 10 : Langages, modèles d'exécution et architectres pour le parallélisme massif, appel d'offres du Ministère de la Recherche et de l'Espace.
2: Fabien Coelho. Étude de la Compilation du High Performance Fortran. Master's thesis, Université Paris VI, September 1993. Rapport de DEA Systèmes Informatiques. TR EMP E/178/CRI.
3: Fabien Coelho. Compilation of I/O Communications for HPF. In 5th Symposium on the Frontiers of Massively Parallel Computation, pages 102-109, February 1995.
4: Fabien Coelho and Corinne Ancourt. Optimal Compilation of HPF Remappings. CRI TR A 277, CRI, École des mines de Paris, October 1995. To appear in JPDC, 1996.
5: François Irigoin, Pierre Jouvelot, and Rémi Triolet. Semantical interprocedural parallelization: An overview of the PIPS project. In ACM International Conference on Supercomputing, pages 144-151, June 1991.
6: Ronan Keryell. WPips and EPips User Manual (Parall�liseur Interproc�dural de Programmes Scientifiques) -- Linear Algebra based Automatic Parallelizer and Program Transformer. CRI, École des mines de Paris, April 1996. Also report A-288.
7: Ronan Keryell, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet, François Irigoin, and Pierre Jouvelot. PIPS: A Framework for Building Interprocedural Compilers, Parallelizers and Optimizers. Technical Report 289, CRI, École des mines de Paris, April 1996.

This document URL is: http://www.cri.ensmp.fr/pips/hpfc.html

Fabien COELHO
Wed Mar 26 13:24:56 MET 1997