Cell library for the LQCD

Claude TADONKI
LAL/IN2P3/CNRS - University of Orsay

Overview and basic assumptions

CellQCD is a set of CELL-accelerated routines for basic LQCD calculations. Each individual routine is highly optimized using state-of-the-art (CELL)acceleration techniques, and a particular optimization applies when consider repetitive calls like in the case of a global iterative process. The execution is managed from the PPU in a seamless way for the programmer, and all available SPEs are permanently requested.

Data structures

typedef struct{
  double re, im;
} complex;

typedef struct {
  complex c00, c01, c02, c10, c11, c12, c20, c21, c22;
} su3;

typedef struct{
  complex c0, c1, c2;
} su3_vector;

typedef struct{
  su3_vector s0, s1, s2, s3;
} spinor;

Equivalent data stypes should also work provided an explicit cast on the arguments. The following statements are equivalent:

complex c;	double c[2];
su3 u;	double u[18];
su3_vector v;	double v[6];
spinor s;	double s[192];

If a function is defined as f(spinor *s), then the two calls f(s) and f((spinor *)s) are valid with either of the following statements: spinor s; or double s[192];.

Download and global instructions

QS20
1.	cell_lqcd.h
2.	ppe_lqcd.o
3.	lib_spu_lqcd.a

QS22
1.	cell_lqcd.h
2.	ppe_lqcd.o
3.	lib_spu_lqcd.a

Once the files are downloaded (the three files for each hardware version), the steps to use the library are the followings:

Coding instructions

include the header file cell_lqcd.h where the library is expected to be called (i.e. #include cell_lqcd.h)
call to the initialization routine CELL_QCD_INIT();
various calls to the routines of the library at your convenience (see the list below)
call to the finalization routine CELL_QCD_FINALIZE();

Running instructions

compile your code with a command similar to the following cc mycode.exe -o mycode.c ppe_lqcd.o lib_spu_lqcd.a -L/path_to_cell_sdk_lib -lpthread -lspe2 -lmisc
if your are using a makefile, add the following LIBS = -L/path_to_cell_sdk_lib -lpthread -lspe2 -lmisc
depending on your system, you may need to make your LD_LIBRARY_PATH variable pointing to the lib directory of the CELL SDK (i.e. setenv LD_LIBRARY_PATH /opt/cell/sdk/usr/lib)
run your code as usual

List of available routines

Each of the folowing routines are already implemented. Other routines will be added and the whole library will improve from time to time. Whatever the case, the actual files should be always considered and used as as an update whenever needed. The nomenclature follows that of the tmLQCD package.

List routines currently implemented

void CELL_QCD_INIT();

void CELL_QCD_FINALIZE();

double CELL_scalar_prod(spinor *S, spinor *R, int N);

double CELL_scalar_prod_r(spinor *S, spinor *R, int N);

double CELL_square_norm(spinor *S, int N);

double CELL_square_norm_assign(spinor *S, spinor *R, int N);

void CELL_assign_diff_mul(spinor *S, spinor *R, complex c, int N);

void CELL_mul_r(spinor *S, spinor *R, double c, int N);

void CELL_assign(spinor *S, spinor *R, int N);

10.

void CELL_assign_mul_add_r(spinor *S, spinor *R, double c, int N);

11.

void CELL_assign_diff_mul_serie(spinor *S, spinor **R, complex *c, unsigned int length, int N);

12.

void build_dependence_indices(unsigned int *dep_indices, int N, int *g_eo2lexic, int *g_lexic2eosub, int *iup, int *idn); or
void build_spinor_indices(unsigned int i_start,unsigned int i_end, unsigned int *dep_indices);

13.

void CELL_Hopping_Matrix(

spinor *l, spinor *k, int i_length, int i_base, complex ka0, complex ka1, complex ka2, complex ka3, unsigned int *dep_indices, su3 *U0);

14.

void CELL_H_eo_tm_inv_psi(

spinor *l, spinor *k, int i_length, int i_base, complex ka0, complex ka1, complex ka2, complex ka3, unsigned int *dep_indices, su3 *U0, double sign, double g_mu);

15.

void CELL_diff(spinor *Q, spinor *S, spinor *R, int N);

16.

void CELL_mul_one_pm_imu_sub_mul_gamma5(spinor *l, spinor *k, spinor *j, double _sign, double g_mu, int N);

17.

void CELL_mul_one_pm_imu_inv(spinor *l, double _sign, double g_mu, int N);

18.

double now();

Benchmark

Using our library within the tmLQCD package, we got the following results.

Illustrative results

(see the library)

A 32×16³ configuration solved using the CGR algorithm on the PPU in 138 seconds

The same configuration and algorithm on the (PPU + 8 SPEs) double precision in 4.58 seconds

CGR:	57 iterations in	4.58 s (QS20)	3.68 s (QS22)	28.34s s (Intel 2.83Ghz)
CG:	685 iterations in	51.70 s (QS20)	38.80 s (QS22)	362.45 s (Intel 2.83Ghz)

Compare to the PPU, a speedup around 30 was obtained (also valid per iteration)

Details about this specific achievment are here.