**Automatic Data Allocation, Buffer Management and Data** movement for Multi-GPU **Machines** 

Thejas Ramashekar MSc Engg (Thesis Defence) Advisor: Dr. Uday Bondhugula Indian Institute of Science

### **A Typical HPC Setup**





### **Multi-GPU Machine**



### **Multi-GPU Setup - Key properties**

• Distributed memory architecture

### **Multi-GPU Setup - Key properties**

- Distributed memory architecture
- Limited GPU memory (512 MB to 6 GB)

### **Multi-GPU Setup - Key properties**

- Distributed memory architecture
- Limited GPU memory (512 MB to 6 GB)
- Limited PCIex bandwidth (Max 8 GB/s)

#### **Affine loop nests**

 Loop nests which have affine bounds and the array access functions in the computation statements are affine functions of outer loop iterators and program parameters

#### **Affine loop nests**

- Loop nests which have affine bounds and the array access functions in the computation statements are affine functions of outer loop iterators and program parameters
- eg: stencils, linear-algebra kernels, dynamic programming codes, data mining applications

#### **Affine loop nests**

- Loop nests which have affine bounds and the array access functions in the computation statements are affine functions of outer loop iterators and program parameters
- eg: stencils, linear-algebra kernels, dynamic programming codes, data mining applications
- eg: Floyd-Warshall

affine bounds affine access function for (k=0; k=1) / \* outer serial loop \*/ for (i=0; i < N; i+) /\* outer most parallel loop for (i=0; i < N; i++)path[i][j] = ((path[i][k]+path[k][j]) < path[i][j])? path[i][k]+path[k][j]: path[i][j];

## Running an affine loop nest on multi-GPU machine



### Structure of an affine loop nest for multi-GPU machine



# The need for a multi-GPU memory manager

• Manual programming of multi-GPU systems is tedious, error-prone and time consuming

# The need for a multi-GPU memory manager

- Manual programming of multi-GPU systems is tedious, error-prone and time consuming
- Existing works are either:
  - Manual application specific techniques or
  - Have inefficiencies in terms of data allocation sizes, reuse exploitation, inter-GPU coherency etc

• The desired abilities for a multi-GPU memory manager are:

• The desired abilities for a multi-GPU memory manager are:

• To identify and minimize data allocation sizes

- The desired abilities for a multi-GPU memory manager are:
  - To identify and minimize data allocation sizes
  - To reuse data already present on the GPU

- The desired abilities for a multi-GPU memory manager are:
  - To identify and minimize data allocation sizes
  - To reuse data already present on the GPU
  - To keep data transfers minimal and efficient

- The desired abilities for a multi-GPU memory manager are:
  - To identify and minimize data allocation sizes
  - To reuse data already present on the GPU
  - To keep data transfers minimal and efficient
  - To achieve all the above with minimal overhead

```
for (k=0; k < N; k++) /* outer serial loop */
for (i=0; i < N; i++) /* outer most parallel loop */
  for (j=0; j < N; j++)
    path[i][j] = ((path[i][k]+path[k][j]) < path[i][j])? path[i][k]+path[k][j]: path[i][j];
               Computation Tile
                                    Exact Accessed Region 🔲 Bounding Box
              tile size = 4 \times 8
                                                          path[i][j] path[i][k]
                                   path[i][j] path[i][k]
                                  0 0 0 0 0 0 0 0
                                                                00000
                                    0000000
                                    0000000
                                  000000 path
                                  0 0 0 0 0 0 0 0 [k] [j]
                                                          0000000 [k] [j]
                                                       (c) bounding boxes
         (a) Iteration space of
                                (b) Exact array regions
                                  accessed by the tile
         a tile (size: 4 \ge 8, k=7)
```







### Key insights on bounding boxes

• Two key insights:

### Key insights on bounding boxes

- Two key insights:
  - Bounding boxes can be subjected to standard set operations at runtime with negligible overhead

### Key insights on bounding boxes

- Two key insights:
  - Bounding boxes can be subjected to standard set operations at runtime with negligible overhead
  - GPUs have architectural support for fast rectangular copies

























#### Negligible runtime overhead

## Architectural support for rectangular transfers

- Architectural support for rectangular transfers on GPU
- Support from programming models such as OpenCL and CUDA
   eg: clEnqueueReadBufferRect()
   and clEnqueueWriteBufferRect()



# The Bounding Box based memory manager (BBMM)

• Compiler-assisted runtime scheme

# The Bounding Box based memory manager (BBMM)

- Compiler-assisted runtime scheme
- Compile-time uses static analysis to identify regions of data accessed by a loop nest in terms of bounding boxes

# The Bounding Box based memory manager (BBMM)

- Compiler-assisted runtime scheme
- Compile-time uses static analysis to identify regions of data accessed by a loop nest in terms of bounding boxes
- Runtime refines these initial bounding boxes into a set of disjoint bounding boxes

# The Bounding Box based memory manager (BBMM)

- Compiler-assisted runtime scheme
- Compile-time uses static analysis to identify regions of data accessed by a loop nest in terms of bounding boxes
- Runtime refines these initial bounding boxes into a set of disjoint bounding boxes
- All data transfers are done in terms of bounding boxes

#### **Overview of BBMM**



#### **Data allocation scheme**



# **Buffer Management**

- Two lists per GPU
   inuse list
  - unused list
- Each bounding box has an associated usage count
- Flags to indicate readonly/read-write etc





# Important features of the Buffer Manager

- Inter-tile data reuse
  - Reuse data already present on the GPU
- Box-in/box-out
  - Ability to make space on the GPU when it runs out of memory

• Based on our previous work:

• Based on our previous work:

Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday Bondhugula. Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed Memory. In ACM PACT 2013.

• Identify the data to be communicated from a source tile due to flow (RAW) dependences called the *Flow-out* set

• Based on our previous work:

- Identify the data to be communicated from a source tile due to flow (RAW) dependences called the *Flow-out* set
- Further refine the Flow-out set using a technique called source-distinct-partitioning

• Based on our previous work:

- Identify the data to be communicated from a source tile due to flow (RAW) dependences called the *Flow-out* set
- Further refine the Flow-out set using a technique called source-distinct-partitioning
- Eliminates both unnecessary and duplicate data transfers

• Based on our previous work:

- Identify the data to be communicated from a source tile due to flow (RAW) dependences called the *Flow-out* set
- Further refine the Flow-out set using a technique called source-distinct-partitioning
- Eliminates both unnecessary and duplicate data transfers
- The scheme has been demonstrated to work well on both distributed memory and heterogeneous systems

## Inter-GPU coherency (cont)

 BBMM extracts the flow-out sets as flow-out bounding boxes



# Inter-GPU coherency (cont)

- BBMM extracts the flow-out sets as flow-out bounding boxes
- The flow-out bounding box of a tile is copied out from the source GPU onto the host CPU



# Inter-GPU coherency (cont)

- BBMM extracts the flow-out sets as flow-out bounding boxes
- The flow-out bounding box of a tile is copied out from the source GPU onto the host CPU
- If any other GPU contains the same bounding box, it is updated with a flow-in transfer
- If no GPU currently has that bounding box, the updated data is retained on the CPU



• The compile-time component integrated into polyhedral source-to-source transformer - Pluto

- The compile-time component integrated into polyhedral source-to-source transformer Pluto
- The input to the compile-time is the sequential C code containing a set of affine loop nests

- The compile-time component integrated into polyhedral source-to-source transformer Pluto
- The input to the compile-time is the sequential C code containing a set of affine loop nests
- Pluto creates a tiled and parallelized version of the input code

- The compile-time component integrated into polyhedral source-to-source transformer Pluto
- The input to the compile-time is the sequential C code containing a set of affine loop nests
- Pluto creates a tiled and parallelized version of the input code
- BBMM's compile-time component takes this tiled and parallelized code as input and generates the following:
  - A set of initial and flow-out bounding boxes
  - The code similar to the host code structure shown in Algorithm 4.

- The compile-time component integrated into polyhedral source-to-source transformer Pluto
- The input to the compile-time is the sequential C code containing a set of affine loop nests
- Pluto creates a tiled and parallelized version of the input code
- BBMM's compile-time component takes this tiled and parallelized code as input and generates the following:
  - A set of initial and flow-out bounding boxes
  - The code similar to the host code structure shown earlier
- The runtime component is implemented as stand-alone C library.

#### **Evaluation and Results**

## **Evaluation and Results**

- Setup
  - A multi-GPU machine consisting of 3 NVIDIA Tesla
     c2050 (fermi) GPUs and 1 NVIDIA Tesla K20 (Kepler) with 2.5 GB of memory each
  - A 12-core CPU system as the host

## **Evaluation and Results**

#### • Setup

- A multi-GPU machine consisting of 3 NVIDIA Tesla
   c2050 (fermi) GPUs and 1 NVIDIA Tesla K20 (Kepler) with 2.5 GB of memory each
- A 12-core CPU system as the host

#### • Benchmarks

| Program  | Source    | Dep pattern   | A | В | Data size on 1 GPU |           | D | Е   | F     |
|----------|-----------|---------------|---|---|--------------------|-----------|---|-----|-------|
|          |           |               |   |   | Array sizes        | Size (GB) |   |     |       |
| floyd    | Polybench | non-uniform   | 1 | 2 | 16384 x 16384      | 2.0       | 2 | yes | 0.05% |
| heat2d   | Pochoir   | uniform       | 2 | 2 | 12288 x 12288      | 2.25      | 4 | yes | 0.10% |
| fdtd2d   | Polybench | uniform       | 3 | 2 | 10240 x 10240      | 2.4       | 2 | yes | 0.06% |
| heat3d   | Pochoir   | uniform       | 2 | 3 | 512x512x512        | 2.0       | 4 | yes | 0.04% |
| lu       | Polybench | non-uniform   | 1 | 2 | 16384 x 16384      | 2.0       | 3 | yes | 0.07% |
| adi      | Polybench | uniform       | 3 | 2 | 8192 x 8192        | 1.5       | 2 | yes | 0.01% |
| mvt      | Polybench | $\mathbf{EP}$ | 3 | 2 | 20480 x 10240      | 1.5       | 1 | no  | 0.01% |
| bscholes | NVIDIA    | EP            | 3 | 1 | 67,108,864         | 1.5       | 1 | no  | 0.01% |

A: number of arrays. B: maximum dimensionality of arrays C: bounding box type chosen by our algorithm D: maximum number of bounding boxes for any array E: subsumed bounding boxes present? F: BBMM runtime overhead as a percentage of overall execution time

• Overhead of the runtime library

- Overhead of the runtime library
- Comparison of data allocation sizes

- Overhead of the runtime library
- Comparison of data allocation sizes
- Performance with data scaling

- Overhead of the runtime library
- Comparison of data allocation sizes
- Performance with data scaling
- Comparison with manually written code

- Overhead of the runtime library
- Comparison of data allocation sizes
- Performance with data scaling
- Comparison with manually written code
- Performance with box-in/box-out

- Overhead of the runtime library
- Comparison of data allocation sizes
- Performance with data scaling
- Comparison with manually written code
- Performance with box-in/box-out
- Benefits of inter-tile data reuse

- Overhead of the runtime library
- Comparison of data allocation sizes
- Performance with data scaling
- Comparison with manually written code
- Performance with box-in/box-out
- Benefits of inter-tile data reuse
- Performance with access function split

## **Overhead of runtime library**

overhead\_percentage = (memory\_mgmt\_time / total\_execution\_time) \* 100

 For all programs, the runtime overhead was less than 0.1% of the total execution time of the program (hence insignificant)

#### **Comparison of data allocation sizes**

- Up to 75% reduction on a 4-GPU machine compared to convex bounding box scheme
- Equal to the exact data sizes required (manually computed) for all cases



### **Performance with data scaling**

- Data scaling similar to weak scaling but with emphasis on data size (memory utilization) rather than on problem size (computation)
- Hence we consider the periteration speedup
- The per-iteration time includes all overhead: data allocation time, compute time, flow-out time, flow-in time and write-out time
- BBMM affects all the above except compute time
- Mean speedup of 0.94 indicating near-ideal speedup



# Comparison with manually written code

- Manual code has following optimizations:
  - Optimized to have theoretically minimum data allocation sizes and coherency volume
  - Reuse exploitation was theoretical maximum
- BBMM at least 88% as efficient as manual OpenCL code
- Outperforms the manual OpenACC code



#### **Benefit of box-in/box-out**

- Significant performance improvements with tiles that have sufficient compute-to-copy ratio
- Without it, significant performance degradation
- With right tiling strategy, the feature can allow applications to work with data sizes significantly larger than available GPU memory



# **Compute-Copy Overlap**



- Hide the data movement overhead within computation time
- Split the computation allocated to a GPU into multiple tiles
- Register a callback to be called at the completion of each tile
- In the callback perform the CopyOut() and CopyIn()
- CopyIn() does not conflict because we work on a distributed parallel loop

# **Maximizing Compute-Copy Overlap**



- Sort the tiles based on size of the CopyOut data
- Schedule them in the sorted order (largest copyout size first)



Data Movement Overhead Comparison Chart

Tile Sizes

#### **Related Work**

| Framework                                                                                                                                                                                                                           | Allocation granular-<br>ity                                                                                                                                                     | Memory mgmt scheme                                                                                                                                                                                            | Manual / Auto                                                                                                            | #devices                                                                                     |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| [Kim et al. 2011]<br>[Augonnet et al. 2009]<br>[Jablin et al. 2011]<br>[Jablin et al. 2012]<br>[Lee and Eigenmann 2010]<br>[Pai et al. 2012]<br>[Baskaran et al. 2010]<br>[Verdoolaege et al. 2013]<br>[OpenACC 2012]<br>BBMM (our) | convex bounding box<br>user-provided<br>entire array<br>entire array<br>entire array<br>x10CUDA Rail<br>entire array<br>entire array<br>entire array<br>disjoint bounding boxes | virtual CPU buffer<br>MSI-based coherency<br>modified runtime libraries<br>modified runtime libraries<br>live variable analysis<br>compiler inserted checks<br>none<br>none<br>none<br>Runtime memory manager | automatic<br>manual<br>automatic<br>automatic<br>user-annotated<br>automatic<br>automatic<br>user-annotated<br>Automatic | multiple<br>multiple<br>single<br>single<br>single<br>single<br>single<br>single<br>multiple |

- We presented a fully automatic data allocation and memory management framework for affine loop nests on multi-GPU machines
- Data allocation, buffer management, inter-GPU coherency were all done at the granularity of bounding boxes

- We presented a fully automatic data allocation and memory management framework for affine loop nests on multi-GPU machines
- Data allocation, buffer management, inter-GPU coherency were all done at the granularity of bounding boxes
- On a 4-GPU machine our scheme was able to:
  - Achieve allocation size reductions of 75% compared existing schemes
  - Comparison to manual OpenCL and OpenACC code showed:
    - Our code yielded a performance of at least 88% of manual OpenCL code
    - Outperformed OpenACC code in all the cases
  - Achieve excellent data scaling

- We presented a fully automatic data allocation and memory management framework for affine loop nests on multi-GPU machines
- Data allocation, buffer management, inter-GPU coherency were all done at the granularity of bounding boxes
- On a 4-GPU machine our scheme was able to:
  - Achieve allocation size reductions of 75% compared existing schemes
  - Comparison to manual OpenCL and OpenACC code showed:
    - Our code yielded a performance of at least 88% of manual OpenCL code
    - Outperformed OpenACC code in all the cases
  - Achieve excellent data scaling
- All the above achieved with an insignificant runtime overhead of 0.1%

- We presented a fully automatic data allocation and memory management framework for affine loop nests on multi-GPU machines
- Data allocation, buffer management, inter-GPU coherency were all done at the granularity of bounding boxes
- On a 4-GPU machine our scheme was able to:
  - Achieve allocation size reductions of 75% compared existing schemes
  - Comparison to manual OpenCL and OpenACC code showed:
    - Our code yielded a performance of at least 88% of manual OpenCL code
    - Outperformed OpenACC code in all the cases
  - Achieve excellent data scaling
- All the above achieved with an insignificant runtime overhead of 0.1%
- Our work is suited to any compiler/runtime system targeting GPUs
- Can bridge the data allocation gap that exists in programming these systems

#### **Publications based on this work**

1. Automatic Data Allocation and Buffer Management for Multi-GPU Machines

Thejas Ramashekar, Uday Bondhugula, In the ACM Transactions on Architecture and Code Optimization, Vol. 10, No. 4, Article 60, Publication date: December 2013 . Selected for presentation at HiPEAC '14, Jan 2014, Vienna, Austria.

2. Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday Bondhugula, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2013.

#### References

- Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In CC 2010.
- Cédric Bastoul. 2005. Clan: The Chunky Loop Analyzer. (2005).
- Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures. In ACM/IEEE Supercomputing (SC '13). ACM, Denver, Colorado, USA.
- Uday Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN PLDI.
- Daniel Chavarría-Miranda and John Mellor-Crummey. 2005. Effective communication coalescing for dataparallel applications. In ACM SIGPLAN PPoPP.
- M. Classen and M. Griebl. 2006. Automatic code generation for distributed memory architectures in the polytope model. In *IEEE IPDPS*.
- CUDA 2011. NVIDIA CUDA. (2011). http://developer.nvidia.com/object/cuda.html
- Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday Bondhugula. 2013. Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed Memory. In *PACT 2013*.
- Armin Größlinger. 2009. Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes. In Compiler Construction. 236–250.
- ISL 2012. Integer Set Library. (2012). Sven Verdoolaege, An Integer Set Library for Program Analysis.
- Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012. Dynamically Managed Data for CPU-GPU Architectures. In CGO.
- Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU Communication Management and Optimization. In ACM PLDI.
- Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In ACM SIGPLAN PPoPP.
- Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, and Samuel Midkiff. 2012. A Hybrid Approach of OpenMP for Clusters. In *PPoPP*. http://engineering.purdue.edu/paramnt/publications/ppopp12.pdf
- Seyong Lee and Rudolf Eigenmann. 2010. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. In SC 2010.
- Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In ACM SIGPLAN PPoPP.
- NVIDIA GPU Computing SDK 2010. NVIDIA GPU Computing SDK. (2010). https://developer.nvidia.com/ gpu-computing-sdk

OpenACC 2012. OpenACC Application Programming Interface. (2012). http://www.openacc-standard.org/ OpenCL 2011. OpenCL. (2011). http://www.khronos.org/opencl/

Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2012. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In ACM PACT.

#### References

- Vikram S. Adve and John M. Mellor-Crummey. 1998. Using Integer Sets for Data-Parallel Program Analysis and Optimization. In ACM SIGPLAN PLDI. 186–198.
- Saman P. Amarasinghe and Monica S. Lam. 1993. Communication optimization and code generation for distributed memory machines. In PLDI. 126-138.
- C. Augonnet, S. Thibault, R. Namyst, and P.A. Wacrenier. 2009. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In *Concurrency and Computation: Practice and Experience*.
- M. Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. In ACM SIGPLAN PPoPP.

### **Backup Slides**

# **Data Allocation Scheme Algorithms**

Algorithm 1: extract\_initial\_bounding\_boxes()

```
Input: Computation tile \vec{t}, Array a
```

```
1 S_a^{i\bar{n}it} = \phi
```

- **2** for each read or write access function  $f_a^i$  do
- 3  $dp_a^i = \text{get_data_polyhedron}(f_a^i)$
- 4  $bb_a^i = \text{get\_bounding\_box}(dp_a^i)$
- 5 add  $bb_a^i$  to  $S_a^{init}$
- 6 **Output**:  $S_a^{init}$ , the set of initial bounding boxes

#### Algorithm 2: get\_disjoint\_bounding\_boxes()

```
Input: S_a^{init} - Set of initial bounding boxes for tile \vec{t} and array a1S_a^{disjoint} = \phi2for each bounding box bb_a^{init} in S_a^{init} do3bb_a^{rem} = bb_a^{init}4for each bounding box bb_a^{disj} in S_a^{disjoint} do5\begin{bmatrix} bb_a^{intersect} = bb\_intersection(bb_a^{rem}, bb_a^{disj}) \\ bb_a^{rem} = bb\_subtract(bb_a^{rem}, bb_a^{intersect}) \end{bmatrix}6\begin{bmatrix} bb_a^{rem} = bb\_subtract(bb_a^{rem}, bb_a^{intersect}) \\ bb_a^{rem} = bb\_subtract(bb_a^{rem}, bb_a^{intersect}) \end{bmatrix}7\begin{bmatrix} add bb_a^{rem} \text{ to } S_a^{disjoint} \end{bmatrix}8Output: S_a^{disjoint}, the set of disjoint bounding boxes for array a
```

#### Structure of the generated host code

Algorithm 4: Structure of generated host code for a single affine loop nest **1** for each iteration of the outer serial loop  $i_s$  do distribute the parallel tiles of  $i_s$  among the GPUs 2 /\* below code is executed in the context of a host worker thread that manages the GPU for each parallel tile  $\vec{t}$  of  $i_s$  allocated to GPU dev do 3  $S = \phi$ 4 **"for each** array a accessed in  $\vec{t}$  do 5 6  $S_a = \text{get_disjoint_bounding_boxes}(\vec{t}, a)$ for each bounding box bb in  $S_a$  do 7 8 if !bb\_present(dev, a, bb) then bb\_alloc(dev, a, bb) 9 10 bb\_readin(dev, a, bb) 11 increment\_usage\_count(bb)  $S = S \cup S_a$ 12  $compute(\vec{t}, dev, S)$ 13  $gpu_to_cpu_flowout(\vec{t}, S)$ 14  $cpu_to_gpu_flowin(\vec{t}, S)$ 15 gpu\_to\_cpu\_writeout( $\vec{t}$ , S) 16 for each bounding box bb in S do 17 decrement\_usage\_count(bb) 18  $bb_cleanup(dev, i_s)$ 19

# Structure of the generated kernel code

| 1        | <pre>void ComputeKernel0(int split0, DATA_TYPE * buf0, int buf0_lb0, int buf0_ub0, int buf0_lb1, int buf0_ub1,<br/>int split1, DATA_TYPE * buf1, int buf1_lb0, int buf1_ub0, int buf1_lb1, int buf1_ub1,)</pre> |
|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>2</b> | {                                                                                                                                                                                                               |
| 3        | <pre>DATA_TYPE * var_wacc_0 = KERNEL0_var_WACC(split0, buf0, buf0_lb0, buf0_ub0, buf0_lb1, buf0_ub1, idx0,<br/>idx1);</pre>                                                                                     |
| 4        | <pre>DATA_TYPE var_racc_0 = KERNEL0_var_RACC(split1, buf1, buf1_lb0, buf1_ub0, buf1_lb1, buf1_ub1, idx0,<br/>idx1);</pre>                                                                                       |
| <b>5</b> |                                                                                                                                                                                                                 |
| 6        | // do the computation using values obtained above.                                                                                                                                                              |
| 7        | $*var_wacc_0 = var_racc_0 + \dots$                                                                                                                                                                              |
| 8        | }                                                                                                                                                                                                               |

#### **Performance with inter-tile reuse**

- compared to performance of the same code without reuse
- mean speedup of
   5.4x with maximum
   speedup of upto
   85x



# Performance with access function split

- compared to performance of code without splits
- stencils did not undergo performance degradation
- floyd in the worst case, suffered 40%
   performance loss. But still much better compared to CPU execution times



# **Table of contents**

- HPC Setup
- Multi-GPU Machines
- Running a program on multi-GPU machines
- Role of Data allocation and memory management
- Need for an automatic memory manager
- Design goals
- Bounding boxes
- Overview of BBMM
- Data allocation scheme
- Buffer Management

- Inter-GPU coherency
- Structure of the generated code
- Experimental setup
- Evaluation and Results
- Related Work
- Conclusion and Future work

# **Distributed memory paradigm**

