Lorenzo La Corte - S4784539 - 2023/2024 - Università degli Studi di Genova


heat_cuda.c accelerates heat.c, a program that computes the 2D heat conduction formula:

Untitled

Setup

First, I have to set up the environment, in order to use the nvc++ compiler:

NVARCH=`uname -s`_`uname -m`; export NVARCH; NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS; MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.7/compilers/man; export MANPATH; PATH=$NVCOMPILERS/$NVARCH/23.7/compilers/bin:$PATH; export PATH;

As a starting point, I can check the hardware characteristics of the workstation in which I am running my experiments (a workstation from 210 laboratory):

$ nvaccelinfo

CUDA Driver Version:           11040
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  470.199.02  Thu May 11 11:46:56 UTC 2023

Device Number:                 0
Device Name:                   NVIDIA T400
Device Revision Number:        7.5
Global Memory Size:            1967259648
Number of Multiprocessors:     6
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1425 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             5001 MHz
Memory Bus Width:              64 bits
L2 Cache Size:                 524288 bytes
Max Threads Per SMP:           1024
Async Engines:                 3
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
Default Target:                cc75

$ nvidia-smi    
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T400         On   | 00000000:01:00.0 Off |                  N/A |
| 38%   35C    P8    N/A /  31W |      5MiB /  1876MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     29717      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

There are 32 warps of 32 threads and the maximum number of threads per block is 1024.

The available GPU memory is 1876 MiB (Mebibytes), which is approximately 2GB.

Analysis - Understand Performance Figures

I can understand the performance figures and benchmark the program in its initial status (heat.c), using different configurations in the compilation phase:

*** CUDA**
$ nvc++ heat.c -o heat && ./heat
CPU Time spent: 429.61499023 ms

*** CUDA - 03**
$ nvc++ -O3 heat.c -o heat && ./heat
CPU Time spent: 33.86600113 ms

*** OpenAcc - GPU**
$ nvc++ -acc heat.c -o heat && ./heat
CPU Time spent: 161.67900085 ms

* **OpenAcc - Multicore**
$ nvc++ -acc=multicore heat.c -o heat && ./heat
CPU Time spent: 160.17900085 ms

* **icc**
$ icc heat.c -diag-disable=10441 -o heat && ./heat
CPU Time spent: 159.77799988 ms

* **icc - fast**
$ icc heat.c -fast -diag-disable=10441 -o heat && ./heat 
CPU Time spent: 34.83600235 ms

The best results are obtained using nvc++ with the option -O3 and icc with -fast:

Compiler/Options Compilation Command CPU Time (ms)
CUDA nvc++ heat.c -o heat 429.61
CUDA - O3 nvc++ -O3 heat.c -o heat 33.87
OpenAcc - GPU nvc++ -acc heat.c -o heat 161.68
OpenAcc - Multicore nvc++ -acc=multicore heat.c -o heat 160.18
icc icc heat.c -diag-disable=10441 -o heat 159.78
icc -fast icc heat.c -fast -diag-disable=10441 -o heat 34.84

Parallelisation with CUDA - heat_cuda_baseline.cu

I can also benchmark the different available levels of optimizations:

$ nvc++ -O0 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 918.99 ms.

$ nvc++ -O1 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 428.00 ms.

$ nvc++ -O2 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 35.10 ms.

$ nvc++ -O3 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 35.59 ms.

$ nvc++ -O4 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 37.21 ms.

Is evident that, after -O3, is useless to optimize: