Lorenzo La Corte - S4784539 - 2023/2024 - Università degli Studi di Genova
heat_cuda.c
accelerates heat.c
, a program that computes the 2D heat conduction formula:
First, I have to set up the environment, in order to use the nvc++
compiler:
NVARCH=`uname -s`_`uname -m`; export NVARCH; NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS; MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.7/compilers/man; export MANPATH; PATH=$NVCOMPILERS/$NVARCH/23.7/compilers/bin:$PATH; export PATH;
As a starting point, I can check the hardware characteristics of the workstation in which I am running my experiments (a workstation from 210 laboratory):
$ nvaccelinfo
CUDA Driver Version: 11040
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.199.02 Thu May 11 11:46:56 UTC 2023
Device Number: 0
Device Name: NVIDIA T400
Device Revision Number: 7.5
Global Memory Size: 1967259648
Number of Multiprocessors: 6
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1425 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 5001 MHz
Memory Bus Width: 64 bits
L2 Cache Size: 524288 bytes
Max Threads Per SMP: 1024
Async Engines: 3
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Default Target: cc75
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA T400 On | 00000000:01:00.0 Off | N/A |
| 38% 35C P8 N/A / 31W | 5MiB / 1876MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 29717 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
There are 32 warps of 32 threads and the maximum number of threads per block is 1024.
The available GPU memory is 1876 MiB (Mebibytes), which is approximately 2GB.
I can understand the performance figures and benchmark the program in its initial status (heat.c
), using different configurations in the compilation phase:
*** CUDA**
$ nvc++ heat.c -o heat && ./heat
CPU Time spent: 429.61499023 ms
*** CUDA - 03**
$ nvc++ -O3 heat.c -o heat && ./heat
CPU Time spent: 33.86600113 ms
*** OpenAcc - GPU**
$ nvc++ -acc heat.c -o heat && ./heat
CPU Time spent: 161.67900085 ms
* **OpenAcc - Multicore**
$ nvc++ -acc=multicore heat.c -o heat && ./heat
CPU Time spent: 160.17900085 ms
* **icc**
$ icc heat.c -diag-disable=10441 -o heat && ./heat
CPU Time spent: 159.77799988 ms
* **icc - fast**
$ icc heat.c -fast -diag-disable=10441 -o heat && ./heat
CPU Time spent: 34.83600235 ms
The best results are obtained using nvc++
with the option -O3
and icc
with -fast
:
Compiler/Options | Compilation Command | CPU Time (ms) |
---|---|---|
CUDA | nvc++ heat.c -o heat |
429.61 |
CUDA - O3 | nvc++ -O3 heat.c -o heat |
33.87 |
OpenAcc - GPU | nvc++ -acc heat.c -o heat |
161.68 |
OpenAcc - Multicore | nvc++ -acc=multicore heat.c -o heat |
160.18 |
icc | icc heat.c -diag-disable=10441 -o heat |
159.78 |
icc -fast | icc heat.c -fast -diag-disable=10441 -o heat |
34.84 |
heat_cuda_baseline.cu
I can also benchmark the different available levels of optimizations:
$ nvc++ -O0 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 918.99 ms.
$ nvc++ -O1 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 428.00 ms.
$ nvc++ -O2 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 35.10 ms.
$ nvc++ -O3 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 35.59 ms.
$ nvc++ -O4 heat_cuda_baseline.cu -o heat_cuda && ./heat_cuda
CPU Time spent: 37.21 ms.
Is evident that, after -O3
, is useless to optimize: