Lorenzo La Corte - S4784539 - 2023/2024 - Università degli Studi di Genova


The goal of this project is to parallelize and accelerate the mandelbrot program using OpenMP and CUDA. This report is focused on the ****iterative process to achieve both optimizations. In particular, it aims to discuss the following points:

  1. Vectorization and possible vectorization issues,
  2. OpenMP:
    1. hotspot identification,
    2. parallelization through OpenMP,
    3. analysis on scalability using a proper number of threads on the laboratory workstation, focusing on the relationship between speedup (and efficiency), number of threads and size of the problem.
  3. CUDA:
    1. implementation of a CUDA version of the program,
    2. analysis of the results obtained by running mandelbrot.cu, with a comparison between different possible configurations and also against the OpenMP version.

Setup

Before starting, I can set up the environment in which icc is installed (in the laboratory machines), by opening a shell and executing the command:

source /opt/intel/oneapi/setvars.sh

For the third part, I will also need to set up the environment for using the NVIDIA HPC SDK (Software Development Kit), which contains nvc++ compiler, through the commands:

NVARCH=`uname -s`_`uname -m`; export NVARCH; NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS; MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.7/compilers/man; export MANPATH; PATH=$NVCOMPILERS/$NVARCH/23.7/compilers/bin:$PATH; export PATH;

mandlebrot.cpp Metrics and Performances

Firstly, I can benchmark the time spent by the program, without enabling any kind of optimization. To get also other useful insights, I can leverage intel advisor:

$ icc mandelbrot.cpp -diag-disable=10441 -o mandelbrot && ./mandelbrot out.txt
Time elapsed: 13 seconds.

$ icc mandelbrot.cpp -O3 -g -o mandelbrot_profiling && ./mandelbrot_profiling
icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
Time elapsed: 13 seconds.
Please specify the output file as a parameter.

$ advix-gui

The analysis is conducted on the sequential version of the program:

Program Elapsed Time 13.10s
Vector Instruction Set SSE2, SSE
Number of CPU Threads 1

It’s already clear that no vectorization is enabled by default:

Metrics Total
CPU Time 13.09s (100%)
Time in scalar code 13.09s (100%)
Vectorization Gain/Efficiency Not Available (No vectorized loops found or not enough data)
Function Call Sites and Loops Total Time % Total Time Self Time Why No Vectorization?
[loop in main at mandelbrot.cpp:33] 100% 13.090s 13.090s outer loop was not auto-vectorized: consider using SIMD directive

This is shown by the 100% times in scalar code and the lack of vectorized loops. The advisor is suggesting that SIMD directives, which are commands that allow for vectorized operations, could be used to improve the efficiency of the code.