Mandelbrot | Notion

Lorenzo La Corte - S4784539 - 2023/2024 - Università degli Studi di Genova

The goal of this project is to parallelize and accelerate the mandelbrot program using OpenMP and CUDA. This report is focused on the ****iterative process to achieve both optimizations. In particular, it aims to discuss the following points:

Vectorization and possible vectorization issues,
OpenMP:
1. hotspot identification,
2. parallelization through OpenMP,
3. analysis on scalability using a proper number of threads on the laboratory workstation, focusing on the relationship between speedup (and efficiency), number of threads and size of the problem.
CUDA:
1. implementation of a CUDA version of the program,
2. analysis of the results obtained by running mandelbrot.cu, with a comparison between different possible configurations and also against the OpenMP version.

Setup

Before starting, I can set up the environment in which icc is installed (in the laboratory machines), by opening a shell and executing the command:

source /opt/intel/oneapi/setvars.sh

For the third part, I will also need to set up the environment for using the NVIDIA HPC SDK (Software Development Kit), which contains nvc++ compiler, through the commands:

NVARCH=`uname -s`_`uname -m`; export NVARCH; NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS; MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/23.7/compilers/man; export MANPATH; PATH=$NVCOMPILERS/$NVARCH/23.7/compilers/bin:$PATH; export PATH;

`mandlebrot.cpp` Metrics and Performances

Firstly, I can benchmark the time spent by the program, without enabling any kind of optimization. To get also other useful insights, I can leverage intel advisor:

$ icc mandelbrot.cpp -diag-disable=10441 -o mandelbrot && ./mandelbrot out.txt
Time elapsed: 13 seconds.

$ icc mandelbrot.cpp -O3 -g -o mandelbrot_profiling && ./mandelbrot_profiling
icc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message.
Time elapsed: 13 seconds.
Please specify the output file as a parameter.

$ advix-gui

The analysis is conducted on the sequential version of the program:

Program Elapsed Time	13.10s
Vector Instruction Set	SSE2, SSE
Number of CPU Threads	1

It’s already clear that no vectorization is enabled by default:

Metrics	Total
CPU Time	13.09s (100%)
Time in scalar code	13.09s (100%)
Vectorization Gain/Efficiency	Not Available (No vectorized loops found or not enough data)

Function Call Sites and Loops	Total Time %	Total Time	Self Time	Why No Vectorization?
[loop in main at mandelbrot.cpp:33]	100%	13.090s	13.090s	outer loop was not auto-vectorized: consider using SIMD directive

This is shown by the 100% times in scalar code and the lack of vectorized loops. The advisor is suggesting that SIMD directives, which are commands that allow for vectorized operations, could be used to improve the efficiency of the code.

Setup

mandlebrot.cpp Metrics and Performances

`mandlebrot.cpp` Metrics and Performances