Intel® Advisor Help
You can model your MPI application performance on a target graphics processing unit (GPU) device to determine whether you can get a performance speedup from offloading the application to the GPU.
Offload Modeling perspective of the Intel® Advisor includes the following stages:
Prerequisite: Set up environment variables to enable Intel Advisor CLI.
In the commands below:
Note: In the commands below, make sure to replace the application path and name before executing a command. If your application requires additional command line options, add them after the executable name.
This example shows how to run Offload Modeling to model performance for the rank 1 of the MPI application. It uses the gtool option of the Intel MPI Library to collect performance data on a baseline CPU. For other collection options, see Analyze MPI Applications.
advisor --collect=offload --dry-run --project-dir=./advi_results -- ./mpi_sample
After you run it, a list of analysis commands to run the Offload Modeling for the specified accuracy level is printed to the terminal/command prompt. For the command above, the commands are printed for the default medium accuracy:
advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results -- ./mpi_sample advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results -- ./mpi_sample advisor --collect=projection --no-assume-dependencies --config=xehpg_512xve --project-dir=./advi_results
You need to modify the printed commands for the MPI syntax to use an MPI launcher. See Analyze MPI Applications for syntax details.
mpirun -gtool "advisor --collect=survey --auto-finalize --static-instruction-mix --project-dir=./advi_results:1" -n 4 ./mpi_sample
mpirun -gtool "advisor --collect=tripcounts --flop --stacks --auto-finalize --enable-cache-simulation --data-transfer=light --target-device=xehpg_512xve --project-dir=./advi_results:1” -n 4 ./mpi_sample
advisor --collect=projection --config=xehpg_512xve --mpi-rank=1 --project-dir=./advi_results
You can only model performance for one rank at a time. The results are generated for the rank specified in a corresponding ./advi_results/rank.1 directory.
By default, Offload Modeling is optimized to model performance for a single-rank MPI application. For multi-rank MPI applications, you can apply additional configuration and settings to adjust the performance model for a specific hardware or application. You can adjust the number of MPI ranks to run per GPU tile and/or exclude MPI time from the report.
In the commands below:
Note: In the commands below, make sure to replace the application path and name before executing a command. If your application requires additional command line options, add them after the executable name.
Prerequisite: Set up environment variables to enable Intel Advisor CLI.
By default, Offload Modeling assumes that one MPI process, or rank, is mapped to one GPU tile. You can configure the performance model to adjust the number of MPI ranks to run per GPU tile to match your target device configuration.
To do this, you need to set the number of tiles per MPI process by scaling the Tiles_per_process target device parameter in a command line or a TOML configuration file. If you want to model performance for the Intel® Arc™ graphics code-named Alchemist, which is XeHPG 256 or XeHPG 512 configuration in Offload Modeling targets, use the Stack_per_process parameter. The parameter sets a fraction of a GPU tile that runs a single MPI process. For example, if you want to offload your MPI application with 8 processes to a target GPU device with 4 tiles, you need to adjust the performance model to run 2 MPI processes per tile, or to use 0.5 tile per process.
The number of tiles per process you set automatically adjusts:
The parameter accepts values from 0.01 to 12.0. Consider the following value examples:
Tiles_per_process/Stack_per_process Value |
Number of MPI Ranks per Tile |
---|---|
1.0 (default) |
1 |
12.0 (maximum) |
1/12 |
0.25 |
4 |
0.125 |
8 |
To run the Offload Modeling with a custom tile-per-process parameter, you need to scale the parameter during the analysis. This change is one time and is applied only to the analysis you run it with. The commands below use the Tiles_per_process parameter for scaling. Replace it with Stack_per_process if needed.
For example, to generate commands for the ./advi_results project and model performance with 0.25 tiles per process, which corresponds to four MPI ranks per tile:
advisor-python $APM/collect.py ./advi_results --set-parameter scale.Tiles_per_process=0.25 --dry-run -- ./mpi_sample
After you run it, a list of analysis commands to run the Offload Modeling with the specified accuracy level is printed to the terminal/command prompt similar to the following:
advisor --collect=survey --project-dir=./advi_results --static-instruction-mix -- ./mpi_sample advisor --collect=tripcounts --project-dir=./advi_results --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:64w:4k/1:192w:768k/1:4w:2m -- ./mpi_sample python $APM/collect.py ./advi_results -m generic advisor --collect=dependencies --project-dir=./advi_results --filter-reductions --loop-call-count-limit=16 --ignore-checksums -- ./mpi_sample
See Analyze MPI Applications for syntax details.
mpirun -gtool "advisor --collect=survey --static-instruction-mix -- ./mpi_sample --project-dir=./advi_results:1" -n 4 ./mpi_sample
mpirun -gtool “advisor --collect=tripcounts --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:64w:4k/1:192w:768k/1:4w:2m --project-dir=./advi_results:1” -n 4 ./mpi_sample
advisor --collect=projection --project-dir=./advi_results --set-parameter scale.Tiles_per_process=0.25 --mpi-rank=1
The result is generated for the rank specified in a corresponding ./advi_results/rank.1 directory. You can transfer them to the development system, if needed, and view the results.
When you open the result in the Intel Advisor GUI or an interactive HTML report, you should see the tiles per process or stack per process parameter in the Modeling Parameters pane with the value you set. The parameter is in a read-only format. Notice that tiles per process or stack per process parameter shows the value per process, while other parameters in the pane show the value per device.
Prerequisite: Set up environment variables to enable Intel Advisor CLI.
For multi-rank MPI workloads, time spent in MPI runtime can differ from rank to rank, which may cause significant performance imbalance. Because of this, the whole application time and Offload Modeling results may be different from rank to rank. If MPI time is large and differs between ranks, and the MPI code does not include many computations, you can exclude time spent in MPI routines from the analysis so that it does not affect modeling results.
advisor --collect=projection --project-dir=./advi_results --ignore=MPI --mpi-rank=1
The results are generated in a ./advi_results/rank.1 directory. You can transfer them to the development system and view the results.
In the report generated, all per-application performance modeling metrics are calculated based on application self-time with time spent in MPI calls excluded from the analysis. This should improve modeling across ranks.
Intel Advisor saves collection results into subdirectories for each rank analyzed under the project directory specified with --project-dir. The modeling results are available only for the ranks that you ran the Performance Modeling for, for example, as specified with the --mpi-rank option.
To view the performance or dependency results collected for a specific rank, you can do one of the following.
View Results in GUI
From the Intel Advisor GUI, open a result project file *.advixeproj that resides in the <project-dir> /rank.<n> directory.
You can also open the GUI from command line:
advisor-gui ./advi_results/rank.1
View Results in Command Line
After you run the Performance Modeling analysis, the summary result of the modeling is printed to a terminal/command prompt. Examine the data to learn the estimated speedup and top five offloaded regions.
View Results in an Interactive HTML Report
Open an interactive advisor-report HTML report generated in the respective rank directory at <project-dir>/rank.<n>/e<NNN>/report and a set of CSV reports in the respective rank directory at <project-dir>/rank.<n>/p<NNN>/data.0.