OpenMP* Code Analysis Method

This recipe introduces a flow to analyze CPU utilization of your OpenMP* or hybrid OpenMP-MPI application and identify causes of possible inefficiencies.

Content expert: Dmitry Prohorov

OpenMP is a fork-join parallel model, which starts with an OpenMP program running with a single master serial-code thread. When a parallel region is encountered, that thread forks into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for work coordination. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code.

Ideally, parallelized applications have working threads doing useful work from the beginning to the end of execution, utilizing 100% of available CPU core processing time. In real life, useful CPU utilization is likely to be less when working threads are waiting, either actively spinning (for performance, expecting to have a short wait) or waiting passively, not consuming CPU. There are several major reasons why working threads wait, not doing useful work:

VTune Profiler together with Intel Composer XE 2013 Update 2 or higher help you understand how an application utilizes available CPUs and identify causes of CPU underutilization.

Note

To analyze an OpenMP application with VTune Profiler:

  1. Compile your code with recommended options.

  2. Configure OpenMP regions analysis.

  3. Explore application-level OpenMP metrics.

  4. Identify serial code.

  5. Estimate potential gain.

  6. Understand limitations.

Compile Your Code with Recommended Options

To enable parallel regions and source analysis during compilation, do the following:

Configure OpenMP Analysis

To enable OpenMP analysis for your target:

  1. Click the (standalone GUI)/ (Visual Studio IDE)Configure Analysis button on the Intel® VTune™ Profiler toolbar.

    The Configure Analysis window opens.

  2. From HOW pane, click the Browse button and select an analysis type that supports OpenMP analysis: Threading, HPC Performance Characterization, Memory Access, or any Custom Analysis type.

  3. Select the Analyze OpenMP regions option, if it is not pre-selected (see the Details section to confirm).

  4. Click the Start button to run the analysis.

The OpenMP runtime library in the Intel Composer provides special markers for applications running under profiling that can be used by the VTune Profiler to decipher the statistics of OpenMP parallel regions and distinguish serial parts of the application code.

Explore Application-Level OpenMP Metrics

Start your analysis with understanding the CPU utilization of your analysis target. If you are using the HPC Performance Characterization viewpoint, focus on the Effective Physical Core Utilization section of the Summary window that shows the number of used logical and physical cores and estimates the efficiency (in percent) of this CPU utilization. Poor core utilization is flagged as a performance issue.

Other viewpoints provide the CPU Utilization Histogram that displays the Elapsed time of your application, broken down by CPU utilization levels. The histogram shows only useful utilization so the CPU cycles that were spent by the application burning CPU in spin loops (active wait) are not counted. You can adjust sliders from the default levels if you intentionally use a number of OpenMP working threads less than the number of available hardware threads.

CPU Usage Histogram

If the bars are close to Ideal utilization, you might need to look deeper, at algorithm or microarchitecture tuning opportunities, to find performance improvements. If not, explore the OpenMP Analysis section of the Summary window for inefficiencies in parallelization of the application:

OpenMP Analysis. Collection Time

This section of the Summary window shows the Collection Time as well as the duration of serial (outside of any parallel region) and parallel portions of the program. If the serial portion is significant, consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible.

Use the OpenMP Region Duration histogram in the Summary window to analyze instances of an OpenMP region, explore the time distribution of instance durations and identify Fast/Good/Slow region instances. Initial distribution of region instances by Fast/Good/Slow categories is done as a ratio of 20/40/20 between min and max region time values. Adjust the thresholds as needed.

OpenMP Region Duration Histogram

Use this data for further detailed analysis in the grid views with OpenMP Region/OpenMP Region Duration Type/... grouping levels.

Identify Serial Code

To analyze the serially executed code, expand the Serial Time (outside parallel regions) section of the Summary window and review the Top Serial Hotspots (outside parallel regions). You can click a function name to be taken to that function in the Bottom-up window for more detail.

Estimate Potential Gain

To estimate the efficiency of CPU utilization in the parallel part of the code, use the Potential Gain metric. This metric estimates the difference in the Elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving parallel execution.

The Summary window provides a detailed table listing the top five parallel regions with the highest Potential Gain metric values. For each parallel region defined by the pragma #omp parallel, this metric is a sum of potential gains of all instances of the parallel region.

Top OpenMP Regions by Potential Gain

If Potential Gain for a region is significant, you can go deeper and select the link on a region name to navigate to the Bottom-up window employing the /OpenMP Region/OpenMP Barrier-to-Barrier Segment/.. dominant grouping that provides detailed analysis of inefficiency metrics like Imbalance by barriers.

Intel OpenMP runtime from Intel Parallel Studio instruments barriers for the VTune Profiler. VTune Profiler introduces a notion of barrier-to-barrier OpenMP region segment that spans from a region fork point or previous barrier to the barrier that defines the segment.

In the example above, there are four barrier-to-barrier segments defined as a user barrier, implicit single barrier, implicit omp for loop barrier and region join barrier.

For the cases when an OpenMP region contains multiple barriers either implicit with parallel loops or #pragma single sections, or explicit with user barriers, analyze the impact of a particular construct or a barrier to inefficiency metrics.

A barrier type is embedded to the segment name, for example: loop, single, reduction, and others. It also emits additional information for parallel loops with implicit barriers like loop scheduling, chunk size and min/max/average of the loop iteration counts that is useful to understand imbalance or scheduling overhead nature. The loop iteration count information is also helpful to identify problems with underutilization of worker threads with small number of iterations that can be a result of outer loop parallelization. Consider inner loop parallelization or "collapse" clause to saturate the working threads in this case.

Analyze the Potential Gain column data that shows a breakdown of Potential Gain in the region by representing the cost (in elapsed time) of the inefficiencies with a normalization by the number of OpenMP threads. Elapsed time cost helps decide whether you need to invest into addressing a particular type of inefficiency. VTune Profiler can recognize the following types of inefficiencies:

If the Potential Gain column is not expandable for earlier versions of Intel OpenMP runtime, analyze the corresponding CPU Time metric breakdown.

To analyze the source of a performance-critical OpenMP parallel region, double-click the region identifier in the grid, sorted by the OpenMP Region/.. grouping level. VTune Profiler opens the source view at the beginning of the selected OpenMP region in the pseudo function created by the Intel compiler.

Note

By default, the Intel compiler does not add a source file name to region names, so the unknown string shows up in the OpenMP parallel region name. To get the source file name in the region name, use the -parallel-source-info=2 option during compilation.

Limitations

VTune Profiler supports the analysis of parallel OpenMP regions with the following limitations:

Related Cookbook Recipes

See Also