Follow this recipe to understand how you can use Flame Graphs to detect hotspots and hot code paths in Java workloads.
Content experts: Dmitry Kolosov, Roman Khatko, and Elena Nuzhnova
A flame graph is a visual representation of the stacks and stack frames in your application. The graph plots all of the functions in your application on the X-axis and displays the stack depth on the Y-axis. Functions are stacked in order of ancestry, with parent functions directly below child functions. The width of a function displayed in the graph is an indication of the amount of time it engaged the CPU. Therefore, the hottest functions in your application occupy the widest portions on the flame graph.
You can use flame graphs when you run the hotspots analysis with stacks on any of these workloads:
This recipe uses a Java application as an example. Typically, a poor selection of parameters (either sub-optimal or incorrect) for the Java Virtual Machine (JVM) can result in slow application performance. The slowdown is not always obvious to analyze or explain. When you visualize the application stacks in a flame graph, you may find it easier to identify hot paths for the application and its mixed stacks (Java and built-in).
Here are the hardware and software tools we use in this recipe:
Performance Analysis Tools: Hotspots Analysis in Intel® VTune™ Profiler (version 2021.7 or newer)
Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler.
Most recipes in the Intel® VTune™ Profiler Performance Analysis Cookbook are flexible. You can apply them to different versions of Intel® VTune™ Profiler. In some cases, minor adjustments may be required.
Get the latest version of Intel® VTune™ Profiler:
From the Intel® VTune™ Profiler product page.
Download the latest standalone package from the Intel® oneAPI standalone components page.
specjbb.input.number_customers=1 specjbb.input.number_products=1
java -XX:-UseAdaptiveSizePolicy -XX:+UseParallelOldGC -jar specjbb2015.jar –m COMPOSITE
VTune Profiler profiles the Java application and collects data. Once this process completes, VTune Profiler finalizes the collected results and resolves symbol information.
Start your analysis in the Summary window, where you can see high level statistics on the execution of your application. Focus on the Elapsed Time and Top Hotspots sections.
In this example, we see that the elapsed time for SPECjbb2015 was around 375 seconds.
The top five hotspots in the summary are in JVM functions. No Java/Application functions appear in this list.
Look at the Bottom-up window next to continue searching for hotspots.
Although the Bottom-up window displays more hotspots in the JVM, we need a deeper analysis to explain the slowdown of the Java application. This would require an expansion of bunches of parent functions for every hotspot in the table above.
Let us now look at the flame graph for this data, where we can observe all application stacks at once and possibly identify hot code paths.
Switch to the Flame Graph window.
A Flame Graph is a visual representation of the stacks and stack frames in your application. Every box in the graph represents a stack frame with the complete function name. The horizontal axis shows the stack profile population, sorted alphabetically. The vertical axis shows the stack depth, starting from zero at the bottom. The flame graph does not display data over time. The width of each box in the graph indicates the percentage of the function CPU time to total CPU time. The total function time includes processing times of the function and all of its children (callees).
The Flame Graph window contains a Call Stacks view, which displays the hottest stack when selected in the flame graph. You can also observe other stacks by selecting a function or drill down to its source code.
The flame graph uses a color scheme to display these types of functions:
Function Type | Description |
---|---|
User |
A function from the application module of the user. |
System |
A function from the System or Kernel module |
Synchronization |
A synchronization function from the Threading Library (like OpenMP Barrier) |
Overhead |
An overhead function from the Threading library (like OpenMP Fork or OpenMP Dispatcher) |
Follow these techniques as you examine the information displayed in the flame graph.
Therefore, a lot of CPU Time was spent in the Java Garbage Collector.
The -XX:-UseAdaptiveSizePolicy JVM option may not allow the application to adapt to the size of the JVM heap. The default values used for the run may also be insufficient. Let us now change the size of the JVM heap to decrease the executing time of the Garbage Collector (GC).
The -Xms and -Xmx options are used to set the operating range of the JVM where it can resize the heap. If the two values are the same, the heap size remains constant. It is good practice to refer to the JVM logs before you set values for these options.
Let us change the -Xms and -Xmx JVM options for the application to 2GB and 4GB respectively. We will then collect a new profile:
Once the data collection completes, check the Elapsed Time and Top Hotspots in the Summary window.
Switch to the Flame Graph window to identify new hot code paths.
The flame graph shows a hot code path that includes the JVM GCTaskThread.
However, this hot code path uses only 30.6% of CPU Time compared to 93.3% on the previous run.