Intel® Advisor Help
All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/remainder loops to the loop body.
Align Data
One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.
Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:
float *array; array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32); // Somewhere else __assume_aligned(array, 32); // Use array in loop _mm_free(array);
Align static data using a 64-byte boundary:
__declspec(align(64)) float array[ARRAY_SIZE]
See also:
Parallelize The Loop with Both Threads and SIMD Instructions
The loop is threaded and auto-vectorized; however, the trip count is not a multiple of vector length. To fix: Do all of the following:
Original code sample:
void f(int a[], int b[], int c[]) { #pragma omp parallel for schedule(static) for (int i = 0; i < n; i++) { a[i] = b[i] + c[i]; } }
Revised code sample:
void f(int a[], int b[], int c[]) { #pragma omp parallel for simd schedule(simd:static) for (int i = 0; i < n; i++) { a[i] = b[i] + c[i]; } }
See also:
Force Scalar Remainder Generation
The compiler generated a masked vectorized remainder loop that contains too few iterations for efficient vector processing. A scalar loop may be more beneficial. To fix: Force scalar remainder generation using a directive: #pragma vector novecremainder.
void add_floats(float *a, float *b, float *c, float *d, float *e, int n) { int i; // Force the compiler to not vectorize the remainder loop #pragma vector novecremainder for (i=0; i<n; i++) { a[i] = a[i] + b[i] + c[i] + d[i] + e[i]; } }
See also:
Force Vectorized Remainder
The compiler did not vectorize the remainder loop, even though doing so could improve performance. To fix: Force vectorization using a directive: #pragma vector vecremainder.
void add_floats(float *a, float *b, float *c, float *d, float *e, int n) { int i; // Force the compiler to vectorize the remainder loop #pragma vector vecremainder for (i=0; i<n; i++) { a[i] = a[i] + b[i] + c[i] + d[i] + e[i]; } }
See also:
Specify The Expected Loop Trip Count
The compiler cannot detect the trip count statically. To fix: Specify the expected number of iterations using a directive: #pragma loop_count.
#include <stdio.h> int mysum(int start, int end, int a) { int iret=0; // Iterate through a loop a minimum of three, maximum of ten, and average of five times #pragma loop_count min(3), max(10), avg(5) for (int i=start;i<=end;i++) iret += a; return iret; } int main() { int t; t = mysum(1, 10, 3); printf("t1=%d\r\n",t); t = mysum(2, 6, 2); printf("t2=%d\r\n",t); t = mysum(5, 12, 1); printf("t3=%d\r\n",t); }
See also:
Change The Chunk Size
The loop is threaded and vectorized using the #pragma omp parallel for simd directive, which parallelizes the loop with both threads and SIMD instructions. Specifically, the directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions. In this case, the chunk size (number of iterations per chunk) is not a multiple of vector length. To fix: Add a schedule (simd: [kind]) modifier to the #pragma omp parallel for simd directive.
void f(int a[], int b[], int[c]) { // Guarantee a multiple of vector length. #pragma omp parallel for simd schedule(simd: static) for (int i = 0; i < n; i++) { a[i] = b[i] + c[i]; } }
See also:
Add Data Padding
The trip count is not a multiple of vector length . To fix: Do one of the following:
See also:
Collect Trip Counts Data
The Survey Report lacks trip counts data that might generate more precise recommendations.
Disable Unrolling
The trip count after loop unrolling is too small compared to the vector length . To fix: Prevent loop unrolling or decrease the unroll factor using a directive: #pragma nounroll or #pragma unroll.
void nounroll(int a[], int b[], int c[], int d[]) { // Disable automatic loop unrolling using #pragma nounroll for (int i = 1; i < 100; i++) { b[i] = a[i] + 1; d[i] = c[i] + 1; } }
See also:
Use A Smaller Vector Length
The compiler chose a vector length of , but the trip count might be smaller than the vector length. To fix: Specify a smaller vector length using a directive: #pragma omp simd simdlen.
void f(int a[], int b[], int c[], int d[]) { // Specify vector length using #pragma omp simd simdlen(4) for (int i = 1; i < 100; i++) { b[i] = a[i] + 1; d[i] = c[i] + 1; } }
In Intel Compiler version 19.0 and higher, there is a new vector length clause that allows the compiler to choose the best vector length based on cost: #pragma vector vectorlength(vl1, vl2, ..., vln) where vl is an integer power of 2.
void f(int a[], int b[], int c[], int d[]) { // Specify list of vector lengths #pragma vector vectorlength(2, 4, 16) for (int i = 1; i < 100; i++) { b[i] = a[i] + 1; d[i] = c[i] + 1; } }
See also:
Disable Dynamic Alignment
The compiler automatically peeled iterations from the vector loop into a scalar loop to align the vector loop with a particular memory reference; however, this optimization may not be ideal. To possibly achieve better performance, disable automatic peel generation using the directive: #pragma vector nodynamic_align.
... #pragma vector nodynamic_align for (int i = 0; i < len; i++) ... void f(float * a, float * b, float * c, int len) { #pragma vector nodynamic_align for (int i = 0; i < len; i++) { a[i] = b[i] * c[i]; } }
See also:
User-defined functions in the loop body are not vectorized.
Enable Inline Expansion
Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.
See also:
Vectorize Serialized Function(s) Inside Loop
#pragma omp declare simd int f (int x) { return x+1; } #pragma omp simd for (int k = 0; k < N; k++) { a[k] = f(k); }
See also:
Math functions in the loop body are preventing the compiler from effectively vectorizing the loop. Improve performance by enabling vectorized math call(s).
Enable Inline Expansion
Inlining is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.
Alternatively use #include <mathimf.h> header instead of the standard #include <math.h> header to call highly optimized and accurate mathematical functions commonly used in applications that rely heaving on floating point computations.
See also:
Vectorize Math Function Calls Inside Loops
Your application calls serialized versions of math functions when you use the precise floating point model. To fix: Do one of the following:
void add_floats(float *a, float *b, float *c, float *d, float *e, int n) { int i; #pragma omp simd for (i=0; i<n; i++) { a[i] = a[i] + b[i] + c[i] + d[i] + e[i]; } }
See also:
Change The Floating Point Model
Your application calls serialized versions of math functions when you use the strict floating point model. To fix: Do one of the following:
gcc program.c -O2 -fopenmp -fp-model precise -fast-transcendentals #pragma omp simd collapse(2) for (i=0; i<N; i++) { a[i] = b[i] * c[i]; for (i=0; i<N; i++) { d[i] = e[i] * f[i]; } }
See also:
Use a Glibc Library with Vectorized SVML Functions
Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:
gcc program.c -O2 -fopenmp -ffast-math -lrt -lm -mavx2 -I/opt/glibc-2.22/include -L/opt/glibc-2.22/lib -Wl,--dynamic-linker=/opt/glibc-2.22/lib/ld-linux-x86-64.so.2 #include "math.h" #include "stdio.h" #define N 100000 int main() { double angles[N], results[N]; int i; srand(86456); for (i = 0; i < N; i++) { angles[i] = rand(); } #pragma omp simd for (i = 0; i < N; i++) { results[i] = cos(angles[i]); } return 0; }
See also:
Use The Intel Short Vector Math Library for Vector Intrinsics
Your application calls scalar instead of vectorized versions of math functions. To fix: Do all of the following:
gcc program.c -O2 -ftree-vectorize -funsafe-math-optimizations -mveclibabi=svml -L/opt/intel/lib/intel64 -lm -lsvml -Wl,-rpath=/opt/intel/lib/intel64 #include "math.h" #include "stdio.h" #define N 100000 int main() { double angles[N], results[N]; int i; srand(86456); for (i = 0; i < N; i++) { angles[i] = rand(); } // the loop will be auto-vectorized for (i = 0; i < N; i++) { results[i] = cos(angles[i]); } return 0; }
See also:
The compiler assumes indirect or irregular stride access to data used for vector operations. Improve memory access by alerting the compiler to detected regular stride access patterns, such as:
Refactor code with detected regular stride access patterns
The Memory Access Patterns Report shows the following regular stride access(es):
See details in the Memory Access Patterns Report Source Details view.
To improve memory access: Refactor your code to alert the compiler to a regular stride access. Sometimes, it might be beneficial to use the ipo/Qipo compiler option to enable interprocedural optimization (IPO) between files.
An array is the most common type of data structure containing a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). Detected constant stride might be the result of AoS implementation. While this organization is excellent for encapsulation, it can hinder effective vector processing. To fix: Rewrite code to organize data using SoA instead of AoS.
However, the cost of rewriting code to organize data using SoA instead of AoS may outweigh the benefit. To fix: Use Intel SIMD Data Layout Templates (Intel SDLT), introduced in version 16.1 of the Intel compiler, to mitigate the cost. Intel SDLT is a C++11 template library that may reduce code rewrites to just a few lines.
Refactor for vertical invariant pattern.
// main.cpp int a[8] = {1,0,5,7,4,2,6,3}; // gather.cpp void test_gather(int* a, int* b, int* c, int* d) { int i, k; // inefficient access #pragma omp simd for (i = 0; i < INNER_COUNT; i++) d[i] = b[a[i%8]] + c[i]; int b_alt[8]; for (k = 0; k < 8; ++k) b_alt[k] = b[a[k]]; // more effective version for (i = 0; i < INNER_COUNT/8; i++) { #pragma omp simd for(k = 0; k < 8; ++k) d[i*8+k] = b_alt[k] + c[i*8+k]; } }Also make sure vector function clauses match arguments in the calls within the loop (if any).
Compare function calls with their declarations.
// functions.cpp #pragma omp declare simd int foo1(int* arr, int idx) { return 2 * arr[idx]; } #pragma omp declare simd uniform(arr) linear(idx) int foo2(int* arr, int idx) { return 2 * arr[idx]; } #pragma omp declare simd linear(arr) uniform(idx) int foo3(int* arr, int idx) { return 2 * arr[idx]; } // gather.cpp void test_gather(int* a, int* b, int* c) { int i, k; // Loop will be vectorized, for complex access patterns gathers could be used for function call. #pragma omp simd for (i = 0; i < INNER_COUNT; i++) a[i] = b[i] + foo1(c,i); // Loop will be vectorized with vectorized call #pragma omp simd for (i = 0; i < INNER_COUNT; i++) a[i] = b[i] + foo2(c,i); // Loop will be vectorized with serialized function call #pragma omp simd for (i = 0; i < INNER_COUNT; i++) a[i] = b[i] + foo3(c,i); }
See also:
Possible register spilling was detected and all vector registers are in use. This may negatively impact performance, because the spilled variable must be loaded to and unloaded from main memory. Improve performance by decreasing vector register pressure.
Decrease Unroll Factor
The current directive unroll factor increases vector register pressure. To fix: Decrease unroll factor using a directive: #pragma nounroll or #pragma unroll.
void nounroll(int a[], int b[], int c[], int d[]) { #pragma nounroll for (int i = 1; i < 100; i++) { b[i] = a[i] + 1; d[i] = c[i] + 1; } }
See also:
Split Loop into Smaller Loops
Possible register spilling along with high vector register pressure is preventing effective vectorization. To fix: Use the directive #pragma distribute_point or rewrite your code to distribute the source loop. This can decrease register pressure as well as enable software pipelining and improve both instruction and data cache use.
#define NUM 1024 void loop_distribution_pragma2( double a[NUM], double b[NUM], double c[NUM], double x[NUM], double y[NUM], double z[NUM] ) { int i; // After distribution or splitting the loop. for (i=0; i< NUM; i++) { a[i] = a[i] +i; b[i] = b[i] +i; c[i] = c[i] +i; #pragma distribute_point x[i] = x[i] +i; y[i] = y[i] +i; z[i] = z[i] +i; } }
See also:
The compiler assumed there is an anti-dependency (Write after read - WAR) or a true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.
Confirm Dependency Is Real
There is no confirmation that a real (proven) dependency is present in the loop. To confirm: Run a Dependencies analysis.
Enable Vectorization
The Dependencies analysis shows there is no real dependency in the loop for the given workload. Tell the compiler it is safe to vectorize using the restrict keyword or a directive:
#pragma ivdep for (i = 0; i < n - 4; i += 4) { // Here another line of comments for demontration of // easy to use code sample... a[i + 4] = a[i] * c; }
See also:
The compiler assumed there is an anti-dependency (Write after read - WAR) or true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly.
Resolve Dependency
The Dependencies analysis shows there is a real (proven) dependency in the loop. To fix: Do one of the following:
#pragma omp simd safelen(4) for (i = 0; i < n - 4; i += 4) { a[i + 4] = a[i] * c; }
#pragma omp simd reduction(+:sumx) for (k = 0;k < size2; k++) { sumx += x[k]*b[k]; }
See also:
There are multiple data types within loops. Utilize hardware vectorization support more effectively by avoiding data type conversion.
Use The Smallest Data Type
The source loop contains data types of different widths. To fix: Use the smallest data type that gives the needed precision to use the entire vector register width.
Example: If only 16-bits are needed, using a short rather than an int can make the difference between eight-way or four-way SIMD parallelism, respectively.
User-defined functions in the loop body are preventing the compiler from vectorizing the loop.
Enable Inline Expansion
Inlining of user-defined functions is disabled by compiler option. To fix: When using the Ob or inline-level compiler option to control inline expansion, replace the 0 argument with the 1 argument to enable inlining when an inline keyword or attribute is specified or the 2 argument to enable inlining of any function at compiler discretion.
See also:
Vectorize User Function(s) Inside Loop
These user-defined function(s) are not vectorized or inlined by the compiler: my_calc() To fix: Do one of the following:
Target | Directive |
---|---|
Source loop | #pragma omp simd |
Inner function definition or declaration | #pragma omp declare simd |
#pragma omp declare simd int f (int x) { return x+1; } #pragma omp simd for (int k = 0; k < N; k++) { a[k] = f(k); }
See also:
Cause: You are using a non-Intel compiler or an outdated Intel compiler. Nevertheless, it appears there are no issues preventing vectorization and vectorization may be profitable.
Explore Vectorization Opportunities
You compiled with auto-vectorization enabled; however, the compiler did not vectorize the code. Explore vectorization opportunities:
See also:
Enable Auto-Vectorization
You compiled with auto-vectorization disabled; enable auto-vectorization:
See also:
System function call(s) in the loop body are preventing the compiler from vectorizing the loop.
Remove System Function Call(s) Inside Loop
Typically system function or subroutine calls cannot be vectorized; even a print statement is sufficient to prevent vectorization. To fix: Avoid using system function calls in loops.
OpenMP* function call(s) in the loop body are preventing the compiler from effectively vectorizing the loop.
Move OpenMP Call(s) Outside The Loop Body
OpenMP calls prevent automatic vectorization when the compiler cannot move the calls outside the loop body, such as when OpenMP calls are not invariant. To fix:
Original code example:
#pragma omp parallel for private(tid, nthreads) for (int k = 0; k < N; k++) { tid = omp_get_thread_num(); // this call inside loop prevents vectorization nthreads = omp_get_num_threads(); // this call inside loop prevents vectorization ... }
Revised code example:
#pragma omp parallel private(tid, nthreads) { // Move OpenMP calls here tid = omp_get_thread_num(); nthreads = omp_get_num_threads(); #pragma omp for nowait for (int k = 0; k < N; k++) { ... } }
See also:
Remove OpenMP Lock Functions
Locking objects slows loop execution. To fix: Rewrite the code without OpenMP lock functions.
Allocating separate arrays for each thread and then merging them after a parallel recommendation may improve speed (but consume more memory).
Original code example:
int A[n]; list<int> L; ... omp_lock_t lock_obj; omp_init_lock(&lock_obj); #pragma omp parallel for shared(L, A, lock_obj) default(none) for (int i = 0; i < n; ++i) { // A[i] calculation ... if (A[i]<1.0) { omp_set_lock(&(lock_obj)); L.insert(L.begin(), A[i]); omp_unset_lock(&(lock_obj)); } } omp_destroy_lock(&lock_obj);
Revised code example:
int A[n]; list<int> L; omp_set_num_threads(nthreads_all); ... vector<list<int>> L_by_thread(nthreads_all); // separate list for each thread #pragma omp parallel shared(L, L_by_thread, A) default(none) { int k = omp_get_thread_num(); #pragma omp for nowait for (int i = 0; i < n; ++i) { // A[i] calculation ... if (A[i]<1.0) { L_by_thread[k].insert(L_by_thread[k].begin(), A[i]); } } } // merge data into single list for (int k = 0; k < L_by_thread.size(); k++) { L.splice(L.end(), L_by_thread[k]); }
See also:
Inefficient memory access patterns may result in significant vector code execution slowdown or block automatic vectorization by the compiler. Improve performance by investigating.
Confirm Inefficient Memory Access Patterns
There is no confirmation inefficient memory access patterns are present. To fix: Run a Memory Access Patterns analysis.
There is a high of percentage memory instructions with irregular (variable or random) stride accesses. Improve performance by investigating and handling accordingly.
Reorder Loops
This loop has less efficient memory access patterns than a nearby outer loop. To fix: Reorder the loops if possible.
Original code example:
void matmul(float *a[], float *b[], float *c[], int N) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; }Revised code example:
void matmul(float *a[], float *b[], float *c[], int N) { for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; }Interchanging is not always possible because of dependencies, which can lead to different results.
Use Intel SDLT
The cost of rewriting code to organize data using SoA instead of AoS may outweigh the benefit. To fix: Use Intel SIMD Data Layout Templates (Intel SDLT), introduced in version 16.1 of the Intel compiler, to mitigate the cost. Intel SDLT is a C++11 template library that may reduce code rewrites to just a few lines.
Using SDLT instead of STL containers may improve the memory access pattern for more efficient vector processing.
Original code example:
struct kValues { float Kx; float Ky; float Kz; float PhiMag; }; std::vector<kValues> dataset(count); // Initialization step for(int i=0; i < count; ++i) { kValues[i].Kx = kx[i]; kValues[i].Ky = ky[i]; kValues[i].Kz = kz[i]; kValues[i].PhiMag = phiMag[i]; } // Calculation step for (indexK = 0; indexK < numK; indexK++) { expArg = PIx2 * (kValues[indexK].Kx * x[indexX] + kValues[indexK].Ky * y[indexX] + kValues[indexK].Kz * z[indexX]); cosArg = cosf(expArg); sinArg = sinf(expArg); float phi = kValues[indexK].PhiMag; QrSum += phi * cosArg; QiSum += phi * sinArg; }
Revised code example:
#include <sdlt/sdlt.h> struct kValues { float Kx; float Ky; float Kz; float PhiMag; }; SDLT_PRIMITIVE(kValues, Kx, Ky, Kz, PhiMag) sdlt::soa1d_container<kValues> dataset(count); // Initialization step auto kValues = dataset.access(); for (k = 0; k < numK; k++) { kValues [k].Kx() = kx[k]; kValues [k].Ky() = ky[k]; kValues [k].Kz() = kz[k]; kValues [k].PhiMag() = phiMag[k]; } // Calculation step auto kVals = dataset.const_access(); #pragma omp simd private(expArg, cosArg, sinArg) reduction(+:QrSum, QiSum) for (indexK = 0; indexK < numK; indexK++) { expArg = PIx2 * (kVals[indexK].Kx() * x[indexX] + kVals[indexK].Ky() * y[indexX] + kVals[indexK].Kz() * z[indexX]); cosArg = cosf(expArg); sinArg = sinf(expArg); float phi = kVals[indexK].PhiMag(); QrSum += phi * cosArg; QiSum += phi * sinArg; }
See also:
Use SoA Instead of AoS
An array is the most common type of data structure containing a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation, it can hinder effective vector processing. To fix: Rewrite code to organize data using SoA instead of AoS.
See also:
Your current hardware supports the AVX2 instruction set architecture (ISA), which enables the use of fused multiply-add (FMA) instructions. Improve performance by utilizing FMA instructions.
Force Vectorization If Possible
The loop contains FMA instructions (so vectorization could be beneficial), but is not vectorized. To fix, review:
See also:
Explicitly Enable FMA Generation When Using The Strict Floating-Point Model
Static analysis presumes the loop may benefit from FMA instructions available with the AVX2 ISA, but the strict floating-point model disables FMA instruction generation by default. To fix: Override this behavior using the fma compiler option.
Windows OS | Linux OS |
---|---|
/Qfma | -fma |
See also:
Target The Higher ISA
Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 or higher ISA, no FMA instructions executed for this loop. To fix: Use the following compiler options:
See also:
Target A Specific ISA Instead of Using The xHost Option
Although static analysis presumes the loop may benefit from FMA instructions available with the AVX2 or higher ISA, no FMA instructions executed for this loop. To fix: Instead of using the xHost compiler option, which limits optimization opportunities by the host ISA, use the following compiler options:
Windows OS | Linux OS |
---|---|
/QxCORE-AVX2 or /QaxCORE-AVX2 | -xCORE-AVX2 or -axCORE-AVX2 |
/QxCOMMON-AVX512 or /QaxCOMMON-AVX512 | -xCOMMON-AVX512 or -axCOMMON-AVX512 |
See also:
Indirect function call(s) in the loop body are preventing the compiler from vectorizing the loop. Indirect calls, sometimes called indirect jumps, get the callee address from a register or memory; direct calls get the callee address from an argument. Even if you force loop vectorization, indirect calls remain serialized.
Improve Branch Prediction
For 64-bit applications, branch prediction performance can be negatively impacted when the branch target is more than 4 GB away from the branch. This is more likely to happen when the application is split into shared libraries. To fix: Do the following:
See also:
Remove Insirect Call(s) Inside The Loop
Indirect function or subroutine calls cannot be vectorized. To fix: Avoid using indirect calls in loops.
Replace Calls to Virtual Methods with Direct Calls
Calls to virtual methods are always indirect because the function address is calculated during runtime. Do the following to fix:
Original code example:
struct A { virtual double foo(double x) { return x+1; } }; struct B : public A { double foo(double x) override { return x-1; } }; ... A* obj = new B(); double sum = 0.0; #pragma omp simd reduction(+:sum) for (int k = 0; k < N; ++k) { // virtual indirect call sum += obj->foo(a[k]); } ...
Revised code example:
struct A { // Intel Compiler 17.x or higher could vectorize call to virtual method #pragma omp declare simd virtual double foo(double x) { return x+1; } }; ... sum = 0.0; #pragma omp simd reduction(+:sum) for (int k = 0; k < N; ++k) { // step for Intel Compiler 16.x or lower: // if you know the method to be called, // replace virtual call with direct one sum += ((B*)obj)->B::foo(a[k]); } ...
See also:
Vectorize Calls to Virtual Method
Force vectorization of the source loop using SIMD instructions and/or generate vector variants of the function(s) using a directive:
Original code example:
struct A { virtual double foo(double x) { return x+1; } }; struct B : public A { double foo(double x) override { return x-1; } }; ... A* obj = new B(); double sum = 0.0; #pragma omp simd reduction(+:sum) for (int k = 0; k < N; ++k) { // indirect call to virtual method sum += obj->foo(a[k]); } ...
Revised code example:
struct A { #pragma omp declare simd virtual double foo(double x) { return x+1; } }; ...
See also:
Vector declaration defaults for your SIMD-enabled functions may result in extra computations or ineffective memory access patterns. Improve performance by overriding defaults.
Target a Specific Processor Type(s)
The default instruction set architecture (ISA) for SIMD-enabled functions is inefficient for your host processor because it could result in extra memory operations between registers. To fix: Add one of the following to tell the compiler to generate an extended set of vector functions.
Windows OS | Linux OS |
---|---|
processor(cpuid) to #pragma omp declare simd | processor(cpuid) to #pragma omp declare simd |
processor(cpuid) to _declspec(vector()) | processor(cpuid) to _attribute_(vector()) |
/Qvecabi:cmdtarget Note: Vector variants are created for targets specified for targets specified by compiler options /Qx or /Qax | -vecabi=cmdtarget Note: Vector variants are created for targets specified for targets specified by compiler options -x or -ax |
See also:
Enforce the Compiler to Ignore Assumed Vector Dependencies
No real dependencies were detected, so there is no need for conflict-detection instructions. To fix: Tell the compiler it is safe to vectorize using a directive #pragma ivdep.
#pragma ivdep for (i = 0; i < n; i++) { a[index[i]] = b[i] * c; }
See also:
This is outer (non-innermost) loop. Normally outer loops are not targeted by auto-vectorization. Outer loop vectorization is also possible and sometimes more profitable, but requires explicit vectorization using OpenMP* API or Intel® Cilk™ Plus.
Collect Trip Counts Data
The Survey Report lacks trip counts data that might prove profitability for outer loop vectorization. To fix: Run a Trip Counts analysis.
Check Dependencies for Outer Loop
It is not safe to force vectorization without knowing that there are no dependencies. Disable inner vectorization before check Dependency. To check: Run a Dependencies analysis.
Check Memory Access Patterns for Outer Loop
To ensure that outer loop has optimal memory access patterns run a Memory Access Patterns analysis.
Consider Outer Loop Vectorization
The compiler never targets loops other than innermost ones, so it vectorized the inner loop while did not vectorize the outer loop. However outer loop vectorization could be more profitable because of better Memory Access Pattern, higher Trip Counts or better Dependencies profile.
To enforce outer loop vectorization:
Target | Directive |
---|---|
Outer Loop | #pragma omp simd |
Inner Loop | #pragma novector |
#pragma omp simd for(i=0; i<N; i++) { #pragma novector for(j=0; j<N; j++) { sum += A[i]*A[j]; } }
See also:
Consider Outer Loop Vectorization.
The compiler did not vectorize the loop as the code exceeds the compilers complexity criteria. You might get higher performance if you enforce the loop vectorization. Use a directive right before your loop block in the source code.
ICL/ICC/ICPC Directive |
---|
#pragma omp simd |
See also:
Consider Outer Loop Vectorization
The compiler did not vectorize the inner loop due to potential dependencies detected. You might vectorize outer loop if it has no dependency. Use a directive right before your loop block in the source code.
ICL/ICC/ICPC Directive |
---|
#pragma omp simd |
See also:
STL algorithms are algorithmically optimized. Improve performance with algorithms that are both algorithmically and programmatically optimized by using Parallel STL. Parallel STL is an implementation of C++ standard library algorithms for the next version of the C++ standard, commonly called C++17, that supports execution policies and is specifically optimized for Intel® processors. Pass one of the following values as the first parameter in an algorithm call to specify the desired execution policy.
Execution Policy | Meaning |
---|---|
seq | Execute sequentially. |
unseq | Use SIMD. (Requires SIMD-safe functions.) |
par | Use multithreading. (Requires thread-safe functions.) |
par_unseq | Use SIMD and multithreading. (Requires SIMD-safe and thread-safe functions.) |
Parallel STL supports SIMD and multithreading execution policies for a subset of algorithms if random access iterators are provided. Execution remains sequential for all other algorithms.
Use Parallel STL Alternative to std::any_of
The std::any_of algorithm runs sequentially. To run in parallel, use the Parallel STL alternative with the following execution policy: std::execution::unseq
#include "pstl/execution" #include "pstl/algorithm" void foo(float* a, int n) { std::any_of(std::execution::unseq, a, a+n, [](float elem) { return elem > 100.f; }); }
See also:
Use Parallel STL Alternative to std::copy_if
The std::copy_if algorithm runs sequentially. To run in parallel, use the Parallel STL alternative with one of the following execution polices:
#include "pstl/execution" #include "pstl/algorithm" void foo(float* a, float* b, int n) { std::copy_if(std::execution::par_unseq, a, a+n, b, [](float elem) { return elem > 10.f; }); }
See also:
Use Parallel STL Alternative to std::for_each
The std::for_each algorithm runs sequentially. To run in parallel, use the Parallel STL alternative with one of the following execution polices:
#include "pstl/execution" #include "pstl/algorithm" void foo(float* a, int n) { std::for_each(std::execution::par_unseq, a, a+n, [](float elem) { ... }); }
See also:
Use Parallel STL Alternative to std::sort
The std::any_of algorithm runs sequentially. To run in parallel, use the Parallel STL alternative with the following execution policy: std::execution::par
#include "pstl/execution" #include "pstl/algorithm" void foo(float* a, int n) { std::sort(std::execution::par, a, a+n); }
See also:
Your current hardware supports Advanced Vector Extensions 512 (AVX-512) instructions that enable the use of approximate reciprocal and reciprocal square root instructions both for single- and double-precision floating-point calculations. Improve performance by utilizing these instructions.
Force Vectorization If Possible
The loop contains SQRT/DIV instructions (so vectorization could be beneficial), but is not vectorized. To fix, review:
See also:
Target the AVX-512 ISA
Static analysis presumes the loop may benefit from AVX-512 approximate reciprocal instructions, but these instructions were not used. To fix: Use one of the following compiler options:
Windows OS | Linux OS |
---|---|
/QxCOMMON-AVX512 or /QaxCOMMON-AVX512 | -xCOMMON-AVX512 or -axCOMMON-AVX512 |
See also:
Target the AVX-512 Exponential and Reciprocal Instructions ISA
Static analysis presumes the loop may benefit from AVX-512 Exponential and Reciprocal (AVX-512ER) instructions currently supported only on Intel® Xeon Phi™ processors, but these instructions were not used. To fix: Use one of the following compiler options:
Windows OS | Linux OS |
---|---|
/QxMIC-AVX512 or /QaxMIC-AVX512 | -xMIC-AVX512 or -axMIC-AVX512 |
See also:
Enable the Use of Approximate Reciprocal Instructions by Fine-Tuning Precision and Floating-Point Model Compiler Options
Static analysis presumes the loop may benefit from using approximate reciprocal instructions, but the precision and floating-point model settings may prevent the compiler from using these instructions. To fix: Fine-tune your usage of the following compiler options:
Windows OS | Linux OS | Comment |
---|---|---|
/fp | -fp-model | -fp-model=precise prevents the use of approximate reciprocal instructions. |
/Qimf-precision | -fimf-precision | Consider using -fimf-precision=medium or -fimf-precision=low. |
/Qimf-accuracy-bits | -fimf-accuracy-bits | Consider decreasing this setting. |
/Qimf-max-error | -fimf-max-error | Consider increasing this setting. There is a similar option: -fimf-absolute-error. Avoid using both options at the same time or tune them together. |
/Qimf-absolute-error | -fimf-absolute-error | Consider using -fimf-max-error instead and set -fimf-absolute-error=0 (default) or increase this setting together with -fimf-max-error. |
/Qimf-domain-exclusion | -fimf-domain-exclusion | Consider increasing this setting. More excluded classes enable more optimized code. USE WITH CAUTION. This option may cause incorrect behavior if your calculations involve excluded domains. |
/Qimf-arch-consistency | -fimf-arch-consistency | -fimf-arch-consistency=true may prevent the use of approximate reciprocal instructions. |
/Qprec-div | -prec-div | -prec-div prevents the use of approximate reciprocal instructions. |
/Qprec-sqrt | -prec-sqrt | -prec-sqrt prevents the use of approximate reciprocal instructions. |
See also:
Stores with indirect addressing caused the compiler to assume a potential dependency.
This resulted in the use of conflict-detection instructions during SIMD processing, such as the AVX-512 vpconflict instruction, which detects duplicate values within a vector and creates conflict-free subsets. Improve performance by removing the need for conflict-detection instructions.
Enforce the Compiler to Ignore Assumed Vector Dependencies
No real dependencies were detected, so there is no need for conflict-detection instructions. To fix: Tell the compiler it is safe to vectorize using a directive #pragma ivdep.
#pragma ivdep for (i = 0; i < n; i++) { a[index[i]] = b[i] * c; }
See also:
Improve performance by enabling approximate operations instructions.
Enable the Use of Approximate Division Instructions
Static analysis presumes the loop may benefit from using approximate calculations. Independent dividors will be pre-calculated and replaced with multiplicators. To fix: Fine-tune your usage of the following compiler option:
Windows OS | Linux OS | Comment |
---|---|---|
/Qprec-div | -no-prec-div | -no-prec-div enables the use of approximate division optimizations. |
See also:
Enable the Use of Approximate sqrt Instructions
Static analysis presumes the loop may benefit from using approximate sqrt instructions, but the precision and floating-point model settings may prevent the compiler from using these instructions. To fix: Fine-tune your usage of the following compiler option:
Windows OS | Linux OS | Comment |
---|---|---|
/Qprec-sqrt | -no-prec-sqrt | -no-prec-sqrt enables the use of approximate sqrt optimizations. |
See also:
Enable Non-Temporal Store
Enable non-temporal store using #pragma vector nontemporal. The nontemporal clause instructs the compiler to use non-temporal (that is, streaming) stores on systems based on all supported architectures, unless specified otherwise; optionally takes a comma-separated list of variables.
When this pragma is specified, it is your responsibility to also insert any fences as required to ensure correct memory ordering within a thread or across threads. One typical way to do this is to insert a _mm_sfence intrinsic call just after the loops (such as the initialization loop) where the compiler may insert streaming store instructions.
Streaming stores may cause significant performance improvements over non-streaming stores for large numbers on certain processors. However, the misuse of streaming stores can significantly degrade performance.
float a[1000]; void foo(int N) { int i; #pragma vector nontemporal for (i = 0; i < N; i++) { a[i] = 1; } }
See also:
Current placement of the loop in memory may result in inefficient use of the CPU front-end. Improve performance by aligning loop code.
Force the Compiler to Align Loop Code
Static analysis shows the loop may benefit from code alignment. To fix: Force the compiler to align the loop to a power-of-two byte boundary using a compiler directive for finer-grained control: #pragma code_align (n)
Align inner loop to 32-byte boundary:
for (i = 0; i < n; i++) { #pragma code_align 32 for (j = 0; j < m; j++) { a[i] *= b[i] + c[j]; } }
You may also need the following compiler option:
Windows OS | Linux OS and Mac OS |
---|---|
/Qalign-loops[:n] | -falign-loops[=n] |
where n = a power of 2 betwen 1 and 4096, such as 1, 2, 4, 8, 16, 32, etc. n = 1 performs no alignment. If n is not present, the compiler uses an alignment of 16 bytes. Suggestion: Try 16 and 32 first.
/Qalign-loops- and -fno-align-loops, the default compiler option, disables special loop alignment.