OpenMP* Offload for Intel® oneAPI Math Kernel Library

You can use Intel® oneAPI Math Kernel Library (oneMKL) and OpenMP* offload to run standard oneMKL computations on Intel GPUs. You can find the list of oneMKL features that support OpenMP offload in the mkl_omp_offload.h header file, which includes:

All Level 1, 2, and 3 BLAS functions through the CBLAS and BLAS interfaces, supporting both synchronous and asynchronous execution
BLAS-like extension: cblas_?axpby, cblas_?axpy_batch{_strided}, cblas_?copy_batch{_strided}, cblas_?gemv_batch{_strided}, cblas_?dgmmm_batch{_strided}, cblas_?gemm_batch{_strided}, cblas_?trsm_batch{_strided}, cblas?gemmt functionality through the CBLAS and BLAS interfaces as well as mkl_?omatcopy_batch_strided, mkl_?imatcopy_batch_strided, and mkl_?omatadd_batch_strided supporting both synchronous and asynchronous execution
LAPACK, supporting both synchronous and asynchronous execution
- {s,d,c,z}getrf
- {s,d,c,z}getri
- {s,d,c,z}getrs
- {s,d,c,z}potrf
- {s,d,c,z}potri
- {s,d,c,z}potrs
- {s,d}orgqr, {c,z}ungqr
- {s,d}ormqr, {c,z}unmqr
- {s,d,c,z}trtrs
- {s,d,c,z}steqr
- {s,d,c,z}geqrf
- {s,d,c,z}gesvd
- {s,d,c,z}gebrd
- {s,d}syev, {c,z}heev
- {s,d}syevd, {c,z}heevd
- {s,d}syevx, {c,z}heevx
- {s,d}sygvd, {c,z}hegvd
- {s,d}sygvx, {c,z}hegvx
- {s,d}sytrd, {c,z}hetrd
LAPACK-like extension: {s,d,c,z}getrf_batch_strided, {s,d,c,z}getrs_batch_strided, {s,d,c,z}getrfnp_batch_strided, {s,d,c,z}getrsnp_batch_strided
Vector Statistics
- Distributions: v?RngUniform, v?RngGaussian, v?RngExponential, v?RngLognormal, v?RngCauchy, v?RngGumbel, v?RngLaplace, v?RngRayleigh, v?RngWeibull, viRngBernoulli, viRngGeometric, viRngPoisson, viRngUniformBits, viRngUniformBits32/64
- Basic Random Number Generators: VSL_BRNG_MCG31, VSL_BRNG_MCG59, VSL_BRNG_PHILOX4X32X10, VSL_BRNG_MRG32K3A
FFTs through both DFTI and FFTW3 interfaces in one, two, and three dimensions.
- Complex-to-complex (C2C) transforms are supported through both interfaces.
- Real-to-complex (R2C) and complex-to-real (C2R) transforms are supported only through DFTI.
- Out-of-place R2C and C2C transforms offloaded to the GPU currently require INPUT_DISTANCE and OUTPUT_DISTANCE to be equal. The use of different INPUT_DISTANCE and OUTPUT_DISTANCE values is not currently supported.
- Both synchronous and asynchronous computations are supported.
- For R2C/C2R transforms on the GPU, only DFTI_CONJUGATE_EVEN_STORAGE=DFTI_COMPLEX_COMPLEX is supported (implying DFTI_PACKED_FORMAT=DFTI_CCE_FORMAT).
- Padded strides are not currently supported on the GPU. When INPUT_STRIDE is greater than INPUT_DISTANCE, it is assumed that INPUT_STRIDE = INPUT_DISTANCE*NUMBER_OF_TRANSFORMS. As an example, if you want to compute three FFTs on arrays x, y, and z, then a dense data layout like [x1, y1, z1, x2, y2, z2, x3,…] (INPUT_DISTANCE=1, NUMBER_OF_TRANSFORMS=3, INPUT_STRIDE=3) is supported. A padded data layout like [x1, y1, z1, pad, x2, y2, z2, pad, x3,…] (INPUT_DISTANCE=1, NUMBER_OF_TRANSFORMS=3, INPUT_STRIDE=4) is not currently supported.
- FFTs on embedded arrays through the FFTW3 interface are not currently supported on the GPU.
  Note
  INCONSISTENT_CONFIGURATION errors at compute time indicate an invalid descriptor or invalid data pointer. Double check your data mapping if you encounter such errors.
Sparse BLAS
- mkl_sparse_{s, d}_create_csr
- mkl_sparse_{s, d}_export_csr
- mkl_sparse_destroy
- mkl_sparse_set_sv_hint
- mkl_sparse_optimize
  - Currently only optimized for mkl_sparse_?_trsv functionality based on added hints for sv.
  - Both synchronous and asynchronous executions are supported. Note that you can run the mkl_sparse_optimize offload function asynchronously, but you are responsible for the data dependency between the optimization routine and the execution routines.
- mkl_sparse_{s, d}_mv:
  - currently only supported with SPARSE_MATRIX_TYPE_GENERAL and SPARSE_OPERATION_NON_TRANSPOSE
  - both synchronous and asynchronous computations are supported
- mkl_sparse_{s, d}_mm:
  - currently supported only with SPARSE_MATRIX_TYPE_GENERAL and SPARSE_OPERATION_NON_TRANSPOSE
  - both SPARSE_LAYOUT_ROW_MAJOR and SPARSE_LAYOUT_COLUMN_MAJOR are supported
  - both synchronous and asynchronous computations are supported
- mkl_sparse_{s, d}_trsv
  - Currently only supported with SPARSE_MATRIX_TYPE_TRIANGULAR, SPARSE_OPERATION_NON_TRANSPOSE, and alpha == 1.0
  - Both synchronous and asynchronous computations are supported
- mkl_sparse_sp2m
  - Currently only supported with SPARSE_MATRIX_TYPE_GENERAL.
  - Both synchronous and asynchronous computations are supported with Level Zero backend, and currently only synchronous computations are supported with OpenCL backend.
  - Note that you can run the mkl_sparse_sp2m offload function asynchronously, but you are responsible for the data dependency between the first stage and the second stage of mkl_sparse_sp2m.
- In Sparse BLAS, the usage model consists of the creation stage, the inspection stage, the execution stage, and the destruction stage. For Sparse BLAS with C OpenMP Offload, only the inspection and execution stages can be asynchronously executed, provided any data dependencies are already respected. If any inspection and execution stages were executed asynchronously, #pragma omp taskwait should be called before the mkl_sparse_destroy() call.

The OpenMP offload feature from Intel® oneAPI Math Kernel Library allows you to run oneMKL computations on Intel GPUs through the standard oneMKL APIs within an omp target variant dispatch section. For example, the standard CBLAS API for single precision real data type matrix multiply is:

void cblas_sgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const MKL_INT M, const MKL_INT N,
    const MKL_INT K, const float alpha, const float *A, const MKL_INT lda,
    const float *B, const MKL_INT ldb, const float beta, float *C,
    const MKL_INT ldc);

If the oneMKL function (for example, cblas_sgemm) is called outside of an omp target variant dispatch section or if offload is disabled, then the CPU implementation is dispatched. If the same function is called within an omp target variant dispatch section and offload is possible then the GPU implementation is dispatched. By default the execution of the oneMKL function within a dispatch variant construct is synchronous. OpenMP offload computations may be done asynchronously by adding the nowait clause to the target variant dispatch construct. This ensures that the host thread encountering the task region generated by this target construct will not be blocked by the oneMKL call. Rather, the host thread is returned to the caller for further use. To finish the asynchronous (nowait) computations and ensure memory and execution model consistency (for example, that the results of a computation will be ready in memory to map), the last such nowait computation is followed by the stand-alone construct #pragma omp taskwait.

From the OpenMP Application Programming Interface version 5.0 specification: "The taskwait region binds to the current task region [i.e., in this case, the last nowait computation]. The current task region is suspended at an implicit task scheduling point associated with the construct. The current task region remains suspended until all child tasks that it generated before the taskwait region complete execution [currently, depend clause is not supported]."

Example

Examples for using the OpenMP offload for oneMKL are located in the Intel® oneAPI Math Kernel Library installation directory, under:

examples/c_offload

The following code snippet shows how to use OpenMP offload for single-call oneMKL features such as most dense linear algebra functionality.

#include <omp.h>
#include "mkl.h"
#include "mkl_omp_offload.h" // MKL header file for OpenMP offload
int dnum = 0; 
int main() {
    float *a, *b, *c, alpha = 1.0, beta = 1.0;
    MKL_INT m = 150, n = 200, k = 128, lda = m, ldb = k, ldc = m;
    MKL_INT sizea = lda * k, sizeb = ldb * n, sizec = ldc * n;
    // allocate matrices and check pointers
    a = (float *)mkl_malloc(sizea * sizeof(float), 64);
    ...
    // initialize matrices
#pragma omp target map(c[0:sizec])
    {
        for (i = 0; i < sizec; i++) {
            c[i] = 42;
        }
        ...
    }
    // run gemm on host, use standard MKL interface
    cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, 
lda, b, ldb, beta, c, ldc);
    // map the a, b and c matrices on the device memory
#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) 
device(dnum)
    {
        // run gemm on gpu, use standard MKL interface within a variant dispatch 
construct
        // if offload is not possible, default to cpu 
        // use the use_device_ptr clause to specify that a, b and c are device 
memory
#pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c)
        {
            cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, 
alpha, a, lda, b, ldb, beta, c, ldc);
        }
    }
    // Free matrices
    mkl_free(a);
        …
}

Some of the oneMKL functionality requires to call a set of functions to perform the corresponding computation. This is the case, for example, for the Discrete Fourier Transform which for a typical computation involves calling the functions.

DFTI_EXTERN MKL_LONG DftiCreateDescriptor(DFTI_DESCRIPTOR_HANDLE*,
                              enum DFTI_CONFIG_VALUE,
                              enum DFTI_CONFIG_VALUE,
                              MKL_LONG, ...);
DFTI_EXTERN MKL_LONG DftiCommitDescriptor(DFTI_DESCRIPTOR_HANDLE);
DFTI_EXTERN MKL_LONG DftiComputeForward(DFTI_DESCRIPTOR_HANDLE, void*, ...);
DFTI_EXTERN MKL_LONG DftiComputeBackward(DFTI_DESCRIPTOR_HANDLE, void*, ...);
DFTI_EXTERN MKL_LONG DftiFreeDescriptor(DFTI_DESCRIPTOR_HANDLE*);

In that case, only a subset of the calls must be wrapped in an omp target variant dispatch construct as shown in the following code snippet for DFTI.

#include <omp.h>
#include "mkl.h"
#include "mkl_omp_offload.h"
int main(void)
{
    const int devNum = 0;
    const MKL_LONG N = 64;      // Size of 1D transform
    MKL_LONG status    = 0;
    MKL_LONG statusGPU = 0;
    DFTI_DESCRIPTOR_HANDLE descHandle    = NULL;
    DFTI_DESCRIPTOR_HANDLE descHandleGPU = NULL;
    MKL_Complex8 *x    = NULL;
    MKL_Complex8 *xGPU = NULL;
    printf("Create DFTI descriptor\n");
    status = DftiCreateDescriptor(&descHandle, DFTI_SINGLE, DFTI_COMPLEX, 1, N);
    printf("Create GPU DFTI descriptor\n");
    statusGPU = DftiCreateDescriptor(&descHandleGPU, DFTI_SINGLE, DFTI_COMPLEX, 
1, N);
    printf("Commit DFTI descriptor\n");
    status = DftiCommitDescriptor(descHandle);
    printf("Commit GPU DFTI descriptor\n");
#pragma omp target variant dispatch device(devNum)
    {
        statusGPU = DftiCommitDescriptor(descHandleGPU);
    }
    printf("Allocate memory for input array\n");
    x = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64);
    printf("Allocate memory for GPU input array\n");
    xGPU = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64);
    printf("Initialize input for forward FFT\n");
    // init x and xGPU ...
    printf("Compute forward FFT in-place\n");
    status = DftiComputeForward(descHandle, x);
    printf("Compute GPU forward FFT in-place\n");
#pragma omp target data map(tofrom:xGPU[0:N]) device(devNum)
    {
#pragma omp target variant dispatch use_device_ptr(xGPU) device(devNum)
        {
            statusGPU = DftiComputeForward(descHandleGPU, xGPU);
        }
    }
    // use results now in x and xGPU ...
 cleanup:
    DftiFreeDescriptor(&descHandle);
    DftiFreeDescriptor(&descHandleGPU);
    mkl_free(x);
    mkl_free(xGPU);
}

For asynchronous execution of multi-call oneMKL computation, the nowait clause needs to be used only on the call to the function performing the actual computation (for example, DftiCompute{Forward,Backward}). For instance, the following snippet shows how the DFTI example above could be changed to have two, back-to-back, asynchronous (nowait) computations dispatched, with a taskwait at the end of the second to ensure the completion of both computations before their results are accessed:

printf("Compute Intel GPU forward FFT 1 in-place\n");
#pragma omp target data map(tofrom:x1GPU[0:N1], x2GPU[0:N2]) device(devNum)
    {
#pragma omp target variant dispatch use_device_ptr(x1GPU) device(devNum) nowait
        {
            status1GPU = DftiComputeForward(descHandle1GPU, x1GPU);
        }
    printf("Compute Intel GPU forward FFT 2 in-place\n");
#pragma omp target variant dispatch use_device_ptr(x2GPU) device(devNum) nowait
        {
            status2GPU = DftiComputeForward(descHandle2GPU, x2GPU);
        }
#pragma omp taskwait
    }
    if (status1GPU != DFTI_NO_ERROR) goto failed;
    if (status2GPU != DFTI_NO_ERROR) goto failed;

For sparse BLAS computations, the workflow ‘create a CSR matrix handle’ -> ‘compute’ -> ‘destroy the CSR matrix handle’ should be done in a single target data region. For instance, the following snippet shows how the Sparse BLAS OpenMP Offload example for mkl_sparse_s_mv() could be run, where N is the number of rows, M is the number of columns, and NNZ is the number of non-zero entries of the sparse matrix csrA_gpu, x is the input vector, and the output is stored in the z array:

#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M]) map(tofrom:z[0:N]) device(devNum)
{                                                                            
#pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a)
        {                                                                   
            status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a);                            
        }                                                                            
#pragma omp target variant dispatch device(devNum) use_device_ptr(x, z)
        {                                                              
            status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z);                                            
        }                                                                            
#pragma omp target variant dispatch device(devNum)
        {                                         
            status_gpu3 = mkl_sparse_destroy(csrA_gpu);
        }                                              
    }

For asynchronous execution of multi-call oneMKL Sparse BLAS computation, the nowait clause can be added to the call of the function performing the actual computation (for example, calls to the mkl_sparse_{s,d}_mv() function). Adding the nowait clause to the mkl_sparse_s_create_csr() or mkl_sparse_destroy() calls may result in incorrect behavior. As an example, the following snippet shows how the Sparse BLAS example above could be changed to have two asynchronous (nowait) computations using the same matrix handle, csrA_gpu, but unrelated vector data so there is no read/write dependency between them. Add a taskwait at the end of the second execution to ensure the completion of both computations before the mkl_sparse_destroy() function is called:

#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M],w[0:M]) map(tofrom:y[0:N],z[0:N]) device(devNum)
    {

#pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a)
        {
            status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a);
        }

#pragma omp target variant dispatch device(devNum) use_device_ptr(x, z) nowait
        {
            status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z);
        }

#pragma omp target variant dispatch device(devNum) use_device_ptr(w, y) nowait
        {
            status_gpu3 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, w, beta, y);
        }
#pragma omp taskwait
       
#pragma omp target variant dispatch device(devNum)
        {
            status_gpu4 = mkl_sparse_destroy(csrA_gpu);
        }
    }

OpenMP* Offload for Intel® oneAPI Math Kernel Library

Note

Example