Intel® oneAPI Math Kernel Library Developer Reference - C
You can use Intel® oneAPI Math Kernel Library (oneMKL) and OpenMP* offload to run standard oneMKL computations on Intel GPUs. You can find the list of oneMKL features that support OpenMP offload in the mkl_omp_offload.h header file, which includes:
The OpenMP offload feature from Intel® oneAPI Math Kernel Library allows you to run oneMKL computations on Intel GPUs through the standard oneMKL APIs within an omp target variant dispatch section. For example, the standard CBLAS API for single precision real data type matrix multiply is:
void cblas_sgemm(const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const MKL_INT M, const MKL_INT N, const MKL_INT K, const float alpha, const float *A, const MKL_INT lda, const float *B, const MKL_INT ldb, const float beta, float *C, const MKL_INT ldc);
If the oneMKL function (for example, cblas_sgemm) is called outside of an omp target variant dispatch section or if offload is disabled, then the CPU implementation is dispatched. If the same function is called within an omp target variant dispatch section and offload is possible then the GPU implementation is dispatched. By default the execution of the oneMKL function within a dispatch variant construct is synchronous. OpenMP offload computations may be done asynchronously by adding the nowait clause to the target variant dispatch construct. This ensures that the host thread encountering the task region generated by this target construct will not be blocked by the oneMKL call. Rather, the host thread is returned to the caller for further use. To finish the asynchronous (nowait) computations and ensure memory and execution model consistency (for example, that the results of a computation will be ready in memory to map), the last such nowait computation is followed by the stand-alone construct #pragma omp taskwait.
From the OpenMP Application Programming Interface version 5.0 specification: "The taskwait region binds to the current task region [i.e., in this case, the last nowait computation]. The current task region is suspended at an implicit task scheduling point associated with the construct. The current task region remains suspended until all child tasks that it generated before the taskwait region complete execution [currently, depend clause is not supported]."
Examples for using the OpenMP offload for oneMKL are located in the Intel® oneAPI Math Kernel Library installation directory, under:
examples/c_offload
The following code snippet shows how to use OpenMP offload for single-call oneMKL features such as most dense linear algebra functionality.
#include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" // MKL header file for OpenMP offload int dnum = 0; int main() { float *a, *b, *c, alpha = 1.0, beta = 1.0; MKL_INT m = 150, n = 200, k = 128, lda = m, ldb = k, ldc = m; MKL_INT sizea = lda * k, sizeb = ldb * n, sizec = ldc * n; // allocate matrices and check pointers a = (float *)mkl_malloc(sizea * sizeof(float), 64); ... // initialize matrices #pragma omp target map(c[0:sizec]) { for (i = 0; i < sizec; i++) { c[i] = 42; } ... } // run gemm on host, use standard MKL interface cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); // map the a, b and c matrices on the device memory #pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum) { // run gemm on gpu, use standard MKL interface within a variant dispatch construct // if offload is not possible, default to cpu // use the use_device_ptr clause to specify that a, b and c are device memory #pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c) { cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); } } // Free matrices mkl_free(a); … }
Some of the oneMKL functionality requires to call a set of functions to perform the corresponding computation. This is the case, for example, for the Discrete Fourier Transform which for a typical computation involves calling the functions.
DFTI_EXTERN MKL_LONG DftiCreateDescriptor(DFTI_DESCRIPTOR_HANDLE*, enum DFTI_CONFIG_VALUE, enum DFTI_CONFIG_VALUE, MKL_LONG, ...); DFTI_EXTERN MKL_LONG DftiCommitDescriptor(DFTI_DESCRIPTOR_HANDLE); DFTI_EXTERN MKL_LONG DftiComputeForward(DFTI_DESCRIPTOR_HANDLE, void*, ...); DFTI_EXTERN MKL_LONG DftiComputeBackward(DFTI_DESCRIPTOR_HANDLE, void*, ...); DFTI_EXTERN MKL_LONG DftiFreeDescriptor(DFTI_DESCRIPTOR_HANDLE*);
In that case, only a subset of the calls must be wrapped in an omp target variant dispatch construct as shown in the following code snippet for DFTI.
#include <omp.h> #include "mkl.h" #include "mkl_omp_offload.h" int main(void) { const int devNum = 0; const MKL_LONG N = 64; // Size of 1D transform MKL_LONG status = 0; MKL_LONG statusGPU = 0; DFTI_DESCRIPTOR_HANDLE descHandle = NULL; DFTI_DESCRIPTOR_HANDLE descHandleGPU = NULL; MKL_Complex8 *x = NULL; MKL_Complex8 *xGPU = NULL; printf("Create DFTI descriptor\n"); status = DftiCreateDescriptor(&descHandle, DFTI_SINGLE, DFTI_COMPLEX, 1, N); printf("Create GPU DFTI descriptor\n"); statusGPU = DftiCreateDescriptor(&descHandleGPU, DFTI_SINGLE, DFTI_COMPLEX, 1, N); printf("Commit DFTI descriptor\n"); status = DftiCommitDescriptor(descHandle); printf("Commit GPU DFTI descriptor\n"); #pragma omp target variant dispatch device(devNum) { statusGPU = DftiCommitDescriptor(descHandleGPU); } printf("Allocate memory for input array\n"); x = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64); printf("Allocate memory for GPU input array\n"); xGPU = (MKL_Complex8 *)mkl_malloc(N*sizeof(MKL_Complex8), 64); printf("Initialize input for forward FFT\n"); // init x and xGPU ... printf("Compute forward FFT in-place\n"); status = DftiComputeForward(descHandle, x); printf("Compute GPU forward FFT in-place\n"); #pragma omp target data map(tofrom:xGPU[0:N]) device(devNum) { #pragma omp target variant dispatch use_device_ptr(xGPU) device(devNum) { statusGPU = DftiComputeForward(descHandleGPU, xGPU); } } // use results now in x and xGPU ... cleanup: DftiFreeDescriptor(&descHandle); DftiFreeDescriptor(&descHandleGPU); mkl_free(x); mkl_free(xGPU); }
For asynchronous execution of multi-call oneMKL computation, the nowait clause needs to be used only on the call to the function performing the actual computation (for example, DftiCompute{Forward,Backward}). For instance, the following snippet shows how the DFTI example above could be changed to have two, back-to-back, asynchronous (nowait) computations dispatched, with a taskwait at the end of the second to ensure the completion of both computations before their results are accessed:
printf("Compute Intel GPU forward FFT 1 in-place\n"); #pragma omp target data map(tofrom:x1GPU[0:N1], x2GPU[0:N2]) device(devNum) { #pragma omp target variant dispatch use_device_ptr(x1GPU) device(devNum) nowait { status1GPU = DftiComputeForward(descHandle1GPU, x1GPU); } printf("Compute Intel GPU forward FFT 2 in-place\n"); #pragma omp target variant dispatch use_device_ptr(x2GPU) device(devNum) nowait { status2GPU = DftiComputeForward(descHandle2GPU, x2GPU); } #pragma omp taskwait } if (status1GPU != DFTI_NO_ERROR) goto failed; if (status2GPU != DFTI_NO_ERROR) goto failed;
For sparse BLAS computations, the workflow ‘create a CSR matrix handle’ -> ‘compute’ -> ‘destroy the CSR matrix handle’ should be done in a single target data region. For instance, the following snippet shows how the Sparse BLAS OpenMP Offload example for mkl_sparse_s_mv() could be run, where N is the number of rows, M is the number of columns, and NNZ is the number of non-zero entries of the sparse matrix csrA_gpu, x is the input vector, and the output is stored in the z array:
#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M]) map(tofrom:z[0:N]) device(devNum) { #pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a) { status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a); } #pragma omp target variant dispatch device(devNum) use_device_ptr(x, z) { status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z); } #pragma omp target variant dispatch device(devNum) { status_gpu3 = mkl_sparse_destroy(csrA_gpu); } }
For asynchronous execution of multi-call oneMKL Sparse BLAS computation, the nowait clause can be added to the call of the function performing the actual computation (for example, calls to the mkl_sparse_{s,d}_mv() function). Adding the nowait clause to the mkl_sparse_s_create_csr() or mkl_sparse_destroy() calls may result in incorrect behavior. As an example, the following snippet shows how the Sparse BLAS example above could be changed to have two asynchronous (nowait) computations using the same matrix handle, csrA_gpu, but unrelated vector data so there is no read/write dependency between them. Add a taskwait at the end of the second execution to ensure the completion of both computations before the mkl_sparse_destroy() function is called:
#pragma omp target data map(to:ia[0:N+1],ja[0:NNZ],a[0:NNZ],x[0:M],w[0:M]) map(tofrom:y[0:N],z[0:N]) device(devNum) { #pragma omp target variant dispatch device(devNum) use_device_ptr(ia, ja, a) { status_gpu1 = mkl_sparse_s_create_csr(&csrA_gpu, SPARSE_INDEX_BASE_ZERO, N, M, ia, ia + 1, ja, a); } #pragma omp target variant dispatch device(devNum) use_device_ptr(x, z) nowait { status_gpu2 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, x, beta, z); } #pragma omp target variant dispatch device(devNum) use_device_ptr(w, y) nowait { status_gpu3 = mkl_sparse_s_mv(SPARSE_OPERATION_NON_TRANSPOSE, alpha, csrA_gpu, descrA, w, beta, y); } #pragma omp taskwait #pragma omp target variant dispatch device(devNum) { status_gpu4 = mkl_sparse_destroy(csrA_gpu); } }