CUDAFunctionLoad
This feature is not supported on the Wolfram Cloud.

CUDAFunctionLoad["src",fun,argtypes,blockdim]
compiles the string src and makes fun available in the Wolfram Language as a CUDAFunction.

CUDAFunctionLoad[File[srcfile],fun,argtypes,blockdim]
compiles the source code file srcfile and then loads fun as a CUDAFunction.

CUDAFunctionLoad[File[libfile],fun,argtypes,blockdim]
loads fun as a CUDAFunction from the previously compiled library libfile.

DetailsDetails

  • The CUDALink application must be loaded using Needs["CUDALink`"].
  • Possible argument and return types, and their corresponding CUDA type, include:
  • _IntegermintWolfram Language integer
    "Integer32"int32-bit integer
    "Integer64"long/long long64-bit integer
    _RealReal_tGPU real type
    "Double"doublemachine double
    "Float"floatmachine float
    {base, rank, io}CUDAMemorymemory of specified base type, rank, and input/output option
    "Local" | "Shared"mintlocal or shared memory parameter
    {"Local" | "Shared", type}mintlocal or shared memory parameter
  • In the specification {base, rank, io}, valid settings of io are "Input", "Output", and "InputOutput".
  • The argument specification {base} is equivalent to {base,_,"InputOutput"}, and {base,rank} is equivalent to {base,rank,"InputOutput"}.
  • The rank can be omitted by using {base,_,io} or {base,io}.
  • Possible base types are:
  • _Integer_Real_Complex
    "Byte""Bit16""Integer32"
    "Byte[2]""Bit16[2]""Integer32[2]"
    "Byte[3]""Bit16[3]""Integer32[3]"
    "Byte[4]""Bit16[4]""Integer32[4]"
    "UnsignedByte""UnsignedBit16""UnsignedInteger"
    "UnsignedByte[2]""UnsignedBit16[2]""UnsignedInteger[2]"
    "UnsignedByte[3]""UnsignedBit16[3]""UnsignedInteger[3]"
    "UnsignedByte[4]""UnsignedBit16[4]""UnsignedInteger[4]"
    "Double""Float""Integer64"
    "Double[2]""Float[2]""Integer64[2]"
    "Double[3]""Float[3]""Integer64[3]"
    "Double[4]""Float[4]""Integer64[4]"
  • CUDAFunctionLoad can be called more than once with different arguments.
  • Functions loaded by CUDAFunctionLoad run in the same process as the Wolfram Language kernel.
  • Functions loaded by CUDAFunctionLoad are unloaded when the Wolfram Language kernel exits.
  • Block dimensions can be either a list or an integer denoting how many threads per block to launch.
  • If libfile is a dynamic library, then the dynamic library function fun is loaded.
  • libfile can be a CUDA PTX, CUDA CUBIN, or a library file.
  • The maximum size of block dimensions is returned by the "Maximum Block Dimensions" property of CUDAInformation.
  • On launch, if the number of threads is not specified (as an extra argument to the CUDAFunction) then the dimension of the element with largest rank and dimension is chosen. For images, the rank is set to 2.
  • On launch, if the number of threads is not a multiple of the block dimension, then it is incremented to be a multiple of the block dimension.
  • The following options can be given:
  • "CleanIntermediate"Automaticwhether temporary files should be deleted
    "CompileOptions"{}compile options passed directly to the NVCC compiler
    "CompilerInstallation"Automaticlocation of the CUDA Toolkit installation
    "CreateCUBIN"Truewhether to compile code to a CUDA binary
    "CreatePTX"Falsewhether to compile code to CUDA bytecode
    "CUDAArchitecture"Automaticarchitecture for which to compile CUDA code
    "Defines"{}defines passed to the NVCC preprocessor
    "Device"$CUDADeviceCUDA device used in computation
    "IncludeDirectories"{}directories to include in the compilation
    "ShellCommandFunction"Nonefunction to call with the shell commands used for compilation
    "ShellOutputFunction"Nonefunction to call with the shell output of running the compilation commands
    "SystemDefines"Automaticsystem defines passed to the NVCC preprocessor
    "TargetDirectory"Automaticthe directory in which CUDA files should be generated
    "TargetPrecision"Automaticprecision used in computation
    "WorkingDirectory"Automaticthe directory in which temporary files will be generated
    "XCompilerInstallation"Automaticthe directory where NVCC will find the C compiler is installed

ExamplesExamplesopen allclose all

Basic Examples  (5)Basic Examples  (5)

First, load the CUDALink application:

In[1]:=
Click for copyable input

This code adds 2 to a given vector:

In[2]:=
Click for copyable input

This compiles and runs the CUDA code defined above:

In[3]:=
Click for copyable input
Out[3]=

This defines the length of the output list:

In[4]:=
Click for copyable input

The following defines the input and output vectors. These are regular Wolfram Language lists that have the same type as defined in the CUDA kernel code's signature:

In[5]:=
Click for copyable input

This runs the function with the specified input:

In[6]:=
Click for copyable input

This prints the first 20 values of the result:

In[7]:=
Click for copyable input
Out[7]=

CUDA files can be passed in. This gets the path to the CUDA function file:

In[1]:=
Click for copyable input
Out[1]=

File names are enclosed as lists:

In[2]:=
Click for copyable input
Out[2]=

This defines the input parameters:

In[3]:=
Click for copyable input
Out[3]=

This calls the function:

In[4]:=
Click for copyable input
Out[4]=

An extra argument can be given when calling the CUDAFunction. The argument denotes the number of threads to launch (or grid dimension times block dimension). This gets the source files containing the CUDA implementation:

In[1]:=
Click for copyable input
Out[1]=

This loads the CUDA function from the file:

In[2]:=
Click for copyable input
Out[2]=

This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:

In[3]:=
Click for copyable input
Out[3]=

For floating-precision support, Real_t is defined based on hardware and "TargetPrecision":

In[1]:=
Click for copyable input

With no options, "TargetPrecision" uses the highest floating-point precision available on the device. In this case, it is double precision:

In[2]:=
Click for copyable input
Out[2]=

Notice how the macros Real_t=double and CUDALINK_USING_DOUBLE_PRECISIONQ=1 are defined. To avoid detection, you can pass in the "Double" or "Single" options. This is equivalent to the above:

In[3]:=
Click for copyable input
Out[3]=

To force the use of single precision, pass the "Single" value to "TargetPrecision":

In[4]:=
Click for copyable input
Out[4]=

The type _Real is detected based on the target precision. To force the use of a specific type, pass either "Float" or "Double" as type:

In[5]:=
Click for copyable input
Out[5]=

The "ShellOutputFunction" can be used to give information on compile failures. This source code has a syntax error:

In[1]:=
Click for copyable input

This loads the function:

In[2]:=
Click for copyable input

Setting "ShellOutputFunction"->Print gives the build log:

In[3]:=
Click for copyable input

In this case, the variable index was misspelled.