compiles the string src and makes fun available in the Wolfram Language as a CUDAFunction.


compiles the source code file srcfile and then loads fun as a CUDAFunction.


loads fun as a CUDAFunction from the previously compiled library libfile.


  • The CUDALink application must be loaded using Needs["CUDALink`"].
  • Possible argument and return types, and their corresponding CUDA type, include:
  • _IntegermintWolfram Language integer
    "Integer32"int32-bit integer
    "Integer64"long/long long64-bit integer
    _RealReal_tGPU real type
    "Double"doublemachine double
    "Float"floatmachine float
    {base, rank, io}CUDAMemorymemory of specified base type, rank, and input/output option
    "Local" | "Shared"mintlocal or shared memory parameter
    {"Local" | "Shared", type}mintlocal or shared memory parameter
  • In the specification {base, rank, io}, valid settings of io are "Input", "Output", and "InputOutput".
  • The argument specification {base} is equivalent to {base,_,"InputOutput"}, and {base,rank} is equivalent to {base,rank,"InputOutput"}.
  • The rank can be omitted by using {base,_,io} or {base,io}.
  • Possible base types are:
  • _Integer_Real_Complex
  • CUDAFunctionLoad can be called more than once with different arguments.
  • Functions loaded by CUDAFunctionLoad run in the same process as the Wolfram Language kernel.
  • Functions loaded by CUDAFunctionLoad are unloaded when the Wolfram Language kernel exits.
  • Block dimensions can be either a list or an integer denoting how many threads per block to launch.
  • If libfile is a dynamic library, then the dynamic library function fun is loaded.
  • libfile can be a CUDA PTX, CUDA CUBIN, or a library file.
  • The maximum size of block dimensions is returned by the "Maximum Block Dimensions" property of CUDAInformation.
  • On launch, if the number of threads is not specified (as an extra argument to the CUDAFunction) then the dimension of the element with largest rank and dimension is chosen. For images, the rank is set to 2.
  • On launch, if the number of threads is not a multiple of the block dimension, then it is incremented to be a multiple of the block dimension.
  • The following options can be given:
  • "CleanIntermediate"Automaticwhether temporary files should be deleted
    "CompileOptions"{}compile options passed directly to the NVCC compiler
    "CompilerInstallation"Automaticlocation of the CUDA Toolkit installation
    "CreateCUBIN"Truewhether to compile code to a CUDA binary
    "CreatePTX"Falsewhether to compile code to CUDA bytecode
    "CUDAArchitecture"Automaticarchitecture for which to compile CUDA code
    "Defines"{}defines passed to the NVCC preprocessor
    "Device"$CUDADeviceCUDA device used in computation
    "IncludeDirectories"{}directories to include in the compilation
    "ShellCommandFunction"Nonefunction to call with the shell commands used for compilation
    "ShellOutputFunction"Nonefunction to call with the shell output of running the compilation commands
    "SystemDefines"Automaticsystem defines passed to the NVCC preprocessor
    "TargetDirectory"Automaticthe directory in which CUDA files should be generated
    "TargetPrecision"Automaticprecision used in computation
    "WorkingDirectory"Automaticthe directory in which temporary files will be generated
    "XCompilerInstallation"Automaticthe directory where NVCC will find the C compiler is installed


open allclose all

Basic Examples  (5)

First, load the CUDALink application:

This code adds 2 to a given vector:

This compiles and runs the CUDA code defined above:

This defines the length of the output list:

The following defines the input and output vectors. These are regular Wolfram Language lists that have the same type as defined in the CUDA kernel code's signature:

This runs the function with the specified input:

This prints the first 20 values of the result:

CUDA files can be passed in. This gets the path to the CUDA function file:

File names are enclosed as lists:

This defines the input parameters:

This calls the function:

An extra argument can be given when calling the CUDAFunction. The argument denotes the number of threads to launch (or grid dimension times block dimension). This gets the source files containing the CUDA implementation:

This loads the CUDA function from the file:

This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:

For floating-precision support, Real_t is defined based on hardware and "TargetPrecision":

With no options, "TargetPrecision" uses the highest floating-point precision available on the device. In this case, it is double precision:

Notice how the macros Real_t=double and CUDALINK_USING_DOUBLE_PRECISIONQ=1 are defined. To avoid detection, you can pass in the "Double" or "Single" options. This is equivalent to the above:

To force the use of single precision, pass the "Single" value to "TargetPrecision":

The type _Real is detected based on the target precision. To force the use of a specific type, pass either "Float" or "Double" as type:

The "ShellOutputFunction" can be used to give information on compile failures. This source code has a syntax error:

This loads the function:

Setting "ShellOutputFunction"->Print gives the build log:

In this case, the variable index was misspelled.

Scope  (4)

Load C and CUDA Files  (1)

CUDAFunctionLoad ignores the C portion of code. This allows you to write code that can be compiled by itself as a binary, but can also be loaded in as a CUDAFunction. The following loads a CUDA source file (mixed with C code) into the Wolfram Language:

This calls the function:

CUDAFunctionLoad can specify the shared (local) memory size of the function at runtime. The following code uses shared memory to store global memory for gradient computation:

This specifies the input arguments, with the last argument being "Shared" for shared memory. The block size is set to 256:

This computes the flattened length of a grayscale image:

This invokes the function. The shared memory size is set to (blockSize+2)*sizeof(mint), and the number of launch threads is set to the flattened length of the image:

A nicer way of specifying the shared memory size is using types:

Using shared memory types, you need not pass in the size of the type:

Templated Functions  (1)

Templated functions can be called. The only limitation is that the templated functions must be instantiated as a device function, and the "UnmangleCode" must be set to False. This compiles a templated function into PTX byte code, using a dispatch function to determine what device function to call:

This loads the function for integers:

This runs the function. Here, the 3 specifies that the input type is integer:

Generic Types Using Macros  (1)

Templated functions can be simulated using macros. This source code has Generic_t as an undefined macro:

This sets the macro Generic_t to mint:

This sets the macro Generic_t to unsigned char:

Using the above method, you can simulate template types without the need to find the mangled name.

Three-Dimensional Block Sizes  (1)

Three-dimensional block sizes are supported. This inverts a volumetric dataset:

This loads the function, passing a three-dimensional block size:

This loads data from a file:

This performs the computation:

This renders the result:

Applications  (8)

Perlin Noise  (1)

Perlin noise is a common algorithm used to generate procedural textures. This is a textbook implementation of the noise function:

This loads the "classicPerlin" function:

This generates the permutation table used in the noise algorithm:

This sets the width and height. It also allocates the output memory:

This defines the parameters to the Perlin noise:

This calls the Perlin noise function. The output is a CUDAMemory handle:

The memory is retrieved from the GPU and displayed as an image:

Putting the result in Manipulate, you can see the output as parameters change:

With Perlin noise, you can create procedural landscapes. Define the width and height and allocate memory for the landscape:

The parameters used define the landscape:

The data is retrieved and some image processing functions are used to smooth the elevation map:

The result is similar to a mountain range:

This deallocates the memory:

Varying parameters to the noise results in difference patterns. Here, wood texture is created:

The following are known parameters for wood:

This defines a helper function that recolors the grayscale image:

Here, the wood texture is generated:

As before, the result can be plotted onto a surface:

The original source code defines more noise functions. This loads all functions:

Here, Manipulate is used to showcase the different noise functions:

This deallocates the memory:

Histogram Algorithm  (1)

The histogram algorithm places elements in a list in separate bins, depending on their values. The following implements a histogram that places values between 0 and 255 in separate bins:

This loads the two CUDA kernel functions:

This gets sample data. An image is chosen in this case, and the ImageData is flattened:

The algorithm requires some temporary data that would be used as intermediate histograms:

This computes sub-histograms and places them in the intermediate list generated before:

This merges the temporary histograms:

This gets the output histogram:

This unloads the temporary memory. Failing to do so results in a memory leak:

This plots the output histogram:

Prefix Sum Algorithm  (1)

The scan, or prefix sum, algorithm is similar to FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The CUDA implementation is found in the following location:

This loads the three kernels used in computation:

This generates random input data:

This allocates the output buffer:

This computes the block and grid dimensions:

A temporary buffer is needed in computation:

This performs the scan operation:

This retrieves the output buffer:

Minus the first term, the result agrees with FoldList:

This deallocates the CUDAMemory elements:

Reduce  (1)

The reduction kernel is similar to Fold in the Wolfram Language because it reduces a list, given a binary operation. Whereas scan kept the previous elements in the computation, reduce discards them. This loads the reduction CUDAFunction:

This sets the input and output buffers:

This performs the computation:

Each block reduces 512 elements of the list; therefore, you need multiple calls to reduce lists larger than 512 elements. This list is small, so no loop is necessary. This gets the output memory from the previous step, assigns the memory in out to in, and frees out:

This allocates a new output buffer:

This performs a second reduction:

The output is retrieved and the output buffer is unloaded:

The result agrees with the Wolfram Language:

RGB-to-HSB Converter  (1)

The following implements a color converter, converting from RGB color space to HSB. The CUDA implementation is in the file:

This loads the "rgb2hsb" function from the source file:

This sets the input image along with the input parameters:

This allocates memory for the output:

This converts the image to HSB space:

By default, Image views the data in RGB space. This results in wrong output:

Use ColorSpace->"HSB" to get proper output:

Caesar Cipher  (1)

The following code implements the Caesar cipher. The Caesar cipher is a simple cypher that adds the value 3 to each character in the text. Here is the CUDA implementation:

This loads some example text; the Declaration of Independence is loaded in this case:

Here, the function is loaded from the code string:

This calls the CUDA function and displays only the first 100 characters of the output:

Moving Average  (1)

The following implements a moving average:

This loads the CUDAFunction defining the macro "BLOCK_DIM" as 256:

This defines the input parameters and allocates memory for the output:

This calls the CUDAFunction:

This gets the output memory:

Memory is unloaded:

BlackScholes Formula  (1)

The BlackScholes formula is a commonly used formula used in financial computation. CUDALink provides CUDAFinancialDerivative, which can compute financial options. To demonstrate how it is written, implement as a simple version:

This loads the CUDAFunction. Set the "TargetPrecision" to "Single", which means that _Real is interpreted as "Float":

This assigns the input parameters:

This calls the function:

This gets the output memory:

This unloads allocated memory:

Possible Issues  (4)

The maximum block dimension is returned by CUDAInformation:

Errors in the function call can place CUDALink in an unusable state. This is a side effect of allowing users to write arbitrary kernels. Infinite loops, buffer overflows, etc. in the kernel code can make both CUDALink and the video driver unstable.

In the extreme case, this may crash the display driver, but usually it just makes further evaluation of CUDA code return invalid results.

Precompiling kernel types must agree. Kernels with Real_t defined as a float will return the wrong result if used when "TargetPrecision" is set to "Double" .

Exporting C++ constructs is not supported when "UnmangleCode" is set to True.

Interactive Examples  (4)

Conway's Game of Life  (1)

Conway's Game of Life is a cellular automaton that evolves a cell, based on the state of the surroundings. This loads the CUDALink function:

For the initialization, you need to initialize randomly, but cannot have the initial state too sparse, or else all the cells will die:

This displays the function using Dynamic and ArrayPlot. Notice that it is slightly slow:

This displays the function using Dynamic and Image. Notice that it is slightly faster:

Using CUDAMemory, you can speed up the rendering:

Ball Bounce  (1)

The following physical simulation shows how to utilize a CUDAFunction to perform computation while delegating the rest to the Wolfram Language. This function loads the CUDAFunction and calls the BallBounceEffect function:

The following draws the particles, updating to create the effect:

This invokes the function:

Julia Set  (1)

The Julia set is a generalization of the Mandelbrot set. This implements the CUDA kernel:

The width and height are set. Since the set is computed, the memory need not be setthat is, only memory allocation is needed:

This loads the CUDAFunction. Macros are used to allow the compiler to optimize the codedoing things like loop unrolling:

This computes the set and views it using ReliefImage:

This creates an interface using Manipulate and ReliefPlot, where the user can adjust the value of c:

The user can substitute ReliefPlot for Image to make the visualization even faster:

Mandelbrot Set  (1)

The Mandelbrot set is defined by . This code showcases how to define your own type in a kernel file, demonstrating this by considering sets of the form , where is a user-defined parameter. The kernel is defined here:

This sets the width and height variables:

The output set's memory is allocated using CUDAMemoryAllocate:

This loads the function from the above file:

This computes and displays the set when the power is 2:

This computes and displays the set when the power varies:

This generates a set of 200 frames:

This renders each frame as a texture on a polygon:

Neat Examples  (4)

SymbolicC Code Generation  (1)

Using CUDALink's symbolic capabilities, you can write CUDA code using Wolfram Language expressions and transform them to CUDA code. The following implements a simple 1D discrete Haar wavelet transform, using symbolic code:

The symbolic code can then be converted to CUDA code, using ToCCodeString from SymbolicC:

The code can then be loaded using CUDAFunctionLoad:

This creates some input data:

This calls the CUDAFunction with maxStage=1:

This gets the resulting CUDAMemory:

One of the interesting aspects of symbolic code generation is being able to manipulate the syntax tree. In this case, change the function arguments from CPointerType["mint"] to CPointerType["float"]:

Again, you generate the CUDA code from the result:

The other interesting aspect of code generation is that the CUDA symbolic functions are mirrored by OpenCL symbolic functions. So, taking the above symbolic code, you can generate OpenCL code by changing only the CUDA symbolic function. First, load OpenCLLink:

This implements the OpenCL 1D discrete Haar wavelet transform:

Note that only two words are changed for this translation: SymbolicCUDAFunction became SymbolicOpenCLFunction, and SymbolicCUDADeclareIndexBlock became SymbolicOpenCLDeclareIndexBlock.

Mandelbrot Set  (1)

The following code calculates the Mandelbrot set:

Rule 30 Cellular Automaton  (1)

Rule 30 cellular automaton does not gain much from CUDA until the column count becomes very large, since the next row is dependent on the previous. Nonetheless, you can write a simple rule 30 cellular automaton as a CUDA function:

This creates the memory buffers:

This applies rule30 128 times:

This plots the result using ArrayPlot:

Mandelbulb Set  (1)

The following implements triplex Mandelbrot ("Mandelbulb")an analog to the Mandelbrot set in 3D. Triplex numbers extend the polar form of complex exponentiation to spherical coordinates in three dimensions; . Addition is simply vector addition. This specifies the Mandelbulb parameters (width, height, camera position, and light position):

This allocates the output memory:

The implementation is loaded from the source:

This loads the CUDAFunction:

This runs the CUDAFunction:

This gets the CUDAMemory into the Wolfram Language:

This displays the result as an image:

The result can be placed into a Manipulate:

The creates an interface that allows the user to adjust the camera position: