CUDAFunctionLoad
CUDAFunctionLoad["src",fun,argtypes,blockdim]
compiles the string src and makes fun available in the Wolfram Language as a CUDAFunction.
CUDAFunctionLoad[File[srcfile],fun,argtypes,blockdim]
compiles the source code file srcfile and then loads fun as a CUDAFunction.
CUDAFunctionLoad[File[libfile],fun,argtypes,blockdim]
loads fun as a CUDAFunction from the previously compiled library libfile.
Details
 The CUDALink application must be loaded using Needs["CUDALink`"].
 Possible argument and return types, and their corresponding CUDA type, include:

_Integer mint Wolfram Language integer "Integer32" int 32bit integer "Integer64" long/long long 64bit integer _Real Real_t GPU real type "Double" double machine double "Float" float machine float {base, rank, io} CUDAMemory memory of specified base type, rank, and input/output option "Local" "Shared"mint local or shared memory parameter {"Local" "Shared", type} mint local or shared memory parameter  In the specification {base, rank, io}, valid settings of io are "Input", "Output", and "InputOutput".
 The argument specification {base} is equivalent to {base,_,"InputOutput"}, and {base,rank} is equivalent to {base,rank,"InputOutput"}.
 The rank can be omitted by using {base,_,io} or {base,io}.
 Possible base types are:

_Integer _Real _Complex "Byte" "Bit16" "Integer32" "Byte[2]" "Bit16[2]" "Integer32[2]" "Byte[3]" "Bit16[3]" "Integer32[3]" "Byte[4]" "Bit16[4]" "Integer32[4]" "UnsignedByte" "UnsignedBit16" "UnsignedInteger" "UnsignedByte[2]" "UnsignedBit16[2]" "UnsignedInteger[2]" "UnsignedByte[3]" "UnsignedBit16[3]" "UnsignedInteger[3]" "UnsignedByte[4]" "UnsignedBit16[4]" "UnsignedInteger[4]" "Double" "Float" "Integer64" "Double[2]" "Float[2]" "Integer64[2]" "Double[3]" "Float[3]" "Integer64[3]" "Double[4]" "Float[4]" "Integer64[4]"  CUDAFunctionLoad can be called more than once with different arguments.
 Functions loaded by CUDAFunctionLoad run in the same process as the Wolfram Language kernel.
 Functions loaded by CUDAFunctionLoad are unloaded when the Wolfram Language kernel exits.
 Block dimensions can be either a list or an integer denoting how many threads per block to launch.
 If libfile is a dynamic library, then the dynamic library function fun is loaded.
 libfile can be a CUDA PTX, CUDA CUBIN, or a library file.
 The maximum size of block dimensions is returned by the "Maximum Block Dimensions" property of CUDAInformation.
 On launch, if the number of threads is not specified (as an extra argument to the CUDAFunction) then the dimension of the element with largest rank and dimension is chosen. For images, the rank is set to 2.
 On launch, if the number of threads is not a multiple of the block dimension, then it is incremented to be a multiple of the block dimension.
 The following options can be given:

"CleanIntermediate" Automatic whether temporary files should be deleted "CompileOptions" {} compile options passed directly to the NVCC compiler "CompilerInstallation" Automatic location of the CUDA Toolkit installation "CreateCUBIN" True whether to compile code to a CUDA binary "CreatePTX" False whether to compile code to CUDA bytecode "CUDAArchitecture" Automatic architecture for which to compile CUDA code "Defines" {} defines passed to the NVCC preprocessor "Device" $CUDADevice CUDA device used in computation "IncludeDirectories" {} directories to include in the compilation "ShellCommandFunction" None function to call with the shell commands used for compilation "ShellOutputFunction" None function to call with the shell output of running the compilation commands "SystemDefines" Automatic system defines passed to the NVCC preprocessor "TargetDirectory" Automatic the directory in which CUDA files should be generated "TargetPrecision" Automatic precision used in computation "WorkingDirectory" Automatic the directory in which temporary files will be generated "XCompilerInstallation" Automatic the directory where NVCC will find the C compiler is installed
Examples
open allclose allBasic Examples (5)
First, load the CUDALink application:
This code adds 2 to a given vector:
This compiles and runs the CUDA code defined above:
This defines the length of the output list:
The following defines the input and output vectors. These are regular Wolfram Language lists that have the same type as defined in the CUDA kernel code's signature:
This runs the function with the specified input:
This prints the first 20 values of the result:
CUDA files can be passed in. This gets the path to the CUDA function file:
File names are enclosed as lists:
This defines the input parameters:
An extra argument can be given when calling the CUDAFunction. The argument denotes the number of threads to launch (or grid dimension times block dimension). This gets the source files containing the CUDA implementation:
This loads the CUDA function from the file:
This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:
For floatingprecision support, Real_t is defined based on hardware and "TargetPrecision":
With no options, "TargetPrecision" uses the highest floatingpoint precision available on the device. In this case, it is double precision:
Notice how the macros Real_t=double and CUDALINK_USING_DOUBLE_PRECISIONQ=1 are defined. To avoid detection, you can pass in the "Double" or "Single" options. This is equivalent to the above:
To force the use of single precision, pass the "Single" value to "TargetPrecision":
The type _Real is detected based on the target precision. To force the use of a specific type, pass either "Float" or "Double" as type:
The "ShellOutputFunction" can be used to give information on compile failures. This source code has a syntax error:
Setting "ShellOutputFunction">Print gives the build log:
Scope (4)
Load C and CUDA Files (1)
CUDAFunctionLoad ignores the C portion of code. This allows you to write code that can be compiled by itself as a binary, but can also be loaded in as a CUDAFunction. The following loads a CUDA source file (mixed with C code) into the Wolfram Language:
CUDAFunctionLoad can specify the shared (local) memory size of the function at runtime. The following code uses shared memory to store global memory for gradient computation:
This specifies the input arguments, with the last argument being "Shared" for shared memory. The block size is set to 256:
This computes the flattened length of a grayscale image:
This invokes the function. The shared memory size is set to (blockSize+2)*sizeof(mint), and the number of launch threads is set to the flattened length of the image:
A nicer way of specifying the shared memory size is using types:
Using shared memory types, you need not pass in the size of the type:
Templated Functions (1)
Templated functions can be called. The only limitation is that the templated functions must be instantiated as a device function, and the "UnmangleCode" must be set to False. This compiles a templated function into PTX byte code, using a dispatch function to determine what device function to call:
This loads the function for integers:
This runs the function. Here, the 3 specifies that the input type is integer:
Generic Types Using Macros (1)
Applications (8)
Perlin Noise (1)
Perlin noise is a common algorithm used to generate procedural textures. This is a textbook implementation of the noise function:
This loads the "classicPerlin" function:
This generates the permutation table used in the noise algorithm:
This sets the width and height. It also allocates the output memory:
This defines the parameters to the Perlin noise:
This calls the Perlin noise function. The output is a CUDAMemory handle:
The memory is retrieved from the GPU and displayed as an image:
Putting the result in Manipulate, you can see the output as parameters change:
With Perlin noise, you can create procedural landscapes. Define the width and height and allocate memory for the landscape:
The parameters used define the landscape:
The data is retrieved and some image processing functions are used to smooth the elevation map:
The result is similar to a mountain range:
Varying parameters to the noise results in difference patterns. Here, wood texture is created:
The following are known parameters for wood:
This defines a helper function that recolors the grayscale image:
Here, the wood texture is generated:
As before, the result can be plotted onto a surface:
The original source code defines more noise functions. This loads all functions:
Here, Manipulate is used to showcase the different noise functions:
Histogram Algorithm (1)
The histogram algorithm places elements in a list in separate bins, depending on their values. The following implements a histogram that places values between 0 and 255 in separate bins:
This loads the two CUDA kernel functions:
This gets sample data. An image is chosen in this case, and the ImageData is flattened:
The algorithm requires some temporary data that would be used as intermediate histograms:
This computes subhistograms and places them in the intermediate list generated before:
This merges the temporary histograms:
This gets the output histogram:
This unloads the temporary memory. Failing to do so results in a memory leak:
Prefix Sum Algorithm (1)
The scan, or prefix sum, algorithm is similar to FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The CUDA implementation is found in the following location:
This loads the three kernels used in computation:
This generates random input data:
This allocates the output buffer:
This computes the block and grid dimensions:
A temporary buffer is needed in computation:
This performs the scan operation:
This retrieves the output buffer:
Minus the first term, the result agrees with FoldList:
This deallocates the CUDAMemory elements:
Reduce (1)
The reduction kernel is similar to Fold in the Wolfram Language because it reduces a list, given a binary operation. Whereas scan kept the previous elements in the computation, reduce discards them. This loads the reduction CUDAFunction:
This sets the input and output buffers:
This performs the computation:
Each block reduces 512 elements of the list; therefore, you need multiple calls to reduce lists larger than 512 elements. This list is small, so no loop is necessary. This gets the output memory from the previous step, assigns the memory in out to in, and frees out:
This allocates a new output buffer:
This performs a second reduction:
RGBtoHSB Converter (1)
The following implements a color converter, converting from RGB color space to HSB. The CUDA implementation is in the file:
This loads the "rgb2hsb" function from the source file:
This sets the input image along with the input parameters:
This allocates memory for the output:
This converts the image to HSB space:
By default, Image views the data in RGB space. This results in wrong output:
Use ColorSpace>"HSB" to get proper output:
Caesar Cipher (1)
The following code implements the Caesar cipher. The Caesar cipher is a simple cypher that adds the value 3 to each character in the text. Here is the CUDA implementation:
This loads some example text; the Declaration of Independence is loaded in this case:
Here, the function is loaded from the code string:
This calls the CUDA function and displays only the first 100 characters of the output:
Moving Average (1)
The following implements a moving average:
This loads the CUDAFunction defining the macro "BLOCK_DIM" as 256:
This defines the input parameters and allocates memory for the output:
This calls the CUDAFunction:
Black–Scholes Formula (1)
The Black–Scholes formula is a commonly used formula used in financial computation. CUDALink provides CUDAFinancialDerivative, which can compute financial options. To demonstrate how it is written, implement as a simple version:
This loads the CUDAFunction. Set the "TargetPrecision" to "Single", which means that _Real is interpreted as "Float":
Possible Issues (4)
The maximum block dimension is returned by CUDAInformation:
Errors in the function call can place CUDALink in an unusable state. This is a side effect of allowing users to write arbitrary kernels. Infinite loops, buffer overflows, etc. in the kernel code can make both CUDALink and the video driver unstable.
In the extreme case, this may crash the display driver, but usually it just makes further evaluation of CUDA code return invalid results.
Precompiling kernel types must agree. Kernels with Real_t defined as a float will return the wrong result if used when "TargetPrecision" is set to "Double" .
Exporting C++ constructs is not supported when "UnmangleCode" is set to True.
Interactive Examples (4)
Conway's Game of Life (1)
Conway's Game of Life is a cellular automaton that evolves a cell, based on the state of the surroundings. This loads the CUDALink function:
For the initialization, you need to initialize randomly, but cannot have the initial state too sparse, or else all the cells will die:
This displays the function using Dynamic and ArrayPlot. Notice that it is slightly slow:
This displays the function using Dynamic and Image. Notice that it is slightly faster:
Using CUDAMemory, you can speed up the rendering:
Ball Bounce (1)
The following physical simulation shows how to utilize a CUDAFunction to perform computation while delegating the rest to the Wolfram Language. This function loads the CUDAFunction and calls the BallBounceEffect function:
The following draws the particles, updating to create the effect:
Julia Set (1)
The Julia set is a generalization of the Mandelbrot set. This implements the CUDA kernel:
The width and height are set. Since the set is computed, the memory need not be set—that is, only memory allocation is needed:
This loads the CUDAFunction. Macros are used to allow the compiler to optimize the code—doing things like loop unrolling:
This computes the set and views it using ReliefImage:
This creates an interface using Manipulate and ReliefPlot, where the user can adjust the value of :
The user can substitute ReliefPlot for Image to make the visualization even faster:
Mandelbrot Set (1)
The Mandelbrot set is defined by . This code showcases how to define your own type in a kernel file, demonstrating this by considering sets of the form , where is a userdefined parameter. The kernel is defined here:
This sets the width and height variables:
The output set's memory is allocated using CUDAMemoryAllocate:
This loads the function from the above file:
This computes and displays the set when the power is 2:
This computes and displays the set when the power varies:
Neat Examples (4)
SymbolicC Code Generation (1)
Using CUDALink's symbolic capabilities, you can write CUDA code using Wolfram Language expressions and transform them to CUDA code. The following implements a simple 1D discrete Haar wavelet transform, using symbolic code:
The symbolic code can then be converted to CUDA code, using ToCCodeString from SymbolicC:
The code can then be loaded using CUDAFunctionLoad:
This calls the CUDAFunction with maxStage=1:
This gets the resulting CUDAMemory:
One of the interesting aspects of symbolic code generation is being able to manipulate the syntax tree. In this case, change the function arguments from CPointerType["mint"] to CPointerType["float"]:
Again, you generate the CUDA code from the result:
The other interesting aspect of code generation is that the CUDA symbolic functions are mirrored by OpenCL symbolic functions. So, taking the above symbolic code, you can generate OpenCL code by changing only the CUDA symbolic function. First, load OpenCLLink:
This implements the OpenCL 1D discrete Haar wavelet transform:
Note that only two words are changed for this translation: SymbolicCUDAFunction became SymbolicOpenCLFunction, and SymbolicCUDADeclareIndexBlock became SymbolicOpenCLDeclareIndexBlock.
Rule 30 Cellular Automaton (1)
Rule 30 cellular automaton does not gain much from CUDA until the column count becomes very large, since the next row is dependent on the previous. Nonetheless, you can write a simple rule 30 cellular automaton as a CUDA function:
This creates the memory buffers:
This applies rule30 128 times:
This plots the result using ArrayPlot:
Mandelbulb Set (1)
The following implements triplex Mandelbrot ("Mandelbulb")—an analog to the Mandelbrot set in 3D. Triplex numbers extend the polar form of complex exponentiation to spherical coordinates in three dimensions; . Addition is simply vector addition. This specifies the Mandelbulb parameters (width, height, camera position, and light position):
This allocates the output memory:
The implementation is loaded from the source:
This loads the CUDAFunction:
This runs the CUDAFunction:
This gets the CUDAMemory into the Wolfram Language:
This displays the result as an image:
The result can be placed into a Manipulate:
The creates an interface that allows the user to adjust the camera position: