此为 Mathematica 8 文档,内容基于更早版本的 Wolfram 语言
查看最新文档(版本11.2)

CUDAFunctionLoad


loads CUDAFunction from scr and makes fun available in Mathematica.

loads CUDAFunction from srcfile and makes fun available in Mathematica.

loads CUDAFunction from libfile and makes fun available in Mathematica.
  • The CUDALink application must be loaded using Needs.
  • Possible argument and return types, and their corresponding CUDA type, include:
_IntegermintMathematica integer
"Integer32"int32-bit integer
_RealReal_tGPU real type
"Double"doublemachine double
"Float"floatmachine float
{base, rank, io}CUDAMemorymemory of specified base type, rank, and input/output option
"Local" | "Shared"mintlocal or shared memory parameter
{"Local" | "Shared", type}mintlocal or shared memory parameter
  • Valid io is , , and .
  • If is passed, then is used by default. If is passed, then is used.
  • The rank can be omitted by using or .
  • Possible base types are:
_Integer_Real_Complex
"Byte""Bit16""Integer"
"Byte[2]""Bit16[2]""Integer32[2]"
"Byte[3]""Bit16[3]""Integer32[3]"
"Byte[4]""Bit16[4]""Integer32[4]"
"UnsignedByte""UnsignedBit16""Float"
"UnsignedByte[2]""UnsignedBit16[2]""Float[2]"
"UnsignedByte[3]""UnsignedBit16[3]""Float[3]"
"UnsignedByte[4]""UnsignedBit16[4]""Float[4]"
"Double""Double[2]""Double[3]"
"Double[4]"
  • can be called more than once with different arguments.
  • Functions loaded by run in the same process as the Mathematica kernel.
  • Functions loaded by are unloaded when the Mathematica kernel exits.
  • Block dimensions can be either a list or an integer denoting how many threads per block to launch.
  • If libfile is a dynamic library, then the dynamic library function fun is loaded.
  • libfile can be a CUDA PTX, CUDA CUBIN, or a library file.
  • The maximum size of block dimensions is returned by the property of CUDAInformation.
  • On launch, if the number of threads is not specified (as an extra argument to the CUDAFunction) then the dimension of the element with largest rank and dimension is chosen. For images, the rank is set to 2.
  • On launch, if the number of threads is not a multiple of the block dimension, then it is incremented to be a multiple of the block dimension.
  • The following options can be given:
"CleanIntermediate"Automaticwhether temporary files should be deleted
"CompileOptions"{}compile options passed directly to the NVCC compiler
"CompilerInstallation"Automaticlocation of the CUDA Toolkit installation
"CreateCUBIN"Truewhether to compile code to a CUDA binary
"CreatePTX"Falsewhether to compile code to CUDA bytecode
"CUDAArchitecture"Automaticarchitecture for which to compile CUDA code
"Defines"{}defines passed to the NVCC preprocessor
"Device"$CUDADeviceCUDA device used in computation
"IncludeDirectories"{}directories to include in the compilation
"ShellCommandFunction"Nonefunction to call with the shell commands used for compilation
"ShellOutputFunction"Nonefunction to call with the shell output of running the compilation commands
"SystemDefines"Automaticsystem defines passed to the NVCC preprocessor
"TargetDirectory"Automaticthe directory in which CUDA files should be generated
"TargetPrecision"Automaticprecision used in computation
"WorkingDirectory"Automaticthe directory in which temporary files will be generated
"XCompilerInstallation"Automaticthe directory where NVCC will find the C compiler is installed
First, load the CUDALink application:
This code adds 2 to a given vector:
This compiles and runs the CUDA code defined above:
This defines the length of the output list:
The following defines the input and output vectors. These are regular Mathematica lists that have the same type as defined in the CUDA kernel code's signature:
This runs the function with the specified input:
This prints the first 20 values of the result:
CUDA files can be passed in. This gets the path to the CUDA function file:
File names are enclosed as lists:
This defines the input parameters:
This calls the function:
An extra argument can be given when calling the CUDAFunction. The argument denotes the number of threads to launch (or grid dimension times block dimension). This gets the source files containing the CUDA implementation:
This loads the CUDA function from the file:
This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:
CUDA binaries can be passed in. This compiles a CUDA function to a binary using NVCCCompiler. The must be set to True if the user wishes to use the unmangled name:
For floating-precision support, is defined based on hardware and :
With no options, uses the highest floating-point precision available on the device. In this case, it is double precision:
Notice how the macros and are defined. To avoid detection, you can pass in the or options. This is equivalent to the above:
To force the use of single precision, pass the value to :
The type is detected based on the target precision. To force the use of a specific type, pass either or as type:
CUDALink libraries can be loaded. This gets the path to an example CUDA library:
This makes sure that the file exists, since the precompiled library extension is operating system dependent:
This loads the library using :
The function adds two to an input list:
The source code for this example is bundled with CUDALink:
The can be used to give information on compile failures. This source code has a syntax error:
This loads the function:
Setting Print gives the build log:
In this case, the variable was misspelled.
First, load the CUDALink application:
In[1]:=
Click for copyable input
This code adds 2 to a given vector:
In[2]:=
Click for copyable input
This compiles and runs the CUDA code defined above:
In[3]:=
Click for copyable input
Out[3]=
This defines the length of the output list:
In[4]:=
Click for copyable input
The following defines the input and output vectors. These are regular Mathematica lists that have the same type as defined in the CUDA kernel code's signature:
In[5]:=
Click for copyable input
This runs the function with the specified input:
In[6]:=
Click for copyable input
This prints the first 20 values of the result:
In[7]:=
Click for copyable input
Out[7]=
 
CUDA files can be passed in. This gets the path to the CUDA function file:
In[1]:=
Click for copyable input
Out[1]=
File names are enclosed as lists:
In[2]:=
Click for copyable input
Out[2]=
This defines the input parameters:
In[3]:=
Click for copyable input
Out[3]=
This calls the function:
In[4]:=
Click for copyable input
Out[4]=
 
An extra argument can be given when calling the CUDAFunction. The argument denotes the number of threads to launch (or grid dimension times block dimension). This gets the source files containing the CUDA implementation:
In[1]:=
Click for copyable input
Out[1]=
This loads the CUDA function from the file:
In[2]:=
Click for copyable input
Out[2]=
This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:
In[3]:=
Click for copyable input
Out[3]=
 
CUDA binaries can be passed in. This compiles a CUDA function to a binary using NVCCCompiler. The must be set to True if the user wishes to use the unmangled name:
In[1]:=
Click for copyable input
Out[1]=
In[2]:=
Click for copyable input
Out[2]=
 
For floating-precision support, is defined based on hardware and :
In[1]:=
Click for copyable input
With no options, uses the highest floating-point precision available on the device. In this case, it is double precision:
In[2]:=
Click for copyable input
Out[2]=
Notice how the macros and are defined. To avoid detection, you can pass in the or options. This is equivalent to the above:
In[3]:=
Click for copyable input
Out[3]=
To force the use of single precision, pass the value to :
In[4]:=
Click for copyable input
Out[4]=
The type is detected based on the target precision. To force the use of a specific type, pass either or as type:
In[5]:=
Click for copyable input
Out[5]=
 
CUDALink libraries can be loaded. This gets the path to an example CUDA library:
In[1]:=
Click for copyable input
Out[1]=
This makes sure that the file exists, since the precompiled library extension is operating system dependent:
In[2]:=
Click for copyable input
Out[2]=
This loads the library using :
In[3]:=
Click for copyable input
Out[3]=
The function adds two to an input list:
In[4]:=
Click for copyable input
Out[4]=
The source code for this example is bundled with CUDALink:
In[5]:=
Click for copyable input
Out[5]=
 
The can be used to give information on compile failures. This source code has a syntax error:
In[1]:=
Click for copyable input
This loads the function:
Setting Print gives the build log:
In this case, the variable was misspelled.
ignores the C portion of code. This allows you to write code that can be compiled by itself as a binary, but can also be loaded in as a CUDAFunction. The following loads a CUDA source file (mixed with C code) into Mathematica:
This calls the function:
can specify the shared (local) memory size of the function at runtime. The following code uses shared memory to store global memory for gradient computation:
This specifies the input arguments, with the last argument being for shared memory. The block size is set to 256:
This computes the flattened length of a grayscale image:
This invokes the function. The shared memory size is set to and the number of launch threads is set to the flattened length of the image:
A nicer way of specifying the shared memory size is using types:
Using shared memory types, you need not pass in the size of the type:
Templated functions can be called. The only limitation is that the templated functions must be instantiated as a device function and the must be set to False. This compiles a templated function into PTX byte code, using a dispatch function to determine what device function to call:
This loads the function for integers:
This runs the function. Here, the specifies that the input type is integer:
Templated functions can be simulated using macro. This source code has as an undefined macro:
This sets the macro to :
This sets the macro to :
Using the above method, you can simulate template types without the need to find the mangled name.
Three-dimensional block sizes are supported. This inverts a volumetric dataset:
This loads the function, passing a three-dimensional block size:
This loads data from a file:
This performs the computation:
This renders the result:
Perlin noise is a common algorithm used to generate procedural textures. This is a textbook implementation of the noise function:
This loads the function:
This generates the permutation table used in the noise algorithm:
This sets the width and height. It also allocates the output memory:
This defines the parameters to the Perlin noise:
This calls the Perlin noise function. The output is a CUDAMemory handle:
The memory is retrieved from the GPU and displayed as an image:
Putting the result in Manipulate, you can see the output as parameters change:
With Perlin noise, you can create procedural landscapes. Define the width, height, and allocate memory for the landscape:
The parameters used define the landscape:
The data is retrieved and some image processing functions are used to smooth the elevation map:
The result is similar to a mountain range:
This deallocates the memory:
Varying parameters to the noise results in difference patterns. Here, wood texture is created:
The following are known parameters for wood:
This defines a helper function that recolors the grayscale image:
Here, the wood texture is generated:
As before, the result can be plotted onto a surface:
The original source code defines more noise functions. This loads all functions:
Here, Manipulate is used to showcase the different noise functions:
This deallocates the memory:
The histogram algorithm places elements in a list in separate bins depending on their values. The following implements a histogram that places values between 0 and 255 in separate bins:
This loads the two CUDA kernel functions:
This gets sample data. An image is chosen in this case, and the ImageData is flattened:
The algorithm requires some temporary data that would be used as intermediate histograms:
This computes sub-histograms and places them in the intermediate list generated before:
This merges the temporary histograms:
This gets the output histogram:
This unloads the temporary memory. Failing to do so results in a memory leak:
This plots the output histogram:
The scan, or prefix sum, algorithm is similar to FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The CUDA implementation is found in the following location:
This loads the three kernels used in computation:
This generates random input data:
This allocates the output buffer:
This computes the block and grid dimensions:
A temporary buffer is needed in computation:
This performs the scan operation:
This retrieves the output buffer:
Minus the first term, the result agrees with FoldList:
This deallocates the CUDAMemory elements:
The reduction kernel is similar to Fold in Mathematica because it reduces a list given a binary operation. Whereas scan kept the previous elements in the computation, reduce discards them. This loads the reduction CUDAFunction:
This sets the input and output buffers:
This performs the computation:
Each block reduces 512 elements of the list; therefore, you need multiple calls to reduce lists larger than 512 elements. This list is small, so no loop is necessary. This gets the output memory from the previous step, assigns the memory in to , and frees :
This allocates a new output buffer:
This performs a second reduction:
The output is retrieved and the output buffer is unloaded:
The result agrees with Mathematica:
The following implements a color converter, converting from RGB color space to HSB. The CUDA implementation is in the file:
This loads the function from the source file:
This sets the input image along with the input parameters:
This allocates memory for the output:
This converts the image to HSB space:
By default, Image views the data in RGB space. This results in wrong output:
Use ColorSpace to get proper output:
The following code implements the Caesar cipher. The Caesar cipher is a simple cypher that adds the value 3 to each character in the text. Here is the CUDA implementation:
This loads some example text; the Declaration of Independence is loaded in this case:
Here, the function is loaded from the code string:
This calls the CUDA function and displays only the first 100 characters of the output:
The following implements a moving average:
This loads the CUDAFunction defining the macro as :
This defines the input parameters and allocates memory for the output:
This calls the CUDAFunction:
This gets the output memory:
Memory is unloaded:
The Black-Scholes formula is a commonly used formula used in financial computation. CUDALink provides CUDAFinancialDerivative, which can compute financial options. To demonstrate how it is written, implement as a simple version:
This loads the CUDAFunction. Set the to , which means that is interpreted as :
This assigns the input parameters:
This calls the function:
This gets the output memory:
This unloads allocated memory:
The maximum block dimension is returned by CUDAInformation:
Errors in the function call can place CUDALink in an unusable state. This is a side effect of allowing users to write arbitrary kernels. Infinite loops, buffer overflows, etc. in the kernel code can make both CUDALink and the video driver unstable.
In the extreme case, this may crash the display driver, but usually it just makes further evaluation of CUDA code return invalid results.
Precompiling kernel types must agree. Kernels with defined as a float will return the wrong result if used when is set to .
Exporting C++ constructs is not supported when is set to True.
Conway's Game of Life is a cellular automaton that evolves a cell based on the state of the surroundings. This loads the CUDALink function:
For the initialization, you need to initialize randomly but cannot have the initial state too sparse, or else all the cells will die:
This displays the function using Dynamic and ArrayPlot. Notice that it is slightly slow:
This displays the function using Dynamic and Image Notice that it is slightly faster:
Using CUDAMemory, you can speed up the rendering:
The following physical simulation shows how to utilize a CUDAFunction to perform computation while delegating the rest to Mathematica. This function loads the CUDAFunction and calls the BallBounceEffect function:
The following draws the particles, updating to create the effect:
This invokes the function:
The Julia set is a generalization of the Mandelbrot set. This implements the CUDA kernel:
The width and height are set. Since the set is computed, the memory need not be set—that is, only memory allocation is needed:
This loads the CUDAFunction. Macros are used to allow the compiler to optimize the code—doing things like loop unrolling:
This computes the set and views it using ReliefImage:
This creates an interface using Manipulate and ReliefPlot where the user can adjust the value of :
The user can substitute ReliefPlot for Image to make the visualization even faster:
The Mandelbrot set is defined by . This code showcases how to define your own type in a kernel file, demonstrating this by considering sets of the form , where is a user-defined parameter. The kernel is defined here:
This sets the and variables:
The output set's memory is allocated using CUDAMemoryAllocate:
This loads the function from the above file:
This computes and displays the set when the power is :
This computes and displays the set when the power varies:
This generates a set of 200 frames:
This renders each frame as a texture on a polygon:
Using CUDALink's symbolic capabilities, you can write CUDA code using Mathematica expressions and transform them to CUDA code. The following implements a simple 1D discrete Haar wavelet transform using symbolic code:
The symbolic code can then be converted to CUDA code using ToCCodeString from SymbolicC:
The code can then be loaded using :
This creates some input data:
This calls the CUDAFunction with :
This gets the resulting CUDAMemory:
One of the interesting aspects of symbolic code generation is being able to manipulate the syntax tree. In this case, change the function arguments from CPointerType to CPointerType:
Again, you generate the CUDA code from the result:
The other interesting aspect of code generation is that the CUDA symbolic functions are mirrored by OpenCL symbolic functions. So, taking the above symbolic code, you can generate OpenCL code by changing only the CUDA symbolic function. First, load OpenCLLink:
This implements the OpenCL 1D discrete Haar wavelet transform:
Note that only two words are changed for this translation: SymbolicCUDAFunction became SymbolicOpenCLFunction and SymbolicCUDADeclareIndexBlock became SymbolicOpenCLDeclareIndexBlock.
The following code calculates the Mandelbrot set:
Rule 30 cellular automaton does not gain much from CUDA until the column count becomes very large, since the next row is dependent on the previous. Nonetheless, you can write a simple rule 30 cellular automaton as a CUDA function:
This creates the memory buffers:
This applies 128 times:
This plots the result using ArrayPlot:
In cases where CUDAFunction is called iteratively, it might be more efficient to write CUDAFunction in a C file and load it as a library. The cellular automaton library file is found here:
The user may need to modify the file depending on the CUDA version and whether double support is available on the hardware. This checks if the file exists:
This loads the library function using :
The library function takes a list to store the output in, the number of steps, and the cellular automaton rule. This allocates output memory of width and height using CUDAMemoryAllocate:
This calls the library function storing the output in (note that is the same as ):
This plots the data using ArrayPlot:
Since CUDAFunction behaves like a Mathematica function, you can use it with other functions like Manipulate:
The following implements triplex Mandelbrot ("Mandelbulb")—an analog to the Mandelbrot set in 3D. Triplex numbers extend the polar form of complex exponentiation to spherical coordinates in three dimensions; . Addition is simply vector addition. This specifies the Mandelbulb parameters (width, height, camera position, and light position):
This allocates the output memory:
The implementation is loaded from the source:
This loads the CUDAFunction:
This runs the CUDAFunction:
This gets the CUDAMemory into Mathematica:
This displays the result as an image:
The result can be placed into a Manipulate:
The creates an interface that allows the user to adjust the camera position: