OpenCLFunctionLoad
OpenCLFunctionLoad["src",fun,argtypes,blockdims]
compiles the string src and makes fun available in the Wolfram Language as an OpenCLFunction.
OpenCLFunctionLoad[File[srcfile],fun,argtypes,blockdim]
compiles the source code file srcfile and then loads fun as an OpenCLFunction.
OpenCLFunctionLoad[File[libfile],fun,argtypes,blockdim]
loads fun as an OpenCLFunction. from the previously compiled library libfile.
Details
 The OpenCLLink application must be loaded using Needs["OpenCLLink`"].
 If libfile is a dynamic library, then the dynamic library function fun is loaded.
 Possible argument and return types, and their corresponding OpenCL type, include:

_Integer mint Wolfram Language integer "Integer32" int 32bit integer "Integer64" long/long long 64bit integer _Real Real_t GPU real type "Double" double machine double "Float" float machine float {base, rank, io} OpenCLMemory memory of specified base type, rank, and input/output option "Local" "Shared"mint local or shared memory parameter {"Local" "Shared", type} mint local or shared memory parameter  In the specification {base, rank, io}, valid settings of io are "Input", "Output", and "InputOutput".
 The argument specification {base} is equivalent to {base,_,"InputOutput"}, and {base,rank} is equivalent to {base,rank,"InputOutput"}.
 The rank can be omitted by using {base,_,io} or {base,io}.
 Possible base types are:

_Integer _Real _Complex "Byte" "Bit16" "Integer32" "Byte[2]" "Bit16[2]" "Integer32[2]" "Byte[4]" "Bit16[4]" "Integer32[4]" "Byte[8]" "Bit16[8]" "Integer32[8]" "Byte[16]" "Bit16[16]" "Integer32[16]" "UnsignedByte" "UnsignedBit16" "UnsignedInteger" "UnsignedByte[2]" "UnsignedBit16[2]" "UnsignedInteger[2]" "UnsignedByte[4]" "UnsignedBit16[4]" "UnsignedInteger[4]" "UnsignedByte[8]" "UnsignedBit16[8]" "UnsignedInteger[8]" "UnsignedByte[16]" "UnsignedBit16[16]" "UnsignedInteger[16]" "Double" "Float" "Integer64" "Double[2]" "Float[2]" "Integer64[2]" "Double[4]" "Float[4]" "Integer64[4]" "Double[8]" "Float[8]" "Integer64[8]" "Double[16]" "Float[16]" "Integer64[16]"  OpenCLFunctionLoad can be called more than once with different arguments.
 Functions loaded by OpenCLFunctionLoad run in the same process as the Wolfram Language kernel.
 Functions loaded by OpenCLFunctionLoad are unloaded when the Wolfram Language kernel exits.
 Block dimensions can be either a list or an integer denoting how many threads per block to launch.
 The maximum size of block dimensions is returned by the "Maximum Work Group Size" property of OpenCLInformation.
 On launch, if the number of threads is not specified (as an extra argument to OpenCLFunction), then the dimension of the element with largest rank and dimension is chosen. For images, the rank is set to 2.
 On launch, if the number of threads is not a multiple of the block dimension, then it is incremented to be a multiple of the block dimension.
 The following options can be given:

"CompileOptions" {} compile options passed directly to the OpenCL compiler "Defines" Automatic defines passed to the OpenCL preprocessor "Device" $OpenCLDevice OpenCL device used in computation "IncludeDirectories" {} directories to include in the compilation "Platform" $OpenCLPlatform OpenCL platform used in computation "ShellCommandFunction" None function to call with the shell commands used for compilation "ShellOutputFunction" None function to call with the shell output of running the compilation commands "TargetPrecision" Automatic precision used in computation "WorkingDirectory" Automatic the directory in which temporary files will be generated
Examples
open allclose allBasic Examples (5)
First, load the OpenCLLink application:
Define the OpenCL source code to load:
Calls the function with the arguments:
Plot the result using ArrayPlot:
Define the path to the OpenCL source file from the "SupportFiles/vectorAdd.cl":
Compile and load the OpenCL function from the file:
Locate the example OpenCLLink library "addTwo_Dobule":
Load the library using OpenCLFunctionLoad:
The function adds two to an input list:
The source code for this example is bundled with OpenCLLink:
An extra argument can be given when calling OpenCLFunction. The argument denotes the number of threads to launch (or the global work group size). Using the previous example:
This loads the OpenCL function from the file:
This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:
If code contains syntax errors, then a "compilation failed" error is returned:
The "ShellOutputFunction" option can be used to print the build log:
The above error states that there is a typo in the code, with a z after the 0 in the code:
Scope (2)
Templated Function (1)
Shared or Local Memory (1)
OpenCLFunctionLoad can be used to specify "Local" or "Shared" memory on launch. The following code uses shared memory to store global memory for gradient computation:
This specifies the input arguments, with the last argument being "Shared" for shared memory. The block size is set to 256:
This computes the flattened length of a grayscale image:
This invokes the function. The shared memory size is set to (blockSize+2)⋆sizeof (int) and the number of launch threads is set to the flattened length of the image:
A nicer way of specifying the shared memory size is using types:
Using shared memory types, you need not pass in the size of the type:
Applications (10)
Image Input (1)
The input can be images; here you write code that performs linear interpolation between images (this can be done using ImageCompose):
This loads OpenCLFunction from the source code above:
This sets the height, width, and channel values. It also allocates memory for the output:
This calls the function with {width,height} threads:
This gets the memory and displays it as an image:
You can take the above and make a function OpenCLImageLinearCombine:
The function now has similar syntax to ImageCompose:
A Manipulate can be used to play with the interpolation coefficients:
Effects can be made; in this example, a smooth animation is viewed:
Uniform Random Number Generation (1)
Uniform random number generators are common seed problems in many applications. This implements uniform random number generators in OpenCL:
This loads the source as an OpenCLFunction. This algorithm uses an image to provide an upper bound to the random number:
This calls OpenCLFunction; note that you can pass images directly into an OpenCLFunction so long as it can be interpreted using the appropriate specified type:
Notice that this is not a regular Lena image; it is a 4channel image with alpha channel set to 1 (using SetAlphaChannel):
The random output can be used to detect important edges in an image:
Random Number Generation Using the Mersenne Twister (1)
The Mersenne Twister is another uniform random number generator algorithm (more sophisticated than the one mentioned above). The implementation is located here:
This loads OpenCLFunction; you specify the type _Real, which means that the Real type is dependent on the CPU capabilities (whether it supports double precision or not):
This sets up the Mersenne Twister's input and output parameters (for more information, refer to the algorithm description):
This invokes OpenCLFunction:
Prefix Sum Algorithm (1)
The scan, or prefix sum, algorithm is similar to FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The OpenCL implementation is found in:
This loads the three kernels used in computation:
This generates random input data:
This allocates the output buffer:
This computes the block and grid dimensions:
A temporary buffer is needed in computation:
This performs the scan operation:
This retrieves the output buffer:
This deallocates the OpenCLMemory elements:
Matrix Operations (1)
Matrix transpose is a fundamental algorithm in many applications. This specifies the inputs:
This loads OpenCLFunction:
This calls OpenCLFunction:
This shows the MatrixForm of the result:
Matrix Multiplication (1)
Matrix multiplication is implemented here:
This loads OpenCLFunction; note it is specified that the input must be rank 2:
This creates random input and allocates the output:
This calls OpenCLFunction:
This gets the output memory using OpenCLMemoryGet:
Fast Fourier Transform (1)
The onedimensional discrete fast Fourier transform can be implemented using OpenCLLink; this implementation assumes that the input is a power of 2:
This loads OpenCLFunction using OpenCLFunctionLoad:
This creates input and output lists:
This calls the output memory and creates a complex list, displaying only the first 50 elements:
The result agrees with Fourier:
Financial Derivative (1)
Black–Scholes models financial derivative investments and is implemented in OpenCL:
This loads OpenCLFunction:
This assigns the input parameters:
This invokes OpenCLFunction:
The result agrees with the output of FinancialDerivative:
For timing, the number of options to be valuated is increased:
On the C2050, it takes 1/100 of a second to valuate 2048 options:
On a Core i7 950, FinancialDerivative takes 1.13 seconds. This is a speedup of 280×. Note that increasing the number of options will exhibit more speedups:
Gaussian Filter (1)
Recursive Gaussian is used to approximate the Gaussian filter. The Gaussian matrix is separable:
It can be written as the outer product of two 1D Gaussians:
Locate the implementation of the recursive Gaussian:
Load two functions using OpenCLFunctionLoad:
Specifies the value in the Gaussian :
Calculate the normal distribution:
The Wolfram Language can plot the distribution:
Calculate the recursive Gaussian parameters:
Allocate OpenCLMemory for the input, output, and temporary storage:
Perform the Gaussian horizontally, then transpose, then perform the Gaussian vertically, and finally transpose to get the full Gaussian:
Sorting (1)
Bitonic sort sorts a given set of integers. It is similar in principle to merge sort. The OpenCL implementation only works on lists of length of a power of 2 and can be found here:
This sets the length of the input and loads it. The direction denotes whether to sort from highest to lowest or lowest to highest. In this case, you sort from lowest to highest:
This calls bitonic sort, similar to merge sort; multiple calls are needed for a full sort:
Possible Issues (5)
The maximum work item sizes (block dimensions) are returned by OpenCLInformation:
On some systems, this can be limited to 1.
To use doubleprecision operations in the OpenCL code, the user must place the following pragmas in the code header:
#ifdef USING_DOUBLE_PRECISIONQ
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#endif /* USING_DOUBLE_PRECISIONQ */
Errors in the function call can place OpenCLLink in an unusable state. This is a side effect of allowing users to write arbitrary kernels. Infinite loops, buffer overflows, etc. in the kernel code can make both OpenCLLink and the video driver unstable. In an extreme case, this may crash the display driver, but usually it just makes further evaluation of OpenCL code return invalid results.
Bugs in some OpenCL implementations may cause the kernel to crash if one of the "IncludeDirectories" contains a space.
Use of memory modifiers such as is not supported by OpenCLLink. Memory passed into an OpenCLFunction must be .
Interactive Examples (5)
Mandelbrot Set (1)
Julia Set (1)
The Mandelbrot set is a restricted form of the Julia set; here is the code for the Julia set:
This defines the input memory and parameters:
This loads OpenCLFunction:
This computes the Julia set and plots it using ReliefPlot:
This computes the Julia set and displays it as a grayscale image:
Image Adjustment (1)
ImageAdjust rescales the image to input high and low values. Gamma correction is also considered. The following defines a simplified version of ImageAdjust in OpenCL:
This loads OpenCLFunction:
This defines a simple Wolfram Language wrapper function to make the OpenCL function have similar syntax to ImageAdjust:
This adjusts the image by rescaling the values between 0.3 and 0.8 to 0.0 and 1.0:
This adjusts the image by rescaling the values using Manipulate:
This adjusts the image by rescaling the values between 0.3 and 0.8 to 0.0 and 1.0:
Bouncing Ball (1)
NBody Simulation (1)
The Nbody simulation is a classic Newtonian problem. This implements it in OpenCL:
This loads OpenCLFunction:
The number of particles, time step, and epsilon distance are chosen:
This sets the input and output memories:
This calls the NBody function:
This shows the result as a Dynamic:
Neat Examples (1)
SymbolicC (1)
OpenCLLink can use SymbolicC's code generation capabilities. To use SymbolicC, the user needs to load it:
OpenCLLink can use SymbolicC's code generation capabilities; here a method toSymbolicC is defined that takes a Wolfram Language statement and translates it to a SymbolicC expression (it cannot translate all Wolfram Language commands, but they can be added by the user):
Wolfram Language expressions can be transformed:
To translate to C, the user uses ToCCodeString:
You can tie this with OpenCLLink's symbolic code generation capabilities to create an OpenCLMapSource function:
OpenCLMapSource can work with pure Wolfram Language functions:
You can also use the code to work with predefined Wolfram Language functions:
The above code can then be loaded using OpenCLFunctionLoad:
The function can be evaluated:
To make this general, you can implement an OpenCLMap function:
The function can be evaluated. Here, the addTwo function is implemented:
Here, the BitNot operator is used: