Wolfram Language & System Documentation Center

OpenCLFunctionLoad

OpenCLFunctionLoad["src",fun,argtypes,blockdims]

compiles the string src and makes fun available in the Wolfram Language as an OpenCLFunction.

OpenCLFunctionLoad[File[srcfile],fun,argtypes,blockdim]

compiles the source code file srcfile and then loads fun as an OpenCLFunction.

OpenCLFunctionLoad[File[libfile],fun,argtypes,blockdim]

loads fun as an OpenCLFunction. from the previously compiled library libfile.

OpenCLLink`

OpenCLFunctionLoad

OpenCLFunctionLoad["src",fun,argtypes,blockdims]

compiles the string src and makes fun available in the Wolfram Language as an OpenCLFunction.

OpenCLFunctionLoad[File[srcfile],fun,argtypes,blockdim]

compiles the source code file srcfile and then loads fun as an OpenCLFunction.

OpenCLFunctionLoad[File[libfile],fun,argtypes,blockdim]

loads fun as an OpenCLFunction. from the previously compiled library libfile.

Details and Options

The OpenCLLink application must be loaded using Needs["OpenCLLink`"].
If libfile is a dynamic library, then the dynamic library function fun is loaded.
Possible argument and return types, and their corresponding OpenCL type, include:

_Integer	mint	Wolfram Language integer
"Integer32"	int	32-bit integer
"Integer64"	long/long long	64-bit integer
_Real	Real_t	GPU real type
"Double"	double	machine double
"Float"	float	machine float
{base, rank, io}	OpenCLMemory	memory of specified base type, rank, and input/output option
"Local" \| "Shared"	mint	local or shared memory parameter
{"Local" \| "Shared", type}	mint	local or shared memory parameter

In the specification {base, rank, io}, valid settings of io are "Input", "Output", and "InputOutput".
The argument specification {base} is equivalent to {base,_,"InputOutput"}, and {base,rank} is equivalent to {base,rank,"InputOutput"}.
The rank can be omitted by using {base,_,io} or {base,io}.
Possible base types are:

_Integer	_Real	_Complex
"Byte"	"Bit16"	"Integer32"
"Byte[2]"	"Bit16[2]"	"Integer32[2]"
"Byte[4]"	"Bit16[4]"	"Integer32[4]"
"Byte[8]"	"Bit16[8]"	"Integer32[8]"
"Byte[16]"	"Bit16[16]"	"Integer32[16]"
"UnsignedByte"	"UnsignedBit16"	"UnsignedInteger"
"UnsignedByte[2]"	"UnsignedBit16[2]"	"UnsignedInteger[2]"
"UnsignedByte[4]"	"UnsignedBit16[4]"	"UnsignedInteger[4]"
"UnsignedByte[8]"	"UnsignedBit16[8]"	"UnsignedInteger[8]"
"UnsignedByte[16]"	"UnsignedBit16[16]"	"UnsignedInteger[16]"
"Double"	"Float"	"Integer64"
"Double[2]"	"Float[2]"	"Integer64[2]"
"Double[4]"	"Float[4]"	"Integer64[4]"
"Double[8]"	"Float[8]"	"Integer64[8]"
"Double[16]"	"Float[16]"	"Integer64[16]"

OpenCLFunctionLoad can be called more than once with different arguments.
Functions loaded by OpenCLFunctionLoad run in the same process as the Wolfram Language kernel.
Functions loaded by OpenCLFunctionLoad are unloaded when the Wolfram Language kernel exits.
Block dimensions can be either a list or an integer denoting how many threads per block to launch.
The maximum size of block dimensions is returned by the "Maximum Work Group Size" property of OpenCLInformation.
On launch, if the number of threads is not specified (as an extra argument to OpenCLFunction), then the dimension of the element with largest rank and dimension is chosen. For images, the rank is set to 2.
On launch, if the number of threads is not a multiple of the block dimension, then it is incremented to be a multiple of the block dimension.
The following options can be given:

"CompileOptions"	{}	compile options passed directly to the OpenCL compiler
"Defines"	Automatic	defines passed to the OpenCL preprocessor
"Device"	$OpenCLDevice	OpenCL device used in computation
"IncludeDirectories"	{}	directories to include in the compilation
"Platform"	$OpenCLPlatform	OpenCL platform used in computation
"ShellCommandFunction"	None	function to call with the shell commands used for compilation
"ShellOutputFunction"	None	function to call with the shell output of running the compilation commands
"TargetPrecision"	Automatic	precision used in computation
"WorkingDirectory"	Automatic	the directory in which temporary files will be generated

Examples

open all close all

Basic Examples (5)

First, load the OpenCLLink application:

Define the OpenCL source code to load:

Loads the OpenCL function:

Define the input parameters:

Calls the function with the arguments:

Plot the result using ArrayPlot:

Define the path to the OpenCL source file from the "SupportFiles/vectorAdd.cl":

Compile and load the OpenCL function from the file:

This calls the function:

Locate the example OpenCLLink library "addTwo_Dobule":

Load the library using OpenCLFunctionLoad:

The function adds two to an input list:

The source code for this example is bundled with OpenCLLink:

An extra argument can be given when calling OpenCLFunction. The argument denotes the number of threads to launch (or the global work group size). Using the previous example:

This loads the OpenCL function from the file:

This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:

If code contains syntax errors, then a "compilation failed" error is returned:

The "ShellOutputFunction" option can be used to print the build log:

The above error states that there is a typo in the code, with a z after the 0 in the code:

Scope (2)

Templated Function (1)

Templated functions can be simulated using macros. Leave as an undefined macro:

Set the macro to during compilation:

Sets to instead:

Shared or Local Memory (1)

OpenCLFunctionLoad can be used to specify "Local" or "Shared" memory on launch. The following code uses shared memory to store global memory for gradient computation:

This specifies the input arguments, with the last argument being "Shared" for shared memory. The block size is set to 256:

This computes the flattened length of a grayscale image:

This invokes the function. The shared memory size is set to (blockSize+2)⋆sizeof (int) and the number of launch threads is set to the flattened length of the image:

A nicer way of specifying the shared memory size is using types:

Using shared memory types, you need not pass in the size of the type:

Applications (10)

Image Input (1)

The input can be images; here you write code that performs linear interpolation between images (this can be done using ImageCompose):

This loads OpenCLFunction from the source code above:

This sets the height, width, and channel values. It also allocates memory for the output:

This calls the function with {width,height} threads:

This gets the memory and displays it as an image:

You can take the above and make a function OpenCLImageLinearCombine:

The function now has similar syntax to ImageCompose:

A Manipulate can be used to play with the interpolation coefficients:

Effects can be made; in this example, a smooth animation is viewed:

Uniform Random Number Generation (1)

Uniform random number generators are common seed problems in many applications. This implements uniform random number generators in OpenCL:

This loads the source as an OpenCLFunction. This algorithm uses an image to provide an upper bound to the random number:

This calls OpenCLFunction; note that you can pass images directly into an OpenCLFunction so long as it can be interpreted using the appropriate specified type:

Notice that this is not a regular duck image; it is a 4-channel image with alpha channel set to 1 (using SetAlphaChannel):

The random output can be used to detect important edges in an image:

Random Number Generation Using the Mersenne Twister (1)

The Mersenne Twister is another uniform random number generator algorithm (more sophisticated than the one mentioned above). The implementation is located here:

This loads OpenCLFunction; you specify the type _Real, which means that the Real type is dependent on the CPU capabilities (whether it supports double precision or not):

This sets up the Mersenne Twister's input and output parameters (for more information, refer to the algorithm description):

This invokes OpenCLFunction:

This plots the output's results:

If the output is timed:

There is almost an 11× increase in speed:

Prefix Sum Algorithm (1)

The scan, or prefix sum, algorithm is similar to FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The OpenCL implementation is found in:

This loads the three kernels used in computation:

This generates random input data:

This allocates the output buffer:

This computes the block and grid dimensions:

A temporary buffer is needed in computation:

This performs the scan operation:

This retrieves the output buffer:

This deallocates the OpenCLMemory elements:

Matrix Operations (1)

Matrix transpose is a fundamental algorithm in many applications. This specifies the inputs:

This loads OpenCLFunction:

This calls OpenCLFunction:

This shows the MatrixForm of the result:

The result agrees with the Wolfram Language:

Matrix Multiplication (1)

Matrix multiplication is implemented here:

This defines the block size:

This loads OpenCLFunction; note it is specified that the input must be rank 2:

This creates random input and allocates the output:

This calls OpenCLFunction:

This gets the output memory using OpenCLMemoryGet:

The output agrees with the Wolfram Language:

Fast Fourier Transform (1)

The one-dimensional discrete fast Fourier transform can be implemented using OpenCLLink; this implementation assumes that the input is a power of 2:

This loads OpenCLFunction using OpenCLFunctionLoad:

This creates input and output lists:

This calls the output memory and creates a complex list, displaying only the first 50 elements:

The result agrees with Fourier:

Financial Derivative (1)

Black–Scholes models financial derivative investments and is implemented in OpenCL:

This loads OpenCLFunction:

This assigns the input parameters:

This invokes OpenCLFunction:

This gets the call values:

The result agrees with the output of FinancialDerivative:

For timing, the number of options to be valuated is increased:

On the C2050, it takes 1/100 of a second to valuate 2048 options:

On a Core i7 950, FinancialDerivative takes 1.13 seconds. This is a speedup of 280×. Note that increasing the number of options will exhibit more speedups:

Gaussian Filter (1)

Recursive Gaussian is used to approximate the Gaussian filter. The Gaussian matrix is separable:

It can be written as the outer product of two 1D Gaussians:

Locate the implementation of the recursive Gaussian:

Load two functions using OpenCLFunctionLoad:

Specifies the value in the Gaussian :

Calculate the normal distribution:

The Wolfram Language can plot the distribution:

Calculate the recursive Gaussian parameters:

Allocate OpenCLMemory for the input, output, and temporary storage:

Perform the Gaussian horizontally, then transpose, then perform the Gaussian vertically, and finally transpose to get the full Gaussian:

Reconstruct the image from the data:

Again you compare timing:

And notice a 4× performance boost:

Sorting (1)

Bitonic sort sorts a given set of integers. It is similar in principle to merge sort. The OpenCL implementation only works on lists of length of a power of 2 and can be found here:

This sets the length of the input and loads it. The direction denotes whether to sort from highest to lowest or lowest to highest. In this case, you sort from lowest to highest:

This gets the input list:

This calls bitonic sort, similar to merge sort; multiple calls are needed for a full sort:

The output list is retrieved sorted:

Possible Issues (5)

The maximum work item sizes (block dimensions) are returned by OpenCLInformation:

On some systems, this can be limited to 1.

To use double-precision operations in the OpenCL code, the user must place the following pragmas in the code header:

#ifdef USING_DOUBLE_PRECISIONQ
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#endif /* USING_DOUBLE_PRECISIONQ */

Errors in the function call can place OpenCLLink in an unusable state. This is a side effect of allowing users to write arbitrary kernels. Infinite loops, buffer overflows, etc. in the kernel code can make both OpenCLLink and the video driver unstable. In an extreme case, this may crash the display driver, but usually it just makes further evaluation of OpenCL code return invalid results.

Bugs in some OpenCL implementations may cause the kernel to crash if one of the "IncludeDirectories" contains a space.

Use of memory modifiers such as is not supported by OpenCLLink. Memory passed into an OpenCLFunction must be .

Interactive Examples (5)

Mandelbrot Set (1)

The Mandelbrot set plots all points satisfying the recurrence equation with a complex number. The following implements the set in OpenCL (a slightly more complicated coloring strategy is used to ensure colors have smooth transition):

Julia Set (1)

The Mandelbrot set is a restricted form of the Julia set; here is the code for the Julia set:

This defines the input memory and parameters:

This loads OpenCLFunction:

This computes the Julia set and plots it using ReliefPlot:

This computes the Julia set and displays it as a grayscale image:

Image Adjustment (1)

ImageAdjust rescales the image to input high and low values. Gamma correction is also considered. The following defines a simplified version of ImageAdjust in OpenCL:

This loads OpenCLFunction:

This defines a simple Wolfram Language wrapper function to make the OpenCL function have similar syntax to ImageAdjust:

This adjusts the image by rescaling the values between 0.3 and 0.8 to 0.0 and 1.0:

This adjusts the image by rescaling the values using Manipulate:

This adjusts the image by rescaling the values between 0.3 and 0.8 to 0.0 and 1.0:

Bouncing Ball (1)

In this example, you compute the position of each particle in a box with varying initial forces. You delegate the particle physics simulation to OpenCL, while all the rest is done in the Wolfram Language:

This defines the OpenCL code and loads the function into the Wolfram Language:

N-Body Simulation (1)

The N-body simulation is a classic Newtonian problem. This implements it in OpenCL:

This loads OpenCLFunction:

The number of particles, time step, and epsilon distance are chosen:

This sets the input and output memories:

This calls the NBody function:

This plots the body points:

This shows the result as a Dynamic:

Neat Examples (1)

SymbolicC (1)

OpenCLLink can use SymbolicC's code generation capabilities. To use SymbolicC, the user needs to load it:

OpenCLLink can use SymbolicC's code generation capabilities; here a method toSymbolicC is defined that takes a Wolfram Language statement and translates it to a SymbolicC expression (it cannot translate all Wolfram Language commands, but they can be added by the user):

Wolfram Language expressions can be transformed:

To translate to C, the user uses ToCCodeString:

You can tie this with OpenCLLink's symbolic code generation capabilities to create an OpenCLMapSource function:

OpenCLMapSource can work with pure Wolfram Language functions:

You can also use the code to work with predefined Wolfram Language functions:

The above code can then be loaded using OpenCLFunctionLoad:

The function can be evaluated:

To make this general, you can implement an OpenCLMap function:

The function can be evaluated. Here, the addTwo function is implemented:

Here, the BitNot operator is used:

Top

OpenCLFunctionLoad

Details and Options

Examples

Basic Examples (5)

Scope (2)

Templated Function (1)

Shared or Local Memory (1)

Applications (10)

Image Input (1)

Uniform Random Number Generation (1)

Random Number Generation Using the Mersenne Twister (1)

Prefix Sum Algorithm (1)

Matrix Operations (1)

Matrix Multiplication (1)

Fast Fourier Transform (1)

Financial Derivative (1)

Gaussian Filter (1)

Sorting (1)

Possible Issues (5)

Interactive Examples (5)

Mandelbrot Set (1)

Julia Set (1)

Image Adjustment (1)

Bouncing Ball (1)

N-Body Simulation (1)

Neat Examples (1)

SymbolicC (1)

See Also

Tech Notes

Related Guides

Related Links

Text

CMS

APA

BibTeX

BibLaTeX