此为 Mathematica 8 文档,内容基于更早版本的 Wolfram 语言
查看最新文档(版本11.2)

OpenCLFunctionLoad


loads fun from source code prog, returning an OpenCLFunction object.

loads fun from source file progfile, returning an OpenCLFunction object.
  • The OpenCLLink application must be loaded using Needs.
  • If progfile is a dynamic library, then the dynamic library function fun is loaded.
  • Possible argument and return types, and their corresponding OpenCL type, include:
_IntegermintMathematica integer
"Integer32"int32-bit integer
_RealReal_tGPU real type
"Double"doublemachine double
"Float"floatmachine float
{base, rank, io}CUDAMemorymemory of specified base type, rank, and input/output option
"Local" | "Shared"mintlocal or shared memory parameter
{"Local" | "Shared", type}mintlocal or shared memory parameter
  • Valid io is , , and .
  • If is passed, then is used by default. If is passed, then is used.
  • The rank can be omitted by using or .
  • Possible base types are:
_Integer_Real_Complex
"Byte""Bit16""Integer"
"Byte[2]""Bit16[2]""Integer32[2]"
"Byte[4]""Bit16[4]""Integer32[4]"
"Byte[8]""Bit16[8]""Integer32[8]"
"Byte[16]""Bit16[16]""Integer32[16]"
"UnsignedByte""UnsignedBit16""Float"
"UnsignedByte[2]""UnsignedBit16[2]""Float[2]"
"UnsignedByte[4]""UnsignedBit16[4]""Float[4]"
"UnsignedByte[8]""UnsignedBit16[8]""Float[8]"
"UnsignedByte[16]""UnsignedBit16[16]""Float[16]"
"Double""Double[2]""Double[4]"
"Double[8]""Double[16]"
  • can be called more than once with different arguments.
  • Functions loaded by run in the same process as the Mathematica kernel.
  • Functions loaded by are unloaded when the Mathematica kernel exits.
  • Block dimensions can be either a list or an integer denoting how many threads per block to launch.
  • The maximum size of block dimensions is returned by the property of OpenCLInformation.
  • On launch, if the number of threads is not specified (as an extra argument to OpenCLFunction) then the dimension of the element with largest rank and dimension is chosen. For images, the rank is set to 2.
  • On launch, if the number of threads is not a multiple of the block dimension, then it is incremented to be a multiple of the block dimension.
  • The following options can be given:
"CompileOptions"{}compile options passed directly to the OpenCL compiler
"Defines"Automaticdefines passed to the OpenCL preprocessor
"Device"$OpenCLDeviceOpenCL device used in computation
"IncludeDirectories"{}directories to include in the compilation
"Platform"$OpenCLPlatformOpenCL platform used in computation
"ShellCommandFunction"Nonefunction to call with the shell commands used for compilation
"ShellOutputFunction"Nonefunction to call with the shell output of running the compilation commands
"TargetPrecision"Automaticprecision used in computation
"WorkingDirectory"Automaticthe directory in which temporary files will be generated
First, load the OpenCLLink application:
This defines the OpenCL source code to load:
This loads the OpenCL function:
This defines the input parameters:
This calls the function with the arguments:
This plots the result using ArrayPlot:
This gets an OpenCL source file from the SupportFiles directory:
This loads the OpenCL function from the file:
This calls the function:
DLLs can be loaded into OpenCLLink for use as an OpenCLFunction:
This makes sure that the file exists, since the precompiled library extension is operating-system dependent:
This loads the library using :
The function adds two to an input list:
The source code for this example is bundled with OpenCLLink:
An extra argument can be given when calling OpenCLFunction. The argument denotes the number of threads to launch (or the global work group size). Using the previous example:
This loads the OpenCL function from the file:
This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:
If code contains syntax errors, then a "compilation failed" error is returned:
The option can be used to print the build log:
The above error states that there is a typo in the code, with a after the in the code:
First, load the OpenCLLink application:
In[1]:=
Click for copyable input
This defines the OpenCL source code to load:
In[2]:=
Click for copyable input
This loads the OpenCL function:
In[3]:=
Click for copyable input
Out[3]=
This defines the input parameters:
In[4]:=
Click for copyable input
This calls the function with the arguments:
In[5]:=
Click for copyable input
This plots the result using ArrayPlot:
In[6]:=
Click for copyable input
Out[6]=
 
This gets an OpenCL source file from the SupportFiles directory:
In[1]:=
Click for copyable input
Out[1]=
This loads the OpenCL function from the file:
In[2]:=
Click for copyable input
Out[2]=
This calls the function:
In[3]:=
Click for copyable input
Out[3]=
 
DLLs can be loaded into OpenCLLink for use as an OpenCLFunction:
In[1]:=
Click for copyable input
Out[1]=
This makes sure that the file exists, since the precompiled library extension is operating-system dependent:
In[2]:=
Click for copyable input
Out[2]=
This loads the library using :
In[3]:=
Click for copyable input
The function adds two to an input list:
In[4]:=
Click for copyable input
Out[4]=
The source code for this example is bundled with OpenCLLink:
In[5]:=
Click for copyable input
Out[5]=
 
An extra argument can be given when calling OpenCLFunction. The argument denotes the number of threads to launch (or the global work group size). Using the previous example:
In[1]:=
Click for copyable input
Out[1]=
This loads the OpenCL function from the file:
In[2]:=
Click for copyable input
Out[2]=
This calls the function with 32 threads, which results in only the first 32 values in the vector add being computed:
In[3]:=
Click for copyable input
Out[3]=
 
If code contains syntax errors, then a "compilation failed" error is returned:
The option can be used to print the build log:
The above error states that there is a typo in the code, with a after the in the code:
In[3]:=
Click for copyable input
Out[3]=
Templated functions can be simulated using macro. This source code has as an undefined macro:
This sets the macro to :
This sets the macro to unsigned :
can be used to specify or memory on launch. The following code uses shared memory to store global memory for gradient computation:
This specifies the input arguments, with the last argument being for shared memory. The block size is set to 256:
This computes the flattened length of a grayscale image:
This invokes the function. The shared memory size is set to (blockSize+2)sizeof (int) and the number of launch threads is set to the flattened length of the image:
A nicer way of specifying the shared memory size is using types:
Using shared memory types, you need not pass in the size of the type:
Reduction, or FoldList, computes the last value of the sequence given and a starting value x:
This loads the reduction code into Mathematica. For performance reasons, the block dimensions are specified. Since the input is a power of 2, you set the define to True:
This specifies the input and allocates memory for the output:
This invokes OpenCLFunction:
You can call again with to further reduce, although since the output is of length 128, there is no performance benefit of doing this over just calling Mathematica's Total function:
The result of adding a 65536 constant vector of 1 agrees with Mathematica's Total function:
The source code can be modified to implement operations other than sum. In this case, you count the number of occurrences of the number 5 in the list:
OpenCLFunction is loaded the same as before, except that you use a block size of 512:
Use random integers as input:
This invokes OpenCLFunction:
This invokes the function. Again, you use Mathematica to total the 1024 remaining numbers:
The result agrees with Count:
This times the launch:
Note the 7× performance increase:
If the input is already on the GPU, then a performance increase of 52× is observed:
The input can be images; here you write code that performs linear interpolation between images (this can be done using ImageCompose):
This loads OpenCLFunction from the source code above:
This sets the , , and values. It also allocates memory for the :
This calls the function with threads:
This gets the memory and displays it as an image:
You can take the above and make a function :
The function now has similar syntax to ImageCompose:
A Manipulate can be used to play with the interpolation coefficients:
Effects can be made; in this example, a smooth animation is viewed:
Uniform random number generators are common seed problems in many applications. This implements uniform random number generators in OpenCL:
This loads the source as an OpenCLFunction. This algorithm uses an image to provide an upper bound to the random number:
This calls OpenCLFunction; note that you can pass images directly into an OpenCLFunction so long as it can be interpreted using the appropriate specified type:
Notice that this is not a regular Lena image; it is a 4-channel image with alpha channel set to 1 (using SetAlphaChannel):
The random output can be used to detect important edges in an image:
The Mersenne Twister is another uniform random number generator algorithm (more sophisticated than the one mentioned above). The implementation is located here:
This loads OpenCLFunction; you specify the type , which means that the type is dependent on the CPU capabilities (whether it supports double precision or not):
This sets up the Mersenne Twister's input and output parameters (for more information, refer to the algorithm description):
This invokes OpenCLFunction:
This plots the output's results:
If the output is timed:
There is almost an 11× increase in speed:
The scan, or prefix sum, algorithm is similar to FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The OpenCL implementation is found in:
This loads the three kernels used in computation:
This generates random input data:
This allocates the output buffer:
This computes the block and grid dimensions:
A temporary buffer is needed in computation:
This performs the scan operation:
This retrieves the output buffer:
This deallocates the OpenCLMemory elements:
The Sobel filter (only horizontal will be discussed in this example) is a convolution with the kernel ; this is called the kernel. Using local memory in this case will increase the performance—see the Sobel filter in the Canny edge-detection example for a faster implementation. A trivial implementation in OpenCL is located here:
This loads OpenCLFunction:
The input image is grabbed from ExampleData:
This specifies the , , and values of the image:
This calls OpenCLFunction:
One property of the Sobel filter is that it is separable. This means that the kernel above can be represented by first convolving with and then ; these are called the and , respectively. The following loads the functions from the same file:
If a convolution kernel is separable, then composing the separable filters is more efficient than performing the full convolution, since it reduces the amount of needed memory copies.
Matrix transpose is a fundamental algorithm in many applications. This specifies the inputs:
This loads OpenCLFunction:
This calls OpenCLFunction:
This shows the MatrixForm of the result:
The result agrees with Mathematica:
Matrix multiplication is implemented here:
This defines the block size:
This loads OpenCLFunction; note it is specified that the input must be rank 2:
This creates random input and allocates the output:
This calls OpenCLFunction:
This gets the output memory using OpenCLMemoryGet:
The output agrees with Mathematica:
The one-dimensional discrete fast Fourier transform can be implemented using OpenCLLink; this implementation assumes that the input is a power of 2:
This loads OpenCLFunction using :
This creates input and output lists:
This calls the output memory and creates a complex list, displaying only the first 50 elements:
The result agrees with Fourier:
Black-Scholes models financial derivative investments and is implemented in OpenCL:
This loads OpenCLFunction:
This assigns the input parameters:
This invokes OpenCLFunction:
This gets the call values:
The result agrees with the output of FinancialDerivative:
For timing, the number of options to be valuated is increased:
On the C2050, it takes 1/100 of a second to valuate 2048 options:
On a Core i7 950, FinancialDerivative takes 1.13 seconds. This is a speedup of 280×. Note that increasing the number of options will exhibit more speedups:
Recursive Gaussian is used to approximate the Gaussian filter. The algorithm relies on the fact that the Gaussian matrix is separable:
It can be written as the outer product of a 1D Gaussian:
The recursive Gaussian is implemented here:
This loads OpenCLFunction using :
This specifies the value. Recall that the Gaussian is :
With set to 5.0, the normal distribution is calculated:
Mathematica can plot the distribution:
Here you calculate the recursive Gaussian parameters:
The input, temporary, and output memory is loaded as OpenCLMemory:
Here you call the functions. First you perform the Gaussian horizontally, then transpose, then perform the Gaussian vertically, then transpose to get the full Gaussian:
This gets the output image:
Again you compare timing:
And notice a 10× performance boost:
Bitonic sort sorts a given set of integers. It is similar in principal to merge sort. The OpenCL implementation only works on lists of length of a power of 2 and can be found here:
This sets the length of the input and loads it. The direction denotes whether to sort from highest to lowest or lowest to highest. In this case, you sort from lowest to highest:
This gets the input list:
This calls bitonic sort, similar to merge sort; multiple calls are needed for a full sort:
The output list is retrieved sorted:
The maximum work item sizes (block dimensions) are returned by OpenCLInformation:
On some systems, this can be limited to 1.
To use double-precision operations in the OpenCL code, the user must place the following pragmas in the code header:
Errors in the function call can place OpenCLLink in an unusable state. This is a side effect of allowing users to write arbitrary kernels. Infinite loops, buffer overflows, etc. in the kernel code can make both OpenCLLink and the video driver unstable. In an extreme case, this may crash the display driver, but usually it just makes further evaluation of OpenCL code return invalid results.
Bugs in some OpenCL implementations may cause the kernel to crash if one of the contains a space.
Use of memory modifiers such as is not supported by OpenCLLink. Memory passed into an OpenCLFunction must be .
The Mandelbrot set plots all points satisfying the recurrence equation with a complex number. The following implements the set in OpenCL (a slightly more complicated coloring strategy is used to ensure colors have smooth transition):
The Mandelbrot set is a restricted form of the Julia set; here is the code for the Julia set:
This defines the input memory and parameters:
This loads OpenCLFunction:
This computes the Julia set and plots it using ReliefPlot:
This computes the Julia set and displays it as a grayscale image:
ImageAdjust rescales the image to input high and low values. Gamma correction is also considered. The following defines a simplified version of ImageAdjust in OpenCL:
This loads OpenCLFunction:
This defines a simple Mathematica wrapper function to make the OpenCL function have similar syntax to ImageAdjust:
This adjusts the image by rescaling the values between 0.3 and 0.8 to 0.0 and 1.0:
This adjusts the image by rescaling the values using Manipulate:
This adjusts the image by rescaling the values between 0.3 and 0.8 to 0.0 and 1.0:
In this example you compute the position of each particle in a box with varying initial forces. You delegate the particle physics simulation to OpenCL, while all the rest is done in Mathematica:
This defines the OpenCL code and loads the function into Mathematica:
The N-body simulation is a classic Newtonian problem. This implements it in OpenCL:
This loads OpenCLFunction:
The number of particles, time step, and epsilon distance are chosen:
This sets the input and output memories:
This calls the function:
This plots the body points:
This shows the result as a Dynamic:
OpenCLLink can use 's code generation capabilities. To use , the user needs to load it:
OpenCLLink can use 's code generation capabilities; here a method is defined that takes a Mathematica statement and translates it to a expression (it cannot translate all Mathematica commands, but they can be added by the user):
Mathematica expressions can be transformed:
To translate to C, the user uses ToCCodeString:
You can tie this with OpenCLLink's symbolic code generation capabilities to create an function:
can work with pure Mathematica functions:
You can also use the code to work with predefined Mathematica functions:
The above code can then be loaded using :
The function can be evaluated:
To make this general, you can implement an function:
The function can be evaluated. Here, the function is implemented:
Here, the BitNot operator is used: