Reduction, or

FoldList, computes the last value of the sequence

given

and a starting value

x:

This loads the reduction code into

*Mathematica*. For performance reasons, the block dimensions

are specified. Since the input is a power of 2, you set the

define to

True:

This specifies the input and allocates memory for the output:

This invokes

OpenCLFunction:

You can call

again with

to further reduce, although since the output is of length 128, there is no performance benefit of doing this over just calling

*Mathematica*'s

Total function:

The result of adding a 65536 constant vector of 1 agrees with

*Mathematica*'s

Total function:

The source code can be modified to implement operations other than sum. In this case, you count the number of occurrences of the number 5 in the list:

OpenCLFunction is loaded the same as before, except that you use a block size of 512:

Use random integers as input:

This invokes

OpenCLFunction:

This invokes the function. Again, you use *Mathematica* to total the 1024 remaining numbers:

The result agrees with

Count:

This times the launch:

Note the 7× performance increase:

If the input is already on the GPU, then a performance increase of 52× is observed:

The input can be images; here you write code that performs linear interpolation between images (this can be done using

ImageCompose):

This loads

OpenCLFunction from the source code above:

This sets the

,

, and

values. It also allocates memory for the

:

This calls the function with

threads:

This gets the memory and displays it as an image:

You can take the above and make a function

:

The function now has similar syntax to

ImageCompose:

A

Manipulate can be used to play with the interpolation coefficients:

Effects can be made; in this example, a smooth animation is viewed:

Uniform random number generators are common seed problems in many applications. This implements uniform random number generators in OpenCL:

This loads the source as an

OpenCLFunction. This algorithm uses an image to provide an upper bound to the random number:

This calls

OpenCLFunction; note that you can pass images directly into an

OpenCLFunction so long as it can be interpreted using the appropriate specified type:

Notice that this is not a regular Lena image; it is a 4-channel image with alpha channel set to 1 (using

SetAlphaChannel):

The random output can be used to detect important edges in an image:

The Mersenne Twister is another uniform random number generator algorithm (more sophisticated than the one mentioned above). The implementation is located here:

This loads

OpenCLFunction; you specify the type

, which means that the

type is dependent on the CPU capabilities (whether it supports double precision or not):

This sets up the Mersenne Twister's input and output parameters (for more information, refer to the algorithm description):

This invokes

OpenCLFunction:

This plots the output's results:

If the output is timed:

There is almost an 11× increase in speed:

The scan, or prefix sum, algorithm is similar to

FoldList and is a very useful primitive algorithm that can be used in a variety of scenarios. The OpenCL implementation is found in:

This loads the three kernels used in computation:

This generates random input data:

This allocates the output buffer:

This computes the block and grid dimensions:

A temporary buffer is needed in computation:

This performs the scan operation:

This retrieves the output buffer:

This deallocates the

OpenCLMemory elements:

The Sobel filter (only horizontal will be discussed in this example) is a convolution with the kernel

; this is called the

kernel. Using local memory in this case will increase the performance—see the Sobel filter in the Canny edge-detection example for a faster implementation. A trivial implementation in OpenCL is located here:

The input image is grabbed from

ExampleData:

This specifies the

,

, and

values of the image:

One property of the Sobel filter is that it is separable. This means that the

kernel above can be represented by first convolving with

and then

; these are called the

and

, respectively. The following loads the functions from the same file:

If a convolution kernel is separable, then composing the separable filters is more efficient than performing the full convolution, since it reduces the amount of needed memory copies.

Matrix transpose is a fundamental algorithm in many applications. This specifies the inputs:

This shows the

MatrixForm of the result:

The result agrees with *Mathematica*:

Matrix multiplication is implemented here:

This defines the block size:

This loads

OpenCLFunction; note it is specified that the input must be rank 2:

This creates random input and allocates the output:

This gets the output memory using

OpenCLMemoryGet:

The output agrees with *Mathematica*:

The one-dimensional discrete fast Fourier transform can be implemented using *OpenCLLink*; this implementation assumes that the input is a power of 2:

This loads

OpenCLFunction using

:

This creates input and output lists:

This calls the output memory and creates a complex list, displaying only the first 50 elements:

The result agrees with

Fourier:

Black-Scholes models financial derivative investments and is implemented in OpenCL:

This assigns the input parameters:

This invokes

OpenCLFunction:

This gets the call values:

The result agrees with the output of

FinancialDerivative:

For timing, the number of options to be valuated is increased:

On the C2050, it takes 1/100 of a second to valuate 2048 options:

On a Core i7 950,

FinancialDerivative takes 1.13 seconds. This is a speedup of 280×. Note that increasing the number of options will exhibit more speedups:

Recursive Gaussian is used to approximate the Gaussian filter. The algorithm relies on the fact that the Gaussian matrix is separable:

It can be written as the outer product of a 1D Gaussian:

The recursive Gaussian is implemented here:

This loads

OpenCLFunction using

:

This specifies the

value. Recall that the Gaussian is

:

With

set to 5.0, the normal distribution is calculated:

*Mathematica* can plot the distribution:

Here you calculate the recursive Gaussian parameters:

The input, temporary, and output memory is loaded as

OpenCLMemory:

Here you call the functions. First you perform the Gaussian horizontally, then transpose, then perform the Gaussian vertically, then transpose to get the full Gaussian:

This gets the output image:

Again you compare timing:

And notice a 10× performance boost:

Bitonic sort sorts a given set of integers. It is similar in principal to merge sort. The OpenCL implementation only works on lists of length of a power of 2 and can be found here:

This sets the length of the input and loads it. The direction denotes whether to sort from highest to lowest or lowest to highest. In this case, you sort from lowest to highest:

This gets the input list:

This calls bitonic sort, similar to merge sort; multiple calls are needed for a full sort:

The output list is retrieved sorted: