Applications
Because GPUs are SIMD machines, to exploit CUDA's potential you must pose the problem in an SIMD manner. Computation that can be partitioned in such a way that each thread can compute one element independently is ideal for the GPU.
Some algorithms either cannot be written in parallel, or cannot be used on CUDA (due to architecture constraints). In those cases, research is ongoing to introduce alternative methods to use the GPU to perform those computations.
In this section some usage of CUDA programming inside
Mathematica is showcased. All the following examples use
CUDAFunctionLoad, which allows you to load CUDA source, binaries, or libraries into
Mathematica.
CUDAFunctionLoad allows you to load CUDA source, binaries, or libraries into Mathematica.
Image Processing
This section contains examples of CUDA applications that perform image processing operations.
CUDALink contains a few built-in functions that perform image processing operations, such as
CUDAImageConvolve,
CUDABoxFilter,
CUDAErosion,
CUDADilation,
CUDAOpening,
CUDAClosing,
CUDAImageAdd, etc.
Image Binarize
Binarize takes an input image and outputs a binary image with pixels set to white if above a threshold and black otherwise.
If you have not done so already, import the
CUDALink application.
This defines the input image. To reduce the memory footprint on the GPU, use

to represent the image.
This loads
CUDAFunction. The type

is signified by

in the parameter list.
| Out[3]= |  |
This defines the input image and allocates

memory for the output.
This calls the binarize function, using

as the threshold value.
| Out[8]= |  |
This displays the output image.
| Out[9]= |  |
The result agrees with
Mathematica.
| Out[10]= |  |
This unloads the memory allocated.
Recursive Gaussian
Recursive Gaussian approximates performing a Gaussian filter on input data. It sacrifices some accuracy in favor of speed. This implements a Deriche-style recursive Gaussian filter.
| Out[12]= |  |
This loads two functions: the

and

functions. The functions take an RGBA image, and the data is passed as

or

.
This specifies the

value. Recall that the Gaussian is

.
The following parameters are calculated based on the

value.
The following defines the input image and allocates memory for the output.
This runs the functions. Perform a vertical Gaussian, transpose, perform another vertical Gaussian, and transpose again to get the result.
This gets the output image.
| Out[25]= |  |
Memory allocated must be freed.
Box Filter
The box filter is an optimized convolution when the kernel is a
BoxMatrix. It is implemented here.
| Out[27]= |  |
This loads the CUDA functions needed to perform the box filter.
This sets the input parameters.
The radius is set to

.
This calls the functions.
| Out[37]= |  |
This unloads the memory allocated.
Image Adjust
This is an implementation of
ImageAdjust in CUDA.
CUDAFunction is loaded from the source string, and a float is used for the constant values.
| Out[40]= |  |
This wraps the
CUDAFunction to make

with similar syntax to
ImageAdjust.
The function can be used similarly to
ImageAdjust.
| Out[43]= |  |
Canny Edge Detection
Canny edge detection combines a dozen or so filters to find edges in an image.
Mathematica's
EdgeDetect provides similar functionality. Here is the implementation.
| Out[44]= |  |
This loads the CUDA functions needed to perform the edge detection.
This sets the input image along with its parameters.
Now add host and device tensors to the CUDA manager. These will hold both the input and the output.
Next, define device-only temporary memory that will be used in the computation. Since Canny edge detection involves many filters, you need quite a few of them.
This calls the Canny edge-detection functions.
This views the output as an image.
| Out[80]= |  |
This unloads the memory allocated.
Linear Algebra and List Processing
This section contains examples of CUDA applications that perform linear algebra operations. Most of the functions discussed can be performed using
CUDATranspose or
CUDADot.
Matrix Transpose
Matrix transposition is essential in many algorithms.
CUDALink provides a ready-made implementation in the form of
CUDATranspose. Users may wish to implement their own, however.
This loads the
CUDAFunction for matrix transposition and defines a new function

that takes a real-valued matrix and outputs its transpose.
Out[83]//MatrixForm= |
| |  |
This transposes matrix

.
Out[84]//MatrixForm= |
| |  |
The result agrees with
Mathematica.
Out[85]//MatrixForm= |
| |  |
Matrix-Vector Multiplication
Matrix-vector multiplication is a common operation in linear algebra, finite element analysis, etc. This loads
CUDAFunction, implementing matrix-vector multiplication.
This sets the input matrix and vector.
This invokes the above defined function, displaying the result using
MatrixForm.
Out[89]//MatrixForm= |
| |  |
The result agrees with
Mathematica.
Out[90]//MatrixForm= |
| |  |
Matrix-Matrix Multiplication
Matrix-matrix multiplication is a pivotal function in many algorithms. This loads
CUDAFunction from a source file, setting the block dimension to

.
This sets input values and allocates memory for the output.
This performs the computation.
This displays the result using
MatrixForm.
Out[97]//MatrixForm= |
| |  |
The result agrees with
Mathematica.
Out[98]//MatrixForm= |
| |  |
This unloads the output buffer.
Dot Product
The dot product of two vectors is a common operation in linear algebra. This implements a function that takes a set of vectors and gives the dot product of each.
| Out[100]= |  |
This loads the function into
Mathematica.
| Out[101]= |  |
Generate 50 random vectors.
This calls the function, returning the result.
| Out[105]= |  |
The results agree with
Mathematica's output.
| Out[106]= |  |
Convex Hull
Convex hull is hard to make parallel on the GPU, so in this example the hybrid approach to GPU programming is taken: do computation that makes sense on the GPU and the rest is done on the CPU.
This convex hull implementation is a textbook implementation of Andrew's algorithm and is designed to be simple, not efficient.
Define the

predicate inside
Mathematica.
This defines a function that splits the point set to be either above or below a given line connecting the extreme points.
This loads the above function as a CUDA function.
Define a
Mathematica function that takes two split point sets and finds the convex hull for them.
This calls the above function. Note that the list is sorted with
CUDASort before being processed by

.
To test, create 20000 uniformly distributed random points.
This visualizes the result. Lines are drawn between the hull points.
| Out[115]= |  |
The above algorithm handles the extreme case where all or most points lie on the hull. This generates uniformly distributed points on the unit disk.
This computes the hull points.
This visualizes the hull.
| Out[119]= |  |
The above is a prime example of combining
Mathematica and CUDA programming to make an algorithm partially parallel that would otherwise be written in only serial code.
Random Number Generation
Many algorithms ranging from ray tracing to PDE solving require random numbers as input. This is, however, a difficult problem on many core systems, where each core has the same state as any other core. To avoid that, parallel random number generators usually use entropy values such as the time of day to seed the random number generators, but those calls are not available in CUDA.
The following section gives three classes of algorithms to generate random numbers. The first is uniform random number generators (where pseudorandom number generators and quasi-random number generators are showcased). The second is random number generators that exploit the uniform distribution of a hashing function to generate random numbers. The final is normal random number generators.
Pseudorandom Number Generators
Pseudorandom number generators are deterministic algorithms that generate numbers that appear to be random. They rely on the seed value to generate further random numbers.
In this section, a simple linear congruential random number generator (the Park-Miller algorithm) is shown, along with a more complex Mersenne Twister.
Park-Miller
The Park-Miller random number generator is defined by the following recurrence equation:
It can be implemented easily in
Mathematica.
Here, common values for

and

are used. Using
NestList, you can generate a list of 1000 numbers and plot them.
| Out[121]= |  |
Here is the timing to generate 1 million numbers.
| Out[122]= |  |
An alternative is to use the
Method option in
SeedRandom; this can be used in the same manner as before.
| Out[123]= |  |
Compared to the
Mathematica implementation, this is around 300 times faster.
| Out[124]= |  |
The CUDA implementation is similar to the one written in
Mathematica. The implementation is distributed along with
CUDALink, and the location is shown below.
| Out[125]= |  |
This loads the CUDA function into
Mathematica.
| Out[126]= |  |
This allocates the output memory.
| Out[127]= |  |
| Out[128]= |  |
| Out[129]= |  |
If you measure the timing, you notice that it is twice as fast as
Mathematica's built-in method, and 600 times faster than pure
Mathematica implementation.
| Out[130]= |  |
| Out[131]= |  |
The timing against a
Compile is similar to that of CUDA. This generates C code from a
Compile statement with no integer overflow detection.
This finds the timing. Notice that there is little difference.
| Out[133]= |  |
If you ignore the time it takes for memory allocation (and you can rightly do so in this case, since generated random numbers are usually reused on the GPU), you notice a 10× speed improvement.
| Out[136]= |  |
Mersenne Twister
Mersenne Twister utilizes shift registers to generate random numbers. Because the implementation is simple, it maps well to the GPU. The following file contains the implementation.
| Out[137]= |  |
This loads the

function from the file.
Here, the Mersenne Twister input parameters are defined. The twister requires seed values. Those values can be computed offline and stored in a file or they can be generated by
Mathematica. The latter is shown here.
This allocates the output memory. Since the output will be overwritten, there is no need to load memory from
Mathematica onto the GPU.
| Out[146]= |  |
This invokes
CUDAFunction with parameters.
| Out[147]= |  |
The output can be plotted to show it is random.
| Out[148]= |  |
A
Mathematica function can be written that takes the number of random numbers to be generated as input, performs the required allocations and setting of parameters, and returns the random output memory.
This generates a plot of the first 10000 elements.
| Out[150]= |  |
The following measures the time it takes for the random number generator to generate 100 million numbers.
| Out[151]= |  |
This is on par with
Mathematica's random number generator timings.
| Out[152]= |  |
Considering that random numbers are seeds to other problems, a user may get performance increase in the overall algorithm even if
Mathematica's timings are superior to the CUDA implementation.
Quasi-Random Number Generators
This section describes quasi-random number generators. Unlike pseudorandom number generators, these sequences are nonuniform and have underlying structures that are sometimes useful in numerical methods. For instance, these sequences typically provide faster convergence in multidimensional Monte Carlo integration.
Halton Sequence
The Halton sequence generates quasi-random numbers that are uniform on the unit interval. While the code works with arbitrary dimensions, only the van der Corput sequence is discussed, which works on 1D space. This is adequate for comparison.
The resulting numbers of the Halton (or van der Corput) sequence are deterministic but have low discrepancy over the unit interval. Because they fill the space uniformly in some applications, such as Monte Carlo integration, they are preferred to pseudorandom number generators.
For a given

with base

:
The one-dimensional Halton (or van der Corput) value in base

:
The sequence of length

is then written as:
Given a number in base representation

, the van der Corput sequence mirrors the number across the decimal point, so that its sequence value is

.
In
Mathematica, you can find the sequence using
IntegerDigits.
Setting the base to

, you can calculate the first 1000 elements in the sequence.
This plots the result; notice how it fills the space uniformly.
| Out[155]= |  |
A property of low-discrepancy sequences is that the next elements in the sequence know where the previous elements are positioned. This can be shown with
Manipulate.
| Out[156]= |  |
For the CUDA implementation, you have to implement your own version of
IntegerDigits, but this is not difficult. First, load the implementation source code.
| Out[157]= |  |
| Out[158]= |  |
This allocates memory for the output. Here, only 1024 random numbers are generated.
| Out[159]= |  |
This runs the function for dimension 1.
| Out[160]= |  |
| Out[161]= |  |
Sobol Sequence
The Sobol sequence is also a low-discrepancy sequence. It is implemented in the following CUDA file.
| Out[162]= |  |
This loads the function, using

as the block dimension.
| Out[163]= |  |
Here the input parameters are loaded. The direction vectors needed by the Sobol sequence are precomputed and stored in a file.
This executes the Sobol function, passing parameters.
This plots the first 10000 values in the sequences. Note that the space is filled evenly with points (a property of quasi-random number generators).
| Out[170]= |  |
When complete, the memory must be unloaded.
Niederreiter Sequence
The Niederreiter sequence is another commonly used low-discrepancy sequence. It is implemented in the following CUDA file.
| Out[172]= |  |
This loads the seed data for the random number generator.
| Out[174]= |  |
This allocates the memory needed for computation and sets the proper parameters. Here the sequence for dimension 3 is generated and the resolution is set to

.
This gets the output from the GPU.
This plots the first 10000 values in the sequences. Note that the space is filled evenly, like the Sobol sequence.
| Out[183]= |  |
When complete, the memory must be unloaded.
Hashing Random Number Generators
Random number generators that depend on hashing generate random numbers of lesser quality, but they generate them fast. For many applications, they are more than adequate.
Tiny Encryption Algorithm Hashing
The Tiny Encryption Algorithm (TEA) is a very simple hashing algorithm implemented in the following file.
| Out[185]= |  |
| Out[186]= |  |
This allocates memory for the output.
| Out[187]= |  |
| Out[188]= |  |
| Out[189]= |  |
This deletes allocated memory.
MD5 Hashing
Other general hashing methods can be used for random number generators. Here is an implementation of the MD5 algorithm—a well-known hashing algorithm.
| Out[191]= |  |
This loads
CUDAFunction from the source.
| Out[192]= |  |
This loads the output memory.
| Out[193]= |  |
| Out[194]= |  |
| Out[195]= |  |
This deletes allocated memory.
Normal Random Numbers
The following algorithms generate normally distributed random numbers.
Inverse Cumulative Normal Distribution
The following implements a way to generate normally distributed random numbers.
| Out[197]= |  |
| Out[198]= |  |
Allocate memory for 100000 random numbers.
| Out[201]= |  |
This gets the memory into
Mathematica.
This plots the result using
Histogram.
| Out[203]= |  |
Box-Muller
Box-Muller is a method of generating normally distributed numbers, given a set of uniformly distributed random numbers. The CUDA implementation is found in the following file.
| Out[205]= |  |
| Out[206]= |  |
This sets the input arguments.
Use the Mersenne Twister (defined two sections ago) to generate a list of uniformly distributed random numbers.
| Out[212]= |  |
Transform the list of uniform random numbers to normally distributed random numbers.
| Out[213]= |  |
You can see the bell curve when using
Histogram.
| Out[214]= |  |
This deletes allocated memory.
Applications of Random Number Generators
Random numbers have applications in many areas. Here two main applications are presented: Monte Carlo integration (by approximating

and an arbitrary function) and simulating Brownian motion.
Approximating 
The value of

can be approximated using Monte Carlo integration. First, generate uniformly random numbers in the unit square. Then the number of points inside the first quadrant of the unit circle is counted. The result is then divided by the number of points. This will give

.
This implements reduction, counting the number of points in a unit circle.
| Out[216]= |  |
| Out[217]= |  |
Generate the random numbers using the Mersenne Twister algorithms discussed previously.
| Out[221]= |  |
This allocates the output memory.
| Out[222]= |  |
This performs the computation.
| Out[223]= |  |
This gets the output memory.
| Out[224]= |  |
The result agrees with
Mathematica.
| Out[225]= |  |
The timing is considerably faster.
| Out[226]= |  |
| Out[227]= |  |
Monte Carlo Integration
Monte Carlo integration finds its way into many areas. Here,
Sqrt
from 0 to 1 is integrated.
| Out[229]= |  |
Use the Sobol quasi-random number generator for random numbers.
Then use the number of random number generators as the length.
| Out[238]= |  |
This allocates memory for the output.
| Out[239]= |  |
| Out[240]= |  |
This checks whether the first few elements make sense.
| Out[241]= |  |
You now need to sum the output. This can be done using
CUDAFold.
| Out[242]= |  |
The result agrees with
NIntegrate.
| Out[243]= |  |
Unload the allocated memory.
Brownian Motion
This allocates memory for the simulation.
| Out[245]= |  |
The values of the pseudorandom sequence are normally distributed.
| Out[248]= |  |
| Out[250]= |  |
Code Generation
Since
CUDALink is integrated in
Mathematica, you can use
Mathematica features like
SymbolicC to generate CUDA kernel code. If you have not done so already, import
CUDALink.
For this example, you need the
SymbolicC package.
This defines some common
Mathematica constructs, translating them to their
SymbolicC representation.
To test, pass in a
Mathematica statement and get the
SymbolicC output.
| Out[271]= |  |
To convert to a C string, use the
ToCCodeString method.
| Out[272]= |  |
The above allows you to write a function that takes a
Mathematica function (pure or not) and would generate the appropriate CUDA kernel source.
Passing a pure function to

returns the kernel code.
| Out[276]= |  |
This defines a function that, given a
Mathematica function and an input list, generates the CUDA kernel code, loads the code as a
CUDAFunction, runs the
CUDAFunction, and returns the result.
You can test

with a pure function that adds 2 to each element in an input list.
| Out[279]= |  |
Any construct translated by

is supported by

. Here, each element is squared.
| Out[280]= |  |
This performs Monte Carlo integration.
| Out[281]= |  |
Functions can be defined and passed into

. Here, a color negation function is defined.
Invoke the color negation function.
| Out[283]= |  |
The above is a simple example, but can be used as a seed for more complicated ones.