CUDALink on Multiple Devices
This feature is not supported on the Wolfram Cloud.

The functional and list-oriented characteristics of the core Wolfram Language allow CUDALink to provide immediate built-in data parallelism, automatically distributing computations across available GPU cards.

Introduction

First, load the CUDALink application.

In[6]:=
Click for copyable input

This launches as many worker kernels as there are devices.

In[2]:=
Click for copyable input
Out[2]=

$CUDADeviceCount is the number of the devices on the system.

$CUDADeviceCountnumber of CUDA devices on system

$CUDADeviceCount gets the number of CUDA GPUs on the system.

This loads CUDALink on all worker kernels.

In[3]:=
Click for copyable input

CUDALink relies on existing Wolfram Language parallel computing capabilities to run on multiple GPUs. Throughout this section the following functions will be used.

ParallelNeedsload a package into all parallel subkernels
DistributeDefinitionsdistribute definitions needed for parallel computations
ParallelEvaluateevaluate the input expression on all available parallel kernels and return the list of results obtained

CUDALink relies on the Wolfram Language's parallel computing capabilities to use multiple GPUs.

This sets the $CUDADevice variable on all kernels.

In[4]:=
Click for copyable input
Out[4]=

CUDALink Functions

High-level CUDALink functions like the image processing, linear algebra, and fast Fourier transforms can be used on different kernels like any other Wolfram Language function. The only difference is that the $CUDADevice variable is set to the device on which computation is performed.

Here you set image names to be taken from the dataset for ExampleData.

In[5]:=
Click for copyable input

Distribute the variable with the worker kernels.

In[6]:=
Click for copyable input
Out[6]=

Perform CUDAErosion on images taken from ExampleData.

In[7]:=
Click for copyable input
Out[7]=

Notice the 2x speed improvement. Since these images are small and data must be transferred, you do not get the 4x performance speedup.

In[8]:=
Click for copyable input
Out[8]=

In other cases, the amount of time spent transferring the data is not as significant as the amount of time spent in calculation. Here, you allocate 2000 random integer vectors.

In[9]:=
Click for copyable input

Map CUDAFold on each device.

In[10]:=
Click for copyable input
Out[10]=

Notice that there is now a better speedup.

In[11]:=
Click for copyable input
Out[11]=

CUDALink Programming

Since a CUDAFunction is optimized and local to one GPU, it cannot be shared with worker kernels using DistributeDefinitions. This section describes an alternative way of programming the GPU.

Add Two

This loads a basic CUDA code that adds 2 to a vector.

In[12]:=
Click for copyable input

This loads the CUDAFunction. Notice the use of SetDelayed in the assignment. This allows DistributeDefinitions to distribute all dependent variables in the CUDAFunctionLoad call.

In[13]:=
Click for copyable input

This sets the input parameters.

In[14]:=
Click for copyable input

This distributes the definitions between worker kernels.

In[17]:=
Click for copyable input
Out[17]=

This runs the CUDAFunction on each worker kernel using different CUDA devices.

In[18]:=
Click for copyable input
Out[18]=

This gathers the result showing the first 20 elements.

In[19]:=
Click for copyable input
Out[19]=

Mandelbrot Set

This is the same CUDA code defined in other sections of the CUDALink documentation.

In[7]:=
Click for copyable input

Here, you load the CUDAFunction.

In[8]:=
Click for copyable input
In[9]:=
Click for copyable input

This shares the variables with worker kernels.

In[10]:=
Click for copyable input
Out[10]=

This launches the kernel, each with a different zoom level, returning the image.

In[24]:=
Click for copyable input
Out[24]=

Random Number Generators

The Mersenne Twister is implemented in the following file.

In[25]:=
Click for copyable input
Out[25]=

This loads the function into the Wolfram Language.

In[26]:=
Click for copyable input

This sets the input variables for the CUDAFunction.

In[27]:=
Click for copyable input

This distributes the function and input parameters.

In[32]:=
Click for copyable input
Out[32]=

This allocates the seed valuesnote that seed evaluation needs to be performed on each worker kernel so that the random numbers are not correlated. The output memory is also allocated, computation is performed, and the result is visualized.

In[33]:=
Click for copyable input
Out[33]=

Memory

CUDAMemory is tied to both the kernel and device where it is loaded and thus cannot be distributed among worker kernels.

Load memory in the master kernel.

In[34]:=
Click for copyable input
Out[34]=

Then distribute the definition.

In[35]:=
Click for copyable input
Out[35]=

Distributed CUDAMemory cannot be operated on by worker kernels.

In[36]:=
Click for copyable input
Out[36]=

To load memory onto the worker kernels, users can use ParallelEvaluate.

In[37]:=
Click for copyable input
Out[37]=

Operations can be further done on the memory using ParallelEvaluate.

In[38]:=
Click for copyable input
Out[38]=

Bandwidth

In some cases, the amount of time spent transferring the data dwarfs the time spent in computation.

Here you load a large list.

In[39]:=
Click for copyable input

Since the parallel version needs to share the large list with worker kernels, it takes considerably longer than the sequential version.

In[40]:=
Click for copyable input
Out[40]=

The sequential version is faster since no data transfer is necessary.

In[41]:=
Click for copyable input
Out[41]=