Wolfram Language & System Documentation Center

CUDALink on Multiple Devices

CUDALink on Multiple Devices

Introduction	Memory
CUDALink Functions	Bandwidth
CUDALink Programming

The functional and list-oriented characteristics of the core Wolfram Language allow CUDALink to provide immediate built-in data parallelism, automatically distributing computations across available GPU cards.

Introduction

First, load the CUDALink application.

Wolfram Language code: Needs["CUDALink`"]

This launches as many worker kernels as there are devices.

Wolfram Language code: LaunchKernels[$CUDADeviceCount]

$CUDADeviceCount is the number of the devices on the system.

$CUDADeviceCount

number of CUDA devices on system

$CUDADeviceCount gets the number of CUDA GPUs on the system.

This loads CUDALink on all worker kernels.

Wolfram Language code: ParallelNeeds["CUDALink`"]

CUDALink relies on existing Wolfram Language parallel computing capabilities to run on multiple GPUs. Throughout this section the following functions will be used.

ParallelNeeds	load a package into all parallel subkernels
DistributeDefinitions	distribute definitions needed for parallel computations
ParallelEvaluate	evaluate the input expression on all available parallel kernels and return the list of results obtained

CUDALink relies on the Wolfram Language's parallel computing capabilities to use multiple GPUs.

This sets the $CUDADevice variable on all kernels.

Wolfram Language code: ParallelEvaluate[$CUDADevice = $KernelID]

CUDALink Functions

High-level CUDALink functions like the image processing, linear algebra, and fast Fourier transforms can be used on different kernels like any other Wolfram Language function. The only difference is that the $CUDADevice variable is set to the device on which computation is performed.

Here you set image names to be taken from the TestImages dataset for ExampleData.

Wolfram Language code: imgNames = {"Girl", "Girl2", "Girl3", "Couple"};

Distribute the variable imgNames with the worker kernels.

Wolfram Language code: DistributeDefinitions[imgNames]

Perform CUDAErosion on images taken from ExampleData.

Wolfram Language code: ParallelEvaluate[CUDAErosion[ExampleData[{"TestImage", imgNames[[$KernelID]]}], 2]]//AbsoluteTiming

Notice the 2x speed improvement. Since these images are small and data must be transferred, you do not get the 4x performance speedup.

Wolfram Language code: (CUDAErosion[ExampleData[{"TestImage", #}], 2]& /@ imgNames)//AbsoluteTiming

In other cases, the amount of time spent transferring the data is not as significant as the amount of time spent in calculation. Here, you allocate 2000 random integer vectors.

Wolfram Language code: lsts = Table[RandomReal[1, 100], {ii, 2000}];

Map CUDAFold on each device.

Wolfram Language code: ParallelMap[CUDAFold[Plus, 0, #]&, lsts];//AbsoluteTiming

Notice that there is now a better speedup.

Wolfram Language code: Map[CUDAFold[Plus, 0, #]&, lsts];//AbsoluteTiming

CUDALink Programming

Since a CUDAFunction is optimized and local to one GPU, it cannot be shared with worker kernels using DistributeDefinitions. This section describes an alternative way of programming the GPU.

Add Two

This loads a basic CUDA code that adds 2 to a vector.

Wolfram Language code:

code = "
__global__ void addTwo(mint * in, mint * out, mint length) {
	mint index = threadIdx.x + blockIdx.x*blockDim.x;
	if (index < length)
		out[index] = in[index] + 2;
}";

This loads the CUDAFunction. Notice the use of SetDelayed in the assignment. This allows DistributeDefinitions to distribute all dependent variables in the CUDAFunctionLoad call.

Wolfram Language code: cudaFun := CUDAFunctionLoad[code, "addTwo", {{_Integer, _, "Input"}, {_Integer, _, "Output"}, _Integer}, 256]

This sets the input parameters.

Wolfram Language code:

listSize = 1000;
A = ConstantArray[1, {listSize}];
B = ConstantArray[1, {listSize}];

This distributes the definitions between worker kernels.

Wolfram Language code: DistributeDefinitions[cudaFun, listSize, A, B]

This runs the CUDAFunction on each worker kernel using different CUDA devices.

Wolfram Language code: ParallelEvaluate[res = cudaFun[A, B, listSize];]

This gathers the result showing the first 20 elements.

Wolfram Language code: ParallelEvaluate[Take[First@res,20]]

Mandelbrot Set

This is the same CUDA code defined in other sections of the CUDALink documentation.

Wolfram Language code:

src = "
__global__ void mandelbrot_kernel(char * set, float zoom, float bailout, mint width, mint height) {
   int xIndex = threadIdx.x + blockIdx.x*blockDim.x;
   int yIndex = threadIdx.y + blockIdx.y*blockDim.y;
   mint ii;

   Real_t x0 = zoom*(width/3 - xIndex);
   Real_t y0 = zoom*(height/2 - yIndex);
   Real_t tmp, x = 0, y = 0;

   if (xIndex < width && yIndex < height) {
       for (ii = 0; (x*x+y*y <= bailout) && (ii < MAX_ITERATIONS); ii++) {
            tmp = x*x - y*y +x0;
            y = 2*x*y + y0;
            x = tmp;
        }
        if (ii == MAX_ITERATIONS) {
            set[xIndex + yIndex*width] = 0;
        } else {
            set[xIndex + yIndex*width] = 1;
        }
    }
}
";

Here, you load the CUDAFunction.

Wolfram Language code:

mfun := CUDAFunctionLoad[src, "mandelbrot_kernel", {{"Byte", _, "Output"}, "Float", "Float", _Integer, _Integer}, {16, 16}, "Defines" -> {"MAX_ITERATIONS" -> 1000}]

Wolfram Language code: {width, height} = {2048, 1024};

This shares the variables with worker kernels.

Wolfram Language code: DistributeDefinitions[mfun, width, height]

This launches the kernel, each with a different zoom level, returning the "Bit" image.

Wolfram Language code:

ParallelEvaluate[
	mset = CUDAMemoryAllocate["Byte", {height, width}];
	res = mfun[mset, 0.001 * $KernelID, 8.0, width, height, {width, height}];
	Image[CUDAMemoryGet[First[res]], "Bit"]
	]

Random Number Generators

The Mersenne Twister is implemented in the following file.

Wolfram Language code: srcf = FileNameJoin[{$CUDALinkPath, "SupportFiles", "random.cu"}]

This loads the function into the Wolfram Language.

Wolfram Language code:

mersenneTwister := CUDAFunctionLoad[{srcf}, "MersenneTwister", {{_Real, _, "Output"}, {_Integer, _, "Input"}, {_Integer, _, "Input"}, {_Integer, _, "Input"}, {_Integer, _, "Input"}, _Integer}, 32];

This sets the input variables for the CUDAFunction.

Wolfram Language code:

MTRNGCount = 4096;
PATHN = 2 ^ 14;
NPerRNG = Ceiling[PATHN / MTRNGCount];
NPerRNG = If[EvenQ[NPerRNG], NPerRNG, NPerRNG + 1];
RANDN = MTRNGCount * NPerRNG;

This distributes the mersenneTwister function and input parameters.

Wolfram Language code: DistributeDefinitions[mersenneTwister, MTRNGCount, PATHN, NPerRNG, RANDN]

This allocates the seed values—note that seed evaluation needs to be performed on each worker kernel so that the random numbers are not correlated. The output memory is also allocated, computation is performed, and the result is visualized.

Wolfram Language code:

ParallelEvaluate[
	{hsMatrixA, hsMaskB, hsMaskC} = RandomInteger[{-Developer`$MaxMachineInteger, Developer`$MaxMachineInteger}, {3, MTRNGCount}];
	hsSeed = RandomInteger[{-Developer`$MaxMachineInteger, Developer`$MaxMachineInteger}, MTRNGCount];output = CUDAMemoryAllocate[Real, RANDN];
	mersenneTwister[output, hsMatrixA, hsMaskB, hsMaskC, hsSeed, NPerRNG, MTRNGCount];
	ListPlot[CUDAMemoryGet[output]]
	]

Memory

CUDAMemory is tied to both the kernel and device where it is loaded and thus cannot be distributed among worker kernels.

Load memory in the master kernel.

Wolfram Language code: x = CUDAMemoryLoad[{1, 2, 3}]

Then distribute the definition.

Wolfram Language code: DistributeDefinitions[x]

Distributed CUDAMemory cannot be operated on by worker kernels.

Wolfram Language code: ParallelEvaluate[CUDAMemoryGet[x]]

To load memory onto the worker kernels, users can use ParallelEvaluate.

Wolfram Language code: ParallelEvaluate[mem = CUDAMemoryLoad[{1, 2, 3}]]

Operations can be further done on the memory using ParallelEvaluate.

Wolfram Language code: ParallelEvaluate[CUDAMemoryGet[mem]]

Bandwidth

In some cases, the amount of time spent transferring the data dwarfs the time spent in computation.

Here you load a large list.

Wolfram Language code: lsts = Table[RandomReal[1, 1000000], {ii, 10}];

Since the parallel version needs to share the large list with worker kernels, it takes considerably longer than the sequential version.

Wolfram Language code: ParallelMap[CUDASort, lsts];//AbsoluteTiming

The sequential version is faster since no data transfer is necessary.

Wolfram Language code: Map[CUDASort, lsts];//AbsoluteTiming

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

CUDALink on Multiple Devices

Introduction

CUDALink Functions

CUDALink Programming

Add Two

Mandelbrot Set

Random Number Generators

Memory

Bandwidth

CUDALink on Multiple Devices

Introduction

CUDALink Functions

CUDALink Programming

Add Two

Mandelbrot Set

Random Number Generators

Memory

Bandwidth

Related Guides

Related Tech Notes