The functional and list-oriented characteristics of the core Wolfram Language allow CUDALink to provide immediate built-in data parallelism, automatically distributing computations across available GPU cards.
First, load the CUDALink application.
This launches as many worker kernels as there are devices.
High-level CUDALink functions like the image processing, linear algebra, and fast Fourier transforms can be used on different kernels like any other Wolfram Language function. The only difference is that the $CUDADevice variable is set to the device on which computation is performed.
Here you set image names to be taken from the TestImages dataset for ExampleData.
Distribute the variable imgNames with the worker kernels.
This distributes the mersenneTwister function and input parameters.
This allocates the seed values—note that seed evaluation needs to be performed on each worker kernel so that the random numbers are not correlated. The output memory is also allocated, computation is performed, and the result is visualized.
CUDAMemory is tied to both the kernel and device where it is loaded and thus cannot be distributed among worker kernels.
Load memory in the master kernel.
Then distribute the definition.
Distributed CUDAMemory cannot be operated on by worker kernels.