Parallel Computation

The Wolfram System compiler can run computations in parallel. It does this by threading a compiled function over lists of data in parallel. A first step is to create a compiled function with the Listable attribute.

When the input matches the type specification of the compiled function, it works normally. In the following example, the real number input matches the type of the compiled function and the function executes.

Here the compiled function receives a list input; since this is higher rank than the input, it threads over the input.

A listable compiled function can also run in parallel. This is done by setting the Parallelization option to True.

When the input matches the type specification of the compiled function, it works normally.

Here the compiled function receives a list input; since this is higher rank than the input, it threads over the input. It also runs this in parallel.

Computation Speedup

A main reason for using parallelization is to speed up the computation. This works well in many cases.

The following creates and runs a compiled function that runs sequentially.

This creates an equivalent compiled function that will run in parallel. On this machine it runs twice as fast.

The number of parallel threads used to run the compiled function defaults to the setting of $ProcessorCount.

CompilationTarget

You can get further acceleration for the parallel compiled function by setting CompilationTarget to C.

External Calls

An important issue for parallel computation is how it handles two different threads trying to do the same thing. This is handled automatically for parallel computation with compiled functions.

The following function definition also increments a counter.

This parallel compiled function calls the external function.

This executes the compiled function in parallel over a list of input data.

The counter has been incremented the correct number of times.

In fact, when a compiled function running in parallel makes an external call, this will always be done with synchronization primitives so that only one of the threads can actually make the call at any moment in time. This means that if a parallel compiled function makes many external calls, you will not get a good parallel acceleration.

Random Numbers

Many computations with random numbers, such as Monte Carlo methods, can be sped up significantly when done in parallel. However, fast and effective use of random numbers in parallel requires having generators on separate execution threads that operate independently and generate random numbers that are statistically independent of the numbers generated on other threads. For this reason, the random generators that the Wolfram Language uses by default for parallel computations are different from the one used for serial computation, and so the actual random numbers will necessarily be different. Further compounding the issue is that for a given parallel computation, the execution thread assigned to a particular part of the computation may be different in different runs, so results may vary from run to run even with the same initial random state.

The following creates two CompiledFunction objects that simulate a random walk of n steps on a lattice, one that runs serially and one that runs in parallel for multiple simulations.

The serial one gives the same result every time when run within BlockRandom.

However, run in parallel, the results are different.

Much of the difference is in the ordering of results. You can see that many are actually the same using Intersection.

The ones that are the same were run on the same execution thread each time.

Typically when you use BlockRandom or SeedRandom you will want to do it outside the parallel computation. If you want to use these commands inside a parallel evaluation they will only affect the current execution thread, and there are subtleties described in "SeedRandom and BlockRandom in Parallel Computations" in "Random Number Generation" that it is good to be aware of.

You can change the random generators for parallel computation using SeedRandom[seed,Method->"ParallelGenerator"]. The default random generators for parallel computation are a set of 1024 Mersenne Twister generators that produce good quality random numbers. The Wolfram Language uses a different one of these generators for each execution thread in a parallel computation. An important feature of these is that each one produces random numbers independent from the numbers on the others, so there should be no correlations between computations done on different execution threads. A demonstration of this is to do a standard test of random numbers, called the blocking test.

The blocking test checks that sample means of numbers generated from a given distribution converge to the normal distribution as they should by the central limit theorem. If they do not converge, this is an indication of a possible problem with the distribution.

The following defines a CompiledFunction that will get the sample mean from n integers that are equally likely to be 0 or 1, and will run in parallel when given a list argument.

Using this, define functions that generate m sample means of n bits in serial and in parallel.

From the central limit theorem, the sample sums should be distributed normally with the same mean and standard deviation .

Now get a set of sample sums in serial and parallel.

Note that the parallel computation is significantly faster.

Show the histograms of the serial and parallel data compared with the PDF of the expected distribution.

It is hard to tell visually if one is better than another. A better way is to use DistributionFitTest to get the -value for the goodness-of-fit hypothesis test with null hypothesis that the data has the same distribution as the expected distribution and alternative hypothesis that it does not.

The test statistic of DistributionFitTest has a distribution itself that should be the (continuous) uniform distribution, so the best information comes from comparing the parallel and serial over many runs.

Parallel Controls

Compiled functions execute in parallel using multiple threads of execution. The number of threads is set initially to be $ProcessorCount.

The actual setting can be modified with SystemOptions using the "ParallelThreadNumber" suboption of "ParallelOptions".

This demonstrates a parallel compiled function.

Now the number of threads is set to 1, which forces the compiled function to run sequentially.

The time to run is the same as if Parallelization of False were set.