Mathematica 9 is now available
Previous section-----Next section

Failure Recovery, Tracing, and Debugging

Failure of Remote Kernels

A remote kernel in use may fail at any time, due to hardware, network, or software problems. A failure of a remote kernel will be noticed the next time Parallel Computing Toolkit tries to send a command to the kernel or tries to read a result from it. The error message Parallel::rdead is used to notify you of a failed remote kernel.
If the failed kernel had any processes assigned to it, these processes will be lost. If you are using Wait for one of these processes, your program will never terminate because the process will never return.
Because Parallel Computing Toolkit keeps track of the commands submitted to remote kernels, it can reassign these commands to another available remote kernel if a remote kernel fails. Alternatively, it may simply terminate the waiting processes with the result $Failed, which indicates failure. The chosen behavior is determined by the value of the variable $RecoveryMode.
$RecoveryModegives the current setting of the failure recovery mode
$RecoveryMode = Nonedoes not perform any failure recovery
$RecoveryMode = Abandonlets processes assigned to a failed kernel return with result $Failed (default)
$RecoveryMode = ReQueuereassigns processes on the failed kernel to another kernel

Possible failure recovery modes.

The ReQueue recovery mode lets you finish a computation as long as at least one kernel remains usable. However, it may give wrong results if the remote computations produce side effects or your computation depends on a certain number of available remote kernels. Side effects are usually present if you use virtual shared memory. There is also the possibility of a deadlock if a process on a failed kernel acquired, but never released, a shared resource.
You can use the Abandon recovery mode to implement your own failure recovery method.
Failure recovery affects only processes started with Queue[] and collected with Wait[]. Other parallel commands, such as ParallelEvaluate[], cannot handle a failed remote kernel and always return $Failed in such cases.

Tracing and Debugging

Debugging concurrent programs can be tricky. Parallel Computing Toolkit offers a tracing facility that lets you monitor the progress of your computation. To use these features, you have to load the debugging package before loading the toolkit itself.
SetOptions[$DebugObject, opts...]set debug options of Parallel Computing Toolkit
Options[$DebugObject]gives the current debug option settings
Trace→{tracers...}set trace events
TraceHandler→handlerspecify how trace events should be handled; possible values include Print and Save
TraceList[]gives the current list of trace events
newTraceList[]initializes the trace list

Debugging functions.

OptionValues[Trace]gives the list of possible tracers
MathLinktrace MathLink events
SendReceivetrace Send/Receive operations
Queueingtrace process scheduling (Queue/Wait)
SharedMemorytrace shared variable access (available only if the VirtualShared package has been loaded)


Tracing Events

To see certain events, specify the desired class of events in SetOptions[$DebugObject,Trace→{tracers...}]. From then on everytime one of the selected events occurs, a message is printed.
ParallelEvalute uses Send and Receive internally, so you can see how the computation is divided into parts that are sent to remote kernels.
To turn off tracing, specify the empty list as tracers.

Saving Trace Events

Instead of printing trace events, the toolkit can save them in a list for later analysis. First, configure the tracing system to save the events and initialize the trace list.
Now specify which events to trace as before
and run your computation.
The list of events is now available in TraceList[], which is best viewed in TableForm.
To reset the list, use newTraceList[] and to end tracing, turn it off as before.
To switch back to printing trace events, use

The Format of Trace Events


MathLink trace messages are described in the chapter Starting Remote Kernels.


A SendReceive trace message has the format
where slave is the kernel involved, expr is the expression sent or received, and n is the size of the kernel's queue.


A Queueing trace message has one of these formats (pid is a process ID, slave a remote kernel).
  • A process is queued (with Queue[]). n is the length of the queue.
  • A process is sent to a remote kernel.
  • A process has finished and has been received from a remote kernel.
  • A process has been returned to the application (inside Wait[] or WaitOne[]).


A SharedMemory trace message has the format
where slave is the kernel that accessed the shared variable, and access describes how the variable was accessed:
  • The value of the variable var was requested and val was returned.
  • The remote kernel asked to change the variable var to val. The new value val was returned.
  • The value of a part of the variable var was requested and val was returned.
  • The remote kernel asked to change a part of the variable var to val.
  • The remote kernel asked for exclusive access to the variable var, setting it to val. The request was granted because var was currently unused.
  • The remote kernel asked for exclusive access to the variable var, setting it to val. The request was denied because var already had the different value old set by another process.
  • The remote kernel released exclusive access to the variable var.
For shared downvalues, the expression var in the preceding examples will be a normal expression whose head is the shared downvalue, such as f[...].

Aborting Parallel Programs

You can interrupt and abort the local (master) kernel during a concurrent computation. Any evaluations already on remote kernels will continue to run. After an abort, wait for any processes still in the queues using
, abandon them with ResetQueues, or abort the remote kernels with ResetSlaves[].
If you abort any other operation such as ParallelEvaluate[], you should follow it by ResetSlaves[].
ResetQueues[]waits for any running processes to finish and clears all queues
ResetSlaves[]aborts all remote kernels and makes them available again
CloseSlaves[]closes the MathLink connections to all remote kernels

Recovering from interrupts and resetting remote kernels.

There is not always a reliable way to interrupt a remote kernel; ResetQueues[] waits for any running computations to finish normally to avoid an interrupt. If this takes too long, try to abort the master kernel again and then use ResetSlaves[].
ResetSlaves[] tries to abort any remote kernels that are not responding. Kernels that fail to react are closed.
If you quit the local kernel while a remote one is still doing a computation, the remote kernel may continue running and should be aborted or eventually killed using the appropriate operating system command.

Any questions about topics on this page? Click here to get an individual response.Buy NowMore Information