Failure Recovery, Tracing, and Debugging

Failure of Remote Kernels

A remote kernel in use may fail at any time due to hardware, network, or software problems. A failure of a remote kernel will be noticed the next time Parallel Computing Toolkit tries to send a command to the kernel or tries to read a result from it. The error message Parallel::rdead is used to notify you of a failed remote kernel.

If the failed kernel had any processes assigned to it, these processes will be lost. If you are using Wait for one of these processes, your program will never terminate because the process will never return.

Because Parallel Computing Toolkit keeps track of the commands submitted to remote kernels, it can reassign these commands to another available remote kernel if a remote kernel fails. Alternatively, it may simply terminate the waiting processes with the result $Failed, which indicates failure. The chosen behavior is determined by the value of the variable $RecoveryMode.

$RecoveryModegives the current setting of the failure recovery mode
$RecoveryMode = Nonedoes not perform any failure recovery
$RecoveryMode = Abandonlets processes assigned to a failed kernel return with result $Failed (default)
$Recovery Mode = ReQueuereassigns processes on the failed kernel to another kernel

Possible failure recovery modes.

The ReQueue recovery mode lets you finish a computation as long as at least one kernel remains usable. However, it may give wrong results if the remote computations produce side effects or your computation depends on a certain number of available remote kernels. Side effects are usually present if you use virtual shared memory. There is also the possibility of a deadlock if a process on a failed kernel acquired, but never released, a shared resource.

You can use the Abandon recovery mode to implement your own failure recovery method.

Failure recovery affects only processes started with ParallelSubmit[] and collected with WaitAll[]. Other parallel commands, such as ParallelEvaluate[], cannot handle a failed remote kernel and always return $Failed in such cases.

Tracing and Debugging

Debugging concurrent programs can be tricky. Parallel Computing Toolkit offers a tracing facility that lets you monitor the progress of your computation. To use these features, you have to load the debugging package before loading the toolkit itself.

SetOptions[$Parallel,opts]set debug options of Parallel Computing Toolkit
Options[$Parallel]gives the current debug option settings
Tracers ->{tracers}set trace events
TraceHandler->handlerspecify how trace events should be handled; possible values include "Print" and "Save"
TraceList[]gives the current list of trace events
newTraceList[]initializes the trace list

Debugging functions.

OptionValues[Tracers]gives the list of possible tracers
WSTPtrace Wolfram Symbolic Transfer Protocol (WSTP) events
SendReceivetrace Send/Receive operations
Queueingtrace process scheduling (ParallelSubmit/WaitAll)
SharedMemorytrace shared variable access

Tracers.

Tracing Events

To see certain events, specify the desired class of events in SetOptions[$Parallel,Tracers{tracers }]. From then on, every time one of the selected events occurs, a message is printed.

For example, ParallelMap uses Send and Receive internally, so you can see how the computation is divided into parts that are sent to remote kernels.

To turn off tracing, specify the empty list as tracers.

Saving Trace Events

Instead of printing trace events, the toolkit can save them in a list for later analysis. First, configure the tracing system to save the events and initialize the trace list.

Now specify which events to trace as before.

Run your computation.

The list of events is now available in TraceList[], which is best viewed in TableForm.

To reset the list, use newTraceList[] and to end tracing, turn it off as before.

To switch back to printing trace events, use the following.

The Format of Trace Events

WSTP

WSTP trace messages are described in Starting Remote Kernels.

SendReceive

A SendReceive trace message has the format

or

where kernel is the kernel involved, expr is the expression sent or received, and n is the size of the kernel's queue.

Queueing

A Queueing trace message has one of these formats (pid is a process ID, slave a remote kernel).

SharedMemory

A SharedMemory trace message has the format

where kernel is the kernel that accessed the shared variable, and access describes how the variable was accessed:

Aborting Parallel Programs

You can interrupt and abort the local (master) kernel during a concurrent computation. Any evaluations already on remote kernels will continue to run. After an abort, wait for any processes still in the queues using Wait, abandon them with ResetQueues, or abort the remote kernels with AbortKernels[].

If you abort any other operation such as ParallelEvaluate[], you should follow it by AbortKernels[].

ResetQueues[]waits for any running processes to finish and clears all queues
AbortKernels[]aborts all remote kernels and makes them available again
CloseKernels[]closes the WSTP connections to all remote kernels

Recovering from interrupts and resetting remote kernels.

There is not always a reliable way to interrupt a remote kernel; ResetQueues[] waits for any running computations to finish normally to avoid an interrupt. If this takes too long, try to abort the master kernel again and then use AbortKernels[].

AbortKernels[] tries to abort any remote kernels that are not responding. Kernels that fail to react are closed.

If you quit the local kernel while a remote one is still doing a computation, the remote kernel may continue running and should be aborted or eventually killed using the appropriate operating system command.