Failure Recovery, Tracing, and Debugging

Failure of Remote Kernels	Aborting Parallel Programs
Tracing and Debugging

Failure of Remote Kernels

A remote kernel in use may fail at any time due to hardware, network, or software problems. A failure of a remote kernel will be noticed the next time Parallel Computing Toolkit tries to send a command to the kernel or tries to read a result from it. The error message Parallel::rdead is used to notify you of a failed remote kernel.

If the failed kernel had any processes assigned to it, these processes will be lost. If you are using Wait for one of these processes, your program will never terminate because the process will never return.

Because Parallel Computing Toolkit keeps track of the commands submitted to remote kernels, it can reassign these commands to another available remote kernel if a remote kernel fails. Alternatively, it may simply terminate the waiting processes with the result $Failed, which indicates failure. The chosen behavior is determined by the value of the variable $RecoveryMode.

$RecoveryMode	gives the current setting of the failure recovery mode
$RecoveryMode = None	does not perform any failure recovery
$RecoveryMode = Abandon	lets processes assigned to a failed kernel return with result $Failed (default)
$Recovery Mode = ReQueue	reassigns processes on the failed kernel to another kernel

Possible failure recovery modes.

The ReQueue recovery mode lets you finish a computation as long as at least one kernel remains usable. However, it may give wrong results if the remote computations produce side effects or your computation depends on a certain number of available remote kernels. Side effects are usually present if you use virtual shared memory. There is also the possibility of a deadlock if a process on a failed kernel acquired, but never released, a shared resource.

You can use the Abandon recovery mode to implement your own failure recovery method.

Failure recovery affects only processes started with ParallelSubmit[] and collected with WaitAll[]. Other parallel commands, such as ParallelEvaluate[], cannot handle a failed remote kernel and always return $Failed in such cases.

Tracing and Debugging

Debugging concurrent programs can be tricky. Parallel Computing Toolkit offers a tracing facility that lets you monitor the progress of your computation. To use these features, you have to load the debugging package before loading the toolkit itself.

SetOptions[$Parallel,opts…]	set debug options of Parallel Computing Toolkit
Options[$Parallel]	gives the current debug option settings
Tracers ->{tracers…}	set trace events
TraceHandler->handler	specify how trace events should be handled; possible values include "Print" and "Save"
TraceList[]	gives the current list of trace events
newTraceList[]	initializes the trace list

Debugging functions.

OptionValues[Tracers]	gives the list of possible tracers
WSTP	trace Wolfram Symbolic Transfer Protocol (WSTP) events
SendReceive	trace Send/Receive operations
Queueing	trace process scheduling (ParallelSubmit/WaitAll)
SharedMemory	trace shared variable access

Tracers.

Tracing Events

To see certain events, specify the desired class of events in SetOptions[$Parallel,Tracers{tracers …}]. From then on, every time one of the selected events occurs, a message is printed.

For example, ParallelMap uses Send and Receive internally, so you can see how the computation is divided into parts that are sent to remote kernels.

To turn off tracing, specify the empty list as tracers.

Saving Trace Events

Instead of printing trace events, the toolkit can save them in a list for later analysis. First, configure the tracing system to save the events and initialize the trace list.

Now specify which events to trace as before.

Run your computation.

The list of events is now available in TraceList[], which is best viewed in TableForm.

To reset the list, use newTraceList[] and to end tracing, turn it off as before.

To switch back to printing trace events, use the following.

The Format of Trace Events

WSTP

WSTP trace messages are described in Starting Remote Kernels.

SendReceive

A SendReceive trace message has the format

where kernel is the kernel involved, expr is the expression sent or received, and n is the size of the kernel's queue.

Queueing

A Queueing trace message has one of these formats (pid is a process ID, slave a remote kernel).

A process is queued (with ParallelSubmit[]). n is the length of the queue.

A process is sent to a remote kernel.

A process has finished and has been received from a remote kernel.

A process has been returned to the application (inside WaitAll[] or WaitOne[]).

SharedMemory

A SharedMemory trace message has the format

where kernel is the kernel that accessed the shared variable, and access describes how the variable was accessed:

The value of the variable var was requested and val was returned.

The remote kernel asked to change the variable var to val. The new value val was returned.

The value of a part of the variable var was requested and val was returned.

The remote kernel asked to change a part of the variable var to val.

The remote kernel asked for exclusive access to the variable var, setting it to val. The request was granted because var was currently unused.

The remote kernel asked for exclusive access to the variable var, setting it to val. The request was denied because var already had the different value old set by another process.

The remote kernel released exclusive access to the variable var.

For shared downvalues, the expression var in the preceding examples will be a normal expression whose head is the shared downvalue, such as f[…].

Aborting Parallel Programs

You can interrupt and abort the local (master) kernel during a concurrent computation. Any evaluations already on remote kernels will continue to run. After an abort, wait for any processes still in the queues using Wait, abandon them with ResetQueues, or abort the remote kernels with AbortKernels[].

If you abort any other operation such as ParallelEvaluate[], you should follow it by AbortKernels[].

ResetQueues[]	waits for any running processes to finish and clears all queues
AbortKernels[]	aborts all remote kernels and makes them available again
CloseKernels[]	closes the WSTP connections to all remote kernels

Recovering from interrupts and resetting remote kernels.

There is not always a reliable way to interrupt a remote kernel; ResetQueues[] waits for any running computations to finish normally to avoid an interrupt. If this takes too long, try to abort the master kernel again and then use AbortKernels[].

AbortKernels[] tries to abort any remote kernels that are not responding. Kernels that fail to react are closed.

If you quit the local kernel while a remote one is still doing a computation, the remote kernel may continue running and should be aborted or eventually killed using the appropriate operating system command.

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read