Failure Recovery, Tracing, and Debugging
Failure of Remote Kernels
A remote kernel in use may fail at any time due to hardware, network, or software problems. A failure of a remote kernel will be noticed the next time Parallel Computing Toolkit tries to send a command to the kernel or tries to read a result from it. The error message Parallel::rdead is used to notify you of a failed remote kernel.
If the failed kernel had any processes assigned to it, these processes will be lost. If you are using Wait for one of these processes, your program will never terminate because the process will never return.
Because Parallel Computing Toolkit keeps track of the commands submitted to remote kernels, it can reassign these commands to another available remote kernel if a remote kernel fails. Alternatively, it may simply terminate the waiting processes with the result $Failed, which indicates failure. The chosen behavior is determined by the value of the variable $RecoveryMode.
$RecoveryMode | gives the current setting of the failure recovery mode |
$RecoveryMode = None | does not perform any failure recovery |
$RecoveryMode = Abandon | lets processes assigned to a failed kernel return with result $Failed (default) |
$Recovery Mode = ReQueue | reassigns processes on the failed kernel to another kernel |
Possible failure recovery modes.
The ReQueue recovery mode lets you finish a computation as long as at least one kernel remains usable. However, it may give wrong results if the remote computations produce side effects or your computation depends on a certain number of available remote kernels. Side effects are usually present if you use virtual shared memory. There is also the possibility of a deadlock if a process on a failed kernel acquired, but never released, a shared resource.
You can use the Abandon recovery mode to implement your own failure recovery method.
Failure recovery affects only processes started with ParallelSubmit[] and collected with WaitAll[]. Other parallel commands, such as ParallelEvaluate[], cannot handle a failed remote kernel and always return $Failed in such cases.
Tracing and Debugging
Debugging concurrent programs can be tricky. Parallel Computing Toolkit offers a tracing facility that lets you monitor the progress of your computation. To use these features, you have to load the debugging package before loading the toolkit itself.
SetOptions[$Parallel,opts…] | set debug options of Parallel Computing Toolkit |
Options[$Parallel] | gives the current debug option settings |
Tracers ->{tracers…} | set trace events |
TraceHandler->handler | specify how trace events should be handled; possible values include "Print" and "Save" |
TraceList[] | gives the current list of trace events |
newTraceList[] | initializes the trace list |
OptionValues[Tracers] | gives the list of possible tracers |
WSTP | trace Wolfram Symbolic Transfer Protocol (WSTP) events |
SendReceive | trace Send/Receive operations |
Queueing | trace process scheduling (ParallelSubmit/WaitAll) |
SharedMemory | trace shared variable access |
Tracing Events
To see certain events, specify the desired class of events in SetOptions[$Parallel,Tracers{tracers …}]. From then on, every time one of the selected events occurs, a message is printed.
For example, ParallelMap uses Send and Receive internally, so you can see how the computation is divided into parts that are sent to remote kernels.
To turn off tracing, specify the empty list as tracers.
Saving Trace Events
Instead of printing trace events, the toolkit can save them in a list for later analysis. First, configure the tracing system to save the events and initialize the trace list.
Now specify which events to trace as before.
The list of events is now available in TraceList[], which is best viewed in TableForm.
To reset the list, use newTraceList[] and to end tracing, turn it off as before.
To switch back to printing trace events, use the following.
The Format of Trace Events
WSTP
WSTP trace messages are described in Starting Remote Kernels.
SendReceive
A SendReceive trace message has the format
where kernel is the kernel involved, expr is the expression sent or received, and n is the size of the kernel's queue.
Queueing
A Queueing trace message has one of these formats (pid is a process ID, slave a remote kernel).
SharedMemory
A SharedMemory trace message has the format
where kernel is the kernel that accessed the shared variable, and access describes how the variable was accessed:
For shared downvalues, the expression var in the preceding examples will be a normal expression whose head is the shared downvalue, such as f[…].
Aborting Parallel Programs
You can interrupt and abort the local (master) kernel during a concurrent computation. Any evaluations already on remote kernels will continue to run. After an abort, wait for any processes still in the queues using Wait, abandon them with ResetQueues, or abort the remote kernels with AbortKernels[].
If you abort any other operation such as ParallelEvaluate[], you should follow it by AbortKernels[].
ResetQueues[] | waits for any running processes to finish and clears all queues |
AbortKernels[] | aborts all remote kernels and makes them available again |
CloseKernels[] | closes the WSTP connections to all remote kernels |
Recovering from interrupts and resetting remote kernels.
There is not always a reliable way to interrupt a remote kernel; ResetQueues[] waits for any running computations to finish normally to avoid an interrupt. If this takes too long, try to abort the master kernel again and then use AbortKernels[].
AbortKernels[] tries to abort any remote kernels that are not responding. Kernels that fail to react are closed.
If you quit the local kernel while a remote one is still doing a computation, the remote kernel may continue running and should be aborted or eventually killed using the appropriate operating system command.