Failure Recovery, Tracing, and DebuggingFailure of Remote KernelsA remote kernel in use may fail at any time, due to hardware, network, or software problems. A failure of a remote kernel will be noticed the next time Parallel Computing Toolkit tries to send a command to the kernel or tries to read a result from it. The error message Parallel::rdead is used to notify you of a failed remote kernel. If the failed kernel had any processes assigned to it, these processes will be lost. If you are using Wait for one of these processes, your program will never terminate because the process will never return. Because Parallel Computing Toolkit keeps track of the commands submitted to remote kernels, it can reassign these commands to another available remote kernel if a remote kernel fails. Alternatively, it may simply terminate the waiting processes with the result $Failed, which indicates failure. The chosen behavior is determined by the value of the variable $RecoveryMode. $RecoveryMode | gives the current setting of the failure recovery mode | $RecoveryMode = None | does not perform any failure recovery | $RecoveryMode = Abandon | lets processes assigned to a failed kernel return with result $Failed (default) | $RecoveryMode = ReQueue | reassigns processes on the failed kernel to another kernel |
Possible failure recovery modes. The ReQueue recovery mode lets you finish a computation as long as at least one kernel remains usable. However, it may give wrong results if the remote computations produce side effects or your computation depends on a certain number of available remote kernels. Side effects are usually present if you use virtual shared memory. There is also the possibility of a deadlock if a process on a failed kernel acquired, but never released, a shared resource. You can use the Abandon recovery mode to implement your own failure recovery method. Failure recovery affects only processes started with Queue[] and collected with Wait[]. Other parallel commands, such as ParallelEvaluate[], cannot handle a failed remote kernel and always return $Failed in such cases. Tracing and DebuggingDebugging concurrent programs can be tricky. Parallel Computing Toolkit offers a tracing facility that lets you monitor the progress of your computation. To use these features, you have to load the debugging package before loading the toolkit itself. SetOptions[$DebugObject, opts...] | set debug options of Parallel Computing Toolkit | Options[$DebugObject] | gives the current debug option settings | Trace→{tracers...} | set trace events | TraceHandler→handler | specify how trace events should be handled; possible values include Print and Save | TraceList[] | gives the current list of trace events | newTraceList[] | initializes the trace list |
Debugging functions. OptionValues[Trace] | gives the list of possible tracers | MathLink | trace MathLink events | SendReceive | trace Send/Receive operations | Queueing | trace process scheduling (Queue/Wait) | SharedMemory | trace shared variable access (available only if the VirtualShared package has been loaded) |
Tracers. Tracing EventsTo see certain events, specify the desired class of events in SetOptions[$DebugObject,Trace→{tracers...}]. From then on everytime one of the selected events occurs, a message is printed. Out[3]= |  |
ParallelEvalute uses Send and Receive internally, so you can see how the computation is divided into parts that are sent to remote kernels. Out[4]= |  |
To turn off tracing, specify the empty list as tracers. Out[5]= |  |
Saving Trace EventsInstead of printing trace events, the toolkit can save them in a list for later analysis. First, configure the tracing system to save the events and initialize the trace list. Out[6]= |  |
Now specify which events to trace as before Out[8]= |  |
and run your computation. Out[9]= |  |
The list of events is now available in TraceList[], which is best viewed in TableForm. Out[10]//TableForm= |  |
To reset the list, use newTraceList[] and to end tracing, turn it off as before. Out[11]= |  |
To switch back to printing trace events, use Out[12]= |  |
The Format of Trace EventsMathLinkMathLink trace messages are described in the chapter Starting Remote Kernels. SendReceiveA SendReceive trace message has the format where slave is the kernel involved, expr is the expression sent or received, and n is the size of the kernel's queue. QueueingA Queueing trace message has one of these formats ( pid is a process ID, slave a remote kernel). - A process is queued (with Queue[]). n is the length of the queue.
- A process is sent to a remote kernel.
- A process has finished and has been received from a remote kernel.
- A process has been returned to the application (inside Wait[] or WaitOne[]).
SharedMemoryA SharedMemory trace message has the format where slave is the kernel that accessed the shared variable, and access describes how the variable was accessed: - The value of the variable var was requested and val was returned.
- The remote kernel asked to change the variable var to val. The new value val was returned.
- The value of a part of the variable var was requested and val was returned.
- The remote kernel asked to change a part of the variable var to val.
- The remote kernel asked for exclusive access to the variable var, setting it to val. The request was granted because var was currently unused.
- The remote kernel asked for exclusive access to the variable var, setting it to val. The request was denied because var already had the different value old set by another process.
- The remote kernel released exclusive access to the variable var.
For shared downvalues, the expression var in the preceding examples will be a normal expression whose head is the shared downvalue, such as f[...]. Aborting Parallel ProgramsYou can interrupt and abort the local (master) kernel during a concurrent computation. Any evaluations already on remote kernels will continue to run. After an abort, wait for any processes still in the queues using , abandon them with ResetQueues, or abort the remote kernels with ResetSlaves[]. If you abort any other operation such as ParallelEvaluate[], you should follow it by ResetSlaves[]. ResetQueues[] | waits for any running processes to finish and clears all queues | ResetSlaves[] | aborts all remote kernels and makes them available again | CloseSlaves[] | closes the MathLink connections to all remote kernels |
Recovering from interrupts and resetting remote kernels. There is not always a reliable way to interrupt a remote kernel; ResetQueues[] waits for any running computations to finish normally to avoid an interrupt. If this takes too long, try to abort the master kernel again and then use ResetSlaves[]. ResetSlaves[] tries to abort any remote kernels that are not responding. Kernels that fail to react are closed. If you quit the local kernel while a remote one is still doing a computation, the remote kernel may continue running and should be aborted or eventually killed using the appropriate operating system command. |