A remote kernel in use may fail at any time, due to hardware, network, or software problems. A failure of a remote kernel will be noticed the next time Parallel Computing Toolkit
tries to send a command to the kernel or tries to read a result from it. The error message Parallel::rdead
is used to notify you of a failed remote kernel.
If the failed kernel had any processes assigned to it, these processes will be lost. If you are using Wait
for one of these processes, your program will never terminate because the process will never return.
Because Parallel Computing Toolkit
keeps track of the commands submitted to remote kernels, it can reassign these commands to another available remote kernel if a remote kernel fails. Alternatively, it may simply terminate the waiting processes with the result $Failed
, which indicates failure. The chosen behavior is determined by the value of the variable $RecoveryMode
recovery mode lets you finish a computation as long as at least one kernel remains usable. However, it may give wrong results if the remote computations produce side effects or your computation depends on a certain number of available remote kernels. Side effects are usually present if you use virtual shared memory. There is also the possibility of a deadlock if a process on a failed kernel acquired, but never released, a shared resource.
You can use the Abandon
recovery mode to implement your own failure recovery method.
Failure recovery affects only processes started with Queue
and collected with Wait
. Other parallel commands, such as ParallelEvaluate
, cannot handle a failed remote kernel and always return $Failed
in such cases.