Running CUDALink in Headless Mode

In some cases, such as in a cluster environment, the machine may be run in headless mode, that is, not starting the GUI server (such as X). The operating system may incorrectly set up the device permissions for the user and thus cause CUDALink to fail. This document shows how to set up the proper permissions in those cases.

Headless CUDALink on Linux

Linux has multiple initialization runlevels. On clusters, usually the runlevel responsible for starting X (usually runlevel 5) is not called. Since runlevel 5 is responsible for creating the NVIDIA device, CUDAQ may return False if the system is not set up properly.

To check if your machine is set up properly, make sure you have correct permissions by using the following.

$ ls -l /dev/nvidia*

If either /dev/nvidia* is not found or the permissions are not correct, a simple script to fix the permissions is below (it needs to be run as root).

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then

# Count the number of NVIDIA controllers found.
N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

N=`expr $N3D + $NVGA - 1`
for i in `seq 0 $N`; do
mknod -m 666 /dev/nvidia$i c 195 $i;
done

mknod -m 666 /dev/nvidiactl c 195 255

else
exit 1
fi

If the user is not root, then contact the system administrator to run the script. If the script fixes the issue, then the system administrator has to add the following script as part of the rc3 startup scripts to set up proper permissions upon startup.

#!/bin/bash
#
# /etc/init.d/cuda startup script for nvidia driver
# symlink from /etc/rc3.d/S80cuda in non xdm environments
#
# Creates devices, sets persistent and compute-exclusive mode
# Useful for compute nodes in runlevel 3 w/o X11 running
#
# chkconfig: 345 80 20

# Source function library
. /lib/lsb/init-functions

# Alias RHEL's success and failure functions
success() {
log_success_msg $@
}
failure() {
log_failure_msg $@
}

# Create /dev nodes
function createdevs() {
# Count the number of NVIDIA controllers
N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`

# Create Devices, exit on failure
while [ ${N} -gt 0 ]
do
let N-=1
/bin/mknod -m 666 /dev/nvidia${N} c 195 ${N} || exit $?
done
/bin/mknod -m 666 /dev/nvidiactl c 195 255 || exit $?
}

# Remove /dev nodes
function removedevs() {
/bin/rm -f /dev/nvidia*
}

# Set compute-exclusive
function setcomputemode() {
# Count the number of NVIDIA controllers
N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`
# Set Compute-exclustive mode, continue on failures
while [ $N -gt 0 ]
do
let N-=1
/usr/bin/nvidia-smi -c 1 -g ${N} > /dev/null
done
}

# Start daemon
function start() {
echo -n $"Loading nvidia kernel module: "
/sbin/modprobe nvidia && success || { failure ; exit 1 ;}
echo -n $"Creating CUDA /dev entries: "
createdevs && success || { failure ; exit 1 ;}
echo $"Setting CUDA compute-exclusive mode."
setcomputemode
echo $"Starting nvidia-smi for persistence."
/usr/bin/nvidia-smi -l -i 60 > /dev/null &
}

# Stop daemon
function stop() {
echo $"Killing nvidia-smi."
/usr/bin/killall nvidia-smi
echo -n $"Unloading nvidia kernel module: "
sleep 1
/sbin/rmmod -f nvidia && success || failure
echo -n $"Removing CUDA /dev entries: "
removedevs && success || failure
}

# See how we were called
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
*)
echo $"Usage: $0 {start|stop|restart}"
exit 1
esac
exit 0

Since different distributions have different init systems, the system administrator may need to modify the above scripts to make them work with the installed distribution.

Headless CUDALink on Windows and Mac OS X

It is not possible to start Windows and Mac OS X in headless mode, so this information does not apply.