Contents
HOWTOs for High Performance Computing
On this page we collect useful HOWTOs related to High Performance Computing and cluster management. You are free to use the information provides here - without any warranty or support!
Using the large_files directory in your scratch folder
In you scratch folder /scratch/userid or $SCRATCH is a subfolder large_files. This is meant for really large files as explained here or recorded (MP4).NVIDIA GPU Monitoring with OMD and Checkmk
If you are using OMD and Checkmk for your cluster management, you might have noticed that the provided NVIDIA support only works when an X server is running. This is because the Checkmk agent relies on "nvidia-settings" to be working - and this tool needs a running X server. On HPC clusters, this is normally not the case. Fortunately, NVIDIA provides another tool "nvidia-smi" that can be used by the Checkmk agent without a running X server.
When you SSH to a cluster node with a NVIDIA GPU and type in "nvidia-smi -q -x" you see a detailed XML structure with a lot of data. But you might have noticed that this tool takes quite long to display the results. This is because the operating system unloads the GPU driver while it is not in use. And loading and initialising it takes quite some time and CPU load. This way, the frequent Checkmk invocations of the agent would disturb normal cluster operations. The solution is to start a daemon process that prevents the driver from being unloaded and keeps initialisation time short.
You can download the C code for this little daemon here: cuda_lock_driver.c. Just start it together with the operating system of your GPU nodes. But remember to kill this daemon if you want to upgrade your NVIDIA driver package.
The next step is to extend your Checkmk agent. If you are using the check_mk-agent RPM on your GPU nodes, then you should add a plugin script. Just download check_mk_agent.txt and save it as /usr/lib/check_mk_agent/plugins/nvidia_smi. Don't forget to make it executable. If you are deploying the OMD-provided check_mk_agent script with your cluster management software, you could alternatively extend this script. Use an editor and find the <<<nvidia>>> section. Insert the lines from check_mk_agent.txt right after this section.
Please note that you will have to install the tool "xml_grep" on the GPU nodes to make this work. On EL6 it can be found in the perl-XML-Twig RPM.
If you invoke your agent now (either locally or remote), you will see a new section called <<<nvidia-smi>>> listing all the GPUs it finds as well as some monitoring data. On a node with two GPUs, the result might look like this:
<<<nvidia-smi>>> 0 TeslaM2090 N/A 0 0 0 0 N/A 32.61 225.00 1 TeslaM2090 N/A 0 0 0 0 N/A 29.66 225.00
Now it is time to update the Checkmk / OMD server to operate on the client output. A suitable check script must be placed in the Checkmk / OMD directory. For OMD this would be something like /omd/sites/omd_XYZ/share/check_mk/checks/nvidia_smi. Additionally, you can place a Perf-O-Meter script into the Checkmk / OMD perfometer directory (e.g. /omd/sites/omd_XYZ/share/check_mk/web/plugins/perfometer/nvidia_smi.py).
If you refresh the OMD inventory for your GPU nodes with "cmk -II <node> ; cmk -O", you will hopefully see the monitoring data for your GPU nodes on your local OMD monitoring web page.