Common problems/issues

Program crashes on login node with message Killed

When running commands or editing files on the login node, users may notice that their processes end abruptly with the error message Killed. Processes with names such as a.out, matlab, etc., are automatically killed on the login node because they may consume excessive computational resources. Unfortunately, this also means that benign processes, such as editing a file with the word matlab as part of its name could also be killed.

Solution: Request an interactive session on a compute node (qsub -I), and then run the application/command.

Home or scratch directories are sluggish or unresponsive

The /home and /scratch directories can become slow/unresponsive when a user (or several users) read/write large amounts of data to these directories. When this happens, all users are affected as these filesystems are shared by all nodes of the cluster.

To avoid this issue, keep in mind the following:

  1. Never use the /home directory as the working directory for jobs that read/write data. If too many jobs read/write data to the /home directory, it can render the cluster unusable by all users. Copy any input data to one of the /scratch directories and use that /scratch directory as the working directory for jobs. Periodically move important data back to the /home directory.

  2. Try to use /local_scratch whenever possible. Unlike /home or the /scratch directories, which are shared by all nodes, each node has its own /local_scratch directory. It is much faster to read/write data to /local_scratch, and doing so will not affect other users. (see example [here])(