Skip to main content

ACCRE networking problems fixed; make note of rules of thumb when reading or writing data to the cluster

Posted by on Tuesday, April 9, 2019 in Cluster Status Notice.

Update, 4/10/2019: Early this morning we applied some changes that appear to have resolved the network stability issues we were having yesterday. Feel free to resume normal activities on the cluster. We apologize for the interruption!

On a related note, we have been observing intermittent sluggishness on /scratch and /data over the last several weeks. We believe that this may in part be due to users’ jobs that are (unintentionally) straining the system and affecting other users on the cluster (especially those doing interactive work). Please read carefully below for rules of thumb to consider when reading or writing data on the cluster and keep in mind that you are using a shared resource. We encourage users who are unsure if their jobs are following best practices to reach out via our Helpdesk.
– Avoid reading or writing to the same file (especially large files on the order of GB in size or greater) from a large number (~50 or more) of jobs at once. This could also apply for especially large Singularity images. Please reach out to us if you are unsure whether this point applies to you.
– Avoid opening and closing files repeatedly within tight loops. If possible, open files once at the beginning of your program/workflow, then close them at the end.
– Avoid storing many small files in a single directory, and avoid workflows that require many small files. A few hundred files in a single directory is generally fine; tens of thousands is almost certainly too many. If you must use many small files, group them in separate directories of manageable size.
– Watch your file system quotas. If you’re near your quota and your job is repeatedly trying to write data, it will stress the file system.
– Make use of /tmp (local storage on compute nodes) when possible, especially if you are performing small and frequent I/O. Just make sure to clean up at the end of your jobs since /tmp is also a shared resource with limited capacity. See this FAQ for instructions on using /tmp: https://www.vanderbilt.edu/accre/support/faq/#how-do-i-use-local-storage-on-a-node

Original post:

We are currently experiencing networking issues that are impacting our GPFS filesystem and other cluster tools. This may result in failed logins, long delays for file access, among other problems. We are actively working to track down the issue and will send an update when it has been resolved.