Skip to main content

[Resolved] Full cluster downtime on Wednesday, Dec 19 starting at 6am; make sure to log out and halt any running processes before downtime starts

Posted by on Monday, December 10, 2018 in Cluster Status Notice.

Update, 12/20/2018:

The GPU drivers upgrade on all Maxwell and Pascal nodes in the CentOS 7 cluster is now complete and the nodes are available to host jobs.

Thank you for your patience.

Update, 12/19/2018:

The cluster is now back online and accessible for normal use again, with the exception of the GPU nodes.

We are currently upgrading GPU drivers on our GPU nodes in our CentOS 7 environment. These upgrades should complete shortly so the GPU nodes in CentOS 7 will be available to use later tonight. The GPU nodes in our legacy CentOS 6 environment are having issues of some sort that we will finish troubleshooting tomorrow morning (please open a ticket with us if you are unable to use CentOS 6 GPU nodes by tomorrow afternoon).

The work today was overall a great success! We were able to successfully install the new storage hardware that will enable us to expand capacity on /scratch and /data (quota increases will be available after the holiday break).

We also upgraded Slurm to the most recent version (18.08.4) during today’s downtime, and upgraded firmware on several of our network devices.

Please open a helpdesk ticket with us if you notice anything unusual.

Thank you for your bearing with us as we performed this important maintenance work today, and happy holidays!

We are taking a full cluster downtime on Wednesday, Dec 19 beginning at 6AM. We expect the work to be completed by the afternoon of Dec 19.

The primary purpose of this downtime is to install new storage hardware to allow us to subsequently expand capacity on /scratch and /data (additional space will not be available immediately, but a few weeks after the downtime). We have been running low on capacity for several months now, and expanding capacity will allow us to accommodate several quota increase requests we’ve had during this time.

We will also be performing a major upgrade of Slurm to the latest version of 18.08 during the downtime, and will also be performing a handful of network upgrades.

During the downtime, users will be unable to log into the cluster, access their files, or submit/run jobs. Users will also be unable to log in to custom gateways that mount /home, /scratch, /data, or /dors. To expedite the scheduled maintenance, please make sure that you are logged out of the cluster prior to the downtime and also halt any processes running (you can check with the “ps -ef | grep <your vunetid>”) on gateways.

Please pay special attention to your wall time requests in Slurm jobs leading up to the downtime. We will be placing a maintenance reservation on all compute nodes beginning on Dec 19 at 5AM, so Slurm will not start any jobs that are projected to run past that time point.

Please let us know if you have any questions or concerns.