Skip to main content

Author

Custom/private gateway work is complete

May. 1, 2019—Update, 5/10: The work is now complete. On Friday, May 10 beginning at 8:30AM we will be taking custom/private gateways offline in order to reboot and upgrade the operating system to CentOS 7.6 from 7.4 and the GPFS filesystem to 5.0.2 from 4.2.3. We expect these upgrades to last roughly 1-2 hours. If this time...

Read more


4/24/2019: Turbo-Charged Machine Learning: Fitting Models Fast with H2O with Jesse Spencer-Smith

Apr. 22, 2019—Jesse Spencer-Smith, Chief Data Scientist at the Vanderbilt Data Science Institute, presented on April 24, 2019. Due to technical difficulties we couldn’t record this seminar, however materials are available at the following repo on GitHub: https://github.com/vanderbilt-data-science/turbo_h2o Abstract: The model fitting stage of a data science workflow has always been compute-intensive. Combine this with a need...

Read more


GPFS outage resolved; check output of running jobs

Apr. 19, 2019—Around 2PM today a GPFS manager node had an issue and caused the GPFS filesystem’s to hang across the cluster, making logins and file access unresponsive. The issue was corrected at 3PM today and all compute nodes seem to have recovered. Please check the output of running jobs just to be safe.

Read more


Scheduled maintenance on public gateways complete

Apr. 18, 2019—Original post: This Saturday morning, April 20th, we will be taking the public gateway and portal servers offline from 7 am to 9 am in order to reboot and upgrade the operating system to CentOS 7.6 from 7.4 and the GPFS filesystem to 5.0.2 from 4.2.3. Updating the entire cluster to GPFS 5 is an...

Read more


ACCRE networking problems fixed; make note of rules of thumb when reading or writing data to the cluster

Apr. 9, 2019—Update, 4/10/2019: Early this morning we applied some changes that appear to have resolved the network stability issues we were having yesterday. Feel free to resume normal activities on the cluster. We apologize for the interruption! On a related note, we have been observing intermittent sluggishness on /scratch and /data over the last several weeks....

Read more


[Resolved] Visualization portal maintenance Saturday morning is now complete

Mar. 14, 2019—Update, March 16: This maintenance is now complete. The ACCRE Visualization Portal will go down for scheduled maintenance on Saturday, March 16th, from 6 AM to 10 AM. This will only affect web access through the Visualization Portal, so users may still run jobs on the cluster and login through the gateway nodes via SSH....

Read more


New command: slurm_groups

Mar. 11, 2019—We’ve added a new command to the ACCRE cluster, slurm_groups, that allows you to view your current SLURM group membership. This is especially useful if you have GPU access to the cluster, as it gives you the group and partition that you will need to use. Click the link to go to the documentation.  

Read more


[Resolved] SLURM scheduler is back online following outage

Mar. 5, 2019—Update, 3/5/2019: The scheduler is now operational. The impact on the cluster queue has been minimal. We are investigating to establish the exact reason of the stuck jobs in order to prevent this to happen again. Thank you for your patience. We are currently experiencing a SLURM overload caused by issues in killing processes related...

Read more


[Resolved] /scratch and /data are back online following weekend maintenance

Jan. 24, 2019—Update, 2/12/2019: /scratch and /data are back online and we are now accepting new jobs. We were never able to get the maintenance command to run successfully, but we were able to verify (with IBM’s assistance) the integrity of /scratch and /data, which is great news and means we will not need to take another...

Read more


Final Steps for CentOS 7 Upgrade

Jan. 10, 2019—Update, Jan 25: The CentOS 6 login is now closed. Original post below… It has been a long journey, but we are almost to the end! Please see below for a schedule of the final systems to be upgraded to CentOS 7. Note this schedule does not include a handful of custom/private gateways that still...

Read more