DORS offline through the weekend for maintenance; some processes and nodes on the cluster were affected by DORS unmount
Update, 11/11/2019: DORS is now back online.
”The filesystem check finished and the filesystem has been repaired. We learned from this that 8 files were lost during the problematic disk rebuild. We are running a process now that should identify these files so they can be restored from backup. This process can be run while the system is up, so we have brought DORS back online. NFS and CIFS access are available now. It will be remounted on the ACCRE cluster first thing Monday morning.“
As always, open a help desk ticket with us if you have any issues.
Update, 11/8/2019 7pm: This afternoon during the attempt to unmount DORS across the cluster, a command sequence was incorrectly ordered on a subset of the compute and gateway nodes causing processes to be accidentally killed. In addition, some nodes required an unexpected reboot in order to completely unmount DORS.
We apologize for any disruption to your work. If you are still experiencing issues on any custom gateway hardware that you suspect may be related to this incident, please open a help desk ticket detailing the resources/services that are not functioning correctly.
Original post:
For those of you who use DORS as part of your research, please be aware of this weekend’s DORS outage:
“Starting at 1pm on Friday, ACCRE will begin unmounting the DORS fileystem from the cluster. New ACCRE jobs that access DORS will fail after this time. Existing ACCRE jobs that access DORS will be killed, starting at 1pm Friday.
At 5pm, the rest of the DORS services (NFS for Linux workstations and CIFS for WIndows/Mac systems) will be taken offline and the filesystem check will begin. This will take an unknown amount of time, but we expect it could take at least a large portion of the weekend.”
The DORS outage follows the detection of bad blocks on a disk last week. The DORS team needs to identify the files that go with those blocks so they can be recovered from backups.
All other cluster services should continue as usual during this time.