Completed -
HPC clusters are running jobs again. The HPC GPFS maintenance is complete ahead of schedule.
Jan 6, 12:32 EST
Update -
HPC storage has passed health checks and is now available again from majority of systems. HPC jobs will start within the next hour (by 1pm).
Jan 6, 12:16 EST
Update -
Drives have been reinstalled in new enclosure, enclosure cabled, and system brought up. Initial checks appear healthy. Working through remaining health checks and will then start bringing up dependent systems and HPC.
Jan 6, 11:53 EST
Update -
hpc gpfs storage has been taken offline. All parts are in our cage within the data center and vendor engineers have arrived.
Jan 6, 10:31 EST
Update -
HPC clusters are not running any jobs.
Jan 6, 10:07 EST
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 6, 09:30 EST
Update -
We've confirmed that all necessary parts are onsite and multiple engineers from vendor are set for data center access and will start at 10am. We plan to stop new HPC jobs from starting on the cluster around 9:30am. All running jobs will be cancelled during maintenance. Pending or held jobs will start after several steps of checking functionality post-maintenance.
Jan 5, 14:28 EST
Scheduled -
GPFS storage will be unavailable for most of January 6th while we work with our storage vendor to replace a disk enclosure. All HPC jobs and many automated systems and internal applications will be offline. All non-ONT instruments that write to instdata or cloud should be able to run.
The effort will involve both software shutdown of many systems as well as labor in the data center to remove 104 drives from the old enclosure, install new enclosure in rack, and install drives in new enclosure.
The currently installed broken enclosure has bad ports on one of its 104 drive slots, preventing a drive swap from resolving an issue.
Building a configuration that can always survive disruption to 100+ drives due to enclosure issues is complex - over half of our gpfs capacity will be resilient to enclosure issues after this maintenance.
Dec 15, 12:17 EST