Resolved: Shared Computing Cluster batch jobs, BU Works test environment

Incident Discovery Time: 02:30pm on 04/12/2022 Time of Resolution: 10:30pm on 04/12/2022 Services Impacted: Server Infrastructure

Description of Impact

The MGHPCC (Holyoke Datacenter) overheated today and some servers and network switches shut down. This primarily affected batch jobs on the SCC. Other areas that were affected are not production areas. They include the nas2/nas-ru2 mirror of the CRC production server, and a development/test environment set up for the BU works team.

Incident Description and Resolution

The SCC nodes were brought back online by 6pm. The rest of the servers were back online by 10:30pm except for a test environment which will need further investigation.

Additional Information

The cause of this incident is still under investigation. If you continue to have issues, please contact the IT Help Center.

Previous Update

Incident Discovery Time: 02:30pm on 04/12/2022 Services Impacted: Server Infrastructure

Description of Impact

The Massachusetts Green High Performance Data Center (MGHPCC) overheated briefly today and a few servers were shutdown.

Current Status

SCC compute nodes are back online and all clients on the SCC were notified. Our data center operations team is en route to Holyoke to investigate other servers. Next Update: 07:30pm

Previous Update

Incident Discovery Time: 02:30pm on 04/12/2022 Services Impacted: Server Infrastructure

Description of Impact

The Massachusetts Green High Performance Data Center (MGHPCC) overheated briefly today and a few servers were shutdown.

Current Status

IS&T teams have fixed the cooling issue and are currently waiting for temperatures to drop enough to bring servers back online.

Additional Information

Batch computing jobs on the Shared Computing Cluster and test environments for the BU Works Basic team were affected. IS&T teams continue to analyze the scope and impact. Next Update: 07:30pm

Previous Update

Incident Discovery Time: 02:30pm on 04/12/2022 Services Impacted: Shared Computing Cluster Batch jobs, BU Works test environment

Description of Impact

The Massachusetts Green High Performance Data Center (MGHPCC) has had an air conditioning issue. The room overheated to the point where some servers shut down. SCC login nodes and filesystem were not affected, but some batch jobs and BU works services may be unavailable.

Current Status

IS&T teams are investigating the impact of this outage and working to get servers back online. Next Update: 5:30pm.