Current SCC System Status
The Research Computing Systems group will update this status when there are changes. Click for additional information.
- Batch system back to normal.
The issues with the batch system have been resolved and the system is back to normal. (September 2, 5:15 p.m.)
- Problem with the batch system on SCC.
We are are aware that there is a problem with the batch system on the SCC cluster and are working on it now and will resolve it as soon as we can. The login nodes and Project Disk are not affected and you can continue to use them normally. Apologies for the inconvenience.(September 2, 3: p.m.)
- Eversource has scheduled a power outage at 111 Cummington Mall on Saturday, July 18, 2:00 a.m to 2:00 p.m. to perform emergency repairs. During the outage, RCS Account Management services will not be available. No Project and related resource management forms will be processed and automated messages will not be sent. RCS Account Management services will resume after the power has been restored. The Shared Computing Cluster (SCC) and all other resources located in Holyoke will not be affected.
Other areas affected by the power outage are:
Momentary: 225 BSR (Castle), 226 BSR (HIS), 232 BSR (PLS), 236 BSR (EGL), 264-270 BSR (SSW), 640 Comm Ave. (COM), 675 Comm Ave. (STO), 2 Cummington (BSC), 38 Cummington (SLB), 44 Cummington (BME)
Duration: 704 Comm Ave. (FOB), 708 Comm Ave. (Residence/Commercial), 710 & 712 Comm Ave. (Commercial), 714 Comm Ave. (Residence/Commercial), 722 Comm Ave. (Residence/Commercial), 48 Cummington (ERA), 68-100 Cummington (SOC), 111 Cummington (MCS)
(July 17 1:00PM)
- The unrelated issues involving scc1.bu.edu and the batch system have been resolved. (May 22 7:00PM)
- Scc1.bu.edu is having issues which are being actively investigated; the machine was also just rebooted so current sessions will have been disrupted. We recommend using another login node (scc2.bu.edu, geo.bu.edu, or scc4.bu.edu) for the time being. (May 22 3:13PM)
Following the reboot, the system seems to be working normally but we continue to actively monitor it. (May 22: 3:32PM)
- The scheduled downtime at the MGHPCC completed early and all SCC login and batch nodes are back on line. (May 19 12:40AM)
- The SCC/MGHPCC will have an outage starting at noon on Sunday, May 17. Details are below, a message just sent to all SCC researchers (April 14, 4:30PM)Dear Colleague,In order to perform yearly scheduled maintenance, there will be a full power outage at the MGHPCC tentatively scheduled to start at 12:00pm on Sunday, May 17th until 12:00pm on Tuesday, May 19th. This outage impacts the Shared Computing Cluster (SCC) and ATLAS, as well as the on-campus Linux Virtual Lab machine (scc-lite.bu.edu) and websites hosted at class.bu.edu that use data stored on systems at the MGHPCC. All SCC login and compute nodes, home directories, and project and restricted project filesystems will be unavailable during the outage. While the planned outage is scheduled to extend through noon on Tuesday, we anticipate that restoration of services could occur earlier on Tuesday during the overnight hours.We will start a process of draining the batch queues on the Shared Computing Cluster (SCC) on Friday, April 17th to prevent long running jobs from starting. This process will continue until 12:00pm Sunday, May 17th at which point all jobs will be stopped and we will shut down all the SCC computer systems, including all the login nodes. Although you may continue submitting jobs between April 17th and the downtime, if your jobs would not complete prior to the shutdown, they will remain pending until the computer systems are returned to normal operations. People submitting jobs to the batch system should plan on adjusting their hard run time limits accordingly (see http://www.bu.edu/tech/support/research/system-usage/running-jobs/submitting-jobs/#job-resources for more details). We anticipate having all systems returned to production status by 12:00pm on Tuesday, May 19th.We will post updates before and during the downtime on our Shared Computing Cluster status page at http://www.bu.edu/tech/research/sccupdates/.Please email email@example.com if you have any additional questions or concerns.Regards,The Research Computing Services Group
- As of May 4, 2015, we will be disabling projects from Shared batch system access if they go over their CPU/SU allocation and remain over for two weeks, despite multiple warnings, without requesting additional resources. We are also reducing the number of messages sent about projects over their allocation. A notice was just sent to all SCC users explaining the changes. (April 17, 4:30 PM)
- The following information was just sent out to all SCC users: Shared Computing Cluster (SCC) Service Degradation March 14thDear Researcher,We are writing to inform you of an IS&T Emergency Outage that will affect services on the Shared Computing Cluster (SCC). On Sat. March 14th from 11PM until Sun. March 15th at 8AM IS&T will be replacing network fiber cabling that was damaged in a recent steam pipe rupture. This outage will affect the following on the SCC in addition to other University wide services:o Users may have intermittent difficulty logging in to the SCC. Existing login sessions should continue to work.
o The archive storage service will be unavailable. Batch jobs that attempt to utilize archive storage space during the outage will hang or fail.
o Interactive or batch sessions that utilize the following licensed products will not work correctly:
AbaqusIn addition, the above software packages will not work from other computers that utilize the campus license servers during the time of the outage.In an effort to minimize the impact of failed jobs we will be suspending batch job dispatch during the outage.We apologize for any inconvienance that this outage may cause.If you have further questions please email firstname.lastname@example.org.Regards,The Research Computing Services Group
- There was an approximately 20 minute interruption to the Project Disk filesystems. (October 17, 12:30 AM)
- Normal access to the Project Disk filesystems has been restored. (October 02, 12:55 AM)
- The Project Disk filesystems are inaccessible and staff are working to restore access as quickly as possible. (October 02, 10:55 AM)
- The MGHPCC maintenance has been completed well ahead of schedule and the system is now operating normally. (August 11, 3:00 PM)
- The MGHPCC maintenance work is proceeding ahead of schedule. The BU networking work and SCC filesystem maintenance have been completed. The SCC login nodes and filesystems are now on-line using generator power. The batch nodes will remain off-line until power is restored. (August 11, 12:50 PM)
- The MGHPCC and SCC will be down for scheduled maintenance all day on Monday, August 11. The SCC will be brought down at 10PM on August 10 to prepare for this. More details on this outage are given in this letter to all users from Glenn Bresnahan. (July 25, 2:30 PM)
- SCC1 seems to be working fine now. We are not sure what caused the access/performance problems and interruption could occur again. The other login nodes did not seem to be affected. Please use one of the other login nodes if you experience further problems with SCC1. (July 18, 2:45PM)
- The problem with SCC1 is being investigated. Please use any of the other login nodes. (July, 18, 2:30PM)
- The networking upgrade mentioned in the prior update of May 21 has been indefinitely postponed. We will post a new update when it is rescheduled which may not be for many months. (June 2, 2:15PM)
- There will at some point relatively soon be a significant networking upgrade to BU services performed at 881 Commonwealth Avenue; the date and time are not yet set. This will likely cause a disruption of some services on the SCC. License servers will be down, making MATLAB, Abacus, Lumerical, Mathematica, and Maple inaccessible. Kerberos authentication will also be affected, making it impossible to log in to the SCC with your Kerberos login and password; if you are already logged in, this should not affect you. There may also be intermittent other login and connectivity issues, including issues with wireless authentication and web login.
If you have questions about this disruption, please send them to email@example.com. (May 21, 10:45AM)
- The issues with the Project Disk Space storage system (/project, /projectnb, /restricted/project, and /restricted/projectnb) on the Shared Computing Cluster (SCC) have now been resolved. There was no loss of data and the system is now operating normally.
The problem was caused by the failure of two RAID disk controllers leading to the failure of a RAID disk array. Intervention by the vendor was needed to restore the disk array to operation. (April 2, 7:30PM)
- All of the Project Disk space partitions (/project, /projectnb, /restricted/project, and /restricted/projectnb) are currently inaccessible from all nodes. This issue is under investigation and we will fix it as soon as we can. We will also post additional updates here as we have them. Home directories remain accessible. (April 2, 2:42PM)
- Power has been restored to the MGHPCC and the SCC is now fully operational. Some electrical issues remain, but it is believed that these can be resolved without further service interruptions. (December 11, 11:30PM)
- Replacement power equipment has been installed and is undergoing final testing. The MGHPCC will attempt to restore full power this evening. SCV staff remain on-site to bring the computer systems up as soon as possible once power is restored. (December 11, 6:30PM)
- The SCC login nodes and filesystems are now accessible using generator power at the MGHPCC. The compute nodes will continue to be unavailable until full power is restored. While the MGHPCC is expecting replacement parts for the main power feed tomorrow morning, we do not currently have an estimate on when full power will be restored. We will continue to post further updates on the SCC status page. (December 10, 9:00 PM)
- Due to equipment failures in the main power path, the MGHPCC was not able to return to operation on the target schedule. Equipment vendors and electric company personnel are on-site assessing the problem. SCV staff are also standing by on-site to return the computing systems to normal operation as soon as possible once power is restored. (December 10, 9:50 AM)
- Job queue is being drained in preparation for power outage at 10:00 PM 12/08 described below. Jobs which would not complete prior to the shutdown are being held until the system returns on 12/10. (December 6, 10:00 AM)
- December 9 – In order to address an exigent issue, there will be a full power outage at the MGHPCC on December 9th. This outage impacts the Shared Computing Cluster and ATLAS, as well as the on-campus Katana and LinGA clusters that use data stored on systems in Holyoke.We anticipate the systems will be down from 10:00PM on December 8 until 9:000AM on December 10. More details are available here. (November 20, 4:00 PM)
- VNC is now available on the SCC. Using this software can greatly speed up the performance of GUI/graphical applications running on the SCC. (October 9, 4:00 PM)
- Note that if one login node is responding slowly, you may get better responsiveness by logging in to another. The login nodes are scc1.bu.edu, scc2.bu.edu, geo.bu.edu (for Earth & Environment department users), and scc4.bu.edu (for BU Medical Campus users). (August 19, 10:30 a.m.)
- Glenn Bresnahan, director of SCV, sends out to all SCF users an update on the SCC Performance Issues. (August 14, 12:30 p.m.)
- Performance back to normal.
As of 7:30 p.m. the SCC’s performance is back to normal. We are still trying to identify the underlying sources of these problems. (August 7, 7:35 p.m.)
- Recurring performance problem.
As of approximately 5:30 p.m. we have again been experiencing intermittent performance degradation. The Systems group is working on it to bring back normal performance as soon as possible. They also continue to try to locate and understand the underlying causes of these performance problems in an effort to prevent them from returning. Many apologies for the interruptions to your productivity. (August 7, 6:15 p.m.)
- File servers hung
Today at approximately 1:00 p.m. two file servers hung and took down the filesystem. The system was restored at 1:30 p.m. This incident was not related to the previous performance degradation issues. (August 7, 1:30 p.m.)
- Update: We believe we have resolved the problem below as of around 4:30 pm on August 6. It was unrelated to the issue on the 2nd and 5th. (August 6, 5:30 p.m.)
- Recurring performance problem.
We are aware that the SCC is having intermittent performance problems again. We are working on it and are trying to fix it as soon as possible. (August 6, 3:00 p.m.)
- On Friday, August 2nd at approximately 3:30pm and again on Monday, August 5th at approximately 12:30pm the Shared Computing Cluster (SCC) experienced system-wide degradations in performance lasting multiple hours . The SCV systems group has been working to identify the issue causing these degradations. At this time we believe that we have identified the issue and continue to work to fully rectify the problem. Users may continue to experience periods of degraded performance on the cluster until we have fully resolved the issue.
We apologize for any inconvenience that these issues may have caused you during the past few days and appreciate your continued patience as we continue to work hard to resolve them. (August 6, 11:00 a.m.)
- Problem with performance on SCC.
We are are aware that there is a problem with performance on the SCC cluster and are working on it now and will resolve it as soon as we can. Apologies in advance for the inconvenience.(August 5, 12:45 p.m.)
- Performance on the SCC cluster is back to normal. We are still investigating the cause of the problems recently experienced. (August 2, 5:05 p.m.)
- Problem with performance on SCC.
We are are aware that there is a problem with performance on the SCC cluster and are working on it now and will resolve it as soon as we can. Apologies in advance for the inconvenience.(August 2, 4:35 p.m.)
- Charging for usage in Service Units (SUs) begins on July 1, 2013. The compute nodes are charged at an SU factor of 2.6 SUs per CPU hour of usage. Also, note that the way usage is calculated on the SCC is different than it is on the Katana Cluster. Usage is charged on the SCC by wall clock time as on the Blue Gene rather than by actual usage as it is on the Katana Cluster. Thus if you request 12 processors and your code runs for 10 hours, you will be charged for the full 120 hours (multiplied by the SU factor for the node(s) you are running on) even if your actual computation only ran for, say, 30 hours. This change will also apply to the nodes which move out to become part of the SCC that used to be part of the Katana Cluster. (July 1, 2013)
- During the week of July 8-12, all of geo.bu.edu and the katana-d*, katana-e*, katana-j*, and katana-k* nodes will move out of the Katana Cluster to become part of the SCC. This includes all of the Buy-In Program nodes. All of these nodes will also be renamed during the transition. Details on this are in this note sent out on July 2.
The schedule is:
July 3rd: 6:00am-6:30am Katana outage to physically relocate July 7th: 7:00am Disable batch queues on machines that are moving. July 8th: 7:00am Power off all machines that are moving 8:00am-6:00pm Systems de-installed and moved to Holyoke 1:00pm "SCC3" becomes an alias for the system name "GEO" July 9: 8:00am-6:00pm Reinstallation and cabling of machines in Holyoke July 10: 12:00pm Target for GEO nodes in production July 11: 12:00pm Target for 2012 Buy-in nodes in production July 12: 12:00pm Target for all systems in production
(June 25, 2013)
- During the week of June 24, 2013, the BUDGE nodes are being moved out of the Katana Cluster to become part of the SCC. They will be operational again on Friday, June 28 with the new names scc-ha1..scc-he2 and scc-ja1..scc-je2. These nodes each have 8 NVIDIA Tesla M2070 GPU Cards with 6 GB of Memory. (June 24, 2013)
- A bug in the automounter on the SCC systems has been identified that prevents the /net/HOSTNAME/ automount space from working properly for certain servers. There are two known problem servers at this time:
nfs-archiveAs a workaround, until a proper bug fix becomes available, we have created a new automount space to handle the problem cases. If you experience a problem accessing /net/HOSTNAME/ for some HOSTNAME, look in /auto/. If HOSTNAME appears there, try that path, otherwise report the problem to firstname.lastname@example.org.The /auto space is maintained manually so only the know problem servers can be accessed through that path. All other servers should be accessed through the usual /net path.This problem does not affect the Katana Cluster. (June 17, 2013)
- The SCC officially went into production use on June 10, 2013. However, there are still some transitional things continuing. Not all software packages are yet installed and disk space is still in a transitional state for some projects. (June 10, 2013)
- You may or may not have noticed that most of the files on the old Project Disk on Katana have been moved to the new Project Disk on the SCC. You can continue to use your files from either system using the same paths that you always have. If you have not been accessing your files, we quietly moved them over the past week. If you have been accessing your files, we are contacting you individually to find a time that is convenient for you to take a break from accessing them while we move your files for you.Projects that did not have directories in /project and /projectnb on the old system do now have them on the new system with 50 GB quotas on each partition.A note for active Blue Gene users: Since the compute nodes are on a private network, we will not move your files at this time and will be contacting you over the next few weeks to discuss the details and options. (June 7, 2013)
- We will be hosting a seminar on June 11 from 12-2pm to go over issues related to the migration to the SCC. Please do register; a light lunch will be served. (June 3, 2013)The slides from these talks are posted here.
- MATLAB versions R2012b and R2013a are both available. R2012b is launched by /usr/local/bin/matlab at the moment, but you can access R2013a by running /usr/local/apps/matlab-2013a/bin/matlab. (May 30, 2013)
- In preparation for the new SCC Project Disk Space file systems going live in mid-June, we are making some changes – the first of which you may notice tomorrow, May 29, in the web forms and reports – the primary unit for reporting disk space will be Gigabytes, not Megabytes. In addition, when the SCC goes into production mid-June, all projects on the SCC will have directories and quotas on both backed up and not-backed-up Project Disk partitions. For projects that already have directories and quotas on Katana, they will be transferred to the SCC. Directories and quotas will be created for projects that did not already have them on Katana. The default minimum will be 50 GB on both partitions. For projects that need more quota, there is no charge for requests up to a total of 1 TB (200 GB backed up and 800 not-backed-up). Researchers who need more than that should look into the Buy-in options.
- MATLAB version R2013a is now installed on the SCC. (May 20, 2013)
- FTTW, Mathematica, Accelrys CHARMm, Gaussian, Grace, OpenGL/GLUT, and Nedit have all been installed on the SCC. (May 17, 2013)
- Production use of the SCC will begin in mid-June 2013.
- Added a table of available software packages on the SCC. This will be regularly updated during the friendly user period. (May 2, 2013)
- Made SCC web site live for everyone to access. (May 2, 2013)
- Friendly User access to the SCC begins. (April 26, 2013)
- Initial elements of the Shared Computing Cluster (SCC) are installed at the Massachusetts Green High Performance Computing Center (MGHPCC). BU is the first institution to install HPC resources at the MGHPCC. (January 22, 2013)