IBM Blue Gene
Table of contents
- Getting a Blue Gene Account
- Help Information
- Allocations and Account Management
- Hardware Configuration
- File Systems
- Usage Policies
- Programming Information
- Running Jobs on the Blue Gene
- Scheduling Policy
- Interactive Mpirun Use
- Totalview Debugger
- Blue Gene Links
On May 9, 2005, we took delivery of an IBM Blue Gene system, one of the first such systems installed in the world. The Blue Gene is based on an IBM Research project dedicated to exploring the frontiers in supercomputing. When first installed, our Blue Gene system was ranked at #59 on the Top 500 Supercomputer Sites list. IBM’s Blue Gene is part of a new family of supercomputers optimized for bandwidth, scalability and the ability to handle large amounts of data while consuming a fraction of the power and floor space required by previous systems.
The Boston University Blue Gene is a single rack system, containing 1024 compute nodes. Each compute node contains two 32-bit 700 MHz PowerPC 440 processors. Our Blue Gene has a peak performance of 5.7 Teraflops.
The Blue Gene is designed for codes that scale well on hundreds or even thousands of processors. The individual processors in the Blue Gene are significantly slower than those on the Katana Cluster. Therefore, unless your code scales well and uses a lot of processors (generally at least 256), you should run it on the Katana Cluster. As a very rough estimate, a node on the Blue Gene runs about half as fast as one the Katana Cluster. In addition to the processor issue, some other restrictions apply to the Blue Gene as well. Please also note that most general application software, such as MATLAB and Mathematica, is not available on the Blue Gene. If you are not sure what machine is right for you, email one of the SCV Scientific Programmers (Kadin Tseng email@example.com or Yann Tambouret ) to discuss it.
The login nodes for the Blue Gene are the Linux machines levi.bu.edu and lee.bu.edu and users must use SSH to log in. Passwords are shared over the Scientific Computing Facilities, so if you already have an account and password on our other systems and have access to the Blue Gene, you will have the same login and password on this system.
New SCF users are not automatically given access to the Blue Gene system, due to special rules that apply to it. External users need to submit identity documentation (such as a passport) to us and both internal and external users need to fill out a web form. To apply for access, go to your SCF User Information page (accessing this page will require your BU login ID and SCF (non-Kerberos) password) and then submit the Update Personal Information form. External users only will also need to follow the identity information instructions on that web form.
This page gives a basic introduction to using the Blue Gene system. For more detailed information, follow the sidebar links, probably starting with the Programming page.
If you are experiencing system problems, please send email to firstname.lastname@example.org.
For more information or help in using or porting applications to the Blue Gene system, please see our Scientific Programming Consulting page.
If you have questions regarding your computer account or resource allocations, please send email to email@example.com.
Note that the Blue Gene is different from our other systems in how time is charged. On the Blue Gene, you will be charged 1 SU for each processor hour you reserve, calculated by wall clock time. On our other systems, users are charged for actual CPU usage rather than by wall clock time.
The Blue Gene rack contains 1024 compute nodes, 128 input/output nodes, and several internal networks used for inter-node communication. Both types of node consist of dual core 32-bit PPC440 processors (700 MHz) with 512 MB of main memory. Each node has a 32 KB L1 cache, 2 KB L2 cache, and a 4 MB L3 cache. See the IBM Journal of Research and Development issue on the Blue Gene for more system details.
The login nodes, levi.bu.edu and lee.bu.edu, are IBM eServer OpenPower 720s with 2-way 1.50 GHz 64-bit POWER5 processors. Each has a main memory of 4 GB and runs the SuSE Linux Enterprise Server 9 operating system. The configuration is similar to our other Linux systems.
User home directories are the same as on the Katana Cluster and all of the standard shared filesystems (e.g. /project and /projectnb) are accessible. Note that there is no /scratch space available on the Blue Gene system.
If you need access to your old twister (retired pSeries) home directory from your Blue Gene account, you can access it by prepending /ibm to your home directory name, for example, /ibm/usr2/faculty/your_login
Please also read the information on disk space in our SCF Users Information document.
The login nodes are intended for program development, compilation, and for submitting jobs to be run on the Blue Gene. Do not run CPU intensive applications on these machines. Use one of our other systems for this purpose instead. See the Scientific Computing Facilities Technical Summary for more information.
Since the Blue Gene is a highly specialized computing platform, few commercial or open source packages are currently ported to it. For the most part, only standard Linux tools, compilers and some math libraries are available. Packages that have been ported for use on the Blue Gene are listed here.
Due to the specialized nature of this machine, the process of compiling and running programs is more complex than on our other computing platforms. Instructions are below and we highly recommend that you read and follow them.
Note on Environment Variables
Your home directory is shared between the Linux Cluster and the Blue Gene login nodes. If you are using the default .cshrc and .login files your environment will automatically be set up properly for both systems. If you have problems finding compilers or running jobs make sure that you are not overriding the system settings of the PATH or LD_LIBRARY_PATH environment variables. These variables are set properly for each system by the global startup files. See here for help on adding your own directories to these variables.
Blue Gene Programming Restrictions
The Blue Gene is not a general purpose computer and thus there are many limits on what a Blue Gene application is allowed to do. Here are some:
- Must use MPI.
- No shared libraries.
- No threads.
- Most Unix system calls are not allowed, e.g. fork, exec, signal,…
- File IO is allowed but only a limited number of filesystems are available: (/project, /projectnb, or any of /usr1-/usr4)
- No /scratch or /tmp.
Compiling for the Blue Gene
Compiling for the Blue Gene is relatively straightforward as long as you use the right compiler, include the right header files, and link with the right libraries.
The login nodes have two types of compilers: native compilers for building programs to run on the login nodes and cross compilers for building programs to run on the Blue Gene. The native compilers have standard names like “gcc” and “xlf.” The Blue Gene cross compilers all have names beginning with “blrts_.” Here is a list:
- IBM compilers
- GNU compilers
The MPI and various Blue Gene header files are in the directory /bgl/BlueLight/ppcfloor/bglsys/include/ . You will therefore need to include -I/bgl/BlueLight/ppcfloor/bglsys/include with your compiler flags.
Every Blue Gene program must be linked with at least four libraries which are in the directory /bgl/BlueLight/ppcfloor/bglsys/lib/. The main 4 required libraries are libmpich.rts.a, libdevices.rts.a, libmsglayer.rts.a, and librts.rts.a.
That is all you need for C and Fortran. The library libcxxmpich.rts.a is also required to compile C++ code.
The compiler option -fno-underscoring is required by blrts_g77.
More detailed compiler and compiler options information is available here.
A sample makefile which incorporates all of the above information is available.
Math Libraries, etc
There is limited math library support and it is detailed here.
Preparing your job to run on the Blue Gene
The mpirun command is used to execute a job on the Blue Gene. Because we do not allow interactive use of the mpirun command on the login nodes (except by special arrangement), one must submit it as a job to the LoadLeveler batch system. Preparing your job for submission requires you to create a job command file (jcf) that contains the correct arguments to the mpirun command. This jcf file is then used by LoadLeveler to correctly dispatch your job to run on the Blue Gene.
To run a Blue Gene executable via the mpirun command, you should compile an executable and place it on a file system that can be accessed by the Blue Gene (/project, /projectnb, or any of /usr1-/usr4). You should then modify the sample jcf file, particularly by changing the “@ arguments” line to reference your executable and set appropriate command line arguments to mpirun.
The main command line arguments to mpirun are:
|-np N||Mandatory||N = the number of mpi tasks
1-1024 (or 1-2048 in VN mode) see below
|-cwd start_dir||Mandatory||start_dir = full pathname of directory where
|-exe path_to_executable||Mandatory||Full pathname of your executable.|
|-verbose 0-4||Optional||controls diagnostic output, default is 0, 1 is recommended|
|-args “list_of_args”||Optional||“list of args” = list of args to executable (enclosed in quotes)|
|-mode CO|VN||Optional||CO|VN = COprocessor(default) or Virtual Node mode|
|-connect MESH|TORUS||Optional||MESH|TORUS = defaults to MESH, N must be
multiple of 512 to use TORUS
See mpirun -h for a full list of mpirun options.
Important note on N above. Although you can choose N to be any value in the above ranges, the system will only allocate 32, 128, 512, or 1024 physical nodes to a job. The system allocates the smallest number of allowed physical nodes necessary to run one task per node (or two per node in VN mode.)
Submitting and Tracking your Job Using LoadLeveler
Once you have tailored your jcf file you can use LoadLeveler to submit your job to run on the Blue Gene. The command you use to submit your job is:
levi% llsubmit jcf_file
Once you have submitted your job you can use the following commands to monitor and change your job while it is queued and running on the Blue Gene:
llq Shows queued and running jobs llcancel Delete a queued or running job llprio Change the priority of your queued jobs qstat Similar to llq but includes more information bglstat Shows current allocation of the Blue Gene machine
The scheduler implements the following usage limits:
|Limit||During Business Hours*||During Off Hours|
|Maximum Runtime per Job||5 hours||5 hours|
|Maximum Nodes used per User||512||1024|
(*Business Hours: 9am – 5pm Eastern Time, Monday – Friday.)
After enforcing the above limits, the scheduler prioritizes the runnable jobs and runs the highest priority one if the necessary resources are available. If the necessary resources are not yet available the scheduler uses a backfilling strategy to run lower priority, short duration jobs on any available resources as long as doing so will not delay the starting of the highest priority job.
The primary ordering criterion used to prioritize jobs is the amount of recent runtime accumulated by the user. This quantity is displayed by the qstat command under the SYSPRI column as a negative value. Jobs with the same SYSPRI are ordered by submission time. Finally, a user can alter the relative ordering of their own jobs with the llprio command. This command modifies the “user priority” which is displayed in the PRI column of the llq and qstat commands. The qstat command lists the waiting jobs in the above described scheduling order.
Sometimes it is convenient (e.g. during program development or debugging) to execute the mpirun directly on the login node rather than through the batch system. This is normally not permitted but it can be arranged by sending a request to firstname.lastname@example.org. We will allocate a partition of the machine for your exclusive use and you will be able to use it by invoking mpirun in the following way:
levi% mpirun -noallocate -partition YOUR_PARTITION ...
where YOUR_PARTITION is the name of the partition assigned to you and “…” represents all the other flags you would normally pass to mpirun.
The Totalview debugger is available for use on the Blue Gene. To use the debugger you must first compile your code with the -g flag. It will also be convenient to have your executable and source code in the same directory and to invoke the debugger from that directory.
You can start totalview through the batch system by using an appropriately modified version of the sample totalview jcf file.
You can also start it interactively on a login node if you have your own partition as described under Interactive Mpirun Use above:
levi% totalview mpirun -a -noallocate -partition YOUR_PARTITION ...
In either case, once it starts, two windows will appear on your screen. Click the GO button in the larger window titled mpirun.
A small window titled Question will pop up. Click YES in that window (you want to stop the job.)
After a little while the large window will be retitled mpirun<your_program>.0 and you should see your source code displayed in it.
At this point all of your mpi tasks have been created and are under the control of the debugger. They are stopped before main has been called.
See the totalview documentation for more information.
- IBM Blue Gene Presentation at BU January 30 – February 1, 2006 (http://scv.bu.edu/documentation/presentations/)
- Blue Gene/L Application and Development Redbook (html |pdf)
- Blue Gene/L Performance Analysis Tools Redbook (html |pdf)
- Blue Gene/L System Administration Redbook (html)
- IBM Journal of Research and Development issue on the Blue Gene/L (http://www.research.ibm.com/journal/rd49-23.html)