Scientific Computing Facilities Frequently Asked Questions
Table of Contents
- General Questions
- Filesystem/Disk Questions
- Batch Job/System Questions
- Programming Questions
- Miscellaneous Questions
General Questions
- 1. What are the Scientific Computing Facilities (SCF)?
- The Scientific Computing Facilities (SCF) currently include our IBM Katana Cluster, the IBM BlueGene system, the soon to be retired IBM pSeries 655 machines, and our virtual reality/scientific visualization facilities. They are part of the more general SCV Computing and Visualization Facilities.
A general introduction to the use of the SCF is available at Information for New SCF Users. The primary computational machines are listed under our Scientific Computing Facility Technical Summary.
- 2. What machines is my account good for?
- Your SCF account gives you access to the IBM Katana Cluster (katana.bu.edu) and the soon to be retired IBM pSeries 655 (twister.bu.edu). Accessing the IBM Blue Gene (levi.bu.edu and lee.bu.edu) is not automatic, special restrictions apply. If you need access to the Blue Gene, go to your SCF User Information page (accessing this page will require your BU login ID and SCF [non-Kerberos] password) and then select and submit the Update Personal Information form.
The primary computation clusters each have one or two machines designated for interactive use and you can only log in to those machines. On the Katana Cluster, the machine is katana.bu.edu, on the pSeries, the machine is twister.bu.edu, and on the Blue Gene for those who have access to it, they are levi.bu.edu and lee.bu.edu. Your SCF Unix (non-Kerberos) password is shared over all of these systems. For the Katana Cluster, you may also login to the systems using your BU Kerberos password but this will not allow access to SCF web materials.
You should log into one of the above machines using SSH and do all your editing and compiling there as well. If your program runs on a single processor and requires less than ten minutes of CPU time, you can also execute your program on one of these machines (with the exception of the Blue Gene systems) interactively. Otherwise, you should submit your program as a batch job and it will automatically be parceled out to the appropriate machine in the facilities based on available resources and what queue you select (please also see later questions on batch system usage).
- 3. Where do I find documentation?
-
- General
- SCV Help
- SCV Supported Software documentation
- Scientific Computing Facilities Resource Request forms
- SCV Tutorials and Presentations
- Look up online man pages
- Supercomputing Systems
- Katana Cluster
- IBM Blue Gene
- IBM pSeries 655
- 4. What debuggers are there?
- On the Katana cluster, the Portland Group parallel debugger is pgdbg.
The debugger on the Blue Gene system is Totalview.
On the IBM pSeries machines, the debugger is pdbx. pdbx is a command-line parallel debugger suitable for MPI.
There are also the standard debuggers dbx and gdb.
- 5. How do I change my password?
- To change your SCF password, you need to run the passwd command on either katana.bu.edu or twister.bu.edu.
- 6. I can log in to the Katana cluster (katana.bu.edu) but not to twister.bu.edu. Why not?
- Most likely you are using your BU Kerberos password. This password will work on the Linux systems but not on our other systems. You also have a Unix password which you first set up when you first got an account on the SCF (but of course may have changed since). Only this password will allow you to log in to twister.bu.edu or the Blue Gene systems (lee.edu, levi.bu.edu). This password can also be used for the Linux systems, but for the other systems is required. Your SCF (non-Keberos) password is also required to access SCF web materials online. If you can’t remember your SCF Unix password, please send e-mail to scfacct@bu.edu explaining your situation.
- 7. What do I do if I forget my password?
- Please send e-mail to scfacct@bu.edu explaining your situation.
- 8. How do I retrieve lost files?
- Please send e-mail to help@scv.bu.edu, explaining exactly what files you deleted, what machine and filesystem they were on, and at what day and time you did it.
- 9. How do I get more resources (such as disk space)?
- For home directory disk space, fill out this form. If you are a Principal Investigator for a project which needs more CPU time or /project space, try using the appropriate form linked to from http://www.bu.edu/tech/accounts/special/research/accounts/. Make sure to specify what machine you are requesting resources on, why you need them and what exactly you need. These requests, particularly large ones, can take several weeks to process and consider.
- 10. What is the best way to keep up with system news, such as downtime?
- All system news is posted to the system message board and to the BU mailing list scfug-l. The system message board can be viewed using the program msgs. By default this command will be included in your .login startup file. If you modify this file, we suggest that you continue to include this command.
You can subscribe to the scfug-l mailing list by sending mail to majordomo@bu.edu with the following line as the BODY of the message (the Subject line does not matter):
“subscribe scfug-l@bu.edu your_email_address“.
- 11. Which filesystems are shared?
- You have one home directory on the Katana Cluster/Blue Gene systems and one on the IBM pSeries. Your Katana/Blue Gene home directory is accessible from the IBM pSeries machines with the pathname /linux/$HOME. Similarly, your IBM pSeries home directory is accessible from the Linux machines with the pathname /ibm/$HOME.
Each machine (with the exception of the Blue Gene systems) has its own scratch partition. If necessary, you can access the /scratch partitions on other machines of the same architecture. On the Katana Cluster machines, use the pathname “/net/katana-xNN/scratch“, where x is the letter a, b, c, d, e, or h and NN is a node number from 01, 02, … 14. On the pSeries systems you can access a remote scratch space via the pathname /hostname/scratch (for example, /frisbee/scratch).
There are several partitions of Project space. All of the /project file systems can be accessed from any of the SCF machines.
- 12. I need large amounts of temporary space for my jobs. What do I do?
- Use /scratch and see the previous question.
If /scratch on a given machine is full, you should do one of the following things. 1) Remove as many files as you can which you no longer need to free up space. 2) Use /scratch on a different machine which has more space (see next question).
If this is a regular need and /scratch does not adequately take care of it, the Principal Investigator of your project can apply for /project disk space, backed up or not backed up as appropriate.
- 13. Why do my files in /scratch automatically get removed, sometimes even immediately after I unTARed them?
- The /scratch reaper automatically removes files which are more than 10 days old. It determines how old a file is by looking at its “write date.” By default, tar does not modify write dates, so an older file which is unTARed will be reaped at the next opportunity. The -m switch to the tar command can be used to override this behavior. The following is from the tar man page:
m Do not restore the modification times. The modification time will be the time of extraction. - 14. Does the SCF have a long term storage facility?
- Yes, it is possible to archive your files for long term storage using the IBM Distributed Storage Manager (Tape Robot).
- 15. How do I submit a batch job?
- Each of the four computer systems maintained by SCV uses a different batch scheduler.
- On the Katana Cluster — the batch scheduler is Sun Grid Engine.
- On the IBM Blue Gene — the batch scheduler is LoadLeveler.
- On the IBM pSeries — the batch scheduler is LSF.
- 16. What limitations are there on jobs (# of nodes, runtime, etc…) on the various systems?
- Our Scientific Computing Facility Technical Summary explains the job limitations on all of our systems.
- 17. How do I have one batch job wait for another to complete?
- On the pSeries, the bsub command in the LSF batch system has a wait option (-w) which allows you to specify the conditions which you wish to wait for before starting the job, including waiting for the termination of another job. For example,
bsub -w 'done("myjob1")&&done("myjob2")' myjob3will cause myjob3 to wait until both myjob1 and myjob2 have completed. Another option -b allows you to specify that jobs should not be run before a certain time. Finally, the -E option provides a completely general mechanism to have a job wait until an arbitrary condition is true. With this option you specify a command which the batch system will execute before running your job. If the command exits with a 0, the job is run. Otherwise it is put back on the queue.
- 18. My batch job starts several other jobs but these other jobs get killed by the reaper. Why?
- If the original job terminates before its children, the reaper cannot determine that the children were started by the batch job and so kills them. Make sure the parent job does not end before the children.
- 19. My batch job is expected to take longer runtime than a queue’s time limit, what can I do?
- The answer depends on how your code is implemented:
- If your code is written as a serial (single processor) application, rewriting it as a parallel (multiprocessor) application could help, provided that the underlying algorithm used in your code is inherently parallelizable. Parallelization can be achieved with MPI. MPI works on shared memory machines as well as distributed memory machines. Please contact Kadin Tseng for more details.
If your program is written for MATLAB, please contact Kadin Tseng to see if your program can be parallelized. - Modify your program so it periodically saves state and can be restarted where it left off. See Kadin for help doing this.
- If your code is already parallelized with MPI and is scalable to many (hundreds) processors, you could port it to the IBM BlueGene.
- If your code is written as a serial (single processor) application, rewriting it as a parallel (multiprocessor) application could help, provided that the underlying algorithm used in your code is inherently parallelizable. Parallelization can be achieved with MPI. MPI works on shared memory machines as well as distributed memory machines. Please contact Kadin Tseng for more details.
- 20. My batch job exited with code ###. What does that mean?
- See this long explanation of the batch system exit codes.
- 21. How does the LSF batch system schedule jobs?
- See this long explanation of the batch system scheduler.
- 22. My LSF (batch) run seemed to run to completion, but I never received the usual e-mail message notifying me that the job had finished. What is the problem?
- At the end of an LSF run the user is automatically sent e-mail to indicate that the job has completed. This e-mail contains everything that was written to standard out during the run. If a large amount of information (greater than 10MB) is written to standard out, the e-mail becomes too large for the mail system to process, and the e-mail is not sent. This sometimes occurs when the user forgets to delete a large number of diagnostic print statements from a run. The best solution is to always re-direct standard out to a file (e.g., myrun > myoutput).
- 23. I am a member of multiple project groups. How do I account my usage to a project other than my default one?
- On the Katana Cluster, run the command newgrp project_name in your shell window before doing your run or submitting your job (from that window) to the batch system.
On the Blue Gene, you need to add the line # @ group = project_name to your batch script.
On the pSeries, use the -P project_name option to bsub when you submit your job.
You can also change your default project by going to your SCF User Information page (accessing this page will require your BU login ID and SCF [non-Kerberos] password) and then selecting and submitting the Change your SCF default project form. Your default project will then be changed the next time the system configuration files are updated, generally overnight.
- 24. How do I specify the number of processors my job will run on?
- It depends on the computing platform you plan to run the job on. Please consult the appropriate link below:
- 25. How do I run MPI jobs?
- Please read the Multiprocessing by Message Passing Tutorial where you will find instructions on what you need to do to use MPI.
- 26. How do I run PVM jobs?
- PVM is not available on any of our current systems.
- 27. What linear algebra packages are available on the SCF systems?
- On the IBM pSeries, ESSL is available for serial applications while PESSL is available for parallel processing.
On the other SCF systems, LAPACK is available for serial applications and ScaLAPACK for parallel applications. Please go to the Packages page for the Katana Cluster or the IBM Blue Gene systems for details.
Basic Linear Algebra Subprograms (BLAS) is available on all four SCV systems. Add -lblas during linking to allow access to the BLAS library to your executable.
- 28. What causes the following error message on Twister?
The minimum size of partition 5 exceeds the partition size limit.When the highest level of optimization (-O5) is requested, the compiler will perform inter-procedural optimization (ipa). In the process, it needs temporary storage and when the default amount is not sufficient, the above message results. To remedy this, add the following switch to your compile line: -qipa=partition=large See the xlf man page for details.
- 29. How do I call fortran subprograms from a C program on Twister?
- Unlike many other machines, no wrapper routine is required with the IBM xlf compilers. In fact, adding an underscore (“_”) in a C wrapper function will cause an error during compilation. This applies to both user-developed C functions and IBM C-based library functions, such as erand48, a 48-bit random number generator.
- 30. How do I call fortran functions from a C program on the Linux-based machines (katana, levi, lee)?
- On these machines, you need to append an underscore (“_”) to the fortran function name. For example, if your fortran subprogram is called myfunc, then in the C program, invoke it with myfunc_. Note also that because fortran subprogram arguments are passed by reference, when you use them in C, all arguments must be passed as pointers (i.e., passed by reference), including any scalars.
- 31. My fortran program calls flush. It doesn’t work on the pSeries/Twister!
- Instead of “call flush(iunit)”, you must, in addition, append an underscore, like this: call flush_(iunit).
- 32. Is etime available on the pSeries? I got this error when I compile my program:
ld: 0711-317 ERROR: Undefined symbol: .etime - Yes. Like flush above, you need to “call etime_”. The following utilities require an underscore: alarm_, clock_,ctime_, dtime_, etime_, fdate_, flush_, gmtime_, idate_, itime_, ltime_, sleep_, time_, usleep_
- 33. My C code with OpenMP directives behaves strangely when private arrays are allocated with malloc. Why is this?
- Arrays allocated with malloc are allocated on the heap and are expected to be treated as shared.
- 34. I use trigonometric and hyperbolic functions quite often in my code. Is there a system library that provides efficient implementations of these functions?
- On the Katana Cluster, there is no special library needed. Both the Portland Group and GNU compilers have math libraries built in.
On the BlueGene and pSeries, it is called MASS. Add -lmass to link to it. Details on MASS are available.
- 35. What is MPMD and are there special considerations when programming using this paradigm?
- MPMD stands for Multiple Program Multiple Data — as opposed to MPI’s more popular Single Program Multiple Data (SPMD) parallel programming paradigm. This is available on the pSeries only. (More details available here.).
- 36. I ran a batch job that uses /scratch for I/O. LSF can’t find the file which is there. What happened?
- You must prepend the name of the machine on which the file resides, for example /twister/scratch. (Details here.)
- 37. I think I have discovered a bug in F90, gcc, etc… What should I do?
- Send e-mail to help@scv.bu.edu with a description of the problem. If possible, tell us exactly how to reproduce the problem you are having. If we can reproduce your problem, we can probably fix it. If you don’t know how to reproduce the problem, please provide as much information as possible including:
- Hostname of machine.
- The name and location of the program (with flags and input files).
- Any error messages you get.
- 38. On the IBM/AIX machines my program fails with the error:
-
twister:~> a.out exec(): 0509-036 Cannot load program a.out because of the following errors: 0509-026 System error: There is not enough memory available now.How do I deal with this error and get access to more memory on the IBM/AIX machines? - The error message is misleading, the system has plenty of memory. By default, a 32-bit AIX executable has a 256MB data segment limit. You need to use the -bmaxdata compiler flag to use more memory. For example:
twister:~> xlf -bmaxdata:0x40000000 prog.f
produces an executable with a 1GB data segment limit. It is usually safe to compile with -bmaxdata:0×80000000 for a 2GB data limit. To go above 2GB, you need to add a /dsa. The largest value you can specify is -bmaxdata:0xd0000000/dsa for a 3.25GB data limit. However, this may or may not work depending on the details of your program. If you need that much memory consider compiling a 64-bit executable by using the -q64 flag.
You can also use the ldedit command to “fix” the executable without recompiling.
Finally, if you are using the GNU compilers (gcc,g++,g77) you need an additional -Xlinker flag:
twister:~> g77 -Xlinker -bmaxdata:0x40000000 prog.f
- 39. When I try to run Mathematica I get the error: “xset: bad font path element”. What should I do?
- All machines that display the Mathematica front end (graphical user interface) must have access to the fonts included with Mathematica. If the Mathematica process is running on a remote machine and the front end is displayed on the local machine, the X server on the local machine must know where to find the Mathematica fonts.
In order to run mathematica on the BU SCF systems and have it display on your remote (eg. office or home) machine, you need two things: the Mathematica fonts and an Xserver.
- Mathematica Fonts: The latest version of the MathFonts.
- Xservers: Most computers running Linux and OS/X will already have an Xserver
installed so it is only a matter of installing the fonts on those computers. For
windows, you can get an Xserver by downloading and installing X-Win32.
Exactly what you need to do though will vary depending on your operating system:
Windows
If you install the BU site license of X-Win32 the Mathematica fonts are included and Mathematica will work.
BU Linux/Mac
Download the fonts as mentioned above.
Untar the Font file using the command tar xvzf MathematicaV7FontsLinux.tar.gz
This will create a directory called Fonts
Copy the Fonts directory where you would like to keep it.
Each time you run mathematica run the command
xset fp+ full_path_to_Fonts_directory/Fonts/Type1; xset fp rehashUbunut/Kbuntu Linux
The process for installing the fonts under Linux is explained here.
- 40. Why can’t the machine I am using read my datafile that I created on another computer?
- If the file is a binary file, it may be a problem with endian-ness. Intel, and DEC computers are usually “little-endian” while MIPS(SGI), SPARC(SUN), PPC(IBM/AIX) are “big-endian”. This means that the order they store the bytes in an integer for example, is reversed. The best solution is to use a portable data format for your data files such as ASCII.
- 41. None of this answered my questions. What should I do?
- For other questions or to clarify anything above, please send mail to help@scv.bu.edu and we will do everything in our power to help you with your issue. Please don’t hesitate to contact us whenever you have trouble.
