On a cluster such as the SCC, jobs that take hours or days to run to completion on occasions could suffer from system abort in mid-stream due to a variety of reasons: power failure, walltime limit, scheduled shutdown, to name a few. Rerunning jobs often leads to inevitable delay and waste of system resources. Checkpoint Restarting has long been a common technique to which researchers turn to tackle this issue. Checkpoint Restarting essentially means saving data to disk periodically so that, if need be, you can restart the job from the point at which your data was last saved. Checkpoint Restarting can either be dealt with through a batch scheduler (if supported) or you can do it the old-fashioned way by writing to disk manually yourself. On the SCC, the OGS batch scheduler checkpointing feature is not supported. A rather simple MATLAB checkpointing tool has been developed by RCS to provide a user-friendly, systematic way to implement checkpointing for your code. Included here is a MATLAB example to demonstrate the usage of this RCS checkpointing tool to facilitate rerunning of your job — should a system abort occurs.

Applicable Platforms

Linux, MS Windows, Mac.

Scopes of the Checkpointing Tools

Two types of applications come to mind:

  1. To restart a job terminated by a variety of reasons: system crash, system kill, job time exceeds requested wall clock limit, etc. The current version uses loop iteration count as the criterion for checkpointing. This may be extended to accommodate other criteria, such as elapsed time.
  2. For jobs that take longer than the SCC batch maximum time limit, divide the job into smaller jobs so that each fits within the system’s maximum wall clock limit. Checkpointing will bridge the jobs. The current version of the package supports this type of applications.

Design Goals of Checkpointing Tools

The goals are modest: to make available a simple and user-friendly checkpointing package for a limited class of applications that have relatively straightforward code implementations. It takes 20 to 30 minutes to understand what it does and determine whether it is suitable for your application. In brief, spell out what you want to save for restarting and then save them, once in a while.


  • The checkpointing tools (chkin.m, chkpt.m, README), as well as a usage example (test_checkpoint.m, test_restart.m) are available for download
  • It computes the sum of a simple arithmetic series, A = 1 + 2 + 3 + . . . + N, in a loop
  • Elements of the arithmetic series are mapped to the iteration index, one-on-one. For each iteration, A represents the partial sum at that instance. For N = 50, A = N*(N+1)/2 = 1275
  • Each time an iteration is performed, the code confirms with “Completed iteration X”
  • The checkpointing frequency is defined (see below) as every 7 iterations
  • A one-second pause (in each iteration) is added to slow down computation to give you ample time to issue Ctrl-c to simulate system kill. Make sure at least one checkpoint event (7 iterations) has happened before issuing the kill action
  • A mat-file mydata.mat is generated. If the job is killed at Iteration 17, data is saved, first at Iteration 7, then overwritten at Iteration 14. The solutions for Iterations 15, 16, and 17 are not saved. The companion code, test_restart.m, will load mydata.mat, generated by test_checkpoint, and then start executing from Iteration 15 and complete at Iteration 50 (Niter)

How to run test_checkpoint and test_restart

The demonstration example can run either in the interactive or batch mode

  • Run interactively

    >> test_checkpoint
    Completed iteration 1
    Completed iteration 2
     .   .   .   .   .   .
    Completed iteration 6
    Checkpointing frequency is every 7 iterations. Data updated at iteration  7
    Completed iteration 7
     .   .   .   .   .   .
    Completed iteration 13
    Checkpointing frequency is every 7 iterations. Data updated at iteration 14
    Completed iteration 14
    Completed iteration 15
    Completed iteration 16
    Completed iteration 17
      ****** Simulate a system kill right here with Ctrl-c
    Operation terminated by user during test_checkpoint (line 65)

    Then proceed with running the restarting job

    >> test_restart
    Completed iteration 18
     .   .   .   .   .   .
    Completed iteration 48
    Completed iteration 49
    Completed iteration 50
    iter = 50; A = 1275
  • Run in batch mode

    scc1% qsub -b y 'matlab -singleCompThread -r "test_checkpoint, exit"'
    scc1% qstat -u kadin
    2168148  . . .     kadin     r     12/12/2013 09:11:42 budge@scc-ha1.scc.bu.edu   1

    You should wait until the time difference between your current clock time and the job’s start run time (shown above) is at least 7 seconds (for one frequency cycle) before launching qdel to simulate a system kill. This is because each frequency save cycle takes 7 seconds (1 second/iteration due to pause(1)).

    scc1% qdel 2168148

    Then, run the restarting job

    scc1% qsub -b y 'matlab -singleCompThread -r "test_restart, exit"'

Checkpointing tools

  • Use chkin.m to designate variables for checkpointing

    function s = chkin(s, field)
    % Purpose:
    % Adds or creates one or more empty fields specified by cell array "field"
    % Example 1: >> chkin(s, {'a'});                 % check in 1 item 
    % Example 2: >> chkin(s, {'a', 'A', 'myArray'}); % check in multiple items
    % Date: December 7, 2013
    % Kadin Tseng, RCS, kadin@bu.edu
    for k=1:length(field)
       s.(field{k}) = [];
    end   % end of function
  • Use chkpt.m to perform checkpointing

    % function chkpt(matfile, s)
    % Purpose:
    % This script m-file is part of the checkpoint restart package for
    % users to periodically save data to a file during a batch job run.
    % In the event that the batch job got terminated, a user can rerun the job
    % from the point of the last save instead of starting from the beginning.
    % Input:
    % matfile -- data storage file name (e.g., mydata.mat)
    % s is a struct array that stores the various data that the application
    % code needs to save for restarting.
    % Note:
    % For the objective of restarting, data saved overwrites previous saves
    % to save disk storage as well as for convenience in restarting.
    % December 7, 2013
    % Kadin Tseng, RCS,  kadin@bu.edu
    for k=1:nNames
       s.(chkNames{k}) = eval(chkNames{k});  % update all variables first
    save(matfile, '-struct', 's');   % Only fields of s are saved, not s
  • Checkpointing Example, test_checkpoint.m

    % Usage:
    % >> test_checkpoint    
    % December 8, 2013
    % Kadin Tseng, RCS, Boston University  (kadin@bu.edu)
    a = 123;        % a constant
    n = 456;        % another constant
    Niter = 50;     % total number of iterations
    frequency = 7;  % how often to perform checkpointing (intermediate save)
    % Before the start of the iteration loop, "check-in" each variable
    % that should be checkpointed in the event of restarting the job
    matfile = 'mydata.mat';     % mandatory; name of checkpoint mat-file
    s = struct();               % mandatory; create struct for checkpointing
    s = chkin(s,{'iter'});      % mandatory; iter is iteration loop index
    s = chkin(s,{'frequency'}); % mandatory; frequency is checkpointing period
                                % i.e., how often to perform a save
    s = chkin(s,{'Niter'});     % mandatory; total number of iterations
    s = chkin(s,{'a' 'n' 'A'}); % OK to check in multiple items 
    % continue until all variables are checked in. Note that you are only
    % checking in the variables, they don't need to have been already defined
    chkNames = fieldnames(s);    % the full list of variables to checkpoint
    nNames = length(chkNames);   % number of variables in list
    A = 0;             % initialize arithmetic sum A
    for iter=1:Niter
       A = A + iter;   % computes the running sum
       pause(1);       % slow down to let you kill job with ^c
                       % to simulate system kill
    % Checkpoints periodically (determined by the constant frequency)
       if mod(iter, frequency) == 0
          chkpt   % performs checkpointing (save) every *frequency* iterations
          fprintf(1, ['Checkpointing frequency is every %2d iterations.' ...
              'Data updated at iteration %3dn'], ...
              frequency, iter);  % Confirm after each checkpointing event
       fprintf(1, 'Completed iteration %dn', iter);
    end   % ending iteration loop
    Sas = Niter*(Niter+1)/2;   % correct sum of arithmetic series
    fprintf(1,'iter = %d; A = %d; Correct answer = %d', iter, A, Sas);
  • Restarting with test_restart.m

    % Usage:
    % >> test_restart    
    % December 8, 2013
    % Kadin Tseng, RCS, Boston University  (kadin@bu.edu)
    %%% Any information defined in original code that has not
    %%% been checkpointed should be placed here.
    matfile = 'mydata';   % should match that in test_checkpoint.m
    load(matfile);        % retrieve data from matfile
    iter1 = iter+1;       % iter is the last time test_checkpoint issued
                          % a save; we start computing on the next step
    %%%%%A = 0;           % should not initialize in restart
    for iter=iter1:Niter  % iteration loop starts from when it was last saved
       A = A + iter;      % computes the sum up to iter
       fprintf(1, 'Completed iteration %dn', iter);
    end   % ending iteration loop
    Sas = Niter*(Niter+1)/2;   % correct sum of arithmetic series
    fprintf(1,'iter = %d; A = %d; Correct answer = %d', iter, A, Sas);