Contemporary desktop and laptop PCs are, more often than not, equipped with multicore processors. On the one hand, the MATLAB Parallel Computing Toolbox may be used to help speed up computations on these PCs. On the other hand, because these cores, or threads, share a common memory, many of MATLAB’s thread-parallel enabled functions may readily be put to work autonomously without any coding modifications. This is referred to as implicit parallelism to distinguish it from explicit parallelism for which the PCT belongs. Vector operation is the necessary, but not sufficient, trigger for implicit parallel computation. The particular application or algorithm, and the amount of computations also help MATLAB to determine whether an application will be performed with multithreads. Here is a list of application functions with multithreading potentials.

On the Shared Computing Cluster, many of the nodes are multicored. Some nodes have 4 cores and others with 8 cores. Implicit parallelism may be used for MATLAB parallel computing within each of these nodes, but not across them.

Examples of implicit parallel applications include computation of a large number of trigonometric functions, linear algebraic operations such as matrix multiplication, and other matrix operations available in LAPACK (Linear Algebra PACKage) and some level-3 BLAS (Basic Linear Algebra Subprograms) matrix operations. This has been available to MATLAB users since long before the Parallel Computing Toolbox was available.

On a multicore computer, by default, multithreading is turned on to use all threads available. To control the thread count, the (soon-to-be-deprecated) maxNumCompThreads command may be used. For example, you can vary the thread count to study speedup and parallel efficiency trends of an application, such as matrix multiplication (see the examples below).

Beginning with the R2009b Release, you can turn multithreading off by starting MATLAB with -singleCompThread.

When a Parallel Computing Toolbox command such as spmd is in effect, multithreading is preempted. Note that requesting for resources with matlabpool open does not, in itself, prevent multithreading nor does it change the thread count in effect.

One of the most common and effective exploits of multithreading is the use of vector operations in place of for-loops to perform computation. The use of vector operations, by itself, often improves the computational efficiency significantly on single-core processors. On multicore processors, it often result in further speedup, as demonstrated in the following examples.

Example 1. The effect of multithreading on vector operations

n = 5000000;         % set matrix size
x = zeros(n,1);
del = 2*pi/n;
% for-loop implementation (will not trigger multithreading)
tic
for i=1:n
  t = i*del;
  x(i) = (sin(t)*exp(-t))^3 +  (t^4+5*t^-2)^0.3;
end
toc

% vector implementation (may trigger multithreading depending on 
% the type of computation and work load)
for i=1:4
  m = 2^(i-1);
  maxNumCompThreads(m);    % set the thread count
  tic                      % starts timer
  t = (1:n)*del;
  x = (sin(t).*exp(-t)).^3 +  (t.^4+5*t.^-2).^0.3;
  walltime(i) = toc;       % wall clock time
  Speedup = walltime(1)/walltime(i);
  Efficiency = 100*s/m;
end
Table 1. Effect of multithreading on vector operations
Number of threads Wall clock time (in secs) * Speedup † Efficiency ‡
for-loop operations 10.81 N/A N/A
1 (vector operations) 0.53 1.0 100.0%
2 0.27 1.9 97.2%
4 0.14 3.7 92.7%
8 0.08 6.6 83.1%

* Timings collected on Intel Xeon X5570 2.93 GHz processors.
† Speedup = T1/TN; where T1 and TN are the wall clock time for 1 and N threads, respectively.
‡ Efficiency = 100*Speedup/N

As shown, converting the for-loop into an equivalent vector operation offers the most significant improvement in computational efficiency despite the practically linear scaling efficiency with multithreading.

Example 2. Matrix Multiplication (multiplying an NxN matrix by another NxN matrix)

n = 2000;                   % set matrix size
A = rand(n);                % create random matrix
B = rand(n);                % create another random matrix
% for-loop implementation (will not trigger multithreading)
tic
C = zeros(n);
for j=1:n
  for i=1:n
    for k=1:n
      C(i,j) = C(i,j) + A(i,k)*B(k,j);
    end
  end
end
toc

% vector implementation (may trigger multithreading)
tic
for i=1:4
   maxNumCompThreads(2^(i-1)); % set the thread count to 1, 2, 4, or 8
   tic                         % starts timer
   C = A * B;                  % matrix multiplication
   walltime(i) = toc;          % wall clock time
   Speedup = walltime(1)/walltime(i);
   Efficiency = 100*s/m;
end
Table 2. Effect of multithreading on matrix multiplication
Number of threads Wall clock time (in secs) * Speedup † Efficiency ‡
for-loop operations 347.2 N/A N/A
1 (vector operations) 1.45 1.0 100.0%
2 0.74 2.0 98.2%
4 0.39 3.7 92.2%
8 0.22 6.5 81.0%

* Timings collected on Intel Xeon X5570 2.93 GHz processors.
† Speedup = T1/TN;    where T1 and TN are the wall clock time for 1 and N threads, respectively.
‡ Efficiency = 100*Speedup/N

Example 3. Effect of multithreading on solving linear algebraic system of equations, Ax=b

% Benchmarking Ax = b algebraic system of equations
% with multithreading
n = 8000;         % set matrix size
M = rand(n);      % create random matrix
A = M + M';       % create A as a symmetric real matrix
x = ones(n,1);    % define solution x as unity vector
b = A * x;        % compute RHS b from A and x

for i=1:4
  m = 2^(i-1);
  maxNumCompThreads(m);   % set the thread count
  tic                     % starts timer
  y = Ab;                % solves Ay = b; y should equal x
  walltime(i) = toc;      % prints wall clock time
  Speedup = walltime(1)/walltime(i);
  Efficiency = 100*s/m;
end
Table 3. Timing table for solving Ax=b with multithreading
Number of threads Wall clock time (in secs) * Speedup † Efficiency ‡
1 (vector operations) 6.49 1.0 100.0%
2 3.64 1.8 89.2%
4 2.18 3.0 74.3%
8 1.49 4.3 54.4%

* Timings collected on Intel Xeon X5570 2.93 GHz processors.
† Speedup = T1/TN;    where T1 and TN are the wall clock time for 1 and N threads, respectively.
‡ Efficiency = 100*Speedup/N

With multithreading on, the above solver (x = Ab) solves for x in parallel with appropriate LAPACK routine.

Previous Home Next