The GNU family of compilers produce highly optimized code for Intel and AMD CPUs.  As the LLVM C and C++ compilers deliberately share the majority of their optimization flags with their GNU equivalents the information here applies to both sets of compilers.  As with all compilers, programs compiled with optimization should have their output double-checked for accuracy. If the numeric output is incorrect or lacks the desired accuracy less-aggressive compile options should be tried. The following table summarizes some relevant commands on the SCC for the GNU compilers:

Command Description
module avail gcc List available versions of the GNU compilers.
module load gcc/10.2.0 Load a particular version.
gcc GNU C compiler.
g++ GNU C++ compiler.
gfortran GNU Fortran 90/95/2003/etc compiler.
g77 GNU Fortran 77 compiler.

The LLVM compilers commands are summarized here:

Command Description
module avail llvm List available versions of the LLVM compilers.
module load llvm/9.0.1 Load a particular version.
clang LLVM C compiler.
clang++ LLVM C++ compiler.

Manuals are available for all of the compilers after their modules are loaded:

man g++
man gfortran
man clang

The GNU Compiler Collection has their optimization flags described in an online document.

General Compiler Optimization Flags

The basic optimization flags are summarized below. Using these flags does not result in any incompatibility between CPU architectures.

Flag Description
-O Optimized compile.
-O2 More extensive optimization.  This is recommended flag for most codes.
-O3 More aggressive than -O2 with longer compile times. Recommended for codes that loops involving intensive floating point calculations.
-ffastmath Allows for higher performance with floating point calculations at the risk of a slight loss of precision.
-Ofast -O3 plus some extras. The GNU documentation notes that this option results in a disregard of "strict standards compliance. "
-flto (GNU only) Link-time optimization, a step that examines function calls between files when the program is linked. This flag must be used to compile and when linking. Compile times are very long with this flag, however depending on the application there may be appreciable performance improvements when combined with the -O* flags.  This flag and any optimization flags must be passed to the linker, and gcc/g++/gfortran should be called for linking instead of calling ld directly.
-mtune=processor This flag does additional tuning for specific processor types, however it does not generate extra SIMD instructions so there are no architecture compatibility issues. The tuning will involve optimizations for processor cache sizes, preferred ordering of instructions, and so on. The useful values for the value processor on the SCC Intel nodes are the same as the architecture flags on the Tech Summary page.  On the AMD Bulldozer nodes the value to use is bdver1, and on the AMD Epyc nodes the value is znver2.

 

Flags to Specify SIMD Instructions

These flags will produce executables that contain specific SIMD instructions which may effect compatibility with compute nodes on the SCC.

Flag Description
-march=native Creates an executable that uses SIMD instructions based on the CPU that is compiling the code. Additionally it includes the optimizations from the -mtune=native flag.
-march=arch This will generate SIMD instructions for a particular architecture and apply the -mtune optimizations.  The useful values of arch are the same as for the -mtune flag above.
-msse4.2 Generates code with SSE4.2 instructions.
-mavx Generates code with AVX instructions. Code compiled with this flag will not be able to run on Nehalem architecture cores.
-mavx2 Generates code with AVX2 instructions. This requires an additional module to be loaded before compiling: module load binutils/2.28

Code compiled with this flag will not be able to run on Nehalem, Sandybridge, Ivybridge, or the AMD Bulldozer cores.

Default Optimization Behavior

Most open source programs that compile from source code use the -O2 or -O3 flags. This will result in fast code that can run on any compute node on the SCC. The -march=native, which is sometimes used by default in open source programs, can be problematic when run on the login nodes as they are Broadwell architecture CPUs which support AVX2 instructions. Codes compiled with -march=native on a login node will only be able to execute on Broadwell architecture compute nodes on the SCC.

Recommendations

Most codes will be well-optimized with the -O2 or -O3 flags plus the -msse4.2 flag. Programs that involve intensive floating-point calculations inside of loops can additionally be compiled with the -xarch flag.  For maximum cross-compatibility across the SCC compute nodes and probable highest performance a combination of flags should be used:

gcc -O3 -ffastmath -msse4.2 -mtune=intel -c mycode.cpp

Floating-point intensive code can benefit from the use of -mavx or -mavx2 instead of -msse4.2, depending on the compute node that will be used.

Note that selecting specific SIMD instructions with the -mavx* flag or -march=arch flag will restrict compatibility with compute nodes unless the job is submitted with this qsub flag: -l cpu_arch=compatible_arch. The compatible_arch value is an architecture name that matches the SIMD instructions.  Alternatively, the qsub flag -l cpu_arch=\!compatible_arch can be used to exclude an incompatible architecture:

gcc -O3 -ffastmath -march=haswell mycode.cpp -o mycode
qsub -l cpu_arch=haswell -b y mycode
# OR...as the -march=haswell has produced AVX instructions
# select nodes that support AVX.
qsub -l avx -b y mycode

Another option is to compile the code as part of a batch job which completely avoids any architectural issues and allows for the maximum amount of optimizations. For example, a job that is submitted to run on a Buy-in node equipped with an Ivybridge architecture CPU could be compiled with tunings for that node. As a precaution the source is copied into $TMPDIR: