GNU and LLVM Compiler Flags : TechWeb : Boston University

The GNU family of compilers produce highly optimized code for Intel and AMD CPUs. As the LLVM C and C++ compilers deliberately share the majority of their optimization flags with their GNU equivalents the information here applies to both sets of compilers. As with all compilers, programs compiled with optimization should have their output double-checked for accuracy. If the numeric output is incorrect or lacks the desired accuracy less-aggressive compile options should be tried. The following table summarizes some relevant commands on the SCC for the GNU compilers:

Command	Description
module avail gcc	List available versions of the GNU compilers.
module load gcc/10.2.0	Load a particular version.
gcc	GNU C compiler.
g++	GNU C++ compiler.
gfortran	GNU Fortran 90/95/2003/etc compiler.
g77	GNU Fortran 77 compiler.

On AlmaLinux 8 the system gcc/g++/gfortran compilers are version 8.5.0.

The LLVM compilers commands are summarized here:

Command	Description
module avail llvm	List available versions of the LLVM compilers.
module load llvm/12.0.1	Load a particular version.
clang	LLVM C compiler.
clang++	LLVM C++ compiler.

Manuals are available for all of the compilers after their modules are loaded:

man g++
man gfortran
man clang

The GNU Compiler Collection has their optimization flags described in an online document.

General Compiler Optimization Flags

The basic optimization flags are summarized below. Using these flags does not result in any incompatibility between CPU architectures.

Flag	Description
-O	Optimized compile.
-O2	More extensive optimization. This is recommended flag for most codes.
-O3	More aggressive than -O2 with longer compile times. Recommended for codes that loops involving intensive floating point calculations.
-ffastmath	Allows for higher performance with floating point calculations at the risk of a slight loss of precision.
-Ofast	-O3 plus some extras. The GNU documentation notes that this option results in a disregard of “strict standards compliance. “
-flto	Link-time optimization, a step that examines function calls between files when the program is linked. This flag must be used to compile and when linking. Compile times are very long with this flag, however depending on the application there may be appreciable performance improvements when combined with the -O* flags. This flag and any optimization flags must be passed to the linker, and gcc/g++/gfortran should be called for linking instead of calling ld directly.
-mtune=processor	This flag does additional tuning for specific processor types, however it does not generate extra SIMD instructions so there are no architecture compatibility issues. The tuning will involve optimizations for processor cache sizes, preferred ordering of instructions, and so on. The useful values for the value processor on the SCC Intel nodes are the same as the architecture flags on the Tech Summary page. On the AMD Bulldozer nodes the value to use is bdver1, and on the AMD Epyc nodes the value is znver2.

Flags to Specify SIMD Instructions

These flags will produce executables that contain specific SIMD instructions which may effect compatibility with compute nodes on the SCC. For AVX-512 instructions there are a variety of flags that can be used. If you are interested in compiling with support for those the easiest way is to specify an Intel CPU architecture that supports AVX-512 using the -march=arch flag. For accepted architecture names check the manual for the compiler, e.g. man gcc

Flag	Description
-march=native	Creates an executable that uses SIMD instructions based on the CPU that is compiling the code. Additionally it includes the optimizations from the -mtune=native flag. Not recommended as code compiled on newer architectures will not run on older architectures.
-march=arch	This will generate SIMD instructions for a particular architecture and apply the -mtune optimizations. The useful values of arch are the same as for the -mtune flag above.
-mavx	Generates code with AVX instructions.
-mavx2	Generates code with AVX2 instructions. Code compiled with this flag will not be able to run CPU architectures without AVX2 instructions.

Default Optimization Behavior

Most open source programs that compile from source code use the -O2 or -O3 flags. This will result in fast code that can run on any compute node on the SCC. The -march=native, which is sometimes used by default in open source programs, can be problematic when run on the login nodes as they are Broadwell architecture CPUs which support AVX2 instructions. Codes compiled with -march=native on a login node will only be able to execute on Broadwell architecture compute nodes on the SCC.

Recommendations

Most codes will be well-optimized with the -O2 or -O3 flags plus the -msse4.2 flag. Programs that involve intensive floating-point calculations inside of loops can additionally be compiled with the -xarch flag. For maximum cross-compatibility across the SCC compute nodes and probable highest performance a combination of flags should be used:

gcc -O3 -march=sandybridge -mtune=intel -c mycode.cpp

Note that selecting specific SIMD instructions with the -mavx* flag or -march=arch flag will restrict compatibility with compute nodes unless the job is submitted with this qsub flag: -l cpu_arch=compatible_arch. The compatible_arch value is an architecture name that matches the SIMD instructions. Alternatively, the qsub flag -l cpu_arch=\!compatible_arch can be used to exclude an incompatible architecture:

gcc -O3 -ffastmath -march=broadwell mycode.cpp -o mycode
qsub -l cpu_arch=broadwell -b y mycode
# OR...as the -march=broadwell has produced AVX2 instructions
# select nodes that support AVX2.
qsub -l avx2 -b y mycode

Another option is to compile the code as part of a batch job which completely avoids any architectural issues and allows for the maximum amount of optimizations. For example, a job that is submitted to run on a Buy-in node equipped with an Ivybridge architecture CPU could be compiled with tunings for that node. As a precaution the source is copied into $TMPDIR:

Example Batch Script to Recompile on a Compute Node

#!/bin/bash -l
#$ -l cpu_arch=ivybridge
module load gcc/9.3.0

# Copy the source to $TMPDIR to avoid interaction
# with other jobs running
cp -R /projectnb/myproject/mysource $TMPDIR
cd $TMPDIR/mysource

gcc -O3 -march=native -ffastmath -c file1.c 
gcc -O3 -march=native -ffastmath -c file2.c 
gcc -o myexe file1.o file2.o -lm 
myexe arg1 arg2 ....