Environment Setup

Compilation Process Explained

Understand the four stages of C compilation: preprocessing, compiling, assembling, and linking. Learn what happens behind the scenes when you run gcc to transform your code into an executable.

The Four Stages of Compilation

When you run gcc to compile a C program, it performs four distinct stages behind the scenes: preprocessing, compiling, assembling, and linking. Understanding these stages helps you debug compilation errors, optimize your code, and appreciate how C actually works. Each stage transforms your code one step closer to machine-executable binary.

GCC is actually a driver program that orchestrates these stages by calling specialized tools: the preprocessor (cpp), the compiler proper (cc1), the assembler (as), and the linker (ld). You can stop at any stage to inspect the intermediate output, which is invaluable for understanding what's happening and debugging complex issues.

The Four Stages:

Preprocessing: Handle #include, #define, comments â†’ .i file
Compilation: Translate C to assembly code â†’ .s file
Assembly: Convert assembly to machine code â†’ .o file
Linking: Combine objects with libraries â†’ executable

C
/* Example: program.c */
#include <stdio.h>
#define MAX 100

int main(void) {
    printf("Hello, World!\n");
    return 0;
}

/* Complete compilation (all stages): */
// gcc program.c -o program

/* Stop after each stage to see output: */
// gcc -E program.c -o program.i    // Preprocessing only
// gcc -S program.c -o program.s    // Compilation only  
// gcc -c program.c -o program.o    // Assembly only
// gcc program.o -o program         // Linking only

Stage 1: Preprocessing

The preprocessor runs first, handling all directives that start with #. It doesn't understand C syntax - it just performs text substitution and file inclusion. The preprocessor removes comments, includes header files, expands macros, and evaluates conditional compilation directives. The output is pure C code with all preprocessing done, ready for actual compilation.

When you #include a file, the preprocessor literally copies that file's entire contents into your source at that location. For stdio.h, this includes hundreds of lines of declarations and definitions. That's why the preprocessed output (.i file) is much larger than your original source code.

Macro expansion happens during preprocessing. If you #define MAX 100, every occurrence of MAX in your code gets replaced with 100 before compilation. The compiler never sees MAX - it only sees the number 100. This is why macros don't have type safety or scope like variables do; they're just text substitution.

C
/* Original source: */
#include <stdio.h>
#define MAX 100
#define SQUARE(x) ((x) * (x))

int main(void) {
    int value = MAX;
    printf("%d\n", SQUARE(value));
    return 0;
}

/* After preprocessing (simplified): */
// All stdio.h contents inserted here (hundreds of lines)
// Comments removed
// Macros expanded

int main(void) {
    int value = 100;  // MAX replaced
    printf("%d\n", ((value) * (value)));  // SQUARE expanded
    return 0;
}

/* View preprocessed output: */
// gcc -E program.c
// This shows exactly what the compiler sees

/* Save preprocessed output to file: */
// gcc -E program.c -o program.i
// gcc -E program.c | less  // View with pagination

Conditional compilation (#ifdef, #ifndef, #if, #else, #endif) also happens during preprocessing. This allows you to include or exclude code based on defined macros, enabling platform-specific code or debug builds. The code that doesn't meet the conditions is removed entirely before compilation.

Stage 2: Compilation (C to Assembly)

The compiler proper translates preprocessed C code into assembly language for your target CPU architecture (x86-64, ARM, etc.). Assembly is a human-readable representation of machine instructions, with mnemonics like mov, add, call instead of raw binary. This stage performs syntax checking, type checking, and optimization.

The compiler understands C semantics and transforms high-level constructs (loops, function calls, expressions) into sequences of low-level CPU instructions. It performs optimizations like constant folding, dead code elimination, and loop unrolling. Different optimization levels (-O0, -O1, -O2, -O3) control how aggressive these transformations are.

Assembly output is specific to your CPU architecture. The same C code produces different assembly on Intel x86, ARM, or RISC-V processors. This is why C compilation isn't perfectly portable at this level - the assembly stage creates architecture-specific code. However, you can recompile the same C source for different architectures.

C
/* C source code: */
int add(int a, int b) {
    return a + b;
}

/* Generated assembly (x86-64, simplified): */
// gcc -S program.c produces program.s

add:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -4(%rbp)   // Store 'a'
    movl    %esi, -8(%rbp)   // Store 'b'
    movl    -4(%rbp), %edx   // Load 'a'
    movl    -8(%rbp), %eax   // Load 'b'
    addl    %edx, %eax       // Add them
    popq    %rbp
    ret

/* With optimization (-O2): */
// Much simpler - compiler optimizes away stack operations

add:
    leal    (%rdi,%rsi), %eax  // Single instruction: eax = rdi + rsi
    ret

/* View assembly output: */
// gcc -S program.c           // Creates program.s
// gcc -S -O2 program.c       // With optimization
// cat program.s              // View the file

Examining assembly output helps you understand what the compiler does with your code. You can see how loops unroll, how functions inline, and how optimizations work. This is advanced, but valuable for performance-critical code where you need to ensure the compiler generates efficient instructions.

Stage 3: Assembly (Assembly to Object Code)

The assembler converts human-readable assembly language into machine code - actual binary instructions that the CPU can execute. The output is an object file (.o on Unix/Linux, .obj on Windows) containing machine code plus metadata like symbol tables, relocation information, and debugging data if requested.

Object files are not yet executable. They contain unresolved references to external functions and variables. For example, your code might call printf(), but the object file doesn't contain printf()'s actual implementation - it just notes "I need printf() from somewhere." The linker resolves these references in the next stage.

Object files use binary formats specific to your operating system: ELF (Executable and Linkable Format) on Linux, Mach-O on macOS, PE (Portable Executable) on Windows. These formats structure the machine code with sections for code (.text), initialized data (.data), uninitialized data (.bss), and more.

C
/* Create object file without linking: */
// gcc -c program.c
// Creates program.o (object file)

/* Object file contents are binary, but we can inspect: */
// file program.o
// Output: ELF 64-bit LSB relocatable, x86-64...

// nm program.o                    // List symbols
// objdump -d program.o            // Disassemble object code
// objdump -t program.o            // Show symbol table
// readelf -h program.o            // Show ELF header (Linux)

/* Symbol table example: */
// $ nm program.o
// 0000000000000000 T main
// U printf           // U = undefined (needs linking)
// U puts

/* Compile multiple files to objects: */
// gcc -c file1.c -o file1.o
// gcc -c file2.c -o file2.o
// gcc -c file3.c -o file3.o
// Later link them together

Object files allow separate compilation - you can compile each source file independently and link them later. This is crucial for large projects with hundreds of source files. When you change one file, you only recompile that file (fast) rather than the entire project (slow). Build systems like make exploit this to speed up development.

Stage 4: Linking

The linker is the final stage, combining your object files with necessary libraries to create an executable program. It resolves all undefined symbols - finding the actual implementations of functions you called but didn't define yourself. The linker also determines memory layout, assigns final addresses to functions and variables, and creates the executable file format.

When your code calls printf(), the linker finds printf()'s implementation in the C standard library (libc) and connects your call to that implementation. It does this for every external function and variable. If the linker can't find something you referenced, you get "undefined reference" errors.

There are two types of linking: static and dynamic. Static linking copies library code directly into your executable, making it larger but self-contained. Dynamic linking creates references to shared libraries (.so on Linux, .dylib on macOS, .dll on Windows) that must be present at runtime. By default, GCC uses dynamic linking for system libraries.

C
/* Link object files into executable: */
// gcc file1.o file2.o file3.o -o program

/* Link with additional libraries: */
// gcc program.o -o program -lm        // Link math library
// gcc program.o -o program -lpthread  // Link pthread library

/* Static vs Dynamic linking: */
// gcc program.c -o program              // Dynamic (default)
// gcc program.c -o program -static      // Static linking

// ldd program                           // Show dynamic dependencies (Linux)
// otool -L program                      // Show dependencies (macOS)

/* Check file size difference: */
// gcc hello.c -o hello_dynamic
// gcc hello.c -o hello_static -static
// ls -lh hello_*
// Dynamic: ~16KB
// Static: ~800KB (includes all library code)

/* Common linker errors: */
// undefined reference to 'function_name'
// â†’ Function declared but not defined, or missing library

// multiple definition of 'variable_name'
// â†’ Same global variable defined in multiple files

// cannot find -lname
// â†’ Library 'name' not found in search paths

The linker performs address relocation, updating all memory addresses in your code now that it knows the final layout. It also handles symbol visibility (which functions/variables are exported from the executable) and creates the entry point that the operating system calls when starting your program.

Putting It All Together

Understanding these stages helps you debug compilation problems. Syntax errors occur during compilation, not preprocessing. Undefined reference errors occur during linking, not compilation. Macro expansion issues are preprocessing problems. Knowing which stage produces an error helps you fix it faster.

C
/* Complete compilation process example: */

// Step 1: Preprocessing
// gcc -E program.c -o program.i
// Expands macros, includes headers, removes comments
// Output: Pure C code

// Step 2: Compilation  
// gcc -S program.i -o program.s
// Translates C to assembly language
// Output: Assembly code (.s file)

// Step 3: Assembly
// gcc -c program.s -o program.o
// Converts assembly to machine code
// Output: Object file (.o file)

// Step 4: Linking
// gcc program.o -o program
// Links objects with libraries
// Output: Executable file

/* Or do all steps at once: */
// gcc program.c -o program

/* With verbose output to see all stages: */
// gcc -v program.c -o program
// Shows exact commands for each stage

/* Stopping at each stage for inspection: */
// gcc -E program.c &gt; program.i      // View preprocessed
// gcc -S program.c                  // View assembly  
// gcc -c program.c                  // Create object
// gcc program.o -o program          // Link to executable

Summary & What's Next

Key Takeaways:

âœ… Compilation has four stages: preprocessing, compiling, assembling, linking
âœ… Preprocessing handles #directives, macros, and includes
âœ… Compilation translates C to assembly language
âœ… Assembly converts assembly to binary machine code
âœ… Linking combines objects with libraries into executable
âœ… Use -E, -S, -c flags to stop at each stage
âœ… Object files allow separate compilation
âœ… Understanding stages helps debug compilation errors

What's Next?

Now let's learn C syntax and program structure!

Next: C Syntax & Structure â†’