← What a Compiler Must Get Right (That You Don't)
How GCC Actually Compiles
GCC originally stood for GNU C Compiler. Today it is the GNU Compiler Collection, with frontends for C, C++, Fortran, Ada, Go and more, a shared middle-end, and many backends. VAX and m68k were the first in 1987; today the list spans everything from modern x86-64 and ARM to legacy PDP-11 and MSP430. Each frontend parses its language and lowers it to a common intermediate representation. From there, the shared transformation passes take over, most of them entirely generic regardless of whether the target is a modern 64-core server or our humble 68000.
What happens between "C text goes in" and "assembly comes out" is roughly 360 of these passes, each rewriting the intermediate representation. Some optimize. Some check for errors. Some transform for consistency. Most do a little of everything. We will focus on the ones where our beloved 68000 needs the most help. I have put together a summary of all GCC passes for reference.
Debugging this pipeline is where the fun and the pain live. When the output is wrong, which of the 360 passes is at fault? Often it is not the obvious one; a bad decision in pass 47 might not surface as wrong, or inefficient, code until pass 180. Understanding the stages, even roughly, is the key to knowing where to look.
GIMPLE: What It Is and Why It Matters
The first ~210 passes operate on GIMPLE, GCC's high-level intermediate representation. If you squint, it looks like simplified C: three-address statements, no complex expressions, explicit temporaries. No registers, no instructions, just operations and variables.
Here is our running example, int base = y * 80 + ((x >> 4) * 4), shown as simplified GIMPLE (the real dump is more verbose):
_1 = y_4(D) * 20;
_2 = x_5(D) >> 4;
_3 = _1 + _2;
_6 = _3 * 4;
If you squint, you could also read each line as a three-operand assembly instruction: _3 = _1 + _2 becomes add _1,_2,_3. But that is getting ahead of ourselves.
Each GIMPLE statement carries a source location (file and line number), so debug information survives even after heavy optimization. One statement, one operation, one result. _1 through _6 are compiler-generated temporaries, each assigned exactly once, making data dependencies explicit and transformations safe. The original variable names survive as annotations: y_4(D) means "version 4 of y, a function parameter (D)". When debugging a miscompilation, isolate the problem into the smallest possible test case and search the dump. Searching for y immediately shows us that _1 = y_4(D) * 20 is the multiply we care about.
Sharp-eyed readers will notice the y * 80 is already gone, replaced by y * 20. The compiler factored out the common * 4: y * 80 + (x >> 4) * 4 became (y * 20 + (x >> 4)) * 4, during C parsing, before GIMPLE even began. The frontend is already smarter than "just translate each operation," and this is a theme throughout: each stage inherits the work of every stage before it.
GIMPLE is the workhorse for the large fundamental transformations. Because every statement has one operation and one result, side effects are obvious, data dependencies are explicit, and it is pretty safe to move, remove, or replace statements. Consider put_pixel: mask = 0x8000 >> (x & 15) only depends on x, not the loop counter p, so the Loop Invariant Motion pass can hoist (move out of the loop scope) the computation. And likewise with ~mask. In the final assembly, not.w %d4 sits above .L4:. Optimization happens at multiple levels, and what one stage misses, a later stage can still fix.
GIMPLE was introduced in GCC 4.0 (2005), replacing what had previously been done entirely in RTL. The ~20 years since have been where the biggest strides in optimization happened. These are the passes that keep modern GCC competitive even for a neglected backend.
A loop has at least one induction variable (IV): a value that changes predictably each iteration, like our p in for (int p = 0; p < 4; p++). Strength reduction replaces expensive operations on an IV with cheaper equivalents; instead of computing screen[base + p] each iteration, the compiler keeps a pointer and adds 2, which we later can take advantage of and transform into an auto increment. A few of the most important GIMPLE passes:
- Inlining (
gcc/ipa-inline.cc): Expands small functions at call sites, eliminating call overhead and exposing cross-function optimization. Transformative for C++. - Loop Invariant Motion (
gcc/tree-ssa-loop-im.cc): Hoists computations that do not change between iterations out of loops. Runs four times at different points in the pipeline. - IVOPTS (
gcc/tree-ssa-loop-ivopts.cc): Strength-reduces induction variables. Turnsbase + parray indexing into a single incrementing pointer, replacing per-iteration multiplies with adds. Also critical for preserving loop counters that later can becomedbra. - Dead Code Elimination (
gcc/tree-ssa-dce.cc): Removes computations whose results are never used. Runs six times; as each round of optimization can expose new dead code. - Redundancy Elimination (
gcc/tree-ssa-sccvn.cc): Reuses an already-computed expression instead of recomputing it.a = x + y; ... b = 8 + x + ybecomesa = x + y; ... b = 8 + a. Runs four times.
Both GIMPLE and RTL organize code into basic blocks: straight-line sequences with one entry and one exit, like the code between two labels in assembly. Here is put_pixel's inner loop with basic block boundaries annotated:
for (int p = 0; p < 4; p++) { // ── BB3: loop header ──
if (col & (1 << p)) // branch to BB4 or BB5
screen[base + p] |= mask; // ── BB4: set bit ──
else // (falls through to BB6)
screen[base + p] &= ~mask; // ── BB5: clear bit ──
} // ── BB6: rejoin, loop back ──
Each path through the if/else is its own basic block because the branch breaks the straight-line flow. The loop header, the two branches, and the rejoin point are four separate blocks. Moving statements around within a basic block is straightforward, as long as you respect data dependencies (a + b is the same as b + a, but a / b cannot be reordered). Moving statements between blocks is harder: you must consult the Control Flow Graph to ensure a computation is not moved out of the path where its result is needed.
The GIMPLE types and helpers live in gcc/gimple.h and gcc/gimple-iterator.h. A gimple is a statement, a gimple_stmt_iterator walks them, and a basic_block groups them. We will look at how to use these headers in a future post when we implement the optimization passes.
RTL: What It Is and Why It Matters
After GIMPLE, the code is lowered to RTL, Register Transfer Language. If GIMPLE looks like simplified C, RTL looks like what you get if LISP and assembly had a child. It uses S-expressions (nested parentheses) to describe operations on registers and memory. If you have ever written a LISP interpreter (I have), the syntax will feel familiar; base = y * 80 would be (set base (mult y 80)) in LISP-like notation. Otherwise, it takes some getting used to.
Here is int base = y * 80 + ((x >> 4) * 4) in RTL, right after expansion from GIMPLE. I have abbreviated the syntax; the real dumps still carry source locations and metadata on every insn so debug information survives to the final assembly:
(set r36:HI (reg %d1 [ y ])) ; r36 = y (arrives in d1 via fastcall)
(set r38:HI r36:HI) ; r38 = y
(set r39:HI (plus r38:HI r38:HI)) ; r39 = y+y
(set r40:HI (plus r39:HI r39:HI)) ; r40 = y*4
(set r41:HI (plus r40:HI r36:HI)) ; r41 = y*5
(set r42:HI (plus r41:HI r41:HI)) ; r42 = y*10
(set r43:HI (plus r42:HI r42:HI)) ; r43 = y*20
(set r44:HI (ashiftrt r35:HI (const 4))) ; r44 = x>>4
(set r45:HI (plus r43:HI r44:HI)) ; r45 = y*20 + x>>4
(set r46:HI (plus r45:HI r45:HI)) ; r46 = result*2
(set r47:HI (plus r46:HI r46:HI)) ; r47 = result*4
Notice there is no * 20 in sight. The GIMPLE _1 = y_4(D) * 20 is still a single multiply at the end of the GIMPLE pipeline; it is pass_expand, the GIMPLE-to-RTL boundary, that queries the target cost model and decides a chain of additions is cheaper than a muls.w instruction. Whether that is the right call depends on the target and cost models, a topic for the next post.
Let's read a single line: (set r44:HI (ashiftrt r35:HI (const 4))). In pseudo-C, this is r44 = r35 >> 4. As pseudo-assembly, think of it as asr.w #4,r35,r44, a three-operand instruction where the result goes to a separate destination register. If you have written ARM assembly, this would feel natural. On real m68k, asr.w is a two-operand in-place instruction; later passes will need to reconcile this.
The :HI after each register is the machine mode; think of it as the instruction size suffix you know from assembly: .b is QI (quarter-integer, 8-bit), .w is HI (half-integer, 16-bit), .l is SI (single-integer, 32-bit). With -mshort, most operations naturally land in HI mode. The r35, r38, etc. are pseudo registers; RTL acts as if the CPU has infinitely many, and each temporary gets its own. For put_pixel, the compiler creates 19 pseudos. The CoreMark matrix multiply creates 28.
Everything in RTL is an RTX (RTL eXpression): a plus, a reg, a const_int, a mem. An INSN is also an RTX, but a special one: it wraps a statement (typically a set) together with metadata. The reg, plus, and set inside an INSN are all RTX, but only the outermost insn is what passes iterate over. Each INSN is a pseudo machine instruction. Early on, INSNs are quite free-form, though sticking to canonical RTX forms helps later passes parse what you built. As passes progress, each INSN must be legalized and constrained until it matches a valid target instruction pattern and can be emitted as real assembly. INSNs that cannot be matched cause an Internal Compiler Error (ICE), the dreaded "unrecognizable insn." Compiling GCC with --enable-checking=yes adds extra validation that catches many of these errors early.
Passes iterate the same way as in GIMPLE: for each function, for each basic block (FOR_EACH_BB_FN), for each INSN (FOR_BB_INSNS), examine, transform, write back. The core types and helpers live in gcc/rtl.h (RTX types, PATTERN(), SET_SRC(), SET_DEST()), gcc/emit-rtl.h (creating and inserting INSNs), and gcc/basic-block.h (basic block iteration).
Somewhere in the middle of the ~110 RTL passes lies the Rubicon: register allocation. All those pseudos must be mapped onto the 15 real hardware registers (d0-7/a0-6), and when they do not fit, values must be spilled to the stack causing memory accesses. Here is the same base calculation after register allocation:
(set %d2:HI %d1:HI) ; d2 = y
(set %d2:HI (plus %d2:HI %d1:HI)) ; d2 = y+y
(set %d2:HI (plus %d2:HI %d2:HI)) ; d2 = y*4
(set %d2:HI (plus %d2:HI %d1:HI)) ; d2 = y*5
(set %d2:HI (plus %d2:HI %d2:HI)) ; d2 = y*10
(set %d2:HI (plus %d2:HI %d2:HI)) ; d2 = y*20
(set %d0:HI (ashiftrt %d0:HI (const 4))) ; d0 = x>>4 (now in-place!)
(set %d0:HI (plus %d0:HI %d2:HI)) ; d0 = y*20 + x>>4
(set %d0:HI (ashift %d0:HI (const 2))) ; d0 = result*4
For this small expression, 14 pseudos collapsed to just two hard registers: %d0 and %d2. The three-operand asr.w #4,r35,r44 became the in-place asr.w #4,%d0, matching the real m68k instruction. This is exactly what falls out as assembly: move.w %d1,%d2; add.w %d1,%d2; ... asr.w #4,%d0; add.w %d2,%d0; lsl.w #2,%d0; rts. How register allocation makes these choices is the topic of the next post.
A few of the most important RTL passes IMHO:
- Instruction Combining (
gcc/combine.cc): Merges multiple insns into one when the target has a matching instruction pattern.moveq #0,d0; move.w d0,(a0)becomesclr.w (a0). Can also split one unmatched insn into two that match. - Prologue/Epilogue (
gcc/function.cc): Inserts the register saves and restores you know from assembly:movem.l d3-7/a2-6,-(sp)at entry,movem.l (sp)+,d3-7/a2-6at exit (fewer, hopefully). - Peephole2 (
gcc/recog.cc): A pluggable pass that runs a list of target-defined pattern-matching rules over small instruction windows. On m68k, this is where, for example, a load+shift+shift sequence for single-bit extraction becomesbtst+sne, saving cycles on 68000 where shifts are expensive. - Instruction Scheduling (
gcc/sched-rgn.cc): Reorders instructions to minimize pipeline stalls. Runs once before register allocation and once after. Tuned for superscalar CPUs, which can hurt code on 68000, but benefits the 68060 which can dual-issue.
The matching from RTL insns to actual assembly happens via the machine description, gcc/config/m68k/m68k.md. Each m68k instruction GGC knows is a define_insn pattern that says "if the RTL looks like this, emit that." For our shift:
(define_insn "lshrhi3"
[(set (match_operand:HI 0 "register_operand" "=d")
(lshiftrt:HI (match_operand:HI 1 "register_operand" "0")
(match_operand:HI 2 "general_operand" "dI")))]
"!TARGET_COLDFIRE"
"lsr%.w %2,%0")
This says: if we see (set dN (lshiftrt:HI dN something)), emit lsr.w something,dN. "=d": operand 0 must be a data register. "0": operand 1 must be the same register (in-place). "dI": operand 2 is a data register or immediate 1–8. The pattern only matches on non-ColdFire targets. When multiple define_insn patterns could match, GCC picks the best fit based on constraints and ordering in the .md file; this is how a small constant ends up using addq instead of the more general addi. This is the bridge between RTL and assembly, and what combine and peephole2 check against to validate their transformations.
Looking Under the Hood
GCC can dump the intermediate representation after any pass. For GIMPLE, -fdump-tree-<pass> writes a .t file; for RTL, -fdump-rtl-<pass> writes an .r file. The filenames include the pass number, so they sort chronologically:
$ m68k-atari-mintelf-gcc -O2 -mshort -mfastcall \
-fdump-tree-optimized -fdump-rtl-expand \
-fdump-rtl-combine -fdump-rtl-reload -S put_pixel.c
$ ls put_pixel.c.*
put_pixel.c.274t.optimized # final GIMPLE
put_pixel.c.275r.expand # GIMPLE → RTL expansion
put_pixel.c.311r.combine # after instruction combining
put_pixel.c.331r.reload # after register allocation
Here is mask traced through these dumps. In the final GIMPLE (.274t.optimized), it is a clean one-liner outside the loop:
mask_27 = 32768 >> _4;
After RTL expansion (.275r.expand), it becomes two INSNs with pseudo registers:
(insn 12 (set (reg:HI 56)
(const_int -32768)))
(insn 13 (set (reg/v:HI 42 [ mask ])
(lshiftrt:HI (reg:HI 56) (reg:HI 55))))
After register allocation (.331r.reload), pseudos become hard registers. r56 and r42 [mask] both landed in %d1; r55 [_4] in %d4:
(insn 12 (set (reg:HI %d1)
(const_int -32768)))
(insn 13 (set (reg:HI %d1 [mask])
(lshiftrt:HI (reg:HI %d1) (reg:HI %d4))))
The allocator reused %d1 for both the constant and the result; r56 was dead (no longer in use) after INSN 13, so no conflict. INSN 13 matches our lshrhi3 pattern from m68k.md, so the final assembly emits lsr.w %d4,%d1.
The nuclear option is -fdump-rtl-all, which dumps every RTL pass (68 files for a simple function). When debugging, the workflow is to diff consecutive dumps to see exactly what changed. Did combine merge those two instructions? Did IRA spill a register you thought should fit? Reading dumps in sequence tells the story of how your code became assembly. A quick grep for a variable name narrows things down:
$ grep 'mask' put_pixel.c.275r.expand
(insn 13 ... (set (reg/v:HI 42 [ mask ]) ...
Even with targeted dumps, the output is dense; a single put_pixel dump runs to hundreds of lines. Having Claude summarize these has been invaluable. Asking "what changed between the combine dump and the IRA dump for the inner loop?" and getting a focused answer has been the single biggest productivity gain in this project.






Comments