Atariscne.org

← How GCC Actually Compiles

Register Allocation and the Cost Model

The m68k has 15 usable registers: d0-d7 and a0-a6 (a7 is the stack pointer, and sometimes a6 is also reserved as a frame pointer). For a CPU from 1979, this is actually quite generous; the 6502 has three, the Z80 and 8086 have seven or eight. So 15 should be plenty, right?

Not so fast. We have two register files with different capabilities. A pointer must be in an address register to dereference it. Most arithmetic must happen in data registers. For many use cases, we effectively have 7 or 8 registers to choose from, not 15. And if our function calls others, the ABI reserves the caller-clobbered registers, effectively leaving only ten registers: d3-d7/a2-a6 for free use with fastcall. When we run out, the allocator spills to the stack: on 68000, each 16-bit spill costs at least 16 cycles (8 to read, 8 to write back), and 32-bit values adds another 8. In a tight loop, it adds up fast.

When you write assembly, register allocation is intuition. You look at a routine and think: I need this value for the next three instructions, then I can reuse the register for something else. You juggle lifetimes in your head, naturally overlapping short-lived values. Teaching a computer this intuition is the fundamental challenge. The answer is to build a graph of which pseudo-registers are live simultaneously; two pseudos that overlap in lifetime interfere and cannot share a register. Then color the graph with K colors, one per available register, starting with the cheap caller-clobbered regs. When 19 pseudos from put_pixel compete for 15 registers, some will not fit — especially once you account for the data/address split — and those get spilled.

So how does GCC make the best of a difficult situation? 

Switching to LRA

GCC's register allocation is a two-stage system. IRA (Integrated Register Allocator) does the global plan: it assigns pseudos to hard registers based on liveness, conflicts, copy preferences, calling convention constraints (which registers are caller-saved vs callee-saved), and target hooks for special cases and hints. That allocation is intentionally optimistic, because many operand constraints are only fully resolved later. The second stage turns IRA's plan into something legal for the target: the legacy Reload pass or the modern LRA (Local Register Allocator).

LRA was introduced in GCC 4.8 and became the default for all modern targets by GCC 5, but m68k never adopted it. With Reload scheduled for removal in GCC 16, the switch is inevitable. And being more modern, surely it must be better for free?

The old Reload does its fix-up in a brittle, patch-up style: one single pass through the function, inserting spills wherever IRA's plan does not fit. Whereas LRA is iterative and tightly integrated with instruction constraints; when it finds a violation, it re-evaluates alternatives and tries again, rather than patching and moving on. The flip side of being iterative: get the costs wrong and LRA can ping-pong between alternatives forever. I discovered this the hard way on ColdFire, where I had costed add.l d0,d0 + move.l (a0,d0.l),d1 the same as move.l (a0,d0.l*2),d1. LRA could not decide which form was cheaper and looped indefinitely. Finding this took actual printf-debugging inside the allocator, the data dumps from last post only applies once a pass is done.

Enabling LRA was easy: cherry-pick a few fixes from the GCC 16 branch and it compiled. But the result was worse code. Two full weeks of whack-a-mole followed, chasing the fact that Reload's simple static order was beating LRA's sophisticated constraint-based approach. Something that will be a felt when mainline GCC-16 is released :(.

The core problem: when define_insn patterns in m68k.md declare that an operand accepts any register, LRA takes that literally. For example the doloop_end pattern for dbra accepts both register types; if LRA picks an address register, no dbra can be emitted, and the fallback is subq.w #1,aN; cmpa.w #-1,aN; bcc, four times slower. Fix that constraint, and the next pattern breaks in a different way. Rinse and repeat.

The fix was tightening constraints throughout m68k.md and implementing new IRA target hooks to guide the data/address decision. TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS is called for every pseudo; simplified, the m68k implementation looks like:

// IRA asks: what register class should this pseudo prefer?
reg_class change_pseudo_allocno_class(pseudo, allocno_class, best_class) {
    // If IRA widened a data-preferring pseudo to GENERAL, narrow it back
    if (best_class == DATA_REGS && allocno_class == GENERAL_REGS)
        return DATA_REGS;
    // Pointer actually used as memory base? → address register
    if (is_pointer(pseudo) && used_as_memory_address(pseudo))
        return ADDR_REGS;
    return allocno_class;
}

The real implementation has per-CPU exceptions and LRA-specific paths, but the core logic is two decisions: keep arithmetic pseudos in data registers, and promote pointer pseudos to address registers.

A companion hook, TARGET_PREFERRED_RELOAD_CLASS_FOR_USE, examines specific uses of each pseudo. Comparisons prefer data registers, since there is no tst aN on 68000 and cmpa is slower on 68020+. Small constants prefer data registers for moveq.

The existing -mlra flag enables LRA; after my changes -mno-lra can still compile with Reload for comparison. Good news, in the end, LRA is smarter. The result: 1,126 bytes (1.6%) smaller binary from switching the register allocation alone on my game in progress. But not the silver bullet you get for free. And maybe Reload could have been taught the same tricks?

The Cost Model: Teaching GCC What Is Cheap

Most optimization passes need to compare alternatives: should this multiply become an add chain? Is this addressing mode worth the extra displacement word? The answers come from target hooks that price RTL expressions. Even GIMPLE passes use these hooks; they construct temporary RTX expressions just to query the cost.

Before register allocation, registers are still pseudos, not hard registers. Is that an add.w d0,d1 or an adda.w d0,a0? We cannot yet know, sometimes we can guess by context. The costs are necessarily approximate, but the huge performance wins happen in these early passes, where there is still freedom for large-scale rewrites. Later passes have more exact costs but less room to maneuver.

Three target hooks matter most:

- TARGET_RTX_COSTS prices a single RTL expression (a shift, a multiply). expand uses it to decide whether y * 20 becomes muls.w or an add chain. combine uses it when merging instructions.
- TARGET_ADDRESS_COST prices an addressing mode: (a0) vs d16(a0) vs (d8,a0,d0.w). IVOPTS uses it when choosing between induction variable candidates for loops.
- TARGET_INSN_COST prices a complete instruction including operand fetch, used by later passes where full context is known. If-conversion uses it to decide whether replacing a branch with a conditional sequence is profitable.

Important: what these hooks receive is not assembly instructions but RTX expressions. The shift example below arrives as (lshiftrt:HI (reg:HI 56) (reg:HI 55)) with pseudo registers; we have to guess that it will become lsr.w %d4,%d1 and cost accordingly.

GCC's cost model is intentionally unitless; costs are only meaningful relative to each other. The baseline is COSTS_N_INSNS(1) = 4, representing one fast instruction like move.l d0,d1 (Possibly a relic of m68k early presence in GCC?). All other costs are expressed as multiples or fractions of this. Since the baseline is 4, we can use ±1 as gentle nudges to steer tiebreakers without overriding the overall costing.

Beyond the target hooks, many passes have cost calculations and heuristics of their own. Loops of unknown iteration count are assumed to run 8 times. When selecting a loop terminator, IVOPTS weighs setup cost against per-iteration savings: a dbra needs a dedicated counter register (+X penalty) and an adjusted initial value (+Y penalty), but saves a compare on every iteration (-8*N credit). For a 4-iteration loop like put_pixel, the savings barely cover the setup; for a 200-iteration scanline loop, it is an easy win. The target hooks provide the raw numbers; the passes layer their own judgment on top.

My approach is straightforward: per-CPU lookup tables with exact cycle counts from the MC680X0 User Manual. The 68000, 68020/030, 68040, and 68060 each get their own table (Sorry no dedicated ColdFire table yet). You cannot be more correct than the manufacturer's own documentation, and tables are easy to write and verify. The previous cost model was sparse, many operations falling through to COSTS_N_INSNS(1) which is simply not true.

Initially I checked for improvements by naively counting instructions in the assembly output. Fine for a start, but eventually not good enough: add.w %d0,%d1 and adda.w %d0,%a0 are both one instruction, but any m68k programmer knows the address register variant costs 4 extra cycles on 68000. Same instruction count, different speed. So I wrote clccnt, a tool that parses assembly and sums actual cycle counts per the User Manual tables. It catches exactly the kind of difference that instruction counting misses.

Costing a Shift. In put_pixel, the mask 0x8000 >> (x & 15) becomes a variable-count shift. The RTL:

(set (reg:HI 42 [ mask ])
     (lshiftrt:HI (reg:HI 56) (reg:HI 55)))

This becomes lsr.w %d4,%d1. On 68000, a word shift costs 6 + 2n cycles where n is the count. For known immediate values we calculate the exact cost: a shift by 4 is 6 + 8 = 14 cycles. But here n is in a register, unknown at compile time. It could be 0 (6 cycles) or 15 (36 cycles). The cost model estimates 22 cycles, roughly the midpoint. When compiling for size with -Os, we intentionally use the lower end of the range; a muls.w may be slower than an add chain, but it is smaller.

On 68020+, shifts are constant-time thanks to the barrel shifter: 4 cycles regardless of count. This 5:1 ratio is why the same source should generate different sequences on different CPUs, and the cost model makes that happen. Simplified, the costing looks like:

// TARGET_RTX_COSTS case for LSHIFTRT, ASHIFT, ASHIFTRT
int cost = costs->shift_base;           // 6 for 68000 word, 4 for 68020
if (CONST_INT(count))
    cost += costs->shift_per_count * count;  // +2n for 68000, +0 for 68020
else
    cost += costs->shift_var_add;       // +16 for 68000, +2 for 68020

Costing an OR. The OR from the inner loop. Same operation, different operands, vastly different costs:

(set (reg:HI d0) (ior:HI (reg:HI d0) (reg:HI d1)))

or.w %d1,%d0, register to register: 4 cycles.

(set (mem:HI (reg:SI a0)) (ior:HI (mem:HI (reg:SI a0)) (reg:HI d1)))

or.w %d1,(%a0), register to memory: 12 cycles. Read the word, OR, write back — one instruction, three times slower. Change the addressing mode to d16(a0) and it is 16 cycles. On 68020+, the pipelining and instruction cache make addressing mode cost largely disappear; both forms cost about 2 cycles. The costing branches on operand shape:

// TARGET_RTX_COSTS case for IOR (and AND, XOR)
if (CONST_INT(op1) && REG(op0))
    cost = costs->logic.reg_const + const_cost(op1);  // ori.w #imm,dN
else if (REG(op0) && MEM(op1))
    cost = costs->logic.reg_op_ea + ea_cost(op1);     // or.w (ea),dN
else if (REG(op1) && MEM(op0))
    cost = costs->logic.ea_op_reg + ea_cost(op0);     // or.w dN,(ea)

The Forward-Looking Problem. Each pass estimates cost based on current state, but optimizes for a future it cannot see. Several hooks steer cross-pass decisions: TARGET_DOLOOP_COST_FOR_GENERIC credits dbra in IVOPTS so the counter is not eliminated before doloop runs; TARGET_REGISTER_MOVE_COST penalizes data-to-address reg shuffles. Every cost estimate is a bet on what comes next. My goal: always at least cost-neutral with baseline GCC.

This is not programming the compiler as much as fine-tuning it. The dials are cycle counts, the feedback loop is diff across a hundred test cases.

Revisit: put_pixel

Time to see what this buys us. Since Post 2, we changed x / 16 to x >> 4, avoiding the signed-division rounding check:

void put_pixel(uint16_t *screen, int x, int y, uint8_t col) {
    int base      = y * 80 + ((x >> 4) * 4);
    uint16_t mask = (uint16_t)(0x8000u >> (x & 15));
    for (int p = 0; p < 4; p++) {
        if (col & (1 << p))
            screen[base + p] |=  mask;
        else
            screen[base + p] &= ~mask;
    }
}

Stock GCC 15.2 inner loop (-O2 -mshort -mfastcall):

.L4:
    move.w (%a0)+,%d1       | read word (post-increment)
    move.w %d3,%d2           | copy col
    asr.w %d0,%d2            | variable shift each iteration!
    btst #0,%d2              | test bit 0
    jeq .L2
    or.w %d4,%d1             | OR mask in register
    move.w %d1,-2(%a0)       | write back (undo the ++)
    addq.w #1,%d0
    cmp.w #4,%d0
  jne .L4
...
.L2:
    move.w %a1,%d2           | ~mask from an address register?!
    and.w %d2,%d1
    move.w %d1,-2(%a0)
    addq.w #1,%d0
    cmp.w #4,%d0
  jne .L4
...

My branch:

.L4:
    btst %d3,%d2             | test bit directly
    jeq .L2
    or.w %d1,(%a0)+          | OR mask to memory, post-increment
    subq.w #1,%d0
    jne .L4
...
.L2:
    and.w %d4,(%a0)+         | AND ~mask to memory, post-increment
    subq.w #1,%d0
    jne .L4
...

Better register allocation keeps mask in d1 and ~mask in d4, both data registers. btst replaces the per-iteration variable shift. And or.w %d1,(%a0)+ replaces the three-instruction read/modify/write-back-with-negative-offset: one read-modify-write instruction, 12 cycles instead of 24. Not all of this is register allocation alone, some other specific optimization passes for another day. It is not perfect — there is still subq; jne where dbra could be — but a function an assembly programmer can look at without wincing.



Next time: something more concrete. How modern compiler transforms broke post-increment addressing for us, and what it took to get mikro's memory copy back to (a0)+.

Comments

42Bastian
Monday, 06 April 2026 19:53
Hmm, interesting again. But, isn't in the last code snippet a branch to the end missing (befor .L2:)?
Quote
PeyloW
Monday, 06 April 2026 21:28
Quoting 42Bastian:
Hmm, interesting again. But, isn't in the last code snippet a branch to the end missing (befor .L2:)?

You are right, I cut too much for brevity. Added a bit more again. You calling it out means a lot, not only did you read it, you questioned it! Thank you.
Quote
42Bastian
Tuesday, 07 April 2026 04:43
"My approach is straightforward: per-CPU lookup tables with exact cycle counts from the MC680X0 User Manual. "
I think you need an "atarist" option, as on the ST cycles are always multiple of 4.
"lsr #2,d0" => 6+2*2 = 10 => 12 on ST.
Means the returned cost should be (c+3)&~3
Quote
PeyloW
Tuesday, 07 April 2026 15:36
Quoting 42Bastian:
"My approach is straightforward: per-CPU lookup tables with exact cycle counts from the MC680X0 User Manual. "
I think you need an "atarist" option, as on the ST cycles are always multiple of 4.
"lsr #2,d0" => 6+2*2 = 10 => 12 on ST.
Means the returned cost should be (c+3)&~3

I might have glossed over some of this. Each table is according to official manual, and also has a `buscycle_cost` entry, and I use that to round up to nearest multiple, in case of plain ST 4 cycles.
For Falcon030 and a CT63 this bus cycle cost is probably not accurate, as alt RAM and ST RAM will behave differently.
Quote
Loaderror
Wednesday, 08 April 2026 03:47
Thanks for these posts! Good to know about some weak spots in GCC and interesting to read how you solve some of these issues by tuning GCC itself. I briefly looked at GCC's code before, but it just looked impenetrable to me :)
Quote
42Bastian
Wednesday, 08 April 2026 04:12
Quoting PeyloW:
Quoting 42Bastian:
"My approach is straightforward: per-CPU lookup tables with exact cycle counts from the MC680X0 User Manual. "
I think you need an "atarist" option, as on the ST cycles are always multiple of 4.
"lsr #2,d0" => 6+2*2 = 10 => 12 on ST.
Means the returned cost should be (c+3)&~3

I might have glossed over some of this. Each table is according to official manual, and also has a `buscycle_cost` entry, and I use that to round up to nearest multiple, in case of plain ST 4 cycles.
For Falcon030 and a CT63 this bus cycle cost is probably not accurate, as alt RAM and ST RAM will behave differently.


Ah, did not look to close into the source. Cool!
Quote

Add comment

Submit

© 2025-2026 The Atariscne Collective | info@atariscne.org