← Register Allocation and the Cost Model
Fixing Post-Increment Addressing
Some years after high school, me and AiO worked at the same company for a while. I spent a lot of time at his apartment in Vimmerby, watching demos on my Falcon030 and his accelerated Amiga, playing Elite Frontier, and making grand plans for projects that mostly never shipped. I was going to write a Worms clone called Grubs, built around fractal-generated terrain and a neat paralax scrolling trick. In the end the only game released from that apartment was DB Phone Home, a 4K side-scrolling platformer for the Falcon. But the thing I remember most is reading AiO's copy of the MC68060 User's Manual.
The 68060 can not only do a multiply in two clockcycles, but execute two instructions at once!? Motorola called it superscalar, and I thought it was the most exciting thing in the world. I imagined what an Atari with this beast could do, and where the 68070 would take this — even wider issue, more parallelism, the same trajectory the industry was already on with the Pentium and the PowerPC. Of course, the 68070 never came. ColdFire does not quite count for me. Our beloved CPU family ended with the 68060, and the dream of wider superscalar m68k died with it.
But the industry kept going. GCC optimizes for that dream-made-real on other architectures: x86-64, ARM, RISC-V. Independent instructions that hardware can overlap or even reorder to execute over half a dozen instructions per cycle on for example an Apple Silicon M3/M4. This all works, as long as there are no data dependencies between consecutive operations. And this is where it goes wrong for us.
The Problem
This has been discussed many times on Atari-Forum, in threads like GCC on ATARI - how much faster could it be?. Beyond just talking mikro went as far as filing a GCC bug report, with no success. Many people have posted workarounds and incantations: insert this asm statement, use this specific flag combination, sacrifice a goat on a full moon and compile on a Tuesday. But nothing that Just Works™.
Imagine a simple memory copy, *dst++ = *src++ manually unrolled four times:
void copy4(const int* src, int* dst, unsigned int count) {
for (unsigned int i = 0; i < count / 4; i++) {
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
}
}
This should generate pretty efficient m68k assembly; we have pretty much handed the compiler the solution on a silver platter.
And yet stock GCC 15.2 at -O2 -mshort -mfastcall gives us this:
.L26:
move.w (%a0),(%a1) | fixed offset 0
move.w 2(%a0),2(%a1) | fixed offset 2
move.w 4(%a0),4(%a1) | fixed offset 4
addq.l #8,%a0 | pointer advance — too early!
addq.l #8,%a1
move.w -2(%a0),-2(%a1) | negative offset — the addq moved above it
Four copies, zero post-increments. The addq pointer advances got hoisted above the fourth access, creating a negative offset. On 68000, each offset costs an extension word for an extra bus access, and then two addq instructions; this is 32 additional clock cycles for overhead no sane assembly programmer would ever consider. Remind me why I should use GCC again if it fails even this simple case this badly?
Down the Rabbit Hole
GCC already has a pass for this: pass_inc_dec (pass 7.29 in the pipeline). It runs early in the RTL phase and converts move (%a0); addq #4,%a0 into move (%a0)+. The pattern match is straightforward: one memory access followed by one increment of the same register. I added fprintf debugging to see why it was not triggering.
The answer: it only works on single instruction pairs with this very specific match. Our unrolled loop has four accesses with ascending offsets and a single combined increment. But the raw GIMPLE still has separate pointer increments: src_23 = src + 2; *src_23; src_26 = src_23 + 2; *src_26, exactly the chain of pairs that pass_inc_dec could handle one by one. Somewhere between the raw GIMPLE and pass_inc_dec, one of the ~200 GIMPLE passes folds the chain into base+offset form. Finding which one by hand would take hours. Claude found it in under five minutes: pass_forwprop (forward propagation, pass 2.20, very early in the GIMPLE pipeline) is the culprit. It folds src + 2 + 2 into src + 4, turning the chain into MEM[src + 0], MEM[src + 2], etc. with a single increment of 8. Fewer operations, fewer temporaries, better for the general case. But it destroys the one-to-one pattern that pass_inc_dec needs.
I briefly considered rewriting pass_inc_dec to handle multi-step sequences, but decided it was simpler and safer to keep the fix m68k-specific. A target-specific pass can make target-specific assumptions without risk to every other backend.
Fixing the Reordering
The first step is not even an optimization, just cleaning up the mess so we can find patterns in it.
A new m68k pass (m68k-reorder-incr, in m68k-pass-memreorder.cc) runs after GCC's instruction scheduler and before register allocation, so RA can make better decisions on the cleaned-up code. In simplified pseudo-C, what this new pass does for each basic block is roughly:
// For each increment (addq, lea) in the basic block:
for (insn = BB_HEAD; insn != BB_END; insn = NEXT_INSN(insn)) {
if (!is_reg_increment(insn, ®no, &incr))
continue;
// Scan forward: collect insns with negative offsets on same register
for (next = NEXT_INSN(insn); has_negative_offset(next, regno); next = NEXT_INSN(next))
adjust_offset(next, +incr); // -2 + 8 = 6
move_insn_after(insn, last_fixup); // move addq to after all accesses
}
In practice the pass also handles arithmetic with memory operands and works across basic block boundaries. The patterns grew over time as I found more cases in real code, but the core idea is what you see above.
It moves the pointer advances past the memory accesses that follow them, adjusting negative offsets to positive, leaving us with this assembly:
.L26:
move.w (%a0),(%a1) | offset 0
move.w 2(%a0),2(%a1) | offset 2
move.w 4(%a0),4(%a1) | offset 4
move.w 6(%a0),6(%a1) | offset 6 (was -2, now fixed)
addq.l #8,%a0 | moved to end
addq.l #8,%a1
The instruction count is the same. The offsets still cost extension words. But the accesses are now sequential: 0, 2, 4, 6, followed by an increment of 8. A clean pattern. And since writing optimization passes for GCC is all about finding and transforming patterns, we have set ourselves up for success.
Why GCC Wanted Fixed Offsets
And it is worth pausing here. This reordered form is actually what a hypothetical 68060 successor would want. None of the four move.w instructions modify a0 or a1. On an ideal superscalar CPU, they could all execute in parallel. With repeated move.w (%a0)+,(%a1)+ each instruction depends on the previous one updating the address register. Serial dependency. Two cycles vs four, and this is how it works on modern CPUs.
This is the fundamental trade-off GCC optimizes for across every target: independent instructions that hardware can overlap. The real 68060 is an early superscalar CPU and more constrained: it can dual-issue two instructions per cycle, but only one can access memory. So two consecutive move.w d(aN),d(aN) cannot pair, but you could interleave register-only ALU work like add.w dN,dN for free. Even on 68060, post-increment is at least as fast as fixed offsets, while saving code size for this use-case. On 68000, the calculation is even more lopsided: (aN)+ avoids the offset extension words entirely, saving 4 cycles per access on the Atari ST bus. By every measure that matters to us, fixed offsets are worse.
Converting to Post-Increment
With clean sequential offsets, a second pass (m68k-autoinc, in m68k-pass-autoinc.cc) completes the job. It runs immediately after m68k-reorder-incr, still before register allocation. This matters: the compiler could otherwise choose indexed addressing like 2(a0,d0), and 2(a1,d0) with a shared offset register for just one addq. Converting to post-increment before RA eliminates that register entirely, freeing it for other uses.
m68k-autoinc is a bit more complicated (actually several passes). In pseudo-C this is basically what it does:
// For each memory access through a bare register (offset 0):
if (MEM_P(src) && mem_at_reg(src, ®no)) {
int expected = GET_MODE_SIZE(mode); // 2 for .w, 4 for .l
// Collect following accesses at ascending offsets
for (next = NEXT_INSN(insn); /*...*/; next = NEXT_INSN(next)) {
if (mem_offset(next, regno) == expected)
fixups.push(next), expected += GET_MODE_SIZE(mode);
else if (is_reg_increment(next, regno, &incr) && incr >= expected)
break; // found the trailing addq/lea
}
// Convert: first insn → (reg)+, each fixup → reduce offset → (reg)+
replace_mem_with_postinc(insn, regno);
for (fix : fixups)
reduce_offset(fix, mode_size); // 6→4→2→0, each 0 becomes (reg)+
delete_or_adjust(add_insn); // addq deleted if fully consumed
}
It replaces each access with (aN)+ and deletes the now-redundant addq:
.L28:
move.w (%a0)+,(%a1)+
move.w (%a0)+,(%a1)+
move.w (%a0)+,(%a1)+
move.w (%a0)+,(%a1)+
Four instructions, dual post-increment, no offsets, no addq. What the C code says is now what the compiler generates. 48 cycles per loop saved!
By extending this pass to not require a trailing increment, it also finds sequential memory accesses elsewhere. C++ constructors initialize members in declaration order, that is, memory layout order. rect_s scr_rect(0, 0, 320, 200) also generates sequential stores to 0(%aN), 2(%aN), etc., perfect for m68k-autoinc. Some optimization passes end up doing more than you initially plan for.
The Cherry on Top
But four word-sized moves? We are copying 8 bytes per iteration. On m68k, move.l copies 4 bytes at once. Could we merge adjacent word moves into long moves?
GCC has a mechanism for exactly this: define_peephole2. A peephole runs very late, after register allocation, when the assembly is nearly final. It slides a small 5 instruction window over the instruction stream, matching patterns and replacing them. Not powerful enough for structural transformations, but perfect for simple local rewrites like this one.
The pattern added to m68k.md is straightforward if you dust off your LISP again:
(define_peephole2
[(set (mem:HI (post_inc:SI (match_operand:SI 0 "register_operand")))
(mem:HI (post_inc:SI (match_operand:SI 1 "register_operand"))))
(set (mem:HI (post_inc:SI (match_dup 0)))
(mem:HI (post_inc:SI (match_dup 1))))]
""
[(set (mem:SI (post_inc:SI (match_dup 0)))
(mem:SI (post_inc:SI (match_dup 1))))])
If you see two consecutive move.w (%aN)+,(%aM)+ with the same registers, replace with one move.l (%aN)+,(%aM)+. Two word copies to adjacent addresses are the same as one long copy. Four word moves become two long moves:
.L29:
move.l (%a0)+,(%a1)+
move.l (%a0)+,(%a1)+
Two instructions, down to two pointer advances and another 8 cycles saved. This is the kind of optimization that falls out naturally once the post-increment form exists. Without it, there is nothing to merge. And obviously this also applies to the struct_s constructor above, and much else.
A Real-World Example
These are not just synthetic test cases. I have a real audio mixing callback for my game: up to four channels of 8-bit samples mixed into a 256-byte DMA buffer at 12,517 Hz, refilled 50 times per second. For performance, I mix 32 bits at a time instead of byte-by-byte. The bit overflow between adjacent bytes is technically wrong, but the noise is inaudible, to me ;). In practice and the speed gain is substantial. Three loop variants handle the cases: a clear for silence when no channels are playing, a copy for the first active channel, and an add for each remaining channel. These same passes transform all three.
The clear produces move.l d0,(%a0)+, the copy produces move.l (%a1)+,(%a0)+, and the add generates move.l (%a1)+,%d1; add.l %d1,(%a0)+. Peak cost for mixing all four channels into a 256-byte buffer: about 8,000 cycles (1,400 for the copy and 2,200 for each add), well under the ~160,000 cycles available per frame at 8 MHz. Or 5%, way down from the almost 15% the stock GCC-15.2 generated.
And as a bonus, the generated code is pretty much what I would write in assembly myself, with the added benefit of being readable, generic C++ that I can change to 25 kHz replay or two channels by editing a single constant, not rewriting dozens of lines of assembly by hand.
Next time: teaching the compiler when, and when not, to optimize — the surprisingly hard problem of a loop instruction.






Comments
No, this one is very independent, not even depending on costs, it was the first thing I implemented. No single commit to cherry-pick though, but should be isolated enough, if someone wanted to take on such a task.
Jupp. Full of dangers, easy to rewrite to valid to execute but buggy code. So this full pattern also need to check that the next insn is not a conditional; bcc, to even tas. Now I have to go back and double check that this is what I did :/.
The same merge can also be done on bytes->words->long. But only if you can guarantee the first is on an even address.