GCC for asm Experts (and C/C++ Intermediates)

Submit News

Feel free to submit any news, reviews, interview and reports via email:
info@atariscne.org

We will curate the submissions and see what we can do. Please note that original, non-commercial, human-made and complete articles are preferred. Please provide plain ASCII texts. If you want to include images or links please reference in the text as [filename.png] and links as [https://www.example.link]. Images should be in PNG-format with dimensions of max. 640x480 using self-explaining filenames in lowercase. Thank you!

← Register Allocation and the Cost Model

Fixing Post-Increment Addressing

Some years after high school, me and AiO worked at the same company for a while. I spent a lot of time at his apartment in Vimmerby, watching demos on my Falcon030 and his accelerated Amiga, playing Elite Frontier, and making grand plans for projects that mostly never shipped. I was going to write a Worms clone called Grubs, built around fractal-generated terrain and a neat paralax scrolling trick. In the end the only game released from that apartment was DB Phone Home, a 4K side-scrolling platformer for the Falcon. But the thing I remember most is reading AiO's copy of the MC68060 User's Manual.

The 68060 can not only do a multiply in two clockcycles, but execute two instructions at once!? Motorola called it superscalar, and I thought it was the most exciting thing in the world. I imagined what an Atari with this beast could do, and where the 68070 would take this — even wider issue, more parallelism, the same trajectory the industry was already on with the Pentium and the PowerPC. Of course, the 68070 never came. ColdFire does not quite count for me. Our beloved CPU family ended with the 68060, and the dream of wider superscalar m68k died with it.

But the industry kept going. GCC optimizes for that dream-made-real on other architectures: x86-64, ARM, RISC-V. Independent instructions that hardware can overlap or even reorder to execute over half a dozen instructions per cycle on for example an Apple Silicon M3/M4. This all works, as long as there are no data dependencies between consecutive operations. And this is where it goes wrong for us.

The Problem

This has been discussed many times on Atari-Forum, in threads like GCC on ATARI - how much faster could it be?. Beyond just talking mikro went as far as filing a GCC bug report, with no success. Many people have posted workarounds and incantations: insert this asm statement, use this specific flag combination, sacrifice a goat on a full moon and compile on a Tuesday. But nothing that Just Works™.

Imagine a simple memory copy, *dst++ = *src++ manually unrolled four times:

void copy4(const int* src, int* dst, unsigned int count) {
    for (unsigned int i = 0; i < count / 4; i++) {
        *dst++ = *src++; 
        *dst++ = *src++;
        *dst++ = *src++; 
        *dst++ = *src++;
    }
}

This should generate pretty efficient m68k assembly; we have pretty much handed the compiler the solution on a silver platter.

And yet stock GCC 15.2 at -O2 -mshort -mfastcall gives us this:

.L26:
    move.w (%a0),(%a1)          | fixed offset 0
    move.w 2(%a0),2(%a1)        | fixed offset 2
    move.w 4(%a0),4(%a1)        | fixed offset 4
    addq.l #8,%a0               | pointer advance — too early!
    addq.l #8,%a1
    move.w -2(%a0),-2(%a1)      | negative offset — the addq moved above it

Four copies, zero post-increments. The addq pointer advances got hoisted above the fourth access, creating a negative offset. On 68000, each offset costs an extension word for an extra bus access, and then two addq instructions; this is 32 additional clock cycles for overhead no sane assembly programmer would ever consider. Remind me why I should use GCC again if it fails even this simple case this badly?

Down the Rabbit Hole

GCC already has a pass for this: pass_inc_dec (pass 7.29 in the pipeline). It runs early in the RTL phase and converts move (%a0); addq #4,%a0 into move (%a0)+. The pattern match is straightforward: one memory access followed by one increment of the same register. I added fprintf debugging to see why it was not triggering.

The answer: it only works on single instruction pairs with this very specific match. Our unrolled loop has four accesses with ascending offsets and a single combined increment. But the raw GIMPLE still has separate pointer increments: src_23 = src + 2; *src_23; src_26 = src_23 + 2; *src_26, exactly the chain of pairs that pass_inc_dec could handle one by one. Somewhere between the raw GIMPLE and pass_inc_dec, one of the ~200 GIMPLE passes folds the chain into base+offset form. Finding which one by hand would take hours. Claude found it in under five minutes: pass_forwprop (forward propagation, pass 2.20, very early in the GIMPLE pipeline) is the culprit. It folds src + 2 + 2 into src + 4, turning the chain into MEM[src + 0], MEM[src + 2], etc. with a single increment of 8. Fewer operations, fewer temporaries, better for the general case. But it destroys the one-to-one pattern that pass_inc_dec needs.

I briefly considered rewriting pass_inc_dec to handle multi-step sequences, but decided it was simpler and safer to keep the fix m68k-specific. A target-specific pass can make target-specific assumptions without risk to every other backend.

Fixing the Reordering

The first step is not even an optimization, just cleaning up the mess so we can find patterns in it.

A new m68k pass (m68k-reorder-incr, in m68k-pass-memreorder.cc) runs after GCC's instruction scheduler and before register allocation, so RA can make better decisions on the cleaned-up code. In simplified pseudo-C, what this new pass does for each basic block is roughly:

// For each increment (addq, lea) in the basic block:
for (insn = BB_HEAD; insn != BB_END; insn = NEXT_INSN(insn)) {
    if (!is_reg_increment(insn, &regno, &incr))
        continue;
    // Scan forward: collect insns with negative offsets on same register
    for (next = NEXT_INSN(insn); has_negative_offset(next, regno); next = NEXT_INSN(next))
        adjust_offset(next, +incr);    // -2 + 8 = 6
    move_insn_after(insn, last_fixup);  // move addq to after all accesses
}

In practice the pass also handles arithmetic with memory operands and works across basic block boundaries. The patterns grew over time as I found more cases in real code, but the core idea is what you see above.

It moves the pointer advances past the memory accesses that follow them, adjusting negative offsets to positive, leaving us with this assembly:

.L26:
    move.w (%a0),(%a1)          | offset 0
    move.w 2(%a0),2(%a1)        | offset 2
    move.w 4(%a0),4(%a1)        | offset 4
    move.w 6(%a0),6(%a1)        | offset 6 (was -2, now fixed)
    addq.l #8,%a0              | moved to end
    addq.l #8,%a1

The instruction count is the same. The offsets still cost extension words. But the accesses are now sequential: 0, 2, 4, 6, followed by an increment of 8. A clean pattern. And since writing optimization passes for GCC is all about finding and transforming patterns, we have set ourselves up for success.

Why GCC Wanted Fixed Offsets

And it is worth pausing here. This reordered form is actually what a hypothetical 68060 successor would want. None of the four move.w instructions modify a0 or a1. On an ideal superscalar CPU, they could all execute in parallel. With repeated move.w (%a0)+,(%a1)+ each instruction depends on the previous one updating the address register. Serial dependency. Two cycles vs four, and this is how it works on modern CPUs.

This is the fundamental trade-off GCC optimizes for across every target: independent instructions that hardware can overlap. The real 68060 is an early superscalar CPU and more constrained: it can dual-issue two instructions per cycle, but only one can access memory. So two consecutive move.w d(aN),d(aN) cannot pair, but you could interleave register-only ALU work like add.w dN,dN for free. Even on 68060, post-increment is at least as fast as fixed offsets, while saving code size for this use-case. On 68000, the calculation is even more lopsided: (aN)+ avoids the offset extension words entirely, saving 4 cycles per access on the Atari ST bus. By every measure that matters to us, fixed offsets are worse.

Converting to Post-Increment

With clean sequential offsets, a second pass (m68k-autoinc, in m68k-pass-autoinc.cc) completes the job. It runs immediately after m68k-reorder-incr, still before register allocation. This matters: the compiler could otherwise choose indexed addressing like 2(a0,d0), and 2(a1,d0) with a shared offset register for just one addq. Converting to post-increment before RA eliminates that register entirely, freeing it for other uses.

m68k-autoinc is a bit more complicated (actually several passes). In pseudo-C this is basically what it does:

// For each memory access through a bare register (offset 0):
if (MEM_P(src) && mem_at_reg(src, &regno)) {
    int expected = GET_MODE_SIZE(mode);  // 2 for .w, 4 for .l
    // Collect following accesses at ascending offsets
    for (next = NEXT_INSN(insn); /*...*/; next = NEXT_INSN(next)) {
        if (mem_offset(next, regno) == expected)
            fixups.push(next), expected += GET_MODE_SIZE(mode);
        else if (is_reg_increment(next, regno, &incr) && incr >= expected)
            break;  // found the trailing addq/lea
    }
    // Convert: first insn → (reg)+, each fixup → reduce offset → (reg)+
    replace_mem_with_postinc(insn, regno);
    for (fix : fixups)
        reduce_offset(fix, mode_size);  // 6→4→2→0, each 0 becomes (reg)+
    delete_or_adjust(add_insn);         // addq deleted if fully consumed
}

It replaces each access with (aN)+ and deletes the now-redundant addq:

.L28:
    move.w (%a0)+,(%a1)+
    move.w (%a0)+,(%a1)+
    move.w (%a0)+,(%a1)+
    move.w (%a0)+,(%a1)+

Four instructions, dual post-increment, no offsets, no addq. What the C code says is now what the compiler generates. 48 cycles per loop saved!

By extending this pass to not require a trailing increment, it also finds sequential memory accesses elsewhere. C++ constructors initialize members in declaration order, that is, memory layout order. rect_s scr_rect(0, 0, 320, 200) also generates sequential stores to 0(%aN), 2(%aN), etc., perfect for m68k-autoinc. Some optimization passes end up doing more than you initially plan for.

The Cherry on Top

But four word-sized moves? We are copying 8 bytes per iteration. On m68k, move.l copies 4 bytes at once. Could we merge adjacent word moves into long moves?

GCC has a mechanism for exactly this: define_peephole2. A peephole runs very late, after register allocation, when the assembly is nearly final. It slides a small 5 instruction window over the instruction stream, matching patterns and replacing them. Not powerful enough for structural transformations, but perfect for simple local rewrites like this one.

The pattern added to m68k.md is straightforward if you dust off your LISP again:

(define_peephole2
  [(set (mem:HI (post_inc:SI (match_operand:SI 0 "register_operand")))
        (mem:HI (post_inc:SI (match_operand:SI 1 "register_operand"))))
   (set (mem:HI (post_inc:SI (match_dup 0)))
        (mem:HI (post_inc:SI (match_dup 1))))]
  ""
  [(set (mem:SI (post_inc:SI (match_dup 0)))
        (mem:SI (post_inc:SI (match_dup 1))))])

If you see two consecutive move.w (%aN)+,(%aM)+ with the same registers, replace with one move.l (%aN)+,(%aM)+. Two word copies to adjacent addresses are the same as one long copy. Four word moves become two long moves:

.L29:
    move.l (%a0)+,(%a1)+
    move.l (%a0)+,(%a1)+

Two instructions, down to two pointer advances and another 8 cycles saved. This is the kind of optimization that falls out naturally once the post-increment form exists. Without it, there is nothing to merge. And obviously this also applies to the struct_s constructor above, and much else.

A Real-World Example

These are not just synthetic test cases. I have a real audio mixing callback for my game: up to four channels of 8-bit samples mixed into a 256-byte DMA buffer at 12,517 Hz, refilled 50 times per second. For performance, I mix 32 bits at a time instead of byte-by-byte. The bit overflow between adjacent bytes is technically wrong, but the noise is inaudible, to me ;). In practice and the speed gain is substantial. Three loop variants handle the cases: a clear for silence when no channels are playing, a copy for the first active channel, and an add for each remaining channel. These same passes transform all three.

The clear produces move.l d0,(%a0)+, the copy produces move.l (%a1)+,(%a0)+, and the add generates move.l (%a1)+,%d1; add.l %d1,(%a0)+. Peak cost for mixing all four channels into a 256-byte buffer: about 8,000 cycles (1,400 for the copy and 2,200 for each add), well under the ~160,000 cycles available per frame at 8 MHz. Or 5%, way down from the almost 15% the stock GCC-15.2 generated.

And as a bonus, the generated code is pretty much what I would write in assembly myself, with the added benefit of being readable, generic C++ that I can change to 25 kHz replay or two channels by editing a single constant, not rewriting dozens of lines of assembly by hand.

Next time: teaching the compiler when, and when not, to optimize — the surprisingly hard problem of a loop instruction.

Details: Written by: PeyloW; Published: 12 April 2026

Comments

mikro

Monday, 13 April 2026 02:12

Great stuff, as usual. I wonder whether this specific optimisation could have been cleanly cherry-picked for upstream? Or does it depend on other features in code that you had introduced earlier?

Quote

PeyloW

Monday, 13 April 2026 02:40

Quoting mikro:

Great stuff, as usual. I wonder whether this specific optimisation could have been cleanly cherry-picked for upstream? Or does it depend on other features in code that you had introduced earlier?

No, this one is very independent, not even depending on costs, it was the first thing I implemented. No single commit to cherry-pick though, but should be isolated enough, if someone wanted to take on such a task.

Quote

platon42

Wednesday, 15 April 2026 11:45

Nice and scary stuff! Note that the peephole2 optimization would only be fully correct if the following instruction does not depend on the condition codes (e.g. branching on zero flag). I know, it’s counting beans, but these kind of subtle compiler bugs are hard to find.

Quote

PeyloW

Wednesday, 15 April 2026 16:17

Quoting platon42:

Nice and scary stuff! Note that the peephole2 optimization would only be fully correct if the following instruction does not depend on the condition codes (e.g. branching on zero flag). I know, it’s counting beans, but these kind of subtle compiler bugs are hard to find.

Jupp. Full of dangers, easy to rewrite to valid to execute but buggy code. So this full pattern also need to check that the next insn is not a conditional; bcc, to even tas. Now I have to go back and double check that this is what I did :/.
The same merge can also be done on bytes->words->long. But only if you can guarantee the first is on an even address.

Quote

platon42

Wednesday, 15 April 2026 17:03

Quoting PeyloW:

Quoting platon42:
Nice and scary stuff! Note that the peephole2 optimization would only be fully correct if the following instruction does not depend on the condition codes (e.g. branching on zero flag).

Jupp. Full of dangers, easy to rewrite to valid to execute but buggy code. So this full pattern also need to check that the next insn is not a conditional; bcc, to even tas. Now I have to go back and double check that this is what I did :/.
The same merge can also be done on bytes->words->long. But only if you can guarantee the first is on an even address.

It's actually not purely the next instruction only, but if the compiler can do reordering of instruction and moves an instruction in-between that doesn't change the condition codes (I sometimes do that when writing assembly)... things can get a bit hairy.

And yeah, byte merges to word or longword-accesses can be equally tricky.

Quote

PeyloW

Wednesday, 15 April 2026 17:24

Quoting platon42:

It's actually not purely the next instruction only, but if the compiler can do reordering of instruction and moves an instruction in-between that doesn't change the condition codes (I sometimes do that when writing assembly)... things can get a bit hairy.

For good and bad, GCC-10 removed the legacy cc0 tracking for condition codes for the m68k backend as cc0 was deprecated and removed (not updating m68k would mean removing it): https://gcc.gnu.org/wiki/CC0Transition

What this means in practice is that the backend no longer know about instructions that generate CC codes and instead emits explicit test/compares. There is then several late peepholes with the only purpose of removing these if the previous insn would set the CCs anyway.

As a downside it means the compiler sometimes generates superfluous test/compare instructions that it does not manage to remove.

As a bonus it means they are easier to match against to check if merging the previous insn is "safe". It makes CC part of the regular data dependency graph.

Quote

Refresh comments list

Add comment

Latest comments

Chip 'n Tell by SYNC

Steve -AtariCrypt Monday, 20 July 2026 08:56

Excellent!! That mic option is gonna be fun to play with.
Floppyshop Online - The Atari ST & Falcon Public Domain Library, Reborn

Steve -AtariCrypt Sunday, 19 July 2026 15:18

I use this so much; it's a brilliant resource, so thanks to Chris :)
Chip 'n Tell by SYNC

evldhs_ Thursday, 16 July 2026 17:15

The only important thing in the article and I borked it, thanks for noticing, it's updated.
Chip 'n Tell by SYNC

Lotek Style Thursday, 16 July 2026 15:30

Yeah wrong link in article :) https://cat.sync.wtf/index.html Brilliant project btw :)
Chip 'n Tell by SYNC

Zeme Thursday, 16 July 2026 12:32

Awesome! Btw: the link is borken :-(

Featured articles

Fujiama 2025

Details

505

Legend has it there is an annual Atari event each summer somewhere in the pitoresque East German mountain range named...

Read more

Latest news

Search news

Submit News

GCC for asm Experts (and C/C++ Intermediates) - Part 5

The Problem

Down the Rabbit Hole

Fixing the Reordering

Why GCC Wanted Fixed Offsets

Converting to Post-Increment

The Cherry on Top

A Real-World Example

Comments

Add comment

Latest comments

Chip 'n Tell by SYNC

Floppyshop Online - The Atari ST & Falcon Public Domain Library, Reborn

Chip 'n Tell by SYNC

Chip 'n Tell by SYNC

Chip 'n Tell by SYNC

Featured articles

Fujiama 2025

Coming parties

Links