This is a brain dump of what I have learned working with the GCC m68k backend, and maybe an attempt to convince someone else to try. This is the first of an unknown number of posts. No promises for how many there will be; I will continue as long as I have something to say and I find it fun.
I got my start with STOS Basic on an Atari 520STfm around 1990. Me and my classmate Tam formed T.O.Y.S. (Terror on Your ST) and I dubbed myself PeyloW. But in the scene, elite sceners wrote assembly; only lamers used STOS or GFA, every scroll text was clear about this. So we bought DevPac 2 and taught ourselves 68000 assembly, starting with snippets embedded in STOS and eventually graduating to full demo screens. The pattern that would follow me for decades was established early: high-level languages for tooling, assembly for anything that had to be fast. STOS gave way to PurePascal in the late '90s, but assembly remained the language that mattered — right through to the Falcon030 demo "Wait", released at EIL 2001.
My active participation in the scene waned, but I never lost sight of it. For years I stayed as an observer, following releases and discussions from the sidelines. Then around 2021 I had an itch, maybe a mid-life crisis: get back to the simpler machines (the kind a single person can keep entirely in their head) and realize a teenage dream of publishing a completed game. C and C++ had become my main languages through University and work, and modern cross-development tools meant I could use them for Atari too. Not just for tooling, but as the scaffolding of the entire project, dipping into assembly only for the bottlenecks. And as my friend AiO likes to joke: C is just a really powerful macro assembler.
GCC Is (No Longer) Written for Us
The m68k was one of GCC's first backends, present alongside VAX in the 1987 GCC-1.0 release. For a long time it was a first-class citizen. But the world moved on, and the backend fell into disrepair, barely in maintenance mode, with no one actively working on it.
To be fair, the great strides made in modern compiler optimization are what keep the m68k backend limping along. For most codebases the result is on par with yesteryear, even if it completely fails at many of the specifics. Even a 68060 fitted into a Falcon with a CT63 is ancient by modern CPU standards. The optimizations that GCC's middle-end applies (instruction scheduling, loop transformations, register heuristics and reordering) are tuned for modern highly parallel superscalar CPUs, and when they miss on m68k, they miss badly.
Take the inner loop of a simple memory copy (mikro will recognize this one), in C:
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
Any experienced m68k programmer would expect (a0)+ and (a1)+, post-increment addressing, the most natural idiom on our architecture. The compiler should be able to generate this just as-is — it is how the code reads. Here is what stock GCC-15.2 produces at -O2:
.L3:
move.l (%a0),(%a1) | plain indexed, no post-increment
move.l 4(%a0),4(%a1)
move.l 8(%a0),8(%a1)
lea (16,%a0),%a0 | pointer update separated from accesses
lea (16,%a1),%a1
move.l -4(%a0),-4(%a1) | negative offset — the lea moved too early
The perfectly fine inner loop gets butchered in the name of scheduling for superscalar execution. Instructions get reordered, pointer increments get separated from their memory accesses, and the fourth copy ends up using a negative offset because the lea was hoisted above it. The result is slower and larger than what GCC-2.95 would have produced, and not even close to what an elite scener would have written. For command-line tools and utilities this is tolerable. For realtime demos and games, it is not.
And Yet — GCC Can Work for Us
But there is light at the end of the tunnel.
Using a cross-compiler changes the way you work, if nothing else, I don't have to code through the mail-box slit that is a 640x200px screen. I write my software to be 95% target-agnostic, so I can build and debug it as a macOS binary with Xcode (my tool and platform of choice). A thin compatibility layer lets me run the same code in Hatari for target-specific testing, and on real hardware obviously. Access to modern tools (debuggers, profilers, sanitizers) multiplies productivity. More importantly, it lets me spend time doing the fun stuff instead of getting bogged down in boilerplate.
When looking at the generated code, the mistakes seem so obvious. If you revert to older compilers like GCC-4.6.4, or even 2.95, more "correct" code is generated. But then I'm locked into 20+ year old C/C++ and lose many of the high-level optimizations: dead code elimination, constant propagation, inlining, loop unrolling. The systemic improvements in GCC's middle-end over 30+ years are what keep modern GCC competitive despite the neglected backend, and C++20 is just a completely different beast than plain C89. The foundation of modern GCC is solid; it is the last mile that is broken.
For Chroma Grid (the game we released at Sommarhack 2024) this is good enough. By holding GCC's hand and sprinkling in some handwritten assembly for the hot paths, something decent can be made. But it nags at me.
I am far from the only one who has noticed. In threads like GCC on ATARI - how much faster could it be? on Atari Forums, and many others over the years, users like mikro, dml, and others have shared tips, workarounds, and musings about how to coax better code out of GCC for m68k. Add to that Bebbo's optimizations for Amiga GCC-6, which I have only ever read about but that seem promising.
What if we could have both: modern high-level optimizations and m68k-aware code generation? How hard could it be to fix some of this?
Enter Claude Code
The answer to "how hard could it be?" turns out to be: very hard.
My first real attempt at modifying the backend was adding a fastcall calling convention, mostly driven by a wish to embed my old assembly snippets in C/C++ code (assembly I had written for the PurePascal ABI, a topic I hope to cover in a future post). I never got that working properly, but the idea lives on as inspiration for the -mfastcall ABI in Thorsten Otto's fork of GCC, which I use and highly recommend.
Beyond that, I have attempted to fix the m68k backend on and off a couple of times over the years, and never get very far. GCC is one huge, daunting codebase: millions of lines of C and C++, with decades of accumulated abstractions, conventions, and here be dragons. The documentation that exists is written for compiler developers, not for amateurs trying to understand one very specific backend. Getting something to compile that does what you want is hard enough. But the real wall is debugging: when a change produces wrong code, or an internal compiler error, or simply different code than expected, understanding why requires tracing through pass after pass of intermediate representations that are dense, verbose, and deeply interconnected. I find it almost impossible.
Each attempt follows the same pattern: read some GCC documentation under long Google sessions, make a small change, rebuild (which takes a while), test, get confused by the results, spend hours staring at RTL dumps without understanding what I am looking at, and eventually give up. No breakpoints, no watch expressions, no call stacks — just fprintf debugging in a codebase that feels like bringing a butter knife to a gunfight. Nothing like the modern development environment I have grown used to and, frankly, been spoiled by.
In my day job as a performance engineer at a fruit company in California (a career that the skill set from the Atari demo scene prepared me for surprisingly well) I was introduced to Claude Code in the fall of 2025. Old and grumpy, I was dragged into the "future". But after seeing what it is capable of and not capable of, I thought: what if I point this at the m68k backend of GCC?
It turns out Claude knows far more m68k than I expected (though, not to brag, not quite as much as me). But more importantly, it is an expert at browsing large codebases like GCC and explaining them succinctly in a way I can understand. It can explain what an RTX or INSN is, and why I should care. And even more valuable than that: parsing the intermediate internal data that GCC uses (RTL dumps, pass output, cost calculations) and extracting only the important parts. For bug fixes and for understanding how internal decisions are made, this is indispensable.
As for writing code, you should think of Claude as a very enthusiastic and overconfident, but productive, intern. It can be a bit lazy, preferring to patch symptoms rather than fix underlying issues. But with careful coaching it works very well for formulating plans of attack, and it helps clean up the resulting code, which is even more important when you consider that GCC has, in my humble opinion, a very particular coding style that takes getting used to. And as for writing documentation. That is where I find it really shines.
With the right tools, I now have a fighting chance to improve my tools.
The Running Example
Throughout this series I will use a simple pixel-plotting function as a running example. It is simplified from real code, but it touches all the things that make Atari ST programming interesting from a compiler's perspective: memory layout, shifts, masks, and a tight inner loop.
A quick reminder: ST-low is 320x200 with 16 colors stored in 4 interleaved bitplanes, each pixel spread across 4 consecutive words.
void put_pixel(uint16_t *screen, int x, int y, uint8_t col) {int base = y * 80 + ((x / 16) * 4);uint16_t mask = (uint16_t)(0x8000u >> (x & 15));for (int p = 0; p < 4; p++) {if (col & (1 << p))screen[base + p] |= mask;elsescreen[base + p] &= ~mask;}}
And here is what stock GCC-15.2 makes of it at -O2 for 68000. Brace yourself:
put_pixel:
movem.l %d2-%d4,-(%sp)
move.l 20(%sp),%d0 | x
move.l 24(%sp),%a0 | y (in an address register?)
moveq #15,%d1
and.l %d0,%d1
move.l #32768,%d2
lsr.l %d1,%d2 | mask
moveq #0,%d3
move.b 31(%sp),%d3 | col
move.l %a0,%d1 | y * 80 done as y*5*16, good
add.l %a0,%d1
add.l %d1,%d1
add.l %a0,%d1
add.l %d1,%d1
add.l %d1,%d1
tst.l %d0 | x < 0 check (signed divide)
jlt .L18
asr.l #4,%d0 | x / 16
add.l %d1,%d0
lsl.l #3,%d0 | * 4 (word index to byte offset)
move.l 16(%sp),%a0
add.l %d0,%a0
moveq #0,%d0
move.w %d2,%d1
not.w %d1
move.w %d1,%a1 | ~mask stashed in an address register?
.L14:
move.w (%a0)+,%d1
btst %d0,%d3
jeq .L12
or.w %d2,%d1
move.w %d1,-2(%a0) | write back with negative offset!
addq.l #1,%d0
moveq #4,%d1
cmp.l %d0,%d1
jne .L14
movem.l (%sp)+,%d2-%d4
rts
.L12:
move.w %a1,%d4 | ~mask recovered from address register?!
and.w %d4,%d1
move.w %d1,-2(%a0)
addq.l #1,%d0
moveq #4,%d1
cmp.l %d0,%d1
jne .L14
jra .L19
There is a lot going on, and not all of it is flattering. y ends up in an address register. The inverted mask gets stashed in a1 because the register allocator ran out of data registers. The inner loop uses move.w (%a0)+ followed by move.w %d1,-2(%a0) — a post-increment immediately undone by a negative offset write-back. This is the code we are going to fix.
In the posts that follow, we will explore how all of this works and maps to the m68k instruction set: how the ABI determines where the arguments arrive, what the cost model thinks of the inner loop, and how the optimization passes transform it step by step, from C text to machine code, even an elite can accept.
- Getting started: If you want to try GCC for m68k yourself, Thorsten Otto maintains cross-compiler packages for MiNT ELF at tho-otto.de/crossmint.php. You will need the compiler, binutils, and either mintlib or libcmini.
- If you are brave: My experimental branch of GCC-15.2 can be found as a PR here github.com/th-otto/m68k-atari-mint-gcc/pull/2. It is not fully production ready yet, but can work well enough.
- Hatari Debugging: For debugging and profiling with Hatari, I use Tat's hrdb, a GUI debugger available at clarets.org/steve/projects/hrdb.html. Uses its own modified Hatari available at the same place.
- On HW Debugging: While I am yet to try it myself, Hildeborg of Omega fame has shared his own tool-chain including a gdbserver over at: github.com/hildenborg/m68k-atari-dev.
- Alternative compiler: If you only need a subset of C99 and no C++, vbcc is a solid alternative cross-compiler for m68k with good code generation.
If this was interesting, I am planning to continue with how GCC's compilation pipeline works, ABIs, the cost model, and register allocation. If you have suggestions for topics you would like to see covered, or questions about GCC and m68k, I would love to hear them.






Comments
I've played around a bit with Hildenborgs tool-chain and it not only for HW debugging, it works great with Hatari as well. From my point of view it has two major strengths:
1. It is based on VS Code, so you edit, build and debug in an environment that is very familiar to many of us.
1. It is source level debugging for C and C++. Just set breakpoints in your source code, run, single step etc and inspect your variables just like you are used to do.
To me, porting my ~100k lines of C++ project to Atari, that is a real game changer :)
You have basically done what I have been wanting to do for years but never had the courage. ;)
Oh and btw, W.A.I.T. still belongs to my favourite Falcon demos, I hope that one day we will see something more on Falcon from you!
I have the same OC-style affliction you suffer from that I disassemble even the most infrequent utility functions and freak out if I see stuff like
andi.l #$ffff, d0 ..
or
move.l d0,0(a0)
move.l d1,4(a0)
addq.l #8,a0
or of course terrible loops that might use something like above and then add compares.
My solution was to abuse macros and gcc assembly constraints and build up a library of pure inline functions and macros:
#define _XX_to_32(width_in, bw, out_t) \
ALWAYS_INLINE out_t width_in ## _to_32(width_in ## _t in) { \
if (__builtin_constant_p(in)) { \
return in; \
} else { \
out_t out; \
asm volatile("\t" "moveq #0,%[Reg32]" \
"\t\n""move."bw" %[RegSrc],%[Reg32]" \
:[Reg32] "=&d" (out):[RegSrc] "rim" (in):"cc"); \
return out; \
} \
}
_XX_to_32(uint8, "b", uint32_t); // uint32_t uint8_to_32(uint8_t in8);
_XX_to_32(uint16, "w", uint32_t); // uint32_t uint16_to_32(uint16_t in16);
Just about where this kind of approach reaches its limits is once you have to asm goto as you run into limitations of the most modern gcc and llvm.
Basically you wreck the optimization passes so you can use it only on simplest most innermost loops.
So real loop improvements pretty much must be done by modifying the sources, either the gcc lisp/scheme stuff or the cost model..
Everyone's moved on to newer branches, but I prefer to just "stabilize" a tool chain... and then,when I feel I know it inside out (nowhere near there now), then.. carefully upgrade and measure changes, so that the floor does not disappear from underneath me just when I think I figured it all out...
Anyway, great work PeytoW!
I have good news for you ;), That inline asm hack, is exactly what my branch now does for you, with cleaner code, also for all the places you forgot to use it in.
Started typing up the next one, a bit more technical; going into the constraints a compiler has, that we usually don't care about when coding asm. The "loop stuff" will be a later one, it is a very multilayered problem.
Jaguar developer will also benefit from this. THX!
Thanks.
I am curious, for Jaguar is there any compiler support for the RISC cpus, or is that always handwritten asm?
I think there is a highly experimental C compiler for the Jaguar RISC but it is practically useless.
There is an old gcc 2.95 but output needs some tweaking to fix issues.