← Fixing Post-Increment Addressing
DBRA and the Cost Model Balancing Act
When you write assembly for the m68k, dbra is just there. Put your count in a data register, end your loop with dbra dN,.label. Four bytes, twelve cycles per iteration on 68000, no flag dance, no comparison. Your loops naturally form around the instruction.
C/C++ does not work that way. The natural loop in C is for (i = 0; i < N; i++) or while (cond). Neither says "count down and exit when you wrap", well, while can. But a counted-up loop with an unsigned comparison is what the language naturally models, and that is what GIMPLE mostly produces. Turning that into a dbra is a backend job, and requires carefully written code. And GCC, for reasons we will get to, mostly stopped doing it for m68k.
What is an Induction Variable?
Before we can talk about why dbra is missing, we need to talk about how GCC sees a loop. Consider:
for (int i = 0; i < 40; i++)
buf[i] = f(i * 80);
To you and me this is one loop with three things going on. To GCC's GIMPLE-SSA representation, it is a loop with three induction variables, IVs for short. An IV is a variable that changes by a constant amount each iteration:
i— increments by+1per iteration, used in the exit testi * 80— increments by+80per iteration, used asf's argument&buf[i]— increments by+sizeof(int)per iteration, used as the store address
Each of these can be initialized and updated independently, or derived from one of the others on the fly. The combinations multiply quickly. GCC's loop optimizer's job is to figure out which arrangement is most economical. The cost calculus matters: i + 1 and &buf[i] + sizeof(int) both translate to a cheap addq, but i * 80 is a multiplication, which costs real cycles on 68000. So updating i*80 by +80 each iteration is much cheaper than recomputing i*80 from a freshly-incremented i.
Each one is also a candidate for "the" loop counter. IVOPTS, the induction variable optimization pass, evaluates them and picks. The exit test does not have to be i < 40 in the final code; it could equivalently be &buf[i] < &buf[40] (pointer compare) or i*80 < 3200 (multiplied form). The shape that minimizes total cost wins, and the others are eliminated or recomputed from the survivor.
IVOPTS also accounts for the number of IVs kept alive. Every additional IV is one more value the register allocator has to fit; if the loop body is already register-hungry, an extra IV may force a spill to the stack. So fewer IVs is better, all else equal.
This is where dbra enters. dbra is not one of the "natural" IVs from your C code. It is a fourth candidate that the backend asks IVOPTS to consider: a dedicated count-down counter in a data register, used only for the exit test. Adding it costs one register and one initialization. To prefer it, IVOPTS has to prove, in cost units, that the savings on the exit test pay for that overhead.
The Hooks That Were Never There
GCC has machinery for this. Three target hooks, in fact:
TARGET_PREDICT_DOLOOP_P— asked early, in GIMPLE: could this loop use a hardware counter?TARGET_IV_COMPARE_COST— tells IVOPTS how expensive the exit comparison really is for a given IV.TARGET_DOLOOP_COST_FOR_GENERIC— return a penalty when a doloop counter is also read inside the loop body.
All three landed in GCC 11, released April 2021. They were added for ARM Cortex-M's Low Overhead Branch and the Power CTR register. m68k entered maintenance mode many years before that, and the m68k backend never adopted any of them. So if you have ever seen modern GCC emit a dbra, that was not skill; that was luck, the generic doloop pass getting lucky with final pattern matching. Most of the time, you got subq.w #1,%d0; jne .L1 or, worse addq + cmp + jne.
The fix is to plug into these hooks and teach the cost model how m68k actually works.
Costing the Exit Test Honestly
TARGET_IV_COMPARE_COST lets the backend tell IVOPTS what an exit comparison really costs for a given mode and IV form. The default assumes register-to-register compare at one insn-cost. On 68000 with -mshort, that is wrong in two ways at once. First, comparing against an immediate costs more than comparing against a register: cmp.w #imm,%dN is a 4-byte instruction (4 cycles for the compare plus 4 cycles for the extension word fetch on the Atari ST bus). Second, dbra has no comparison at all; the decrement-and-branch is a single instruction, and on 68000, the exit test cost is effectively zero compared to a cmp + jne pair.
With accurate costs, IVOPTS can decide whether a count-down dbra IV pays back its setup overhead. For many loops it does, but not always. That is the whole point of having a cost model rather than a rule. Without these costs, IVOPTS does not get to make an informed decision at all: it systematically undervalues dbra because it underestimates how much the count-up form actually costs to test.
When Sharing Costs You Money
For reference, this is the inner loop in put_pixel from Part 1:
for (int p = 0; p < 4; p++) {
if (col & (1 << p))
screen[base + p] |= mask;
else
screen[base + p] &= ~mask;
}
Two IVs are obvious: the counter p (used both as a bit shift in 1 << p and as the exit test) and the pointer &screen[base + p]. A third IV would be a dedicated dbra counter, used only for the exit. Three uses, three potential candidates. Now the cost model decides.
A quick detour into how IVOPTS makes its choice. The IVs we listed earlier are candidates; possible loop variables IVOPTS could keep alive. The places in the loop that need a value (the exit test, each memory access, each body computation) are uses. Every use has to be served by one of the candidates. IVOPTS's job is to pick a set of candidates that collectively cover every use, and that set is the group. The group is scored as a single total: the cost of every chosen candidate, plus the cost each use pays to be derived from its assigned candidate. The lowest-total group wins.
TARGET_DOLOOP_COST_FOR_GENERIC is the more subtle hook, and it is the one that fixed put_pixel. The catch with the group cost model is that it accounts for shared candidates correctly only if sharing is actually cheaper. With dbra, sharing is not cheaper, it makes dbra impossible. And the group cost does not reflect that on its own.
For put_pixel's inner loop over four bitplanes, IVOPTS had two solutions tied at the same nominal cost:
The 3-IV solution (what we want). A dedicated count-down d2 for the loop exit, a separate d0 for the bit position used by btst, plus the address pointer in a0:
.L553:
btst %d0,%d1
jeq .L551
or.w %d3,(%a0)+
addq.w #1,%d0
dbra %d2,.L553
Five instructions in the body. The dbra does the loop control in 12 cycles. The bit position is its own variable.
The 2-IV shared-counter solution (what we got). Reuse d0 for both the bit position and the loop exit. No dedicated counter, one fewer register. But now d0 is read inside the loop body (by btst) and so cannot be the dbra counter; dbra only decrements, never reads, the counter for any other purpose. The exit becomes cmp.w #4,%d0; jne:
.L534:
...
addq.w #1,%d0
cmp.w #4,%d0
jne .L534
Six instructions in the body, exit costs 24 cycles instead of 12, and we have lost the dbra. IVOPTS thought it was saving a register. It was actually paying every iteration for the privilege.
TARGET_DOLOOP_COST_FOR_GENERIC returns a penalty (we use COSTS_N_INSNS(2)) that is added to the cost of any candidate which uses the doloop counter for a non-exit purpose. Sharing the dbra IV with btst now costs 38 instead of 34. The 3-IV solution wins. dbra returns. The inner loop is two instructions shorter.
Register Pressure: The Tradeoff
There is no free lunch. A dedicated dbra counter costs one data register that could otherwise hold a value. For tight inner loops with small iteration counts, like put_pixel's four-bitplane loop, that register is well spent. For larger loops with many simultaneously-live values, the right answer might genuinely be to share or skip dbra entirely. The hooks do not force dbra; they let IVOPTS make an informed decision.
That is the balancing act. We are not telling GCC "always emit dbra", we are giving it the cost numbers it needs to figure out when dbra is worth it. The C code does not change. The IV candidates do not change. Only the prices on the menu change, and IVOPTS picks again with better information.
- Doloop target hooks in m68k.cc
- Cost calculations in m68k_costs.cc
Next time: results, lessons learned, and a long-overdue look back at working with Claude.






Add comment