Software! Math! Data! The blog of R. Sean Bowman
The blog of R. Sean Bowman
January 12 2017

Here’s some code to find greatest common divisors in Knuth’s MMIX. It’s faster than published implementations I know of and does not use predicated instructions like conditional stores.

I’ve been writing some programs for MMIX, Knuth’s imagined RISC architecture and the successor to the MIX language of TAoCP volumes 1-3. With such a spartan and subtle language it’s probably best to start small, so I’ve been computing greatest common divisors. There are two great implementations in the MMIX supplement to vol. 1-3, a translation of MIX programs to MMIX written by Martin Ruckert. Dr. Ruckert is one of the people responsible for maintaining MMIX at its new home in Munich. This is in itself a fantastic reference of algorithms and implementation tricks.

While I was learning stuff I found a nice way to calculate the GCD in MMIX. You can have a look at the code on github.

Basic GCD with division

Here’s a basic GCD algorithm using integer division:

a       IS $0
b       IS $1
t       IS $2

Gcd     PUT   rD,0    ; ensure divisor a
1H      DIVU  t,a,b   ; t = a/b
        SET   a,b     ; a = b
        GET   b,rR    ; b = remainder
        PBNZ  b,1B    ; loop if b == 0
        POP   1,0     ; else return

The first PUT is there to set the high bits of the dividend to 0. Later we GET the remainder from another special register rR. The label 1H appears as the target of a branch (a “probable” branch PBNZ) in the form 1B; the B is read as “back” so PBNZ b, 1B branches to the first label 1H that appears previously. The POP instruction returns to the caller, returning a in a register of its choice.

This is a nice simple program and there isn’t much to say about it except that integer division takes 60 cycles on MMIX, so even though this program is quite short it takes a long time to run.

Can we do better? Yes, and the basic idea is to use the binary GCD algorithm. Such an algorithm is published in the MMIX supplement together with a hint we’ll use later. First we need a trick.

(The tricks I talk about here are not new and I certainly didn’t come up with them; on the other hand, I don’t know of a good reference. HAKMEM and Hacker’s Delight are good places to look for these kind of tricks.)

Branchless counting

In the binary GCD algorithm we divide by the largest power of 2 possible. So it’s nice to have a way of determining that for a given number. Let’s look at 8 bit numbers so we can do examples. How about 01011000, hex 58 or decimal 88. How can we compute the largest power of 2 dividing 88?

The obvious way is to shift right repeatedly until the number is odd. This works fine, and it’s relatively fast, but suffers from a flaw on architectures with long pipelines: mispredicted branches can have a large time penalty because they must flush the pipeline. This is true for modern x86, ARM, MIPS, Power, and SPARC processors, and the penalty can be substantial, 10 cycles or more.

If we can write more-or-less equivalent code with no branches, we have a potentially big win. One trick is to compute a mask that has ones where there are trailing zeros of the number and then use some intrinsic (like x86’s popcnt) to count them. In our example above, the mask would be 00000111, with 3 zeros, so the largest power is \(2^3\).

Shift and OR

So how do we compute such a mask? Here’s one way, starting with the observation that all we really know is that there’s a 1 somewhere in the binary representation of any nonzero number. We shift the number left \(n\) places and OR it with itself repeatedly, starting with \(n=1\) and doubling each step. After one step we have

01011000 | 10110000 = 11111000,

which is pretty much the answer we wanted (it’s the complement, but hey, not far off). Supposing we started with a more sparse number like 00001000, we’d do

00001000 | (00001000 << 1) = 00011000,
00011000 | (00011000 << 2) = 01111000,
01111000 | (01111000 << 4) = 11111000.

For an \(m\) bit number we need \(\log_2 m\) shifts and ORs, not too bad.

MMIX’s SADD instruction

On MMIX we can do much better than this using the SADD “sideways add” instruction. SADD $X, $Y, $Z counts the number of bit positions in which $Y has a 1 while $Z has a 0. If we can arrange for $Y to hold our mask and $Z to be the number we’re interested in, this calculates the quantity we’re interested in. We have even more freedom in choosing our mask, though, because we only need for it to be different from our number at its trailing zeros. We can accomplish this by simply subtracting one from our number: with the example above, we have

01011000 - 1 = 01010111,

and if we SADD those we get 3. Only two instructions, and no branches!

Absolute value and minimum

Now we’ve found the largest power of 2 dividing both our input numbers \(u\) and \(v\). The next part of the binary GCD algorithm is to repeatedly subtract the smaller from the larger and divide out any powers of 2 as we did before.

Finding the absolute value and/or swapping would seem to require a branch, but again, we can do better on MMIX. There are special predicated instructions for setting one register to another dependent on the value of a third register. In fact, the Ruckert translation of Knuth’s original MIX program uses these conditional instructions to swap \(u\) and \(v\), if needed, so that \(u<v\).

However, predicated instructions are a mixed bag. One reason is that they introduce dependencies which can limit superscalar processors’ freedom in scheduling. (The 64 bit ARM architecture does not support predication at all, whereas its predecessors could conditionally execute almost every instruction.) Can we find a way to write the code without any conditional instructions?

Yes! MMIX’s ODIF instruction implements “saturated subtraction” on unsigned 64 bit quantities:

\[ u ∸ v = \max\{u-v, 0\}. \]

Using this, we can easily compute

\[ |v - u| = (u ∸ v) + (v ∸ u) \]

and

\[ \min\{u, v\} = u - (u ∸ v). \]

These two formulas allow us to completely avoid branches and conditional instructions in the body of our binary GCD algorithm.

Putting it together

Here’s the finished product. The first few lines are just aliases for registers. The subroutine is named GcdRsb and starts on that line. D1 and D2 are labels used as jump targets.

u       IS $0
v       IS $1
t       IS $2
p       IS $3
q       IS $4
r       IS $5

GcdRsb  OR    t,u,v  ; largest power of 2 
        SUBU  k,t,1  ; dividing both u and v
        SADD  k,k,t
        SRU   u,u,k
        SRU   v,v,k

D1      ODIF  p,u,v
        ODIF  q,v,u
        ADDU  q,p,q  ; q = abs(u-v)
        BZ    q,D2
        SUBU  v,u,p  ; v = min(u,v)
        SUBU  p,q,1  ; p = q - 1
        SADD  p,p,q  ; p = k, 2^k | p
        SRU   u,q,p  ; u = q >> k
        JMP   D1
D2      SLU   u,u,k  ; restore powers of 2
        POP   1,0

This version outperforms the binary GCD implementation published in the MMIX supplement by around 10% on some simple tests I did. More importantly, it does much less branching. On a modern computer with large branch misprediction penalties, that can make a big difference.

Conclusion

I wrote a fast, cool GCD on MMIX. It doesn’t use any predicated instructions, and has only a single conditional branch. The arithmetic and bit manipulation tricks are worth keeping in mind for other applications. The secret ingerdient, as always, is love.

Approx. 1302 words, plus code