Here’s some code to find greatest common divisors in Knuth’s MMIX. It’s faster than published implementations I know of and does not use predicated instructions like conditional stores.

I’ve been writing some programs for MMIX, Knuth’s imagined RISC architecture and the successor to the MIX language of TAoCP volumes 1-3. With such a spartan and subtle language it’s probably best to start small, so I’ve been computing greatest common divisors. There are two great implementations in the MMIX supplement to vol. 1-3, a translation of MIX programs to MMIX written by Martin Ruckert. Dr. Ruckert is one of the people responsible for maintaining MMIX at its new home in Munich. This is in itself a fantastic reference of algorithms and implementation tricks.

While I was learning stuff I found a nice way to calculate the GCD in MMIX. You can have a look at the code on github.

## Basic GCD with division

Here’s a basic GCD algorithm using integer division:

```
a IS $0
b IS $1
t IS $2
Gcd PUT rD,0 ; ensure divisor a
1H DIVU t,a,b ; t = a/b
SET a,b ; a = b
GET b,rR ; b = remainder
PBNZ b,1B ; loop if b == 0
POP 1,0 ; else return
```

The first `PUT`

is there to set the high bits of the dividend to 0.
Later we `GET`

the remainder from another special register `rR`

. The
label `1H`

appears as the target of a branch (a “probable” branch
`PBNZ`

) in the form `1B`

; the `B`

is read as “back” so `PBNZ b, 1B`

branches to the first label `1H`

that appears previously. The `POP`

instruction returns to the caller, returning `a`

in a register of its
choice.

This is a nice simple program and there isn’t much to say about it except that integer division takes 60 cycles on MMIX, so even though this program is quite short it takes a long time to run.

Can we do better? Yes, and the basic idea is to use the binary GCD algorithm. Such an algorithm is published in the MMIX supplement together with a hint we’ll use later. First we need a trick.

(The tricks I talk about here are not new and *I* certainly didn’t
come up with them; on the other hand, I don’t know of a good
reference. HAKMEM
and Hacker’s Delight are good
places to look for these kind of tricks.)

## Branchless counting

In the binary GCD algorithm we divide by the largest power of 2
possible. So it’s nice to have a way of determining that for a given
number. Let’s look at 8 bit numbers so we can do examples. How about
`01011000`

, hex 58 or decimal 88. How can we compute the largest
power of 2 dividing 88?

The obvious way is to shift right repeatedly until the number is odd. This works fine, and it’s relatively fast, but suffers from a flaw on architectures with long pipelines: mispredicted branches can have a large time penalty because they must flush the pipeline. This is true for modern x86, ARM, MIPS, Power, and SPARC processors, and the penalty can be substantial, 10 cycles or more.

If we can write more-or-less equivalent code with no branches, we have
a potentially big win. One trick is to compute a mask that has ones
where there are trailing zeros of the number and then use some
intrinsic (like x86’s `popcnt`

) to count them. In our example above,
the mask would be `00000111`

, with 3 zeros, so the largest power is
\(2^3\).

### Shift and OR

So how do we compute such a mask? Here’s one way, starting with the observation that all we really know is that there’s a 1 somewhere in the binary representation of any nonzero number. We shift the number left \(n\) places and OR it with itself repeatedly, starting with \(n=1\) and doubling each step. After one step we have

```
01011000 | 10110000 = 11111000,
```

which is pretty much the answer we wanted (it’s the complement, but
hey, not far off). Supposing we started with a more sparse number
like `00001000`

, we’d do

```
00001000 | (00001000 << 1) = 00011000,
00011000 | (00011000 << 2) = 01111000,
01111000 | (01111000 << 4) = 11111000.
```

For an \(m\) bit number we need \(\log_2 m\) shifts and ORs, not too bad.

### MMIX’s SADD instruction

On MMIX we can do much better than this using the `SADD`

“sideways
add” instruction. `SADD $X, $Y, $Z`

counts the number of bit
positions in which `$Y`

has a 1 while `$Z`

has a 0. If we can arrange
for `$Y`

to hold our mask and `$Z`

to be the number we’re interested
in, this calculates the quantity we’re interested in. We have even
more freedom in choosing our mask, though, because we only need for it
to be different from our number at its trailing zeros. We can
accomplish this by simply subtracting one from our number: with the
example above, we have

```
01011000 - 1 = 01010111,
```

and if we `SADD`

those we get 3. Only two instructions, and no branches!

## Absolute value and minimum

Now we’ve found the largest power of 2 dividing both our input numbers \(u\) and \(v\). The next part of the binary GCD algorithm is to repeatedly subtract the smaller from the larger and divide out any powers of 2 as we did before.

Finding the absolute value and/or swapping would seem to require a branch, but again, we can do better on MMIX. There are special predicated instructions for setting one register to another dependent on the value of a third register. In fact, the Ruckert translation of Knuth’s original MIX program uses these conditional instructions to swap \(u\) and \(v\), if needed, so that \(u<v\).

However, predicated instructions are a mixed bag. One reason is that they introduce dependencies which can limit superscalar processors’ freedom in scheduling. (The 64 bit ARM architecture does not support predication at all, whereas its predecessors could conditionally execute almost every instruction.) Can we find a way to write the code without any conditional instructions?

Yes! MMIX’s `ODIF`

instruction implements “saturated subtraction” on
unsigned 64 bit quantities:

\[ u ∸ v = \max\{u-v, 0\}. \]

Using this, we can easily compute

\[ |v - u| = (u ∸ v) + (v ∸ u) \]

and

\[ \min\{u, v\} = u - (u ∸ v). \]

These two formulas allow us to completely avoid branches *and*
conditional instructions in the body of our binary GCD algorithm.

## Putting it together

Here’s the finished product. The first few lines are just aliases for
registers. The subroutine is named `GcdRsb`

and starts on that line.
`D1`

and `D2`

are labels used as jump targets.

```
u IS $0
v IS $1
t IS $2
p IS $3
q IS $4
r IS $5
GcdRsb OR t,u,v ; largest power of 2
SUBU k,t,1 ; dividing both u and v
SADD k,k,t
SRU u,u,k
SRU v,v,k
D1 ODIF p,u,v
ODIF q,v,u
ADDU q,p,q ; q = abs(u-v)
BZ q,D2
SUBU v,u,p ; v = min(u,v)
SUBU p,q,1 ; p = q - 1
SADD p,p,q ; p = k, 2^k | p
SRU u,q,p ; u = q >> k
JMP D1
D2 SLU u,u,k ; restore powers of 2
POP 1,0
```

This version outperforms the binary GCD implementation published in
the MMIX supplement by around 10% on some simple tests I did. More
importantly, it does *much* less branching. On a modern computer with
large branch misprediction penalties, that can make a big difference.

## Conclusion

I wrote a fast, cool GCD on MMIX. It doesn’t use any predicated instructions, and has only a single conditional branch. The arithmetic and bit manipulation tricks are worth keeping in mind for other applications. The secret ingerdient, as always, is love.

Approx. 1302 words, plus code