## Wednesday, February 9, 2022

### Game Boy Fiver [Wordle clone]: How to compress 12972 five-letter words to 17871 bytes

Update: You can play an updated version online here in the binjgb Game Boy emulator. This is the version with the frequency-based answer list rather than the official Wordle list, for copyright reasons.

There is a Game Boy version of Wordle, using a bloom filter, a reduced vocabulary and a reduced list of guess words, all fitting on one 32K cartridge. I decided to challenge myself and see if I could fit in the whole 12972 word Wordle vocabulary, with the whole 2315 word answer list. So the challenge is:

• Compress 12972 five-letter words (Vocabulary)

• Compress a distinguished 2315 word subset (Answers).

I managed it (download ROM here), and it works in a Game Boy emulator. There is more than one way, and what I did may be excessively complicated, but I don’t have a good feel for how fast the Game Boy runs, so I did a bit of speed optimization.

Step 0: We start with 12972 × 5 = 64860 bytes of uncompressed data.

Step 1: Divide the 12972 word list into 26 lists, based on the first letter of the word. Since in each list, the first letter is the same, we now need only store four letters per word, along with some overhead for each list. (The overhead in the final analysis will be 108 bytes.) If we stop here,

Step 2: Each four letter “word” (or tail of a word) can be stored with 5 bits per letter, thereby yielding a 20 bit unsigned integer. If we stop here, we can store each word in 2.5 bytes, for a total of 32430. That would fit on the cartridge if there was no code, but it is some progress.

Step 3: Here was my one clever idea. Each of the lists of four letter “words”, is in alphabetical order, and encoded the natural way as 20 bit numbers, the numbers will be in ascending order. Instead of storing these numbers, we need only store their arithmetical differences, starting with an initial (invalid) 0.

Step 4: Since the differences are always at least 1, we can subtract one from each difference to make the numbers slightly smaller. (This is a needless complication, but I had it, and don’t feel like removing it.)

Step 5: Store a stream of bytes encoding the difference-minus-ones. Each number is encoded as one, two or three bytes, seven-bits in each byte, with the high bit of each byte being 1 if it’s the last 7-bit sequence and 0 if it’s not. It turns out that the result is 17763 bytes, plus 108 bytes of overhead, for a total of 17871 bytes, or 28% of the original list, with very, very simple decompression.

Step 6: Now we replace each word in the alphabetically-sorted Answers list with an index into the vocabulary list. Since each index fits into 2 bytes, this would let us store the 2315 words of the Answers as 2315 × 2 = 4630 bytes.

Step 7: However, it turns out that the difference between two successive indexes is never bigger than 62. So we can re-use the trick of storing successive differences, and store the Answers in 2315 bytes. (In fact, since we only need 6 bits for the differences, we could go down to 1737 bytes, but it would complicate the code significantly.)

Result: Vocabulary plus Answers goes down to 108+17763+2315=20186 bytes. This was too big to fit on a 32K cartridge using the existing code. But it turns out that most of the existing code was library support code for `gprintf()`, and replacing the single `gprintf()` call, which was just being used to format a string containing a single-digit integer variable, with `gprint()`, seemed to get everything to fit in 32K.

Example of the Vocabulary compression:

• The first six words are: aahed, aalii, aargh, aarti, abaca, abaci.

• Dropping the initial “a”, we get ahed, alii, argh, arti, baca, baci.

• Encoding as 20-bit integers and adding an initial zero, we get 0, 7299, 11528, 17607, 18024, 32832, 32840.

• The differences-minus-one are 7298, 4228, 6078, 416, 14807, 7.

• Each of these fits in 14-bits (two bytes, given the high-bit usage), with the last one in 7-bits. In practice, there are a lot of differences that fit in 7-bits, so this ends up being more efficient than it looks—the first six words are not representative.

Notes:

• With the code as described above, there are 250 bytes to spare in the cartridge.

• One might wonder whether making up the compression algorithm saves much memory over using a standard general purpose compressor. Yes. `gzip` run on the 64860 bytes of uncompressed Vocabulary yields 30338 bytes, which is rather worse than my 17871 byte compression of the Vocabulary. Plus the decompression code would, I expect, be quite a bit more complex.

• One could save a little memory by encoding the four-letter “words” in Step 2 in base-26 instead of four 5-bit sequences. But it would save only about 0.5K of memory, and the code would be much nastier (the Game Boy uses library functions for division!).

• The Answers could be stored as a bitmap of length 12972, which would be 1622 bytes. But this would make the code for generating a random word more complicated and slower.

Alexander R Pruss said...

Looks like I'm not the first one to have done this, though probably my algorithm is different: https://github.com/stacksmashing/gb-wordle/pull/8

Alexander R Pruss said...

The 2315 Answer word list looks like it may have had some creative selection behind it, so there could be copyright problems, and thus I replaced it with a subset of the Vocabulary generated using a frequency-sorted corpus of English. I also changed the name of the ROM. :-)

Alexander R Pruss said...

Interestingly one saves about 0.5K by reversing the letters in the words (and resorting), and almost 1K by combining that with the base-26 encoding. Not sure there is much of a need to do this.

Unknown said...

I ran it through a simple arithmetic compressor and reduced it to 4368 bytes total. I used a compressor I wrote at https://github.com/ChrisLomont/ArithmeticCompression

This was with a zero order model. An order 1 or 2 model should compress a lot more.

Chris Lomont

Alexander R Pruss said...

Nice!

I don't know much about arithmetic compression. What kind of memory and time usage could one expect for decompression? Decompression needs to run on a 4MHz device with 8K RAM whose CPU has no multiplication or division primitives.

Alexander R Pruss said...

I added a link to a version playable in the browser in a Game Boy emulator to the post. That version uses my own list of answer words, and I did eventually move to a bitmap for storing the answer words.

pauli said...

Did you try a TRIE representation? Or more specifically a TRIE reduced to a DAWG? I'd expect to see a similar level of compress (if not better) and very rapid access. Because all words are five letters in length, there should be some additional compression possible -- no need to store which sequences are words e.g.

RW said...

When using a 6-bit byte instead of a 7-bit one, I got a size of 16686.75 bytes. I didn't use your method from step 1-2, it seems to have almost no effect on the result size in my case.

Tero Keski-Valkama said...

It seems to me the main bottleneck here is the number of bits needed to encode the difference from the previous 4-letter postfix word to the next one when the words are alphabetized. It might be that alphabetic order is bad because it has strong correlations, so the small differences accumulate to the common symbol regions, and large differences require reserving more bits for all differences.

It might be that simply creating a random XOR mask across 4-letters and then using a numerical ordering for the XOR'ed postfix representations would make the sequential numerical differences between words more uniformly distributed and thus the worst case bits needed lower, thus lowering the total required space needed.

Adding more words to the vocabulary can also reduce the total space needed if you can get up to a sweet spot where the number of bits needed for the differences steps down for that number of words in the vocabulary.

Tero Keski-Valkama said...
This comment has been removed by the author.
RW said...

Now I got 16197 bytes with using 6-bit byte and conversation words to base-26 numbers

Blair said...

https://news.ycombinator.com/item?id=30402407

Alexander R Pruss said...

RW:

What do you mean by "6-bit byte"? Is that for the differences?

A down-side of base-26 is that you need to import 32-bit multiply, divide and remainder library routines. Maybe you can get enough saving to make that worth it, but just barely.

BTW, I've rewritten the slowest part of the decoding in assembly in the latest version.

RW said...

Yes, for differences. I do almost like you:

1. Get numbers for each word (it takes 25 bits = 5 characters * 5 bits per char)
Way of getting this number is bit different in cases when I use "5-bits bytes" concatenation or base26.

2. Get differences and subtract one (like you do).

3. Convert differences to binary, split it to 5-bits chunks and add 6th byte to have possibility to mark last chunk of sequence. So for each word I have sequence of "6-bits bytes", it may contains 1 or more such of bytes (related to size of specific difference)

Why did you use "7-bits byte"? I think, I think, 6 bits is enough :)

Yes, I got why you did not use base26, I did it just for test, how many it will save. It saved about 500Kb (like in your case)

Jay said...

Oook you nerd sniped me...

Here's me trying two different techniques, but the one I got lower was:
https://github.com/justecorruptio/ardle/blob/main/grey_compress.py

Basically, partition the full word list into sequences of words that differ by one letter, eg: RODEO->ROMEO->RONEO->RONDO->CONDO->KONDO->KENDO->ZENDO.

For all the single words that do not fit into such a sequence, encode with your delta scheme.

For these sequences, encode the list of initial words with your delta scheme. To encode the sequences themselves, we need the letter position and the rotations from the existing letter, eg: ROMEO->RONEO = 3rd position, 1 rotation. This is 5 * 25 possible values, so it fits in 7 bits. We can then use the 8th bit of a byte to denote the end of a sequence, thus each word in a sequence after the first takes exactly one byte each.

Using this method, I'm able take it down to 15467B for the full list. (It can be a few bytes smaller if you reverse some of the sequences, but didn't bother optimizing that).

Alexander R Pruss said...

RW:

I think using chunks smaller than a byte will mean that decoding will have to use a bit-shift by a variable amount. The GameBoy CPU does not have an opcode for that, so such a shift would have to be implemented as a loop of shifts. This will likely slow things down. The same issue applies to Huffman coding and the like.

Does the speed matter? Yes: otherwise there is an annoying delay when you enter a word and it's being checked.

BTW, another way to save memory is in the bitmap encoding of the answers. Not all the bit patterns appear in an actual byte in the encoding. The unused byte patterns could be used to encode run lengths of zeroes. I don't remember how much that would save. Maybe about 200 bytes, but of course you need to subtract the extra decoding code.

Alexander R Pruss said...

Jay:

That's clever, but I think the speed of word checking might suffer, because you'd have to go through every one of the sequences to check the word, while my scheme requires one to only go through words that have the given starting letter.

Moreover, to fit 5*25 in seven bits, decoding would require either a division and a modulo operation, which are not available on the CPU and hence would pull in library code, or else a 125-byte table lookup (splitting the position/rotation data into 3+5 bit bytes). The second could probably be pretty fast, though.

Alexander R Pruss said...

pauli:

I have never used tries or dawgs before. It might be worth trying. One complication is that nodes would probably not be an integral number of bits, and so decoding would require a shift by a variable amount. That would be slow.

Jay said...

Ah yeah the speed of lookups would definitely be bad.

as for the divmod of by 25 though, we can use modular arithmetic here:
for value in the range [0-127]

d = ((x << 5) + (x << 3) + x) >> 10
m = x - (d << 4) - (d << 3) - d

and it looks like this all fits in 16 bit math. Not sure this is actually faster than repeated subtraction haha.

Alexander R Pruss said...

The processor only has shifts of 8-bit values, so a 16-bit value shift would actually be two shifts with carry.

There is 16-bit addition, which would help for d, but no 16-bit subtraction for m (but only one 16-bit subtraction would be needed, of course).

One could do some optimization, though. The right shift by 10 could be done by just taking the high byte and right shifting it by 3.

The table lookup would be a lot faster, I expect.

Alexander R Pruss said...

FYI, here are the opcodes: http://www.devrs.com/gb/files/opcodes.html

Alexander R Pruss said...

Tero Keski-Valkama:

The XOR suggestion is clever. I tried to find the optimal XOR value.

For 5-bit encoding with words reversed, it saves 123 bytes by xor'ing with 0x02.

For 5-bit encoding with words in regular order, it saves 58 bytes by xor'ing with 0x1D.

The time cost is minimal. I am going to consider it.

Tero Keski-Valkama said...

It might be better to XOR with a whole word mask, because that permutates the order of words when ordered in XOR'ed form, and thus can distribute the deltas more uniformly. Using a single byte XOR mask doesn't properly shuffle the permutations yet.

z-10 said...

Hi Alex,