In my last podcast episode, my quick-and-dirty algorithm for cracking a bunch of substitution cyphers made good work of all the encoded messages it was given - all except for one.
This last message goes as follows:BYPPCVACMZQVYAP
I tried tackling it by hand to no avail - with the repeated letters in the 3rd and 4th position, I could guess that the first word is "HELLO" or something like that. But I couldn't make it fit with the rest of the message!
It's very difficult to crack substitution cyphers on small messages because they contain less information, and several different substitutions of letters can make sense.
For example, consider this four-letter substitution cypher:TBBO
The only information we have about this word is that it has 3 unique letters, with the two letters in the middle repeating. It could be LOOK, BOOK, BEEP, BOOM, ROOM, or any such word like that. In this case, there's no hope in pinning down the message directly, unless we could gain more information about the word in question.
For the 15 letter message above, I think there is hope. I'm going to try a couple of ways to get ahead of it, and I'll post this to the github for my project.
1) Improve the language model.
I noticed that for short messages, it can find other messages with the same substitution pattern with lower entropy. In other words, the language model doesn't necessarily pick out the best one due to the simplicity of our language models compared to real natural language. This isn't as much of a problem with larger texts because it has so much more information to go on in separating out "real" from "fake" English.
My first thought is to move from bag-of-words to a bigram model, and work out from there.
I should also keep track of several top messages, to make sure that the real message doesn't wind up in space 2 or 3.
2) Count every possibility.
In the other cases, it wasn't possible to check every possible letter-and-space substitution because that meant checking 27! possibilities. That's a big number!
That's around 10^28. I said on the show that it has "28 zeros" as a shorthand for this order of magnitude, but of course I realized that I misspoke since it's actually not 28 zeros but 28 digits. I think most people got the point!
Anyway, in the case above, only 9 unique letters are used: ABCMPQVYZ. This means that there are far fewer substitutions to check. After selecting on of the letters as the space, there are only 8 substitutions to make! Let's do the math on that:
27! / (27-8)! = 27! / 19! = 27 * 26 * 25 * 24 * 23 * 22 * 21 * 20= 89,513,424,000
At around 89 billion possibilities that no small iteration even for a computer. And no way I'm going spin up an Amazon cluster for this thing. BUT! This can be whittled down further with some educated guesses about which letters could arise, making exhaustive search reasonable.
If you crack this one before me, either email the show firstname.lastname@example.org, or message me on Twitter. I'll give you a shout out in the next episode!