For one of my homework assignments, I have to solve words encrypted via a substitution cipher. These ciphers were insecure before computers came around, but they are still fun. If you’re unfamiliar with them, you’d probably recognize them as the cryptograms (“Cryptoquote”) in your local newspaper. In the simplest form, each letter is mapped to a different letter of the alphabet. A lot of people do these for fun and I know at least one person reading this does. The result is a run of text that might look like:
ov umy rfgs f nmg
MY DOG EATS A LOT
There are many ways of going about solving substitution ciphers, but a common way is by counting frequencies of characters. As most people know, e tends to occur more than other letters in most written English. The rest of the letters typically follow a pattern, as well, but that pattern degenerates once you leave the most common letters. The domain of the text you are examining is fundamentally important here. By domain I mean whether this text is from a newspaper, an IM, transcribed speech, etc. You can also look at bigrams, two character sequences, to find the most commonly appearing sequences. In English, th appears much more ty, but ty still occurs. When trying to solve substitution ciphers this way, you are essentially matching the frequency distribution of the cipher text to the distribution of English and building a mapping from there. To put that a different way, you are matching up the most common letters or sequences in the garbled text with the most common real English letters or sequences.
Once the frequency counts have revealed the most common letters, many people proceed to deduction to eliminate the rest. Of course, this requires knowledge of English words directly, which has an impact on computational approaches to solving substitution ciphers automatically. I’m curious what approaches people have taken (if any) other than using a dictionary of English words and trying to find matches from there.