For one of my homework assignments, I have to solve words encrypted via a substitution cipher. These ciphers were insecure before computers came around, but they are still fun. If you’re unfamiliar with them, you’d probably recognize them as the cryptograms (”Cryptoquote”) in your local newspaper. In the simplest form, each letter is mapped to a different letter of the alphabet. A lot of people do these for fun and I know at least one person reading this does. The result is a run of text that might look like:
ov umy rfgs f nmg
MY DOG EATS A LOT
There are many ways of going about solving substitution ciphers, but a common way is by counting frequencies of characters. As most people know, e tends to occur more than other letters in most written English. The rest of the letters typically follow a pattern, as well, but that pattern degenerates once you leave the most common letters. The domain of the text you are examining is fundamentally important here. By domain I mean whether this text is from a newspaper, an IM, transcribed speech, etc. You can also look at bigrams, two character sequences, to find the most commonly appearing sequences. In English, th appears much more ty, but ty still occurs. When trying to solve substitution ciphers this way, you are essentially matching the frequency distribution of the cipher text to the distribution of English and building a mapping from there. To put that a different way, you are matching up the most common letters or sequences in the garbled text with the most common real English letters or sequences.
Once the frequency counts have revealed the most common letters, many people proceed to deduction to eliminate the rest. Of course, this requires knowledge of English words directly, which has an impact on computational approaches to solving substitution ciphers automatically. I’m curious what approaches people have taken (if any) other than using a dictionary of English words and trying to find matches from there.
So, I’ll talk more about this later, after the homework assignment is due, but there is an interesting connection here with finite state transducers. Here is a little bit of python code to do some simple stuff with substitution ciphers in case you want to play around. I have found python a very handy tool in decrypting these ciphers. Anyone have any other tool they like and want to share?
def decrypt(words, charmap):
# Given the ciphertext in lowercase (words) and a
# mapping of ciphertext to upper case plaintext
# (charmap), return the transformed string
tmp = [" "] * len(words)
for x in xrange(len(words)):
if charmap.has_key(words[x]):
tmp[x] = charmap[words[x]]
else:
tmp[x] = words[x]
return tmp
def encrypt(words, charmap):
# begin by making a reverse map
rev_charmap = dict()
for key in charmap:
rev_charmap[charmap[key]] = key
return decrypt(words, rev_charmap)
def make_random_charmap():
letters = list()
# make the list of uppercase letters
for x in xrange(26):
letters.append(chr(ord('A') + x))
random.shuffle(letters)
charmap = dict()
for x in xrange(26):
charmap[chr(ord('a') + x)] = letters[x]
return charmap
And I’m only semi-fond of this plugin for source code. It does some weird crap and is really temperamental.




2 comments
Comments feed for this article
5 October 2007 at 15:45:38
Jason Adams
For the decrypt function, the implementation is pretty slow. I thought at first it might be faster than the builtin replace function for strings, but it turns out that is blazingly fast. So a better implementation would be to replace the body of the function with:
tmp = words
for key in charmap:
tmp = tmp.replace(key, charmap[key])
return tmp
27 June 2008 at 06:03:19
Cliona Ni Cheallachain
Hi!! I was wondering if you could help me break a code that my friend gave me… For a break from the norm we have been writing code to each other but he has made this one really hard- there may be two codes in there, and he gave it as part of a pearl Jam album so that might be a clue, its driving me mad!: K EPWAK AZC OEE EZWZ OECYA PHT.
K OWBG MCVM HUE OPAH GEFOG OFAT.
EPOAK AZC TBR VSM QBON OQGX.
Isn’t that crazy? And he won’t give me any clues!! I’d really appreciate your help! Thanks, Cliona :)