Hangman, entropy, and the robustness of stupid
One of the "procrastination tools" on Critique Circle is a little game of hangman, and I sure sunk a lot of time guessing (and sometimes failing to guess) words there. The rules are fairly simple and you probably know them. You start by knowing how many letters the word has. You guess a letter; if the guess is right, you get to see where the letter appears in the word. If you guess wrong, something is added to the hangman drawing, and if it completes because you make too many wrong guesses, you lose. Instead, if you find all the letters before that happens, you win.
Simple fun game, but it got me thinking that it's a neat illustration for the concept of information theoretic entropy.
Lets imagine then that we have a list of all possible words, and we know (this is VERY important) that the word we have to guess is chosen from this list AND all words are equally likely.
Lets ignore the rules of hangman for a while, and ask a question about how we may be able to select a word from the list. We need to specify it somehow, and the most compact way to do this is by giving its position (its index) in the list. This is a number, and we have no trouble thinking about numbers as strings of "bits", but nonetheless it pays, imo, to look into what bits mean.
So we're not given the index, at first, but we are allowed to ask "yes/no" questions about it. We know the list has, for example, 1024 entries. We might ask, is the index among the numbers 513..1024? The answer splits the list in two equal parts, where only one contains our word. That is what a bit does: when you learn a new bit of information, the space of possibilities you must consider is halved.
There's something to pay attention to here: that halving of possibilities occurs only if we ask the right questions, and only if we ask new questions to which we don't know the answer. As a result, what's usually called "[information] bit" in normal language doesn't necessarily correspond to the information theoretic bit, which is that information which will allow you to halve the number of possibilities.
Anyway, to locate a word in a list of 1024, we need to keep halving it 10 times; we need 10 bits.
Unfortunately when playing hangman, you're not allowed those kinds of questions. Instead, you're allowed to guess letters. This changes things slightly.
As an example, suppose the list of all possible four letter words is
abcd
abef
aebh
adbc
So the hangman game starts with four unknown letters ----. What should you guess?
"a" seems a safe guess, but it also has a curious property: it tells you nothing about which word is the chosen one. For, if you guess a, the gamesmaster just updates the word to a--- which fits any on the list. You didn't reduce the list in any way.
Suppose instead you guessed h. Now, it depends. Suppose the gamesmaster reveals you guessed correctly, so the hidden word is now ---h. Well, there's only one word on the list that fits it, so from a list of 4 words, you reduced to a single word: you gained 2 bits, and also found your word (aebh).
But it could also be the case that the gamemaster reveals you guessed wrong. So now you know there's no h in the word, and this reduces your list from 4 to 3 possible words. You didn't quite gain a bit of info, but still you gained something.
Guessing "h" is a bit of a high-risk, high-reward choice. There's a chance you gain 2 bits (one in four, for this example: you gain 2 bits if the word happens to be the one with h in it), but more likely (three in four) you gain a little under 1 bit.
Choosing something like c (or d) has more possible outcomes. If the gamesmaster reveals there's no c, then you know the only possible words are those 2 that do no contain c (so you gain 1 bit, and the chance this will happen is 2 in 4). If the gamesmaster reveals a c at the end, then you know the word (so you gain 2 bits, and the chance this will happen is 1 in 4); and if there's a c in the third position, you also know the word (gain 2 bits, chance is 1 in 4).
So it seems that if you guess a letter you're doing something like gambling: specifically, you gamble for bits. The more bits you gain, the quicker you can reduce the list of possibilities and pinpoint the word. Each guess has several possible outcomes, each outcome brings with it a gain of bits, and each outcome has a certain chance to happen. So, like a typical gambling strategy, you might want to choose which letter to guess based on which letter has the greatest expected value of gained bits: that is, if you sum the probabilities of the outcomes, each multiplied by bits gained if that outcome happens (p1*n1 + p2*n2 + ...), you get the best result.
And it's not too difficult to see that the expected value of bits gained is also the information theoretic entropy. The number of bits you gain if an outcome happens is simply the (base 2) logarithm of the number of possibilities before a guess, minus the (base 2) logarithm of the number of possibilities that remain once you know the outcome.
It's not too difficult to code a proper hangman solver this way; I just used the Unix spellcheck list, and used the strategy above to make it guess words in hangman games. It's very effective. But it does raise a question: can it find all words?
The algorithm is fairly deterministic, so it is possible to predict what it will do in every possible hangman game. True, if two guesses share the same entropy, the program chooses a random one, so there's that bit of randomness, but it's easy to take care of. Whenever a choice appears, "split" the simulation into the number of choices, rather than choose only one at random. And that way I can estimate probabilities for any possible branch in a possible hangman game, and therefore estimate the chance that my program will guess a word.
Unsurprisingly, most words have a probability 1 to be found. But not all of them. In particular, there are words that the algorithm will never guess. "ably" is one of them. What happens is, the program will try letters to maximize its gained bits, but a lot of those letters just happen to not zoom onto this word quickly enough. For example, "e" is the first guess, and while the program learns the word doesn't have "e" in it, that still leaves many words possible.
The fact there are words impossible to find has an interesting consequence. Though my program has, "in general", a good chance to solve a hangman game, you can make sure it never wins just by giving it one of these impossible games. That's where the original assumption is important. My program is a good strategy, as long as all words in the possible list are equally likely.
Consider instead a different hangman solver "strategy": just choose which letter to guess at random, with all letters having an equal chance to be picked. This is a very poor strategy, and its win rate will in general be very low. However, it is so stupid it is impossible to fool. There are no tricky words that are impossible for it to find. True, it has a very low chance to find any particular word, but that chance is never zero. There's no way to significantly affect its win rate one way or the other. You probably can nudge it up or down depending on which word you choose to give it to guess, I didn't check this, but it doesn't seem like any influences will be great; certainly there's no way to guarantee success or failure.
This, incidentally, has some more general application in game theory: randomized behavior. The idea is, if there are patterns, they can be exploited. Imagine a goal keeper always jumped to the right when defending a penalty kick. If a player knows this, they'll always kick the ball left. So a good goal keeper would try to jump at random, and not give any indication of which direction they prepare for.
Randomized strategies are a rich subject, and they can be certainly made more clever than the silly hangman solver I described above that just picks letters at random. But for today, that's enough. Hangman may be a simple game, but even so, it contains some interesting ideas of information and game theory.
Simple fun game, but it got me thinking that it's a neat illustration for the concept of information theoretic entropy.
Lets imagine then that we have a list of all possible words, and we know (this is VERY important) that the word we have to guess is chosen from this list AND all words are equally likely.
Lets ignore the rules of hangman for a while, and ask a question about how we may be able to select a word from the list. We need to specify it somehow, and the most compact way to do this is by giving its position (its index) in the list. This is a number, and we have no trouble thinking about numbers as strings of "bits", but nonetheless it pays, imo, to look into what bits mean.
So we're not given the index, at first, but we are allowed to ask "yes/no" questions about it. We know the list has, for example, 1024 entries. We might ask, is the index among the numbers 513..1024? The answer splits the list in two equal parts, where only one contains our word. That is what a bit does: when you learn a new bit of information, the space of possibilities you must consider is halved.
There's something to pay attention to here: that halving of possibilities occurs only if we ask the right questions, and only if we ask new questions to which we don't know the answer. As a result, what's usually called "[information] bit" in normal language doesn't necessarily correspond to the information theoretic bit, which is that information which will allow you to halve the number of possibilities.
Anyway, to locate a word in a list of 1024, we need to keep halving it 10 times; we need 10 bits.
Unfortunately when playing hangman, you're not allowed those kinds of questions. Instead, you're allowed to guess letters. This changes things slightly.
As an example, suppose the list of all possible four letter words is
abcd
abef
aebh
adbc
So the hangman game starts with four unknown letters ----. What should you guess?
"a" seems a safe guess, but it also has a curious property: it tells you nothing about which word is the chosen one. For, if you guess a, the gamesmaster just updates the word to a--- which fits any on the list. You didn't reduce the list in any way.
Suppose instead you guessed h. Now, it depends. Suppose the gamesmaster reveals you guessed correctly, so the hidden word is now ---h. Well, there's only one word on the list that fits it, so from a list of 4 words, you reduced to a single word: you gained 2 bits, and also found your word (aebh).
But it could also be the case that the gamemaster reveals you guessed wrong. So now you know there's no h in the word, and this reduces your list from 4 to 3 possible words. You didn't quite gain a bit of info, but still you gained something.
Guessing "h" is a bit of a high-risk, high-reward choice. There's a chance you gain 2 bits (one in four, for this example: you gain 2 bits if the word happens to be the one with h in it), but more likely (three in four) you gain a little under 1 bit.
Choosing something like c (or d) has more possible outcomes. If the gamesmaster reveals there's no c, then you know the only possible words are those 2 that do no contain c (so you gain 1 bit, and the chance this will happen is 2 in 4). If the gamesmaster reveals a c at the end, then you know the word (so you gain 2 bits, and the chance this will happen is 1 in 4); and if there's a c in the third position, you also know the word (gain 2 bits, chance is 1 in 4).
So it seems that if you guess a letter you're doing something like gambling: specifically, you gamble for bits. The more bits you gain, the quicker you can reduce the list of possibilities and pinpoint the word. Each guess has several possible outcomes, each outcome brings with it a gain of bits, and each outcome has a certain chance to happen. So, like a typical gambling strategy, you might want to choose which letter to guess based on which letter has the greatest expected value of gained bits: that is, if you sum the probabilities of the outcomes, each multiplied by bits gained if that outcome happens (p1*n1 + p2*n2 + ...), you get the best result.
And it's not too difficult to see that the expected value of bits gained is also the information theoretic entropy. The number of bits you gain if an outcome happens is simply the (base 2) logarithm of the number of possibilities before a guess, minus the (base 2) logarithm of the number of possibilities that remain once you know the outcome.
It's not too difficult to code a proper hangman solver this way; I just used the Unix spellcheck list, and used the strategy above to make it guess words in hangman games. It's very effective. But it does raise a question: can it find all words?
The algorithm is fairly deterministic, so it is possible to predict what it will do in every possible hangman game. True, if two guesses share the same entropy, the program chooses a random one, so there's that bit of randomness, but it's easy to take care of. Whenever a choice appears, "split" the simulation into the number of choices, rather than choose only one at random. And that way I can estimate probabilities for any possible branch in a possible hangman game, and therefore estimate the chance that my program will guess a word.
Unsurprisingly, most words have a probability 1 to be found. But not all of them. In particular, there are words that the algorithm will never guess. "ably" is one of them. What happens is, the program will try letters to maximize its gained bits, but a lot of those letters just happen to not zoom onto this word quickly enough. For example, "e" is the first guess, and while the program learns the word doesn't have "e" in it, that still leaves many words possible.
The fact there are words impossible to find has an interesting consequence. Though my program has, "in general", a good chance to solve a hangman game, you can make sure it never wins just by giving it one of these impossible games. That's where the original assumption is important. My program is a good strategy, as long as all words in the possible list are equally likely.
Consider instead a different hangman solver "strategy": just choose which letter to guess at random, with all letters having an equal chance to be picked. This is a very poor strategy, and its win rate will in general be very low. However, it is so stupid it is impossible to fool. There are no tricky words that are impossible for it to find. True, it has a very low chance to find any particular word, but that chance is never zero. There's no way to significantly affect its win rate one way or the other. You probably can nudge it up or down depending on which word you choose to give it to guess, I didn't check this, but it doesn't seem like any influences will be great; certainly there's no way to guarantee success or failure.
This, incidentally, has some more general application in game theory: randomized behavior. The idea is, if there are patterns, they can be exploited. Imagine a goal keeper always jumped to the right when defending a penalty kick. If a player knows this, they'll always kick the ball left. So a good goal keeper would try to jump at random, and not give any indication of which direction they prepare for.
Randomized strategies are a rich subject, and they can be certainly made more clever than the silly hangman solver I described above that just picks letters at random. But for today, that's enough. Hangman may be a simple game, but even so, it contains some interesting ideas of information and game theory.
Sorry, this isn't about your blog post. Just trying to contact you! I want to thank you, profoundly, for your explanation of Nietzsche's/Peter Weyland's quote in Prometheus. I found it when I googled and scified.com was near the top. I suspect we are of vastly different tribes, still, thank you for sharing your knowledge.
ReplyDelete