Multi-armed Bayesian Bandits

4 minutes read

Imagine that you are presented with a row of slot machines, each configured to win at a different (but unknown to you) rate. How do you discover which machines have the best returns, while minimizing the amount of money wasted playing lower-performing machines? If a new machine appears, how can you learn if it’s worth playing?

Imagine that you are presented with a several different ads, each with a different (but unknown to you) interaction-appeal to people viewing them. How do you discover which ads have the highest click-through rate, while minimizing the number of ad slots wasted displaying lower-performing ads? If a new ad appears in your inventory, how do you learn if it’s worth showing?

Imagine that you have several different pictures on Tinder

Imagine that you have several potential colour schemes

A/B tests, assuming that they’re run properly, are an extremely useful way of tracking which of several options is superior. Unfortunately they are required to run to completion in order to take statistically validated action, during which time and effort might be wasted on underperforming options.

The problem of balancing the need to properly explore the set of options with the desire to maximize the time spent pursuing the best option is known as the Multi-armed bandit problem. One approach to the problem is known as Thompson sampling, which suggests that the amount of time spent exploiting any given option should be proportionate to the probability that that option is the best option. Because the vagaries of luck mean we can never be certain that our observations to date are representative of the underlying probabilities, our probabilities are never certain (0 or 1). So there is always a chance, no matter how small, that we will choose any given option.

Let’s say that we’re trying to choose a category of ad to show on a page. To simplify matters, each category has a similar payout, so we’re solely concerned with maximizing click-through rate. Each category has a true success rate — given a large enough sample of impressions — and the click-through rate will converge to this number. At this point, however, we don’t know what this true rate is, and we need to recognize that a string of either good or bad luck could skew the observed success rate away from the true one. Because of this, we represent our belief of what the actual rate is as a probability distribution.

Sample probability distribution

We choose which category of ad to show by sampling the distribution associated with each category — essentially, taking an educated yet still random guess as to what the success rate could be — and choosing the category that has the highest sample. After displaying an ad from that category, we then track whether or not it got a click, and use Bayes’ rule to update that category’s probability distribution.

In theory, any type of probability distribution could be used to model the success rate for each category. In practice, the beta distribution is most frequenty used. Not only does the beta distribution have some nice mathematical properties that justify its use here, but it’s also extraordinarily easy to update based on new observations. Each distribution in the beta family is determined by two shape parameters, α and β. When the category succeeds in getting a click, we increment α by one. When it fails, we increment β by one. The greater the sum α + β is, the more narrow the curve. The peak of our probability distribution function (the mode) is equal to (α - 1) / (α + β - 2). In this way, our most likely value from the distribution matches our observed success rate, and, as we gain more observations, our uncertainty about the true success rate decreases.

If a new category of ad is introduced, we need to supply it with an initial distribution to work from. Since we expect our new category to be average, we can choose α and β so that the mode sits at the average success rate of all categories. We also want to pick α and β with small enough initial values that our distribution will be broad enough to be chosen at a reasonable frequency, giving us a chance to improve our knowledge of the true rate.

If the true click-through rates change, our Bayesian-Bandit model is able to correct itself. Given enough time, the best-yielding choice will again end up dominant. Unfortunately, at the point where ‘enough time’ has passed, the true rates may have changed again, and you’ve spent a lot of money on low-return slot machines. There are several techniques that can be used to help adjust to true rate changes faster. One is to restrict α + β, thus preventing ourselves from becoming overly certain. The lower the sum, the faster the distribution will react to changes, but at the cost of decreased accuracy. Without a limit to the sum, a distribution can get very resistant to change. Another technique is to proportionately reduce α and β at the start of each day, or any other time that we feel that it is likely that true rates may have changed. The reduction of α and β will increase the range of likely values sampled from the distribution associated with each category, thus giving categories other than the previous winner a chance to have their performance tested. It also allows the distributions of categories with lots of observations to adapt to changing conditions more rapidly.

To help understand how this approach can work in real life, here’s a small demo. The large circles under the horizontal axis indicate the true success rate for the option, while the smaller circles indicate which value was sampled from the distribution during that round. ‘Shuffle’ generates new success rates for each option, while ‘Reduce certainty’ halves α and β for each distribution, provided that the reduction doesn’t drop α + β below a certain value.

OptionSelection rateAlphaBeta

Updated: