# The Bayesian Bandit and the Greedy Bandit¶

In [8]:
from __future__ import division
import random
random.seed(999)
import copy
%matplotlib inline
import seaborn as sns


This is a quick post about various methods of multi armed bandits. For those who are unfamiliar, multi-armed bandits are an interesting class of problems, that have everybody's attention these days, given their importance in content experiments by internet companies.

## Bandit in a Casino¶

At the center of this puzzle is a Bandit, who has gained access to a casino, for a small amount of time. Before the police comes, the Bandit must make the most of his efforts to get access to the casino. Now the casino (and forgive my ignorance on the subject matter, I have never been to one!) consists of many machines with levers. You pull a lever, and a coin might come out. You could be unlucky and nothing could come out too. Each of these levers have a different probability of giving you money.

Of course, there are no guards and such, and no money to bet. So the bandit keeps on pulling the levers feverishly, and putting the coins in his knapsack. However, the Bandit has a problem. He does not know which lever gives how much money. However, the police is on its way. So it needs to make a decision. Should I pull the different levers to determine which is the best one? or should I just go with my knowledge from the pulls I have done, maximize my profit by pulling from the lever which I think is the best? This is the famous explore-exploit dilemma, explore more, or exploit the current knowledge to make the most you can, of the situation? At its heart, the Bandit problem boils down to a solution of this dilemma.

## Bandits and Internet Experiments¶

The old bandit puzzle has gained a recent prominence, because it corresponds to a practical problem that today's internet companies face. Imagine you have 10 different versions of an website, which have been proposed to you by designers. Now, you want to find out which version do your customers like most. The term "like" here is very subjective. Depending upon your business needs, it can mean mean many things, for eg. are people buying from your site? are people clicking on outgoing links from your site etc.

So, how do you test each version of the site? You want to show it to customers. You don't want to spend too much time on the pages which are really bad, but you also don't want to come to a premature conclusion about what customers like or dislike, based on a 5 or 10 customers. Again, you are in the explore-exploit dilemma. So, imagine setting up the experiment the following way. Each website version is a casino machine. For each customer that arrives, you decide which machine they go to. If they buy stuff from that version of the site, you got your coin. If they smirk, and close the page, then your lever pull for that machine went into a waste. So, now, with this setting, how do you maximize your earnings?

## Bayesian and Greedy Approaches¶

Generally, multi armed bandit strategies can be divided into two types. (1) Greedy strategies, like epsilon-greedy and epsilon first (2) Bayesian strategy: Known as probability matching/Thompson sampling. We will explain each of these strategies and run the code to execute them. Let us try to compare the result of these three strategies. TL;DR : Bayesians are wayyyy better than the greedy strategies.

Our framework is this:

There are five machines in the casino. They have probability of success 0.3, 0.4, 0.5, 0.6,0.7 respectively. However, the Bandits do not know this.The bandits have 1000 chances in total, before the police cars show up. We will try epsilon first, epsilon greedy and Thompson sampling to decide who wins.

In [9]:
#first create n bandits

import numpy as np

class Bandits:
def __init__(self, prob_arr):

self.prob_arr= prob_arr
self.total_bandits= len(self.prob_arr)
self.success_arr = [0]*self.total_bandits #keeps track of successes for each arm
self.total_tries = [0]*self.total_bandits #keeps track of how many times each
# arm is tried.
self.empirical_prob=[0]*self.total_bandits

def pull_this_arm(self, arm_no):
if arm_no>= self.total_bandits:
raise ValueError("Arm number is more than total bandits")

#try this arm
self.total_tries[arm_no]+=1
rand_no = np.random.uniform(low=0,high=1,size=1)
if rand_no<= self.prob_arr[arm_no]:
self.success_arr[arm_no]+=1
return True
else:
return False

def find_empirical_prob(self):
for i,tries in enumerate(self.total_tries):
if tries!=0:
self.empirical_prob[i] = self.success_arr[i]/tries

def find_best_arm_now(self):
self.find_empirical_prob()
return self.empirical_prob.index(max(self.empirical_prob))

#this is the Thompson sampling
def find_best_thompson(self):
sample_from_beta = [0]*self.total_bandits
for i in range(self.total_bandits):
sample_from_beta[i] = np.random.beta(1.+self.success_arr[i],\
1.+self.total_tries[i] - self.success_arr[i])
best_bandit = np.argmax(sample_from_beta)
return best_bandit



## Epsilon-first¶

Now, let us first use epsilon first strategy. What does the epsilon first strategy do? This strategy first explores, for a given period of time, which is $\epsilon*T$, and then based on the exploration it has done, it decides on the best bandit. The second phase is the pure exploitation phase where it sticks to its decision, and only pulls the bandit which it decides is the best. Let us see how this strategy performs?

Interesting to note, this is how we make many decisions in real life. Job interviews or elections are an example. Candidates are explored for a given amount of time, and then we give the chance to the best one. A country sticks to its President/ Prime Minister for four/five years.

In [10]:
def epsilon_first(Band, eps_val, total_draws):
money_earned = 0.
best_arm = None

for i in range(total_draws):
if (i*1.0/total_draws)<eps_val:
#explore
pull_lever = np.random.randint(0,3)
result = Band.pull_this_arm(pull_lever)
if result == True:
money_earned +=1.

else:
#exploit
#if exploiting for the first time, find out
#which is the best arm.
if best_arm is None:
best_arm = Band.find_best_arm_now()

result = Band.pull_this_arm(best_arm)
if result == True:
money_earned +=1
return money_earned


So, we have defined the epsilon first strategy. For a given value of epsilon, the code first finds whether we are in the exploration, or exploitation phase. If we are in the exploration phase, it randomly finds an arm and draws. Then, when it enters the exploitation phase, it finds the best arm it has seen so far, and sticks to it. It keeps drawing by pulling that previously decided on "best arm" for all the subsequent chances. In the end, it returns how much money it has made.

But each time it will run, it will give a slightly different result. So, let us run it for a bunch of times, and see the distribution of the money it gives us. If this were an estimate, we would call it the sampling distribution of the estimate. So, let us find the sampling distribution of the money we would earn, each time we do this. This will of course depend on what value of epsilon we choose. Lets first guess 0.2. We will then vary this guess and see what happens.

In [11]:
tot_sims=1000
tot_draws = 1000
universal_bandit = Bandits([0.3,0.4,0.5,0.6,0.7])

def run_sim_many_times(sim_func, tot_sims, tot_draws, eps_val):
eps_first_money = [0]*tot_sims
for i in range(tot_sims):
#Band_aid = Bandits([0.3,0.4,0.5]) #nice name for bandits!
Band_aid = copy.deepcopy(universal_bandit)
eps_first_money[i] = sim_func(Band_aid, eps_val,tot_draws)
return eps_first_money

eps_first_0p2 = run_sim_many_times(epsilon_first, tot_sims, tot_draws, 0.2)


In [12]:
import matplotlib.pyplot as plt
plt.hist(eps_first_0p2)

Out[12]:
(array([  10.,    1.,    1.,   39.,   62.,   20.,   97.,  425.,  312.,   33.]),
array([ 302. ,  324.8,  347.6,  370.4,  393.2,  416. ,  438.8,  461.6,
484.4,  507.2,  530. ]),
<a list of 10 Patch objects>)

### Should we explore more?¶

That's an interesting question. I am sure there is a good statistical answer to where the optimum $\epsilon$ lies, but I cannot think of one off the top of my head. Why don't we brute force our way to an answer.

In [13]:
eps_arr = np.arange(0,1.05,0.05)
expected_eps_first = [0]*len(eps_arr)
for i,eps_val in enumerate(eps_arr):
first_money = run_sim_many_times(epsilon_first, tot_sims, tot_draws, eps_val)
expected_eps_first[i] = np.mean(first_money)

In [15]:
plt.plot(eps_arr, expected_eps_first,'ro')

Out[15]:
[<matplotlib.lines.Line2D at 0x110a81bd0>]
In [16]:
print expected_eps_first

[300.02999999999997, 452.83100000000002, 466.90499999999997, 470.01400000000001, 468.75099999999998, 466.66399999999999, 463.78800000000001, 459.69999999999999, 457.24099999999999, 452.99200000000002, 448.99400000000003, 443.97500000000002, 439.56900000000002, 434.714, 429.51799999999997, 424.92200000000003, 419.315, 415.11000000000001, 409.43000000000001, 404.57299999999998, 399.12900000000002]


So, just by luck, we were pretty close! What this shows is that exploring a lot does not gain you much. After $\epsilon=0.2$, exploring only diminishes your gains. So, now, let us look at the epsilon-greedy strategy.

## Epsilon Greedy¶

The epsilon-greedy strategy tries a slightly different tact. We choose greedily, i.e. based on what we know now, we choose. But sometimes, we choose randomly, in order to explore, i.e. we don't stop exploring entirely after a while. We explore intermittently while we explore. In case we were initially wrong, it gives us a chance at correction. Whether we explore or exploit, we will decide based on a random throw of the dice. We will generate a random number $X$ between 0 and 1. if $X In [17]: def epsilon_greedy(Band, eps_val, total_draws): money_earned = 0. best_arm = None explore = None for i in range(total_draws): random_draw = np.random.uniform(low=0.,high = 1., size=1) if random_draw<eps_val: explore = True else: explore = False if explore == True: pull_lever = np.random.randint(0,3) result = Band.pull_this_arm(pull_lever) if result == True: money_earned +=1. else: best_arm = Band.find_best_arm_now() result = Band.pull_this_arm(best_arm) if result == True: money_earned +=1 return money_earned  Now, all we need to do, is run this function many times. We already have the framework for it, thanks to functional programming. In [18]: eps_greedy_0p2 = run_sim_many_times(epsilon_greedy, tot_sims, tot_draws, 0.2)  In [19]: plt.hist(eps_greedy_0p2)  Out[19]: (array([ 4., 8., 16., 34., 50., 133., 276., 320., 135., 24.]), array([ 352. , 369.1, 386.2, 403.3, 420.4, 437.5, 454.6, 471.7, 488.8, 505.9, 523. ]), <a list of 10 Patch objects>) Well, clearly not very good, right. Okay, lets see if we can find an optimum, just like the last time. Again, similar code to previous case. In [20]: expected_eps_greedy = [0]*len(eps_arr) for i,eps_val in enumerate(eps_arr): first_money = run_sim_many_times(epsilon_greedy, tot_sims, tot_draws, eps_val) expected_eps_greedy[i] = np.mean(first_money)  In [21]: plt.plot(eps_arr, expected_eps_greedy,'bo')  Out[21]: [<matplotlib.lines.Line2D at 0x110d91350>] Around$460 just like last time, but a smaller value of epsilon is more useful. Okay, so this is the best the frequentists can do. Now, let us ask if the Bayesians can do any better?

## Bayesian Bandits aka Thompson sampling¶

Now, do Thompson sampling. The logic of the Thomson sampling lies in the function named find_best_thompson() in the Bandit class. What does this function do?

For each machine, it constructs a beta distribution as a prior. By multiplying it with the likelihood of getting a success, it constructs the posterior distribution for each machine. This posterior corresponds to our belief at a time point on what the true probability of success is, if we draw from that machine. In each step, it samples from each of the posterior distributions. It then chooses the sample with the highest probability. If one machine stands out, clearly, it will be the maximum in most cases, when samples are drawn, and correspond to the exploit phase. Whereas, if two machine have similar posterior distribution, this means, to a Bayesian, from our knowledge, we aren't very sure, who is better. So, in a sense then, the maximum is more likely to come from either distribution. In plain English, this is how Thompson sampling solves the explore exploit dilemma.

In [22]:
def bayesian_bandit(Band, total_draws):
money_earned = 0.
for i in range(total_draws):
pull_lever = Band.find_best_thompson()
result = Band.pull_this_arm(pull_lever)
if result == True:
money_earned +=1.
return money_earned


Now, we want to run all the three. Our main aim is to see who earns the most money?

In [23]:
thompson_earning=[0]*tot_sims
for i in range(tot_sims):
Band_aid = copy.deepcopy(universal_bandit)
thompson_earning[i] = bayesian_bandit(Band_aid, tot_draws)


How well does Thompson sampling do, on average?

In [24]:
print(mean(thompson_earning))

671.139


What did the distribution look like?

In [25]:
plt.hist(thompson_earning)

Out[25]:
(array([   6.,    4.,    7.,   28.,   79.,  246.,  348.,  222.,   56.,    4.]),
array([ 562. ,  579.4,  596.8,  614.2,  631.6,  649. ,  666.4,  683.8,
701.2,  718.6,  736. ]),
<a list of 10 Patch objects>)

Wow! even the worst Thompson sampling is at around 560, which beats the best epsilon greedy or epsilon first by ~\$100.

Let us plot the three methods side by side, to get a sense of how good the Bayesian sampling is, compared to the epsilon first, and epsilon greedy.

In [44]:
#thompson_normal = [w/max(thompson_earning) for w in thompson_earning]
plt.plot(eps_arr*100, expected_eps_first,'ro')
plt.ylabel("Earnings")
plt.plot(eps_arr*100, expected_eps_greedy,'bo')
sns.distplot(thompson_earning, vertical=True, hist=True, kde=False)

Out[44]:
<matplotlib.axes.AxesSubplot at 0x112afe850>
In [ ]:



© Sourav Chatterjee. Built using Pelican. Theme by Giulio Fidente on github.