The following is (seemingly) a simple problem in probability theory that had haunted me for a while, until - with the support of friends and colleagues - I managed to come up with a solution that was both effective and elegant.
The solution can only be appreciated once you have tried to solve the problem yourself, which is why I am posting this in two parts: first a problem description, and then the solution that I have come up with.
Before I start, I would like to acknowledge Alexey Sorikin and Philipp Ustinov: Their help and feedback has been instrumental.
If you have a solution or know about an implementation out there, please let me know via Mastodon (linked below) or e-mail (see About).
This being said, while I have not found this solution anywhere else, it is my firm belief that other people have come up with this solution before me and that it is hidden somewhere in some mathematical paper. But, at least for me, it was not “google friendly”, so I document it here.
Random choice
Imagine there is group of friends living together and, at the end of a nice dinner, somebody has to do the dishes. Nobody really wants to do it, so they agree that somebody is selected randomly.
They sit down and start writing some Python code. Somebody knows about
the numpy function random()
that returns a value between 0.0
and 1.0 (including the 0.0 but excluding 1.0).
Their solution is to take a random value, scaling it by the number of friends and then rounding the result down to give us the index of the value to pick.
In the end, the implementation looks like this:
from numpy.random import random
def choose_one(values):
n = floor(random() * len(values))
return [values[n]]
(Ignore for a moment that we wrap the value into an array)
Our friends give it a spin with a simple test function, and see if it does its job:
print("Today, %s does the dishes." % choose_one(['Alice', 'Bob', 'Charlie', 'Dave'])[0])
>>> Today, Bob does the dishes.
To illustrate, this is the value that random()
might have picked
to come up with the decision to let Bob do the chores:
But is the function fair? To figure this out, our friends rely on the Law of Large Numbers, and run it many times, collecting some statistics.
from collections import defaultdict
def sample(n, fn):
counter = defaultdict(int)
for _ in range(0, n):
choice = fn()
for x in choice:
counter[x] += 1
return dict(counter)
print(sample(10000, lambda: choose_one(['Alice', 'Bob', 'Charlie', 'Dave'])))
>>> {'Bob': 2441, 'Alice': 2525, 'Charlie': 2559, 'Dave': 2475}
This looks simple enough, and matches what is expected: every person should be selected roughly a quater of the times. That is 2500 times out of 10000. All good!
Choice with probabilites
This works all right until - after a long discussion one evening - our friends find out that some folks like to do the dishes more than the others. They agree that these people should be more likely to do the dishes. Put differently, they want the probability for these people to be larger than the probibility for the others.
The probability of an event is the likelyhood that the event will take place. Simply put, it is a number between 0.0 and 1.0, with 0.0 meaning that the event will never happen, and 1.0 meaning it will always happen.
If the probability is 0.5, the event happens roughly every other time (not strictly alternating, but you know: random).
Of course, only one can do the dishes. This means that the sum of all probabilites must add up to 1.0.
Our friends come up with the following table of how much they like/dislike doing the dishes:
Name | Probability |
---|---|
Alice | 0.1 |
Bob | 0.3 |
Charlie | 0.4 |
Dave | 0.2 |
Again, our friends sit down and write some Python code to help them.
from numpy import cumsum
from bisect import bisect_left
def choose_with_probabilities(values, probabilities):
sums = cumsum(probabilities)
n = bisect_left(sums, random())
return [values[n]]
The cumsum()
computes the cummulative sum (i.e. adding all
previous values), so the array [0.1,0.3,0.4,0.2]
becomes [0.1,0.4,0.8,1.0]
.
The function bisect_left()
selects the index of the value in a
sorted list that is just larger than the given value. So, if the
random value is 0.156
it would return 1
, as this values is
between 0.1
and 0.4
. If the random value is 0.073
,
it returns 0
as this value is smaller than all other values.
Our friends run the code again, they see:
print("Today, %s does the dishes." %
choose_with_probabilities(['Alice', 'Bob', 'Charlie', 'Dave'],
[0.1, 0.3, 0.4, 0.2])[0])
>>> Today, Charlie does the dishes.
Here is an illustration of what is going on:
And again, they verify that everything is fair by running many samples:
print(sample(10000, lambda: choose_one(['Alice', 'Bob', 'Charlie', 'Dave'])))
>>> {'Bob': 2957, 'Charlie': 4037, 'Dave': 2005, 'Alice': 1001}
All looking good again, everybody is happy.
Choice with probabilities and without replacement
Life is great for our friends, and the chores are settled. But times change and they move into a new appartment. This appartment has a much larger kitchen, and our friend discover that two people doing the dishes is much more fun than doing it alone!
So, they want to change to a system where two people are selected to do the dishes. They cannot just run their old code twice, because that could end up that the same person is selected both times, working alone with himself in the kitchen again. Our friends need a random draw without replacement.
They want to keep their individual probabilities to do the dishes, but of course now two people are needed in the kitchen, so they expect to have their turn double as often.
Name | Probability |
---|---|
Alice | 0.2 |
Bob | 0.6 |
Charlie | 0.8 |
Dave | 0.4 |
Remark: People with knowledge in probablity theory will now scream inside and argue that probabilites always have to add up to 1.0. This is true if you are taking about an either/or choice. But in our case, two people are selected, so everything is fine with these values. These are in the end the probabilities that a friend has to do the dishes, and as there are two slots to be filled, the total of all probabilities has to add up to two.
And this is currently where the story of our friends ends: how should they implement the code that chooses a couple of friends to do the dishes, but respects their individual probabilities?
Formal problem description
Given a set S
of input values of size n
, and probability
function f
that associates every element of S
with a
probability between 0.0 and 1.0, such that all the probabilities add
up to a value m
, implement an algorithm that chooses a subset
C
of size m
from S
, with the individual probability of
each element s
in S
to be included in C
is equivalent to
f(s)
.
(Clarification after feedback: there can be many probability distributions for results that fulfil the property, the algorithm just has to come up with one of them)
Epilogue to Part 1: trying numpy’s choice function
One evening, Alice discovers the function numpy.random.choice
which seems to do exactly what she wants: it offers to choose a subset
from a given set, and has a parameter for probabilities. So she tries:
choice(['Alice', 'Bob', 'Charlie', 'Dave'], p=[0.2, 0.6, 0.8, 0.4], size=2, replace=False)
>>> ValueError: probabilities do not sum to 1
Ok, this is strange. So perhaps she needs to look at each role in the kitchen separately, in a way who is chosen first and chosen second. It should not matter. So she runs:
choice(['Alice', 'Bob', 'Charlie', 'Dave'], p=[0.1, 0.3, 0.4, 0.2], size=2, replace=False)
>>> array(['Bob', 'Charlie'], dtype='<U7')
“This looks great”, she says. That seems to be exactly what they need. So, she runs the normal sample test, only to be disappointed:
print(sample(10000, lambda: choice(['Alice', 'Bob', 'Charlie', 'Dave'], p=[0.1, 0.3, 0.4, 0.2],
size=2, replace=False)))
{'Dave': 4454, 'Bob': 6052, 'Charlie': 7069, 'Alice': 2425}
Wow! She would need to do the dishes 400 more times than expected. This can’t be right. She runs the code a few more times, but still gets wrong results.
She does a bit more experiments, and discovers that the second value seems to be the one that is off:
print(sample(10000, lambda: [choice(['Alice', 'Bob', 'Charlie', 'Dave'], p=[0.1, 0.3, 0.4, 0.2],
size=2, replace=False)[0]]))
>>> {'Bob': 3022, 'Alice': 1000, 'Dave': 1976, 'Charlie': 4002}
print(sample(10000, lambda: [choice(['Alice', 'Bob', 'Charlie', 'Dave'], p=[0.1, 0.3, 0.4, 0.2],
size=2, replace=False)[1]]))
>>> {'Bob': 3050, 'Alice': 1362, 'Dave': 2334, 'Charlie': 3254}
While the Numpy code talks about probabilities, even demands that they add up to 1.0, it really considers these probabilities as weights. And the probabities shift once the first person has been selected, as all the probabilities of the remaining friends are just scaled up. Very disappointed, Alice switches off her computer.
Continue reading: Part 2 - the solution