Thursday, 13 January 2011

Back of the Envelope Answers to a Hard Problem

In a comment "td" asked if I was going to answer the last question I asked in my blog on a variation of the birthday problem. The question asked the probability that a random selection of 1000 people would include a person with every birth date for all 365 calendar days of a non-leap year.

I had not done the problem, but figured from its nature that it was of a size to be difficult to calculate, but theoretically not too difficult.

I thought of three ways to try and attack it. I was pretty sure it could be done exactly with Markov matrices (more later) if software was available to raise a square matrix of order 365 to the 1000th power. I will illustrate how this might be done with a much smaller problem.

I thought it could probably be approximated really well by using a combination of simulation and statistics... I will do that with also with a smaller problem.

And I thought you could convince any reasonable reader with a bit of basic logic that it was a very, very, low probability... I'll do that now.

I reasoned that the probability of finding all B-days would be small if the probability of failing for any particular day was only relatively small, so I set out to try and find the probability that any given date (they will all be the same) did not appear in the list of Birthdays of the sample. This part of the problem is actually binomial. The birth dates of the random sample are independent of each other, and with a large population to draw from, the probabilities of getting any date on any person selected will be essentially the same. For a particular date, we can say that the probability that a person selected would NOT have that date of birth is 364/365. So the probability of 1000 people being selected who all did NOT have that date is (364/365)1000. My Ti-84 calculator barely blinked to compute .06434 for the probability. Since the probability of each success is a really small, 1/365, we could also use the Poisson with a mean of 1000/365 and find the probability of zero successes. That comes out to .0645 on my calculator.

Once we know the prob of any date not showing up is about .06, we might assume that for 365 dates, even with their lack of independence, the probability of getting all dates in a sample of 1000 would be pretty low.

To see how low, I thought I would try to simulate a few trials and see how often I was successful. I wrote a short program for the TI-84, but it took a long time to run (probably bad program more than bad device) so I switched over and wrote used a spreadsheet run off 1000 random integers between 1 and 365. Then I had it how how many times each of the numbers appeared in the list. Finally I just multiplied these counts all together. If it came out one, then there was at least one birthday missing. It took a little longer to lay out, but it would run a test in a few seconds. Then I got a little more clever and just copied the whole thing across for 100 columns. Now I could run 100 trials at a time..(this time the computer blinked).

I ran that five times, for a total of 500 simulations, and never got a full set of birthdays. That made me wonder if it was working correctly, so I extended the list to 2000 random integers. Now I started to get some hits. It was working after all.

One of the beautiful things about statistics is that you can approximate the probability of something happening that has never happened if it has never happened with enough opportunities to happen. In this case we were 0 for 500. Statistically that doesn't mean the probability is zero. But there is a clever rule of thumb that says if it has not happened in n trials, the probability of happening on any trials is less than 3/n with about a 95% confidence.(How we do that in another blog later.) So I'm figuring that the probability of a full set of birth dates in a group of 1000 is about 3/500 or less.

I wanted to do an exact calculation using Markov Matrices, but none of the software I had access to would handle raising a really big matrix to such a high power. Still, it is a beautiful way to handle (smaller) problems like this so I thought I would illustrate the process for students with a mini version of the problem.

Suppose instead we had picked seven students from a large high school and wanted to know the probability that we would get one from each of the freshman, sophmore, jr and senior classes. The birthday problem for 10000,365 scaled down to 7,4.

We will treat the process as if people were selected one at a time. After the first person is selected, since they will obviously have some birth date, we will say we are in state 1. As each new person is added to the mix, we will move to the next higher state if they have a different birthday to everyone already present, and remain in the previousl state if they match one of the other dates. We can illustrate this with a "State Diagram", a graph of the probability of moving from one state to another as each person is introduced into the mix.

In state one, all the people up to that point have had a single birth date. The probability the next person has a different birthday and moves us to state two is 3/4, and the probability that they have the same birthday as everyone else so far is 1/4. The red lines show the probability of moving from one state to the next, while the blue lines show the probability of staying where they are.

To find the probability we are in any state after n selections we must add the probability we just moved there from a lower state with the probability we were already in that state after n-1 selections. To accomplish this multiplication process we embed the probabilities in a transition matrix. Each row and column gives the probability that you move from the row-state to the column state. For example, row 2 column 3 gives the probability that you move from state two to state three on any selection.
The zeros indicate impossible transitions, like moving from state one to state four on the next person. The one in row 4 column 4 says that once you have all four birthdays, you stay there. This is called an absorbing state.

Now the pretty part. This matrix tells us the probability of being in any state after two poeople (because we started after the first person put us in state one).
The probability we are still in state one is 1/4, and the probability we are in state two is 3/4. Now to find the probability after a third person is selected, we simply multipy our present matrix state by this matrix... we square our transition matrix. Now we have the probability of being in each state in the first line. Notice that this is after three people... two have been added to the original person who put us in state one.
You can check these manually. To be still in state one, we must have had both the 2nd and third person have the same birth date as the initial person, a probability of (1/4)(1/4)=1/16. To be in state three, the second must have differed from the first and the third must have differed from both the others. The probailities in order are 3/4 and 1/2 so there is a 3/8 probability of this outcome after three people. The probablitly of state two is now what is left to total one.
So after seven people, six beyond our initial person, we want to raise the matrix to the sixth power. The top row gives us the probability of being in any of the states after seven people. .

For seven people, the probability we have found all four birthdays is about 52%.

Which means I can solve the problem with 365 birthdays and 1000 people exactly, I just need someone to loan me the money to rent a big Cray Computer for a few seconds.... "anyone, anyone......Bueller?"

*** FOOTNOTE: I got a comment from Jon Ingram that says (in much nicer words) that my inability to find the 1000th power of a 365x365 matrix was a matter more of my ignorance than restrictions in computing power that is readily available. Math teachers should be aware of the power that is out there to do this kind of stuff, much of it either very modestly priced, or free. Here is Jon's comment, with my appreciation... I'm old, but I'm not done learning...
You don't need a Cray! It took my computer (using Python with the Numpy extension) about 10 seconds to calculate the 1000-person, 365-days matrix. It looks like you shouldn't be surprised not to find all the birthdays, as the probability of getting to that state is around 1.7e-12. The most likely outcome is 342 distinct birthdays (probability around 9.4%).
Post a Comment