Wednesday, 12 January 2011

Getting the Full Set

I recently wrote about some different forms of probability distributions, Binomial, Multinomial, Geometric and Poisson. Today I want to address a pretty type of problem that doesn't fit any of those.

Everybody who plays around much with recreational mathematics has encountered the famous "birthday problem." It is often stated similar to this: "How many randomly selected people must be assembled so that the probability that at least two of them have the same birthday is at least p (for some specific p, often 1/2)."

But today I want you to think about a slightly different variation. How many people (on average) would you have to gather in order to have at least one with a birthday on every day of the year" (like the birthday problem, we make some assumptions that are not exactly true... births occur with equal probability on every day of the year (they don't, see here. consider only 365 days a year, etc..)

That's probably a hard one to start, so let's break down to an easier problem. How many times should you have to flip a coin until you get at least one head and one tail?

The probability you get one of them on the first roll is 1. But how long should it take to get the "other" one. Well, each flip the probability it happens is p=1/2, and we know that the expected value of this geometric event is Ev= 1/p.. or 2 flips. So we figure that, on average, we will get both outcomes in three flips. To be suggestive, I will suggest that you see that as 2/2 + 2/1.

Ok, upping the ante a little. How many times would you have to roll a die to get all six possible outcomes? Again, break it into packets. How many times should you have to roll until you get one of the six outcomes? Good, you recognize that ONE of them has to happen on the first roll. Now to get one of the five others on the following rolls has a probability of 5/6, so it should take only 6/5 rolls (on average) to get a second outcome. By now you can see we will probably end up with 1+6/5+6/4+...+6/1. I point out here that that can be written as .
Ok, but that summation of 1/k as k goes from one to some number is a harmonic series. It grows very slowly, and no body wants to add up 365 fractions that get very small very quickly. Fortunately, Euler came to our rescue. What Euler did was come up with a really good approximation for 1/1 + 1/2 + 1/3 + ..... + 1/n for any n... It turns out that as n gets bigger and bigger, the sum gets closer and closer to n*ln(n).. and Euler came up with a really good estimate of how wrong it would be. As n gets really big the sum of is about n ln(n) + .577. So 365 outcomes is not much more difficult to calculate than six was. (365)ln(365)+.577 turns out to be about 2154 people. So if you go to a pretty big school, maybe there is a person enrolled who has a birthday on each day of the year.

So you need a challenge.... Suppose you don't have 2154 people on campus. What if you have say 1000. Now what is the probability that there is someone on campus with all 365 birth dates.

Be the first on YOUR campus to solve it, and send the solution here. It might be easier to start by making the problem smaller.


td said...

Nice blogpost! Thanks!
Are you going to give away the solution to the challenge?

Pat's Blog said...

td, Thanks for the kind words....

Solution??? well I will give a very close approximation to the solution...and write about some different approaches to a solution (None of which are very easy, at least for me) in a (not to far in the) future blog.

Anonymous said...

There is a fascinating wording problem with: "Now what is the probability that there is someone on campus with all 365 birth dates."


Fred Gloff said...

So close and yet so far! You were genuinely on the right track in the paragraph beginning with “I reasoned that the probability . . .”, but you skipped one step before using the Poisson distribution. You had correctly computed the probability of a particular birthdate not occurring among the 1000 people. But that applies to each of 365 birthdates. So we should multiply that probability by 365, thus finding that the expected value of the number of birthdates NOT occurring is 23.486…. Notice that we have used an exact technique to obtain this result. If this expected frequency had turned out to be merely 4 or 5, the probability of no missed birthdates would be “pretty low.” When it is larger than 23, that probability must be extremely small. If you want to get a handle on how small, now is the time to reach for the Poisson distribution, or since we are talking about getting zero occurrences, just compute e^(-23.486), getting about 6E-11 (Now we are approximating). Notice that carrying out these computations on a basic scientific calculator would take little more than 10 seconds.
But we are not done yet. Even if one is not familiar with the Poisson distribution, it should seem apparent that the probability distribution for the number of missed birthdays would be skewed to the right (many more options above 23.5 than below), so the mode would be smaller than 23.5. In the Poisson distribution, it is right at 23, with a probability of about 8.3%. So, as Jon Ingram found by the matrix method, the most likely outcome is 342 distinct birthdays. How do our probabilities compare with his? The first is high (but the discrepancy is hardly meaningful) and the second is low. For those approximations, I actually prefer to use a binomial distribution. With “birthday problems”, using an n-value about half the maximum number of possibilities (364/2 here) seems to work fairly well. With n=182, the probabilities come out around 1E-11 and 8.8%, clearly closer to Jon’s results. Sometimes being willing to accept approximate answers can produce useful results while saving a tremendous amount of effort in both reasoning and computation.
In the previous day’s post, Pat showed that the average group size needed to cover all 365 possible birthdates is about 2154. Using that in place of 1000 produces an expected frequency of .990 (pretty close to 1, which is typical when dealing with waiting time random variables) and a probability of about 37% (not far from e^-1). With a group size of 2284, we get an expected frequency of 0.693 (near ln(1/2)), with a probability very close to 50%. Thus the median waiting time is approximately 2284.

Fred Gloff said...
This comment has been removed by the author.
Fred Gloff said...
This comment has been removed by the author.