Monday 26 July 2010

Standard Deviations of Sums of Distributions

A week or so ago I was at a textbook selection conference with a couple of really good teachers, and one of them (thanks, Dru) pulled out a copy of Robert Hayden's, "Advice to Mathematics Teachers on Evaluating Statistics Textbooks." I mention it now because it has two good pieces of advice. (Ok, it has way more pieces of good advice than that, but I'm mentioning these two in particular)

The first, I hope I follow, "...make sure the textbook mentions assumptions and teaches students to check them rather than make them." One of the ways I try to get students to check assumptions is to make them understand, as much as possible in the limited time of a AP course, the WHY. In order to do that, I frequently violate one of Professor Hayden's other pieces of wisdom; "Be wary of an author who is not familiar with enough real data sets to illustrate a textbook."

Ok, I'm gonna claim some "weasel" room here. First, I'm thinking more along the lines of an exercise to help the students understand why checking independence is so important, and not writing a textbook. Second, even Professor Bob himself says "While there may be places (such as Anscombe’s regression examples , in which a skillfully fabricated batch of numbers illustrates a pedagogical point,.." Ok, so the "skillfully" may not apply to what follows, but I hope the fabricated data at least help drive home a "pedagogical point".

I begin with two simple data populations, X= {1,1,1,2,2,2,3,3,3} and Y= {1,1,1,3,3,3,5,5,5}. Students who have learned the "Standard Deviation as Distance" approach can quickly check and find the standard deviation of the X population (or using a calculator) is sqrt(2/3)or appx .8165. For Y the std. dev. is 1.633. Perhaps for what we will be doing, we remind them that the variance of each is the square of the standard deviation, so Var(X)=2/3 and Var(Y)= 8/3.

So what happens if we add or subtract the populations? It all depends! If the populations are independent, then any X and any Y may (must?) be associated with equal probability. I illustrate this by pairing one of each X value with one of each Y.. (is it possible to have two distributions be independent without this type of each x with each y association?)

X___1___1___1___2___2___2___3___3___3

Y___1___3___5___1___3___5___1___3___5.

and the sum and differences are then

X+Y =2___4___6___3___5___7___4___6___8 and

X-Y =0__-2__-4___1__-1__-3___2___0__-2

I think it is worth drawing the two resulting distributions because many students will NOT see that these are distributions are reflections of each other. So they should have exactly the same standard deviations (this takes a moments reflection for some students).




Wow, that's good news. If the populations items are independent of each other in the way they are combined, it doesn't matter if you add them or subtract them, the spread is the same since the two distributions are symmetric, which means the standard deviations should (and are) the same, about 1.8257. Even better, we can point out that the variance, 10/3, is simply the sum of the original variances, 2/3 + 8/3. For me it is worth pointing out this "Pythagorean" relationship, [StDev(X+Y)]2=[StDev(X)]2+[StDev(Y)]2, IFF X and Y are independently associated.....(oops, I have been called out on this mistake... The statement is true IF x and y are independent, but also in any situation in which the correlation coefficient is zero... which does not necessarily require independence...see comment from "gasstationwithoutpumps" below... "mia culpa" and thanks to "gas..."

BUT... what if the original populations were NOT independent. (quick, think of two data sets that you would really combine in real life that are totally independent...better yet, send your ideas in the comments)

Well they might have a positive or a negative correlation, so we slightly rearrange our data sets and group lower numbers somewhat together (no Ones with the fives) like this..

X___1___1___1___2___2___2___3___3___3

Y___1___3___1___1___3___5___5___3___5.

Now our sums and differences are

X+Y =2___3___2___3___4___5___6___5___6 and

X-Y =0__-1___0___1___0__-1___0___1___0

We recognize quickly that the sets no longer have the same shapes. The distribution of sums is almost uniform with the peaks at the ends, while the difference distribution has two peaks closer to the center .

So what are the spread measures now. The standard deviation of the summation distribution is 2.26 or the square root of the variance of 46/9. The differences have a standard deviation of 1.247, the square root of a variance of 14/9, a really big difference. In fact, we help the students notice that the variances are the same distance from the equal variance of 10/3 = 30/9 when the populations were combined independently. The distribution of sums variance is 16/9 higher, the differences are 16/9 lower. Is this just a curious coincidence...(by now my students know that almost NOTHING I bring up is a "curious coincidence" ).

So how can we explain this difference. Slowly you lead their thinking...."If the distributions are NOT independent, they must be dependent,.... and there must be some relationship,..... some measure of how UN-independent they are." Eventually they will think of the correlation coefficient, r. In this association between X and Y they have a positive correlation of 2/3 ... can that help. If the relationship when the association was independent is "Pythagorean", maybe we can look for some extension of the Pythagorean theorem to help... Can we find something like the Law of Cosines that would tie the package together? After all, we need something that will add 16/9 to the sum distribution, and subtract the same amount for the differences... I can't imagine that I would have kids who would see this, and will probably lead them to observe that StDev(X+Y)=[StDev(X)]2+[StDev(Y)]2+2 r [StDev(X)][StDev(Y)]. They can quickly test that the change of sign leads to
StDev(X-Y)=[StDev(X)]2+[StDev(Y)]2 - 2 r [StDev(X)][StDev(Y)].

I hope before I get to this point I have laid a foundation for this by giving a short presentation based on a blog from John D Cook at "The Endeavor" that shows this geometrical relation between the correlation coefficient and the cosine of an angle. I hope to write a blog about this relationship in a more vector sense later.

All of this follows in the wake of a warning about non-real data from Professor Hayden, so it is important to follow up with real data that should bare this out. I'm thinking something simple like their own age in months and height. If it is true for all data sets, it should be true with the measures we have about them; but I am very willing to consider suggestions about a more appropriate data base.

2 comments:

Anonymous said...

You said "[StDev(X+Y)]2=[StDev(X)]2+[StDev(Y)]2, IFF X and Y are independently associated", but that must be incorrect, if your later assertion about the meaning of 'r' is correct.

Two random variables can be uncorrelated without being independent. Consider pairs like (0,0), (1,-1), (1,+1). X and Y are not independent (X=0 iff Y=0), but the correlation coefficient is still 0.

dru.martin said...

Pat,

The elegance and clarity of thought are beautiful to see. It's clear that your teaching and understanding is about three floors up from mine!

Dru