Monday, 12 July 2010

Standard Deviation as Distance

Early today I had a conversation with another HS stats teacher that reminded me that when I was writing about vectors a while back I had not covered two nice uses in Stats. I hope to correct one of those today.

As we were talking I bemoaned the fact that few introductory textbooks seem to really help kids to develop any intuitive idea of what the standard deviation is or how it works. As we talked, I mentioned that I thought there was a geometric approach to the standard deviation that might help make it more clear. You be the judge.

I think the standard deviation is most easily approached as a distance (more specifically a sort of average of distances). Most high school stats students can quickly find the distance between two points on the plane using the square root of the sum of the squares of the differences (deviations) in each direction (dimension). For those who have never been introduced to it, only a few moments convinces them that it can generalize to n-dimensions. And in a few short minutes they can be finding the "distance" between (point)vectors in any number of dimensions, and many can quickly invent a shortcut to the calculation using the list functions of their calculators.

So why does the standard deviation as a distance make sense? The standard deviation is a measure of how much the data items "disagree" with each other. Start with two measures, and for the moment we use the unconventional notation of calling one of them x1 and the other y1. Now if they agree perfectly, then they lie on the line y=x. If they don't, then they will be off the line by some distance. We begin by finding that distance. The perpendicular from the line y=x to the point (x1 ,y1) would cross y=x at the point where the x and y values were the average of x1 and y1, or at a point we call (xbar,xbar). That means the distance of the point (x1 ,y1) from the line y=x is just

Now if all our data sets had only two values (and statistics was REALLY EASY) then we could use this "distance" measure as a "standard measure". But one of the funny things about distance is that it grows with dimension, "sort of"... here is what I mean. In one dimension, the distance from (0) to (1) is one unit. In two dimensions the distance from (0,0) to (1,1) is farther, it's the square root of two. In three dimensions the distance from (0,0,0) to a point one away in each dimension is the square root of three. This would meant that the data set {1,3} would seem to be "less spread out" than {1,1,3,3}, which seems like a bad thing. To compensate, we simply divide this Pythagorean distance result by the square root of the dimension.

In effect then, the standard deviation of a population of values is the distance between the n dimensional points A={x1,x2,x3..xn) and B= (x-bar,x-bar,.... x-bar) divided by the square root of n. In truth, it would seem there was no need to memorize a formula when the student understands it as a "mean distance".

As a happy coincidence, John Cook at The Endeavour web site just posted a blog about the relationship between vector geometry and statistics when finding the standard deviation of a sum or difference of two distributions. A must read for intro stats teachers who want to be able to explain what happens (and why?) when the distributions are NOT independent.

4 comments:

chudi said...
This comment has been removed by the author.
chudi said...

Hi Pat. I'm just relearning statistics and the standard deviation "smelled" terribly as a pythagorean distance from something to something that i could not define and my teacher would not give a moment of thought. Thanks for explain it so clearly. Now, after the show of affection, I would require for you to formalize a bit more 'this distance is that it grows with dimension, "sort of"'. If it's not too much trouble. Or at least point me into further reading. Thanks.

Pat B said...

Chudi,
Thanks for the kind words, now about your question... if we measure the distance from 0 to one on the x-axis we call that a length of one..
If we go one unit out in two directions (say to the point one,one), the distance is sqrt(2) from the origin... as we increase the dimension with each one increment from the origin, the distance grows (1,1,1) is sqrt(3) from the origin... but the "average" distance is still one in each dimension... so if we want to know the "average distance" of (3,5,4) from the origin, we take the distance (sqrt*3^2_+5^2+4^2) and divide by sqrt(3) since it is in the third dimension... or sqrt(50/3) is the average.. or "standard" deviation from zero.. If we want the standard deviation from the mean value (rms) we do the same thing with the distances (3-4, 5-4, 4-4) and get sqrt(2/3) for the population... there is a minor adjustment if we expect this is a sample from a population...we replace n (in this case three) with n-1 (or two) :Hope one or both of those agree with your calculations..

Mihir Manohar said...

Thats absolutely fantastic!!! This article is really helpful to understand standard deviation.. I just want one help.. I am not able to understand that ... Sir, In your blog you have mentioned that to compensate we would divide the result by square root of the dimension.. What does this signify... I am eagerly waiting for your response.. Once again thanks a lot...... Mihir Manohar, India.