Monday, 14 November 2022

#34 The Student's t Distribution....from old Math Terms notes

 Student's t The story of the name for this statistical distribution and test is almost legend, and some version of the tale is remembered by Intro Stats students long after they forgot the purpose of the t-test. A recent dialogue between Randy Schwartz and James A Landau on the Historia Matematica discussion group gives both one of the folk versions, and a brief history.

Randy Schwartz writes "the distribution now known as 'Student's t distribution' was first discussed in print in a paper by Willam S Gosset in the journal Biometrika in 1908, published under the pseudonym 'Student.' The paper solved a problem from the Guinness Brewery concerning how large a sample of people should be used in its tastings of beer. Apparently Gosset was embarrassed to be working on a problem stemming from the liquor industry, thus the pseudonym."
James Landau responds "The story of 'Student' has been told so many times that it has become folklore, and like all folklore variant versions exist until it is difficult to determine which is the original. The variant you tell is one I have not encountered before.
Gosset was an employee of Guinness Brewery (a brewmaster, I believe) who went to study statistics under Karl Pearson. Gosset eventually discovered a result that he published in Biometrika under the pen-name of 'Student.' Why did he choose to use a pseudonym? Here is where the folklore kicks in. The most common story is that Guinness wanted to keep it secret that they were using statistics in their business, and ordered Gosset not to reveal his identity.


In any event, Gosset published all his statistical works as "Student", even though his identity became well known. Why he continued to use the pseudonym is not part of the folklore, and I have never heard a plausible story. Perhaps it is because he became famous as 'Student' and did not want to have to re-establish his professional reputation under his real name. Perhaps he liked the notoriety."
OK, now we know why he used "Student", but why t?

About a year after I wrote the last question in the paragraph above, I picked up a copy of a wonderful book,The Lady Tasting Tea by David Salsburg, and found the answer in a footnote on page 30. He writes "Gosset used the letter "z" to indicate this ratio. Some years later, writers of textbooks began referring to normally distributed variables with the letter "z", and they began using the letter "t" for "Student's ratio".

While looking up some data on the Data and Story Library site, I came across a little historical note that should be included here also:

W.S. Gosset (1876-1937) was employed by the Guniess Brewing Company of Dublin. Sample sizes available for experimentation in brewing were necessarily small, and Gosset knew that a correct way of dealing with small samples was needed. He consulted Karl Pearson (1857-1936) of Universiy College in London about the problem. Pearson told him the current state of knowledge was unsatisfactory. The following year Gosset undertook a course of study under Pearson. An outcome of his study was the publication in 1908 of Gosset's paper on "The Probable Error of a Mean," which introduced a form of what later became known as Student's t-distribution. Gosset's paper was published under the pseudonym "Student." The modern form of Student's t-distribution was derived by R.A. Fisher and first published in 1925.

The t-distribution is a family of curves that discribe the pattern of possible errors when the average of a sample, , is used to estimate the mean of the population, m, from which the sample is taken if the underlying population is normally distributed. The distribution for every different sample size is slightly different, becoming more normal as the size of the sample gets larger. The distribution is given by the function . Think of repeatedly drawning samples of size n from a normal distribution with a mean of m and a standard deviation of s. After each sample is drawn you find the mean of the sample, and the sample standard deviation, s. With all of this in hand you use the formula above to construct a single value of t. If you repeated this operation many, many times, the ditribution of the different t values you get from each trial would form the curve called the t-distribution with n-1 degrees of freedom.

The image at right shows a comparison of the t-distribution with one degree of freedom (dark) and the standard normal distribution (dotted). You can experiment with the impact of the size of the sample, n, on the shape of the distribution with this java applet . For small values of n the curve is leptokurtic or flattened in comparison to the Standard Normal curve. It is lower at its peak than the Normal curve, but has higher density at the extreme regions of the tails.

The t-distribution with n=2 (or 1 Degree of freedom) was actually used at least as early as 1824 when Poisson wrote of it (not by that name). The distribution was also briefly explored by F Y Edgeworth (1883) where he calls it the subexponential distribution.

The t-distribution with only one degree of freedom is also called a Cauchy distribution. It can be written as . The curve was frequently studied in the early stages of statistics as an alternative to the normal curve for the distribution of errors. Although the distributions are named for Cauchy, they were earlier studied by Fermat, and numerous others including Huygens, Poisson, and Maria Agnesi.

In the form it was studied by Agnesi, it has often been called the "Witch of Agnesi.". A nice visual for how the "witch" is derived geometrically is found at Mathworld.com The origin of the name seems to have two explanations. One is that the Agnesi used the term "averisera" for the versed (turned) sine curve, and was subsequently mis-translated by John Colson who was Lucasian Professor of Mathematics at Cambridge. [See the text box below for the story according to Shirley Gray of California State Univ.] According to Stephen Stigler, however, Agnesi had used the term "la versiera" and stated that it meant witch or she-devil. Stigler also adds that she was not the first to use that term for the curve. Guido Grandi had used the same term in 1718, "explaining that the curve had arisen from a consideration of the 'semi versi' versed sine, and that he would call it 'versiera' after the latin word 'versoria'." In fact, says Stigler, "Although some have interpreted this to mean that versoria meant versed sine, the only relevant definition of versoria given in dictionaries of that time is 'a turning or twisting around'... Versiera is not the Italian word for Versoria but, rather, is the feminine form of 'avversario', sometimes used to mean devil..." Stigler goes on to suggest that Guido may have been indulging in a little humorous play on words and may have been suggesting that the curve resembled a woman's breast.

No comments: