As I mentioned in class, humans tend to see patterns when, in fact, the results are completely random. When listening to a Geiger counter, we all get the impression that the radioactive decays are clumped. And yet, they are all independent and random. The same phenomena is sometimes called the "Gambler's Paradox." Someone who is placing bets looks for patterns and bets to take advantage of them. But the patterns aren't really there (the fall of the ball on the roulette wheel really is random -- at least at an honest casino), and so the gambler who is looking for patterns and 'streaks' is really only fooling himself. Every spin is independent, with equal chance to come up red or black, equal chance to hit any number between 0 and 99. The fact that the last 5 hits were black doesn't mean that you can now predict that the next one will be black too.
Nor does a string of blacks mean that the next one will be red. That might be the logic of a gambler who says, "in the end, it all has to even out." In fact, it doesn't have to even out. In the end, if the number of times it came up red is expected to be 10,000, then from the square-root rule, we really expect it only to be within the range of 10,000 ± squareroot(10,000) = 10,000 ± 100. Note that with more spins of the wheel, the squareroot gets larger. So it doesn't have to even out!
The same phenomena -- seeing patterns in random data -- occurs with random patterns of points. Using a computer, I generated completely random locations. I assigned each location to a "star" and then I made the plot. It is shown on the left.
To get a more uniform plot, I then redid all the calculations, but in a different way. I divided the box into 100 smaller boxes, and put one star in each little box. However the location in the little box was random. The result is the figure on the right.
Which pattern looks more random? Most people would say: the pattern on the right! The left diagram seems to be full of clumps, lines, vacant regions. The left diagram seems to be full of "constellations," groupings of stars that don't seem random.
It is hard to believe that the diagram on the right is the non-random one. In the figure below (on the left) I duplicated the more uniform star pattern, but also put in dotted lines, showing explicitly the 100 little boxes. The star pattern is exactly the same as for the plot on the upper right. But now you can verify that no box contains two stars. The stars were spread out in this matter to make the coverage more uniform. That makes it look more random! But it is an illusion.
In the actual sky, there are some stars that are not randomly placed. The stars in the cluster known as the Pleiades are truly clustered. But most of the stars are randomly placed. The stars in the constellation of Orion are not even close to each other; some are much farther away than others. We see what appear to be constellations, because truly random patterns appear clustered. We have to make them more uniform to make them look random. That is why many paintings of stars look wrong. The artist did not make the stars sufficiently random.
Here is a CHALLENGE, possibly worth one extra quiz point: find me examples of paintings by great artists that are even better examples of "overly-uniform stars. I want only paintings that are 1900 or earlier; no 20th or 21st century. Note: you earn a quiz point only if you are the first person to suggest a particular painting, and the painting does indeed show stars -- either randomly painted, or uniformly, or something of similar interest to us. I am not interested in stars that show only the star of Bethlehem, for example, or are an attempt to portray the Big Dipper. I want stars that are supposed to look like real stars, and either succeed, or don't. Please, no more paintings by Van Gogh; I have all of those from other students. I will grant two points if the painting was done prior to 1900! (I recently spent a day in the Louvre looking for such paintings, and found stars were not depicted by any early painters! Art history students in the class -- why is this?
The best submission to the Challenge, as of September 21, 2002, is the book cover displayed to the right on the figure above. (Thank you, Poorwa Singh, for submitting this one!) Do you think the stars in this are random or nonrandom?
I'd still like to find such a painting by one of the great masters. (If you know the name of the painting and the artist, then I can probably find a copy online by using Google in the "advanced" mode.)
The one rule of statistics that everyone should know is the square root rule. In fact, you will be amazed to see how important it is! This is the way it works: if, based on past performance, you expect to have an event happen 1000 times, then don't be surprised if it actually happens 1032 times, or 968 times. Where did I get those numbers? First, I take the squareroot of 1000. That is 32. Then I both add it, and subtract it, from 1000. That gives me the expected range. That's the squareroot rule.
Is the squareroot rule ever violated? Yes -- about 1/3 of the time! That is pretty frequent. So if you expected the event to occur 1000 times, and it actually occurred 1050 times (that is bigger than 1032), then you are surprised, but not too surprised. The squareroot is called the "standard deviation". The standard deviation for 1000 events is 32. For 100 events, it is 10. For 1,000,000 events, it is 1,000. The rules of statistics say that you will exceed twice the standard deviation only 5% of the time. That still happens, of course. You exceed three standard deviations only 0.3% of the time. That still happens about one time in 300.
Take a look at the stars in the truly random pattern. How many stars do you expect to find on the left side of the square? Partial answer: 50. That is because they are random, so you expect half to be on half the page. But a much better answer is 50 ± 7. Then you wouldn't be surprised if you found 43 or 57. If you look at the figure on the right side, the uniformly distributed one, you will of course find exactly 50.
important example: political polling. Suppose a political poll is trying to figure out who is going to be the next president. They don't have the money to poll a million people, so they decide to try 1000. They find that 484 of these people want to elect Richard Muller as next president. They report this result to Muller. "Too bad," they say. "Looks like you are going to lose."
But Muller knows the square-root rule. So he says that the real number is 484 ± sqrt(484) = 484 ± 22. That means that he is only a little bit more than 1 standard deviation away from winning! He tells his supporters, "Donate money -- my chances are really good!"
Note that the accuracy of the poll turned out to be ± 22 out of 484. That means the percentage error was 22/484 = 4.5%. This may sound familiar; this is similar to polls presented on TV and in newspapers. This is the typical uncertainty that you get when you poll 1000 people. You can get better accuracy by polling 10,000 people, but only a little better, and it cost ten times as much. In fact, you can actually do a little better: about 3%. The reason is that most votes are close. For the details, see the optional paragraph that follows.
Note for the experts only. No need to read this unless you have actually studied statistics! The square-root rule is not exact; it is only an approximate rule. For the binomial distribution, the standard deviation is the square-root of (Np(1-p)), where p is the probability. So if p = 0.5 (e.g. an election that is close), then the standard deviation is actually sqrt(N*0.5*0.5) = sqrt(0.5 N)/sqrt(2). The mean is Np = 0.5 N. So the uncertainty is not the square-root of mean but is smaller by a factor of the square-root of two. This improves the value from 4.5% to 3%. If the election is not close, e.g. if p is small, then the standard deviation is sqrt(NP), and this is equal to the square-root of the mean = NP (But nobody cares, except in close elections.) So the rule is exact in this limit (the Poisson limit). But the rule of thumb that you shouldn't be surprised if you get a result that is one or two times the square-root is still a very good rule of thumb.
The great coin flip experiment.
When I taught this class in the fall of 2000, I asked every student to flip a coin 200 times, and to email me with the number of heads and number of tails. 65 students reported the results of their coin tossing experiment. Three of them got exactly 100 heads. One student reported 79, and another 134 (they were the extremes). I plotted the number of students reporting within various intervals (e.g. from 100 to 103, from 104 to 107, etc.), and the plot shows what I found.
For discussion: is the width of this curve what you would expect, based on the square-root rule?
More discussion: Notice the bin near Heads = 120. It has no events in it. None of the students in the class reported 114 to 121 heads. This is what resulted in the zero in the plot. Zero. Does that surprise you? What does it mean? Click for my answer.
This section will make most sense to those who are baseball fans, but everyone should read it anyway, even if the terminology is a little unfamiliar. As I mentioned in class, Barry Bonds (as of Friday Sept 9, 2001) had 22 games more to play, and had hit 60 home runs already. Is he likely to match or surpass the record of 70 home runs set by Mark McGuire? Note: this page was written on Sept 9; by now, you may know the answer. But the discussion is more interesting when you don't, so pretend it is Sept 9, 2001.)
I wrote a little program using the language "Matlab" that simulated a player hitting home runs. I assumed the player plays in 162 games, and that on average he is expected to hit 70 home runs. That means the probability that he will hit a home run in any one game is 70/162. The Matlab function "rand" will generate random numbers, just as coin flipping does. However instead of 50% odds for a home run, I set rand to set the probability exactly equal to 70/162. This simulation isn't really precise, since I didn't allow for the possibility that Bonds could hit 2 or more homeruns in one game. And it doesn't take into account the fact that he is playing in different stadiums, and that he is getting more "walks" near the end of the season. (That is when he is denied a chance to hit a home run by being deliberately put on first base.)
Here is what the program got. Each 0 is a game with no home run; each 1 is a home run:
In this simulation, there were 69 home runs. I expected the number to be 70 ± sqrt(70), i.e. 70 ± 8.5. When I ran the simulation several more times, I sometimes got over 80 home runs, and sometimes under 70. But about 2/3 of the time the results were within 70 ± 8.5, i.e. between 78.5 and 61.5.
Note the long runs of zeros. Near the end of the season, there is a series of 8 games in a row with 0 home runs. I colored those in red. In this program, they happened even though the probability didn't change. The program was not "in a slump" -- its is just that sometimes you get eight zeros in a row! A similar thing happened to Barry Bonds just before the All-Star game, and many reporters thought he was in a "slump" -- a period when something was wrong, and he wasn't playing well. But it might have just be random luck. In the simulation above, with 8 zeros in a row, it was random -- since that is all the program does. So random runs of 8 zeros in a row do happen.
It is possible, even likely, that when Mark McGuire hit his record 70, that his expected number (based on his quality of play) was only 60, but because of the way random numbers work, he was lucky and the number fluctuated up to 70. Of course, the same thing might have been true of Ruth and Maris. (Some of us think that the only reason Mickey Mantle didn't set the old record is that he never had that luck.) With an expected number of 60, the standard deviation was sqrt(60) = 7.7. About 5% of the time he would exceed two standard deviations, i.e. he would be either 60 - 15.4, or 60 + 15.4. So if you have the skill to get 60, you might be lucky and actually get 75!
Update (on 9/27). As of today, Bonds has 67 homeruns. At his current rate, he should hit 4 more. But that really means 4 ± 2. So the odds are about 2/3 that he will hit between 4-2 = 2 and 4+2 = 6 more homeruns. I apologize if that doesn't give great insight into what is going to happen. Statistics do have value, but only limited value.
Update (September 2002). If you are not a baseball fan, you might not actually know how it all turned out. Bonds hit 73 homeruns in 2001.
For those of you who are interested in the computer program, I attach it below. You can buy a student version of Matlab at the Scholar's Workstation, or run it on a campus computer that already has it installed. Statements preceded by the % symbol are comments only; they don't do anything, but are there to remind me how the program works. The actual program runs using only the 6 lines that are deeply indented.
% Matlab program
% 70 home runs in a 162 game season.
% prob of hitting a homerun in any one game is 70/162.
generate 162 random numbers, one for each game :
game = rand(162,1);
% find the random numbers that are less that 70/162
and count those as homeruns :
hr = find(game<70/162);
find the number of homeruns hit :
number = length(hr);
display the results :
record = zeros(162,1);
record(hr) = 1;