Legacy Documentation

Mathematica® Teacher's Edition (2002)

This is documentation for an obsolete product.
Current products and services
 Documentation /  Mathematica Teacher's Edition /  Teacher Tools /  Courseware /  Algebra, Trigonometry & Mathematica /  12.Applications of functions /

Fitting a line to data

Example: women's 100-meter freestyle

Introducing the data

Below are the data of the winning times (in seconds) of the women's 100-meter freestyle swimming competition at the Olympic Games. In order to make the data more visible, we use the ListPlot command.

Clear[womens100meter,womens100meterplot]
womens100meter = {{1912,82.2},{1920,73.6},{1924,72.4},
    {1928,71.0},{1932,66.8},{1936,65.9},{1948,66.3},
    {1952,66.8},{1956,62.0},{1960,61.2},{1964,59.5},
    {1968,60.},{1972,58.59},{1976,55.65},{1980,54.79},
    {1984,55.92},{1988,54.93},{1992,54.64},{1996,54.50}};
womens100meterplot = ListPlot[womens100meter,
    PlotStyle->{PointSize[.02],Red},
    AxesLabel->{"year","time in seconds"}];

Source: "Summer Olympic Games Champions 1896-1996", in The World Almanac and Book of Facts 1998, Mahwah, NJ: World Almanac Books, 1997, 856.

Notice that the point where the axes cross is not the point .

A good observer might ask: "Why are some of the years missing?"
The answer is simple: the Olympics are held only every four years, and during World War I and World War II, no Olympics were held.

The data seem to indicate that the winning times have gotten shorter and shorter as we get closer to the present. It also seems reasonable to predict that future winning times are going to be even shorter.

Predicting the year 2000

Try to predict the winning time for the women's 100-meter freestyle swimming competition in the Olympic Games in the year 2000 and 2004!

One possible answer might go like this:

Show[womens100meterplot];

If we look at the above plot it seems that the points might center along a straight . So, let's use Mathematica to find the line which approximates the data best. The command used is called Fit and takes as input the data, followed by {1,x},x. For more on the Fit command, click .

Clear[womens100meterline]
womens100meterline[x_] = Fit[womens100meter,{1,x},x]

Here we see how the line looks together with the data.

Clear[lineplot,plotwithline]
lineplot = Plot[womens100meterline[x],{x,1912,1996},
    DisplayFunction->Identity];
plotwithline = Show[womens100meterplot,lineplot,
    DisplayFunction->$DisplayFunction];

Seems to fit reasonably well. Here are the predictions we get from our line function for the 2000 and 2004 Olympic Games.

womens100meterline[2000]
womens100meterline[2004]

That says the winning time in the next two Olympics will be about 50.66 seconds and 49.51 seconds. This seems to be a drastic improvement over the time of 54.66 at the 1996 Olympics. So these times might be too fast.

Room for improvement

Let's take a closer look at predicting the winning time for the women's 100-meter freestyle swimming competitions in the Olympic Games in the years 2000 and 2004. Here again is the plot of the data with the line.

Show[plotwithline];

Notice the very first data point. It stands out and seems to be off the trend. Maybe the very first Olympic 100-meter freestyle swim meet was not as competitive as the event was in the later years, and so the time was slower than it could have been. Maybe we can improve things by just ignoring the first data point.
Note that we have bundled all the Mathematica code together.

Clear[womens100meterwithpointdeleted,
    womens100meterwithpointdeletedline,
    womens100meterwithpointdeletedplot,     
    newlineplot,newplotwithline]
womens100meterwithpointdeleted =
    {{1920,73.6},{1924,72.4},
    {1928,71.0},{1932,66.8},{1936,65.9},{1948,66.3},
    {1952,66.8},{1956,62.0},{1960,61.2},{1964,59.5},
    {1968,60.},{1972,58.59},{1976,55.65},{1980,54.79},
    {1984,55.92},{1988,54.93},{1992,54.64},{1996,54.50}};
    
womens100meterwithpointdeletedline[x_] =
    Fit[womens100meterwithpointdeleted,{1,x},x]
    
womens100meterwithpointdeletedplot =
    ListPlot[womens100meterwithpointdeleted,
    PlotStyle->{PointSize[.02],Red},
    DisplayFunction->Identity];
    
newlineplot = Plot[womens100meterwithpointdeletedline[x],
    {x,1912,1992},DisplayFunction->Identity];
    
newplotwithline = Show[womens100meterwithpointdeletedplot,
    newlineplot,DisplayFunction->$DisplayFunction];

Seems to fit reasonably well. Here are the predictions we get from our new line function for the 2000 and 2004 Olympic Games.

womens100meterwithpointdeletedline[2000]
womens100meterwithpointdeletedline[2004]

That says the winning times will be about 51.49 seconds and 50.46 seconds. The previous answers were 50.66 seconds and 49.51 seconds. These differ from the old answers by almost a second. This seems to be a drastic improvement over the time of 54.66 of the 1996 Olympics. So both of our predictions are probably too fast.

Here is the plot of the data, along with both of the two lines.

Show[{womens100meterplot,lineplot,newlineplot}];

The line with the steeper slope is the line that came from all the data points. With the other line, the upper-leftmost point was excluded.

Problems with fitting a line to data

Does the data allow a reasonable line fit?

The two previous attempts to predict the winning times for the women's 100-meter freestyle swimming competitions in the next two Olympics were mathematically as good as we can possibly make them. However, predictions like this have some principal weaknesses which we will discuss in detail below in the section "Some of the problems in predicting from data."

When we look at the women's 100-meter freestyle swimming times, it seems that these data do fit a line reasonably well, with the possible exception of the first point. Here is the data together with the best fit line again.

Show[plotwithline];

Often data do not fit a line well, but may fit some other curve reasonably well. (See .) In these cases do not attempt to fit a line to the data. If you do, your predictions will not be supported by the data. You can find an example of such an attempt in the next section.

A poor line fit: U.S. population between 1790 and 1990

Here are some data on the U.S. population, based on the census that is done every 10 years.

Clear[uspop,popdataplot]
uspop =
{{1790,3.9},{1800,5.3},{1810,7.2},{1820,9.6},
{1830,12.9},{1840,17.1},{1850,23.2},{1860,31.4},
{1870,38.6},{1880,50.2},{1890,63.0},{1900,76.2},
{1910,92.2},{1920,106.0},{1930,123.2},
{1940,132.2},{1950,151.3},{1960,179.3},
{1970,203.3},{1980,226.5},{1990,248.7}};

popdataplot = ListPlot[uspop,
    AxesLabel->{"year","population in millions"},
    PlotStyle->{Red,PointSize[0.03]}];

Source: "U.S. Population by Official Census 1790-1990", in The World Almanac and Book of Facts 1998, Mahwah, NJ: World Almanac Books, 1997, 380-381.

We can go through the same steps we did with the women's 100-meter swim example and fit a line to the data, but this would not be a good thing to do. This data clearly does not increase linearly (as a line), but instead it seems that it is increasing slowly at first and then it is increasing at a faster and faster rate.
We can detect this easily by taking a ruler and holding the ruler to the last and first data point. This will result in the following picture.

ruler=Graphics[Line[{{1790,3.9},{1990,248.7}}]];
Show[ruler,popdataplot,Axes->True];

We see that all data points except the first and last are below the line. The whole picture looks like a bow and the ruler is the string of the bow.
Here is the graph of the data and the best fit line.

popdataline[x_] = Fit[uspop,{1,x},x];
popdatalineplot = Plot[popdataline[x],
    {x,1790,1990},DisplayFunction->Identity];
plotwithline = Show[popdataplot,popdatalineplot,
    DisplayFunction->$DisplayFunction];

We see that this is not a good fit, because the points are not randomly spread around the best fit line. On the left and right they are above the line, and in the center they are below the line. Compare this with the better fit of the line to the women's 100-meter freestyle data we have seen before. In the women's 100-meter freestyle swim example, the red points are spread above and below the line with no clear pattern.
So what can we do in such a case? See .

Some of the problems in predicting from data

Does a prediction make sense?

If one looks at the first example of the women's 100-meter freestyle it is clear that our line (with negative slope) as a best fit function will result in shorter and shorter times.

This is what we expected. But we will never be as fast as a dolphin, so using the line to predict Olympic times into the far future is nonsense. Also, the line we get will eventually hit the -axis, which would represent a time of zero seconds. How much sense does this make?

If a function fits data extremely well, does this mean it is suitable for predictions in the long run?
No, it does not mean this at all.

In order to have confidence in a long-term prediction based on a fit-function, we need to have an underlying theory that explains to us why things should be as our fit-function predicts. A good fit could mean such a theory exists; sometimes this is a clue to go look for such a theory. As long as we cannot come up with such a theory, we cannot have much confidence in our fit-function. For more details look at .

Is the data set chosen in an impartial manner, and is the collection of data points large enough?

To improve the fit in our first example of the women's 100-meter freestyle, we left off the first data point. This is a standard trick used by all sorts of people and agencies. In order to make their arguments more convincing, they present you with a data set whose points have been chosen to make their arguments look good. But there may be no good reason for preselecting data points.
A data point should only be deleted if there is some convincing reason to do so. That things fit better without that data point is not a convincing reason.