The Least Squares Fit of a Straight Line

I. Theory

In many experiments there is a linear relationship between the measured variables. For example, the velocity of an object in free fall changes linearly with time, in the absence of air resistance. When we plot a set of data and find that it approximates a straight line, the next question is how to find the slope and the intercept of the line which seems to provide the best fit. If we were actually to measure speed versus time for a falling object (which we do in next week's lab), the data would be scattered around a straight line. We do not expect to find an exact fit because we know that the presence of random error causes this scatter away from an ideal straight line. We can find approximate values for slope and intercept by using a straight edge to draw a line which seems to "split the difference" between the scattered points. A more exact answer is given by a statistical analysis, which is described below. We will simply use the computer to do this analyis.

The process of finding this best fit proceeds as follows: If there were no random errors present, all of our experimental results y would fall exactly on a line given by the equation

filler image

This equation is our model for the data.  If we expect our data to have this relationship, then we must have a theoretical equation which takes this form, where y and x are variables and m and b are undetermined constants.  We will use Excel to fit this theory to the actual data and determine m and b.  Note that if your data does not look linear, perhaps it is not!

As an example, lets take the expected physical relationship for the velocity versus time for an object undergoing constant acceleration:

v = v0 +at

If we are measuring velocity at certain times, the relationship between v and t should be a linear one: this equation looks a lot like the equation for the straight line, above. If we were to make a graph of v vs. t, we would expect it to look like a straight line. We could then use Excel to fit the best possible straight line to that data, and tell us what the slope and y-intercept is.

In addition to computing the slope and intercept, Excel can perform the mathematical operations to give estimates of the standard errors in the slope and the intercept. These are given by the linest command or by using regression as described in the online help or the Excel Hints sheet provided. Provided your computations are based upon at least five data points, you may expect that the odds are approximately 9 out of 10 that the slope computed from an infinite number of measurements will fall within 2 standard errors of the slope you have obtained, and similarly for the intercept. (Since we have only one estimate of the slope and one estimate of the intercept, the standard error and the standard deviation are identical.) Thus, to 90% confidence, your slope is m + 2*Std Err of m and your intercept is b + 2* Std Err of b, where m, b, and the standard error of each are given by the analysis in Excel.

In order to use the computer program intelligently, keep in mind the following points:

  1. In your experimental procedure, fix one variable (x), then measure the other (y).
  2. You must enter at least three data points or the least squares procedure will not work. You need at least five data points to get approximate 90% confidence intervals in the slope and intercept.
  3. The x, y data you enter might not be simply the experimental values you have obtained. For example, if the relationship of interest is v2 = 2ax, and you are measuring v and x (position) to determine "a" from a straight-line fit, you would want v2 to be plotted on the y axis and x (position) on the x axis.
  4. No matter how wildly scattered your data may be, or even if the variables are not linearly related, the computer can always come up with a slope and an intercept which is a best fit in the least squares sense. It's a good idea to make at least a crude plot of your data to be sure that your chosen method of plotting does yield something reasonably close to a straight line. (The smaller the standard error in the slope, the closer your points come to fitting a straight line exactly.)

You can skip the following section on the background to how Excel makes the fit.

For a given value of x (denoted by xi), the value y that we actually obtain (denoted yi) will differ from the ideal (error-free) value of y by an amount given by

filler image

Based on the mathematics of the Gaussian distribution that we discussed in the first lab, the probability that this value of y occurs is given by

filler image.

A similar statement can be made for each of the y values we have obtained. Now, in general, the overall probability for the occurrence of successive events is given by the product of the individual probabilities for each event. The probability P that N measurements will yield the N experimental values of y that we have obtained is

filler image

Our problem then is this: find values for the slope m and intercept b which will make this probability as large as possible. In other words, find m and b such that the values of y which we have obtained from our measurements represent a set of values of y which is most likely to occur.

The maximum value of P will occur when the exponent has its minimum value, i.e., when the sum of the squares of the deviations of your measured points from the fitted line is a minimum. This minimum value may be found using standard techniques from the differential calculus, and occurs when m and b are given by

filler image,

filler image.

 

Pretty clearly, the computations involved in finding m and b will get tiresome even for a small number of experimental points. And just as clearly, the procedure can be carried out systematically by a computer. The use of the computer to find the best-fit line has already been demonstrated for you.  It is called a "trend-line" in Excel.


Department of Physics

Randolph College