Sie sind auf Seite 1von 7

CURVE FITTING AND THE METHOD OF LEAST SQUARES Introduction.

The full utilization of quantitative data typically invol ves the fitting of the data to a mathematical model. Fitting is usually perform ed for one of two purposes. On the one hand, the scientist wishes to determine a physical quantity and a measure of its uncertainty from experimental data when there is a well-established relationship between variables. For example, Ohm's Law (V = IR) states that the voltage V across a resistor is directly proportio nal to the current I through it. Consequently one can use measured values of V and I to determine the best value of the resistance R, a parameter. On the othe r hand, the goal might be to establish a mathematical relationship between a dep endent variable such as the toxicity of chlorinated hydrocarbons and independent variables such as their chemical and physical properties. A wide range of tool s has been developed to solve these problems. The most frequently employed tech nique is the method of least squares, which will be discussed in this section an d used in General Chemistry. We shall only discuss in General Chemistry the met hod of linear-least-squares, sometimes referred to as linear regression or simpl y least squares because the applications in the lecture and laboratory will be l imited to those cases in which there is a linear relationship, i.e. y = mx + b, between a single dependent variable and a single independent variable. Addition al fitting procedures for more complicated problems are addressed in Chemistry 1 60. Rationale of the method of linear least squares. Suppose that a student measures V in Volts and I in Ampere and wishes to determine the resistance in O hms from the data (c.f. Table 1). The traditional approach would involve constr ucting a graph of V versus I, drawing the "best" line through the points, and ca lculating the slope of the graph, V/ I, which is the required value of R. This tim e honored procedure suffers from a number of deficiencies. First, what is meant by "best" is not clearly defined and the act of drawing the line through the po ints is subject to bias, particularly if the experimental data are noisy. Furth ermore, the freehand procedure does not yield an estimate of the uncertainty of the slope. In careful work we require an objective procedure that reduces bias on the part of the researcher and yields a quantitative measure of the uncertain ty of the parameters and the validity of the model. Statisticians have shown th at the elementary method of least squares yields the best results if two conditi ons often satisfied in chemical experiments are met: 1) The principal source of error in the experiment is in the values of the dependent variable. 2) The error s in the dependent variable are random and fit a normal or Gaussian distribution . Even if these conditions are not met exactly, a careful comparison with other techniques has shown that the method of least squares usually yields acceptable results when applied to a carefully prepared set of chemical data. Table 1. V versus I for a Hypothetical Resistor V(Volt) 5.2 I(Ampere) 9.8 0.10 15.4 0.20 20.1 0.30 24.6 0.40 0.50

Two attempts at drawing a freehand graph (Figures 1 and 2) demonstrate t he basis for the method. In Figure 1, the line has clearly been poorly drawn wi th the result that most of the experimental points are far from the line. In co ntrast, the superior graph in Figure 2 is drawn so that the points are close to the line. Exact agreement should not be expected. The demon noise is always wi th us and a perfect fit is either evidence for fraudulent data or egregious abus e of fitting procedures. Note, however, in Figure 2 that the deviations of the points from the line, sometimes called the residuals, are small and that the sig ns of the deviations are randomly distributed so that there is no pattern to whi ch points are above the line and which are below. Clearly the well-trained draf tsman draws a line which minimizes these deviations or some function of the devi ations. Theorems from statistics show that optimum results, i.e. the best-fit l

ine, is obtained when the sum of the squares of the deviations, S, is minimized, provided that the assumptions outlined above are satisfied. Hence, the techniq ue is referred to as the method of least squares. Furthermore, there is a bonus in the approach. Statisticians have also derived equations for the standard de viations of the parameters and other useful quantities that will be discussed in the following treatment of an illustrative example.

Illustrative Example and Use of Excel. In General Chemistry we shall us e the Analysis Tool Package add-in to Microsoft Excel to perform the linear regr ession calculations. In this section we shall use an illustrative example which will include a discussion of the results produced by Excel. This example will be demonstrated in class. Following this example is a second one that will be a ssigned as homework and will be due at the beginning of the following lab period . Our first example is a set of student measurements in the gas phase of t he temperature, pressure, volume, and mass of an unknown gas that displays small deviations from ideality. The goal is to obtain the molecular weight of the ga s and a measure of its non-ideality. Before jumping in, some algebra is require d. It is well known that an ideal gas obeys the equation pV = nRT from which on e can readily show that pV/nRT = 1. Non-ideality can be modeled by adding a ter m linear in the pressure p, yielding equation (1) pV/nRT = 1 + Bp (1) where the coefficient B is called the virial coefficient. There is nothing myst erious about this equation. It is a simple example of the Maclaurin theorem in mathematics that states that all well behaved functions appear linear over a sma ll range of the independent variable, p in our case. One more step is required, as we do not know the molecular weight. From stoichiometry we know that n = m/ M so with a bit of manipulation one can quickly derive equation (2): pV/mRT = (B/M)p + 1/M (2) Equation 2 states that for real gases a plot of pV/mRT, the dependent variable, versus p, the independent variable, is linear, i.e. an equation of the form y = mx + b, with a slope (B/M) and an intercept, 1/M. Note that a bit of mathematic s must be performed to put the data into a linear format. In the laboratory, we shall provide the identity of the dependent and independent variables that are linearly related. In a new situation, the researcher may have to derive such a relationship. In our case, the student collected samples of the gas in a 1.1153 liter bulb and obtained the data tabulated in the first three columns of Table 2. As a first step the pressures are converted from torr to atm and the dependent vari able pV/mRT is calculated, yielding columns 4 and 5 in the table and the Excel s preadsheet. Table 2. Gas Density Data for an Unknown Gas p(torr) T(K) m(g) mole/g) 46.6 296.2 0.5504 1 76.3 296.2 0.8906 3 p(atm) 0.06118 0.10040 pV/mRT ( 0.00510 0.00517

101.6 144.1 155.8

296.1 297.0 297.2

1.1780 1.6511 1.7871

0.13368 0.18961 0.20500

0.005209 0.005255 0.005246

When the Excel regression (linear least squares) routine is run with pV/ mRT as the dependent (y) variable and p as the independent variable, the informa tion presented on the previous page is obtained. The format of the presentation has been edited somewhat so that the data would fit on one page. The significa nce of the numbers in the report is given below. (We will demonstrate how to run this analysis in class. Detailed written instru ctions for the homework example are also provided in the following pages.) 1) SUMMARY OUTPUT. The information in this section, which is appended to the ri ght of the spreadsheet, indicates how well the data fit the linear model. Multip le R provides the linear correlation coefficient, R, which tests whether the data support the hypothesis that there is a relationship between the dependent varia ble and the independent variable. If there is no relationship at all, R is zero . If there is a statistically significant relationship, R is close in magnitude to 1. The sign of R is given by the sign of the slope. In our case, R is 0.96 2 and the hypothesis that the gas deviates measurably from ideality is supported . The square of R, which is 0.925 in this example, has a straightforward interp retation. R2 gives that fraction of the variation in the dependent variable tha t is explained by the model. If R2 were a small number, we would have to reject the results of the entire calculation and consider another model. Some statist icians adjust R in the case of a small sample size and calculate the adjusted R (third entry in the output); we shall not use this statistic. The fourth entry, the standard error or the standard deviation of the residuals, which is 1.99 x 10-5, measures the goodness of fit. This quantity, like all standard deviations , is based on the deviations of the data points from the line. It is calculated by dividing the sum of the squares of the deviations by the number of degrees o f freedom and by taking the square root of this quotient. The number of degrees of freedom equals the number of data points (the last entry in the section) min us the number of parameters. We determined two parameters, a slope and an inter cept, in this case so the number of degrees of freedom is 5 - 2 = 3. A small st andard error indicates that the model is good and the errors in the values of th e dependent variable are small. The standard error is also used to determine wh ether the rejection of a suspect data point is justified. If a datum is contami nated by systematic error, the standard error will decrease significantly when t he suspected datum is removed from the dataset. In close cases where bias migh t influence the decision to retain or reject a datum, one uses a quantitative pr ocedure called the F test, which is based on the standard error. 2) ANOVA. ANOVA stands for analysis of variance. The most important statistic in this section is the F value, 38.8359. This statistic is used to determine if the proposed relationship is statistically significant. The column labeled sig nificance of F is the probability that the observed fit could be generated by ran dom means alone. Note that this probability, 0.009878, is a small number. Any probability larger than 0.05 is considered to be unacceptably large. A new mode l for the relationship or better data should be sought in such cases. 3) Coefficients. Assuming that we have a statistically significant fit, the num bers that we shall normally use will be found here. The row labeled Intercept con tains statistics for the intercept unless the line is forced through the origin. In this case, which you use in the colorimetric analysis experiment, N/A (not apply) will appear in the columns of this row. The first numerical column in th e Intercept row gives the value of the intercept, 0.005059 g/mole. The second num ber yields the standard deviation or error of the intercept, 2.45 x 10-5. Note that the standard deviation of the intercept is a small fraction of the intercep

t so the intercept is a well-defined statistic. Recall that we use the standard deviation of a parameter to determine the number of significant digits in the v alue to be reported. If one compares the two numbers, one notes that the 10-5th place is the first digit with significant error. This is the least significan t digit (marked with an * below) and all digits to the right are insignificant. * intercept 0.005059 standard deviation of the intercept 0.0000245 Hence the value is rounded off to 0.00506. To obtain the 95% confidence interva l of the intercept, we multiple the standard deviation of the intercept by the S tudents t which is 3.182 for 5 data, 2 parameters, and 3 degrees of freedom and o btain (3.182)(2.45 x 10-5) = 7.8 x 10-5. The final result is reported as (5.06 0. 078) x 10-3 mole/g. Recall that our goal was the determination of the gas molecu lar weight, which we obtain by taking the reciprocal of the intercept. With the use of the calculus, one can calculate a standard deviation for the molecular w eight from the standard deviation of the intercept. This derivation which is ca lled a propagation of errors analysis is covered in Physics 51 and Chemistry 160 . The column labeled P-value is the probability that a non-zero value of t he intercept could be generated by random means alone. Note that P is very smal l; if P were 0.05 or greater, there would be a 5% or greater chance that the res ults are spurious and we would have to reject the analysis. The final data in t he row yield the actual interval defined by the value of the intercept and the 9 5% confidence interval, i.e. the intercept minus the 95% CI and the intercept pl us the 95% CI. The next row labeled by the dependent variable name provides the same in formation for the slope. That is, m = 0.00010 mole/g-atm. Note that the relati ve standard deviation of the slope is much larger than that of the intercept and the slope can only be reported to 2 significant digits. The 95% CI of the slop e is calculated by multiplying the Students t by the standard deviation of the sl ope, (3.182)(0.000165) = 0.00053 so the result is reported as 0.00010 0.00056 mol e/g-atm. The slope is as not well determined. This should come as no surprise. Gases deviate very slightly from ideal-gas behavior and measuring small quanti ties is a difficult task. 4) RESIDUAL OUTPUT. This section contains a table of the predicted values of th e dependent variable and the residuals. The predicted values are calculated usin g the equations, the least-square parameters, and the value of the independent v ariable. The residuals or deviations give the differences between the observed and calculated values of the dependent variable. A large residual marks a suspe ct datum. If all goes well, the signs of the residuals should be randomly distr ibuted. A pattern to these signs is an indication of systematic error and the n eed to modify the equation. Variations on the Theme and Extensions. More advanced packages such as NCSS which is used in Chemistry 160 and Mathematics 57 permit handling more than one independent variable as well as weighting the data points. Weighting is imp ortant if the data are of uneven quality. Our main focus in General Chemistry i s on fitting data to a known equation and determining parameters such as the abs orption coefficient in the colorimetric manganese experiment. In fields such as drug design, the focus shifts to developing models with predictive value and a different approach called cross-validated statistics is used. The NCSS package c an also solve problems in which the relationship between the dependent and indep endent variable is intrinsically non-linear. All of these advanced topics are c onsidered in Chemistry 160 with the aid of the NCSS package.

Least Squares Exercise Using Microsoft Excel. The following exercise illustrates the method of least squares and regression ca lculations with Excel. Complete the following exercises on your own and turn in the printouts and answers to the questions at the beginning of your next lab pe riod. 1) Log on to the campus network. Start the Excel software. It will automa tically open with a new spreadsheet. 2) In the first column, enter the label T_C in the first row and in rows 211 the following values of the temperature in degrees Centigrade: 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100. This will be the independent variable in the firs t model. Label the second column p_torr and use Appendix F in your laboratory man ual to enter in rows 2-12 the vapor pressure in torr of water at the 11 temperat ures defined above.

3) Label columns 3-6 T_K,1/T_K, p_atm, and ln(p_atm), respectively. Use the sp dsheet functions (cell definition, copy, and paste) to load columns 3-5 with the values of the absolute temperature, the inverse of the absolute temperature, th e vapor pressure in atmosphere, and the natural log of the vapor pressure in atm osphere. 4) Consider as the first model a linear dependence of vapor pressure on Cen tigrade temperature. Use the following set of instructions to perform a least s quares fit with vapor pressure in atm as the dependent variable and Centigrade t emperature as the independent variable. a) To access the Regression window, click on Tools in the spreadsheet comma nd line to access the Tools menu and then on Data Analysis. Select Regression f rom the Data Analysis menu and click on OK. (In the event that the regression f eatures have not been installed, further steps are required. To install the fea tures, click on Tools, then Add-Ins, and Analysis ToolPak and Analysis ToolPak-V BA, and OK. You may wish to seek assistance from a consultant at OIT in perform ing this step.) b) When the regression window appears, select the following options by clic king on the appropriate boxes: labels, confidence level, residuals, and line fit plot. Do not select the Constant Is Zero option. This option is used in cases w here the intercept is known to be zero, e.g. the Colorimetric Manganese experime nt. Select the values of the dependent or Y variable by first clicking in the f ield for Input Y Range and then selecting the spreadsheet column that contains t he values of the dependent variable and its label. Similarly select the values o f the independent or X variable by clicking in the field for Input X range and t hen selecting the spreadsheet column that contains the values of the independent variable and its label. In general, you dont have to use all the data in a colu mn. You can use spreadsheet techniques to choose the data selectively. c) You would like to have the put data. To achieve this result, field for Output Range and then in mpty. This will be the G1 cell in cation of the regression results. regression results on the same page as the in select the option Output Range, click in the the leftmost, top spreadsheet cell which is e your case. This step informs Excel of the lo

d) Initiate the calculations by clicking on OK. In a few seconds the regre ssion results and the graph will appear on the screen. e) You will probably want to edit the graph produced by Excel as it is not quite in the format required in your lab reports. The following steps will achi

eve this goal. i. You want the graph to fill or nearly fill a full page. Force Excel to s how the page boundaries by clicking on Page Setup in the File menu and then OK. Then move the graph into the first empty page and change its size to fill the p age. To this end, click on the graph and a frame will appear around its perimet er. Grab the framed graph with the mouse and move the graph to its new location . When the mouse is at one of the four corners, the sizing tool will appear whi ch can be used to change the size of the graph. ii. Add the best-fit line; Excel calls this the trendline. Activate the gra ph (chart in Excelese) for editing by clicking on it. Click on any one of the s ymbols marking a predicted value of the dependent variable. From the Chart menu , choose Add Trendline. A Trendline menu will now appear. Select the Linear tr endline on the Type tab. The best-fit line is drawn between the first and last points. To extend the line so that it covers the entire graph, access the Optio ns tab and adjust the forward and backward values of the Forecast parameter. Fo r example, if the X value of the leftmost point on the graph is 0.2 and you want to line to extend to X = 0, the backward forecast value should be 0.2. Click o n OK when you have set all the options and the best-fit line will be drawn. iii. In some case, you may wish to customize your graph further. Double clic k on the portion of the graph you'd like to edit and an options menu should appe ar. You can now edit the caption for the graph and the axes. By clicking on th e symbols for the data points in the legend box on the right, you can change the symbol type and color. I usually choose the largest possible symbol for the me asured values of the dependent variable and change the size and color of the sym bol for the calculated values so that the latter merges with the best-fit line. You can always exit the editing mode by clicking on an empty cell outside of th e graph. f) Once the graph has edited, print the spreadsheet by clicking the Print i tem under the File menu. g) Save the spreadsheet on your diskette or user space. Insert a formatted diskette in the a: drive. Click on the Save As item under the File menu. The hard drive always comes up as the default drive so you have to select the a: dri ve if you're saving to diskette. Also provide a file name and then execute the save by clicking on OK. Once you have saved a spreadsheet, you can always read it later by clicking on the Open item under the File menu. 5) Now that the calculations and graphics have been done, you can analyze t he results. Do the statistics support the conjecture that vapor pressure depend s on temperature? Is the linear model a good one? How do you know? What are t he least-squares values of the slope and intercept and the associated 95% confid ence intervals to the appropriate number of significant digits? Answer these qu estions directly on your printout or on a separate page and turn in with the pri ntout of your results. 6) Refinement of the model. We deliberately started with a poor model so t hat you could recognize the characteristics of a dog. One can show from thermodyn amics that the natural logarithm of vapor pressure depends linearly on the inver se of the absolute temperature so repeat the analysis using the better model. a) Re-open the Regression window (recall click on Tools, Data Analysis, and Regression) and select the natural log of the vapor pressure in atm as the depe ndent variable and the inverse of the absolute temperature as the independent va riable.

b) Repeat the regression calculations. All of the options in the Regressio n window have been preset in the previous calculation. After you click OK, you will be asked if you want to erase the old output. Choose a new output destinat ion cell to see the new analysis without overwriting the previous results. c) Save and print out your results for the second analysis.

d) Is the second model better? How do you know? As above, report the valu es of the slope and intercept with the associated 95% confidence intervals along with the answers to the above questions and turn in with the printout of your r esults. Dont forget units and significant digits. 7) When you are done, exit Excel by clicking on the Exit item under the Fil e menu. Log off of the network unless you want the next user to have access to your user space. least_sq.doc WES, 3 July 1997 updated, January 2, 2002, JMT

Das könnte Ihnen auch gefallen