25 The top row of the block will now show the values of the slope and intercept, the
second row the corresponding standard deviations, the third will contain the square of
the correlation coeﬃcient and the variance in y, then follow the value of the statistical
function Fand the number of degrees of freedom and, in the bottom row, the sums of
squares of the residuals. The function LINEST provides a lot of information very
quickly, albeit in a rather cryptic, unlabeled format.
26 To summarize the syntax of the argument of the LINEST( ) function, it is: (y-array, x-
array, do you want the intercept?, do you want the statistical information?)
27 If you only want the values of a
, just highlight two adjacent cells in the same
row, while still using Ctrl+Shift+Enter. If you specify a block two wide but fewer than
ﬁve rows high, the values of a
will be repeated in all rows. On the other hand, if
you assign too large a block, the results will be ﬁne, but the unused spaces will be ﬁlled
with the error message #N/A.
28 Also compare your results with those from the Regression routine you can ﬁnd under
ata Analysis Regression. This routine provides even more statistical infor-
mation than the LINEST function, and labels it, but takes more time to execute and
does not automatically update itself when the input data are changed. Regression also
can make graphs.
29 This time, leave the box Constant is Z
ero blank. (For reasons that will become clear in
exercise 2.8, place the output from the regression routine in cell J13.) Locate the places
where Regression lists its estimates for a
30 Verify that the diﬀerence between a
is of the order of magnitude (i.e., within
a factor of two or three) of
, and that the diﬀerence between a
of the order of
. That is, of course, the signiﬁcance of these standard deviations: they
provide estimates of how close our ‘best values’ come to the true values, provided that
all deviations are random and follow a single Gaussian distribution.
31 Save the spreadsheet, and close it.
In the above examples, we started from a precisely known expression
such as y =a
x, added Gaussian noise, and then extracted from the data
the estimates a
. This allowed us to judge how closely we can
reconstruct the true values of a
. In practice, however, the experi-
menter has no such luxury, since the true parameter values are generally not
known, so that we will only have parameter estimates. In practice, then,
there is little need to distinguish between the true parameters and their esti-
mates, so that from now on the subscripts calc will be deleted whenever that
can be done without introducing ambiguity.
Finally,awordofcaution.In science,weusuallyhavetheoreticalmodels to
provide a basis for assuming a particular dependence of, say, y on x. Least-
squares methods are designed to ﬁt experimental data to such ‘laws’, and to
give us some idea ofthe goodness oftheir ﬁt. They are at their best when we
Introduction to statistics
have a large number of data points, justifying the statistical approach.
However, they do not guarantee that the assumed model is the correct one.
Always plot the experimental data together with the curve ﬁtted through
them, in order to make a visual judgment of whether the assumed model
applies. And always plotthe residuals, because such a plot, by removing the
main trend of the data, is usually more revealing of systematic rather than
random deviations than a direct comparison of experimental and reconsti-
tuted data. A plot of the residuals may show the presence of non-random
trends,inwhichcase themodelchosen mayhave tobereconsidered.
Least-squares methods are usually favored over more subjective methods
of ﬁtting experimental data to a mathematical function, such as eyeballing
or using French curves. However, the least-squares method is not entirely
objective either: one still has to make the choice of model to which to ﬁt the
data. Least-squares ﬁtting of data to a function gives the best ﬁt to that
chosen function. It is your responsibility to select the most appropriate func-
tion, preferably based on a theoretical model of the phenomenon studied.
Absent theoretical guidance, one is most often led by Occam’s parsimony
rule according to which, all else being equal, the simplest of several satisfac-
tory models is considered preferable.
Looking at the data
It is always useful to inspect the data visually, as plotted in a graph, rather
than to just let the computer analyze them. The four data sets shown in Table
2.8-1 were carefully crafted by Anscombe (Am. Statist. 27#2 (1973) 17) to
illustrate this point. Below we will ﬁt all four data sets to a line y=a
with the usual assumption that all errors reside in y.
Instructions for exercise 2.8
1 Recall the spreadsheet Line of the previous exercise. We will now make a copy of it to
2 Right-click on the name tab Line, then click on Move or Copy. This will open the Move
or Copy dialog box.
3 In this dialog box, click on C
reate a Copy, and in the window B
efore Sheet click on the
spreadsheet you want to copy (here: Line). Click OK.
4 Automatically, the new copy will be called Line . Rename it Anscombe.
5 Because we will not need any artiﬁcial noise, we simply put the noise amplitude in cell
C2 to zero. The noise in C9:C58 then does not aﬀect the analysis.
6 Enter the data from Table 2.8-1 in a block to the right of the region already used, e.g., in
J1:Q11. Note that columns for X are the same for the ﬁrst three data sets, so that you can
copy them to save time and eﬀort.
2.8 Looking at the data
How to C#: Basic SDK Concept of XDoc.PDF for .NET
PDF file(s), and add, create, insert, delete, re-order, copy, paste, cut, rotate, and save PDF page(s), etc. Class: PDFDocument. Class: PDFPage. Text Process. copy text from pdf to word; edit pdf replace text
7 Import the ﬁrst data set into the data analysis region, e.g., into block A9:B19. To do this,
activate block J1:K11, copy it with Ctrl +c, activate cell A9, and paste using Ctrl+v.
8 Now erase all data in A20:H58, because these were not overwritten. Bingo! You will see
the least-squares analysis of the newly reported data set in row 2. A quick check: make
sure that N in cell C5 shows the proper value, 11.
9 Note down the values of the parameters a
10 Verify your results with the Regression routine (which you can place, say, in cell J13),
and also note down some of the other statistical parameters, such as the correlation
coeﬃcient (‘Multiple R’) and its square (‘R Square’).
11 Plot the data, and their residuals, O
n this Sheet. This will allow you to see all results
(two plots each from four data sets) simultaneously, on the very same sheet. In order to
accommodate eight graphs on one sheet, use the methods described in the second
paragraph of section 1.10.
12 Now analyze the second data set, by copying block L1:M11 to A9. Row 2 will immedi-
ately provide the new values for the parameters a
(although you might
hardly notice it, because they will be quite similar). On the other hand, the Regression
routine does not automatically update, and must therefore be called in again.
Since you have noted down the results of the earlier regression analysis, just override
it and write the new results over the old ones, in J13.
13 Note down the parameters, and plot the data and their residuals. Place the new graphs
close to the earlier ones.
14 Repeat this process until you have analyzed all four data sets.
15 List the numerical results obtained.
Introduction to statistics
Table 2.8-1:Four sets of x,y data pairs.
Data set # 1
Data set # 2
Data set # 3
Data set # 4
The four data sets in this problem were selected by Anscombe (1973) to
have the same values for their slopes a
(=0.50), their intercepts a
the sums of the squares of their residuals ∑(y – y
(=1.53), their standard
(=1.12), as well as their correla-
tion coeﬃcients r (= 0.816). By all these criteria, then, they ﬁt the same
equation of a straight line equally well. However, visual examination of the
graphs, or of the residual plots, yields a quite diﬀerent answer: only set #1
reasonably ﬁts a straight line. In other words, the statistical analysis in terms
of ‘summary’ statistics does not address the validity of the assumed model,
and can produce results regardless of whether or not the model is appropri-
ate. In the present case, the ﬁrst data set reasonably ﬁts the assumed model
of a linear dependence, the second set should be ﬁtted to a parabola instead
of to a line, while the third and fourth sets both contain an intentional
‘outlier’. Direct observation of the graphs before the analysis, and/or of the
residuals afterwards, can often help us reject clearly inappropriate models.
The take-home message of this example is that a quick look at the original
data (and, for more subtle diﬀerences, a quick look at the residuals) is often
much more informative than the numerical values of the statistical analysis.
A glance at the graphs can reveal a trend in the deviations from the model,
and may suggest that the model used is inappropriate to the data, thereby
sending you back to reconsidering the theory behind the phenomenon
studied, or the method used to acquire the data. That theory may need to be
modiﬁed or extended before you can beneﬁt from a statistical data analysis,
or a defect in the data acquisition method may have to be corrected. As is
said in computer jargon, ‘garbage in, garbage out’, i.e., the quality of the
results obtained by any computer analysis depends on the quality of the
input data and, we might add here, on the appropriateness of the analysis
model used. If the input data are of poor quality, no amount of statistical
analysis can make them whole. Likewise, if the model used is inapplicable to
the data, statistics cannot help. Just imagine trying to ﬁt the coordinates of
the numbers on a clock face to a straight line.
In short: it is seldom useful merely to analyze data without a thorough
understanding of what the model used means or implies, and without a
visual veriﬁcation that the model is appropriate to the experimental data.
Mindless application of statistical rules is no guarantee for valid conclu-
sions. Always graph your data and their residuals; it may make you recon-
sider the experiment, or its interpretation.
What is ‘normal’?
So far we have used the Gaussian distribution as our paradigm of experi-
mental imprecision, because Excel makes it so readily available. The
Gaussian distribution has indeed become the norm in much of science, as
2.9 What is ‘normal’?
reﬂected in the fact that it is often called the normal distribution. The reason
for its popularity, however, should be clearly understood: it is not that the
Gaussian distribution is always, or even most often, followed by experimen-
tal data. We usually do not take the time to test the nature and distribution of
the experimental imprecision, and the assignment is therefore more often
based on hope or faith than on experimental evidence. There are several
reasons that explain why the Gaussian distribution is so popular: it provides
a convenient mathematical model, and it is usually close enough to approxi-
mate the actual distribution of errors or experimental imprecision.
Moreover, a Gaussian distribution provides an optimistic estimate, because
the assumption of a symmetrical distribution makes the imprecision look
small. It also helps that a Gaussian distribution is quite compact: it has fewer
far-out points than, e.g., a Lorentzian distribution would have. However, the
following two examples should give us pause.
Introduction to statistics
Fig.2.9-1:The fraction fof molecules in an ‘ideal’ gas at a given speed v, in meters per
Fig.2.9-2:The fraction fof molecules in an ‘ideal’ gas at a given speed v, as in Fig. 2.9-1,
and (colored curve) its mirror image (mirrored around the maximum value of f ) to
emphasize the asymmetry of this distribution.
The kinetic theory of an ideal gas leads directly to an expression for the so-
called root-mean-square velocity v of the gas molecules, viz.
where k is the Boltzmann constant, T is the absolute temperature, m is the
molecular mass, R is the gas constant and M is the molecular weight in
Daltons. The Maxwell–Boltzmann distribution also provides an expression
for the fraction of the gas molecules at any particular velocity,
Figure 2.9-1 shows that a plot of f(v) versus v according to (2.9-2) has the
general shape of a bell-shaped curve, and therefore resembles a Gaussian
distribution. However, a more careful comparison indicates that the curve
representing (2.9-2) is not quite Gaussian. Speciﬁcally, the distribution of
molecular velocities is asymmetrical, with a ‘tail’ at high velocities. This
asymmetry is more apparent in Fig. 2.9-2, where a colored line shows the
same curve in reverse, mirrored around its maximum. For a symmetrical
curve, such a mirror image would overlay the original curve; for an asym-
metrical curve, it emphasizes the asymmetry. The parameters used in Figs.
2.9-1 and Fig. 2.9-2 are speciﬁed in the spreadsheet exercise.
Here, then, we have the simplest of theoretical situations (so that nobody
can argue that we did not take a statistically valid sample) involving random
thermal motion, and already we ﬁnd that the velocities of the gas molecules
only approximately follow a Gaussian distribution.
‘Unfair,’ I hear you mutter under your breath, ‘velocities have a lower limit
of zero but lack an upper limit, so it is no wonder that they exhibit an asym-
metrical distribution.’ True enough, but this also applies to many other
quantities, such as absolute temperature, mass, concentration, and absor-
bance. The point here is not why a number of distributions are decidedly
non-Gaussian, but thatthey are.
Instructions for exercise 2.9
1 The solid curve in Fig. 2.9-1 shows (2.9-2) for R=8.3143 J mole–1K–1(1 J=1 m2kg
), T= 300 K and M=30 Da (1 Da=1 g mole
) so that m
=M / N
kg. The value of M is appropriate for air
molecules: M=28 Da for N
, M=32 Da for O
2 Open a spreadsheet.
3 In cell A1 deposit a label, such as v.
4 In cell A3 deposit the value 0, in cell A4 the instruction=A3+0.5, then copy this
instruction and paste it in cells A4:A83. This will generate v=0 (0.5) 40, i.e., numbers in
the range from 0 to 40 with increments of 0.5.
2.9 What is ‘normal’?
Introduction to statistics
5 In column B calculate the corresponding values for f(v) as a function of v.
6 By inspection, ﬁnd the approximate location of the maximum in the curve; it should lie
close to v=13.
7 Temporarily expand the v-scale around that maximum in order to get a better esti-
mate. In the present example it is suﬃcient to expand the scale between v=12 and
v= 14 with increments of 0.1.
8 You will then see that the curve has a maximum close to v=12.9.
9 Make the intervals symmetrical around that maximum, e.g., by using in A3 the value
0.4 instead of 0, so that column A now contains v=0.4 (0.5) 40.4.
10 Copy the contents of column B for f(v) and special paste its values in column C, but
starting below where column B ends, i.e., in cell C85.
11 In A85:A136 compute 25.9 (-0.5) 0.4, by depositing the value 25.9 in A85, and the
instruction=A85-0.5 in cell A86, and by copying this instruction down to row 136.
12 Highlight the block A3:C136, then plot the contents of columns B and C versus that of
Our second example will show another reason why we should be careful
with the assumption of a ‘normal’ distribution. Consider the weight of
pennies. We usually assume that all pennies are born equal, but some may
experience more wear than others (reducing their weights somewhat) while
others may have been oxidized more (thereby increasing their weights).
There is no a priori reason to assume that the weight loss by abrasion will be
the same as the weight gain by oxidation (since abrasion and oxidation are
rather independent processes), and therefore there is no reason to assume
that the ﬁnal distribution will be symmetrical, as in a Gaussian distribution.
But that is not the main point we want to make here. When you actually
weigh individual pennies, you will ﬁnd that most of them weigh about 2.5 g,
but once in a while you may encounter a penny that weighs well over 3 g. Say
you weigh 10 pennies, and their weights are as follows: 2.5136 g, 2.5208 g,
2.5078 g, 2.4993 g, 2.5042 g, 2.5085 g, 2.5136 g, 3.1245 g, 2.5219 g, and
2.5084 g. What weight should you report?
You might just use (2.2-1) and (2.2-2) and calculate the weight as 2.57 ±
0.19 g. Or you might reason that the eighth measurement is so far oﬀ that it
should be excluded as an ‘outlier’, in which case you would obtain 2.511 ±
0.007 g. The former result would seem to be the more honest one, because
the heavy penny does not look much diﬀerent from the other ones, and
reweighing conﬁrms the various weights. On the other hand, disregarding
the heavy penny yields a result that certainly ‘looks’ much better, because it
has a considerably smaller standard deviation. Which of these options
should you choose?
Neither choice is correct. Look more carefully at the pennies, and you will
ﬁnd that the heavy one was minted in or before 1982, the lighter ones after
Documents you may be interested
Documents you may be interested