Faraway,.Julian.J.. .Practical.Regression.and.Anova.using.R.[sharethefiles.com]

[ Pobierz całość w formacie PDF ]

Energy
Figure 6.2: Linear and quadratic fits to the physics data
> g2
> summary(g2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 183.830 6.459 28.46 1.7e-08
energy 0.971 85.369 0.01 0.99124
I(energy�2) 1597.505 250.587 6.38 0.00038
Residual standard error: 0.679 on 7 degrees of freedom
Multiple R-Squared: 0.991, Adjusted R-squared: 0.989
F-statistic: 391 on 2 and 7 degrees of freedom, p-value: 6.55e-08
> 0.679�2*7
[1] 3.2273
> 1-pchisq(3.32273,7)
[1] 0.85363
This time we cannot detect a lack of fit. Plot the fit:
> x
> lines(x,g2$coef[1]+g2$coef[2]*x+g2$coef[3]*x�2,lty=2)
The curve is shown as a dotted line on the plot (thanks tolty=2). This seems clearly more appropriate
than the linear model.
6.2 �2 unknown
�
The �2that is based in the chosen regression model needs to be compared to some model-free estimate
of �2. We can do this if we have repeated y for one or more fixed x. These replicates do need to be
Crossection
200
250
300
350
6.2. �2 UNKNOWN 61
truly independent. They cannot just be repeated measurements on the same subject or unit. Such repeated
measures would only reveal the within subject variability or the measurement error. We need to know the
between subject variability this reflects the �2 described in the model.
The pure error estimate of �2 is given by SSpe d fpe where
SSpe yi �2
y
" "
distinct x givenx
Degrees of freedom d fpe #replicates 1
"distinctx
�
If you fit a model that assigns one parameter to each group of observations with fixed x then the �2 from
�
this model will be the pure error �2. This model is just the one-way anova model if you are familiar with
that. Comparing this model to the regression model amounts to the lack of fit test. This is usually the most
convenient way to compute the test but if you like we can then partition the RSS into that due to lack of fit
and that due to the pure error as in Table 6.1.
df SS MS F
Residual n-p RSS
RSS SSpe
Lack of Fit n p d fpe RSS SSpe n p d fpe Ratio of MS
Pure Error d fpe SSpe SSpe d fpe
Table 6.1: ANOVA for lack of fit
Compute the F-statistic and compare to Fn p d fpe d fpe and reject if the statistic is too large.
Another way of looking at this is a comparison between the model of interest and a saturated model that
assigns a parameter to each unique combination of the predictors. Because the model of interest represents
a special case of the saturated model where the saturated parameters satisfy the constraints of the model of
interest, we can use the standard F-testing methodology.
The data for this example consist of thirteen specimens of 90/10 Cu-Ni alloys with varying iron content
in percent. The specimens were submerged in sea water for 60 days and the weight loss due to corrosion
was recorded in units of milligrams per square decimeter per day. The data come from Draper and Smith
(1998).
We load in and print the data
> data(corrosion)
> corrosion
Fe loss
1 0.01 127.6
2 0.48 124.0
3 0.71 110.8
4 0.95 103.9
5 1.19 101.5
6 0.01 130.1
7 0.48 122.0
8 1.44 92.3
9 0.71 113.1
10 1.96 83.7
11 0.01 128.0
12 1.44 91.4
13 1.96 86.2
6.2. �2 UNKNOWN 62
We fit a straight line model:
> g
> summary(g)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.79 1.40 92.5
Fe -24.02 1.28 -18.8 1.1e-09
Residual standard error: 3.06 on 11 degrees of freedom
Multiple R-Squared: 0.97, Adjusted R-squared: 0.967
F-statistic: 352 on 1 and 11 degrees of freedom, p-value: 1.06e-09
Check the fit graphically see Figure 6.3.
> plot(corrosion$Fe,corrosion$loss,xlab="Iron content",ylab="Weight loss")
> abline(g$coef)
0.0 0.5 1.0 1.5 2.0
Iron content
Figure 6.3: Linear fit to the Cu-Ni corrosion data. Group means denoted by black diamonds
We have an R2 of 97% and an apparently good fit to the data. We now fit a model that reserves a
parameter for each group of data with the same value of x. This is accomplished by declaring the predictor
to be a factor. We will describe this in more detail in a later chapter
> ga
The fitted values are the means in each group - put these on the plot:
> points(corrosion$Fe,ga$fit,pch=18)
Weight loss
90
100
110
120
130
6.2. �2 UNKNOWN 63
We can now compare the two models in the usual way:
> anova(g,ga)
Analysis of Variance Table
Model 1: loss � Fe
Model 2: loss � factor(Fe)
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 11 102.9
2 6 11.8 5 91.1 9.28 0.0086
The low p-value indicates that we must conclude that there is a lack of fit. The reason is that the pure error
sd 11 8 6 1 4 is substantially less than the regression standard error of 3.06. We might investigate
models other than a straight line although no obvious alternative is suggested by the plot. Before considering
other models, I would first find out whether the replicates are genuine perhaps the low pure error SD can
be explained by some correlation in the measurements. Another possible explanation is unmeasured third
variable is causing the lack of fit.
When there are replicates, it is impossible to get a perfect fit. Even when there is parameter assigned
to each group of x-values, the residual sum of squares will not be zero. For the factor model above, the R2
is 99.7%. So even this saturated model does not attain a 100% value for R2. For these data, it s a small
difference but in other cases, the difference can be substantial. In these cases, one should realize that the
maximum R2 that may be attained might be substantially less than 100% and so perceptions about what a
good value for R2 should be downgraded appropriately.
These methods are good for detecting lack of fit, but if the null hypothesis is accepted, we cannot
conclude that we have the true model. After all, it may be that we just did not have enough data to detect the
inadequacies of the model. All we can say is that the model is not contradicted by the data.
When there are no replicates, it may be possible to group the responses for similar x but this is not
straightforward. It is also possible to detect lack of fit by less formal, graphical methods.
A more general question is how good a fit do you really want? By increasing the complexity of the
model, it is possible to fit the data more closely. By using as many parameters as data points, we can fit
the data exactly. Very little is achieved by doing this since we learn nothing beyond the data itself and any
predictions made using such a model will tend to have very high variance. The question of how complex a
model to fit is difficult and fundamental. For example, we can fit the mean responses for the example above
exactly using a sixth order polynomial:
> gp
Now look at this fit:
> plot(loss � Fe, data=corrosion,ylim=c(60,130))
> points(corrosion$Fe,ga$fit,pch=18)
> grid
> lines(grid,predict(gp,data.frame(Fe=grid)))
as shown in Figure 6.4. The fit of this model is excellent for example:
> summary(gp)$r.squared
[1] 0.99653
but it is clearly riduculous. There is no plausible reason corrosion loss should suddenly drop at 1.7 and
thereafter increase rapidly. This is a consequence of overfitting the data. This illustrates the need not to
become too focused on measures of fit like R2.
6.2. �2 UNKNOWN 64
0.0 0.5 1.0 1.5 2.0
Fe
Figure 6.4: Polynomial fit to the corrossion data
loss
60
80
100
120
Chapter 7
Diagnostics
Regression model building is often an iterative and interactive process. The first model we try may prove
to be inadequate. Regression diagnostics are used to detect problems with the model and suggest improve-
ments. This is a hands-on process. [ Pobierz całość w formacie PDF ]

Wątki