Lecture 16: F-test for Lack-of-FitEarlier we alluded to the fact that regression is the smoothed line of averages at each level of X. We can formally test this with an F-test. The question is whether our assumption that the model is linear is an appropriate one. The test has its limitations however. This test requires more than one Y observation for one or more Xj. And it only tests for the suitability of the fitted regression relationship. You can use the Joe MacDonald forestry farm data to test for lack of fit. (Try it, we will use the Acacia data set as an example here.) Suppose there are c distinct Xs. Xj (j = 1,2,…,c) has ni (i=1,2,…,nj) observations of Y denoted by Yij. That is the total number of observations is given by n = ∑ nj j=1 c Assumptions of lack-of-fit test Yij (i = 1,2,…,nj) for a given Xj are independent Yij for a given Xj are normally distributed Yij for a given Xj have constant variance σ2. Recall that SSE = ˆi ) ∑ (Yi − Y c nj 2 In the light of multiple Y observations this modifies to SSE = ˆ j )2 ∑∑ ( Yij − Y j=1 i=1 Suppose the regression model being tested is E{Y|X} = β0 + β1X i.e. that this is a linear model. we do not know E{Y|X} but estimate it by ˆ Y = b0 + b1X As we have multiple Y observations for some X, we can estimate σ2, from Yij for each Xj. ∑ Yij Let nj Yj = i=1 nj variation in Yij around their mean SSPE = 2 ∑ ∑ ( Yij − Yj ) j=1 i=1 c nj Yj .variation of c ˆ j .We can separate out two sum of squares SSPE . That is (n – 2) = (n – c) + (c – 2) Dividing the sums of squares by corresponding degrees of freedom MSPE = SSPE n−c SSLF c−2 and MSLF = .2 It can be shown algebraically that SSE = SSPE + SSLF which also holds for the degrees of freedom. That is Yj s around Y SSLF = ˆ j )2 ∑ n j ( Yj − Y j=1 and the associated degrees of freedom with only c different Xjs is c .∑1 j=1 j=1 c c = n-c SSLF . That is and the associated degrees of freedom ∑ (n j − 1) j=1 c = ∑ n j . 4 12 11 10 Fig 4. the hypothesis being tested here is H 0 : E{ Y } = β 0 + β1 X ⎫ ⎬ H a : E { Y } ≠ β 0 + β1 X ⎭ Reject H0 if significance level = α F* > F(1-α.4 height 9 8 7 6 5 3 4 5 6 7 8 9 10 11 dbh If the true relationship between Y and X is linear. no matter what is the true nature of the regression function. Yj s would be very close to Y that is both MSPE and MSLF would be unbiased estimators of σ2.c-2. ˆ j s. we have lack-of-fit meaning Yj s are ‘substantially’ different from Y Σn j [E{ Yj } − (β 0 + β1 X j )] 2 c−2 MSLF > σ2 since E{MSLF} = σ + 2 This lack-of-fit is tested by F* = MSLF MSPE Formally. (Note: this is analogous to MSB and MSW in analysis of variance test) If ˆ j s. as may be seen in Fig 4.n-c) .MSPE is an unbiased estimator of the error variance ( σ2). Rejection of H0 does not imply that there is. but requires multiple Ys for one or more sets of Xs. We can still do a lack of fit test. . Otherwise our assumption that the process between X and Y is a linear one is violated. that a linear relationship exists. Multiple Ys at all levels of X not necessary. i. The lack-of-fit test generalizes to multiple linear regression. only that there is no linear relationship. or is not.Important Points Whereas before with the t-test we wanted to reject the H0.e. here we want to conclude H0. which renders all our results useless. any relationship between X and Y. 5a.5b 0 -1 3 4 5 6 7 8 9 10 11 dbh .5a 9 8 7 6 5 3 4 5 6 7 8 9 10 11 dbh 1 resids Fig 4. nj = 3 The height diameter plot and the fitted regression line are in Fig 4.Example Data used here is from 15 Acacia trees with three heights corresponding to each of five diameters. 12 11 10 height Fig 4.5b. we have n = 15. Thus. c = 5. The residuals have been plotted in Fig 4. 8 9.88 10.500 10.7200 -0.56 7.4800 0.400 10.400 10.833 10.200 5.5000 0.1 lists the data used in this example.2 7.1 8.24 6.5200 -0.7400 -1.5 5.4 9.24 7.3000 0.52 11.8200 1.833 10. Table 4.0400 0.200 8.56 8.1 Xj 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11 Yij 4.Table 4.4 9.8 11.20 10.3 Yj eij -1.52 .467 8. together with corresponding Yj and ˆj Y values.4 10.20 10.7 10.2200 ˆ Y j 5.500 9.5 11.5400 0.5 8.24 6.9400 -0.0 10.88 8.400 10.8000 -1.467 9.1200 -0.88 8.3400 -0.9 5.467 8.7 10.500 9.20 11.833 6.2400 1.52 11.56 7.200 5. F* is 6.Test for Linearity e.671 52.000 a.0106 To use SPSS for the Lack of Fit test go to: STATISTICS. Ftable is 6. This is the Pure Error Sums of Squares d. Ftable is 5.435 F 34. Another way to say this is: We are confident 98 out of 100 times that we have a linear relationship.22.300 Sig.090 t 6.800 . Statistics. Error 4.43 P = 0. Dependent Variable: HEIGHT d ANOVA Table HEIGHT * DBH Between Groups Within Groupsc Total (Combined) Linearitya b Deviation from Linearity Sum of Squares 60.000 .55.660 dbh Coefficientsa Standardi zed Coefficien ts Beta .431e Sig. .431 and for 99% CI.842 120. Means.168 52. Why? Since the F valve associated with 98% falls within the Ho area. Since F* < Ftable we do not reject H0 and conclude that the relationship is linear. Means. Compare Means. The regression equation is ht = 4.26 + 0.353 65.399 4. Since F* > Ftable we reject the H0 and conclude that the relationship is not linear. This is the Lack of Fit Sums of Squares c.011 a. The ANOVA table is produced from SPSS. but not 99 out of 100. Options What is the interpretation of the significance value 0.241 7.260 .272 8.000 . This is the regression sums of squares b. For 98% CI. .SPSS Output SPSS was used to fit the linear regression line to the ‘ht’ ‘dbh’ data. while the F value for 99% does not.024 df 4 1 3 10 14 Mean Square 15.000 .272 2.897 Model 1 (Constant) DBH Unstandardized Coefficients B Std.074 6.683 . Compare Means. . This is our LACK of FIT test statistic Pure error test: F* = 6. Options.011? Recall that we reject the H0 if F* > Ftable and accept the alternative that the model is not linear.660 . We get different values of Ftable based on our confidence interval we choose.