Advanced Data Analysis 2

March 26, 2018 | Author: Juan Pablo Madrigal Cianci | Category: Analysis Of Covariance, P Value, Principal Component Analysis, Chi Squared Test, Student's T Test


Comments



Description

Lecture notes forAdvanced Data Analysis 2 (ADA2) Stat 428/528 University of New Mexico Erik B. Erhardt Edward J. Bedrick Spring 2014 Ronald M. Schrader Contents I Review of ADA1 1 R statistical software and review 1.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 ADA1 Ch 0: R warm-up . . . . . . . . . . . . . . . . . . . 1.3 ADA1 Chapters 2, 4, 6: Estimation in one-sample problems 1.4 ADA1 Chapters 3, 4, 6: Two-sample inferences . . . . . . . 1.5 ADA1 Chapters 5, 4, 6: One-way ANOVA . . . . . . . . . 1.6 ADA1 Chapter 7: Categorical data analysis . . . . . . . . 1.7 ADA1 Chapter 8: Correlation and regression . . . . . . . . 1 . . . . . . . . . . . . . . 3 3 4 7 13 18 28 33 II Introduction to multiple regression and model selection 40 2 Introduction to Multiple Linear Regression 2.1 Indian systolic blood pressure example . . . . . . . . . . . . . 2.1.1 Taking Weight Into Consideration . . . . . . . . . . . . 2.1.2 Important Points to Notice About the Regression Output 2.1.3 Understanding the Model . . . . . . . . . . . . . . . . 2.2 GCE exam score example . . . . . . . . . . . . . . . . . . . . 2.2.1 Some Comments on GCE Analysis . . . . . . . . . . . 42 42 46 48 50 53 62 3 A Taste of Model Selection for Multiple Regression 3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Backward Elimination . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Maximum likelihood and AIC/BIC . . . . . . . . . . . 68 68 69 70 CONTENTS 3.3 Example: Peru Indian blood pressure . . . . . . . . . . . . . . 3.3.1 Analysis for Selected Model . . . . . . . . . . . . . . . 3.4 Example: Dennis Cook’s Rat Data . . . . . . . . . . . . . . . III ies iii 71 80 83 Experimental design and observational stud92 4 One Factor Designs and Extensions 94 5 Paired Experiments and Randomized Block Experiments 98 5.1 Analysis of a Randomized Block Design . . . . . . . . . . . . . 100 5.2 Extending the One-Factor Design to Multiple Factors . . . . . 111 5.2.1 Example: Beetle insecticide two-factor design . . . . . . 112 5.2.2 The Interaction Model for a Two-Factor Experiment . . 114 5.2.3 Example: Survival Times of Beetles . . . . . . . . . . . 121 5.2.4 Example: Output voltage for batteries . . . . . . . . . 129 5.2.5 Checking assumptions in a two-factor experiment . . . . 134 5.2.6 A Remedy for Non-Constant Variance . . . . . . . . . 137 5.3 Multiple comparisons: balanced (means) vs unbalanced (lsmeans) 145 5.4 Unbalanced Two-Factor Designs and Analysis . . . . . . . . . 149 5.4.1 Example: Rat insulin . . . . . . . . . . . . . . . . . . 150 5.5 Writing factor model equations and interpretting coefficients . . 161 5.5.1 One-way ANOVA, 1 factor with 3 levels . . . . . . . . . 161 5.5.2 Two-way ANOVA, 2 factors with 3 and 2 levels, additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.5.3 Two-way ANOVA, 2 factors with 3 and 2 levels, interaction model . . . . . . . . . . . . . . . . . . . . . . . . 162 6 A Short Discussion of Observational Studies IV ANCOVA and logistic regression 7 Analysis of Covariance: Comparing Regression Lines 163 172 174 iv CONTENTS 7.1 ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Generalizing the ANCOVA Model to Allow Unequal Slopes 7.2.1 Unequal slopes ANCOVA model . . . . . . . . . . . 7.2.2 Equal slopes ANCOVA model . . . . . . . . . . . . 7.2.3 Equal slopes and equal intercepts ANCOVA model . 7.2.4 No slopes, but intercepts ANCOVA model . . . . . 7.3 Relating Models to Two-Factor ANOVA . . . . . . . . . . 7.4 Choosing Among Models . . . . . . . . . . . . . . . . . . 7.4.1 Simultaneous testing of regression parameters . . . . 7.5 Comments on Comparing Regression Lines . . . . . . . . . 7.6 Three-way interaction . . . . . . . . . . . . . . . . . . . . 8 Polynomial Regression 8.1 Polynomial Models with One Predictor . . . . 8.1.1 Example: Cloud point and percent I-8 8.2 Polynomial Models with Two Predictors . . . 8.2.1 Example: Mooney viscosity . . . . . . 8.2.2 Example: Mooney viscosity on log scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 182 183 188 190 191 193 194 195 199 200 . . . . . 203 203 206 211 211 214 9 Discussion of Response Models with Factors and Predictors 219 9.1 Some Comments on Building Models . . . . . . . . . . . . . . 224 9.2 Example: The Effect of Sex and Rank on Faculty Salary . . . . 227 9.2.1 A Three-Way ANOVA on Salary Data . . . . . . . . . 229 9.2.2 Using Year and Year Since Degree to Predict Salary . . 236 9.2.3 Using Factors and Predictors to Model Salaries . . . . . 238 9.2.4 Discussion of the Salary Analysis . . . . . . . . . . . . 245 10 Automated Model Selection for Multiple Regression 10.1 Forward Selection . . . . . . . . . . . . . . . . . . . . . . 10.2 Backward Elimination . . . . . . . . . . . . . . . . . . . . 10.3 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . 10.3.1 Example: Indian systolic blood pressure . . . . . . . 10.4 Other Model Selection Procedures . . . . . . . . . . . . . . . . . . . . . . . . 248 249 250 251 251 260 CONTENTS 10.4.1 R2 Criterion . . . . . . . . . . . . . . . . . . . . 10.4.2 Adjusted-R2 Criterion, maximize . . . . . . . . . 10.4.3 Mallows’ Cp Criterion, minimize . . . . . . . . . . 10.5 Illustration with Peru Indian data . . . . . . . . . . . . . ¯ 2, and Cp Summary for Peru Indian Data . 10.5.1 R2, R 10.5.2 Peru Indian Data Summary . . . . . . . . . . . . 10.6 Example: Oxygen Uptake . . . . . . . . . . . . . . . . . 10.6.1 Redo analysis excluding first and last observations v . . . . . . . . . . . . . . . . . . . . . . . . 260 260 261 262 266 267 268 274 11 Logistic Regression 278 11.1 Generalized linear model variance and link families . . . . . . . 278 11.2 Example: Age of Menarche in Warsaw . . . . . . . . . . . . . 279 11.3 Simple logistic regression model . . . . . . . . . . . . . . . . . 282 11.3.1 Estimating Regression Parameters via LS of empirical logits284 11.3.2 Maximum Likelihood Estimation for Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 11.3.3 Fitting the Logistic Model by Maximum Likelihood, Menarche . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 11.4 Example: Leukemia white blood cell types . . . . . . . . . . . 291 11.5 Example: The UNM Trauma Data . . . . . . . . . . . . . . . 304 11.5.1 Selecting Predictors in the Trauma Data . . . . . . . . 307 11.5.2 Checking Predictions for the Trauma Model . . . . . . 310 11.6 Historical Example: O-Ring Data . . . . . . . . . . . . . . . . 312 V Multivariate Methods 320 12 An Introduction to Multivariate Methods 322 12.1 Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . 323 12.2 Vector and Matrix Notation . . . . . . . . . . . . . . . . . . . 326 12.3 Matrix Notation to Summarize Linear Combinations . . . . . . 329 13 Principal Component Analysis 331 13.1 Example: Temperature Data . . . . . . . . . . . . . . . . . . 333 13.2 PCA on Correlation Matrix . . . . . . . . . . . . . . . . . . . 340 vi CONTENTS 13.3 Interpreting Principal Components . . . . . . . . . . . . . . . 13.4 Example: Painted turtle shells . . . . . . . . . . . . . . . . . . 13.4.1 PCA on shells covariance matrix . . . . . . . . . . . . 13.4.2 PCA on shells correlation matrix . . . . . . . . . . . . 13.5 Why is PCA a Sensible Variable Reduction Technique? . . . . 13.5.1 A Warning on Using PCA as a Variable Reduction Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.2 PCA is Used for Multivariate Outlier Detection . . . . 13.6 Example: Sparrows, for Class Discussion . . . . . . . . . . . . 13.7 PCA for Variable Reduction in Regression . . . . . . . . . . . 14 Cluster Analysis 14.1 Introduction . . . . . . . . . . . . . 14.1.1 Illustration . . . . . . . . . . 14.1.2 Distance measures . . . . . . 14.2 Example: Mammal teeth . . . . . . 14.3 Identifying the Number of Clusters . 14.4 Example: 1976 birth and death rates 14.4.1 Complete linkage . . . . . . . 14.4.2 Single linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Multivariate Analysis of Variance 345 346 347 349 350 352 354 355 359 368 368 368 372 373 376 382 383 390 397 16 Discriminant Analysis 412 16.1 Canonical Discriminant Analysis . . . . . . . . . . . . . . . . 413 16.2 Example: Owners of riding mowers . . . . . . . . . . . . . . . 414 16.3 Discriminant Analysis on Fisher’s Iris Data . . . . . . . . . . . 422 17 Classification 17.1 Classification using Mahalanobis distance . . . . . 17.2 Evaluating the Accuracy of a Classification Rule . 17.3 Example: Carapace classification and error . . . . 17.4 Example: Fisher’s Iris Data cross-validation . . . 17.4.1 Stepwise variable selection for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 430 433 434 439 446 CONTENTS vii 17.5 Example: Analysis of Admissions Data . . . . . . . . . . . . . 17.5.1 Further Analysis of the Admissions Data . . . . . . . . 17.5.2 Classification Using Unequal Prior Probabilities . . . . 17.5.3 Classification With Unequal Covariance Matrices, QDA VI R data manipulation 18 Data Cleaning 18.1 The five steps of statistical analysis . . . . . . . . 18.2 R background review . . . . . . . . . . . . . . . . 18.2.1 Variable types . . . . . . . . . . . . . . . 18.2.2 Special values and value-checking functions 18.3 From raw to technically correct data . . . . . . . 18.3.1 Technically correct data . . . . . . . . . . 18.3.2 Reading text data into an R data.frame . . 18.4 Type conversion . . . . . . . . . . . . . . . . . . 18.4.1 Introduction to R’s typing system . . . . . 18.4.2 Recoding factors . . . . . . . . . . . . . . 18.4.3 Converting dates . . . . . . . . . . . . . . 18.5 Character-type manipulation . . . . . . . . . . . 18.5.1 String normalization . . . . . . . . . . . . 18.5.2 Approximate string matching . . . . . . . 449 451 454 458 463 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 465 467 467 468 469 469 470 478 478 479 481 484 484 485 viii CONTENTS Part I Review of ADA1 . and to define the procedures for analyzing data. and read data. Persistence wins the day with programming (as does asking and searching for help).e. ?mean). I encourage you to learn R by (1) running the commands in the tutorials. that I refer you to the course website2 for links to tutorials. Make mistakes. Rather.1 R R is a programming language for programming. (2) R does not really have a 1 2 http://statacumen..Chapter 1 R statistical software and review The purpose of this chapter is to discuss R in the context of a quick review of the topics we covered last semester in ADA11. code) to define. data management. R uses syntax-based programs (i.g. and statistical analysis.com/teaching/ada1/ http://statacumen. figure out why some things don’t work the way you expect. provide a point-and-click environment for statistical analysis.com/teaching/ada2/ . So many people have written “An Introduction to R”.. and (3) trying things on your own as you become curious. (2) looking at the help for the commands (e. transform. more rewarding) than some statistical packages (such as Minitab) for the following reasons: (1) R does not. in general. R is more difficult to master (though. 1. and keep trying. "scatterplot3d". #### Install packages needed this semester ADA2. I also recommend placing comments before blocks of code in order to describe what the code below is meant to do. "mvnormtest".2 11.4 13.4 Ch 1: R statistical software and review spreadsheet environment for data management. "nortest".com/teach/ADA2/ADA2_notes_Ch01_turkey. I strongly recommend commenting any line of code that isn’t absolutely obvious what is being done and why it is being done. Well done.frame containing integers. "klaR". ## $ weight: num 13. transformation. "ggplot2". "leaps".package.3 8.. or imported from a spreadsheet.package. You should use help to get more information on the functions demonstrated here.data) # examine the structure of the dataset.packages(ADA2.list <.4 13.1 12. "reshape". and factors str(turkey) ## 'data. this means that all the steps of the analysis are available to be repeatable and understood.1 10. is it what you expected? # a data. and selection of data is coded in the R program. "Hmisc". "car". Rather. read from a file.csv(fn.2 "MASS". "ellipse". "multcomp". "candisc". "vioplot".c("BSDA". "vcd"..list) 1.csv" # read file and assign data to turkey variable turkey <. numbers. ADA1 Ch 0: R warm-up This example illustrates several strategies for data summarization and analysis.9 15. #### Example: Turkey. data are entered directly within an R program. of 3 variables: ## $ age : int 28 20 32 25 23 22 29 27 28 26 . "reshape2". "NbClust". R warm-up # Read the data file from the website and learn some functions # filename fn. which helps anyone reading the program to more easily follow the logic. "plyr". "aod"..8 13. All manipulation. "moments"."http://statacumen. "gridExtra".read. Many of the lines in the program have comments.1 13. "cluster".. "lsmeans".frame': 15 obs.data <.8 . Take a minute to install the packages we’ll need this semester by executing the following commands in R. . "xtable") install. "popbio". The data for this example are from 15 turkeys raised on farms in either Virginia or Wisconsin. 2: ADA1 Ch 0: R warm-up ## $ orig : Factor w/ 2 levels "va".5 wi 31 16.8 wi 23 13..9 va 32 15. ## $ orig : Factor w/ 2 levels "va".4 wi 30 15.4 va TRUE 28 13.. # let's create an additional variable for later # gt25mo will be a variable indicating whether the age is greater than 25 months turkey$gt25mo <..1 va TRUE 27 12.. there's a few ways to do that turkey$age # name the variable ## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30 turkey[. # there are a couple ways of subsetting the rows turkey[(turkey$gt25mo == TRUE).2 va 26 11.8 va 21 11.1 va 27 12. 1] ## [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30 turkey[.4 13.2 wi 29 15.3 8.8 13.6 wi 27 14. of 4 variables: ## $ age : int 28 20 32 25 23 22 29 27 28 26 ..1.frame': 15 obs."wi": 1 1 1 2 2 1 1 1 1 1 .4 va 29 13.] # specify the rows ## ## ## ## ## ## 1 3 7 8 9 age weight orig gt25mo 28 13. "age"] ## # give the column number # give the column name [1] 28 20 32 25 23 22 29 27 28 26 21 31 27 29 30 # and the structure is a vector str(turkey$age) ## int [1:15] 28 20 32 25 23 22 29 27 28 26 .3 va 20 8.4 13.1 12.1 13.8 .2 va TRUE 5 ..(turkey$age > 25) # now we also have a Boolean (logical) column str(turkey) ## 'data.1 va TRUE 29 13.2 11. # print dataset to screen turkey ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 age weight orig 28 13.1 va 25 13..4 va 28 13..1 10. ## $ gt25mo: logi TRUE FALSE TRUE FALSE FALSE FALSE ... ## $ weight: num 13.9 15..3 va TRUE 32 15."wi": 1 1 1 2 2 1 1 1 1 1 ..1 wi 22 10.9 wi # Note: to view the age variable (column). :21.2 TRUE :4 ## Mean :26. :16.:24.:28.0 Max. :20.:15.5 wi FALSE Analyses can be then done on the entire dataset.9 va FALSE 4 25 13.9 va wi wi wi wi TRUE TRUE TRUE TRUE TRUE subset(turkey. :16.:13.2 Mean :13.6 ## ## ## ## ## Ch 1: R statistical software and review 10 12 13 14 15 26 31 27 29 30 11. :31. :32.4 NA's :0 ## 3rd Qu. :20.:24.4 15. :32.0 1st Qu.4 va FALSE 11 21 11.1 ## ---------------------------------------------------## turkey$orig: wi ## age weight orig gt25mo ## Min.:12.1 Median :13.2 3rd Qu.8 wi FALSE 5 23 13.0 weight Min.:29.0 Mean :26. :11.9 1st Qu.2 ## Max.4 wi:7 FALSE:3 ## Median :27.3 NA's :0 ## 3rd Qu.6 Mean :14.:13.:29. : 8.5 va:0 Mode :logical ## 1st Qu. : 8.0 Max.0 Min.5 Mean :12.2 3rd Qu.1 wi FALSE 6 22 10. or repeated for all subsets of a variable in the dataset.:11.2 15.6 orig va:8 wi:7 gt25mo Mode :logical FALSE:5 TRUE :10 NA's :0 # or summarize by a variable in the dataset.frame records age weight orig gt25mo 2 20 8.0 1st Qu.0 Min. :15.8 16. summary(turkey) ## ## ## ## ## ## ## age Min.8 TRUE :6 ## Mean :26.4 wi:0 FALSE:2 ## Median :27. summary) ## turkey$orig: va ## age weight orig gt25mo ## Min.0 Max. by(turkey.:25.9 va:8 Mode :logical ## 1st Qu.0 Median :27.6 14. gt25mo == FALSE) ## ## ## ## ## ## # use subset() to select the data.6 ## Max.5 3rd Qu. # summaries of each variable in the entire dataset.6 .5 Median :12.5 3rd Qu.6 Max.0 1st Qu. turkey$orig.:14.0 Median :14. 01) .p21 + geom_density(alpha=0.) .005) . alpha = 1/5) # violin plot p22 <.3 7 ADA1 Chapters 2. Chapters 2..p21 + geom_histogram(aes(y=. 4.. alpha = 1/5) # violin plot p12 <. aes(x = "weight". ncol=1.. aes(x = "weight". 6: Estimation in one-sample problems 1. main="Turkey weights for origin va") # Histogram overlaid with kernel density curve p21 <. fill="white") # Overlay with transparent density plot p21 <. ncol=1.ggplot(turkeyva. aes(x = weight)) # Histogram with density instead of count on y-axis p21 <. main="Turkey weights for origin wi") . 6: Estimation in one-sample problems Plot the weights by origin.ggplot(turkeyva.01) .1. y = weight)) p12 <. position = position_jitter(height = 0. 4. alpha = 3/4) p22 <.p12 + geom_violin(fill = "gray50") p12 <.p11 + geom_density(alpha=0.arrange(p21. aes(x = weight)) # Histogram with density instead of count on y-axis p11 <. aes(x = "weight". orig == "wi") library(ggplot2) # Histogram overlaid with kernel density curve p11 <. fill="#FF6666") p21 <. y = weight)) p22 <. 6 # subset the data for convenience turkeyva <.p21 + geom_point(aes(y = -0. orig == "va") turkeywi <.p22 + coord_flip() # boxplot p23 <.p11 + geom_histogram(aes(y=.005) . colour="black". p22. p12.ggplot(turkeyva.2.p23 + geom_boxplot() p23 <. p23.subset(turkey.ggplot(turkeywi.ggplot(turkeywi.p13 + geom_boxplot() p13 <.1.density.p12 + coord_flip() # boxplot p13 <. p13.3: ADA1 Chapters 2. colour="black".ggplot(turkeywi. aes(x = "weight".p23 + coord_flip() library(gridExtra) grid. fill="#FF6666") p11 <.arrange(p11.1. binwidth=2 ..p12 + geom_boxplot(width = 0.p13 + coord_flip() library(gridExtra) ## Loading required package: grid grid. y = weight)) p23 <. position = position_jitter(height = 0.2. 4. binwidth=2 . y = weight)) p13 <.density.p22 + geom_violin(fill = "gray50") p22 <. alpha = 3/4) p12 <.subset(turkey. fill="white") # Overlay with transparent density plot p11 <.) . #### Example: Turkey.p22 + geom_boxplot(width = 0.p11 + geom_point(aes(y = -0. oma=c(1.par <.colMeans(sam). sd = sd(dat)/sqrt(n)) . signif(mean(dat). # save par() settings old. digits = 5)) # overlay a density curve for the sample means points(density(sam. digits = 5) .dist <.1.one.mean).5 weight weight 10 15.mean) # restore par() settings par(old. type = "l") rug(dat) hist(colMeans(sam).one. replace = TRUE).dist(turkeywi$weight) .00 0.par) } # Bootstrap sampling distribution bs. ". col = "red") # place a rug of points under the plot rug(sam. se =". type = "l". freq = FALSE.samp. bold and red x <. type = "l") # overlay a normal distribution. max(sam.par(no.matrix(sample(dat. n .samp.2.5 10.1).mean).0 weight "weight" "weight" weight weight 14 weight 12 13 14 weight Check normality of each sample graphically with with bootstrap sampling distribution and normal quantile plot and formally with normality tests.mean). mean =". breaks = 6 .5 weight 10 12 14 12 "weight" "weight" 20.1 0.function(dat. length = 1000) points(x.0 12. # draw a histogram of the means sam.2 0.1. ncol=N). N = 1e4) { n <. main = "Bootstrap sampling distribution of the mean" .dist(turkeyva$weight) bs. ".15 0. xlab = paste("Data: n =". lwd = 2.2.1)) # Histogram overlaid with kernel density curve hist(dat.length(dat).10 0.0 13 14 15 16 15 16 weight weight 12 17.samp.mean <.readonly = TRUE) # make smaller margins par(mfrow=c(2.0 10 15 7.05 0. size = N * n.20 0. mean = mean(dat).seq(min(sam. # resample from data sam <. signif(sd(dat)/sqrt(n)). freq = FALSE.one. mar=c(3. breaks = 25 . # a function to compare the bootstrap sampling distribution with # a normal distribution with mean and SEM estimated from the data bs.1).8 Ch 1: R statistical software and review Turkey weights for origin va Turkey weights for origin wi density density 0. main = "Plot of data with smoothed density curve") points(density(dat). dnorm(x. cex = 1.2 Density 0.0 0.0 −0. se = 0.3 Plot of data with smoothed density curve 8 10 12 14 16 11 12 dat 13 14 15 16 17 dat Bootstrap sampling distribution of the mean 0. lwd = 1 .cex = 1 : is the size of those labels # lwd = 1 : line width qqPlot(turkeyva$weight. turkey origin wi ● 15 ● 16 ● 14 ● 13 ● 15 turkeywi$weight turkeyva$weight ● ● ● 12 ● 11 ● 14 ● ● 13 ● 10 12 9 ● −1.357 . mean = 14. turkey origin va") qqPlot(turkeywi$weight. main="QQ Plot.6 Bootstrap sampling distribution of the mean Density 9 10 11 12 13 14 15 12 13 Data: n = 8 . mean = 12.5 1.67764 5 14 15 16 Data: n = 7 .4 0. lwd = 1 . turkey origin va QQ Plot. main="QQ Plot.3: ADA1 Chapters 2.n = 0.5 1.cex = 1.5 0.20 0. se = 0.665066 5 # normal quantile-quantile (QQ) plot of each orig sample library(car) # qq plot # las = 1 : turns labels on y-axis to read horizontally # id.1 0.0 −0.0 . id. turkey origin wi") QQ Plot. 4.0 norm quantiles 0.275 . las = 1. las = 1. and outputs to console # id.n = 0.1. id.5 ● −1. id.0 1.0 0.2 Density 0.6 0.5 −1. 6: Estimation in one-sample problems Plot of data with smoothed density curve 0.10 Density 0. id.00 0.2 0.0 norm quantiles 0.n = n : labels n most extreme observations.5 0.0 0.4 0. test(turkeywi$weight) ## Error: sample size must be greater than 7 Because we do not have any serious departures from normality (the data are consistent with being normal.4642 # WI shapiro.summary .t.7528 library(nortest) ad.10 Ch 1: R statistical software and review # Normality tests # VA shapiro.test(turkeywi$weight) ## Error: sample size must be greater than 7 # lillie.5339 # lillie. p-value = 0.283.test(turkeyva£weight) cvm.9733. as well as the sampling distribution of the mean) the t-test is appropriate.test(turkeyva$weight) ## ## Shapiro-Wilk normality test ## ## data: turkeyva$weight ## W = 0. We will also look at a couple nonparametric methods.0501.9541.test(turkeywi£weight) cvm. mu = 12) t. p-value = 0.test(turkeywi$weight) ## ## Shapiro-Wilk normality test ## ## data: turkeywi$weight ## W = 0. p-value = 0. # Is the average turkey weight 12 lbs? # t-tests of the mean # VA t.summary <.test(turkeyva$weight) ## ## Cramer-von Mises normality test ## ## data: turkeyva$weight ## W = 0.test(turkeyva$weight) ## ## Anderson-Darling normality test ## ## data: turkeyva$weight ## A = 0. p-value = 0.9209 library(nortest) ad.test(turkeyva$weight. 1.544.summary <.test(turkeyva$weight. p-value = 0. 6: Estimation in one-sample problems ## ## ## ## ## ## ## ## ## ## ## One Sample t-test data: turkeyva$weight t = 0. mu = 12) t.test(turkeywi$weight.28 # WI t. 4. md=12) ## ## ## ## ## ## ## One-sample Sign-Test data: turkeyva$weight s = 5.t.4058.summary ## ## ## ## ## ## ## ## ## ## ## One Sample t-test data: turkeywi$weight t = 3.98 sample estimates: mean of x 14. p-value = 0.88 sample estimates: mean of x 12.67 13.73 15.7266 alternative hypothesis: true median is not equal to 12 95 percent confidence interval: 11 . p-value = 0. Wool ## ## The following object is masked from ’package:datasets’: ## ## Orange SIGN. df = 6.697 alternative hypothesis: true mean is not equal to 12 95 percent confidence interval: 10.36 # Sign test for the median # VA library(BSDA) ## Loading required package: e1071 ## Loading required package: class ## Loading required package: lattice ## ## Attaching package: ’BSDA’ ## ## The following objects are masked from ’package:car’: ## ## Vocab. df = 7.3: ADA1 Chapters 2.01216 alternative hypothesis: true mean is not equal to 12 95 percent confidence interval: 12. 9844 11.4 14.5.5 16.E.10 # WI SIGN. correct=FALSE) ## Warning: cannot compute exact p-value with ties ## Warning: cannot compute exact confidence interval with ties ## ## Wilcoxon signed rank test ## . p-value = 0.75 Lower Achieved CI Interpolated CI Upper Achieved CI Conf. mu=12.9500 12.912 13.900 15.90 Interpolated CI 0.Level L. conf.38 sample estimates: median of x 14.int=TRUE) ## ## ## ## ## ## ## ## ## ## ## ## ## Warning: Warning: cannot compute exact p-value with ties cannot compute exact confidence interval with ties Wilcoxon signed rank test with continuity correction data: turkeyva$weight V = 21.test(turkeywi$weight.885 sample estimates: median of x 12.pt U.0 16.9500 9.47 # without continuity correction wilcox.30 0. mu=12.pt U.1 15.12 ## ## ## ## ## ## ## ## Ch 1: R statistical software and review 9.E.88 0.test(turkeyva$weight.9297 10. conf. p-value = 0. since symmetric assumption) # VA # with continuity correction in the normal approximation for the p-value wilcox.125 alternative hypothesis: true median is not equal to 12 95 percent confidence interval: 12.400 13.int=TRUE.913 13.E.38 Upper Achieved CI 0.60 # Wilcoxon sign-rank test for the median (or mean.1 sample estimates: (pseudo)median 12.pt Lower Achieved CI 0.2 Conf.test(turkeyva$weight.E.pt 0.674 alternative hypothesis: true location is not equal to 12 95 percent confidence interval: 10.00 16.8750 13.Level L. md=12) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## One-sample Sign-Test data: turkeywi$weight s = 6.9922 8. test(turkeywi$weight. p-value = 0. 6: Two-sample inferences ## ## ## ## ## ## ## ## 13 data: turkeyva$weight V = 21. #### Example: Turkey.65 16.38 1. conf.4 ADA1 Chapters 3. There are many ways to plot the data for visual comparisons. 6: Two-sample inferences Presume it is of interest to compare the center of the weight distributions between the origins. p-value = 0.03125 alternative hypothesis: true location is not equal to 12 95 percent confidence interval: 12. 6 # stripchart (dotplot) using ggplot library(ggplot2) p1 <. 4.65 16. Chapters 3.65 13. p-value = 0.6236 alternative hypothesis: true location is not equal to 12 95 percent confidence interval: 10. mu=12.4: ADA1 Chapters 3. 4.5.int=TRUE) ## ## ## ## ## ## ## ## ## ## ## Wilcoxon signed rank test data: turkeywi$weight V = 27.int=TRUE. correct=FALSE) ## ## ## ## ## ## ## ## ## ## ## Wilcoxon signed rank test data: turkeywi$weight V = 27. y = orig)) .75 sample estimates: (pseudo)median 12.03125 alternative hypothesis: true location is not equal to 12 95 percent confidence interval: 12.1.00 sample estimates: (pseudo)median 14. mu=12.ggplot(turkey.47 # WI # with continuity correction in the normal approximation for the p-value wilcox. conf.38 # without continuity correction wilcox.00 sample estimates: (pseudo)median 14. aes(x = weight.test(turkeywi$weight. 4. aes(x = weight.p2 + stat_summary(fun.1)) p1 <. p4.p2 + coord_flip() p2 <. fill=orig)) p5 <. p2.14 Ch 1: R statistical software and review p1 <. aes(x = orig.p3 + labs(title = "Histogram with facets") p4 <. aes(x = weight)) p3 <.arrange(p1.p3 + geom_histogram(binwidth = 2) p3 <.ggplot(turkey.p2 + geom_point() p2 <.5.y = mean. fill=orig)) p4 <.p3 + facet_grid(orig ~ . p5. ncol=2. nrow=3 .) p3 <. y = weight)) p2 <.p1 + labs(title = "Dotplot with position jitter") # boxplot p2 <. shape = 3.p5 + geom_histogram(binwidth = 2.p2 + labs(title = "Boxplot with mean (+) and points") # histogram using ggplot p3 <. alpha = 0.p1 + geom_point(position = position_jitter(h=0. position="identity") p4 <. p3. position="dodge") p5 <. geom = "point".p2 + geom_boxplot() # add a "+" at the mean p2 <.ggplot(turkey. aes(x = weight.p4 + geom_histogram(binwidth = 2. size = 2) p2 <. main="Turkey weights compared by origin") .ggplot(turkey.p5 + labs(title = "Histogram with dodge") library(gridExtra) grid.p4 + labs(title = "Histogram with opacity (alpha)") p5 <.ggplot(turkey. alpha = 1. N = 1e4) { n1 <.mean. # calculate the means and take difference between populations sam1.1).mean <. diff.sam2.sam1.readonly = TRUE) # make smaller margins par(mfrow=c(3.par(no. oma=c(1. first check the normality assumptions of the sampling distribution of the mean difference between the populations. replace = TRUE).two.4: ADA1 Chapters 3.matrix(sample(dat2.diff.1. # resample from data sam1 <. sam2 <. # save par() settings old.matrix(sample(dat1. mar=c(3.colMeans(sam1). size = N * n1. n2 <.1.2.length(dat2). sam2.mean <.1.1). # a function to compare the bootstrap sampling distribution # of the difference of means from two samples with # a normal distribution with mean and SEM estimated from the data bs. replace = TRUE).dist <.function(dat1. size = N * n2.mean <.2. ncol=N).colMeans(sam2). ncol=N).length(dat1).mean . 4.par <.samp. dat2. 6: Two-sample inferences 15 Turkey weights compared by origin Dotplot with position jitter wi ● ● ● ● wi ● ● ● ● ● ● ● ● orig ● orig ● Boxplot with mean (+) and points va ● ● 9 ● ●● ● ● 11 13 ● va 15 17 ● ● 9 ● ● 11 weight ●●● ● 13 15 weight Histogram with facets Histogram with opacity (alpha) 4 4 3 va 2 3 count count 1 0 4 orig va 2 wi 3 wi 2 1 1 0 0 10 15 20 weight 10 15 20 weight Histogram with dodge 4 count 3 orig va 2 wi 1 0 10 15 20 weight Using the two-sample t-test.1)) . freq = FALSE. se =".0838 . freq = FALSE. digits = 5)) .2 0. digits = 5) . max(diff. se = 0.9167 10 12 14 16 0. signif(mean(dat1).00 Density Sample dat1 2 n = 7 . signif(sd(diff.7596 10 12 14 16 Density 0. sd = 1. type = "l". lwd = 2. signif(sd(dat1). signif(mean(dat2). dat2))) points(density(dat1). mean =". ". signif(sd(dat2).mean). "n =".diff. sd =". n2 .mean). type = "l") rug(dat2) hist(diff. digits = 5)) . This is the most powerful test and detects a difference at a 0.1 Density 0.dist(turkeyva$weight. "\n" . breaks = 6 .20 0.275 . main = paste("Sample 1". mean = mean(diff. dat2))) points(density(dat2).equal = FALSE is the default # two-sample t-test specifying two separate vectors .4 Bootstrap sampling distribution dat2 of the difference in means mean = −2. bold and red x <. mean = 12.par) } bs. mean =".mean). ".mean) # restore par() settings par(old. "\n" . ".0 0.16 Ch 1: R statistical software and review # Histogram overlaid with kernel density curve hist(dat1.88081 −6 −5 −4 −3 −2 −1 0 1 diff. "\n" . type = "l") rug(dat1) hist(dat2.3 0. digits = 5) .10 0. ".mean). digits = 5))) # overlay a density curve for the sample means points(density(diff. n1 . freq = FALSE. dnorm(x.2 0.samp. # Two-sample t-test ## Equal variances # var. sd = 1.3 Sample 1 n = 8 .1 0. breaks = 6 . breaks = 25 . signif(mean(diff.mean). "n =". col = "red") # place a rug of points under the plot rug(diff. xlim = range(c(dat1.mean. length = 1000) points(x. digits = 5) . type = "l") # overlay a normal distribution.two. sd = sd(diff.seq(min(diff.05 significance level. mean = 14. main = paste("Sample 2". xlim = range(c(dat1.357 .mean). turkeywi$weight) 0. ".0 0. "mean =".mean)) . sd =". main = paste("Bootstrap sampling distribution of the difference in means".mean Two-sample t-test is appropriate since the bootstrap sampling distribution of the difference in means is approximately normal. test(turkeyva$weight.36 # two-sample t-test with unequal variances (Welch = Satterthwaite) # specified using data. p-value = 0.test(weight ~ orig.18.13408 -0. correct=FALSE) . but doesn’t require normality.summary. # with continuity correction in the normal approximation for the p-value wilcox. HeadBreadth by Group t.test(turkeyva$weight. p-value = 0.5.test(turkeyva$weight.09994 ## sample estimates: ## difference in location ## -2.1. var.int=TRUE.eqvar ## ## ## ## ## ## ## ## ## ## ## Two Sample t-test data: turkeyva$weight and turkeywi$weight t = -2.05 significance level.frame and a formula. turkeywi$weight.4: ADA1 Chapters 3.summary.14596 -0.int=TRUE) ## Warning: cannot compute exact p-value with ties ## Warning: cannot compute exact confidence intervals with ties ## ## Wilcoxon rank sum test with continuity correction ## ## data: turkeyva$weight and turkeywi$weight ## W = 11.06384 ## alternative hypothesis: true location shift is not equal to 0 ## 95 percent confidence interval: ## -4.04827 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4. p-value = 0. 6: Two-sample inferences 17 t. df = 13.uneqvar ## ## ## ## ## ## ## ## ## ## ## Welch Two Sample t-test data: weight by orig t = -2.19994 0.eqvar <. turkeywi$weight. turkeywi$weight.summary.t.equal = FALSE) t.36 (Wilcoxon-)Mann-Whitney two-sample test is appropriate because the shapes of the two distributions are similar.04717 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.summary. conf. df = 12.193. data = turkey. var.equal = TRUE) t. and fails to detect a difference at a 0.96. 4.uneqvar <.153 # without continuity correction wilcox. This is a less powerful test.03021 sample estimates: mean in group va mean in group wi 12. conf. though their locations are different.28 14.28 14.t.01833 sample estimates: mean of x mean of y 12. 2 10.cmu.0 5.5 -1.3 -3.3 3. pronounced “dazzle”) is an online library of datafiles and stories that illustrate the use of basic statistics methods.7 22.1 -6. Duxbury Press.0 4.5 9. The firm’s quality control department collects weekly data on percent-age waste (run-up) relative to what can be achieved by computer layouts of patterns on cloth.6 8.read. .7 10.5. the Data and Story Library (DASL.3 6.8 4.1 11.2 3.7 17.0 12.2 8.stat.6 7.4 3.4 12.3 6. Under question are differences among the five supplier plants (PT1.4 8.0 9. 4.0 3.3 9.4 3.html. 6 # convert to a data. p-value = 0.7 6. 6: One-way ANOVA The Waste Run-up data3 refer to five suppliers of the Levi-Strauss clothing manufacturing plant in Albuquerque. .153 1.3 3.2 -3. Introduction to Contemporary Statistical Methods.0 4.2 1.7 10.7 4.5 13. . #### Example: Waste Run-up.8 2.table(text = " PT1 PT2 PT3 PT4 PT5 1.0 10.3 -0.frame by reading the text table waste <.8 16.7 -2.446e-05 sample estimates: difference in location -2. PT5).2 16.1 3.6 5.9 11. Chapters 5.1 -1. . It is possible to have negative values.18 ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 1: R statistical software and review Warning: Warning: cannot compute exact p-value with ties cannot compute exact confidence intervals with ties Wilcoxon rank sum test data: turkeyva$weight and turkeywi$weight W = 11.5 9. 4. p.edu/DASL/Stories/wasterunup.8 2.100e+00 1. “Waste Run-up” dataset from L.7 4.6 -9.2 9.8 15.8 8. .1 8.0 -11.05598 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: -4.9 2.2 -3.3 0. 86.6 11. which indicate that the plant employees beat the computer in controlling waste.5 24.3 -2. Koopmans.9 3 From http://lib.5 ADA1 Chapters 5.0 10. 1987.2 16.8 5.1 6. 3 9.8 8.6 -9.4 70.0 9.7 17.7 4.0 4.9 2.6 11.0 -11.5 0.0 3.8 7.vars = c("PT1".8 7.0 10.1 5.1 14.vars: ID variables # all variables to keep but not split apart on # id.5 NA NA NA 2.7 10.3 ## 12 0.3 2.9 ## 14 -0.2 ## 4 1.rm = TRUE ) ## Using as id variables 19 .1 NA ## 18 42.2 16.3 -2.4 4.4 6.8 19.4 12.7 NA 3.4 3.9 3."PT3".0 -3.4 3.1.7 6.4 5.8 15.3 6.0 8.melt(waste.long <.8 7.name: column name for values in table value.vars) # measure.5 NA NA NA ## 21 2.9 NA NA NA library(reshape2) waste.6 5.7 -0.2 10. # value.8 4.3 2.9 7.3 ## 9 -3.1 6.1 3.9 7.2 NA ## 16 2.0 12.3 NA ## 17 13.vars: The source columns # (if unspecified then all other variables are measure.1 -6.8 19.4 70.3 ## 13 3.2 8. 6: One-way ANOVA -0.1 14.9 NA NA NA ".7 4.2 9.0 8.3 6. # id.1 11.0 5.name: Name of the destination column identifying each # original column that the measurement came from variable.5 NA 19.7 NA ## 20 3.6 7.7 7.5 NA ## 15 19.8 16.5: ADA1 Chapters 5."PT4".name = "runup".8 ## 8 2.2 -3. header=TRUE) waste ## PT1 PT2 PT3 PT4 PT5 ## 1 1.5 -1.0 NA NA NA ## 22 1.2 1. # remove the NA values na.7 22. 4.2 1.4 6. # variable.name = "plant".9 11.0 3.7 10. # measure.1 8.0 ## 2 10.6 0.0 NA NA NA 1.5 0.0 4.5 13.1 NA 42.9 3.3 3.7 -0.0 ## 7 3.4 5.2 16.1 ## 10 -1.7 ## 3 -2.2 ## 5 -3.6 8.2 3.5 9.vars=NULL.5 9.8 5.8 ## 11 2.1 5.5 24.3 NA 13.6 0.4 NA 1.2 NA 2."PT2".8 7.3 3.7 7.0 -3."PT5").3 ## 6 -0.4 8.3 3.4 NA ## 19 1.4 4. .summary$n) waste.. s = sd(X$runup).0 PT1 1.waste.waste.summary$moe <.waste. of 2 variables: ## $ plant: Factor w/ 5 levels "PT1". sd."PT3". "plant". and se for the plants # The plyr package is an advanced way to apply a function to subsets of data # "Tools for splitting.summary$se # individual confidence limits waste.0.3 97 PT5 3.1 98 PT5 16.frame': 95 obs. function(X) { data.2 10.summary .7 -3.20 Ch 1: R statistical software and review str(waste.7 tail(waste.1 PT1 -2.long) ## 'data."PT2".long) ## ## ## ## ## ## ## plant runup 96 PT5 22.5 PT1 -3.2 -1..frame( m = mean(X$runup).2 PT1 10.summary$ci.summary$ci.8 99 PT5 11.long) ## ## ## ## ## ## ## 1 2 3 4 5 6 plant runup PT1 1.5 -3 -0.summary$n .waste.05 / 2. ## $ runup: num 1.l <..summary$moe waste. df = waste. n = length(X$runup) ) } ) # standard errors waste.u <.: 1 1 1 1 1 1 1 1 1 1 .summary$m + waste.2 2.long.summary$m .summary$se <..1 -2 1.3 100 PT5 12.7 3. head(waste.7 .1) * waste. applying and combining data" library(plyr) ## ## Attaching package: ’plyr’ ## ## The following object is masked from ’package:lubridate’: ## ## here # ddply "dd" means the input and output are both data.9 # Calculate the mean.summary$s/sqrt(waste..ddply(waste. n.qt(1 .frames waste.3 101 PT5 16.summary$moe waste.0 PT1 -0.summary <. 5) # diamond at mean for each group p <.832 15.151 # Plot the data using ggplot library(ggplot2) p <.139 4.954 PT4 7. alpha = 0.774 4.403 19 1. size=.p + ylab("Run-up waste") print(p) 21 .p + geom_hline(aes(yintercept = mean(runup)).273 6. alpha = 0.p + stat_summary(fun.353 22 3.657 19 0.650 5. width = .2.489 3.1.122 2.ggplot(waste.p + geom_boxplot(size = 0.971 PT2 8.2.75 to stand out behind CI p <.u PT1 4. linetype = "solid". y = runup)) # plot a reference line for the global mean (assuming no groups) p <.448 0.807 2.032 22 2. geom = "errorbar". alpha = 0. size = 0.02447 15. h = 0).839 1.763 5.832 4.8) # confidence limits based on normal distribution p <. size = 6.data = "mean_cl_normal".8) p <. aes(x = plant. 4.70932 6.long. geom = "point".5) # boxplot. colour = "black".252 PT5 10.5: ADA1 Chapters 5.3. 6: One-way ANOVA ## ## ## ## ## ## 1 2 3 4 5 plant m s n se moe ci.3) p <. colour = "black".05. size = 0. linetype = "dashed".377 9.523 10. colour = "red". alpha = 0. alpha = 0.y = mean.l ci.p + labs(title = "Plant Run-up waste relative to computer layout") p <.5) # points for observed data p <. shape = 18.p + geom_point(position = position_jitter(w = 0. alpha = 0.72681 9.p + stat_summary(fun.555 13 2.75.60288 16.p + geom_hline(aes(yintercept = 0).639 PT3 4.07477 8.010 2. colour = "red". long$plant: PT1 Anderson-Darling normality test data: dd[x.22 Ch 1: R statistical software and review Plant Run−up waste relative to computer layout ● 50 Run−up waste ● 25 ● ● ● ● ● 0 ● ● ● PT1 PT2 PT3 PT4 PT5 plant The outliers here suggest the ANOVA is not an appropriate model. by(waste.long$runup. ad.869.761e-07 ---------------------------------------------------- . p-value = 1.long$plant. waste. The normality tests below suggest the distributions for the first two plants are not normal.test) ## ## ## ## ## ## ## ## waste. ] A = 2. 16 0. ] A = 2.86 Estimated effects may be unbalanced # all pairwise comparisons among plants .33 90 8749 97. 4. but we would count on the following nonparametric method for inference.6004 For review purposes.521.w <.7624 ---------------------------------------------------waste. ] A = 0.5: ADA1 Chapters 5.7 1.w) ## ## plant ## Residuals Df Sum Sq Mean Sq F value Pr(>F) 4 451 112.long$plant: PT2 Anderson-Darling normality test data: dd[x.2338.334e-06 ---------------------------------------------------waste.long$plant: PT5 Anderson-Darling normality test data: dd[x. fit. of Freedom plant Residuals 451 8749 4 90 Residual standard error: 9. data = waste. ] A = 0.long$plant: PT4 Anderson-Darling normality test data: dd[x. data = waste.long) Terms: Sum of Squares Deg.1. p-value = 0.2744. p-value = 0. ] A = 0. p-value = 0.1236.w ## ## ## ## ## ## ## ## ## ## Call: aov(formula = runup ~ plant. I’ll fit the ANOVA.2 fit.long$plant: PT3 Anderson-Darling normality test data: dd[x.aov(runup ~ plant. p-value = 1. 6: One-way ANOVA ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 23 waste.long) summary(fit.9834 ---------------------------------------------------waste. 8719 PT5-PT1 5. waste.00 1.147 0.408 0.test(waste.585 0.t. pool. p. p.00 0.198 0.method = "none") ## ## ## ## ## ## ## ## ## ## ## ## Pairwise comparisons using t tests with pooled SD data: PT2 PT3 PT4 PT5 waste.00 1.long$runup.00 1. waste.992 12.3091 -3.adjust.0000 PT4-PT1 2.334 15.9204 PT5-PT3 5.method = "bonf") ## ## ## ## ## ## ## ## ## ## ## ## Pairwise comparisons using t tests with pooled SD data: PT2 PT3 PT4 PT5 waste.665 0.3423 -9.long$plant PT1 1.long$runup.767 0.8874 -6.00 PT3 1.630 11.5451 -8.5453 -4.939 7.93 PT2 1.long$runup and waste.748 15.long$plant.5976 PT3-PT1 0.w) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = runup ~ plant.6947 PT4-PT2 -1.967 12.00 1.093 PT2 0.3089 -8.00 1.00 PT4 1.8542 -3.test(waste.sd = TRUE.sd = TRUE.905 1.long$plant.long$runup and waste. data = waste.122 PT4 0.00 P value adjustment method: bonferroni .9258 # Bonferroni 95% Individual p-values # All Pairwise Comparisons among Levels of waste pairwise.151 0.418 P value adjustment method: none # Tukey 95% Individual p-values TukeyHSD(fit.425 0.0002 -12.4408 PT3-PT2 -4.057 11.9667 -5.254 0.597 4.24 Ch 1: R statistical software and review # Fisher's LSD (FSD) uses "none" pairwise.563 0.5251 PT5-PT4 2.287 8.9925 PT5-PT2 1.long$plant PT1 0.9915 PT4-PT3 2.adjust. pool.247 11.596 0.6579 -6.456 0.339 0.655 PT3 0.563 0.t.long) $plant diff lwr upr p adj PT2-PT1 4.921 0. test(runup ~ plant.5: ADA1 Chapters 5.1)) library(car) qqPlot(fit.n = 10.w$residuals 40 18 ● 20 ● 15 ● 89 96 ●● 38 0 ● ●●● ●●●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ●●●●● ●● 2490 ● 25 ● 93 ● ● −20 −2 −1 0 1 2 norm quantiles Kruskal-Wallis ANOVA is a non-parametric method for testing the hypothesis of equal population medians against the alternative that not all population medians are equal. id.wk ## ## Kruskal-Wallis rank sum test ## ## data: runup by plant . 4.kruskal. It’s still not perfect here because the distributional shapes are not all the same. data = waste. 6: One-way ANOVA 25 The residuals show many outliers # QQ plot par(mfrow=c(1. but it is a better alternative than the ANOVA. main="QQ Plot of residuals") ## 41 18 25 93 15 24 90 89 96 38 ## 95 94 1 2 93 3 4 92 91 90 QQ Plot of residuals 41 ● 60 fit.cex = 1. las = 1.long) fit.w$residuals.1. lwd = 1 . id. # KW ANOVA fit.wk <. names(waste)[i1. names(waste)[i2.wilcox.pt]] W = 131.26 Ch 1: R statistical software and review ## Kruskal-Wallis chi-squared = 15.2)) cat(names(waste)[i1.5.pt]) print(wt) } } ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT1 PT2 Wilcoxon rank sum test with continuity correction data: waste[. waste[. names(waste)[i1.pt]] and waste[. p-value = 0.pt+1):5) { wt <. conf.3 Warning: cannot compute exact p-value with ties .05/choose(5.pt].5.level = 1 .pt]] and waste[.5 percent confidence interval: -6.009813 alternative hypothesis: true location shift is not equal to 0 99. names(waste)[i1.5 percent confidence interval: -8.004084 # Bonferroni 95% pairwise comparisions with continuity correction # in the normal approximation for the p-value for (i1.test(waste[.1 sample estimates: difference in location -5. names(waste)[i2.names(waste)[i1. names(waste)[i2.int=TRUE. p-value = 0.6 sample estimates: difference in location -4. p-value = 0.001241 alternative hypothesis: true location shift is not equal to 0 99.pt]] .pt]].names(waste)[i2.pt]] and waste[.pt]] W = 85.9 -1.pt]] W = 141. df = 4.3 1.7 sample estimates: difference in location -2.0.5 percent confidence interval: -8.4 Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT1 PT3 Wilcoxon rank sum test with continuity correction data: waste[.pt in (i1. names(waste)[i2.pt in 1:4) { for (i2.32.9 2. conf.07978 alternative hypothesis: true location shift is not equal to 0 99. p-value = 0.5 Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT1 PT4 Wilcoxon rank sum test with continuity correction data: waste[. 704 Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT2 PT3 Wilcoxon rank sum test with continuity correction data: waste[.pt]] and waste[. names(waste)[i2. names(waste)[i1.5 percent confidence interval: -3.1 7.pt]] W = 76.631 27 .5563 alternative hypothesis: true location shift is not equal to 0 99.5 7.pt]] and waste[.5 percent confidence interval: -5.8 sample estimates: difference in location -1.pt]] W = 238.5: ADA1 Chapters 5. names(waste)[i2.5 percent confidence interval: -15.pt]] W = 186. p-value = 0. 6: One-way ANOVA ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Warning: cannot compute exact confidence intervals with ties PT1 PT5 Wilcoxon rank sum test with continuity correction data: waste[. p-value = 0.6 sample estimates: difference in location -8. p-value = 0.353 Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT2 PT4 Wilcoxon rank sum test with continuity correction data: waste[. names(waste)[i2.9 3. names(waste)[i1.1 Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT2 PT5 Wilcoxon rank sum test with continuity correction data: waste[. names(waste)[i1. names(waste)[i1.1375 alternative hypothesis: true location shift is not equal to 0 99. names(waste)[i2.pt]] and waste[.1.5 percent confidence interval: -13.4562 alternative hypothesis: true location shift is not equal to 0 99.4 sample estimates: difference in location -4.pt]] and waste[.pt]] W = 99.2 sample estimates: difference in location 1. p-value = 0. 4.5 3.02318 alternative hypothesis: true location shift is not equal to 0 99. pt]] and waste[. names(waste)[i2.6 ADA1 Chapter 7: Categorical data analysis Returning to the turkey dataset.06583 alternative hypothesis: true location shift is not equal to 0 99.6 Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT4 PT5 Wilcoxon rank sum test with continuity correction data: waste[.5 percent confidence interval: -11.pt]] and waste[.8 sample estimates: difference in location -4 1.8 1. data = turkey) . #### Example: Turkey. below is the cross-classification of orig by gt25mo. p-value = 0. names(waste)[i1. names(waste)[i1.5 percent confidence interval: -13.pt]] W = 67.5 percent confidence interval: -6. Chapter 7 # create a frequency table from two columns of categorical data xt <.4 1. p-value = 0.4 PT3 PT5 Wilcoxon rank sum test data: waste[. names(waste)[i1.pt]] and waste[. names(waste)[i2.pt]] W = 117.1 4.xtabs( ~ orig + gt25mo. p-value = 0.7 sample estimates: difference in location -2.03018 alternative hypothesis: true location shift is not equal to 0 99.7 sample estimates: difference in location -6.pt]] W = 82. names(waste)[i2.1157 alternative hypothesis: true location shift is not equal to 0 99.28 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 1: R statistical software and review Warning: cannot compute exact p-value with ties Warning: cannot compute exact confidence intervals with ties PT3 PT4 Wilcoxon rank sum test with continuity correction data: waste[. test(xt) ## Warning: Chi-squared approximation may be incorrect ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: xt ## X-squared = 0. p-value = 0.0335. df = 1.summary <.test(xt) ## ## ## ## ## ## ## ## ## ## ## Fisher's Exact Test for Count Data data: xt p-value = 0.4698 A mosaic plot is for categorical data.23768 sample estimates: odds ratio 0. df = 1. p-value = 0. p-value = 0. correct=FALSE) ## Warning: x.8548 # Fisher's exact test fisher.5.02688 6.6084 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.chisq.summary Chi-squared approximation may be incorrect ## ## Pearson's Chi-squared test ## ## data: xt ## X-squared = 0.4642 # the default is to perform Yates' continuity correction chisq.5357.1. Area represents frequency.6: ADA1 Chapter 7: Categorical data analysis 29 # display the table xt ## gt25mo ## orig FALSE TRUE ## va 2 6 ## wi 3 4 # summary from xtabs() is the same as chisq. since colors only appear when there’s evididence of . data = turkey) Number of cases in table: 15 Number of factors: 2 Test for independence of all factors: Chisq = 0.5 Chi-squared approximation may be incorrect # same as xtabs() x. The default shading is a good start. df = 1.test(xt.test() without continuity correction summary(xt) ## ## ## ## ## ## Call: xtabs(formula = ~orig + gt25mo. summary$stdres) ## Warning: row names were found from a short variable and have been discarded # There are duplicate row and col identifiers in this x.46 From the chisq.30 Ch 1: R statistical software and review association related to those cell values. # Can you figure out the naming scheme? x. res = x. gp_args = list(interpolate = seq(.2.44 0. shade=TRUE) # you can define your own interpolated shading mosaic(xt.00 wi wi −0. # use output in x..44 va Pearson residuals: 0.05))) FALSE gt25mo TRUE FALSE gt25mo TRUE va Pearson residuals: 0. In our example.00 orig orig 0. exp = x.summary$residuals .10 0.frame(obs = x. there’s insufficient evidience for association.1.41 p−value = 0.table .41 p−value = 0.10 −0.summary$residuals^2 . library(vcd) # for mosaic() ## ## Attaching package: ’vcd’ ## ## The following object is masked from ’package:BSDA’: ## ## Trucks # shading based on significance relative to appropriate chi-square distribution mosaic(xt.. chisq = x.data. so the default shading is all gray.table <.table # because we're creating vectors from a two-way table # and columns identifying row and col names are automatically added. we make a table to summarize important values from that analysis and compare the observed and expected frequencies in plots.test() above.15 −0.20 −0. shade=TRUE.summary$observed .46 −0.20 0.15 0.summary$expected . stdres = x.summary and create table x. table$obs <.subset(x.table.Freq # include only the "cleaned" columns x.333 0.table$exp. # variable.333 4.7319 # create a single column with a joint cell name x.16667 va ## 2 0.gt25mo obs.vars=c("cellname").667 -0.x. # id.09524 wi ## stdres.gt25mo ## 1 va FALSE 2 2.7319 ## 4 TRUE -0.3086 wi TRUE 0.melt(x.table$stdres.table$obs.table.7319 ## 3 TRUE 0. obs. as.4364 wi FALSE 0.table$stdres <.character(x.333 va TRUE ## 4 wi TRUE 4 2.4364 0.orig ## 1 -0.Freq exp.7319 # reshape the data for plotting library(reshape2) x.character(x. x.FALSE exp.333 0.TRUE res.c(x.19048 0. res.obsexp <.table$obs.table$chisq.FALSE[1:2].6: ADA1 Chapter 7: Categorical data analysis ## obs.vars = c("obs".667 5.table$cellname <.667 wi FALSE ## 3 va TRUE 6 2.name = "value" ) x.obsexp 31 . sep="_") # expected frequencies in a single column x.gt25mo chisq.Freq x.7319 ## 4 wi_TRUE 4 4.09524 -0.19048 wi ## 3 0.7319 ## 2 wi_FALSE 3 2.1.vars: The source columns # (if unspecified then all other variables are measure.4082 va FALSE 0.16667 -0.table$obs.2887 0. stdres)) x.Freq x.x.Freq ## 1 FALSE -0.gt25mo stdres. chisq.333 va FALSE ## 2 wi FALSE 3 2.table.name = "stat". stdres x.orig obs. select = c(cellname.gt25mo) .7319 ## 2 FALSE 0. exp.table <.table$chisq <. chisq.paste(as.4082 0.Freq stdres. # value.3086 0.orig chisq.08333 0.667 wi TRUE ## res.table$exp <. res.Freq x.table ## cellname obs exp res chisq stdres ## 1 va_FALSE 2 2.orig) .x.333 4.vars) measure.table.667 5.name: column name for values in table value.vars: ID variables # all variables to keep but not split apart on id.table$res <.x.08333 va ## 4 -0.name: Name of the destination column identifying each # original column that the measurement came from variable.667 -0.7319 ## 3 va_TRUE 6 5.TRUE[3:4]) # create a simpler name for the obs.table$res.orig res. "exp").table$exp.2887 va TRUE 0.Freq chisq. # measure. obsexp.p + labs(title = "Contribution to Chi-sq statistic") p <. fill = stat.chisq$cellname <.333 va_TRUE exp 5.p + xlab("Sorted cellname category (years)") p <.667 wi_FALSE exp 2.table.table[.000 wi_FALSE obs 3. weight=value)) p <. reorder(cellname."chisq")] # reorder the cellname categories to be descending relative to the chisq statistic x.chisq.333 wi_TRUE exp 4.000 va_TRUE obs 6.table.p + labs(title = "Observed and Expected frequencies") p <.ggplot(x.with(x.p + geom_bar() p <. weight = chisq)) p <. -chisq)) p <.chisq <. aes(x = cellname. aes(x = cellname.p + xlab("Age category (years)") print(p) # Contribution to chi-sq # pull out only the cellname and chisq columns x.000 va_FALSE exp 2.667 Plot observed vs expected frequencies.table. c("cellname".32 ## ## ## ## ## ## ## ## ## Ch 1: R statistical software and review 1 2 3 4 5 6 7 8 cellname stat value va_FALSE obs 2.p + ylab("Contribution") print(p) .table. and the contribution to chi-square statistic sorted decending.x.000 wi_TRUE obs 4.p + geom_bar(position="dodge") p <. # Observed vs Expected counts library(ggplot2) p <.table.ggplot(x. .dat" # this file uses spaces as delimiters.1:nrow(rocket) # add an id variable to identify observations str(rocket) ## 'data.frame': 20 obs.data."http://statacumen. 8 17 5.10 2 0.1.table(fn... 6 7 8 9 10 .table() rocket <. #### Example: Rocket.8 ## $ id : int 1 2 3 4 5 head(rocket) ## shearpsi agewks id ## 1 2159 15...7 wi_FALSE va_FALSE wi_TRUE va_TRUE Sorted cellname category (years) ADA1 Chapter 8: Correlation and regression Rocket Propellant Data A rocket motor is manufactured by bonding an igniter propellant and a sustainer propellant together inside a metal housing.5 23. header = TRUE) rocket$id <. The shear strength of the bond between the two types of propellant is an important quality characteristic.. so use read. the second is age of propellant in weeks.read.50 1 variables: 2316 2061 2208 . Chapter 8 fn.com/teach/ADA2/ADA2_notes_Ch01_rocket. of 3 ## $ shearpsi: num 2159 1678 ## $ agewks : num 15.5 .05 0 0..data <. It is suspected that shear strength is related to the age in weeks of the batch of sustainer propellant.15 count stat obs exp Contribution 4 0. Twenty observations on these two characteristics are given below.00 va_FALSE va_TRUE wi_FALSE wi_TRUE Age category (years) 1. The first column is shear strength in psi.7: ADA1 Chapter 8: Correlation and regression Observed and Expected frequencies 33 Contribution to Chi−sq statistic 6 0. aes(x = agewks. data = rocket) Residuals: Min 1Q Median -216. # fit the simple linear regression model lm.00 2 3 4 5 6 # ggplot: Plot the data with linear regression fit and confidence bands library(ggplot2) p <.0 -50.50 19.p + geom_point() # plot labels next to points p <. vjust = -0.5 < 2e-16 *** agewks -37.75 8. intercept) summary(lm.p + geom_smooth(method = lm) print(p) 2700 19 ● 8 ● 2400 17 12 ● ● 9 ● 3 14 shearpsi ● ● 10 ● 5 ● 18 ●11 1 ● ● 4 16 ● 2100 ● 2015 ● 1800 6 7 13 ● ● ● 2 ● ● 5 10 15 20 agewks The data are reasonably linear.18 59. data = rocket) # use summary() to get t-tests of parameters (slope.00 17.9 1.82 44.p + geom_text(hjust = 0.6e-10 *** --- 25 .agewks) ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = shearpsi ~ agewks.shearpsi.6 Max 106. y = shearpsi. Error t value Pr(>|t|) (Intercept) 2627.15 2.00 5.ggplot(rocket. label = id)) p <.7 28.8 Coefficients: Estimate Std.shearpsi.5.89 -12.5) # plot regression line and confidence band p <. so fit the regression.lm(shearpsi ~ agewks.34 ## ## ## ## ## Ch 1: R statistical software and review 2 3 4 5 6 1678 2316 2061 2208 1708 23.7 3Q 66.agewks <. 16 ● ● ● ● ● 10 15 rocket$agewks 20 25 ● 1● 50 ● ● ● ● ● ● ● ● 0 ● ● −50 ● ● ● −100 −150 −200 ● 5 −2 ● 6 −1 0 norm quantiles 1 2 ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● −100 ● ● ● 100 −200 ● lm.08 0.agewks$residuals 100 Residuals vs Fitted ● ● ● 5 10 15 20 Index The relationship between shear strength and age is fairly linear with pre- .shearpsi.shearpsi.30 5 1 Obs. main="QQ Plot") ## ## 5 1 6 1 2 20 # residuals vs order of data plot(lm. p-value: 1.agewks.10 0 ● ● −200 Residuals ● ● 1800 lm.agewks$residuals.896 F-statistic: 165 on 1 and 18 DF.902.6)) # residuals vs weight plot(rocket$agewks.4.001 '**' 0.64e-10 Plot diagnostics.Adjusted R-squared: 0.agewks$residuals.agewks$residuals.3)) plot(lm. id.05 '.7: ADA1 Chapter 8: Correlation and regression ## ## ## ## ## Signif.shearpsi. col = "gray75") 15 0.10 ●● ● ● 20 ● ● 0. which = c(1.agewks$residuals ● ● 50 100 Residuals vs Order of data ● ● ● QQ Plot ● −100 10 ● ● ● Residuals vs agewks ● −200 19 ● Leverage hii ● 5 0.5 ● 0 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm. codes: 35 0 '***' 0. main="Residuals vs Order of data") # horizontal line at zero abline(h = 0.agewks$residuals ● ● ● ● Cook's distance ● ● 0.' 0. main="Residuals vs agewks") # horizontal line at zero abline(h = 0.5 ●5 Fitted values ● 0 50 100 2200 19 ● 5● 2000 2 ●6 0.1 on 18 degrees of freedom Multiple R-squared: 0.12 0.01 '*' 0.1 ' ' 1 Residual standard error: 96.1. number ● ● 0.shearpsi.shearpsi.shearpsi.20 2400 1.20 0.30 ● ● ● 2.shearpsi.00 ●6 Cook's distance ● 0.00 −100 ● 6 0.04 ● ● ●● 0.5 5 ●1 ● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance lm. # plot diagnistics par(mfrow=c(2.n = 3. lm. las = 1. Also note that R2 = 0. These same observations appear as potential outliers in the normal scores plot and the plot of ri against Yˆi.0001. Before we hold out these cases. The test for H0 : β1 = 0 (zero slope for the population regression line) is highly significant: p-value< 0. The predicted values for these observations are much greater than the observed shear strengths.2 Age. how do you think the LS line will change? My guess is these cases are pulling the LS line down. The data plot and residual information identify observations 5 and 6 as potential outliers (r5 = −2. see the Cook’s distance values. A sensible next step would be to repeat the analysis holding out the most influential case. ] head(rocket56) ## ## ## ## ## ## ## 1 2 3 4 7 8 shearpsi agewks id 2159 15. r6 = −2.36 Ch 1: R statistical software and review dicted shear strength decreasing as the age of the propellant increases.) What will happen to R2 when we delete these points? Exclude observations 5 and 6 and redo the analysis.75 2 2316 8.50 8 . I will assess the impact of omitting both simultaneously.50 1 1678 23. Observations 5 and 6 also have the largest influence on the analysis.8 − 37. # exclude observations 5 and 6 rocket56 <. or have I already seen the output? Both.00 3 2061 17. Holding out either case 5 or 6 would probably also affect the slope.9018 so the linear relationship between shear strength and age explains about 90% of the variation in shear strength. Since both cases have essentially the same effect on the positioning of the LS line. but my guess is that when they are both omitted the slope will change little.6). observation 5. so the intercept of the LS line should increase once these cases are omitted.rocket[-c(5. The fitted LS line is Predicted shear strength = 2627.32).00 7 2575 2. (Is this my experience speaking. It should be somewhat clear that the influence of case 6 would increase dramatically once case 5 is omitted from the analysis.00 4 1785 24.38. agewks <.1 ' ' 1 Residual standard error: 63 on 16 degrees of freedom Multiple R-squared: 0.shearpsi.' 0.02e-12 25 .p + geom_text(hjust = 0. p-value: 2. Error t value Pr(>|t|) (Intercept) 2658. aes(x = agewks.p + geom_smooth(method = lm) print(p) 2700 19 ● 8 ● 2400 17 12 ● ● 9 ● 3 14 ● ● 10 shearpsi ● 18 ●11 1 ● ● 4 16 ● 2100 ● 2015 ● 1800 7 13 ● ● ● 2 ● 5 10 15 20 agewks The data are reasonably linear.01 '*' 0.lm(shearpsi ~ agewks.69 1. y = shearpsi.958.0 Coefficients: Estimate Std. label = id)) p <.1 2e-12 *** --Signif.5) # plot regression line and confidence band p <.5.955 F-statistic: 363 on 1 and 16 DF. data = rocket56) Residuals: Min 1Q Median -118.p + geom_point() # plot labels next to points p <. vjust = -0. data = rocket56) # use summary() to get t-tests of parameters (slope.53 87.7 11. codes: 0 '***' 0.97 30.8 Max 84.agewks) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = shearpsi ~ agewks.1 -35.1 <2e-16 *** agewks -37.ggplot(rocket56.98 -19. intercept) summary(lm.001 '**' 0.05 '. # fit the simple linear regression model lm. so fit the regression.Adjusted R-squared: 0.7: ADA1 Chapter 8: Correlation and regression 37 # ggplot: Plot the data with linear regression fit and confidence bands library(ggplot2) p <.3 3Q 44.shearpsi.1. lm.08 ● 0.12 QQ Plot Residuals vs Order of data ● ● ● ● ● ● ● ● ● 0 15 rocket56$agewks 20 25 ● ● ● ● ● ● −50 ● ● ● −100 ● 10 ● −2 20 2 12 −1 0 norm quantiles 1 2 ● ● ● 50 ● ● ● ● ● ● ● 0 0 ● 50 ● ● ● ● −50 ● ● ● ● −100 ● lm.5 ● ● ● Obs.agewks.agewks$residuals −50 ● Cook's distance ● ● 0.0 2400 ● lm.agewks$residuals 50 ● ● ● ● 0 0.1 ● ● ● Cook's distance 50 ● ● ●2 2. las = 1. are given below.16 Residuals vs agewks ● −50 ● ● ●● ● 15 ● ● Leverage hii ● 5 0. id.shearpsi.shearpsi. # plot diagnistics par(mfrow=c(2.5 2 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.3 2600 5 10 0.shearpsi. number ● −100 ● Fitted values ● ● 19 ● ● 0.shearpsi.shearpsi. which = c(1.shearpsi.6)) # residuals vs weight plot(rocket56$agewks. The summaries lead to the following conclusions: 1. Holding out cases 5 and 6 has little effect on the estimated LS line.04 0.agewks$residuals 1 ●2 ● ● 1800 ● 12 0.1 ● ● 2 lm.3)) plot(lm.0 0.3 19 ● 20 12 ● 1.agewks$residuals.agewks$residuals. main="Residuals vs agewks") # horizontal line at zero abline(h = 0.shearpsi.n = 3.2 0 ● ● −100 Residuals ● 0. main="QQ Plot") ## 12 20 ## 1 2 2 3 # residuals vs order of data plot(lm. main="Residuals vs Order of data") # horizontal line at zero abline(h = 0.2 0.5 12 ● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 0.38 Ch 1: R statistical software and review Plot diagnostics. Predictions of shear strength are slightly larger after holding out these two cases (recall that intercept increased.4 100 Residuals vs Fitted ● ● ● ● 5 10 15 Index Some summaries for the complete analysis. col = "gray75") 2000 2200 0. and when cases 5 and 6 are held out. but slope was roughly the same!) .agewks$residuals.4. 9578 σˆ 96.10 62.82 2658.97 b1 −37. Holding out these two cases decreases σˆ considerably.0001 Here is a comparison of the predicted or fitted values for selected observations in the data set. based on the two fits. I am hesitant to delete either case from the analysis. let’s dive into new material! . I feel relatively confident that including these cases will not seriously limit the use of the model.96 p-val for H0 : β1 = 0 0.6 1 2159 2052 2075 2 1678 1745 1764 4 2061 1996 2018 8 2575 2535 2565 10 2257 2219 2244 15 1765 1810 1830 18 2201 2163 2188 20 1754 1829 1849 Review complete Now that we’re warmed up. Feature Full data Omit 5 and 6 b0 2627.9018 0. Without any substantive reason to explain the low shear strengths for cases 5 and 6. omit 5. The complete data set will give wider CI and prediction intervals than the analysis which deletes case 5 and 6 because σˆ decreases when these points are omitted.15 −37. 3. and leads to a modest increase in R2.69 R2 0.0001 0. One observation has a large Cook’s D but does not appear to be extremely influential. Once these cases are held out.7: ADA1 Chapter 8: Correlation and regression 39 2. full Pred. Observation Actual Shear Strength Pred. the normal scores plot and plot of the studentized residuals against fitted values shows no significant problems.1. Part II Introduction to multiple regression and model selection . . . They measured the blood pressure and several other characteristics of 39 Indians who migrated from a very primitive environment high in the Andes into the mainstream of Peruvian society at a lower altitude. 1 This problem is from the Minitab handbook.Chapter 2 Introduction to Multiple Linear Regression In multiple linear regression. the additional predictors are used to explain the variation in the response not explained by a simple linear regression fit. In essence. header=TRUE) # examine the structure of the dataset.frame containing integers.read. and factors str(indian) ## 'data. a linear combination of two or more predictor variables (xs) is used to explain the variation in a response. 2.data <. numbers.. is it what you expected? # a data. All of the Indians were males at least 21 years of age. #### Example: Indian # filename fn.1 Indian systolic blood pressure example Anthropologists conducted a study1 to determine the long-term effects of an environmental change on systolic blood pressure. . and were born at a high altitude.table(fn. of 11 variables: ## $ id : int 1 2 3 4 5 6 7 8 9 10 .frame': 39 obs.data.com/teach/ADA2/ADA2_notes_Ch02_indian.dat" indian <."http://statacumen. 70 11.30 6..00 5. 88 64 68 52 72 72 64 80 76 60 .00 4.30 5.00 6.7 3.7 .00 3. 12. # Description of variables # id = individual id # age = age in years # wt = weight in kilos # chin = chin skin fold in mm # calf = calf skin fold in mm # sysbp = systolic bp yrmig ht fore pulse diabp = = = = = years since migration height in mm forearm skin fold in mm pulse rate-beats/min diastolic bp ## print dataset to screen #indian 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 age 21 22 24 24 25 27 28 28 31 32 33 33 34 35 35 36 36 37 37 38 38 38 38 39 39 39 41 41 41 42 43 43 43 44 44 45 47 50 54 yrmig 1 6 5 1 1 19 5 25 6 13 13 10 15 18 2 12 15 16 17 10 18 11 11 21 24 14 25 32 5 12 25 26 10 19 18 10 1 43 40 wt 71.00 6..70 5.40 68.00 55..00 7..00 73.70 7.00 87.00 62.3 5.70 3.00 5.50 68..70 4.00 53.30 9.30 13.00 5..70 4.00 57.70 6.00 3.00 64.70 4.3 3.00 3.00 12.00 11.3 4.00 65.30 3.00 56.00 3.00 3...00 53.00 7.00 5.00 6.00 57.00 63.30 6.30 3.7 4.30 fore 7.00 3.3 3.00 3.00 8.00 70..70 6..50 57..50 59.3 9 4 .00 1.7 8 0 10 6 .30 pulse 88 64 68 52 72 72 64 80 76 60 68 73 88 60 60 72 84 64 72 64 80 76 60 64 64 68 76 60 76 88 72 68 60 74 72 56 64 72 92 sysbp 170 120 125 148 140 106 120 108 124 134 116 114 130 118 138 134 120 120 114 124 114 136 126 124 128 134 112 128 134 128 140 138 118 110 142 134 116 132 152 diabp 76 60 75 120 78 72 76 62 70 64 76 74 80 68 78 86 70 76 80 64 66 78 72 62 84 92 80 82 92 90 72 74 66 70 84 70 54 90 88 .70 3.00 57.00 10.2.00 ht 1629 1569 1561 1619 1566 1639 1494 1568 1540 1530 1622 1486 1578 1645 1648 1521 1547 1505 1473 1538 1513 1653 1566 1580 1647 1620 1637 1528 1647 1605 1625 1615 1640 1610 1572 1534 1536 1630 1542 chin 8.00 10.7 8 4.30 7. 170 120 125 148 140 106 120 108 124 134 .30 7.10 64.30 4..3 3 12.00 11.00 6.00 7.50 74.30 6.00 3.70 6.00 11.00 61...30 4. 7 5 1. 76 60 75 120 78 72 76 62 70 64 .00 4.00 56.30 5..30 10..00 11.00 65.00 5.70 13.7 5.00 7.30 3.70 8.30 8.30 9.7 10.70 13.00 4.00 4.00 3.70 5.70 5.70 5.3 4.30 3..50 61.70 8.1: Indian systolic blood pressure example ## ## ## ## ## ## ## ## ## ## $ $ $ $ $ $ $ $ $ $ age : yrmig: wt : ht : chin : fore : calf : pulse: sysbp: diabp: int int num int num num num int int int 43 21 22 24 24 25 27 28 28 31 32 .00 66.70 9.30 4.30 3. 8 3.00 4.00 69.30 3.00 58.00 3.00 6.00 4..30 20.30 6.5 56 61 65 62 53 53 65 57 .00 65.30 3.70 3.7 9 3 7.30 3.50 64.70 11.70 calf 12.00 5.50 56.00 6.20 55.00 4.00 0. 1 6 5 1 1 19 5 25 6 13 .00 15..00 11.70 10..3 20.70 4.3 3.30 4.70 8.00 8.00 62.30 11.00 3.30 7.70 5.30 5.70 10.00 71.00 3.00 72.00 5. 1629 1569 1561 1619 1566 1639 1494 1568 1540 1530 .00 5.00 4.00 12.00 5.30 5.00 59.00 3.00 60.00 3. 71 56.00 7.00 69.00 3.00 7.00 57. 25.ggplot(indian.p + scale_shape_manual(values=charToInt(sort(unique(indian$wtcat)))) # plot regression line and confidence band p <. "H")) # library(ggplot2) p <.1 (2014-01-04) successfully loaded.rep(NA.p + geom_smooth(method = lm) .methodsS3 ## R. alpha = 0.p + geom_point(aes(colour=wtcat.methodsS3 for help.18."L" # update low indian$wtcat[(indian$wt >= 70)] <.44 Ch 2: Introduction to Multiple Linear Regression A question we consider concerns the long term effects of an environmental change on the systolic blood pressure.oo v1. ## ## Attaching package: ’R. label = id)) p <. "M". y = sysbp. # Create the "fraction of their life" variable # yrage = years since migration divided by age indian$yrage <. y = sysbp. detach.0 (2014-02-22) successfully loaded.ggplot(indian.indian$yrmig / indian$age # continuous color for wt # ggplot: Plot the data with linear regression fit and confidence bands library(ggplot2) p <. is there a relationship between the systolic blood pressure and how long the Indians lived in their new environment as measured by the fraction of their life spent in the new environment. load. aes(x = yrage. nrow(indian)) indian$wtcat <."M" # init as medium indian$wtcat[(indian$wt < 60)] <. aes(x = yrage. See ?R. shape=wtcat).oo’ ## ## The following objects are masked from ’package:methods’: ## ## getClasses. ## R. size=2) library(R.5.oo) # for ascii code lookup ## Loading required package: R.oo for help.p + labs(title="Indian sysbp by yrage with continuous wt") print(p) # categorical color for wt indian$wtcat <.methodsS3 v1.5.6. In particular. See ?R. colour = 2) # plot regression line and confidence band p <. gc.p + geom_point(aes(colour=wt). getMethods ## ## The following objects are masked from ’package:base’: ## ## attach. vjust = -0.ordered(indian$wtcat.p + geom_text(hjust = 0.p + geom_smooth(method = lm) p <. levels=c("L". size=2) # plot labels next to points p <. label = id)) p <."H" # update high # define as a factor variable with a specific order indian$wtcat <. save p <. 06 <2e-16 *** yrage -15.lm(sysbp ~ yrage.yrage) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ yrage.' 0. data = indian) Residuals: Min 1Q Median -17.50 4.50 M 0.99 -1. intercept) summary(lm.1: Indian systolic blood pressure example 45 p <.1 ' ' 1 .01 '*' 0.p + labs(title="Indian sysbp by yrage with categorical wt") print(p) Indian sysbp by yrage with continuous wt Indian sysbp by yrage with categorical wt 1 ● H 160 160 39 ● H 4 35 5 ● 15 140 29 ● 37 ● 31 ● 32 ● ● 120 M wt 36 22 ● ● 70 1626 10 ● 30 ● 23 93● 20 ● ● ● 7 2 ● 33 ● ● 12 ● 80 ● ● 38 ● 13 25 24 ● 17 18 ● ● 14 ● 11 ● 19 21 ●● 34 M M M M M L H L M H H M M M 120 H M L L L L L L L L M M L M 27 L L L ● M 8 6 M ● L ● 0. Residuals 6033 37 --Signif.25 Coefficients: Estimate Std. type=3) ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: sysbp Sum Sq Df F value Pr(>F) (Intercept) 178221 1 1092. --Signif.05 '.16 -10. df) library(car) Anova(lm. codes: 0 '***' 0.089 .05 0.sysbp.00 yrage 0.75 0.25 L H ● ● 0.75 0.sysbp.sysbp. codes: 0 '***' 0.01 '*' 0.yrage <.25 0.01 -1. Error t value Pr(>|t|) (Intercept) 133.089 . # fit the simple linear regression model lm.05 '.95 <2e-16 *** yrage 498 1 3.50 0.75 9.yrage.' 0.01 3Q 6.04 33.2.1 ' ' 1 # use summary() to get t-tests of parameters (slope.85 Max 37. data = indian) # use Anova() from library(car) to get ANOVA table (Type 3 SS.001 '**' 0.001 '**' 0.00 L M H 60 28 ● 140 M ● ● wtcat H sysbp sysbp ● 0.75 yrage Fit the simple linear regression model reporting the ANOVA table (“Terms”) and parameter estimate table (“Coefficients”). it is usually accepted that systolic blood pressure and weight are related. the weak linear relationship observed in the data is not atypical of a population where there is no linear relationship between systolic blood pressure and the fraction of life spent in a modern society. Even if this test were significant. As in simple linear regression. Nonetheless. consider fitting the regression model sysbp = β0 + β1 yrage + ε. the t-test of H0 : β1 = 0 is not significant at the 5% level (p-value=0. However. 2.1.46 Ch 2: Introduction to Multiple Linear Regression ## Residual standard error: 12. there is a weak relationship between systolic blood pressure and the yrage fraction. the model is written in the form: Response = Mean of Response + Residual.0888).1 Taking Weight Into Consideration At best. sysbp and suggests that average systolic blood pressure decreases as the fraction of life spent in modern society increases.Adjusted R-squared: 0.0888 A plot above of systolic blood pressure against yrage fraction suggests a weak linear relationship. That is. p-value: 0. A natural way to take weight into consideration is to include wt (weight) and yrage fraction as predictors of systolic blood pressure in the multiple regression model: sysbp = β0 + β1 yrage + β2 wt + ε.75 yrage. the small value of R2 = 0. The least squares line (already in the plot) is given by [ = 133. However.5 + −15.0763 suggests that yrage fraction does not explain a substantial amount of the variation in the systolic blood pressures. If we omit the individual with the highest blood pressure then the relationship would be weaker.0513 ## F-statistic: 3. .0763.8 on 37 degrees of freedom ## Multiple R-squared: 0.05 on 1 and 37 DF. 982 Coefficients: Estimate Std. (" + wt" added) lm.2 0. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: sysbp Sum Sq Df F value Pr(>F) (Intercept) 1738 1 18.yrage.281 4. data = indian) Residuals: Min 1Q -18.218 -3.26 0.71 0.79e-06 .001 '**' 0.1 ' ' 1 summary(lm.234 5.05 '.wt <.001 '**' 0. The parameters of the regression model β0.2 on 2 and 36 DF.1 8e-06 *** Residuals 3441 36 --Signif.01 '*' 0.444 F-statistic: 16.yrage. As in simple linear regression.433 -7.1: Indian systolic blood pressure example 47 so the model implies that that average systolic blood pressure is a linear combination of yrage fraction and weight.896 3Q 5.473. codes: 0 '***' 0.78 on 36 degrees of freedom Multiple R-squared: 0.wt.Adjusted R-squared: 0. Here is the multiple regression model with yrage and wt (weight) as predictors. β2. and σ 2 are estimated by least squares (LS).21 8e-06 *** --Signif.00014 *** yrage 1315 1 13.' 0.728 Max 23.sysbp. Add wt to the right hand side of the previous formula statement.wt) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ yrage + wt.217 0.767 7.sysbp.1 ' ' 1 Residual standard error: 9.' 0. p-value: 9.00070 *** wt 1. codes: 0 '***' 0.00070 *** wt 2592 1 27.05 '.00014 *** yrage -26.2.01 '*' 0. the standard multiple regression analysis assumes that the responses are normally distributed with a constant variance σ 2. β1.lm(sysbp ~ yrage + wt. Error t value Pr(>|t|) (Intercept) 60.sysbp.yrage.307 Median 0.896 14.8 0. # fit the multiple linear regression model. data = indian) library(car) Anova(lm. 076.21 wt.37 − 3441.44 = 0.36 = 2592. The LS estimates of the intercept and the regression coefficient for yrage fraction. Adding a predictor increases the Regression df by 1 and decreases the Residual df by 1. The proportion of variation in the response explained by the regression model: R2 = Regression SS/Total SS never decreases when new predictors are added to a model. sysbp For the multiple regression model: [ = 60. The Total SS does not depend on the number of predictors so it stays the same. sysbp 2.75 yrage. That is.2 Important Points to Notice About the Regression Output 1. and their standard errors. The R2 for the simple linear regression was 0. Looking at the ANOVA tables for the simple linear and the multiple regression models we see that the Regression (model) df has increased from 1 to 2 (2=number of predictor variables) and the Residual (error) df has decreased from 37 to 36 (=n − 1− number of predictors). (You can’t add a predictor and explain less variation. change from the simple linear model to the multiple regression model. For the simple linear regression: [ = 133. whereas R2 = 3090. The Residual SS decreases by 6033.89 − 26. Adding the weight variable to the model increases R2 by 40%. The Residual SS.01 upon adding the weight term term to the model.473 for the multiple regression model. or the part of the variation in the response unexplained by the regression model. weight explains 40% of the variation in systolic blood pressure not already explained by fraction.76 yrage + 1.1.50 − 15.48 Ch 2: Introduction to Multiple Linear Regression 2.01 upon adding the weight term. 3. The Regression SS increased by 2592. .08/6531. never increases when new predictors are added.) 4. 06.217 = = 5.207 SE(b2) 0. We saw a big increase in R2. which suggests β2 6= 0.234 is compared to a t-critical value with Residual df = 36.163 (which is compared to a F -table with 2 and 36 df ) tests H0 : β1 = β2 = 0 against HA : not H0. A similar interpretation is given to the t-test for H0 : β1 = 0. The F -statistic for the multiple regression model Fobs = Regression MS/Residual MS = 1545. The t-statistic for this test tobs = b2 − 0 1. we are interested in testing H0 : β2 = 0 against HA : β2 6= 0. 6.0001.59 = 16. whereas σˆ 2 = 95. The estimated variability about the regression line Residual MS = σˆ 2 decreased dramatically after adding the weight effect. are important for explaining the variation in systolic blood pressure.1: Indian systolic blood pressure example 49 5. This is a test of no relationship between the average systolic blood pressure and fraction and weight. If this test is significant. or both. then either fraction or weight. The t-test of H0 : β2 = 0 in the multiple regression model tests whether adding weight to the simple linear regression model explains a significant part of the variation in systolic blood pressure not explained by yrage fraction. Given the model sysbp = β0 + β1 yrage + β2 wt + ε. 7. the t-test of H0 : β1 = 0 will be significant if the increase in R2 (or decrease in Residual SS) obtained by adding weight to this simple linear regression model is substantial. assuming the relationship is linear.04/95. The test gives a pvalue of < 0.59 for the multiple regression model. For the simple linear regression model σˆ 2 = 163. This suggests that an important predictor has been added to model. which is deemed significant by the t-test. . In some sense.2. 3 Understanding the Model The t-test for H0 : β1 = 0 is highly significant (p-value=0. The t-tests for β0 = 0 and β1 = 0 are conducted. 9.” I will try to convince you that this was expected. the correlation between a predictor and a response says very little about the importance of the predictor in a regression model with one or more additional predictors. or H as a plotting symbol. The relationship between systolic blood pressure and fraction is fairly linear within each weight category. In multiple regression “everything depends on everything else. The slopes in the three groups are negative and roughly constant. assessed.0007. in the sense that a predictor that is highly correlated with the response may be unimportant in a multiple regression model once other predictors are included in the model. and interpreted in the same manner. Weight is called a suppressor variable. This implies that fraction is important in explaining the variation in systolic blood pressure after weight is taken into consideration (by including weight in the model as a predictor). We compute CIs for the regression parameters in the usual way: bi + tcritSE(bi). To see why yrage fraction is an important predictor after taking weight into consideration. where tcrit is the t-critical value for the corresponding CI level with df = Residual df.1. whereas the p-value for testing H0 : β1 = 0 is 0. The p-value for testing H0 : β0 = 0 is 0. The implications of this analysis are enormous! Essentially. The model implies .0007). given the plot of systolic blood pressure against fraction. This plot used a weight category variable wtcat L. let us return to the multiple regression model. which implies that fraction is important in explaining the variation in systolic blood pressure after weight is taken into consideration (by including weight in the model as a predictor). M.0001. Ignoring weight suppresses the relationship between systolic blood pressure and yrage fraction.50 Ch 2: Introduction to Multiple Linear Regression 8. This conclusion also holds in situations where the correlation is high. and stronger than when we ignore weight. 2. To see this point. if we fix the value of fraction. then the average systolic blood pressure is linearly related to weight with a constant slope β2.89 − 26.76 yrage + 1.76 yrage + 1. A similar interpretation holds if we switch the roles of yrage fraction and weight.21(60) = 133. sysbp For each fixed weight. sysbp . sysbp If we restrict our attention to 50kg Indians.21(50) = 121.76 yrage.89 − 26. sysbp For 60kg Indians.49 − 26.39 − 26. the average systolic blood pressure is linearly related to yrage fraction with a constant slope β1. independent of yrage fraction.2. [ = 60.89 − 26.76 yrage. the average systolic blood pressure as a function of fraction is [ = 60.21 wt.1: Indian systolic blood pressure example 51 that the average systolic blood pressure is a linear combination of yrage fraction and weight: [ = β0 + β1 yrage + β2 wt.76 yrage + 1. suppose that the LS estimates of the regression parameters are the true values [ = 60. independent of weight. That is. Similarly. size=2) # plot labels next to points p <. with a constant slope across weights.e.76 for each increase of 1 on fraction.76 for each increase of 1 in fraction. The same phenomenon should .5. The intercept increases by 1.52 Ch 2: Introduction to Multiple Linear Regression Hopefully the pattern is clear: the average systolic blood pressure decreases by 26.p + geom_smooth(method = lm) p <.p + labs(title="Indian sysbp by wt with continuous yrage") print(p) Indian sysbp by wt with continuous yrage 1 ● 160 39 ● 4 sysbp ● 140 16 10 ●● yrage 15 ● 22 36● 29 ● ● 13 5 31 ● 120 ● ●● 38 32 ● 26 0. if we plot the average systolic blood pressure as a function of weight. regardless of one’s weight. and intercepts decreasing by 26.21 for each increase of 1kg in weight. # ggplot: Plot the data with linear regression fit and confidence bands library(ggplot2) p <.p + geom_point(aes(colour=yrage)..25. we see a set of parallel lines with slope 1. The plot should show a fairly linear relationship between systolic blood pressure and fraction.21.5. we get a set of parallel lines (i. If we vary weight over its range of values. vjust = -0.ggplot(indian. broken down by individual weights. label = id)) p <.6 0. colour = 2) # plot regression line and confidence band p <.p + geom_text(hjust = 0. equal slopes) when we plot average systolic blood pressure as a function of yrage fraction.8 ● ● ● 323 ● 20 ● 24 ●● 7 18 217 ● ● ●● 37 ● 19 12 21 35 34 ● 6 ● ● 60 70 80 wt If we had more data we could check the model by plotting systolic blood pressure against fraction. alpha = 0. for several fixed values of fraction. aes(x = wt. y = sysbp.4 ● 0.2 ● 28 30 25 ● ● 9 ● 33 11 ● 14 ● ● 27 ● 8 0. I grouped the weights into categories because of the limited number of observations. table(fn. but the relationships were linear. In particular. Y denotes the candidate’s total mark. This is probably not warranted here.21 wt sysbp our interpretation is consistent with the explanation of the regression model given above. #### Example: GCE fn. while X1 is the candidate’s score in the compulsory part of the exam. the predicted systolic blood pressure increases by 1. in addition to weight and yrage fraction. holding yrage fraction constant at any level. For the fitted model [ = 60. in a School Certificate English Language paper taken on a previous occasion.2. focus on the yrage fraction coefficient. which has a maximum score of 200 of the 1000 points on the exam. in the GCE exam.com/teach/ADA2/ADA2_notes_Ch02_gce. A final issue that I wish to address concerns the interpretation of the estimates of the regression coefficients in a multiple regression model.21 for each unit increase in weight.read. X2 denotes the candidates’ score. out of 1000. holding weight constant at any value.frame': 15 obs. For example.2 GCE exam score example The data below are selected from a larger collection of data referring to candidates for the General Certificate of Education (GCE) who were being considered for a special award. the predicted systolic blood pressure decreases by 26. A more complete analysis of the data.data <. Here. including diagnostics. of 3 variables: ## $ y : int 476 457 540 551 575 698 545 574 645 690 .2: GCE exam score example 53 approximately hold. we would need to include an interaction or product variable wt × yrage in the model.data.76 for each unit increase in fraction."http://statacumen.89 − 26... This example was meant to illustrate multiple regression. Similarly. The negative coefficient indicates that the predicted systolic blood pressure decreases as yrage fraction increases holding weight constant. will be given later. ## $ x1: int 111 92 90 107 98 150 118 110 117 114 . and it does. 2.76 yrage + 1.. header=TRUE) str(gce) ## 'data. out of 100.. If the slopes for the different weight groups changed drastically with weight.dat" gce <. . Let us answer the following straightforward questions.). Do the correlations appear sensible.1) conflicts) . strength. 3.54 ## Ch 2: Introduction to Multiple Linear Regression $ x2: int 68 46 50 59 50 66 54 51 59 80 . linear. ## print dataset to screen #gce 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 y 476 457 540 551 575 698 545 574 645 690 634 637 390 562 560 x1 111 92 90 107 98 150 118 110 117 114 130 118 91 118 109 x2 68 46 50 59 50 66 54 51 59 80 57 51 44 61 66 A goal here is to compute a multiple regression of Y on X1 and X2. Plot X1 against X2 and comment on the form. lower = list(continuous = "cor") ) print(p) # detach package after use so reshape2 works (old reshape (v. and direction of the relationship. and comment on the form (i.e. etc.ggpairs(gce. and make the necessary tests to enable you to comment intelligently on the extent to which current performance in the compulsory test (X1) may be used to predict aggregate performance on the GCE exam (Y ). strength. 1. upper = list(continuous = "points") . 2. non-linear. Plot Y against X1 and X2 individually... logarithmic. given the plots? library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) #p <. and on whether previous performance in the School Certificate English Language (X2) has any predictive value independently of what has already emerged from the current performance in the compulsory papers.ggpairs(gce) # put scatterplots on top so y axis is vertical p <. Compute the correlation between all pairs of variables. and direction of the relationships. I will lead you through a number of steps to help you answer this question.. 509 x2 60 50 60 70 80 # correlation matrix and associated p-values testing "H0: rho == 0" library(Hmisc) rcorr(as.731 120 x1 ● ● ●● ● ● ● ● 100 120 140 ● ● ●● ● 80 70 Corr: 0.00 n= 15 P y x1 x2 y 0. ignore the possibility that Y .73 0. unload=TRUE) 700 ● ● ● ● 600 ● y ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 500 ● ● ● 400 500 600 700 ● ● ● ● 140 ● Corr: 0.55 0.0346 0.0020 0.0346 x1 0. X1 or X2 might ideally need to be transformed. unload=TRUE) detach("package:reshape".0527 x2 0.0527 In parts 4 through 9.55 x1 0.51 x2 0.matrix(gce)) ## ## ## ## ## ## ## ## ## ## ## ## ## y x1 x2 y 1.00 0.548 Corr: 0.73 1.51 1.2.2: GCE exam score example 55 detach("package:GGally".00 0.0020 0. . 86 0.x1 <.02 3. type=3) ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: y Sum Sq Df F value Pr(>F) (Intercept) 4515 1 1. which = c(1.x1. Error t value Pr(>|t|) (Intercept) 128.x1) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = y ~ x1. # plot diagnistics par(mfrow=c(2. lm.x1. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.01 '*' 0.y.285 x1 53970 1 14.y.1 ' ' 1 Residual standard error: 60.56 Ch 2: Introduction to Multiple Linear Regression 4.95 1.y.002 ** --Signif.y.2 on 13 degrees of freedom Multiple R-squared: 0.00197 Plot diagnostics.534. Model Y = β0 + β1X1 + ε: # y ~ x1 lm.90 0.6)) plot(gce$x1.25 0.3)) plot(lm. data = gce) Residuals: Min 1Q Median -97. main="Residuals vs x1") # horizontal line at zero abline(h = 0.y.y.lm(y ~ x1.001 '**' 0.03 3Q Max 48.55 115.51 111.64 -0. id.' 0. main="QQ Plot") ## 10 13 1 .05 '.01 '*' 0.n = 3. codes: 0 '***' 0.86 -33.9 on 1 and 13 DF. codes: 0 '***' 0. data = gce) library(car) Anova(lm.05 '. p-value: 0. las = 1.x1$residuals. Which of X1 and X2 explains a larger proportion of the variation in Y ? Which would appear to be a better predictor of Y ? (Explain).4.' 0.x1$residuals.1 ' ' 1 summary(lm.16 1.002 ** Residuals 47103 13 --Signif.498 F-statistic: 14.285 x1 3.Adjusted R-squared: 0.001 '**' 0.33 Coefficients: Estimate Std.12 0. data = gce) Residuals: Min 1Q Median -143.lm(y ~ x2.y.1 ● ● ● lm.y. col = "gray75") 600 650 4 8 10 12 0.4 Residuals vs x1 ● ●● 100 ● ● Leverage hii ● ● ● ● ● ● ● ● ● Obs.2 0.5 0.3 700 ● ● 2 1.7 Max 99. main="Residuals vs Order of data") # horizontal line at zero abline(h = 0.5 Fitted values 100 −50 0.00 0.1 Coefficients: 3Q 54.3 8 Index 10 12 14 .2: GCE exam score example ## 15 1 57 2 # residuals vs order of data plot(lm.3 Residuals vs Order of data 110 0 −50 −100 120 130 140 ● ● ● ● ● ● ● ● 13 150 ● 1 −1 0 1 ● 50 ● ● ● ● ● ● 0 ● ● lm.1 0.2.x1$residuals.4 Residuals vs Fitted Call: lm(formula = y ~ x2.05 '.y.0 550 ● −100 6 3 Cook's dist vs Leverage hii (1 − hii) 1● ● 13 0 50 ● ● 13 0. type=3) Anova Table (Type III tests) Response: y Sum Sq Df F value Pr(>F) (Intercept) 32656 1 6. data = gce) library(car) Anova(lm.035 * Residuals 70752 13 --Signif.x1$residuals −100 ● ● ● ●● 0 50 ● ● −50 Residuals 100 10 ● 500 lm.01 '*' 0.2 Cook's distance 6 6● ●3 ● ● 0.y.' 0.x2.y.2 Cook's distance 0.x2) ## ## ## ## ## ## ## ## ## 0. number 100 90 0.4 0.8 -37. codes: 0 '***' 0.x1$residuals ● ● ● ● ● ● ● ● −50 ● ● ● ● 50 −100 ● ● ● 100 10 ● ● ● ● 2 4 6 norm quantiles Model Y = β0 + β1X2 + ε: # y ~ x2 lm.0 2 1 ● 13 0.1 0.3 14 0 0 0.x1$residuals Cook's distance 0.y.1 ' ' 1 summary(lm.001 '**' 0.57 0.x2 <.5 QQ Plot gce$x1 ## ## ## ## ## ## ## ## ## 0.029 * x2 30321 1 5.y.7 7. 3 650 ● −50 lm. p-value: 0.59 119. number ● 0 4 Fitted values ● ● ●● ● ● ●● ● ● ● ● ● 0 2 ● 45 0. 5.05 '.58 Ch 2: Introduction to Multiple Linear Regression ## ## ## ## ## ## ## ## ## Estimate Std.035 * --Signif. lm.0 10 QQ Plot lm. Error t value Pr(>|t|) (Intercept) 291.0 600 ● ● ●6 ● ● ● ● ● ● 13 ● −150 100 550 6 ● 1● ● 1 13 ● 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.x2.Adjusted R-squared: 0.57 on 1 and 13 DF.x2$residuals 0.6)) plot(gce$x2. col = "gray75") 12 ● ● ● 0 −50 −150 60 65 gce$x2 −150 70 75 80 ● ● ● ● ● ● ● −100 ● ● 0.y.3.5 ●1 0.y.4.y.8 on 13 degrees of freedom Multiple R-squared: 0. las = 1.246 F-statistic: 5.04 2.30 with X2. main="QQ Plot") ## ## 1 13 12 1 2 15 # residuals vs order of data plot(lm.83 2.2 50 50 0 0.1 ● lm. codes: 0 '***' 0.1 100 Residuals vs Fitted ● 12 1 −1 0 norm quantiles 1 ● 2 4 6 8 10 12 14 Index Answer: R2 is 0. id.4 0. Thus.001 '**' 0.n = 3.0346 # plot diagnistics par(mfrow=c(2. the Model SS is larger for X1 (53970) than for X2 (30321).04 2. Do X1 and .x2$residuals 50 8 Residuals vs x2 ● 50 6 Obs.1 ' ' 1 Residual standard error: 73.x2$residuals. main="Residuals vs Order of data") # horizontal line at zero abline(h = 0.01 '*' 0.y.5 −50 −150 ● 13 Cook's distance ● ● ● 2. Equivilantly. Consider 2 simple linear regression models for predicting Y .3 0.2 50 0 ● Cook's distance ● ● 500 13 ● ●● −50 Residuals 0.1 100 12 ● ● ● ● 0 Leverage hii 100 ● 14 ● Residuals vs Order of data ● ● 55 0.3 0.y.4 ● ● ● ● ● ● 0.' 0. X1 appears to be a better predictor of Y than X2.45 0.y. which = c(1.36 0. and the other with X2 as the predictor.x2$residuals.4 ● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 1 0.x2$residuals ● 0.52 1. main="Residuals vs x2") # horizontal line at zero abline(h = 0.029 * x2 4. one with X1 as a predictor.2 0.3)) plot(lm.53 for the model with X1 and 0.y.x2$residuals. if any.lm(y ~ x1 + x2.x1.x2) ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = y ~ x1 + x2. test that the slopes of the regression lines are zero).30 1. Error t value Pr(>|t|) (Intercept) 81. data = gce) library(car) Anova(lm.299 --- .2 -29.36 with an associated pvalue of 0.3 Coefficients: Estimate Std.001 '**' 0.x2 <.18 0. Which. your answer to the previous question? Answer: The model with X1 has a t-statistic of 3.0346. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: y Sum Sq Df F value Pr(>F) (Intercept) 1571 1 0.18 2.299 Residuals 42885 12 --Signif.e.016 * x2 2.2 3Q 56.520 x1 3.16 122.1 ' ' 1 summary(lm.x1.2: GCE exam score example 59 X2 individually appear to be important for explaining the variation in Y ? (i.92 1. Describe in words what this test is doing. This is consistant with part (4).05 '.86 with an associated p-value of 0. Fit the multiple regression model Y = β0 + β1X1 + β2X2 + ε. support.09 0. codes: 0 '***' 0.y.x2.2 Max 66. while X2 has a t-statistic of 2. Test H0 : β1 = β2 = 0 at the 5% level..x1.41 0. of the output.016 * x2 4218 1 1. or contradicts.80 0.520 x1 27867 1 7. data = gce) Residuals: Min 1Q Median -113.y.y.79 0.44 0.0020. Model Y = β0 + β1X1 + β2X2 + ε: # y ~ x1 + x2 lm.01 '*' 0. 6.6 -6.09 1.66 0.' 0. and what the results mean here. Both predictors explain a significant amount of variability in Y .2. x1.1 ' ' 1 Residual standard error: 59. # plot diagnistics par(mfrow=c(2. col = "gray75") 1● 550 600 650 1.0 2 ● 100 13 700 ● 90 1.2 Residuals vs Fitted ● ● 5● ● ● 0 ● ● ● ● ● ● −50 ● ● 13 −100 ● 45 50 55 60 65 gce$x2 ● 70 75 80 1 −1 0 1 norm quantiles Answer: The ANOVA table reports an F -statistic of 8.4 −50 ● 13 ● 50 ● ● 500 lm.4. number −50 −100 1 1.6 ●● ● ● ● ● ●● 6 8 10 12 14 ● ● ● 0 0.8 0.x2$residuals.5 ● 13 ● Leverage hii ● 10 ● ●1 Obs.' 0.x1. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.505 F-statistic: 8.2 ● 0.y.4 0.x1.1 0.x1.Adjusted R-squared: 0.6 0.x2$residuals ● ● 110 ● ●● 50 ● ● ● 0 0. p-value: 0.14 with associated p-value of 0.y.0 1.x2$residuals.x2$residuals ● ● Cook's distance 0 ● ● ● −100 Residuals ● 2.6)) plot(gce$x1. lm.y.x1.x2.5 Residuals vs x1 ● 0 4 0.8 1.y. lm.x1. main="QQ Plot") ## ## 1 13 5 1 2 15 ## residuals vs order of data #plot(lm.x2$residuals.2 0.x1. though the Cook’s distance is substantially larger for observation 10.y.14 on 2 and 12 DF.n = 3.576.4 Residuals vs x2 QQ Plot ● 120 gce$x1 130 140 150 ● 0 ● ● ● ● ● ● ● 50 ● −100 ● ● ● lm.3)) plot(lm.y. main="Residuals vs x2") # horizontal line at zero abline(h = 0. codes: 0 '***' 0.2 0.01 '*' 0.0 0.x2£residuals.y.3 0.0 ● ● Cook's distance ●5 10 −50 50 ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 0.05 '.001 '**' 0. las = 1. We may wish to fit the model without observation 10 to see whether conclusions change. id.5 2 0.00584 Diagnostic plots suggest the residuals are roughly normal with no substantial outliers.8 on 12 degrees of freedom Multiple R-squared: 0.0058 indicating that the regression model with both X1 and X2 explains significantly more variability in Y than a model with the in- .60 Ch 2: Introduction to Multiple Linear Regression ## ## ## ## ## Signif.x2$residuals ● ● lm.2 0. col = "gray75") plot(gce$x2. which = c(1.x1. main="Residuals vs x1") # horizontal line at zero abline(h = 0. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.5 Fitted values ● ● 1 0.y. 2: GCE exam score example 61 tercept. How does the R2 from the multiple regression model compare to the R2 from the individual simple linear regressions? Is what you are seeing here appear reasonable. That is. we fail to reject H0 concluding that there is insufficient evidence that the slope is different from 0 conditional on X1 being in the model. whether the addition of the predictor being tested explains significantly more variability in Y than without it. the t-statistic is 1. only X2 is 0. Thus. This does not tell us which of or whether X1 or X2 are individually important (recall the results of the Indian systolic blood pressure example). For H0 : β2 = 0. given the tests on the individual coefficients? Answer: The R2 for the model with only X1 is 0. .2987. and that X2 does not explain significantly more variability in the model with X1. Answer: Each hypothesis is testing. Do your best to answer the question posed above. X1 and X2 explain variability in Y together. Thus. Answer: Yes. In the multiple regression model. . in the paragraph after the data “A goal . . ”. For H0 : β1 = 0.5757. and both X1 and X2 is 0.95X1. There is only a very small increase in R2 from the model with only X1 when X2 is added.2.5340.79 with an associated p-value of 0.3000. alone. conditional on all other predictors being in the model. 8. X2 does not explain significantly more variability in Y given that X1 is already in the model. That is.55 + 3.09 with an associated p-value of 0. That is. which is consistent with X2 not being important given that X1 is already in the model. the t-statistic is 2.0163. and what the results mean here. X1 explains significantly more variability in Y given that X2 is already in the model. 7. we’ve seen that X1 may be used to predict Y . Describe in words what these tests are doing. test H0 : β1 = 0 and H0 : β2 = 0 individually. we reject H0 in favor of the alternative that the slope is statistically significantly different from 0 conditional on X2 being in the model. the preferred model has only X1: yˆ = 128. 9. Thus. Provide an equation (LS) for predicting Y . y. You might see a nonlinear trend here. I will use some new tools to attack this problem. When I assess plots I try to not allow a few observations affect my perception of trend. However. and will outline how they are used. is a graphical tool that provides information about the need for transformations in a multiple regression model. including any other effects that are important. the 2D plots only tell us whether we need to transform the data in a simple linear regression analysis. library(car) avPlots(lm. I would do an analysis using the suggested transformations. id. The following reg procedure generates diagnostics and the partial residual plots for each predictor in the multiple regression model that has COMP and SCEL as predictors of GCE. I will examine whether transformations of the data are appropriate. If a 2D plot shows a strong non-linear trend. The partial regression residual plot.62 2. but the trend in the plot of GCE (Y ) against SCEL (X2) is less clear.x1. and how I would attack this problem.n=3) . or added variable plot.x2. but the relationship is not very strong. keeping the ultimate goal in mind.1 Ch 2: Introduction to Multiple Linear Regression Some Comments on GCE Analysis I will give you my thoughts on these data. The plot of GCE (Y ) against COMP (X1) is fairly linear. and whether any important conclusions are dramatically influenced by individual observations. and with this in mind. I do not see any strong evidence at this point to transform any of the variables. One difficulty that we must face when building a multiple regression model is that these two-dimensional (2D) plots of a response against individual predictors may have little information about the appropriate scales for a multiple regression analysis. In particular. it might be that no variables need to be transformed in the multiple regression model.2. ylab="y | others". we “adjust” the selected variable Xsel for all the other predictors in the model. xlab="x1 | others") partial.regression. col = "red". First.regression.matrix(x[.plot(gce$y...2. . main="Partial regression plot".plot(gce$y. 1.lm(x[.. Then.. cbind(gce$x1. xlab="x2 | others") . gce$x2). gce$x2).) { m <. 2. sel.lm(y ~ m)$res # residuals of x regressed on all other x's x1 <. # function to create partial regression plot partial. -sel]) # residuals of y regressed on all x's except "sel" y1 <.function (y. sel] ~ m)$res # plot residuals of y vs residuals of x plot( y1 ~ x1. Lastly. we “adjust” Y for all the other predictors in the model except the selected one. .plot <.as. cbind(gce$x1. x. plot the residuals from these two models against each other to see what relationship still exists between Y and Xsel after accounting for their relationships with the other predictors.2: GCE exam score example 63 ● 50 ● 10 y | others ●● ● ● 13 ● 1 −10 0 10 x1 | others 20 30 15 ● ● ● −100 −150 ● 5 ●● ● ● ● −50 ● ● ● ● ● 0 50 0 5 ● −50 y | others ● 6● 11 ● ● 10 ● 100 100 Added−Variable Plots ● ● 1● 13 −5 0 5 10 15 20 x2 | others The partial regression residual plot compares the residuals from two model fits. 2)) partial. lwd = 2) } par(mfrow=c(1.) # add grid grid(lty = "solid") # add red regression line abline(lm(y1 ~ x1).regression. This conclusion is consistent with the fairly weak linear relationship between GCE against SCEL seen in the second partial residual plot. and much less so than the original 2D plot of GCE against COMP. This plot tells us whether we need to transform COMP in the multiple regression model. the multiple regression output indicates that SCEL does not explain a significant amount of the variation in GCE. This indicates that there is no strong evidence for transforming COMP in a multiple regression model that includes SCEL. Put another way. The positive relationship seen here is consistent with the coefficient of COMP being positive in the multiple regression model. given below. Although SCEL appears to somewhat useful as a predictor of GCE on it’s own. The partial residual plot for COMP shows little evidence of curvilinearity. Do diagnostics suggest any deficiencies associated with this conclusion? The partial residual plot of SCEL highlights observation 10.64 Ch 2: Introduction to Multiple Linear Regression ● ● −150 ● ● −10 0 10 x1 | others 20 30 ● ●● ● ● ● ● ● ● −100 −50 ● ● ● 100 ● y | others 0 ● ●● ● −50 y | others 50 ● ● ● 50 ● Partial regression plot 0 100 Partial regression plot ● ● ● ● ● −5 0 5 10 15 20 x2 | others The first partial regression residual plot for COMP. “adjusts” GCE (Y ) and COMP (X1) for their common dependence on all the other predictors in the model (only SCEL (X2) here). once the effect of COMP has been taken into account. and whether any observations are influencing the significance of COMP in the fitted model. which has the largest . A roughly linear trend suggests that no transformation of COMP is warranted. previous performance in the School Certificate English Language (X2) has little predictive value independently of what has already emerged from the current performance in the compulsory papers (X1 or COMP). x2) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = y ~ x1 + x2. codes: 0 '***' 0. If we visually hold observation 10 out from this partial residual plot.y10.25 0.57 -0.001 '**' 0.4.6)) .80 Coefficients: Estimate Std.20 3.542 F-statistic: 8.33 0. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: y Sum Sq Df F value Pr(>F) (Intercept) 5280 1 1. Model Y = β0 + β1X1 + β2X2 + ε.8 on 11 degrees of freedom Multiple R-squared: 0.28 2. which = c(1.1 ' ' 1 summary(lm. data = gce10) library(car) Anova(lm.001 '**' 0.05 '.05 '. p-value: 0.2118 x1 4. codes: 0 '***' 0.3)) plot(lm.29 1.613.lm(y ~ x1 + x2.1 ' ' 1 Residual standard error: 54.y10. Error t value Pr(>|t|) (Intercept) 159.x2. it would appear that the relationship observed in this plot would weaken.' 0. the p-value for testing the importance of SCEL in the multiple regression model would be inflated by holding out observation 10.y10. The studentized residuals. That is.6279 --Signif.0047 ** x2 -1.24 1.42 Max 64.71 on 2 and 11 DF.46 120.] # y ~ x1 + x2 lm.01 '*' 0. This suggests that observation 10 is actually enhancing the significance of SCEL in the multiple regression model.32 4.01 '*' 0.y10.x1.0047 ** x2 747 1 0.6279 Residuals 33052 11 --Signif.45 0.x1.2: GCE exam score example 65 value of Cook’s distance in the multiple regression model. The following output confirms this conjecture.2118 x1 37421 1 12.gce[-10.Adjusted R-squared: 0.x1.00541 # plot diagnistics par(mfrow=c(2.53 0.66 3Q 37.x2.x2 <.12 -30.x1.50 0.2. excluding observation 10: gce10 <.' 0. Cook’s distances and partial residual plots show no serious deficiencies. data = gce10) Residuals: Min 1Q Median -99.76 0. 0 700 ● ● ● −100 lm.5 9● ● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 0.5 ● ● 13 90 3 −50 −100 1● Cook's distance ● 0.5 12 14 0 ● ● ● ●● ● ● ● 0 ● 0.x1.4 0.x1. las = 1. main="QQ Plot") ## 13 ## 1 1 9 2 14 ## residuals vs order of data #plot(lm.1 0.66 Ch 2: Introduction to Multiple Linear Regression plot(gce10$x1.1 0.2 50 0 ● −50 Residuals ● ● ● 2.x2$residuals.x2.x1.x2$residuals 0.2 0.0 550 0 500 1.y10.y10.y10.x1.x2$residuals.3 0.n=3) ● ● ● ● ● ● ● −100 ● lm.y10.x2$residuals.2 0. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0. lm.x2£residuals. main="Residuals vs x1") # horizontal line at zero abline(h = 0.5 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.4 Fitted values Obs.y10.x1.6 Residuals vs Fitted ● 1 13 −1 0 norm quantiles 1 9● . col = "gray75") plot(gce10$x2.3 ● Cook's distance ● ● ● ● 13 0.y10.n = 3.x1.x2$residuals 50 ● 110 ● ● 50 ● 100 4 1● 0.x2$residuals 0 ● ● ● ● ● 50 55 gce10$x2 ● 60 65 ● ● ● 0 ● ● ● ● ● ● −50 ● −100 45 0. number Leverage hii Residuals vs x1 Residuals vs x2 QQ Plot ● ● −50 ● ● ● 120 130 140 150 gce10$x1 library(car) avPlots(lm. id.4 0.y10.x1.5 2 1 0.5 ● 50 ● lm. id.x1. lm. col = "gray75") 600 650 6 8 10 0. main="Residuals vs x2") # horizontal line at zero abline(h = 0.1 2 1 ● 13 ●3 ● 0.y10.3 0. I would likely use a simple linear regression model to predict GCE (Y ) from COMP (X1) only. For simplicity.2. The diagnostic analysis of the model showed no serious deficiencies.2: GCE exam score example 67 100 Added−Variable Plots ● ● 50 9 11 ● 0 ● ● 15 ● ● ● ● ● −50 ● 13 1 ● −10 9 ● 0 y | others ● ● ● ● 12 ● ● ● ● ● ● ● −100 −50 y | others 50 ● ● 6● 0 10 x1 | others 20 1● 13 −5 0 5 10 x2 | others What are my conclusions? It would appear that SCEL (X2) is not a useful predictor in the multiple regression model. . X2. we wish to develop a regression model to predict Y . . and that the candidate list of predictors includes all the important predictors. or equivalently. . . . We want to identify the important predictors. . Xk . Before applying any of the methods. We will study several automated methods for model selection. which. . . you should plot Y against each predictor X1. In most problems one or more of the predictors can be eliminated from this general or full model without (much) loss of information. If a transformation of Xi is suggested. the most general model is Y = β0 + β1X1 + · · · + βk Xk + ε. gives the best predictors. Assuming that the collection of variables is measured on the correct scale.Chapter 3 A Taste of Model Selection for Multiple Regression 3. include the transformation along with the original Xi in . Xk to see whether transformations are needed. .1 Model Given data on a response variable Y and k predictor variables X1. given a specific criterion for selecting a model. X2. eliminate the predictors that are not very useful for explaining the variation in Y (conditional on the other predictors in the model). giving the new full model Y = β0 + β1X1 + · · · + βk−1Xk−1 + ε. delete Xk from the full model. If you do not reject H0. starting from the full model. At this point. Other approaches will be addressed later this semester. Fit the full model Y = β0 + β1X1 + · · · + βk Xk + ε. However. Suppose this variable is Xk . regardless of their statistical significance. If not.10 significance level is common to use for this strategy. and check whether it is important.2: Backward Elimination 69 the candidate list. Otherwise. A related issue is that several sets of predictors might . 2. A 0. or equivalently. The steps is the procedure are: 1. Note that you can transform the predictors differently. for √ example. I will briefly discuss this issue.2 Backward Elimination The backward elimination procedure deletes unimportant variables. increases the Residual SS the least. one at a time. 3. I will only consider the backward elimination method. but you should recognize that there is no universally accepted best approach to building models. Find the variable which when omitted from the full model (1) reduces R2 the least. This is the variable that gives the largest p-value for testing an individual regression coefficient H0 : βi = 0 for i > 0. In backward elimination we isolate the least important predictor left in the model.3. log(X1) and X2. stop. If you reject H0. delete it and repeat the process. They argue strongly for the need to always include confounding variables in a model. stop and conclude that the full model is best. then you should consider doing one analysis for each suggested response scale before deciding on the final scale. Repeat steps 1 and 2 sequentially until no further predictors can be deleted. Epidemiologists use a slightly different approach to building models. if several transformations are suggested for the response. where the penalty term is larger in BIC (k ln(n)) than in AIC (2k). That is. We choose the model that minimizes the (estimated) information loss (the Kullback-Leibler divergence of the “true” unknown model represented with a candidate model). or succinctness. start with a set of candidate models. MLE finds the particular parametric values that make the observed data the most probable given the model. and then find the models’ corresponding AIC/BIC values. it selects the set of values of the model parameters that maximizes the likelihood function.2. k is the number of model parameters. There will almost always be information lost due to using one of the candidate models to represent the “true” (unknown) model. In practice. and L is the maximized value of the likelihood function for the estimated model.1 Maximum likelihood and AIC/BIC The Akaike information criterion (AIC) and Bayesian information criterion (BIC) are related penalized-likelihood criteria of the relative goodness-of-fit of a statistical model to the observed data. For model selection. However. this should make us question whether one could ever completely unravel which variables are important (and which are not) for predicting a response. a parsimonious model minimizes (one of) these quantities. The penalty discourages overfitting. regardless of the number of free parameters in the data-generating process. Maximum-likelihood estimation (MLE) applied to a data set and given a statistical model. the penalty . economy. Increasing the number of free parameters in the model will always improve the goodness-of-fit. This should not be too surprising because predictors are often correlated with each other. They are defined as AIC = −2 ln(L) + 2k and BIC = −2 ln(L) + k ln(n) where n is the number of observations. In the spirit of Occam’s razor.70 Ch 3: A Taste of Model Selection for Multiple Regression give nearly identical fits and predictions to those obtained using any model selection method. the principle of parsimony. 3. estimates the model’s parameters (βs and σ 2 in regression). and seven candidate predictors: wt = weight in kilos. select=c("sysbp". AIC or BIC are good tools for helping to choose among candidate models. and yrage = fraction.data. or another method. "fore".subset(indian. so we will analyze the data using the given scales. ht = height in mm.. "wt". "chin" . A model selected by BIC will tend to have fewer parameters than one selected by AIC. 3. and I may decide upon a different model than AIC. ## $ wt : num 71 56. BIC.3. calf = calf skin fold in mm."http://statacumen.indian$yrmig / indian$age # subset of variables we want in our model indian2 <.3 Example: Peru Indian blood pressure I will illustrate backward elimination on the Peru Indian data..data <. I think of automated model selection as a starting point among the models I ultimately consider. #### Example: Indian # filename fn. pulse = pulse rate-beats/min. you have to choose a model.5 56 61 65 62 53 53 65 57 . using systolic blood pressure (sysbp) as the response... Ultimately.frame': 39 obs. The program given below generates simple summary statistics and plots.read.3: Example: Peru Indian blood pressure 71 helps balance the complexity of the model (low) with its ability to describe the data (high).com/teach/ADA2/ADA2_notes_Ch02_indian. The plots do not suggest any apparent transformations of the response or the predictors. The correlations between the response and each potential predictor indicate that predictors are generally not highly correlated with each other (a few are). . "calf".table(fn. "ht". "yrage") ) str(indian2) ## 'data. There are many methods for model selection. "pulse". header=TRUE) # Create the "fraction of their life" variable # yrage = years since migration divided by age indian$yrage <. chin = chin skin fold in mm.dat" indian <. of 8 variables: ## $ sysbp: int 170 120 125 148 140 106 120 108 124 134 . fore = forearm skin fold in mm. 3 3 12. # Description of variables # id = individual id # age = age in years # wt = weight in kilos # chin = chin skin fold in mm # calf = calf skin fold in mm # sysbp = systolic bp yrmig ht fore pulse diabp = = = = = years since migration height in mm forearm skin fold in mm pulse rate-beats/min diastolic bp ## print dataset to screen #indian2 library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) #p <.7 8 4..2727 0.0417 0.7 5..1) conflicts) detach("package:GGally". 88 64 68 52 72 72 64 80 76 60 .3 3.04 .7 4.ggpairs(indian2.3 3.72 ## ## ## ## ## ## Ch 3: A Taste of Model Selection for Multiple Regression $ $ $ $ $ $ ht : chin : fore : calf : pulse: yrage: int num num num int num 1629 1569 1561 1619 1566 1639 1494 1568 1540 1530 ..3 5.3 20.7 .7 3.3 4. unload=TRUE) detach("package:reshape"..... 8 3..3 9 4 . 0. upper = list(continuous = "points") . lower = list(continuous = "cor") ) print(p) # detach package after use so reshape2 works (old reshape (v.ggpairs(indian2) # put scatterplots on top so y axis is vertical p <.7 10.3 3.2083 0. 12.3 4..0476 0..7 9 3 7.7 8 0 10 6 . unload=TRUE) .. 7 5 1.. 45 1600 1550 ● ● ● ● ●● ● ●● ● ●● ● ● ● ● Corr: 0.42 0.17 fore 0.22 chin 0.0512 Corr: 0.12 0.0689 ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● chin Corr: 0.00294 Corr: 0.52 0.133 ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ● ● 80 ●● ● ● ● calf ● ● ● ● 15 ● ● ● ● 0 ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● 20 Corr: 0.521 70 wt ● ● ● 60 70 80 ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ●● Corr: 0.3.5 10 7.07 0.5 1012.56 0.12 Corr: 0.22 0.00 0.11 0.27 0.matrix(indian2)) ## ## ## ## ## ## ## ## ## ## sysbp sysbp 1.028 Corr: −0.00 0.25 pulse 0.293 Corr: 0.00 0.29 0.3: Example: Peru Indian blood pressure ● 160 sysbp 140 120 140 160 ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ● Corr: 0.00 0.25 0 0.21 -0.213 0.5 5 7.03 0.54 0.00 0.5 10 7.13 yrage -0.01 -0.638 ● ● ● ● ● ● 5 2.74 1.5 yrage 0.736 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ●●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● 5 ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 Corr: 0.00 0.5 ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Corr: 0.11 0.31 0.21 1.5 fore 5 2.113 70 ● ● 90 Corr: 0.17 0.39 0.272 ● ● ● 80 1500155016001650 Corr: 0.05 0.42 0.22 0.03 -0.12 0.07 0.544 ● ● ●●●● ●●●●● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ●●● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ht Corr: −0.5 0.13 -0.45 1.562 Corr: 0.5 10 12.25 0.17 ● ● ● 1650 Corr: 0.27 calf 0.422 Corr: 0.5 ● ● ● ●●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ●● ● ● ●●●● ● ●● ● ● ● ● ●● ●● ●●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●●● ●●● ● ● 12.25 0.29 0.28 1.54 -0.64 0.45 0.0079 Corr: −0.392 Corr: −0.00 wt 0.56 -0.00285 Corr: 0.52 ht 0.74 0.21 Corr: −0.39 0.31 0.21 0.00 0.55 7.05 0.75 # correlation matrix and associated p-values testing "H0: rho == 0" library(Hmisc) rcorr(as.5 ● ● ● ● ●● ● ● ● ●● ● ● 12.21 1.75 .224 Corr: 0.516 Corr: 0.251 ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● Corr: 0.52 0.00 0.28 wt ht chin fore calf pulse yrage 0.01 1.22 0.31 Corr: 0.64 1.276 Corr: 0.219 73 ● pulse 60 5060 70 80 90 ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● 0.00 -0.00 Corr: 0.00 0.52 0. 0548 0.1928 Below I fit the linear model with all the selected main effects.03 0.0002 0.01 '*' 0.0000 pulse 0. wt 1.15 0.0007 0. codes: 0 '***' value 3.0007 0.1995 0.9130 1.full) ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + ht + chin + fore + calf + pulse + yrage.0075 0.' 0.4211 0. # fit full model lm.full.1802 0.0008 0.8656 0.1236 0. type=3) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: sysbp Sum Sq Df F (Intercept) 389 1 wt 1956 1 ht 132 1 chin 187 1 fore 27 1 calf 3 1 pulse 15 1 yrage 1387 1 Residuals 3096 31 --Signif.6767 0.3003 0. data = indian2) library(car) Anova(lm.0702 0.9619 fore 0.0548 0.9863 0.0000 calf 0.001 '**' 0.indian2.9863 0.7109 0.90 19.9619 0.86664 0.7570 chin 0.0702 0.1708 0.0000 0.0136 0.88 Pr(>F) 0.70470 0.15 13.577 Coefficients: Estimate Std.0075 0.05728 .3003 0.0040 0.3866 4.27 0.945 Max 23.00078 *** 0.1928 yrage 0.4933 0.399 -5.00011 *** ht -0.0003 0. data = indian2) Residuals: Min 1Q -14.indian2.00011 *** 0.4665 0.9858 0.74 ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 3: A Taste of Model Selection for Multiple Regression n= 39 P sysbp sysbp wt ht chin fore calf pulse yrage 0.0000 0.8656 0.0453 0.1995 0.25933 0.87 0.0888 wt ht 0.0136 0.0002 0.lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage .0888 0.0395 -1.0040 0.9858 0.1236 0.4933 0. 0.43 0.4665 0.1708 0.18124 0.1 ' ' 1 summary(lm.60681 0.indian2.59 1.25933 .4211 0.0003 0.4577 53.full <.0936 0.0936 0.1802 0.792 Median -0. Error t value Pr(>|t|) (Intercept) 106.7570 0.05728 .691 3Q 6.6767 0.05 '.32 1.97 0.0008 0. full.38 0. In the ANOVA table.indian2.70470 yrage -29.lm.3: Example: Peru Indian blood pressure ## ## ## ## ## ## ## ## ## ## ## 75 chin -1. codes: 0 '***' 0.1 ' ' 1 Residual standard error: 9. so calf will be the first to be omitted from the model. increases the Residual SS the least.0008) tests the hypothesis that the regression coefficient for each predictor variable is zero.10 threshold. The least important predictor left is pulse. This variable is omitted from the model because the p-value for including it exceeds the 0.10 significance level.91 with p-value = 0.6117 0. as judged by the overall F -test p-value.17 0.red <.red <.52 0. The least important variable in the full model.91 on 7 and 31 DF. # remove calf lm.indian2.' 0. This is repeated until all predictors remain significant at a 0.indian2.05 '. After deleting calf.0749 0.3499 -0. This variable.87 exceeds the default 0. # model reduction using update() and subtracting (removing) model terms lm.red.01 '*' 0.3181 7. or equivalently.1957 0. summary(lm. p-value: 0.000808 Remarks on Step 0: The full model has 7 predictors so REG df = 7.calf ).86664 pulse 0.73 0.00078 *** --Signif. the six predictor model can be fitted.60681 calf 0. we will continue in this way. The F -test in the full model ANOVA table (F = 4.indian2.10 cut-off.526. reduces R2 the least. . you can find that at least one of the predictors left is important. Manually. The p-value of 0. upon omission.8461 -1.99 on 31 degrees of freedom Multiple R-squared: 0. as judged by the p-value. Below.red). .indian2.18124 fore -0.1036 0. ~ . indicating that one or more of the predictors is important in the model. This test is highly significant.update(lm.1572 0.001 '**' 0.3.419 F-statistic: 4. The p-value is the same whether the t-statistic or F -value is shown.37 0. is calf skin fold.7018 1.Adjusted R-squared: 0.8684 -3. the F -value column gives the square of the t-statistic (from the parameter [Coefficients] estimate table) for testing the significance of the individual predictors in the full model (conditional on all other predictors being in the model). 15 0. p-value: 0.8119 -1.0382 -1.927 Coefficients: Estimate Std.001 '**' 0.03860 * wt 1.01 '*' 0.5673 1.0450 0.38 0.451 F-statistic: 7.525.207 3Q 6.fore ).update(lm.2787 51. summary(lm.red.001 '**' 0.05399 .7090 0.25 on 5 and 33 DF.37 0.05 '. Error t value Pr(>|t|) (Intercept) 106.3747 4.1 ' ' 1 Residual standard error: 9. p-value: 0.437 F-statistic: 5.45 0.18 0.0746 -0.315 Median -0.724 Coefficients: Estimate Std.1 ' ' 1 Residual standard error: 9.71302 yrage -29.000112 # remove fore lm. ~ .59 6.1914 0.7e-05 *** ht -0. ~ .87 0.red <.9817 7. wt 1.92 0.0387 -1. summary(lm.25601 chin -1.16 0.00031 # remove pulse lm.indian2.49 8.0558 2.red).indian2.indian2.00051 *** --Signif.red.676 Max 24.4338 0.1417 0.6398 -3.3917 -3.17764 fore -0.66701 yrage -28.297 Max 23.84 on 32 degrees of freedom Multiple R-squared: 0.05 '.24681 chin -1.15651 fore -0.red).1374 53.Adjusted R-squared: 0.' 0.53 0.1772 0.91 on 6 and 32 DF.9993 -0. Error t value Pr(>|t|) (Intercept) 110. ## .5400 7. .Adjusted R-squared: 0.60120 pulse 0.71 on 33 degrees of freedom Multiple R-squared: 0. ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + ht + chin + fore + yrage. data = indian2) Residuals: Min 1Q -14.3805 4.' 0.indian2. codes: 0 '***' 0.update(lm.699 -5.7183 0.523.8282 -1. codes: 0 '***' 0.43 0. data = indian2) Residuals: Min 1Q -14.pulse).0448 0.615 -5.01 '*' 0.76 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 3: A Taste of Model Selection for Multiple Regression Call: lm(formula = sysbp ~ wt + ht + chin + fore + pulse + yrage.1867 2. .red <.00 0.2e-05 *** ht -0.772 3Q 7.980 Median -0.0710 0.indian2.00042 *** --Signif.indian2. 277 5.63 on 35 degrees of freedom Multiple R-squared: 0.update(lm.ht ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 77 ).001 '**' 0.3320 4. data = indian2) Residuals: Min 1Q -15.update(lm. p-value: 1.Adjusted R-squared: 0.464 F-statistic: 9.00036 *** --Signif. ~ .695 -1.348 Median 0.chin ).00127 ** wt 1.638 -6.7422 -1. .indian2.05 '.452 3Q 6.27453 chin -1.209 Coefficients: Estimate Std.6 on 34 degrees of freedom Multiple R-squared: 0.103 -6.3: Example: Peru Indian blood pressure ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + ht + chin + yrage.Adjusted R-squared: 0.red.08635 .05 '.84 0.96 0.23 on 4 and 34 DF.01 '*' 0. ## ## Call: ## lm(formula = sysbp ~ wt + yrage.441 0.014 0.9e-05 *** ht -0.0356 -1.11 0.5e-06 *** chin -1.352 7.1 ' ' 1 Residual standard error: 9.461 F-statistic: 11. Error t value Pr(>|t|) (Intercept) 52.21 8.777 Max 24. ~ . p-value: 3.1 ' ' 1 Residual standard error: 9.' 0.51 0.503. codes: 0 '***' 0. data = indian2) Residuals: Min 1Q -16.red). summary(lm.indian2.indian2.77 0.46 0.888 Coefficients: Estimate Std.521.14 0.00049 *** --Signif.3258 7.909 15.3108 0.indian2. yrage -28. data = indian2) ## ## Residuals: ## Min 1Q Median 3Q Max .1488 -3.03963 * wt 1.red.01 '*' 0.96 1.red <. .indian2. Call: lm(formula = sysbp ~ wt + chin + yrage.283 3Q 6.001 '**' 0.8 on 3 and 35 DF.red <.090 3.118 -3.632 Median 0.3.indian2.5229 48. Error t value Pr(>|t|) (Intercept) 104.' 0. summary(lm.8463 2.red).66e-05 # remove ht lm. codes: 0 '***' 0.6463 0.68e-05 # remove chin lm.359 Max 24.15341 yrage -27.0396 0. At each step. the predictors are ranked (least significant to most significant) and then a decision of whether to keep the top predictor is made.indian2.1 ' ' 1 Residual standard error: 9.fore 1 27 3123 185 0.05 '.calf 1 3 3099 185 0.444 F-statistic: 16.433 -7. though resulting models are usually the same as the method described above.26 0. p-value: 9.1 ' ' 1 .982 Coefficients: Estimate Std.896 14.473.2 on 2 and 36 DF.767 7.307 0. stop.red AIC/BIC automated model selection The AIC/BIC strategy is more commonly used for model selection.79e-06 # all are significant.728 23.step(lm.18124 .indian2.indian2.03 0.pulse 1 15 3111 185 0.Adjusted R-squared: 0.25933 <none> 3096 187 .88 0.70470 .234 5. <none> represents the current model.indian2. include k=log(nrow( [data.78 ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 3: A Taste of Model Selection for Multiple Regression -18.00014 *** wt 1.chin 1 187 3283 187 1.15 0.21 8e-06 *** yrage -26.00070 *** --Signif.217 0. codes: 0 '***' 0. # final model: sysbp ~ wt + yrage lm.281 4.' 0.' 0.218 -3.AIC <.71 0.32 0. codes: 0 '***' 0.frame name] )) lm.78 on 36 degrees of freedom Multiple R-squared: 0.red. test="F") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Start: AIC=186.01 '*' 0.87 0. I use the step() function to perform backward selection using the AIC criterion (and give code for the BIC) then make some last-step decisions. Note that because the BIC has a larger penalty. Error t value Pr(>|t|) (Intercept) 60.6 sysbp ~ wt + ht + chin + fore + calf + pulse + yrage Df Sum of Sq RSS AIC F value Pr(>F) .01 '*' 0.full.05 '.896 5.00011 *** --Signif. direction="backward". it arrives at my chosen model directly.59 0.wt 1 1956 5053 204 19.86664 . ## AIC # option: test="F" includes additional information # for parameter estimate tests that we're familiar with # option: for BIC.60681 .27 0.final <.lm.yrage 1 1387 4483 199 13.00078 *** .001 '**' 0.ht 1 132 3228 186 1.001 '**' 0. 08635 .00049 *** 201 27.00042 *** .wt 1 --Signif.25601 <none> 3099 185 .00051 *** .yrage 1 1448 4547 198 14.3.60120 .10 0. codes: 0 '***' 0.fore 1 18 3130 181 0. codes: RSS 3244 197 3441 1368 4612 2515 5759 AIC F value Pr(>F) 180 181 2.37 0.05 '.19 0.66701 .' 0. .chin 1 184 3283 185 1.indian2.2e-05 *** --Signif.8 sysbp ~ wt + ht + chin + fore + yrage Df Sum of Sq RSS AIC F value Pr(>F) . test="F".001 '**' 0.001 '**' 0.5e-06 *** 0 '***' 0.12 0.14 8.001 '**' 0.yrage 1 1446 4576 194 15.' 0.00036 *** .6 sysbp ~ wt + ht + chin + fore + pulse + yrage Df Sum of Sq RSS AIC F value Pr(>F) .01 '*' 0.ht 1 131 3244 182 1.14 0.yrage 1 .39 0.3: Example: Peru Indian blood pressure ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Step: AIC=184.34 0.28 0.pulse 1 13 3113 183 0. codes: 0 '***' 0.05 '.chin 1 .01 '*' 0.wt 1 1954 5053 202 20.01 '*' 0.03 6.ht 1 130 3229 184 1.1 ' ' 1 Step: AIC=180.1 ' ' 1 Step: AIC=181 sysbp ~ wt + ht + chin + yrage Df Sum of Sq RSS AIC F value Pr(>F) .15341 192 14.' 0.chin 1 198 3311 183 2.59 1.70 0.17 8.yrage 1 1450 4563 196 15.wt 1 2264 5394 200 24.01 '*' 0.90 0.9e-05 *** --Signif.001 '**' 0.15651 . codes: 0 '***' 0.05 '. direction="backward". k=log(nrow(indian2))) 79 .13 0.' 0.1 ' ' 1 # BIC (not shown) # step(lm.23 0.fore 1 27 3126 183 0.27453 <none> 3130 181 .76 0.24681 <none> 3113 183 .wt 1 1984 5096 200 21.95 0.17764 .71302 .1 ' ' 1 Step: AIC=182.full.ht 1 114 3244 180 1.7e-05 *** --Signif.05 '.4 sysbp ~ wt + chin + yrage Df Sum of Sq <none> .chin 1 287 3418 182 3. 1 Analysis for Selected Model The summaries and diagnostics for the selected model follow.8 0. codes: 0 '***' 0.80 Ch 3: A Taste of Model Selection for Multiple Regression Remark on Summary Table: The partial R2 is the reduction in R2 achieved by omitting variables sequentially. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: sysbp Sum Sq Df F value Pr(>F) (Intercept) 1738 1 18.00070 *** Residuals 3441 36 --Signif. 3.001 '**' 0.47. Model sysbp = β0 + β1 wt + β2 yrage + ε: library(car) Anova(lm.indian2.234 5. Using a mechanical approach.52.1 ' ' 1 summary(lm.50. As we progress from the full model to the selected model. 0.217 0.982 Coefficients: Estimate Std.' 0. and 0. The backward elimination procedure eliminates five variables from the full model. 0. pulse rate pulse. height ht.05 '. and chin skin fold chin.307 Median 0.53.896 14.281 4.1 8e-06 *** yrage 1315 1 13.01 '*' 0.21 8e-06 *** . At this point we should closely examine this model.2 0.00014 *** wt 1. The model selected by backward elimination includes two predictors: weight wt and fraction yrage.3.728 Max 23.26 0. data = indian2) Residuals: Min 1Q -18. Error t value Pr(>|t|) (Intercept) 60. we are led to a model with weight and years by age fraction as predictors of systolic blood pressure.00014 *** wt 2592 1 27. in the following order: calf skin fold calf.433 -7.final.indian2. 0. The decrease is slight across this spectrum of models.52. R2 decreases as follows: 0. forearm skin fold fore.final) ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + yrage. 0.896 3Q 5.53. indian2.indian2.01 '*' 0. which = c(1.indian2.final$residuals.79e-06 Comments on the diagnostic plots below. Except for case 1.final.' 0. id. main="Residuals vs yrage") # horizontal line at zero abline(h = 0. The individual with the highest systolic blood pressure (case 1) has a large studentized residual ri and the largest Cook’s Di. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.indian2. 1.Adjusted R-squared: 0.final$residuals. 3. las = 1.3.767 7.71 0. col = "gray75") . lm.001 '**' 0. the rankit plot and the plot of the studentized residuals against the fitted values show no gross abnormalities.218 -3.3)) plot(lm. main="QQ Plot") ## 1 34 11 ## 39 1 2 ## residuals vs order of data #plot(lm.3: Example: Peru Indian blood pressure ## ## ## ## ## ## ## 81 yrage -26. Although case 1 is prominent in the partial residual plots.4.n = 3.indian2. 2. The plots of studentized residuals against the individual predictors show no patterns.final$residuals. These plots collectively do not suggest the need to transform either of the predictors. codes: 0 '***' 0.78 on 36 degrees of freedom Multiple R-squared: 0. lm.00070 *** --Signif. main="Residuals vs wt") # horizontal line at zero abline(h = 0.05 '.1 ' ' 1 Residual standard error: 9. it does not appear to be influencing the significance of these predictors. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0. p-value: 9. col = "gray75") plot(indian2$yrage. # plot diagnistics par(mfrow=c(2. The partial residual plots show roughly linear trends.2 on 2 and 36 DF.6)) plot(indian2$wt.473.final£residuals.444 F-statistic: 16. 0 ● 34 8 4 1.05 Residuals vs yrage ● ● −10 lm. adjusts systolic blood pressure and weight for their common dependence on all the other predictors in the model (only years by age fraction here).indian2. given below. The partial residual plot for fraction exhibits a stronger relationship than is seen in the earlier 2D plot of systolic blood pressure against year by age fraction. A roughly linear trend.2 ● ● ● ● ● 0.final$residuals 20 10 0 ● ● ● ●● ● ● 55 0.1 ● 32.final$residuals 100 0. id.5 ●1 0. The positive relationship seen here is consistent with the coefficient of weight being positive in the multiple regression model.4 0.3 0.4 0 ● ● ● 0.0 ● lm.0 0 ● Cook's distance ● ● ● ● 0.3 ● ● 0. library(car) avPlots(lm.final.2 ● ● ● ● 0. number ● ● 20 1 Fitted values ● ● ● 0.6 0.indian2. as seen here.15 QQ Plot ● ● 0 0 0.3 1● ● −10 ● ● ● lm. This plot tells us whether we need to transform weight in the multiple regression model.final$residuals ● 40 Residuals vs wt ● ● ●● 30 ● ● ● ● ●●● ● ● ●●●● ● ●● ●● ●●● ● ● ●● ●●● ●● ● ● Leverage hii ● ● ● 0.5 2 0. and whether any observations are influencing the significance of weight in the fitted model.indian2. suggests that no transformation of weight is warranted.5 8● ●4 ● ● Obs.25 0.4 1● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 1 ● 11● ● 34 −2 −1 0 1 2 norm quantiles Recall that the partial regression residual plot for weight.1 10 ● −10 Residuals ● Cook's distance 20 0. This means that fraction is more useful as a predictor after taking an individual’s weight into consideration.2 ● ● 0.indian2.8 ● ● ●● ●● ● ● ● ● ● ● ● ●● ● −10 −20 indian2$yrage ● ● 10 ● ●● ● ● ● ● 0.n=3) .82 Ch 3: A Taste of Model Selection for Multiple Regression Residuals vs Fitted ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ●● ● 11 110 120 130 140 10 ● ● ● ● 65 20 10 ● ● 70 75 80 ● 0 ● ● ● ● ● ● 85 indian2$wt ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● −20 −20 ● ● ● ● 60 0.2 20 ● ● ●● ● ●●● ●●● 0 0. . there are no large residuals. After holding out case 1.4 −0. but greatly increases model relationship so greatly increases SSR.2 0. Why?1 The two analyses suggest that the “best model” for predicting systolic blood pressure is sysbp = β0 + β1 wt + β2 yrage + ε.2 ● ● ● ● ● ● ● ● ● ● 38 ● ● ● ● −20 10 ● 10 ● ● 1 0 sysbp | others 39 ● 20 sysbp | others 30 ● 11● ● 34 0. The R2 for the selected model is now R2 = 0.4 0. This decrease in R2 should have been anticipated. and on predicted values. we find that the model with weight and fraction as predictors is suggested again.408.4: Example: Dennis Cook’s Rat Data 83 Added−Variable Plots 1● −10 0 ● ● ● ● 8 −15 ● ●● ● ● ● ● ● ● ● ● ●● −5 ● ● ● ● ● ● ● ● ● ● ● ● 30 20 ● ●● ●● ● 11●● ● 34 0 5 wt | others ● ● ● ● ● ● ●● ● ● ● ● ● ● 10 15 20 −0. or any gross abnormalities in plots. If we hold out case 1.6 yrage | others Model selection methods can be highly influenced by outliers and influential cases. no extremely influential points. 1 Obs 1 increases the SST.3. Should case 1 be deleted? I have not fully explored this issue. What do you think? 3. and rerun the backward procedure to see whether case 1 unduly influenced the selection of the two predictor model. We should hold out case 1.4 Example: Dennis Cook’s Rat Data This example illustrates the importance of a careful diagnostic analysis.0 8● ● ● 0. but I will note that eliminating this case does have a significant impact on the estimates of the regression coefficients. 40 dose 0. weighed.. 6.c(4.41 0.read.00 7.32 0.46 bodywt 176 176 190 176 200 167 188 195 176 165 158 148 149 163 170 186 146 181 149 obs.56 0.25 0..30 9. lower = list(continuous = "cor") .9 7. of 4 variables: 0.98 0. the actual dose an animal received was approximately determined as 40mg of the drug per kilogram of body weight.80 0.21 0.88 0. The experimental hypothesis was that.00 10.88 0.88 1 0.84 Ch 3: A Taste of Model Selection for Multiple Regression An experiment was conducted to investigate the amount of a particular drug present in the liver of a rat.28 0.84 0.ggpairs(ratliver.2 8.85 0.94 0.50 9.ggpairs(ratliver) # put scatterplots on top so y axis is vertical p <.38 . and the percent of the dose in the liver determined.33 0.90 7.00 8.90 0.37 0.34 0.csv(fn. (Liver weight is known to be strongly related to body weight.41 0.88 0.23 0.) After a fixed length of time.42 0. the liver weighed.90 8.94 0.37 0.csv" ratliver <."http://statacumen.90 7.80 7. liverwt 6.28 0..data) ratliver <.00 0.32 0.ratliver[.30 5.37 0.3)] # reorder columns so response is the first str(ratliver) ## 'data. upper = list(continuous = "points") ..20 8. Nineteen (19) rats were randomly selected.88 1 0..50 9.00 6.20 8..23 0. Because it was felt that large livers would absorb more of a given dose than small livers.88 0.88 1.75 library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) #p <.1.2.com/teach/ADA2/ADA2_notes_Ch03_ratliver. each rat was sacrificed..73 0.9 8 10 8 7.42 0.30 0. #### Example: Rat liver fn.74 0.75 0.38 0. for the method of determining the dose. liver weight.98 0.9 .27 0. and relative dose.25 0.data <.23 0.83 0.20 6.00 8.33 0.5 9.00 0.36 0.81 0.94 0.88 1.84 . there is no relationship between the percentage of dose in the liver (Y ) and the body weight.56 0.frame': 19 ## $ y : num ## $ bodywt : int ## $ liverwt: num ## $ dose : num 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 y 0. 176 176 190 176 200 167 188 195 176 165 ..90 6.40 7. placed under light ether anesthesia and given an oral dose of the drug.23 0. 0.5 9 8.83 0. 00 ## ## n= 19 ## ## ## P ## y bodywt liverwt dose ## y 0.20 0.50 0.23 0. unload=TRUE) detach("package:reshape".99 Corr: 0.4 ● ● ● ● ● 0. as expected.matrix(ratliver)) ## y bodywt liverwt dose ## y 1.228 Corr: 0.203 Corr: 0.5 ● ● ● y 0.3.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4038 0.3 0.00 0.49 0.00 0.3488 .3 ● ● ● ● 0.15 0.9 dose 0.4: Example: Dennis Cook’s Rat Data 85 ) print(p) # detach package after use so reshape2 works (old reshape (v. # correlation matrix and associated p-values testing "H0: rho == 0" library(Hmisc) rcorr(as.5 8 liverwt ● ● ●● ● ● 7 ● ● ● ● ● 6 5 6 7 8 9 10 ● 1 Corr: 0.151 ● ● ● ● 180 bodywt ● ● ● ● ● ● ● ● ● 160 180 200 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 10 ● 9 ● ● ● ● ● Corr: 0. unload=TRUE) ● ● ● 0.00 0.15 1.99 0.50 1.20 0.99 ## liverwt 0.2 ● ● ● ● ● ● ● ● ● ● ● 200 ● ● Corr: 0.23 ## bodywt 0.4 0.9 1 The correlation between Y and each predictor is small.8 0.49 ## dose 0.1) conflicts) detach("package:GGally".5370 0.49 1. 364.015 * --Signif.78 ## y ~ bodywt + liverwt + dose ## ## Df Sum of Sq RSS AIC F value Pr(>F) .0000 0. lm.0000 0. codes: 0 '***' 0.0293 ## dose 0. codes: 0 '***' 0.00713 3Q 0.full <.1 ' ' 1 Residual standard error: 0. direction="backward".1 ' ' 1 summary(lm. data = ratliver) library(car) Anova(lm.192 bodywt 0.69 0.01 '*' 0. The p-values for testing the importance of these variables.4038 0.5370 0.01 '*' 0. # fit full model lm.53 0.37 0.01430 0.0773 on 15 degrees of freedom Multiple R-squared: 0.015 * Residuals 0.072 The backward elimination procedure selects weight and dose as predictors.05 '.86 Ch 3: A Taste of Model Selection for Multiple Regression ## bodywt 0.10 0.Adjusted R-squared: 0.87 0. type=3) ## ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: y Sum Sq Df F value Pr(>F) (Intercept) 0.0332 0.66 0.00797 -2. Error t value Pr(>|t|) (Intercept) 0.step(lm.17811 1.19459 1.192 bodywt -0.red.ratliver. when added last to this two predictor model.419 dose 0.full) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = y ~ bodywt + liverwt + dose.0896 15 --Signif.lm(y ~ bodywt + liverwt + dose. 0.full.0293 ## liverwt 0.06323 Median 0.' 0.ratliver.02125 0. p-value: 0.86 on 3 and 15 DF.ratliver.237 F-statistic: 2.0112 1 1.52263 2.0041 1 0.3488 0.AIC <.74 0.83 0.13469 Coefficients: Estimate Std.ratliver. data = ratliver) Residuals: Min 1Q -0.0332 Below I fit the linear model with all the selected main effects.018 * liverwt 0.0424 1 7.015.001 '**' 0. are small.019 and 0.full.018 * liverwt 0.01722 0.' 0.419 dose 4.ratliver.04597 Max 0.001 '**' 0. test="F") ## Start: AIC=-93.26592 0.05 '.0450 1 7.10056 -0. final.0896 0. Note that the correlation between dose and body weight is 0. main="Residuals vs dose") # horizontal line at zero abline(h = 0.red.9 . las = 1. The apparent paradox can be resolved only with a careful diagnostic analysis! For the model with dose and body weight as predictors. Although this commonly happens in regression problems. but that neither of these predictors is important of its own (low correlations with Y ).0937 0.99.8 -88.ratliver.lm.final£residuals.2 6.ratliver. lm.dose 1 0.0041 0.final$residuals. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.4 -88.5 0. so to a first approximation.ratliver.4.0937 -94.AIC This cursory analysis leads to a conclusion that a combination of dose and body weight is associated with Y .' 0. main="QQ Plot") ## 19 13 1 ## 19 1 18 ## residuals vs order of data #plot(lm.05 '. main="Residuals vs bodywt") # horizontal line at zero abline(h = 0.92 y ~ bodywt + dose Df Sum of Sq RSS AIC F value Pr(>F) <none> 0.6 7.' 0.6)) plot(ratliver$bodywt.final$residuals.4: Example: Dennis Cook’s Rat Data ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## .1346 -94.0424 0.1320 0. these predictors are linearly related and so only one of them should be needed in a linear regression model. lm. codes: 0.ratliver. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.9 -93.3.0 87 0.ratliver. col = "gray75") .019 * .0439 0. there are no cases with large |ri| values. codes: 0 '***' 0.bodywt 1 . which = c(1.1 ' ' 1 lm.015 * 0 '***' 0.1377 -89.1336 -90.liverwt 1 <none> .8 0.final <.0450 0.dose 1 --Signif. # plot diagnistics par(mfrow=c(2.10 7. id.01 '*' 0.001 '**' 0. col = "gray75") plot(ratliver$dose.n = 3.01 '*' 0.final$residuals.0399 0.3)) plot(lm.419 7.69 0.1 ' ' 1 Step: AIC=-94.53 0.018 * 0.ratliver.001 '**' 0.05 '. it is somewhat paradoxical here because dose was approximately a multiple of body weight. but case 3 has a relatively large Cook’s D value.bodywt 1 0.ratliver.015 * --Signif. 10 ● 0.ratliver. The partial residual plot for dose gives the same message.4 0.00 10 ● ● ● ● ●● ● ● ● ● Fitted values ● −0.95 1. id.0 0.8 Leverage hii Residuals vs bodywt Residuals vs dose QQ Plot 19 ● ● ● ● ● ● ● ● ● 160 170 180 190 0.35 0.10 0. library(car) avPlots(lm.00 ● ● ● lm.5 ● ●● Cook's distance 0.85 0.5 88 200 ratliver$bodywt ● ● ● ● ● 0.15 1. number ● 0. the partial residual plot for bodywt clearly highlights case 3.10 ● ● 0.50 ● 150 1 0.ratliver.5 5 0 0 0.0 ● ● ● 0.5 5 13 ● 0.40 0.15 Residuals vs Fitted 0. The importance of body weight as a predictor in the multiple regression model is due solely to the placement of case 3.05 ● ● ● ● ● ● 0. suggesting that body weight is unimportant after taking dose into consideration.5 19 0.ratliver.ratliver.10 lm.45 1.final$residuals 3● ●5 ● 19 0. Without this case we would see roughly a random scatter of points.80 0.final$residuals ● ● ● 1● 0.05 ● ● −0.75 ● ● −0.05 ● ● ● ● ● ● ● ● ● ● lm.6 0.05 15 Obs.10 ● ● 0.05 ● −0.30 Cook's distance ● ● ● 1.00 −2 ● ● ● 13 −1 0 1 2 norm quantiles Further.90 ratliver$dose 0.n=3) .0 2 1.0 0.00 ● ● ● −0.final.0 −0.final$residuals ● ● ● 0.Ch 3: A Taste of Model Selection for Multiple Regression Cook's dist vs Leverage hii (1 − hii) Cook's distance 2.5 3 19 ● ● ●● ●● 0.05 Residuals ●1 1.10 0. ] # fit full model lm.000979 0. direction="backward".AIC <.000827 0.15 ● ● −6 −4 0. the apparent relationship between Y and body weight and dose in the initial analysis can be ascribed to Case 3 alone.ratliver3.0 0. The output below shows that the backward elimination removes each predictor from the model.liverwt 1 0.step(lm.10 18 ● ● ● 10 ● ● ● ● ● ● −8 1 ● ● 19 ● ● 13 ● 0.0867 -90.0875 -91.04 y ~ bodywt + liverwt Df Sum of Sq RSS AIC F value Pr(>F) .001059 0.bodywt .02 0.0 0. no important predictors of Y .81 .0857 -88.64 0.dose .000356 0.10 0.0871 -92.9 0.01 0.25 y ~ bodywt + liverwt + dose .23 0.71 <none> 0.0871 -90.ratliver3.04 dose | others Removing case 3 If we delete this case and redo the analysis we find.001421 0. Can you see this case in the plots? # remove case 3 ratliver3 <. test="F") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Start: AIC=-88.full.liverwt <none> Df Sum of Sq RSS AIC F value Pr(>F) 1 0.ratliver[-3.00 ● −2 0 2 ● −0.03 0.red.05 y | others 0.06 0.05 1● 10 ● y | others ● −0.17 0. as expected.full <.bodywt 1 0.2 Step: AIC=-90.70 1 0.15 ● 0. data = ratliver3) lm.14 0.3.0 0.16 0.lm(y ~ bodywt + liverwt + dose.0868 -90.0867 -90.20 Added−Variable Plots ● ● ● ● 13 ● −0. Thus.68 1 0.0 0.0 .ratliver3.01 bodywt | others 0.4: Example: Dennis Cook’s Rat Data 89 3 19 ● 3● ● ● 18 ● ● ● ● ● ● ● −0. 86 All variables are omitted! In his text2.25. 3rd Ed.p + geom_text(hjust = 0. vjust = -0. label = 1:nrow(ratliver))) # plot regression line and confidence band p <. colour = 2) p <.liverwt <none> Step: y ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) 1 0. by Sanford Weisberg. alpha = 0. published by Wiley/Interscience in 2005 (ISBN 0-471-66379-4) .ggplot(ratliver.p + labs(title="Rat liver dose by bodywt: rat 3 overdosed") print(p) 2 Applied Linear Regression.5. # ggplot: Plot the data with linear regression fit and confidence bands library(ggplot2) p <.98). see scatterplot below (e.09 0.0 AIC=-93.p + geom_smooth(method = lm) p <. aes(x = bodywt.76 0.000509 0. was reported to have received a full dose of 1.90 ## ## ## ## ## ## ## ## ## ## Ch 3: A Taste of Model Selection for Multiple Regression Step: AIC=-91.97 y ~ liverwt . rat 8 with a weight of 195g got a lower dose of 0.0876 -93.000.9 0. Inspection of the data indicates that this rat with weight 190g.g. y = dose.0871 -92. which was a larger dose than it should have received according to the rule for assigning doses. Weisberg writes: The careful analyst must now try to understand exactly why the third case is so influential.5.p + geom_point(alpha=1/3) # plot labels next to points p <.. with dose determined by some rule other than a constant proportion of weight.4: Example: Dennis Cook’s Rat Data 91 Rat liver dose by bodywt: rat 3 overdosed 3 1. This suggests the need for collection of additional data. or (2) the regression fit in the second analysis is not appropriate except in the region defined by the 18 points excluding case 3.3. It is possible that the combination of dose and rat weight chosen was fortuitous. I hope the point of this analysis is clear! What have we learned from this analysis? . and that the lack of relationship found would not persist for any other combinations of them. since inclusion of a data point apparently taken under different conditions leads to a different conclusion.8 15 13 19 12 160 180 200 bodywt A number of causes for the result found in the first analysis are possible: (1) the dose or weight recorded for case 3 was in error. so the case should probably be deleted from the analysis.0 5 8 16 7 18 0.9 dose 2 1 4 9 10 17 6 14 11 0. Part III Experimental design and observational studies . . in the sense of providing the lowest typical survival time.Chapter 4 One Factor Designs and Extensions This section describes an experimental design to compare the effectiveness of four insecticides to eradicate beetles. and then randomly assign a predetermined number of beetles to the treatment groups (insecticides). A power analysis is often conducted to determine sample sizes for the treatments. and the individual survival times recorded. say the concentration of the insecticide or the age of the beetles. with 12 beetles assigned to each group. and not due to genetic differences among beetles. or fixed by . After assigning the beetles to the four groups. For simplicity. The primary interest is determining which treatment is most effective. Other factors that may influence the survival time. the insecticide is applied (uniformly to all experimental units or beetles). so that differences in survival times are attributable to the insecticides. the scientist might select a sample of genetically identical beetles for the experiment. The same strain of beetles should be used to ensure that the four treatment groups are alike as possible. A natural analysis of the data collected from this one factor design would be to compare the survival times using a one-way ANOVA. In a completely randomized design (CRD). There are several important controls that should be built into this experiment. The sample sizes for the groups need not be equal. assume that 48 beetles will be used in the experiment. would be held constant. normal random variables with constant variance. where µi is the (unknown) population mean for all potential responses to the ith treatment. The randomization of beetles to insecticides tends to diffuse or greatly reduce the effect of the uncontrolled influences on the comparison of insecticides. and eij is the residual or deviation of the response from the population mean. Suppose yij is the response for the j th experimental unit in the ith treatment group. in the sense that these effects become part of the uncontrolled or error variation of the experiment. In summary.95 the experimenter. 2. In complex experiments. . 3. yij is the survival time for the j th beetle given the ith insecticide. . . where i = 1. an experiment is to impose a treatment on experimental units to observe a response. if possible. The randomization of beetles to groups ensures that there is no systematic dependence of the observed treatment differences on the uncontrolled influences. 2. The statistical model for a completely randomized one-factor design that leads to a one-way ANOVA is given by: yij = µi + eij . Thus. For the insecticide experiment. . 2. The assumed population distributions of responses for the I = 4 insecticides can be represented as follows: . 4 and j = 1. The random selection of beetles coupled with the randomization of beetles to groups ensures the independence assumptions. 12. This is extremely important in studies where genetic and environmental influences can not be easily controlled (as in humans. The responses within and across treatments are assumed to be independent. more so than in bugs or mice). I. the same concentration would be used with the four insecticides. . . Randomization and carefully controlling factors are important considerations. . where i = 1. . there are always potential influences that are not realized or are thought to be unimportant that you do not or can not control. and measure the difference between the treatment population means and the grand mean. An hypothesis of interest is whether the population means are equal: H0 : µ1 = · · · = µI .96 Ch 4: One Factor Designs and Extensions Insecticide 1 Insecticide 2 Insecticide 3 Insecticide 4 Let 1X µ= µi I i be the grand mean. The treatment effects are constrained to add to zero. Let αi = µi − µ be the ith group treatment effect. then the one-way model is yij = µ + eij . If H0 is true. the one-way ANOVA model is yij = µ + αi + eij . or average of the population means. The model specifies that the Response = Grand Mean + Treatment Effect + Residual. Given this notation. α1 + α2 + · · · + αI = 0. which is equivalent to the hypothesis of no treatment effects: H0 : α1 = · · · = αI = 0. . and pairwise comparisons of treatment effects. I will downplay the discussion of estimating treatment effects to minimize problems. You know how to test H0 and do multiple comparisons of the treatments. but many statistical packages. The standard constraint where the treatment effects sum to zero was used above. Most texts use treatment effects to specify ANOVA models. An infinite number of constraints can be considered each of which gives the same structure on the population means. a convention that I will also follow. the null and alternative models used with the ANOVA F -test. . so I will not review this material. Although estimates of treatment effects depend on which constraint is chosen. impose the constraint αI = 0 (or sometimes α1 = 0). A difficulty with this approach is that the treatment effects must be constrained to be uniquely estimable from the data (because the I population means µi are modeled in terms of I + 1 parameters: µi = µ + αi).97 where µ is the common population mean. do not. then the variation within samples due to these uncontrolled features can dominate the effects of the treatment. the extent of their infection. if the characteristics that affect the recovery time are spread across treatments. suppose you are interested in comparing the effectiveness of four antibiotics for a bacterial infection.Chapter 5 Paired Experiments and Randomized Block Experiments A randomized block design is often used instead of a completely randomized design in studies where there is extraneous variation among the experimental units that may influence the response. For example. or their age. The recovery time after administering an antibiotic may be influenced by a patient’s general health. Alternatively. A better way to design this experiment would be to block the subjects into groups of four patients who are alike as possible on factors other than . Randomly allocating experimental subjects to the treatments (and then comparing them using a one-way ANOVA) may produce one treatment having a “favorable” sample of patients with features that naturally lead to a speedy recovery. A significant amount of the extraneous variation may be removed from the comparison of treatments by partitioning the experimental units into fairly homogeneous subgroups or blocks. leading to an inconclusive result. say varieties of corn. the agricultural experiment could be modified to compare four combinations of two corn varieties and two levels of fertilizer in each block instead of the original four varieties. when comparing the mean yields.99 the treatment that influence the recovery time. Two or more factors can be used with a randomized block design. 1959 . For example. and then itching was induced 1 Beecher. Randomized block (RB) designs were developed to account for soil fertility gradients in agricultural experiments. Example: Comparison of Treatments for Itching Ten1 male volunteers between 20 and 30 years old were used as a study group to compare seven treatments (5 drugs. The discussion will be limited to randomized block experiments with one factor. and no drug) to relieve itching. each plot would receive the same type and amount of fertilizer and the same irrigation plan. with each treatment occurring the same number of times (usually once) per block. The experimental field would be separated into strips (blocks) of fairly constant fertility. The blocking of patients usually produces a more sensitive comparison of treatments than does a completely randomized design because the variation in recovery times due to the blocks is eliminated from the comparison of treatments. A randomized block design is a paired experiment when two treatments are compared. All other factors that are known to influence the response would be controlled or fixed by the experimenter. Each strip is partitioned into equal size plots. a placebo. Except on the no-drug day. The four treatments are then randomly assigned to the patients (one per patient) within a block. are randomly assigned to the plots. The usual analysis for a paired experiment is a parametric or non-parametric paired comparison. each experimental unit receives each treatment. Each subject was given a different treatment on seven study days. For example. The experimental units are “natural” blocks for the analysis. and the recovery time measured. the subjects were given the treatment intravenously. The time ordering of the treatments was randomized across days. The treatments. In certain experiments. in seconds. without sound medical justification. This limits the extent of inferences from the experiment. aminophylline. Let yij be the response for the j th treatment within the ith block.csv") 1 2 3 4 5 6 7 8 9 10 Patient Nodrug Placebo Papv Morp Amino Pento Tripel 1 174 263 105 199 141 108 141 2 224 213 103 143 168 341 184 3 260 231 145 113 78 159 125 4 255 291 103 225 164 135 227 5 165 168 144 176 127 239 194 6 237 121 94 144 114 136 155 7 191 137 35 87 96 140 121 8 100 102 133 120 222 134 129 9 115 89 83 100 165 185 79 10 189 433 237 173 168 188 317 5. At best. The volunteers in the study were treated as blocks in the analysis. where each treatment occurs once in each block.com/teach/ADA2/ADA2_notes_Ch05_itch.100 Ch 5: Paired Experiments and Randomized Block Experiments on their forearms using an effective itch stimulus called cowage. The model for the experiment is yij = µij + eij . The population means are assumed to satisfy the additive model µij = µ + αi + βj .csv("http://statacumen. The data are given in the table below. where µij is the population mean response for the j th treatment in the ith block and eij is the deviation of the response from the mean. morphine. The scientists can not. tripelenamine.1 Analysis of a Randomized Block Design Assume that you designed a randomized block experiment with I blocks and J treatments. The subjects recorded the duration of itching. the volunteers might be considered a representative sample of males between the ages of 20 and 30.read. From left to right the drugs are: papaverine. pentobarbitol. extrapolate the results to children or to senior citizens. #### Example: Itching itch <. and y¯·· be the average response of all IJ observations in the experiment. There are no block effects if the population mean response for an arbitrary treatment is identical across blocks. This H0 is rejected when the block averages y¯i· vary significantly relative to the error variation.) This H0 is rejected when the treatment averages y¯·j vary significantly relative to the error variation. normally distributed and with constant variance. The model is sometimes written as Response = Grand Mean + Treatment Effect + Block Effect + Residual. Total IJ − 1 ij (yij − y A primary interest is testing whether the treatment effects are zero: H0 : β1 = · · · = βJ = 0.5. as an upper tail area from an F-distribution with J − 1 and (I − 1)(J − 1) df. Source df SS MS P Blocks I − 1 J i(¯ yi· − y¯··)2 P y·j − y¯··)2 Treats J − 1 I j (¯ P Error (I − 1)(J − 1) (yij − y¯i· − y¯·j + y¯··)2 ij P ¯··)2. but does assume that the correlation between responses within a block is identical for each pair of treatments. The responses are assumed to be independent across blocks. if the experiment is designed well.. The treatment effects are zero if in each block the population mean responses are identical for each treatment. and βj is the effect for the j th treatment.1: Analysis of a Randomized Block Design 101 where µ is a grand mean. y¯·j be the j th treatment sample mean (the average of the responses on the j th treatment). An ANOVA table for the randomized block experiment partitions the Model SS into SS for Blocks and Treatments. αi is the effect for the ith block. the blocks will be. noticeably different. by construction. The p-value is evaluated in the usual way (i. because. . A test for no block effects (H0 : α1 = · · · = αI = 0) is often a secondary interest. The randomized block model does not require the observations within a block to be independent. let y¯i· be the ith block sample mean (the average of the responses in the ith block). A formal test of no block effects is based on the p-value from the the F -statistic Fobs = MS Blocks/MS Error.e. A formal test of no treatment effects is based on the p-value from the F-statistic Fobs = MS Treat/MS Error. Given the data. 2.102 Ch 5: Paired Experiments and Randomized Block Experiments The randomized block model is easily fitted in the lm() function. This is a reasonable working assumption in many analyses. α ˆ i = y¯i· − y¯··. The analysis of a randomized block experiment under this model is the same analysis used for a two-factor experiment with no replication (one observation per cell). then the experimenter has eliminated a substantial portion of the variation that is used to assess the differences among the treatments.e. The Block SS plus the Error SS is the Error SS from a one-way ANOVA comparing the J treatments. RB Analysis of the Itching Data First we reshape the data to long format so each observation is its own row in the data. and βˆj = y¯·j − y¯··. 3. and treatment effects are µˆ = y¯··. let me mention five important points about randomized block analyses: 1. The RB model is equivalent to an additive or no interaction model for a two-factor experiment. id. A multivariate repeated measures model can be used to compare treatments when the constant correlation assumption is unrealistic. value. P P 4. for example when the same treatment is given to an individual over time..melt(itch . respectively. The F -test for comparing treatments is appropriate when the responses within a block have the same correlation.frame and indexed by the Patient and Treatment variables. mean response for the (i. The estimated ˆ i + βˆj = y¯i· + y¯·j − y¯··.long <.vars = "Patient" . This leads to a more sensitive comparison of treatments than would have been obtained using a one-way ANOVA. block effects. Under the sum constraint on the parameters (i. The F -test p-value for comparing J = 2 treatments is identical to the p-value for comparing the two treatments using a paired t-test. If the Block SS is large relative to the Error SS from the two-factor model. j)th cell is µˆ ij = µˆ + α 5. We will discuss the two-factor design soon. library(reshape2) itch. variable. Before illustrating the analysis on the itching data.name = "Treatment" .name = "Seconds" ) . where the blocks are levels of one of the factors. i αi = j βj = 0). the estimates of the grand mean. head(itch. colour = "black". The differences in the level of the boxplots will usually be magnified by the F -test for comparing treatments because the variability within the boxplots includes block differences which are moved from the Error SS to the Block SS... As a first step. ## $ Seconds : int 174 224 260 255 165 237 191 100 115 189 .: 1 1 1 1 1 1 1 1 1 1 . 3) ## Patient Treatment Seconds ## 68 8 Tripel 129 ## 69 9 Tripel 79 ## 70 10 Tripel 317 # make Patient a factor variable itch..1: Analysis of a Randomized Block Design 103 str(itch.. The relatively large spread in the placebo group suggests that some patients responded adversely to the placebo compared to no drug......"2".: 1 1 1 1 1 1 1 1 1 1 .: 1 2 3 4 5 6 7 8 9 10 .. ## $ Treatment: Factor w/ 7 levels "Nodrug". Papaverine appears to be the most effective drug. linetype = "solid". ## $ Treatment: Factor w/ 7 levels "Nodrug".long) ## 'data.5) . Each of the five drugs appears to have an effect. aes(x = Treatment.p + geom_hline(aes(yintercept = mean(Seconds)). The placebo and no drug have similar medians.. ## $ Seconds : int 174 224 260 255 165 237 191 100 115 189 . The boxplots are helpful for informally comparing treatments and visualizing the data. compared to the placebo and to no drug.3.ggplot(itch.5.long$Patient) str(itch."Placebo".long) ## 'data..long$Patient <.factor(itch.p + geom_hline(aes(yintercept = 0). colour = "black".long.. y = Seconds)) # plot a reference line for the global mean (assuming no groups) p <. size = 0. 3) ## Patient Treatment Seconds ## 1 1 Nodrug 174 ## 2 2 Nodrug 224 ## 3 3 Nodrug 260 tail(itch. linetype = "dashed". whereas others responded positively.frame': 70 obs. # Plot the data using ggplot library(ggplot2) p <. alpha = 0... of 3 variables: ## $ Patient : Factor w/ 10 levels "1"... this plot is a little too busy.2."4"..3) p <. I made side-by-side boxplots of the itching durations across treatments.long."3".long.frame': 70 obs. I admit. The plot also includes the 10 Patients with lines connecting their measurements to see how common the treatment differences were over patients."Placebo". of 3 variables: ## $ Patient : int 1 2 3 4 5 6 7 8 9 10 . alpha = 0. size = 0. position="none") print(p) Comparison of Treatments for Itching. lm.y = mean.p + labs(title = "Comparison of Treatments for Itching.5) points for observed data <. data = itch.8) p <. Treatment means ● 400 Duration of itching (seconds) ● ● 300 ● ● ● ● ● ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 Nodrug Placebo Papv Morp Amino Pento Tripel Treatment To fit the RB model in lm(). The response variable Seconds appears to the left of the tilde.lm(Seconds ~ Treatment + Patient.104 Ch 5: Paired Experiments and Randomized Block Experiments # p # p # p # p colored line for each patient <.p + theme(legend.s. alpha = 0.75. and include each to the right of the tilde symbol in the formula statement. colour = Patient). alpha = 0.p + geom_line(aes(group = Patient. geom = "point". size=. you need to specify blocks (Patient) and treatments (Treatment) as factor variables.data = "mean_cl_normal".p + geom_boxplot(size = 0. Treatment means") p <.5) # confidence limits based on normal distribution p <. shape = 18. alpha = 0.p <.p + stat_summary(fun.p + geom_point(aes(colour = Patient)) diamond at mean for each group <. aes(colour=Treatment).5) boxplot.long) library(car) .75 to stand out behind CI <.t. geom = "errorbar". alpha = 0.2. size = 6.p + stat_summary(fun. width = .p + ylab("Duration of itching (seconds)") # removes legend p <. ' 0.76 0.56 0.1e-09 *** Treatment 53013 6 2.86 29. codes: 0 '***' 0.p.85 0.62 0.43 29.0011 ** Residuals 167130 54 --Signif.58 0.34 F-statistic: 3.1: Analysis of a Randomized Block Design 105 Anova(lm.001 '**' 0.5625 TreatmentTripel -23.88 0.74 -1.74 1.01 '*' 0.00 29.18 0.29 -34.74 -0.74 0.88 -0.71 29.50 24.0659 .08 3.71 0.80 24.p) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = Seconds ~ Treatment + Patient.5814 TreatmentPapv -72. If there are no block effects and no tretment .' 0.1360 Patient10 82.88 -0.92 0. 1.t.60 7.483. TreatmentAmino -46.5.0173 * Patient 103280 9 3.88 0.88 -2.05 '.93 0.37 on 15 and 54 DF.80 -8.001 '**' 0.1e-09 *** TreatmentPlacebo 13.00 29.73 0.3629 Patient9 -45. The F-test at the bottom of the summary() tests for both no block effects and no treatment effects.s.2018 Patient5 11.0896 .39 0.10 0.74 1.3430 Patient2 35.t.74 -1.1 ' ' 1 summary(lm.11 3.74 -0.0050 ** TreatmentMorp -43.long) Residuals: Min 1Q Median -81. codes: 0 '***' 0.0079 ** --Signif.90 148.6952 Patient6 -18.Adjusted R-squared: 0.9238 Patient4 38.1 ' ' 1 Residual standard error: 55.55 0. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: Seconds Sum Sq Df F value Pr(>F) (Intercept) 155100 1 50.39 3Q Max 30.2444 Patient3 -2.70 24.51 0.80 24.5349 Patient7 -46. p-value: 0.74 2.57 29.00 29.74 -0.29 29.91 Coefficients: Estimate Std.s.96 0.05 '. TreatmentPento -14.01 '*' 0.1254 Patient8 -27.00 24. Error t value Pr(>|t|) (Intercept) 188.88 -1.80 24.29 29.6 on 54 degrees of freedom Multiple R-squared: 0.00052 The order to look at output follows the hierarchy of multi-parameter tests down to single-parameter tests.29 0. data = itch.88 -1.29 26. 017) and among patients (p-value=0. # multcomp has functions for multiple comparisons library(multcomp) ## Loading required package: mvtnorm ## Loading required package: TH. glht. the Type I and Type III SS are identical. The individual parameter (coefficient) estimates in the summary() are likely of less interest since they test differences from the baseline group. first. The Mean Squares. The multiple comparisons in the next section will indicate which factor levels are different from others.. # Here: correcting over Treatment using Tukey contrast corrections. only.itch. linfct = mcp(Treatment = "Tukey")) summary(glht. F-statistics.s. 3. no missing data). and correspond to the formulas given earlier. Multiple comparisons Multiple comparison and contrasts are not typically straightforward in R.0005 strongly suggests that the population mean itching times are not all equal. The F -tests show significant differences among the treatments (p-value=0.data ## ## Attaching package: ’multcomp’ ## ## The following object is masked by ’. # The mpc (multiple comparison) specifies the factor and method.001). The distinction between Type I and Type III SS is important for unbalanced problems. Below I use Tukey adjustments. and p-values for testing these effects are given.e.t) . The package multcomp is used to specify which factor to perform multiple comparisons over and which p-value adjustment method to use.GlobalEnv’: ## ## waste # Use the ANOVA object and run a "General Linear Hypothesis Test" # specifying a linfct (linear function) to be tested. though some newer packages are helping make them easier.p).itch. Below I show one way that I think is relatively easy. The ANOVA table at top from Anova() partitions the Model SS into the SS for Blocks (Patients) and Treatments. For a RB design with the same number of responses per block (i. The p-value of 0.glht(aov(lm. an issue we discuss later.106 Ch 5: Paired Experiments and Randomized Block Experiments effects then the mean itching time is independent of treatment and patients. 2.t <.t. 000 Pento .9 1.9 -2.9 -1.15 1.Nodrug == 0 -14.600 Amino .271 Amino .Nodrug == 0 13.000 --Signif. the p-value adjustment can be coerced into one of several popular methods.9 -0.Amino == 0 32.97 0.7 24. the population mean response for factor levels (averaged over the other factor) are significantly different if the p-value for the test is 0.Nodrug == 0 -23. Recall how the Bonferroni correction works.9 -1.001 '**' 0.Nodrug == 0 -46.3 24.911 Tripel .9 -0.Morp == 0 28.34 0.1: Analysis of a Randomized Block Design ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 107 Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: aov(formula = lm. Passing the summary to plot() will create a plot of the pairwise intervals for difference between factor levels.05 0.8 24.77 0.913 Tripel .p) Linear Hypotheses: Estimate Std.1 ' ' 1 (Adjusted p values reported -.Placebo == 0 -37.0 24. such as Bonferroni.Nodrug == 0 -72.9 1.29 0.Papv == 0 29.05/c or less. A comparison of c pairs of levels from one factor having a family error rate of 0.070 .852 Tripel .05/c level.05 or less is attained by comparing pairs of treatments at the 0.1 24.Placebo == 0 -60.Morp == 0 -3.2 24.15 0.58 0.205 Pento .3 24. Notice that the significance is lower (larger p-value) for Bonferroni below than Tukey above.9 -2.5 24.016 * Morp .9 -0.504 Pento .Nodrug == 0 -43.892 Amino . Note comment at bottom of output that “(Adjusted p values reported -.bonferroni method)”.73 0. Using this criteria.8 24.92 0. The out- .8 24.9 -1.940 Pento .single-step method) With summary().9 2.9 0.9 -2.9 -0.20 0.' 0.51 0.01 '*' 0. codes: 0 '***' 0.737 Morp .243 Tripel .t.5 24.88 0.446 Amino .Placebo == 0 -86.s.Amino == 0 22.9 1.Papv == 0 26.93 0.48 0.9 -1. Error t value Pr(>|t|) Placebo .9 0.6 24.997 Tripel .Pento == 0 -9.3 24.7 24.9 1.Morp == 0 19.5.37 1.28 0.9 -3.9 1.Papv == 0 58.998 Papv .05 '.8 24.5 24.14 0.961 Papv .55 0.9 24.6 24.Placebo == 0 -28.987 Pento .8 24.43 0.968 Tripel .2 24.Papv == 0 49.Placebo == 0 -56.9 0.0 24.96 0. Morp . 29 1. test = adjusted("bonferroni")) .Placebo == 0 -28. sub="Bonferroni-adjusted Treatment contrasts") par(op) # reset plotting options . test = adjusted("bonferroni")) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: aov(formula = lm.021 * Morp .73 1.Amino == 0 32.000 Amino .15 1.6 24.9 1.Nodrug == 0 -46.9 -1.51 1.7 24.Morp == 0 -3.48 0.8 24.000 Tripel .Nodrug == 0 -14.readonly = TRUE) # the whole list of settable par's.1 ' ' 1 (Adjusted p values reported -.3 24.20 1.9 -0.2 24.6 24.5 24.001 '**' 0.Placebo == 0 -56.000 Pento .9 -1.15 1.9 0.Nodrug == 0 13.96 1.Nodrug == 0 -23.386 Pento .Placebo == 0 -37.37 1.0 24.58 1.9 0.5 24.Papv == 0 49.9 -2.Pento == 0 -9. left.000 Amino .105 Morp .97 1.9 2.Papv == 0 29.Nodrug == 0 -43.itch.Amino == 0 22.108 Ch 5: Paired Experiments and Randomized Block Experiments put actually adjusts the p-values by reporting p-value×c.000 Papv .9 -3.8 24.3 24.t. top.8 24.9 -1.9 1.8 24.9 -1.9 -0.000 Tripel . 4.3 24.05 significance level.9 -0.05 1.000 Tripel .par(no.92 1.1 24.9 -2.000 Tripel .Morp == 0 19. right) # plot bonferroni-corrected difference intervals plot(summary(glht.5 24.9 24.28 0.Placebo == 0 -60.p) Linear Hypotheses: Estimate Std.9 1. 10.000 Morp .34 0.Morp == 0 28.000 --Signif.000 Pento . # make wider left margin to fit contrast labels par(mar = c(5.05 '.88 1.0 24.9 1. so that the reported adjusted p-value can be compared to the 0.000 Amino . 2) + 0.000 Pento . codes: 0 '***' 0.Papv == 0 58.43 0.bonferroni method) # plot the summary op <.9 -0.7 24.000 Tripel .2 24.s.itch.Placebo == 0 -86.8 24. Error t value Pr(>|t|) Placebo .14 1.9 1.479 Tripel .Papv == 0 26. summary(glht.55 1.93 0.000 Pento .01 '*' 0.1) # order is c(bottom.t.9 -2.77 1.' 0.000 Papv .554 Amino .9 0.Nodrug == 0 -72.t. test = adjusted("bonferroni")) plot(summary(glht. ### Code for the less interesting contrasts.p.s. ### that is.5. only glht. ### Note that the first block of code below corrects the p-values ### for all the tests done for both factors together.itch. test = adjusted("bonferroni"))) # # # # # correcting over Patient.1: Analysis of a Randomized Block Design 109 95% family−wise confidence level Placebo − Nodrug Papv − Nodrug Morp − Nodrug Amino − Nodrug Pento − Nodrug Tripel − Nodrug Papv − Placebo Morp − Placebo Amino − Placebo Pento − Placebo Tripel − Placebo Morp − Papv Amino − Papv Pento − Papv Tripel − Papv Amino − Morp Pento − Morp Tripel − Morp Pento − Amino Tripel − Amino Tripel − Pento ( ( ) ● ) ● ( ) ● ( ) ● ( ( ( ) ● ) ● ) ● ( ) ● ( ) ● ( ) ● ( ) ● ( ) ● ( ) ● ( ( ( ) ) ● ( ) ● ( ( ) ● ( −100 ) ● ( −150 ) ● ● ) ● ) ● −50 0 50 100 Linear Function Bonferroni−adjusted Treatment contrasts The Bonferroni comparisons for Treatment suggest that papaverine induces a lower mean itching time than placebo.glht(aov(lm.itch.itch. The comparison of Patient blocks is of less interest.s. linfct = mcp(Patient = "Tukey")) summary(glht.tp.p). the Bonferroni-corrected significance level is (alpha / (t + p)) ### where t = number of treatment comparisons ### and p = number of patient comparisons. # # # # # # correcting over Treatment and Patient glht.p <.itch. ### Testing multiple factors may be of interest in other problems. linfct = mcp(Treatment = "Tukey" . including the RB model.p).itch.t. Patient = "Tukey")) summary(glht.itch.tp. is easily performed using the .t.glht(aov(lm.p.tp <. test = adjusted("bonferroni")) plot(summary(glht. All the other comparisons of treatments are insignificant. test = adjusted("bonferroni"))) Diagnostic Analysis for the RB Analysis A diagnostic analysis of ANOVA-type models. p$residuals.00 ● ● ● ● ● ● ● 0.5 2 20 52 ● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance Cook's distance 150 Residuals vs Fitted 0 10 20 30 40 50 60 70 Leverage hii Residuals vs Treatment Residuals vs Patient QQ Plot 150 Obs. due to three cases that are not fitted well by the model (the outliers in the boxplots). to see whether similar conclusions are reached about the treatments. col = "gray75") plot(itch.05 50 ●● Cook's distance ● ● ● ● 0.05 20 ● ● 3.p$residuals.10 ● ● 52 0. which are also the most influential cases (Cook’s distance).3)) plot(lm.p£residuals. the plot of the studentized residuals against fitted values shows no gross abnormalities.t. las = 1.t.6)) plot(itch. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0. id.s.15 ● ● ● 48 0. # plot diagnistics par(mfrow=c(2.t.p. main="Residuals vs Patient") # horizontal line at zero abline(h = 0. col = "gray75") ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 100 150 0.4. I will present a non-parametric analysis as a backup.t.s.15 ● 20 48 52 ● 0.110 Ch 5: Paired Experiments and Randomized Block Experiments lm() output.s.5 32.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● −100 ● ● 1 200 ● 250 0.10 0.t. .p$residuals.s.t.s.long$Treatment. which = c(1.p$residuals 100 100 ● ● 0 −50 50 0 −50 50 ● ● 48 ● 100 ● ●● 50 0 −50 ● ● Nodrug Papv Amino Tripel 1 2 0 0.00 100 ● 0 Residuals ● 48 1. number 150 Fitted values 20 ● 150 52 ● ● lm.5 0.n = 3. Except for these cases. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.s. lm. lm.228571428571428 3 4 5 6 7 8 9 10 ● −2 ● ● ●● ● ● ●●● ●● ● ●● ●● ● ● ●●●● ●● ● ●●●● ●●● ●●●● ● ● ● ●●● ●● ●● ●●● ●●●● ● ●● ● ● ● ● ●● −1 0 1 2 norm quantiles Although the F -test for comparing treatments is not overly sensitive to modest deviations from normality. The normal quantile (or QQ-plot) shows the residual distribution is slightly skewed to the right. main="QQ Plot") ## 20 52 48 ## 70 69 68 ## residuals vs order of data #plot(lm.long$Patient. in part. main="Residuals vs Treatment") # horizontal line at zero abline(h = 0. For example. Any generalization of the conclusions to other situations must be justified scientifically.long) ## ## Quade test ## ## data: Seconds and Treatment and Patient ## Quade F = 3. data = itch. which supports the earlier conclusion. typically through further experimentation. p-value = 0.test(Seconds ~ Treatment | Patient.732.2: Extending the One-Factor Design to Multiple Factors 111 Non-parametric Analysis of a RB Experiment Milton Friedman developed a non-parametric test for comparing treatments in an unreplicated randomized block design where the normality assumption may be violated. num df = 6. friedman.89. The inferences from the one-way ANOVA apply to beetles with a given age from the selected strain that might be given the selected concentration of the insecticides.2 Extending the One-Factor Design to Multiple Factors The CRD (completely randomized design) for comparing insecticides below varies the levels of one factor (insecticide). The null hypothesis is that apart from an effect of blocks. # The formula is of the form a ~ b | c. quade. respectively.long) ## ## Friedman rank sum test ## ## data: Seconds and Treatment and Patient ## Friedman chi-squared = 14. An unreplicated complete block design has exactly one observation in y for each combination of levels of groups and blocks.02115 # Quade test is very similar to the Friedman test (compare the help pages).test(Seconds ~ Treatment | Patient. There are several ways to broaden the scope of the study. The output suggests significant differences among treatments. df = 6. while controlling other factors that influence survival time.003542 5. p-value = 0.5. data = itch. b and c give the data values (a) # and corresponding groups (b) and blocks (c). the location parameter of y is the same in each of the groups. # where a. . denom df = 54. # Friedman test for differences between groups conditional on blocks. assume that the experiment is balanced. that is.com/teach/ADA2/ADA2_notes_Ch05_beetles."medium".3 is a survival time of 3 hours. The unit of measure for the survival times is 10 hours. Medium. labels = c("low". For simplicity. and High) are applied with each of the four insecticides. header = TRUE) # make dose a factor variable and label the levels beetles$dose <.table("http://statacumen. This is a balanced 4-by-3 factorial design (two-factor design) that is replicated four times. 0.factor(beetles$dose. Assuming that 48 beetles are available.dat" . consider a simple two-factor experiment where three concentrations (Low.read. 3=High). The table gives survival times of groups of four beetles randomly allocated to twelve treatment groups obtained by crossing the levels of four insecticides (A. D) at each of three concentrations of the insecticides (1=Low. compare concentrations. This is a completely crossed two-factor experiment where each of the 4 × 3 = 12 combinations of the two factors (insecticide and dose) are included in the comparison of survival times. giving prespecified numbers of beetles to the 12 groups. #### Example: Beetles beetles <.2. This is a CRD with two factors. With this experiment. the same number of beetles (4) is assigned to each group (12 × 4 = 48)."high")) . B. 2=Medium. and check for an interaction between dose and insecticide.1 Example: Beetle insecticide two-factor design The data below were collected using the experimental design just described. the scientist would randomly assign them to the 12 experimental groups. C. 5.112 Ch 5: Paired Experiments and Randomized Block Experiments several strains of beetles or several concentrations of the insecticide might be used. the scientist can compare insecticides. that is. For simplicity. in the table below.92 0.0200 0. the marginal mean for insecticide A is the average of the cell means for the 3 treatment combinations involving insecticide A: 0."t2".43 0..82 0.melt(beetles .2: Extending the One-Factor Design to Multiple Factors 1 2 3 4 5 6 7 8 9 10 11 12 dose low low low low medium medium medium medium high high high high insecticide A B C D A B C D A B C D t1 0.4400 0.3100 0.3700 0.4300 0.4900 0.44 1 1 1 2 2 3 4 1 2 3 1 1 1 1 1 0. id.3000 t2 0.frame': 48 obs. head(beetles.4000 0.413. For example.7600 0."B". variable.name = "number" .31 low B t1 0.92 The basic unit of analysis is the cell means. library(reshape2) beetles.3600 t3 0.22 2 2 4 1 1 1 0. "insecticide") .31 0..7200 0.6200 0.56 0.36 0.3300 First we reshape the data to long format so each observation is its own row in the data.2300 1.45 0.vars = c("dose".43 low D t1 0.2200 0.314 = (0."t3".413 + 0.4500 1.2100 0..4500 0."C". 1 1 .4600 0.82 low C t1 0.3 3 3 .2200 0. ..6600 0.5.2400 0.long) ## ## ## ## ## ## ## 1 2 3 4 5 6 dose insecticide number hours10 low A t1 0..45 medium A t1 0. value.4000 0.2500 0.1000 0."medium".210)/3.3100 0.long <.name = "hours10" ) str(beetles.3600 0.6100 0.frame and indexed by the dose and insecticide variables.3500 1.2900 0.5600 0.3800 0."D": 1 2 ## $ number : Factor w/ 4 levels "t1".long) ## 'data..3100 113 t4 0. For example. which are the averages of the 4 observations in each of the 12 treatment combinations.: 1 ## $ insecticide: Factor w/ 4 levels "A".7100 0.7100 0. From the cell means we obtain the dose and insecticide marginal means by averaging over the levels of the other factor.2400 0..4500 0.4300 0.2900 0..3000 0.1800 0..3800 0.2300 0.9200 0. .8800 0. the sample mean survival for the 4 beetles given a low dose (dose=1) of insecticide A is 0..2300 0. 2 .320 + 0. of 4 variables: ## $ dose : Factor w/ 3 levels "low".8200 0.6300 0.: 1 ## $ hours10 : num 0..36 medium B t1 0.. the marginal mean for insecticide A is the average survival time for the 16 beetles given insecticide A. Similarly.534 0. The total number of responses is KIJ.544 3 Insect marg 0.677 0. the insecticides have noticeably different mean survival times averaged over doses. a marginal mean is the average of all observations that receive a given treatment. or K times the IJ treatment combinations.335 0.2 The Interaction Model for a Two-Factor Experiment Assume that you designed a balanced two-factor experiment with K responses at each combination of the I levels of factor 1 (F1) with the J levels of factor 2 (F2).114 Ch 5: Paired Experiments and Randomized Block Experiments Cell Means Insecticide A B C D Dose marg 1 0.314 0. A more formal approach to analyzing the table of means is given in the next section. where µij is the population mean response for the treatment defined by the ith level of F1 combined with the j th level of F2.210 0.610 0.325 0.375 0. 5.277 0.393 0. higher doses tend to produce lower survival times. A generic model for the experiment expresses yijk as a mean response plus a residual: yijk = µij + eijk .235 0. normally distributed.568 0. and have constant variance. As in a one-way ANOVA.2. Let yijk be the k th response at the ith level of F1 and the j th level of F2. .413 0.815 0. Looking at the table of means.618 Dose 2 0.320 0.880 0.480 Because the experiment is balanced. For example.668 0. with insecticide A having the lowest mean survival time averaged over doses. the responses within and across treatment groups are assumed to be independent. where µ is a grand mean. (αβ) is not their product.2: Extending the One-Factor Design to Multiple Factors 115 The interaction model expresses the population means as µij = µ + αi + βj + (αβ)ij .. meaning Response = Grand Mean + F1 effect + F2 effect + F1-by-F2 interaction + Residual. and (αβ)ij is the interaction between the ith level of F1 and the j th level of F2. no interaction terms.. .... βj is the effect for the j th level of F2. Level Level of F1 1 2 1 µ11 µ12 2 µ21 µ22 ... . given here.. J c .5. is yijk = µ + αi + βj + eijk .. The additive model having only main effects. αi is the effect for the ith level of F1.. . . meaning Response = Grand Mean + F1 effect + F2 effect + Residual. (Note that (αβ) is an individual term distinct from α and β.) The model is often written yijk = µ + αi + βj + (αβ)ij + eijk . · · · µIJ µ¯ I· · · · µ¯ ·J µ¯ ·· The F1 marginal population means are averages within rows (over columns) of the table: 1X µ¯ i· = µic.. µI1 µI2 I F2 marg µ¯ ·1 µ¯ ·2 of F2 · · · J F1 marg · · · µ1J µ¯ 1· µ¯ 2· · · · µ2J . Defining effects from cell means The effects that define the population means and the usual hypotheses of interest can be formulated from the table of population means.. The effects of F1 and F2 on the mean are additive. . .. I r The overall or grand population mean is the average of the cell means 1X 1 X 1X µ¯ i· = µ¯ ·j . Inferences about the population means are based on the table of sample means: Level of F2 Level of F1 1 2 · · · J F1 marg y¯11 y¯12 · · · y¯1J y¯1· 1 y¯21 y¯22 · · · y¯2J y¯2· 2 .. i j ij and satisfy µij = µ + αi + βj + (αβ)ij (i. cell mean is sum of effects) required under the model. βj = µ¯ ·j − µ¯ ··.. The effects sum to zero: X X X αi = βj = (αβ)ij = 0. . y¯I1 y¯I2 · · · y¯IJ y¯I· I F2 marg y¯·1 y¯·2 · · · y¯·J y¯·· . respectively. αi = µ¯ i· − µ¯ ··.. µ¯ ·· = µrc = IJ rc I i J j Using this notation. and (αβ)ij = µij − µ¯ i· − µ¯ ·j + µ¯ ··..116 Ch 5: Paired Experiments and Randomized Block Experiments The F2 marginal population means are averages within columns (over rows): 1X µ¯ ·j = µrj .. . The interaction effect will be interpreted later...e. except that here the treatment means are averaged over the levels of the other factor. The F1 and F2 effects are analogous to treatment effects in a one-factor experiment. Estimating effects from the data Let 1 X y¯ij = yijk K and s2ij k be the sample mean and variance. the effects in the interaction model are µ = µ¯ ··... for the K responses at the ith level of F1 and the j th level of F2.. . .. . y¯·· is the average of the cell sample means: 1X 1X 1 X y¯ij = y¯i· = y¯·j . . . . y¯·j is the sample average of all responses at the j th level of F2. y¯·· = IJ ij I i J j The sample sizes in each of the IJ treatment groups are equal (K). 2. and y¯·· is the average response in the experiment. Under the interaction model. I = y¯·j − y¯·· the estimated F2 effect j = 1. 2. as measured by the Total SS.1) that satisfy [ . This can be partitioned into estimated effects µˆ ˆi α βˆj [ (αβ) ij = y¯·· the estimated grand mean = y¯i· − y¯·· the estimated F1 effect i = 1. J = y¯ij − y¯i· − y¯·j + y¯·· the estimated cell interaction (5. into components that measure . . µˆ ij = µˆ + α ˆ i + βˆj + (αβ) ij The ANOVA table The ANOVA table for a balanced two-factor design decomposes the total variation in the data. I r Finally. . the estimated population mean for the (i. . so y¯i· is the sample average of all responses at the ith level of F1. .2: Extending the One-Factor Design to Multiple Factors 117 The F1 marginal sample means are averages within rows of the table: 1X y¯i· = y¯ic. J c The F2 marginal sample means are averages within columns: 1X y¯·j = y¯rj .5. j)th cell is the observed cell mean: µˆ ij = y¯ij . 118 Ch 5: Paired Experiments and Randomized Block Experiments the variation of marginal sample means for F1 and F2 (the F1 SS and F2 SS). H0 is rejected when the F2 marginal means y¯·j vary significantly relative to the within sample variation. The test of no F2 effect: H0 : β1 = · · · = βJ = 0 is equivalent to testing H0 : µ¯ ·1 = µ¯ ·2 = · · · = µ¯ ·J . The absence of a F2 effect implies that each level of F2 has the same population mean response when the means are averaged over levels of F1. The absence of an F1 effect implies that each level of F1 has the same population mean response when the means are averaged over levels of F2. H0 is rejected when the sum of squared F1 effects (between sample variation) is large relative to the within sample variation. H0 is rejected when the sum of squared F2 effects (between sample variation) is large relative to the within sample . respectively. the MS for each source of variation is the corresponding SS divided by the df. and their interpretations are: The test of no F1 effect: H0 : α1 = · · · = αI = 0 is equivalent to testing H0 : µ¯ 1· = µ¯ 2· = · · · = µ¯ I·. a component that measures the degree to which the factors interact (the F1-byF2 Interaction SS). The test for no F2 effect is based on MS F2/MS Error. ijk (yijk − y MS MS F1=SS/df MS F2=SS/df MS Inter=SS/df MSE=SS/df The standard tests in the two-factor analysis. The test for no F1 effect is based on MS F1/MS Error. Each SS has a df. Source F1 F2 Interaction Error Total df I −1 J −1 (I − 1)(J − 1) IJ(K − 1) IJK − 1 SS P yi· − y¯··)2 KJ i(¯ P y·j − y¯··)2 KI j (¯ P K ij (yij − y¯i· − y¯·j + y¯··)2 P (K − 1) ij s2ij P ¯··)2. The MS Error estimates the common population variance for the IJ treatments. H0 is rejected when the F1 marginal means y¯i· vary significantly relative to the within sample variation. Equivalently. respectively. Equivalently. and a component that pools the sample variances across the IJ samples (the Error SS). given in the following ANOVA table. which is compared to an F-distribution with numerator and denominator df of J − 1 and IJ(K − 1). which is compared to the upper tail of an F-distribution with numerator and denominator df of I − 1 and IJ(K − 1). As usual. 5. The interaction model places no restrictions on the population means µij .2: Extending the One-Factor Design to Multiple Factors 119 variation. The Error SS for the two-way interaction model is identical to the Error SS for a one-way ANOVA of the IJ treatments. giving what is known as the population mean profile plot. the interaction model can be viewed as a one factor model with IJ treatments. and Interaction SS for the two-way interaction model sum to the Treatment or Model SS for comparing the IJ treatments. suppose you (conceptually) plot the means in each row of the population table. as in the plot below for a 3 × 5 experiment. F2. At each F2 level. Understanding interaction To understand interaction. No interaction is present if the plot has perfectly parallel F1 profiles. One connection between the two ways of viewing the two-factor analysis is that the F1. the F2 marginal mean averages the population cell means across F1 profiles. which is compared to an F-distribution with numerator and denominator df of (I − 1)(J − 1) and IJ(K − 1). An overall test of no differences in the IJ population means is part of the two-way analysis. Since the population means can be arbitrary. respectively. I always summarize the data using the cell and marginal means instead of the estimated effects. primarily because means are the basic building blocks for the analysis. The levels of F1 and F2 do not . The test of no interaction: H0 : (αβ)ij = 0 for all i and j is based on MS Interact/MS Error. My discussion of the model and tests emphasizes both approaches to help you make the connection with the two ways this material is often presented in texts. The F1 marginal population means average the population means within the F1 profiles. j interaction effect is 0 ⇔ no interaction term in model. . F1=2 4 F1=1 0 2 F1=3 1 2 3 Level of Factor 2 4 5 Interaction is present if the profiles are not perfectly parallel.120 Ch 5: Paired Experiments and Randomized Block Experiments interact. j difference between level of F2 j and F2 mean is the same for all levels of ⇔ µij − µ ¯i· − µ ¯·j + µ ¯·· = 0 for all i. j interaction effect is 0 ⇔ (αβ)ij = 0 for all i. j. Population Mean 6 8 10 12 14 parallel profiles ⇔ µij − µhj is independent of j for each i and h difference between levels of F1 does not depend on level of F2 ⇔ µij − µ ¯i· = µhj − µ ¯h· for all i. h difference between level of F2 j and F2 mean does not depend on level of ⇔ µij − µ ¯i· = µ ¯·j − µ ¯·· for all i. That is. An example of a profile plot for two-factor experiment (3×5) with interaction is given below. the Interaction SS is zero when the sample mean profiles are perfectly [ = 0 for all i and j. . In particular. It is often helpful to view the interaction plot from both perspectives. Similarly. A qualitative check for interaction can be based on the sample means profile plot.2: Extending the One-Factor Design to Multiple Factors Population Mean 6 8 10 12 F1=3 2 4 F1=1 0 F1=2 1 2 3 Level of Factor 2 4 5 The roles of F1 and F2 can be reversed in these plots without changing the assessment of a presence or absence of interaction. Three variables were needed to represent each response in the data set: dose (1-3.121 14 5. As noted earlier. with insecticide A having the lowest mean. the insecticides have noticeably different mean survival times averaged over doses. categorical).3 Example: Survival Times of Beetles First we generate cell means and a sample means profile plot (interaction plot) for the beetle experiment. The ddply() function was used to obtain the 12 treatment cell means. parallel because (αβ) ij 5. and time (the survival time). categorical). The Interaction SS measures the extent of non-parallelism in the sample mean profiles. but keep in mind that profiles of sample means are never perfectly parallel even when the factors do not interact in the population.2. insecticide (A-D. d <.di. m = mean(hours10)) beetles.6100 5 medium A 0.p + geom_point(alpha = 0. size = 0. m = mean(hours10)) insecticide A B C D m 0.3750 8 medium D 0. aes(y = m.2100 10 high B 0.(dose.long[. linetype = "solid".8800 3 low C 0.ddply(beetles.122 Ch 5: Paired Experiments and Randomized Block Experiments higher doses tend to produce lower survival times.p + labs(title = "Beetles interaction plot.i ## ## ## ## ## 1 2 3 4 <. "hours10"]) ## [1] 0.d beetles.long. m = mean(hours10)) ## . library(plyr) # Calculate the cell means for each (dose.5342 beetles.3250 # Interaction plots.long. aes(x = dose.long.(insecticide). summarise. summarise.p + geom_line(data = beetles. colour = "black" . y = hours10.5) p <.(). alpha = 0.mean.6767 0. insecticide) combination mean(beetles.3142 0.8150 7 medium C 0.2762 beetles.mean.p + geom_boxplot(alpha = 0.3) p <. .2350 12 high D 0. ggplot p <.mean.mean.p + geom_hline(aes(yintercept = 0).3925 0.mean.ddply(beetles.mean. colour = insecticide. shape = insecticide p <.(dose).di ## ## ## ## ## ## ## ## ## ## ## ## ## dose insecticide m 1 low A 0. group = insecticide).5.6675 9 high A 0.ddply(beetles.ggplot(beetles. size = 4) p <. summarise. . . m = mean(hours10)) ## dose m ## 1 low 0.di <.insecticide).5444 ## 3 high 0. The sample means profile plot shows some evidence of interaction. size = 1.4125 2 low B 0.75)) p <.mean beetles.4794 beetles.3200 6 medium B 0.4794 beetles.di.2.5675 4 low D 0.mean <.1) p <.ddply(beetles.i beetles.3350 11 high C 0.long.id m ## 1 <NA> 0. summarise.long.size=0. outlier.mean.6175 ## 2 medium 0. insecticide by dose") . position=position_dodge(width=0. aes(y = m).p + geom_point(data = beetles.25. .mean. aes(x = insecticide.mean.long$insecticide. size = 4) p <. colour = "black" .di. main = "Beetles interaction plot.5. beetles. dose by insecticide 1. linetype = "solid".long$dose.long$insecticide. group = dose). aes(y = m.25. main = "Beetles interaction plot. position=position_dodge(width=0.p + geom_line(data = beetles.mean. beetles. insecticide by dose Beetles interaction plot.ggplot(beetles.long$dose. beetles.8 0.long$hours10 .75)) p <.size=0. dose by insecticide") .plot(beetles.p + geom_hline(aes(yintercept = 0).5) p <.plot(beetles. insecticide by dose") interaction.8 ● insecticide B C dose hours10 hours10 ● A ● low ● ● high D ● 0.p + labs(title = "Beetles interaction plot.2. beetles. dose by insecticide") print(p) ## ymax not defined: adjusting position using y instead Beetles interaction plot.long$hours10 .p + geom_point(alpha = 0. size = 0. alpha = 0.1) p <. shape = dose)) p <.di.4 ● ● medium ● ● 0. colour = dose.5.2 1. base graphics interaction.2: Extending the One-Factor Design to Multiple Factors 123 print(p) ## ymax not defined: adjusting position using y instead p <. outlier.2 0.0 0.3) p <.p + geom_point(data = beetles.0 low medium high dose A B C D insecticide # Interaction plots. y = hours10.long.p + geom_boxplot(alpha = 0.4 ● 0. aes(y = m). size = 1. The p-value of < 0.h. data = beetles.8 beetles. The interaction between dose and insecticide is indicated with dose:insecticide. Insecticide.long$insecticide low medium beetles.112).i. data = beetles.di. the interaction seen in the profile plot of the sample means might be due solely to chance or sampling variability.9 Beetles interaction plot.9 Beetles interaction plot.di <. The next summary at the top gives two partitionings of the one-way ANOVA Treatment SS into the SS for Dose. type=3) ## Anova Table (Type III tests) .long$hours10 0. The shorthand dose*insecticide expands to “dose + insecticide + dose:insecticide” for this first-order model.di <.d.124 Ch 5: Paired Experiments and Randomized Block Experiments 0. The p-values for the F-statistics indicate that the dose and insecticide effects are significant at the 0.long$dose 0.3 mean of beetles.long) # lm.h. The F -test at the bottom of the summary() tests for no differences among the population mean survival times for the 12 dose and insecticide combinations.01 level. The Mean Squares.2 0. including the main effects and two-way interactions. dose by insecticide 0.0001 strongly suggests that the population mean survival times are not all equal.8 beetles. and the Dose by Insecticide interaction.long$dose high A B C D beetles. F-statistics and p-values for testing these effects are given.5 0. insecticide by dose 0.d.6 0.6 0.h. The F-test for no dose by insecticide interaction is not significant at the 0.i. Thus.7 medium low high 0.4 0.i.d. lm.long) # equivalent library(car) Anova(lm.lm(hours10 ~ dose + insecticide + dose:insecticide .long$insecticide In the lm() function below we specify a first-order model with interactions.long$hours10 0.3 mean of beetles.lm(hours10 ~ dose*insecticide.7 B D C A 0.10 level (p-value=0.5 0.2 0.4 0. 321 dosehigh:insecticideD -0.i.i.h.1 ' ' 1 summary(lm.4e-05 insecticideC 0.150 insecticideD 0.92 0.681 1 30.h.389 dosemedium:insecticideD 0.53 2.87 Residuals 0.9e-06 *** 0. .long) Residuals: Min 1Q -0.30 0.1055 -1.1500 0.i <.1491 -0.4125 0.00095 *** 0.h.d.149 on 36 degrees of freedom Multiple R-squared: 0.3250 -0.43 8.87 0.di.734.1491 0. I’ll drop the interaction term and fit the additive model with main effects only.85 insecticide 0. ~ .d.47 0.9e-06 dosemedium -0.652 F-statistic: 9.801 36 --Signif. Error t value Pr(>|t|) (Intercept) 0.2: Extending the One-Factor Design to Multiple Factors ## ## ## ## ## ## ## ## ## ## Response: hours10 Sum Sq Df F value (Intercept) 0.5. type=3) ## Anova Table (Type III tests) ## .0746 5.01 '*' 0.18 0.80 dose:insecticide 0.' 0.507 dosehigh:insecticideC -0.update(lm.250 6 1.05 '.0825 0.1491 -2.1055 1.11225 0.4250 Coefficients: Estimate Std.028 dosemedium:insecticideC -0.001 '**' 0.01 '*' 0. codes: 0 '***' 0.Adjusted R-squared: 0.i.di) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = hours10 ~ dose + insecticide + dose:insecticide.1 ' ' *** .0925 0.dose:insecticide ) library(car) Anova(lm.584 --Signif.1491 -0. * 1 Residual standard error: 0.1491 1.069 dosemedium:insecticideB 0.1300 0.0431 Max 0.2025 0.386 dosehigh -0.17216 0.88 0. lm. I update the model by removing the interaction term.3425 0.d.60 dose 0.1550 0.67 0.1055 -0.0050 3Q 0.99e-07 Since the interaction is not significant.d.001 '**' 125 Pr(>F) 2.01 0.1491 -0.h.05 '.1055 4.082 2 1.454 3 6.063 insecticideB 0.1975 0.855 dosehigh:insecticideB -0.4675 0.1000 0. p-value: 1.0275 0. codes: 0 '***' 0.' 0.0487 Median 0. data = beetles. *** .1055 1.55 0.87 0.01 on 11 and 36 DF. 61 1.001 '**' value Pr(>|t|) 8.01 '*' 0. the Bonferroni-corrected significance level is (alpha / (d + i)) # where d = number of dose comparisons # and i = number of insecticide comparisons. codes: 0 '***' 0.0559 dosemedium -0.beetle. linfct = mcp(dose = "Tukey" . # correcting over dose and insecticide glht.6 5.beetle.7e-06 *** Residuals 1.1 ' ' 1 Residual standard error: 0.i) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = hours10 ~ dose + insecticide.4e-06 *** 1.4523 0.2e-10 *** -1.1 ' ' 1 summary(lm.31 0.0559 dosehigh -0.d.h.05 '.126 ## ## ## ## ## ## ## ## Ch 5: Paired Experiments and Randomized Block Experiments Response: hours10 Sum Sq Df F value Pr(>F) (Intercept) 1.2517 -0.05 '.001 '**' 0. # that is.glht(aov(lm.41 0.long) Residuals: Min 1Q Median -0.0646 insecticideD 0. test = adjusted("bonferroni")) ## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## .4 4.609 F-statistic: 15.12e-08 The Bonferroni multiple comparisons indicate which treatment effects are different.1981 -6.di.0618 Max 0.h.0646 --Signif.2319 3. codes: 0 '***' 0.d. insecticide = "Tukey")) summary(glht.637 1 65.09 4. # Testing multiple factors is of interest here.3412 0.3 6.2e-10 *** dose 1.0646 insecticideC 0.7e-07 *** insecticide 0.4983 Coefficients: Estimate Std. # Note that the code below corrects the p-values # for all the tests done for both factors together.01 '*' 0.10 2.' 0.di <.3625 0.2200 0.0963 -0.0559 insecticideB 0.Adjusted R-squared: 0.6 on 5 and 42 DF.158 on 42 degrees of freedom Multiple R-squared: 0.0731 0.0783 0.21 0.65.0149 3Q 0.0015 ** 0. p-value: 1.921 3 12. data = beetles.033 2 20.051 42 --Signif.' 0.i).8e-07 *** 5. Error t (Intercept) 0. 5.A == 0 0.0646 -2.0646 -4.21 1.A == 0 0. 4.B == 0 -0.5e-06 *** dose: high .2 ) ● 0.001 '**' 0.0559 -4.2 Linear Function Bonferroni−adjusted Treatment contrasts 0.2681 0.0 0.2: Extending the One-Factor Design to Multiple Factors ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 127 Fit: aov(formula = lm.41 0.h.bonferroni method) # plot the summary op <. 2) + 0.40 0. top.low == 0 -0.' 0.4 .05 '.2842 0.3412 0. Error t value Pr(>|t|) dose: medium .31 1.low == 0 -0.0731 0.readonly = TRUE) # the whole list of settable par's. # make wider left margin to fit contrast labels par(mar = c(5.A == 0 0.0559 -1.1) # order is c(bottom.10 2. left.i) Linear Hypotheses: Estimate Std. codes: 0 '***' 0.61 1.1 ' ' 1 (Adjusted p values reported -.medium == 0 -0.1425 0.0559 -6.0646 5.3e-05 *** insecticide: C .00065 *** insecticide: D .00000 dose: high .0783 0.0646 2.01313 * insecticide: C .2200 0.beetle.19 0. 10.3625 0.1417 0.30453 --Signif.B == 0 -0.00019 *** insecticide: B .4 ) ● ) ● insecticide: D − B ) ● −0. test = adjusted("bonferroni")) .d.0646 3. right) # plot bonferroni-corrected difference intervals plot(summary(glht.par(no.C == 0 0.0646 1.di.79 0. sub="Bonferroni-adjusted Treatment contrasts") par(op) # reset plotting options 95% family−wise confidence level ( dose: medium − low dose: high − low ( ) ● ( dose: high − medium ) ● ) ● ( insecticide: B − A ( insecticide: C − A insecticide: C − B ) ● ( insecticide: D − A ( ( ) ● ( insecticide: D − C −0.01 '*' 0.29570 insecticide: D .21 0.00000 insecticide: D . Of course. averaged over insecticides. conclude that the highest dose yields the lowest survival time regardless of insecticide. A Bonferroni comparison shows that the population mean survival time for the high dose (averaged over insecticides) is significantly less than the population mean survival times for the low and medium doses (averaged over insecticides). a single factor variable combining both factors would need to be created). The two lower doses are not significantly different from each other. significantly less than 0. In the latter case the medium dose would be better than the high dose for the given insecticide.268) estimates the typical decrease in survival time achieved by using the high dose instead of the medium dose. our profile plot tells us that this hypothetical situation is probably not tenable here.544 . then the difference in mean times between the medium and high doses on a given insecticide may be significantly greater than 0. in general. I would likely summarize the main effects assuming no interaction. even though the high dose gives better performance averaged over insecticides. This leads to two dose groups: Dose: Marg Mean: Groups: 1=Low 2=Med 0.276 ----- If dose and insecticide interact.618.544 ------------ 3=Hig 0. respectively.618 0. and 0.268. The average survival time decreases as the dose increases. the difference in the medium and high dose marginal means (0. with estimated mean survival times of 0. For example. so I will give both interpretations to emphasize the differences. you can conclude that beetles given a high dose of the insecticide typically survive for shorter periods of time averaged over insecticides.0. Given the test for interaction.268. If the two factors interact. then the difference in marginal dose means averaged over insecticides also estimates the difference in population . but it could be so when a significant interaction is present. If dose and insecticide do not interact. or even negative.128 Ch 5: Paired Experiments and Randomized Block Experiments Interpretation of the Dose and Insecticide Effects The interpretation of the dose and insecticide main effects depends on whether interaction is present. An interaction forces you to use the cell means to decide which combination of dose and insecticide gives the best results (and the multiple comparisons as they were done above do not give multiple comparisons of cell means.276. The distinction is important.276 = 0. You can not. 0.544. 5.544 .363 is the expected decrease in survival time from using A instead of B. using three temperatures (50. This follows from the parallel profiles definition of no interaction. regardless of the insecticide (and hence also when averaged over insecticides). you can conclude that insecticide A is no better than C. when performance is averaged over doses. Insect: Marg Mean: Groups: 5.314 = 0. regardless of the insecticide. the difference in the medium and high dose marginal means (0. If interaction is present. and three materials for the plates (1. Three groups are obtained from the Bonferroni comparisons. the difference in marginal means for insecticides B and A of 0.534 0. A practical implication of no interaction is that you can conclude that the high dose is best. then A is not significantly better than C. Furthermore. 3).2.268) estimates the expected decrease in survival time anticipated from using the high dose instead of the medium dose. Four batteries were tested at each of the 9 combinations . regardless of the insecticide used. 2.2: Extending the One-Factor Design to Multiple Factors 129 mean survival times between two doses.677 .393 0. but is significantly better than B or D.677 0.0.0. 65. regardless of dose. This is also the expected decrease in survival times when averaged over doses.4 B D C A 0. regardless of the dose. If the interaction is absent. but significantly better than B or D. The difference in marginal means for two doses estimates the difference in average survival expected. with any two insecticides separated by one or more other insecticides in the ordered string having significantly different mean survival times averaged over doses. Thus.314 ---------------------------------- Example: Output voltage for batteries The maximum output voltage for storage batteries is thought to be influenced by the temperature in the location at which the battery is operated and the material used in the plates. 80). for example. A scientist designed a two-factor study to examine this hypothesis.276 = 0. regardless of the insecticide. An ordering of the mean survival times on the four insecticides (averaged over the three doses) is given below. long) ## 'data. the main effect of material is not.. lm. "temp") .57 6.mt <. The two-way ANOVA table indicates that the main effect of temperature and the interaction are significant at the 0..long) library(car) Anova(lm.: 1 1 1 1 1 1 1 1 1 2 . The maximum output voltage was recorded for each battery.05 level.."65". #### Example: Output voltage for batteries battery <.130 Ch 5: Paired Experiments and Randomized Block Experiments of temperature and material type.table("http://statacumen.m. This is a balanced 3-by-3 factorial experiment with four observations per treatment..mt."v3". ## $ temp : Factor w/ 3 levels "50".frame': 36 obs.m.com/teach/ADA2/ADA2_notes_Ch05_battery. of 4 variables: ## $ material: Factor w/ 3 levels "1".66 0.t. The overall F -test at the bottom indicates at least one parameter in the model is significant. value.factor(battery$temp) 1 2 3 4 5 6 7 8 9 material 1 1 1 2 2 2 3 3 3 temp 50 65 80 50 65 80 50 65 80 v1 130 34 20 150 136 25 138 174 96 v2 155 40 70 188 122 70 110 120 104 v3 74 80 82 159 106 58 168 150 82 v4 180 75 58 126 115 45 160 139 60 library(reshape2) battery. header = TRUE) battery$material <.long <.name = "battery" .t.dat" ..vars = c("material". ## $ maxvolt : int 130 34 20 150 136 25 138 174 96 155 .factor(battery$material) battery$temp <.m.melt(battery .."v2".. variable. type=3) ## ## ## ## ## ## Anova Table (Type III tests) Response: maxvolt Sum Sq Df F value Pr(>F) (Intercept) 72630 1 107.lm(maxvolt ~ material*temp.read.. id.name = "maxvolt" ) str(battery..m."2".52689 ."3": 1 1 1 2 2 2 3 3 3 1 ."80": 1 2 3 1 2 3 1 2 3 1 ..5e-11 *** material 886 2 0. data = battery. ## $ battery : Factor w/ 4 levels "v1". summarise.12189 material3:temp65 79.long.56 0.ddply(battery.mean.mean.27424 material3:temp80 18.75 -14.05 '.5.2: Extending the One-Factor Design to Multiple Factors ## ## ## ## ## 131 temp 15965 2 11.00 25.m.22 0.38 3Q 17. temp) combination battery.05 0.14 0.5 battery.37 -4.25 25.12 0. summarise.t.43e-07 The cell means plots of the material profiles have different slopes.m battery.' 0. .00508 ** material2:temp80 -29.98 3.long) Residuals: Min 1Q Median -60.(temp). summarise.696 F-statistic: 11 on 8 and 27 DF.37 1.ddply(battery.t <.00 18. p-value: 9.Adjusted R-squared: 0. Error t value Pr(>|t|) (Intercept) 134. m = mean(maxvolt)) ## material m ## 1 1 83.47676 --Signif.50 25. which is consistent with the presence of a temperature-by-material interaction.00021 *** material:temp 9614 4 3.00026 *** material2:temp65 41.72 0.98 -1. library(plyr) # Calculate the cell means for each (material.25 Coefficients: Estimate Std.82 0.00025 *** temp80 -77.01861 * Residuals 18231 27 --Signif.m <.long.1 ' ' 1 summary(lm. data = battery.75 12.75 25.mean <.m. codes: 0 '***' 0.5e-11 *** material2 21.20 0.mean.98 0.id m ## 1 <NA> 105.().37 -4.60 0.25 18. m = mean(maxvolt)) battery.37 6.25 18. codes: 0 '***' 0.94 Max 45.62 1.long.05 '.37 0.001 '**' 0.26311 material3 9.mean ## .50 0. .08 battery.50 18.01 '*' 0.61875 temp65 -77.mean.33 ## 3 3 125. m = mean(maxvolt)) .001 '**' 0.17 ## 2 2 108.01 '*' 0.' 0.99 10.ddply(battery.mt) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = maxvolt ~ material * temp.(material).765. .98 1.t battery.1 ' ' 1 Residual standard error: 26 on 27 degrees of freedom Multiple R-squared: 0. 50 3 50 144. m = mean(maxvolt)) battery. group = material). shape = material)) p <. linetype = "solid".p + labs(title = "Battery interaction plot.size=0. aes(x = temp. outlier.p + geom_point(alpha = 0.p + geom_line(data = battery.17 battery.3) p <.75)) p <.long. size = 4) p <.mt ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 material temp m 1 50 134. temp by material") print(p) ## ymax not defined: adjusting position using y instead p <.mt. aes(x = material. y = maxvolt.25. position=position_dodge(width=0. position=position_dodge(width=0.75 2 65 119. alpha = 0. group = temp).p + geom_hline(aes(yintercept = 0).p + geom_point(data = battery. aes(y = m.5) p <. linetype = "solid".ggplot(battery.mean.p + geom_boxplot(alpha = 0. colour = material.2.mean.long.mt <.p + geom_line(data = battery.p + geom_point(alpha = 0.5. colour = temp.25.p + geom_boxplot(alpha = 0.50 # Interaction plots.temp). size = 1.132 Ch 5: Paired Experiments and Randomized Block Experiments ## temp m ## 1 50 144. summarise.mean.mean. size = 4) p <. size = 0.mt.mean. shape = temp)) p <.long.75 2 80 49.size=0. size = 1. colour = "black" .ddply(battery.1) p <. outlier.25 1 80 57. aes(y = m.5.83 ## 2 65 107.p + labs(title = "Battery interaction plot.mt. aes(y = m).2.5) p <.75 1 65 57. aes(y = m). material by temp") print(p) ## ymax not defined: adjusting position using y instead .(material.ggplot(battery.58 ## 3 80 64.p + geom_hline(aes(yintercept = 0).3) p <.75)) p <.p + geom_point(data = battery. alpha = 0.mean. colour = "black" .mt. ggplot p <.50 2 50 155. y = maxvolt.75 3 80 85.00 3 65 145.1) p <. size = 0. . Error t value Pr(>|t|) 65 . # correcting over temp glht.00074 *** 80 .m.m. temp by material 133 Battery interaction plot. material by temp ● 150 150 ● ● ● ● 50 65 80 material maxvolt maxvolt temp 100 2 3 50 50 0 0 1 2 ● 1 100 3 50 material ● ● 65 80 temp The Bonferroni multiple comparisons may be inappropriate because of covariate interactions.battery. test = adjusted("bonferroni")) ## ## ## ## ## ## ## ## ## ## ## ## Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: aov(formula = lm.25 18.battery. That is.22 0. linfct = mcp(temp = "Tukey")) ## Warning: covariate interactions found -.50 18.t.50 == 0 -77. You can only conclude that the differences are significant when averaged over the levels of the other factor.m.t <.mt).glht(aov(lm.m.37 -4.t.5.default contrast might be inappropriate summary(glht.2: Extending the One-Factor Design to Multiple Factors Battery interaction plot.37 -4.50 == 0 -77.mt) Linear Hypotheses: Estimate Std. The significant interaction between temperature and material implies that you can not directly conclude that batteries stored at 50 degrees have the highest average output regardless of the material.20 0. interactions make the main effects less meaningful (or their interpretation unclear) since the change in response when one factor is changed depends on what the second factor is.t.00077 *** . Nor can you directly conclude that material 3 has a higher average output than material 1 regardless of the temperature. codes: 0 '***' 0. For example.134 ## ## ## ## Ch 5: Paired Experiments and Randomized Block Experiments 80 .001 '**' 0.83 Group: ------------. material 2 might produce a significantly higher average output than the other two material types at 50 degrees.58 144.bonferroni method) # plot bonferroni-corrected difference intervals plot(summary(glht.65 == 0 0. material 2 and 3 (or 1 and 2) might be significantly different even though they are not significantly different when averaged over temperatures.37 0. 5.1 ' ' 1 (Adjusted p values reported -. you can compare materials at each temperature.05 '. a point I will return to later.------ However. This comparison of cell means is relevant if you are interested in using the batteries at 50 degrees! Comparing cell means is possible using “lsmeans”.' 0.battery. test = adjusted("bonferroni")) . and you can compare temperatures for each material.17 107. At individual temperatures.t. sub="Bonferroni-adjusted Treatment contrasts") 95% family−wise confidence level 65 − 50 ( ● ) 80 − 50 ( ● ) ( 80 − 65 −100 −50 ● 0 ) 50 Linear Function Bonferroni−adjusted Treatment contrasts The Bonferroni comparisons indicate that the population mean max voltage for the three temperatures averaged over material types decreases as the temperature increases: Temp: 80 65 50 Marg mean: 64.01 1.00000 --Signif.2.25 18.5 Checking assumptions in a two-factor experiment The normality and constant variance assumptions for a two-factor design can be visually checked using side-by-side boxplots (as was produced in the ggplot() .01 '*' 0. .insecticide).2: Extending the One-Factor Design to Multiple Factors 135 interaction plots) and residual plots.04655 ## 11 high C 0. The group sample sizes are small. Another useful tool for checking constant variances is to plot the sample deviations for each group against the group means.01291 . Also.05686 ## 8 medium D 0.5.07528 ## 6 medium B 0. aes(x = dose.long.06946 ## 2 low B 0.6675 0.3750 0. I used ddply() to store the means y¯ij and standard deviations sij for the 12 treatment combinations. #### Example: Beetles. the model assumes that variability is the same and does not depend on treatment.02160 ## 10 high B 0.p + geom_boxplot() print(p) # mean vs sd plot library(plyr) # means and standard deviations for each dose/interaction cell beetles.meansd.di <. Only the relevant output is presented. s = sd(hours10)) beetles.ggplot(beetles. Let us check the distributional assumptions for the insecticide experiment.8150 0.11284 ## 5 medium A 0. checking assumptions # boxplots. The diagnostic plots we’ve been using for lm() displays residual plots. colour = insecticide)) p <.8800 0.2350 0. ggplot p <. The plot of the standard deviation vs mean shows an increasing trend.33630 ## 7 medium C 0.27097 ## 9 high A 0.di ## dose insecticide m s ## 1 low A 0.4125 0.6100 0.3200 0. y = hours10. The set of box plots (each representing 4 points) for each insecticide/dose combination indicates both that means and standard deviations of treatments seem different.meansd. summarise .16083 ## 3 low C 0. m = mean(hours10). there appears to be less variability for dose=3 (high) than for doses 1 and 2 in the table.ddply(beetles.long.15671 ## 4 low D 0. so the residual plots are likely to be more informative than the side-by-side boxplots and the plot of the standard deviations.2100 0.3350 0. The code below generates plots and summary statistics for the survival times.(dose.5675 0. The sampling design suggests that the independence assumptions are reasonable. 1 ● ● ● ● A B C D ● ● 0.di <.2 0. # interaction model lm.3)) plot(lm. lm.02646 p <.d.ggplot(beetles.di$residuals.di$residuals.h.h.3 1.00 dose ● low medium A 0. but have higher kurtosis (more peaky) than a normal distribution.8 m Diagnostic plots show the following features.d. data = beetles. aes(x = m. las = 1.long) # plot diagnistics par(mfrow=c(2. col = "gray75") plot(beetles. colour = insecticide)) p <. main="Residuals vs dose") # horizontal line at zero abline(h = 0.long$insecticide.136 ## 12 Ch 5: Paired Experiments and Randomized Block Experiments high D 0. The plot of the Cook’s distances indicate a few influential observations. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.p + geom_point(size=4) p <.meansd.di.4. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.n = 3.75 0.i. suggesting the residuals are not normal.d.d.h.h. col = "gray75") .25 0. id. shape = dose.di£residuals.i. lm.p + labs(title = "Beetles standard deviation vs mean") print(p) Beetles standard deviation vs mean 1.d.3250 0.6 0. The residuals vs the fitted (predicted) values show that the higher the predicted value the more variability (horn shaped). main="QQ Plot") ## 42 20 30 ## 48 47 1 ## residuals vs order of data #plot(lm.6)) plot(beetles.0 low medium high 0.lm(hours10 ~ dose*insecticide.d.25 0.4 dose 0. main="Residuals vs insecticide") # horizontal line at zero abline(h = 0.di.h. The normal quantile plot shows an “S” shape rather than a straight line.long$dose.50 ● 0. which = c(1.h.di$residuals. y = s.i.i.i.i.2 high B C s hours10 insecticide insecticide ● ● D 0. This is reinforced by the residual plot.8 0.0 ●● high A B C D ● ● ● ● ● ● medium ● ●● ● ● −0.10 ● ● ● ● Cook's distance ● ● ● ● ● 42 0.30 Residuals vs Fitted 137 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0. where the variability increases as the predicted values (the cell means under the two-factor interaction model) increase.5 0.30 1.00 ● 0.2 ● ● ● 0. Looking at the QQ-plot we clearly see evidence of non-normality.d.532. with the spread or variability in the distribution increasing as the mean or median increases.0 −0.2 0. The transformations also .4 0. stabilize the variance).7 0. and the standard deviation should be fairly constant across groups.2 0.h.4 20 ● 0. Not surprisingly.5.4 42 ● ● 20 ● ● ●● ● ● ● ● ● ● ● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 0.5 2 ● 20 ● 30 ● 0. normal.e. Ideally. the boxplots do not suggest non-normality. 5.5 1 ● 42 0.2 ● 0.249999999999999 Fitted values 0.6 0. The boxplots (note the ordering) and the plot of the sij against y¯ij show the tendency for the spread to increase with the mean.2 lm.5 30 −2 −1 0 1 2 norm quantiles Survival times are usually right skewed.2 ● ● ● ● ●●●●●●● ●●●● ●●●●●● ●●● ●●●● ●● ●● 0.20 0.2: Extending the One-Factor Design to Multiple Factors ● ● ● ● ● ● ● ● ● 30 0.6 A Remedy for Non-Constant Variance A plot of cell standard deviations against the cell means is sometimes used as a diagnostic tool for suggesting transformations of the data.4 0.0 −0. number Leverage hii Residuals vs dose Residuals vs insecticide QQ Plot ● ● 42 ● 0.20 20 3.2 ● low 0 0.4 0.di$residuals 0.10 ● ● ● ● ● ● ● ● ● ● ● Cook's distance 0. the QQ-plot of the studentized residuals is better suited to examine normality here than the boxplots which are constructed from 4 observations.2 0. As noted earlier. Here are some suggestions for transforming non-negative measurements to make the variability independent of the mean (i.4 30 ● 0.i. the distributions should be symmetric..3 0.2 Residuals 0.0 ● ● −0.00 −0.9 0 10 20 30 40 Obs.2. If sij is roughly independent of y¯ij . 1. if you survive 2 hours. so these nonlinear transformations may destroy the symmetry. aes(x = dose. summarise .di. For example.p + geom_boxplot() print(p) # mean vs sd plot library(plyr) # means and standard deviations for each dose/interaction cell beetles.rate . do not transform the response.rate <. 4. # boxplots. then 1/2 is the proportion of your remaining lifetime expired in the next hour.long. use a log transformation of the response. if present (and may induce skewness if absent!). use a square root transformation of the response. This idea does not require the response to be non-negative! A logarithmic transformation or a reciprocal (inverse) transformation of the survival times might help to stabilize the variance.di. 3. As a first pass. As an aside.long$hours10 Redo the analysis replacing hours10 by rate. m = mean(rate). The survival time distributions are fairly symmetric. s = sd(rate)) beetles. non-constant variance # create the rate variable (1/hours10) beetles. The unit of time is actually 10 hours.long$rate <.insecticide). colour = insecticide)) p <. If sij increases as a quadratic function of y¯ij . 2. Create the rate variable. so 0. I will consider the reciprocal transformation because the inverse survival time has a natural interpretation as the dying rate. use a reciprocal (inverse) transformation of the response.(dose.1/beetles. some statisticians prefer to plot the IQR against the median to get a more robust view of the dependence of spread on typical level because sij and y¯ij are sensitive to outliers.1 scaling factor has no effect on the analysis provided you appropriately rescale the results on the mean responses.1/time is the actual rate.ggplot(beetles. ggplot p <. The standard deviations of rate appear much more similar than those of time did.meansd.meansd. If sij increases as a square root function of y¯ij .ddply(beetles.138 Ch 5: Paired Experiments and Randomized Block Experiments tend to reduce skewness. The 0. y = rate. . #### Example: Beetles. If sij increases linearly with y¯ij .long. long.().029 0.ggplot(beetles.3647 medium A 3.p + labs(title = "Beetles standard deviation vs mean") print(p) Beetles standard deviation vs mean 0.6 insecticide medium high B ● 3 C s rate A insecticide ● ● ● ● ● ● D 0. m = mean(rate)) beetles.(insecticide). summarise.rate.mean. y = s.269 ## 3 high 3.ddply(beetles.268 0. m = mean(rate)) .mean.8223 medium B 1.2: Extending the One-Factor Design to Multiple Factors ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 12 139 dose insecticide m s low A 2.meansd. m = mean(rate)) ## dose m ## 1 low 1.4 ● ● 2 A B C D 1 ● 0.long.8 5 dose 4 ● low 0. colour = insecticide)) p <. .863 0.id m ## 1 <NA> 2. shape = dose .di.2348 high D 3.702 0.ddply(beetles.4214 high C 4. aes(x = m.p + geom_point(size=4) p <.5532 medium C 2.4175 medium D 1.622 beetles. library(plyr) # Calculate the cell means for each (dose.i <.163 0.803 0.mean.mean ## . summarise. insecticide) combination beetles.4967 low B 1.265 0. summarise.d <.487 0.2 low medium high 1 dose 2 3 4 m The profile plots and ANOVA table indicate that the main effects are significant but the interaction is not.7019 high A 4.393 0.ddply(beetles.5.(dose).801 ## 2 medium 2.092 0. .1995 low C 1. .5296 high B 3.mean.2441 p <.long.4894 low D 1.797 beetles.690 0.714 0.mean <.d beetles.i beetles. size=0.di.ggplot(beetles.5) p <.140 ## ## ## ## ## 1 2 3 4 Ch 5: Paired Experiments and Randomized Block Experiments insecticide A B C D m 3.268 medium B 1.75)) p <.ddply(beetles.690 medium A 3.863 low D 1. aes(x = dose.p + geom_point(alpha = 0.5) p <.size=0.di. colour = "black" . outlier. group = dose).di.mean.p + geom_point(data = beetles.1) p <. linetype = "solid".714 medium D 1. position=position_dodge(width=0. colour = dose. aes(y = m.r.265 high D 3. size = 4) p <.di.2. group = insecticide).di ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 12 dose insecticide m low A 2.r.487 low B 1.mean.393 medium C 2. type=3) .029 high C 4. size = 1.long. size = 0. ggplot p <.p + geom_line(data = beetles.p + geom_line(data = beetles.2. insecticide by dose") print(p) ## ymax not defined: adjusting position using y instead p <. colour = insecticide. aes(y = m. alpha = 0.mean.947 2.3) p <.long) # equivalent library(car) Anova(lm.75)) p <. . aes(y = m).mean. m = mean(rate)) beetles.3) p <. shape = insecticide)) p <.i.d. outlier. aes(x = insecticide. aes(y = m). y = rate. summarise. data = beetles.702 high A 4.25.ggplot(beetles.161 beetles.di <.i.p + geom_hline(aes(yintercept = 0).519 1. size = 1.p + labs(title = "Beetles interaction plot. y = rate.5.163 low C 1. linetype = "solid".25.di <.mean.5.di.p + geom_point(alpha = 0.long.1) p <. position=position_dodge(width=0.p + geom_boxplot(alpha = 0.mean.lm(rate ~ dose*insecticide. insecticide by dose Beetles interaction plot.long. colour = "black" .p + geom_hline(aes(yintercept = 0). shape = dose)) p <.p + labs(title = "Beetles interaction plot. size = 4) p <.803 high B 3.p + geom_point(data = beetles. alpha = 0.092 # Interaction plots. dose by insecticide") print(p) ## ymax not defined: adjusting position using y instead Beetles interaction plot.d.p + geom_boxplot(alpha = 0. size = 0. dose by insecticide ● 4 4 insecticide ● dose ● ● A B ● low rate rate ● medium C ● ● high D ● 2 ● 2 ● ● ● 0 0 low medium dose high A B C D insecticide lm.862 2.(dose.insecticide). 0691 3Q 0.86 0.2 1 239.828 F-statistic: 21.26 0.001 '**' 0.68 8.7816 0.dose:insecticide) library(car) Anova(lm.d.30 0.5e-07 *** insecticide 3.49 on 36 degrees of freedom Multiple R-squared: 0.12495 dosehigh:insecticideD -0.3234 0.07039 --Signif.05 '. type=3) ## Anova Table (Type III tests) ## ## Response: rate ## Sum Sq Df F value Pr(>F) ## (Intercept) 58.1 ' ' 1 summary(lm.3465 -2.i.01 '*' 0.57 0. Error t value Pr(>|t|) (Intercept) 2.4900 -0.86093 dosemedium:insecticideD -0.r.29e-12 Drop the nonsignificant interaction term.d.9137 0.3465 6.57 6 1.80 0. .4900 0. codes: 0 '***' 0.6242 0.00051 insecticideC -0.3465 -3.88783 dosehigh:insecticideC 0.i.74 1 103.2: Extending the One-Factor Design to Multiple Factors ## ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: rate Sum Sq Df F value Pr(>F) (Intercept) 24.' 0.i.4900 -1.7697 0.01 '*' 0.r.7972 0.5 on 11 and 36 DF.001 '**' 0.82 0.13 0.0696 0. ~ .0865 0.26767 dosehigh:insecticideB -0.0055 ** dose:insecticide 1.3158 0. 1 141 . data = beetles.09 0.r.36421 dosemedium:insecticideC 0.05 '. codes: 0 '***' 0.2964 -0.4900 -1.7685 -0.5.2546 Max 1.di.' 0.long) Residuals: Min 1Q Median -0.0794 Coefficients: Estimate Std.Adjusted R-squared: 0.d.96 0.4900 -1.03025 dosehigh 2.57 3 4.2e-12 dosemedium 0.6e-08 insecticideB -1.3465 2.14 0. * .2e-12 *** dose 11.3465 -1.i <.3867 Residuals 8.2450 10.10 2 23.update(lm. lm.868.di) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = rate ~ dose * insecticide.08001 insecticideD -0.4 < 2e-16 *** *** * *** *** .d.r.4900 0.4869 0. p-value: 1.1 ' ' Residual standard error: 0.12 3.5517 0.64 36 --Signif.15 4.04 4.92 0.02730 dosemedium:insecticideB -0.4503 0.18 0. long$insecticide.0103 * 11.i) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = rate ~ dose + insecticide.n = 3.0069 ** -6. codes: 0 '***' 0.2757 Max 1.84 0.358 0.i£residuals.0212 3Q 0.i.' 0. Error t (Intercept) 2.47 < 2e-16 *** 2.572 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.r. data = beetles.1815 Coefficients: Estimate Std. # plot diagnistics par(mfrow=c(2.r. lm. main="QQ Plot") ## 41 4 33 ## 48 47 46 ## residuals vs order of data #plot(lm.174 dosehigh 1.i$residuals.01 '*' 0.657 0.2e-10 *** Residuals 10.174 dosemedium 0.996 0.7 2.1 ' ' 1 summary(lm. Also.9 2 71.r.69 0.0 4.i$residuals.' 0.r.001 '**' value Pr(>|t|) 15.201 insecticideD -1.493 on 42 degrees of freedom Multiple R-squared: 0.d.97e-16 Unlike the original analysis. col = "gray75") plot(beetles.201 --Signif.826 F-statistic: 45.long$dose.d.d.d.i$residuals.4 3 28.201 insecticideC -0.Adjusted R-squared: 0.469 0.long) Residuals: Min 1Q -0. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0. p-value: 6.2 42 --Signif.01 '*' 0.75 3. the residual plots do not show any gross deviations from assumptions.05 '.9e-14 *** insecticide 20.3e-08 *** 0.4. codes: 0 '***' 0.r. no case seems relatively influential.174 insecticideB -1. id.7e-10 *** -2.844.142 ## ## ## ## ## Ch 5: Paired Experiments and Randomized Block Experiments dose 34. col = "gray75") .23 2. lm.d.5 on 5 and 42 DF.1 ' ' 1 Residual standard error: 0. which = c(1.d.45 1.698 0. las = 1.r.3762 Median 0. main="Residuals vs insecticide") # horizontal line at zero abline(h = 0.6)) plot(beetles.8276 -0.05 '.7e-14 *** -8.3)) plot(lm. main="Residuals vs dose") # horizontal line at zero abline(h = 0.001 '**' 0. 358 0. the Bonferroni-corrected significance level is (alpha / (d + i)) # where d = number of dose comparisons # and i = number of insecticide comparisons.r. # Testing multiple factors is of interest here.69 0.10 4 4● ● 33 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 0.5e-13 *** 1.10 ● ● ● 0.2: Extending the One-Factor Design to Multiple Factors ● ● ● ● ● ● 0.996 0.di.201 -2.d.572 0.i) Linear Hypotheses: dose: medium dose: high dose: high insecticide: insecticide: insecticide: .5 ●● ● ●●● ● ●●● ● ●● ●●● ●● 0.rate. # correcting over dose and insecticide glht.201 -8.15 Residuals vs Fitted 143 0 10 20 30 40 Fitted values Obs.05 ● ● ● ● ● Cook's distance ● ● ● ● ● ● 0.5 Estimate Std. number Leverage hii Residuals vs dose Residuals vs insecticide QQ Plot 1.0 1.174 11. test = adjusted("bonferroni")) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: aov(formula = lm.rate <.5.5 2 2 3 4 0.0 lm.beetle. Error t value Pr(>|t|) 0.0 0.5 0. # Note that the code below corrects the p-values # for all the tests done for both factors together.5 1.di.5 1.low == 0 low == 0 medium == 0 B .5 Residuals 1.5 ● ● ● ● high A B C D ● ● ●●● ●● ● ●● ● medium ● ● ● −2 −1 0 1 2 norm quantiles Bonferroni multiple comparisons imply differences about the mean rates.15 ● ● ● ● ● ● ● ● ● 41 1 ● 33 0. # that is.657 0.5 0.r.84 0.A == 0 0 0.0 41 ● ●4 Cook's dist vs Leverage hii (1 − hii) Cook's distance 0.528 0.A == 0 D .76 4.174 2.201 -6.0 ● ● ● ● ● 41 33 ● ● ● ● −0.r.4e-09 *** -0. linfct = mcp(dose = "Tukey" .00 ● ● −1.00 ● 3 2.45 1.174 8.125 41 ● low 0.0 ●4 ● 33 ● ● ●● 0.5 0.5 0.i).A == 0 C .0 0.i$residuals −0.5e-10 *** -1.23 2.0e-07 *** . -1.092 .d.glht(aov(lm.0 −0.05 ● ● Cook's distance 0.d.0 ●● ●●● −0.062 . insecticide = "Tukey")) summary(glht.beetle.469 0.75 3. 1. 201 -3. For example. 2) + 0.000 insecticide: D .085 0. codes: 0 '***' 0.bonferroni method) par(mfrow=c(1.39 2. sub="Bonferroni-adjusted Treatment contrasts") par(op) # reset plotting options 95% family−wise confidence level ( dose: medium − low ) ● ( dose: high − low ( dose: high − medium insecticide: B − A ( ( ( ) ● ( ( insecticide: D − B ( −2 ) ● ) ● insecticide: C − B insecticide: D − C ) ) ● insecticide: C − A insecticide: D − A ● ● −1 ● ● ) ) ) 0 1 2 Linear Function Bonferroni−adjusted Treatment contrasts Comments on the Two Analyses of Survival Times The effects of the transformation are noticeable.7e-05 *** insecticide: D .di.rate. # make wider left margin to fit contrast labels par(mar = c(5.B == 0 1. top. left. 4.299 0.' 0.001 '**' 0.1)) # plot the summary op <. 10.201 1.1 ' ' 1 (Adjusted p values reported -. right) # plot bonferroni-corrected difference intervals plot(summary(glht.91 0.1) # order is c(bottom.201 5.05 '.786 0.144 ## ## ## ## ## ## Ch 5: Paired Experiments and Randomized Block Experiments insecticide: C . test = adjusted("bonferroni")) .49 1.C == 0 -0. the comparisons among doses and insecticides are less sensitive (differences harder to distinguish) on the original scale (look at the Bonferroni groupings).par(no.beetle.B == 0 0. A comparison of the interaction p-values and profile plots for the two analyses suggests that the transformation eliminates much of the observed interaction between the main .readonly = TRUE) # the whole list of settable par's.003 ** --Signif.01 '*' 0. especially if you believe that the original time scale is most relevant for analysis. Lenth. This need appears to be less pressing with the rates. #### Multiple comparisons #### Example: Battery # fit additive (main effects) model (same as before) lm.t). the two methods agree.3: Multiple comparisons: balanced (means) vs unbalanced (lsmeans) 145 effects. 5.t.glht(aov(lm.3 Multiple comparisons: balanced (means) vs unbalanced (lsmeans) The lsmeans provides a way to compare cell means (combinations of factors).long) ### comparing means (must be balanced or have only one factor) # correcting over temp glht. To be on the safe side. and well-known for his online power calculators. When there are only main effects. we compare the multiple comparison methods using means (glht()) and lsmeans2 (lsmeans()). Given the suitability of the inverse transformation.battery. The statistical assumptions are reasonable for an analysis of the rates. the small sample sizes suggest that power for detecting interaction might be low.m.m. which compares marginal means. something that is not possible directly with glht(). You might disagree.t <. test = adjusted("bonferroni")) ## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts 2 lsmeans is a package written by Russell V. I did not consider the logarithmic transformation.m.t <. Using the battery example.m.5.112). linfct = mcp(temp = "Tukey")) summary(glht. PhD of UNM 1975. I think that the simplicity of the main effects interpretation is a strong motivating factor for preferring the analysis of the transformed data to the original analysis.lm(maxvolt ~ material + temp.battery. data = battery. Although the interaction in the original analysis was not significant at the 10% level (p-value=0. . one might interpret the main effects in the original analysis as if an interaction were present. CL 31 127.04 0.146 ## ## ## ## ## ## ## ## ## ## ## ## Ch 5: Paired Experiments and Randomized Block Experiments Fit: aov(formula = lm.80 80.23 31 46.95 $`pairwise differences of temp` contrast estimate SE df t.CL upper.65 37.67 12.mt).7 12.t <. Error t value Pr(>|t|) 65 .044 0. the comparisons of the main effects are inappropriate. # fit interaction model (same as before) lm.24 31 3. adjust = "bonferroni") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $`lsmeans of temp lsmean 50 144.652 8.glht(aov(lm.1 ' ' 1 (Adjusted p values reported -.battery.94 125.m.58 80 64.652 8. and give different results depending on the method of comparison.83 65 107.t) Linear Hypotheses: Estimate Std.59 6.mt <.m.ratio p.52 81. test = adjusted("bonferroni")) ## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## .50 == 0 -37.50 == 0 -80.593 <.m.lm(maxvolt ~ material*temp.9e-07 *** 80 .001 '**' 0. list(pairwise ~ temp).m.4 12.24 31 6.0142 50 .2 -3.42 12.48 31 89.19 162.55 0.0038 ** --Signif.65 == 0 -43.2 12.battery.t.bonferroni method) ### comparing lsmeans (may be unbalanced) library(lsmeans) ## compare levels of main effects # temp lsmeans(lm.t.81 Confidence level used: 0.t. codes: 0 '***' 0. linfct = mcp(temp = "Tukey")) ## Warning: covariate interactions found -.2 -6.0001 65 .548 0.80 43.2 -3.m.17 temp` SE 8.05 '.25 12.652 df lower.t. data = battery.' 0.0142 * 80 .value 50 .0038 P value adjustment: bonferroni method for 3 tests When there are model interactions.01 '*' 0.m.24 31 3.long) ### comparing means (must be balanced or have only one factor) # correcting over temp glht.m.m.default contrast might be inappropriate summary(glht. 09 161.t.25 18. codes: 0 '***' 0.61 27 3.mt.m.mt.80 80.bonferroni method) ### comparing lsmeans (may be unbalanced) library(lsmeans) ## compare levels of main effects # temp lsmeans(lm.25 18.501 27 48.value 50 .CL upper.50 == 0 -77.0010 P value adjustment: bonferroni method for 3 tests When there are model interactions and you want to compare cell means.m.CL 50 144.83 7.1 ' ' 1 (Adjusted p values reported -.95 $`pairwise differences of temp` contrast estimate SE df t. list(pairwise ~ material | temp).m.99 27 129.61 27 7.' 0.m.00074 *** 80 .501 27 92.78 79.t.m. list(pairwise ~ temp).00077 *** 80 .50 == 0 -77.m.long) ### comparing lsmeans (may be unbalanced) library(lsmeans) ## compare levels of one factor at each level of another factor separately # material at levels of temp lsmeans(lm.001 '**' 0.75 12.61 27 4.17 7.t.3: Multiple comparisons: balanced (means) vs unbalanced (lsmeans) ## ## ## ## ## ## ## ## ## ## ## 147 Fit: aov(formula = lm.01 1.56 Confidence level used: 0. Error t value Pr(>|t|) 65 .22 0.604 <.093 0.41 . # fit interaction model (same as before) lm. data = battery.511 0.58 7.09 182. adjust = "bonferroni") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## NOTE: Results may be misleading due to involvement in interactions $`lsmeans of temp` temp lsmean SE df lower.01 '*' 0.mt <.m. levels of one factor at each level of another factor separately.20 0.80 43.97 80 64.42 10. adjust = "bonferroni") ## $`lsmeans of material | temp` ## temp = 50: ## material lsmean SE df lower.65 == 0 0.67 10.37 -4.00000 --Signif.5.m.mt) Linear Hypotheses: Estimate Std.22 65 107.37 -4.99 27 108.41 ## 2 155.05 '.19 122.44 160.75 12.CL upper.CL ## 1 134.501 27 129.65 37.0001 65 .50 18.37 0.25 10. then you must use lsmeans().0048 50 .ratio p.lm(maxvolt ~ material*temp.t. CL 1 57.50 12.99 27 30.148 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 5: Paired Experiments and Randomized Block Experiments 3 144.84 76.84 84.402 0.84 76.99 27 93.CL upper.99 27 22.25 12.66 temp = 65: material lsmean SE df lower.50 18.41 3 145.3 -36.99 27 119.ratio p.0001 2 .5055 temp = 80: contrast estimate SE df t.37 27 -3.ratio p.37 27 -1.75 12.75 12.524 0.59 83.00 18.t.959 0.37 27 -1.3 -88.75 12.37 27 -1.99 27 30.37 27 -4.00 18.7893 1 .09 182.639 1.16 ## ## material = 2: ## temp lsmean SE df lower.00 12.37 27 -0.817 0.0000 2 .84 112.09 146.3 -28.CL ## 50 155.0063 1 .00 18.value 1 .37 27 0.41 temp = 80: material lsmean SE df lower.4175 2 .m.mt. adjust = "bonferroni") ## $`lsmeans of temp | material` ## material = 1: ## temp lsmean SE df lower.09 172.m.50 12.503 1.00 18.09 161.84 84.50 12.16 ## .CL upper.99 27 108.2 -62.143 0.34 170.25 18.41 ## 65 119.99 27 93.0000 temp = 65: contrast estimate SE df t.2 -21.99 27 30.75 12.16 3 85.value 1 .91 2 119.value 1 .37 27 0.99 27 58.41 ## 65 57.3 -9.25 12.435 1.CL upper.50 18.2 8.CL ## 50 134.50 12.00 18.16 2 49.91 ## 80 57.99 27 129.50 12.99 27 117. list(pairwise ~ temp | material).3 -26.75 18.95 $`pairwise differences of material | temp` temp = 50: contrast estimate SE df t.37 27 -1.09 146.415 0.CL upper.CL 1 57.3 11.99 27 30.99 27 22.41 ## 80 49.59 83.ratio p.0000 1 .1814 P value adjustment: bonferroni method for 3 tests # temp at levels of material lsmeans(lm.75 12.16 Confidence level used: 0. 65 36.84 112. You should use the means statement with caution — it is OK for balanced or unbalanced one-factor designs.37 27 5.09 172.value 50 .783 <.99 65 145.50 12.ratio p.75 12. or unbalanced.25 18.25 18.823 0. an important point demonstrated in the next section is that the cell and marginal averages given by the means and lsmeans methods agree here for the main effects model because the design is balanced.CL upper.65 77.0109 65 .80 77.1814 50 .95 $`pairwise differences of temp | material` material = 1: contrast estimate SE df t.80 58.16 Confidence level used: 0.75 18. For unbalanced designs with two or more factors.4: Unbalanced Two-Factor Designs and Analysis ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## material = 3: temp lsmean SE 50 144.CL 27 117. and for the balanced two-factor designs (including the RB) that we have discussed.279 0. 5. Although this has no consequence on the specification .80 106.65 -1.184 0.80 -0.959 0.34 170.0008 65 .0021 material = 3: contrast estimate SE df t.0000 material = 2: contrast estimate SE df t.095 1.37 27 -0.ratio p.014 1.5.value 50 . I will argue that lsmeans are the appropriate averages for unbalanced analyses.80 60.25 18.4 Unbalanced Two-Factor Designs and Analysis Sample sizes are usually unequal.25 18.00 12.37 27 1.99 149 df lower.50 18.66 27 119.37 27 4.value 50 .37 27 -0.204 0.0086 P value adjustment: bonferroni method for 3 tests Finally.80 70.37 27 3.50 18.00 18.41 27 58.37 27 3.0007 50 .ratio p.37 27 4.37 27 3.25 18. lsmeans and means compute different averages. for the different treatment groups in an experiment.218 0.99 80 85.0000 50 .0001 65 . . An unusual feature of this experiment is that the rats used in the six vein and time combinations are distinct. you have to be more careful with the analysis of unbalanced experiments that have two or more factors..150 Ch 5: Paired Experiments and Randomized Block Experiments of a model. of 3 variables: ## $ vein : Factor w/ 2 levels "j".. The design is unbalanced. 30. or use other multivariate techniques that allow correlated responses within rats..4. which assumes that the responses are independent within and across treatments (other models may be preferred)."60": 1 1 1 1 1 2 2 2 2 2 . Depending on the questions of interest. head(rat.table("http://statacumen. 3) . ## $ insulin: int 18 36 12 24 43 61 116 63 132 68 .. and then measure the insulin levels of each rat at the three time points. #### Example: Rat insulin rat <. With unbalanced designs.factor(rat$time) str(rat) ## 'data. 5. Inferences might critically depend on which summaries are used. the Type I and Type III SS differ. you could compare veins using a one-way MANOVA.read. as do the main effect averages given by means and lsmeans. header = TRUE) # make time a factor variable and label the levels rat$time<..frame': 48 obs."30". 3) ## vein time insulin ## 1 j 0 18 ## 2 j 0 36 ## 3 j 0 12 tail(rat. and 60)."p": 1 1 1 1 1 1 1 1 1 1 . portal) and three time levels (0. ## $ time : Factor w/ 3 levels "0".dat" . I will use the following example to emphasize the differences between the types of SS and averages.1 Example: Rat insulin The experiment consists of measuring insulin levels in rats a certain length of time after a fixed dose of insulin was injected into their jugular or portal veins. with sample sizes varying from 3 to 12. An alternative experimental design might randomly assign rats to the two vein groups. This is a two-factor study with two vein types (jugular. I will fit a twofactor interaction model.com/teach/ADA2/ADA2_notes_Ch05_ratinsulin. p + labs(title = "Rats standard deviation vs mean") print(p) Rats standard deviation vs mean ● 10 300 3 60 vein 200 insulin ● a j ● a p 12 vein p s j time ● 0 40 6 30 60 100 12 ● 20 5 ● 0 0 30 time 60 50 100 m 150 . vjust = -0.72 12 p <.tv ## ## ## ## ## ## ## 1 2 3 4 5 6 time vein m s n 0 j 26.33 62.tv <.45 6 30 p 172.50 49.meansd. # boxplots. summarise .ggplot(rat.76 5 0 p 81.(time.meansd.60 12. n = length(insulin)) rat. colour = vein)) p <.ddply(rat.90 76. y = insulin. . colour = vein.ggplot(rat.tv.5.4: Unbalanced Two-Factor Designs and Analysis 151 ## vein time insulin ## 46 p 60 105 ## 47 p 60 71 ## 48 p 60 83 It appears the standard deviation increases with the mean.p + geom_boxplot() print(p) # mean vs sd plot library(plyr) # means and standard deviations for each time/interaction cell rat.p + geom_text(hjust = 0. aes(x = m.p + geom_point(size=4) # labels are sample sizes p <. m = mean(insulin). y = s. label=n)) p <.5. ggplot p <.12 10 60 j 61.vein).meansd. aes(x = time.75 12 30 j 79. s = sd(insulin).92 27.50 36. shape = time.5) p <.52 3 60 p 128. vein). shape = time. aes(x = m. it’s not of much concern.287 0.180 0. The variances are more constant now. y = loginsulin. y = s. summarise .p + labs(title = "Rats standard deviation vs mean") print(p) .759 1.338 0. label=n)) p <. rat$loginsulin <.072 0.(time.4661 6 30 p 5.tv ## ## ## ## ## ## ## 1 2 3 4 5 6 time vein m s n 0 j 3.meansd.p + geom_boxplot() print(p) # mean vs sd plot library(plyr) # means and standard deviations for each time/interaction cell rat.152 Ch 5: Paired Experiments and Randomized Block Experiments We take the log of insulin to correct the problem.meansd.4185 10 60 j 3.0255 3 60 p 4.p + geom_point(size=4) # labels are sample sizes p <. .ggplot(rat. n = length(loginsulin)) rat.ddply(rat.4096 12 30 j 4.ggplot(rat. but because this is based on such a small sample size. colour = vein.5. m = mean(loginsulin) .5166 5 0 p 4. ggplot p <. colour = vein)) p <.p + geom_text(hjust = 0.785 0.tv.log(rat$insulin) # boxplots.3953 12 p <. s = sd(loginsulin) .tv <. vjust = -0.meansd. except for one sample with only 3 observations which has a larger standard deviation than the others.5) p <. aes(x = time. The Type III SS are more difficult to define explicitly. the Type I and Type III interaction SS are identical because this effect was added last to the model statement.rmit.au/~fscholer/anova.6 5 ● 3 6 12 0 30 60 time 12 ● 0.0 5 vein ● a j ● a p vein j p 4 s loginsulin 0.0 m Type I and Type III SS We can request ANOVA tables including Type I or Type III SS3 for each effect in a model. The Type I and III SS for the main effects are not equal. In a regression analysis.php.cs.5 10 5.5 4. where there is no unique way to define the SS for an effect. but are typically different for unbalanced designs.4: Unbalanced Two-Factor Designs and Analysis 153 Rats standard deviation vs mean 3 1. . The problem here is similar to multiple regression.edu. where the SS for a predictor X is the decrease in Residual SS when X is added to a model. see http://goanna. For the insulin analysis. Also note that the Type I SS for the 3 For the ugly details.4 3.0 4. but they roughly correspond to the reduction in Error SS achieved when an effect is added last to the model. the standard tests for effects in a model are based on Type III SS and not on the Type I SS. Type I SS and Type III SS are equal for balanced designs and for oneway ANOVA. The Type I SS is the sequential reduction in Error SS achieved when an effect is added to a model that includes only the prior effects listed in the model statement.5. This SS is not unique because the change in the Residual SS depends on which predictors are included in the model prior to X.8 time ● 0 30 ● 60 0. aes(x = time.tv)) ## ## ## ## ## ## ## Df Sum Sq Mean Sq F value Pr(>F) time 2 5. y = loginsulin.ggplot(rat.56 Residuals 42 9. codes: 0 '***' 0.80 time:vein 0 2 0. shape = vein)) p <. alpha = 0. except for the interaction term. we see the Type I and Type III SS are different.lm(loginsulin ~ time*vein.26 0.01 '*' 0.154 Ch 5: Paired Experiments and Randomized Block Experiments main effects and interaction add to the model SS.i. colour = vein. ggplot p <.vein). summarise.05 '.rmit. contrasts = list(time = contr.7e-05 *** vein 1 9.i.v. .8e-08 *** time:vein 2 0.1 ' ' 1 # type III SS Anova(lm.3) . linetype = "solid".13 0.sum.1e-07 *** 0. and because of the interaction Type III SS p-value above.t.au/~fscholer/anova.2. size = 0.58 time 6 2 13.56 '**' 0.001 '**' 0.58 0.i.45 2.tv.' 0.sum" in order for the correct ## Type III SS to be computed. # calculate means for plot library(plyr) rat.tv <. m = mean(loginsulin)) # Interaction plots.001 Pr(>F) < 2e-16 *** 2. type=3) ## ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: loginsulin Sum Sq Df F value (Intercept) 669 1 2987.18 6.edu. vein = contr.22 --Signif. data = rat .80 vein 9 1 40.tv <.php library(car) # type I SS (intercept SS not shown) summary(aov(lm.73 12.t. colour = "black" . but the Type III SS do not. ## See http://goanna. lm.' 0. ## The contrast statement above must be included identifying ## each main effect with "contr.1 ' ' 1 Because the profile plot lines all seem parallel.v.p + geom_hline(aes(yintercept = 0).01 '*' 0. Looking at the output.58 Residuals 9 42 --Signif.40 0.32 41.32 9.05 '.(time. codes: 0 '***' 0.mean. it appears there is not sufficient evidence for a veinby-time interaction.sum)) ## CRITICAL!!! Unbalanced design warning.cs.ddply(rat.5e-05 *** 1.t.v. For now we’ll keep the interaction in the model for the purpose of discussing differences between means and lsmeans and Type I and Type III SS.66 8. 30. time by vein 6 6 ● ● 4 4 ● ● vein ● ● j p time loginsulin loginsulin ● ● 30 60 2 2 0 0 0 30 60 time ● 0 j p vein Means versus lsmeans The lsmeans (sometimes called adjusted means) for a single factor is an arithmetic average of cell means. respectively. ignoring time.p + geom_boxplot(alpha = 0. size = 4) p <. group = vein). size = 1.74 = (3. aes(y = m). and 3).p + geom_line(data = rat. aes(y = m. aes(y = m. This average gives equal weight to the 3 times even though the sample sizes at these times differ (5.75)) p <. size = 1. group = time).3) p <.76. the mean responses in the jugular vein at times 0.25.5) p <.5. The lsmeans for the jugular vein is thus 3.1) geom_point(alpha = 0.p + geom_hline(aes(yintercept = 0). and 60 are 3.p + geom_point(data = rat.p + labs(title = "Rats interaction plot.5.ggplot(rat.p + print(p) 155 geom_boxplot(alpha = 0.78 for the jugular is the average of the 14 jugular responses. linetype = "solid". vein by time") ## ymax not defined: adjusting position using y instead p <. aes(x = vein. colour = "black" .2.29 + 3.18.25.p + geom_point(alpha = 0.p + p <.tv.tv. outlier. alpha = 0. 6.size=0.75)) geom_point(data = rat. If the cell sample sizes were equal.1) p <.tv. The means of 3.p + p <. aes(y = m). position=position_dodge(width=0.29.5) labs(title = "Rats interaction plot.mean. shape = time)) p <.p + p <. and 3. For example. size = 0.size=0.18 + 4.p + p <.mean.4: Unbalanced Two-Factor Designs and Analysis p <. y = loginsulin.mean. colour = time. outlier.5.tv. the lsmeans and .mean. 4.76)/3. size = 4) geom_line(data = rat. time by vein") print(p) ## ymax not defined: adjusting position using y instead Rats interaction plot. position=position_dodge(width=0. vein by time Rats interaction plot. ddply(rat.60 -0.433 4. . summarise.0001 42 -2.CL ## j 3.759 0.1300 P value adjustment: bonferroni method for 3 tests # unbalanced.249 <.mean.e.1259 42 3.tv. . The means and lsmeans for individual cells (i.1527 42 3.926 60 4. adjust = "bonferroni") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## NOTE: Results may be misleading due to involvement in interactions $`lsmeans of time` time lsmean SE df lower.568 4.tv.778 ## 2 p 4.1221 42 4.mean.95 .580 Confidence level used: 0.v.0390 42 2.742) (3.13193 42 3.156 Ch 5: Paired Experiments and Randomized Block Experiments means averages would agree.083 0.i.v <.1979 30 .759076)/3 ## [1] 3.4073 0.778) with the lsmeans average below (3.179610 + 4.742 0.712 # compare jugular mean above (3. don't match rat.95 $`pairwise differences of contrast estimate SE 0 .008 ## p 4. m = mean(loginsulin)) rat. list(pairwise ~ time).013 30 4. and equal to cell means. #### lsmeans library(plyr) # unbalanced.CL upper.t.value 42 -5.742 lsmeans(lm.505 4.1955 time` df t.CL 0 3.964 4.ratio p..30 -0.997 ## 2 30 4.t ## time m ## 1 0 3.272 0.t <.mean.286804 + 3.mean.CL upper.60 0.08143 42 4.580 library(lsmeans) lsmeans(lm.594 0. don't match rat.732 0.778 ## 3 60 4.t.i. list(pairwise ~ vein).(vein).(time).ddply(rat.680 0.476 4.5133 0. summarise.v. m = mean(loginsulin)) rat.v ## vein m ## 1 j 3.9207 0. for the 6 vein*time combinations) are identical.896 ## ## Confidence level used: 0. adjust = "bonferroni") ## NOTE: Results may be misleading due to involvement in interactions ## $`lsmeans of vein` ## vein lsmean SE df lower.1754 0 . 1931 42 -2.287 30 p 5.677 42 3. summarise.2116 42 2.759 time | vein` SE 0.3345 time | vein` df t.310 vein = p: time lsmean SE df lower.338 30 j 4.5277 0.60 0.072 0.60 -0.mean.759 60 p 4.180 0.tv.3455 30 .mean.1072 0.063 4.ratio p.2025 42 -3.625 0.897 4.607 157 .30 -0. m = mean(loginsulin)) rat.1366 42 4.p -0.vein). list(pairwise ~ vein | time).180 0 p 4.2025 42 1.2731 df lower.30 -1.180 30 4.value 0 .785 0.tv ## ## ## ## ## ## ## 1 2 3 4 5 6 time vein m 0 j 3.95 $`pairwise differences of vein = j: contrast estimate SE 0 .CL 0 4.1931 0. .(time. but highest-order interaction cell means will match rat.061 Confidence level used: 0.316 0.208 4. adjust = "bonferroni") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $`lsmeans of vein = j: time lsmean 0 3.5795 0.510 5.tv <.i.155 42 -6.tv.072 60 j 3.614 30 5.387 <.1366 42 4.3664 vein = p: contrast estimate SE df t.CL ## j 3.417 0.287 60 3.338 0. adjust = "bonferroni") ## $`lsmeans of vein | time` ## time = 0: ## vein lsmean SE df lower.CL upper.4472 0.7342 0.value ## j .CL upper.v.0766 30 .865 0.785 lsmeans(lm.578 0.t.4: Unbalanced Two-Factor Designs and Analysis ## ## $`pairwise differences of vein` ## contrast estimate SE df t.2116 0.0011 42 -1.3027 42 1.2870 0. list(pairwise ~ time | vein).CL upper.753 3.753 3.677 0.t.374 60 4.value 42 -3.1496 42 4.2864 0 .771 5.0001 # unbalanced.i.ratio p.5.9902 0.4917 P value adjustment: bonferroni method for 3 tests lsmeans(lm.60 -0.v.60 0.CL 42 2.ddply(rat.607 42 3.0023 0 .ratio p. contrasts = list(time = contr.value j .t.tv£residuals. vein = contr.t.1931 42 3.1586 0.061 Confidence level used: 0. data = rat .i.3054 42 -3.i.tv.i.ratio p.338 0.785 0. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.v.1366 42 4.072 0.208 4.759 0.v. main="Residuals vs time") # horizontal line at zero abline(h = 0.ratio p.CL upper. id.tv <.i.n = 3.063 4.v.lm(loginsulin ~ time*vein.510 5.897 4.tv$residuals.CL upper. which = c(1.sum. lm.2518 42 -4. these diagnostic plots are mostly fine.CL j 3. main="Residuals vs vein") # horizontal line at zero abline(h = 0.1496 42 4.374 time = 60: vein lsmean SE df lower.t.771 5.3)) plot(lm.t.95 $`pairwise differences of vein | time` time = 0: contrast estimate SE df t.1366 42 4. main="QQ Plot") ## 13 12 17 ## 48 1 2 ## residuals vs order of data #plot(lm.tv$residuals.310 p 4.v.v.p -1.216 0.361 0.0025 time = 60: contrast estimate SE df t.sum)) # plot diagnistics par(mfrow=c(2.158 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 5: Paired Experiments and Randomized Block Experiments p 4.p -0.i.0017 For completeness.287 0.p -1. lm.0264 0.601 <.4.7856 0.6)) plot(rat$time. # interaction model lm.value j .value j .ratio p.0001 time = 30: contrast estimate SE df t.677 p 5.t.614 time = 30: vein lsmean SE df lower. col = "gray75") plot(rat$vein. though the plot of the Cook’s distances indicate a couple influential observations.tv$residuals.i. col = "gray75") .t.2443 42 -3. las = 1.v.2731 42 3.CL j 4. col = "gray75") # Normality of Residuals library(car) qqPlot(lm. 2 ● ● ● Cook's distance ● ●● ● ● ● ● ● ● ● 0. The Type I SS and F -tests and the multiple comparisons based on means should be ignored because they do not.5 2 0.0 −0.0 −0.4 0.15 0.i.v.0 60 p ● ● ● ●● ●●● ●● ●● ● ●● ●● ● −0. in general. .2 ● 0 0.05 0.4 ● ● ● ● ● ● ● ● ● ● 1.5 ●3 ● ● ● ● 0.0 4. The F -statistics based on Type III SSs are appropriate for unbalanced twofactor designs because they test the same hypotheses that were considered in balanced designs.3 13 ● 1.4: Unbalanced Two-Factor Designs and Analysis Residuals vs Fitted ● ● ● 0.0 −0.0 0.0 3 2.0 0.5 13 ● 0.0 Fitted values 0.25 Residuals vs time Residuals vs vein QQ Plot 1.0 Residuals ● ● ●● ● ● ● ● ● ● 3.2 1.6 12 0.5 ● 0. defined as the average of the cell means over the levels of the other factor.tv$residuals 0.6 1 12 ● 3 ● 17 ● 4.t.0 Leverage hii 0.5 Obs. Type I or Type III SS? Use lsmeans and Type III SS. Regardless of whether the design is balanced.5 ● ● 0. Given that the Type III F -tests for the main effects check for equal population cell means averaged over the levels of the other factor.0 ● 12 40 0. multiple comparisons for main effects should be based on lsmeans. and the marginal means.5.5 lm.5 Cook's dist vs Leverage hii (1 − hii) Cook's distance 13 ● 13 ● 159 0.5 0. so comparisons of means. The problem with using the means output is that the experimenter has fixed the sample sizes for a two-factor experiment. the basic building blocks for a two-factor analysis are cell means. That is.5 ● ● Cook's distance ● −1. the Type III F -tests on the main effects check for equality in population means averaged over levels of the other factor.5 0.5 5.5 j ● ●● ● ● 30 ●● ●●● ●●● ●●● ● ● 0 ●● ● ● ● ● ● ● 12 17 −2 −1 0 1 2 norm quantiles Should I use means or lsmeans. The Type III F -test for no interaction checks for parallel profiles.0 0 10 20 30 ● ● ● ● ● ● ● ● ● ● ● ● 0. number 1. test meaningful hypotheses. and between times 0 and 30. What can you conclude from the lsmeans comparisons of veins and times? Answer: significant differences between veins. Focusing on the Type III SS. introduces a potential bias due to choice of sample sizes. which is consistent with a lack of interaction. the F -tests indicate that the vein and time effects are significant. any differences seen in the means in the jugular and portal could be solely due to the sample sizes used in the experiment and not due to differences in the veins. The jugular and portal profiles are reasonably parallel. but that the interaction is not significant. . Put another way.160 Ch 5: Paired Experiments and Randomized Block Experiments which ignore the second factor. 2 1. 4.5. 2. and indicate the β coefficients on the plot (βs are vertical differences)4.pdf. I hope that coefficient interpretation will become clear. 5. additive model Write two-way ANOVA factor model (general and indicator-variables) Write model for each factor level with βs and predicted values Plot the predicted values on axes Label the βs on the plot Calculate marginal and grand means in table and label in plot Please attempt by hand before looking at the solutions at http://statacumen. 2 factors with 3 and 2 levels.5. 1 factor with 3 levels Write ANOVA factor model (general and indicator-variables) Write model for each factor level with βs and predicted values Plot the predicted values on axes Label the βs on the plot Calculate marginal and grand means in table and label in plot Level 1 2 3 yˆ 5 4 6 5. 2. 5.5. 4. . 3. 5.5: Writing factor model equations and interpretting coefficients 5. You’ll need pen (preferrably multicolored) and paper. 3. Together. 4 Two-way ANOVA. Assume a balanced design with ni observations for each treatment combination. One-way ANOVA. and interpretting model coefficients.5 161 Writing factor model equations and interpretting coefficients This section is an exercise in writing statistical factor models.com/teach/ADA2/ ADA2_05_PairedAndBlockDesigns_CoefScan. write the model for each factor combination. In class I’ll discuss indicator variables and writing these models. plotting predicted values.1 1. From the exercise. we’ll plot the model predicted values. 3. interaction model Write two-way ANOVA factor model (general and indicator-variables) Write model for each factor level with βs and predicted values Plot the predicted values on axes Label the βs on the plot Calculate marginal and grand means in table and label in plot yˆ Factor 1 Factor 2 1 2 3 1 5 4 6 2 8 10 3 .5. 4. 2.162 Ch 5: Paired Experiments and Randomized Block Experiments yˆ Factor Factor 2 1 2 1 5 4 2 8 7 5. 2 factors with 3 and 2 levels. 5. 1 3 6 9 Two-way ANOVA.3 1. A representative sample of 550 high school seniors was selected in 1970. Given the large sample sizes. Journal of Educational Statistics.Chapter 6 A Short Discussion of Observational Studies “Thou shall adjust for what thou can not control.org/stable/1164696. The boxplots for the two samples show heavy-tailed distributions with similar spreads. A similar sample of 550 was selected in 1990. Instead. pp. . Inferences about the nature of differences among groups in such observational studies can be flawed if this heterogeneity is ignored in the statistical analysis. the groups being compared do not consist of identical experimental units that have been randomly assigned to receive a treatment. The final SAT scores (on a 1600 point scale) were obtained for each student1. The data are artificial. 4 (Winter. 1986). 239–244 http: //www. No. the F -test comparing populations is 1 The fake-data example in this chapter is similar to a real-world SAT example illustrated in this paper: “Minority Contributions to the SAT Score Turnaround: An Example of Simpson’s Paradox” by Howard Wainer. but the conclusions are consistent with an interesting analysis conducted by researchers at Sandia National Laboratories.” In most scientific studies. and highlights the distinction between the means and lsmeans output for a two-way table. the groups might be extremely heterogeneous on factors that might be related to a specific response on which you wish to compare the groups.jstor. The following problem emphasizes the care that is needed when analyzing observational studies. 11. Vol. to see whether students are scoring higher.lm(grade ~ year.mean.7 points) over the 20 year period.p + geom_point(data = sat. header = TRUE) sat$year <.y. size = 1.2. or about the same.factor(sat$eth ) # calculate means by year (also calculated below to illustrate lsmeans()) library(plyr) sat. ggplot p <. m = mean(grade)) # Interaction plots.table("http://statacumen.164 Ch 6: A Short Discussion of Observational Studies approximately valid even though the population distributions are non-normal. #### Example: SAT sat <.5) p <. h = 0).y <.y <. aes(y = m).com/teach/ADA2/ADA2_notes_Ch06_sat.read.p + geom_line(data = sat.factor(sat$year) sat$eth <.ddply(sat. size=1. combined with the observed averages. lower.5) p <. y = grade)) p <. aes(y = m). contrasts = list(year = contr. over time. respectively.1.sum)) .mean. indicates that the typical SAT score has decreased significantly (10.p + labs(title = "SAT scores by year") print(p) SAT scores by year ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 950 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 900 grade ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 850 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 800 ● 1970 1990 year A simple analysis might compare the average SAT scores for the two years. aes(x = year.dat". alpha = 0. data = sat .mean.p + geom_point(position = position_jitter(w = 0. summarise. the average SAT scores for 1970 and 1990 are 892. .y. The one-way ANOVA. size = 4) #p <.p + geom_boxplot(alpha = 0. lm.8 and 882.g.ggplot(sat.2) p <. The one-way lsmeans and means breakdowns of the SAT scores are identical.(year). colour="gray25". where a scientist has controlled all the factors that might affect the response (the SAT score) other than the treatment (the year).358 1098 7.ddply(sat.g.14e+04 1 6.0001 Should we be alarmed? Should we be concerned that students entering college have fewer skills than students 20 years ago? Should we be pumping billions of dollars into the bloated bureaucracies of our public school systems with the hope that a few of these dollars might be put to good use in programs to enhance performance? This is the consensus among some people in the know.mean.6e-15 *** Residuals 5.7 ## 1990 882.ratio p. Even without these controls.y. The SAT study is not a well-designed experiment.66e+08 1 1.69 1.3 884.01 '*' 0.mean.CL ## 1970 892.57e+05 1098 --Signif. to compete in the new global economy.y.05 '. . .CL upper.19e+01 8.y <.0 ## ## Confidence level used: 0.2 0.y ## year m ## 1 1970 892.8 0.0 894.value ## 1970 .9605 1098 880. type=3) ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: grade Sum Sq Df F value Pr(>F) (Intercept) 8. there is no randomization of treatments to students selected from a target population.165 library(car) # type III SS Anova(lm. so means and lsmeans match sat.868 <. adjust = "bonferroni") ## $`lsmeans of year` ## year lsmean SE df lower.8 ## 2 1990 882.71e+06 < 2e-16 *** year 3. m = mean(grade)) sat.001 '**' 0. codes: 0 '***' 0.1 ' ' 1 library(plyr) # balanced with respect to year.2 library(lsmeans) lsmeans(lm.9605 1098 891. list(pairwise ~ year).1990 10.95 ## ## $`pairwise differences of year` ## contrast estimate SE df t.(year). The SAT study is an observational study of two distinct populations. all of whom wax eloquently about the impending inability of the U. summarise.g.S.' 0. p + labs(title = "SAT interaction plot. eth by year") print(p) #p <. m = mean(grade)) sat.p + geom_point(data = sat.eth).mean.ye. shape = eth)) p <.ye ## ## ## ## ## 1 2 3 4 year eth m 1970 1 899.5 # Interaction plots. y = grade. eth by year ● ● 950 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 900 ● ● ● ● ● ● ● ● ● ● ● ● ● grade ● ● ● ● ● ● ● eth ● 1 ● ● ● ● 2 ● ● ● ● ● ● 850 ● ● ● ● ● ● 800 ● 1970 1990 year . you see that the typical SAT score within each ethnic group has increased over time. and what are appropriate conclusions in the analysis? sat. y = grade. group = eth). aes(y = m). size = 1. colour = eth.5) p <. group = year). size = 4) #p <. If you construct box-plots of the SAT scores for the four combinations of ethnicity and year. ggplot library(ggplot2) p <.166 Ch 6: A Short Discussion of Observational Studies The observed differences in SAT scores may indeed be due to a decrease in performance.p + geom_boxplot(alpha = 0.p + geom_line(data = sat. aes(y = m). outlier. year by eth") #print(p) SAT interaction plot.mean.mean. Is this a paradox. outlier.mean.size=0. . The differences might also be due to factors that make the two populations incomparable for assessing changes in performance over time. summarise.(year.ye. aes(y = m. shape = year)) #p <. size = 4) p <.ye.ggplot(sat.5.mean.p + labs(title = "SAT interaction plot.p + geom_boxplot(alpha = 0.5) p <. aes(y = m. My hypothetical populations have students from two ethnic groups (1 and 2).6 1990 2 875. aes(x = eth.ddply(sat.size=0.p + geom_point(data = sat.ggplot(sat.7 1970 2 824. whereas the typical SAT score ignoring ethnicity decreased over time.1 1990 1 948.mean.ye.5) #p <.5) #p <.ye <.5. colour = year.p + geom_line(data = sat. size = 1. aes(x = year. 1.70e+06 year 2.au/~fscholer/anova.00e+04 year:eth 1. Given the lack of a significant interaction.rmit.861. contrasts = list(year = contr.e.' 0. The marginal lsmeans indicate that the average SAT score increased significantly over time when averaged over ethnicities. The F -test for comparing years adjusts for ethnicity because it is based on comparing the average SAT scores across years after averaging the cell means over ethnicities.g.1 ' ' 1 The year and ethnicity main effects are significant in the two factor model. but the interaction is not. ## See http://goanna. library(plyr) # unbalanced.05 '.lm(grade ~ year * eth.167 I fit a two-factor model with year and ethnicity effects plus an interaction.50e+04 1096 --Signif. . codes: 0 '***' 0.sum" in order for the correct ## Type III SS to be computed.e.01 '*' 0.y ## year m ## 1 1970 892.001 '**' Pr(>F) <2e-16 <2e-16 <2e-16 0.089 *** *** *** . data = sat .ddply(sat.86e+08 1 5.mean.edu.8 .mean. 0.sum)) ## CRITICAL!!! Unbalanced design warning. type=3) ## ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: grade Sum Sq Df F value (Intercept) 2.cs. summarise.(year). the expected increase in SAT scores from 1970 to 1990 within each ethnic group is the difference in marginal averages: 912. lm. after adjusting for the effect of ethnicity on performance.9 = 50. eth = contr. thereby eliminating from the comparison of years any effects due to changes in the ethnic composition of the populations.55e+03 eth 5.0 .y. don't match (lsmeans is correct) sat.y.y <.45e+02 1 2.ye. This is consistent with the cell mean SAT scores increasing over time within each ethnic group.02e+05 1 1.89e+00 Residuals 5. The two-way analysis is preferable to the unadjusted one-way analysis which ignores ethnicity.ye <.php library(car) # type III SS Anova(lm.sum.g. m = mean(grade)) sat. ## The contrast statement above must be included identifying ## each main effect with "contr.28e+05 1 4. The two-factor model gives a method to compare the SAT scores over time. 6 1990 2 875.e.ye ## ## ## ## ## 1 2 3 4 year eth m 1970 1 899. adjust = "bonferroni") ## $`lsmeans of year | eth` ## eth = 1: ## year lsmean SE df lower.0001 # unbalanced.5253 1096 860.31 0.ye <.y.8 lsmeans(lm.8 0.e.11 0.mean.45 <.8 850.9 Confidence level used: 0.CL 1 924.0017 1096 946.95 $`pairwise differences of eth` contrast estimate SE df t.e.(year.y.3168 1096 899. .g.2 2 849.5253 1096 923.CL upper.2 74. summarise.ye. don't match (lsmeans is correct) sat.g.ye.0 1990 912.7429 1096 100 <.7 1970 2 824.mean.3 ## 1990 948.eth). summarise.e ## eth m ## 1 1 904.ratio p. list(pairwise ~ year | eth).CL upper.2 ## 2 2 870.mean. .7429 1096 -67.5253 1096 911.y.ye.5 ## ## eth = 2: .5 lsmeans(lm.1 1990 1 948.1 900.0001 # unbalanced.0 913.0 0. list(pairwise ~ eth).ddply(sat.CL 1970 861. adjust = "bonferroni") ## ## ## ## ## ## ## ## ## ## ## NOTE: Results may be misleading due to involvement in interactions $`lsmeans of eth` eth lsmean SE df lower.mean.7 0.5253 1096 848.6 950. adjust = "bonferroni") ## ## ## ## ## ## ## ## ## ## ## NOTE: Results may be misleading due to involvement in interactions $`lsmeans of year` year lsmean SE df lower.CL ## 1970 899.1 925. list(pairwise ~ year).168 Ch 6: A Short Discussion of Observational Studies ## 2 1990 882.g.9 863.CL upper. m = mean(grade)) sat.95 $`pairwise differences of year` contrast estimate SE df t. but highest-order interaction cell means will match sat. m = mean(grade)) sat.1990 -50.1 Confidence level used: 0.value 1970 .9 0.2 library(lsmeans) lsmeans(lm.(eth).6 1.value 1 .ddply(sat.ratio p.e <.1 0. ratio p.CL 1 899. the 1970 mean SAT score of 892.CL upper.2 73. adjust = "bonferroni") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $`lsmeans of eth | year` year = 1970: eth lsmean SE df lower.g.37 1.53 <.5 0.85 1.051 1096 69.6 950.value 1970 .49 <.0017 1096 822.CL 1970 824.0001 As noted in the insulin analysis.CL upper.1 Confidence level used: 0.3 2 824.6 1.0017 1096 946.1 1.CL upper.ratio p.051 1096 -46.05 1.9 876.95 $`pairwise differences of eth | year` year = 1970: contrast estimate SE df t.ye. .value 1970 .3168 1096 874.3168 1096 874. The marginal lsmeans are averages of cell means over the levels of the other factor.1990 -51. The 1970 lsmeans SAT score of 861.1 900.0001 lsmeans(lm.169 ## ## ## ## ## ## ## ## ## ## ## ## ## ## year lsmean SE df lower.7 0.e.9 = (899.y.1 1.1 year = 1990: eth lsmean SE df lower. for example.value 1 . but the year lsmeans are not.1 1990 875.2 75. The marginal means ignore the levels of the other factors when averaging responses.2 826.051 1096 71.0001 year = 1990: contrast estimate SE df t. the marginal lsmeans and means are different for unbalanced two-factor analyses.0017 1096 822. Hopefully.2 826. Thus.5 0.1 Confidence level used: 0. list(pairwise ~ eth | year).5 2 875.1990 -48.57 1.93 <.1)/2.9 is midway between the average 1970 SAT scores for the two ethnic groups: 861.90 <.9 876.7 + 824.ratio p.051 1096 -48.3168 1096 899.CL 1 948. this discussion also clarifies why the year marginal means are identical in the one and two-factor analyses.value 1 .8 is the average of the 550 scores selected that year.0001 eth = 2: contrast estimate SE df t.95 $`pairwise differences of year | eth` eth = 1: contrast estimate SE df t.ratio p. Students in the second ethnic group are underachievers. A two-factor analysis backed up with a comparison of the marginal lsmeans is needed to compare performances over time. the typical SAT scores were shown to have increased. These confounding effects are taken into consideration by including them as effects in the model. adjusting for the changes in ethnic composition. The average SAT scores (ignoring ethnicity) decreased from 1970 to 1990 because the ethnic composition of the student population changed. but they are becoming a larger portion of the population over time. The interpretation of the results from an observational study with several effects of interest. the one-way analysis ignoring ethnicity is valid. Only one out of eleven students sampled in 1990 was from this group. the year by . Ten out of every eleven students sampled in 1970 were from the first ethnic group. The Sandia team showed that the widely reported decreases in SAT scores over time are due to changes in the ethnic distribution of the student population over time. A more complete analysis of the SAT study would adjust the SAT scores to account for other potential confounding factors. The decrease in average (means) performance inferred from comparing 1970 to 1990 is confounded with the increased representation of the underachievers over time. is greatly simplified by eliminating the insignificant effects from the model. and several confounding variables. These marginal averages are not relevant for understanding any trends in performance over time because they do not account for changes in the composition of the population that may be related to performance. The Sandia study reached the same conclusion. rather than decreased. For example. but it does not provide any insight into the nature of the changes that have occurred. with individuals in historically underachieving ethnic groups becoming a larger portion of the student population over time. and differences due to the number of times the exam was taken. Once ethnicity was taken into consideration. and allows you to conclude that the typical SAT score has decreased over time.170 Ch 6: A Short Discussion of Observational Studies The 1970 and 1990 marginal means estimate the typical SAT score ignoring all factors that may influence performance. In summary. such as sex. 171 ethnicity interaction in the SAT study might be omitted from the model to simplify interpretation. An important caveat The ideas that we discussed on the design and analysis of experiments and observational studies are universal. They apply regardless of whether you are analyzing categorical data. or measurements. . say the insulin study that we analyzed earlier. counts. The year effects would then be estimated after fitting a two-way additive model with year and ethnicity effects only. The same approach is sometimes used with designed experiments. Part IV ANCOVA and logistic regression . . header = TRUE) str(tools) ## 'data.frame': 20 obs.. B).dat" ..Chapter 7 Analysis of Covariance: Comparing Regression Lines Suppose that you are interested in comparing the typical lifetime (hours) of two tool types (A and B). A simple analysis of the data given below would consist of making side-by-side boxplots followed by a two-sample test of equal means (or medians). The standard two-sample test using the pooled variance estimator is a special case of the one-way ANOVA with two groups.read. In the output below.7 14.com/teach/ADA2/ADA2_notes_Ch07_tools.4 ."B": 1 1 1 1 1 1 1 1 1 1 .. The summaries suggest that the distribution of lifetimes for the tool types are different.. of 3 variables: ## $ lifetime: num 18.4 14. . µi is population mean lifetime for tool type i (i = A.. ## $ rpm : int 610 950 720 840 980 530 680 540 890 730 ..5 17.5 13.table("http://statacumen. #### Example: Tool lifetime tools <. ## $ type : Factor w/ 2 levels "A". alpha = 0.93. width = .75 to stand out behind CI p <.p + labs(title = "Tool type lifetime") + ylab("lifetime (hours)") p <.175 1 2 3 4 5 6 7 8 9 10 lifetime rpm type 18. colour="red".5200 950 A 17.0900 770 B 25. colour = "black".4900 760 B 35. shape = 18.422e-06 ## alternative hypothesis: true difference in means is not equal to 0 .5400 840 A 13.p + geom_point(position = position_jitter(w = 0. alpha = 0.5) # diamond at mean for each group p <.0700 910 B 36. df = 15. aes(x = type.4400 980 A 24.1600 670 B 27. size = 6.3900 530 A 13.data = "mean_cl_normal". data = tools) t. size=. geom = "errorbar".3400 680 A 22.9500 810 B 43. p-value = 8.4000 880 B 26.435.8) # confidence limits based on normal distribution p <. t.5) # points for observed data p <.3. y = lifetime)) # plot a reference line for the global mean (assuming no groups) p <.t.7800 650 B 34.ggplot(tools.3200 730 A 11 12 13 14 15 16 17 18 19 20 lifetime rpm type 30.6200 590 B 26.0500 1000 B 33.summary ## ## Welch Two Sample t-test ## ## data: lifetime by type ## t = -6.7300 610 A 14. alpha = 0.summary <.8) p <.6800 890 A 19.test(lifetime ~ type.6700 500 B library(ggplot2) p <. colour="red".p + geom_boxplot(size = 0. h = 0).p + geom_hline(aes(yintercept = mean(lifetime)).p + stat_summary(fun.p + stat_summary(fun.5) # boxplot. size = 0.75.7100 540 A 12. alpha = 0. geom = "point". alpha = 0.4300 720 A 14.05.2.y = mean.p + coord_flip() print(p) Tool type lifetime type B A 20 30 40 lifetime (hours) A two sample t-test comparing mean lifetimes of tool types indicates a difference between means. linetype = "dashed". p + labs(title="Fake tools data.935 ## sample estimates: ## mean in group A mean in group B ## 17.701 -9.com/teach/ADA2/ADA2_notes_Ch07_toolsfake. consider the data plot given below. fits the data exactly. If speed influences lifetime. This is not exactly what happens in the actual data. aes(x = speed. the differences seen in the boxplots above could be due to tool type B being operated at lower speeds than tool type A.93 This comparison is potentially misleading because the samples are not comparable. The regression model indicates that you would expect identical mean lifetimes for tool types A and B. Fake example For example. operated at identical speeds. To see how this is possible. I hope the point is clear.176 Ch 7: Analysis of Covariance: Comparing Regression Lines ## 95 percent confidence interval: ## -19. The tools were operated at different speeds. other than the treatment (tool type).ggplot(toolsfake. hours by speed with categorical type") print(p) . A simple linear regression model relating hours to speed. fake toolsfake <.read. ignoring tool type. ignoring speed. y = hours. #### Example: Tools. shape = type)) p <.table("http://statacumen. if they were. (The data were generated to fall exactly on a straight line). are fixed by the experimenter. where the relationship between lifetime and speed is identical in each sample.11 31.p + geom_point(size=4) library(R.dat" .p + scale_shape_manual(values=charToInt(sort(unique(toolsfake$type)))) p <.oo) # for ascii code lookup p <. header = TRUE) library(ggplot2) p <. yet the lifetime distributions for the tool types. Then. differ dramatically. colour = type. suppose speed is inversely related to lifetime of the tool. then the observed differences in lifetimes could be due to differences in speeds at which the two tool types were operated. A one-way ANOVA is most appropriate for designed experiments where all the factors influencing the response. However. or could be. 0 B B B B B 27. hours by speed with categorical type 30. you should be wary of group comparisons where important factors that influence the response have not been accounted for or controlled. speed. A two-way ANOVA with two factors.1 ANCOVA A natural way to account for the effect of speed is through a multiple regression model with lifetime as the response and two predictors. 7. In the SAT example. The appropriate statistical technique for handling this problem is called analysis of covariance (ANCOVA).5 A A A A A 20. .7. speed and tool type.0 A A A B B A A A A 22. For the tool lifetime problem. gave the most sensible analysis. the differences in scores were affected by a change in the ethnic composition over time.5 B B B hours B type B 25.1: ANCOVA 177 Fake tools data. time and ethnicity. you should compare groups (tools) after adjusting the lifetimes to account for the influence of a measurement variable.0 600 700 800 900 1000 speed As noted in the Chapter 6 SAT example. β2 = slope of population regression lines for tool types A and B. A picture of the population regression lines for one version of the model is given below. and 1 for type B tools. For the ANCOVA model. one for each tool type. let us focus on the interpretation of the regression coefficients. Consider the model Tool lifetime = β0 + β1 typeB + β2 rpm + e. is included in the model as a dummy variable or indicator variable (a {0. To see this. but restricts the slopes of the regression lines to be identical. This ANCOVA model fits two regression lines. . and β0 = intercept of population regression line for tool A (called the reference group). For type B tools. the model simplifies to: Tool lifetime = β0 + β1(0) + β2 rpm + e = β0 + β2 rpm + e. where typeB is 0 for type A tools. here tool type. Given that β0 + β1 = intercept of population regression line for tool B. For type A tools. it follows that β1 = difference between tool B and tool A intercepts. 1} variable). the model simplifies to: Tool lifetime = β0 + β1(1) + β2 rpm + e = (β0 + β1) + β2 rpm + e.178 Ch 7: Analysis of Covariance: Comparing Regression Lines A binary categorical variable. The relationship between lifetime and speed is roughly linear within tool types. The QQ-plot does not show any gross deviations from a straight line.p + scale_shape_manual(values=charToInt(sort(unique(tools$type)))) p <. shape = type)) p <.ggplot(tools. The ANCOVA model is plausible. se = FALSE) p <. A test of H0 : β1 = 0 is the primary interest.1: ANCOVA 15 Tool B Tool A 500 600 700 800 900 1000 Speed An important feature of the ANCOVA model is that β1 measures the difference in mean response for the tool types. The model assumes that the variability of the responses is the same for each group. aes(x = rpm.p + geom_smooth(method = lm. #### Example: Tool lifetime library(ggplot2) p <. lifetime by rpm with categorical type") print(p) . regardless of the speed.p + labs(title="Tools data. colour = type. after adjusting or allowing for the speeds at which the tools were operated. but suggests that the variability about the regression line for tool type A is somewhat smaller than the variability for tool type B. y = lifetime. with similar slopes but unequal intercepts across groups.179 Population Mean Life 20 25 30 35 40 7. The plot of the studentized residuals against the fitted values shows no gross abnormalities.oo) # for ascii code lookup p <.p + geom_point(size=4) library(R. and is interpreted as a comparison of the tool types. data = tools) Residuals: Min 1Q Median -5. type=3) summary(lm.l.02661 0.787 -0.04 3. data = tools) #library(car) #Anova(aov(lm.002 3Q 1.l.553 -1.2e-09 *** rpm -0.00452 -5.r.t <.51038 10. codes: 0 '***' 0.01 '*' 0.t) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = lifetime ~ rpm + type.180 Ch 7: Analysis of Covariance: Comparing Regression Lines Tools data.8e-05 *** typeB 15.l.984 Coefficients: Estimate Std.839 Max 4.' 0.35967 11.t).lm(lifetime ~ rpm + type.54 7. Error t value Pr(>|t|) (Intercept) 36.6e-09 *** --Signif.1 ' ' 1 A 1000 B B .00425 1.001 '**' 0.r.05 '. lifetime by rpm with categorical type B 40 B B B lifetime B type B 30 A A B B A A 20 A A 600 B A A A 500 B A A 700 800 900 rpm lm.98560 3.r.89 1. the fitted relation- . which = c(1. main="Residuals vs type") # horizontal line at zero abline(h = 0.1: ANCOVA 181 ## ## Residual standard error: 3.t$residuals.r.889 ## F-statistic: 76. id.t$residuals.t.7.0266 rpm. pch=as.5 A A B A A B A BB B A A Residuals vs rpm B 2 0 30 19 BB B Leverage hii B −2 A7 Obs. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.3)) plot(lm. pch=as.4 0.9. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.r.l.l. pch=as.0 B Cook's distance B B A 0.1 A Cook's distance 4 2 0 A −2 A B B A A −4 Residuals A 20 Cook's dist vs Leverage hii (1 − hii) 1 AA 10 lm.r.2 20 0.character(tools$type)) # horizontal line at zero abline(h = 0. Assigning the LS estimates to the appropriate parameters.r.3 35 5 10 15 0. p-value: 3.00 typeB − 0.6).t£residuals.5 2 0.r. number A 500 20B Fitted values B A 19 0.7 on 2 and 17 DF.n = 3.15 0.99 + 15.character(tools$type)) ## ## 7 20 19 1 20 19 ## residuals vs order of data #plot(lm. las = 1.4 Cook's distance 6 Residuals vs Fitted B tools$rpm 7 −2 −1 0 1 2 norm quantiles The fitted relationship for the combined data set is Predicted Lifetime = 36.l.character(tools$type)) plot(tools$rpm.l.r.l.l. main="Residuals vs rpm".3 0.04 on 17 degrees of freedom ## Multiple R-squared: 0.t$residuals.l. col = "gray75") 20 25 0.0 −6 A7 7 0. main="QQ Plot".1 B B 15 1.t$residuals 20B 19B 0.2 B 0.r.2 4 4 4 A A B A A 0.1 Residuals vs type QQ Plot lm.Adjusted R-squared: 0. lm.4. lm.09e-09 # plot diagnostics par(mfrow=c(2.t$residuals 2 0 B A A A −2 B A 19 B B A B 2 B B B A A −2 A −4 −4 −4 800 900 B B A A B A 1000 A A A A ● 700 20 B A B A 600 0 B 0 B B B 0. col = "gray75") plot(tools$type. 7. Summarizing this result another way. one raised in a foster home (IQF) and the other raised by natural parents (IQN). The algorithm also provides a way to build regression models in studies where the primary interest is comparing the regression lines across groups rather than comparing groups after adjusting for a regression effect. . The approach can be applied to an arbitrary number of groups and predictors. The estimated difference in average lifetime is 15 hours.99 − 0. The data1 below are the IQ scores of identical twins.0266 rpm = 51.99 − 0. I will consider a problem with three groups and a single regression effect.0266 rpm.182 Ch 7: Analysis of Covariance: Comparing Regression Lines ships for the two tool types must be. The t-test of H0 : β1 = 0 checks whether the intercepts for the population regression lines are equal. and for tool type A: Predicted Lifetime = 36.e. regardless of the lathe speed. the t-test suggests that there is a significant difference between the lifetimes of the two tool types.0001 suggests that the population regression lines for tools A and B have unequal intercepts.0266 hours for each increase in 1 RPM.0266 rpm. for tool type B: Predicted Lifetime = (36. Regardless of the lathe speed. The LS lines indicate that the average lifetime of either type tool decreases by 0. assuming equal slopes. the regression coefficient for the typeB predictor) than type A tools. For simplicity..99 + 15. after adjusting for the effect of the speeds at which the tools were operated. The t-test p-value < 0.2 Generalizing the ANCOVA Model to Allow Unequal Slopes I will present a flexible approach for checking equal slopes and equal intercepts in ANCOVA-type models. the model predicts that type B tools will last 15 hours longer (i. The 27 pairs 1 The data were originally analyzed by Sir Cyril Burt.00) − 0. however. If status = M.1 Unequal slopes ANCOVA model The most general model allows separate slopes and intercepts for each group: IQF = β0 + β1I1 + β2I2 + β3 IQN + β4I1 IQN + β5I2 IQN + e. then I1 = I2 = 0. These are. If status = L.7. let I1 = 1 for H status families and I1 = 0 otherwise.2. The easiest way to check these hypotheses is to fit a multiple regression model to the combined data set. Two indicator variables are needed to uniquely identify each observation by social class. (7. L=low). The indicators I1 and I2 jointly assume 3 values: Status I1 I2 L 0 0 0 1 M H 1 0 Given the indicators I1 and I2 and the predictor IQN. For these families IQF = β0 + β3 IQN + e. and let I2 = 1 for M status families and I2 = 0 otherwise. and corresponds to fitting a simple linear regression model to the three groups separately (3 × 2 = 6).1) This model is best understood by considering the three status classes separately.2: Generalizing the ANCOVA Model to Allow Unequal Slopes 183 are divided into three groups by social status of the natural parents (H=high. For these families IQF = β0 + β2(1) + β3 IQN + β5 IQN + e = (β0 + β2) + (β3 + β5) IQN + e. define two interaction or product effects: I1 × IQN and I2 × IQN. and check whether certain carefully defined regression effects are zero. reasonable hypotheses to examine. M=medium. I will examine the regression of IQF on IQN for each of the three social classes. 7. then I1 = 0 and I2 = 1. For example. There is no a priori reason to assume that the regression lines for the three groups have equal slopes or equal interepts. . The most general model has six parameters. The plot gives a possible picture of the population regression lines corresponding to the general model (7. Population Mean IQ Foster Twin (IQF) 80 100 120 140 160 The regression coefficients β0 and β3 are the intercept and slope for the L status population regression line. The other parameters measure differences in intercepts and slopes across the three groups. using L status families as a baseline or reference group.184 Ch 7: Analysis of Covariance: Comparing Regression Lines Finally. if status = H.1). β4 = difference between the slopes of the H and L population regression lines. β5 = difference between the slopes of the M and L population regression lines. L Status M Status H Status 70 80 90 100 110 Home Twin IQ (IQN) We fit the general model to the twins data. then I1 = 1 and I2 = 0. For these families IQF = β0 + β1(1) + β3 IQN + β4 IQN + e = (β0 + β1) + (β3 + β4) IQN + e. 120 130 . In particular: β1 = difference between the intercepts of the H and L population regression lines. β2 = difference between the intercepts of the M and L population regression lines. y = IQF. colour = status.dat" . shape = status)) p <.com/teach/ADA2/ADA2_notes_Ch07_twins. ## $ IQN : int 82 90 91 115 115 129 131 78 79 82 .p + geom_point(size=4) library(R.p + geom_smooth(method = lm.frame': 27 obs.table("http://statacumen. "L") str(twins) ## 'data."H". data = twins) library(car) Anova(aov(lm.. type=3) ## Anova Table (Type III tests) 120 185 .read.2: Generalizing the ANCOVA Model to Allow Unequal Slopes #### Example: Twins twins <.f."IQN")]) p <..7."M": 2 2 2 2 2 2 2 3 3 3 ..c("IQF". ## $ status: Factor w/ 3 levels "L". IQF by IQN with categorical status") # equal axes since x.range <.s.lm(IQF ~ IQN*status.and y-variables are same quantity dat.ggplot(twins.range) + ylim(dat.p + xlim(dat.relevel(twins$status.range) + coord_equal(ratio=1) print(p) Twins data.n.ns). of 3 variables: ## $ IQF : int 82 80 88 108 116 117 132 71 75 93 ..n.p + labs(title="Twins data. header = TRUE) # set "L" as baseline level twins$status <.. aes(x = IQN.ns <.oo) # for ascii code lookup p <. IQF by IQN with categorical status H 120 H H IQF L L M L L 100 M L H L 80 L L L status L L L L L M HL H H H M M M H M M L 60 60 80 100 IQN lm.f.range(twins[. library(ggplot2) p <. se = FALSE) p <.s..p + scale_shape_manual(values=charToInt(sort(unique(twins$status)))) p <. 67 IQN 1700 1 27. codes: 0 '***' 0.155 3Q 4. lm.f.84 IQN:statusH 0.582 Max 13.92 on 21 degrees of freedom Multiple R-squared: 0.1 ' ' 1 summary(lm.4487 -0.7e-05 *** statusH -9.n.s.37 0.s.0241 0.480 -5.s.001 '**' 0.0767 24.18 0.3)) plot(lm.186 ## ## ## ## ## ## ## ## ## ## Ch 7: Analysis of Covariance: Comparing Regression Lines Response: IQF Sum Sq Df F value Pr(>F) (Intercept) 12 1 0.94 --Signif.05 '.s.ns$residuals. pch=as.ns.character(twins$status)) plot(twins$IQN. lm.n.7e-05 *** status 9 2 0.4.0291 0. Error t value Pr(>|t|) (Intercept) 7.01 0.s.f.31e-07 # plot diagnostics par(mfrow=c(2. codes: 0 '***' 0. col = "gray75") .21 0.12 0.3886 31.f. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.10 3.ns$residuals. main="QQ Plot".2446 0.67 IQN 0.01 '*' 0. main="Residuals vs status") # horizontal line at zero abline(h = 0.804.3393 0.s. p-value: 8.n. which = c(1.757 F-statistic: 17.n.0209 -0.7513 0.07 0.71 statusM -6. main="Residuals vs IQN". pch=as.1 ' ' 1 Residual standard error: 7.' 0.05 '.n.9484 0. id. col = "gray75") plot(twins$status.Adjusted R-squared: 0.character(twins$status)) # horizontal line at zero abline(h = 0.ns$residuals.f.2 on 5 and 21 DF.93 IQN:status 1 2 0.' 0.07 0.248 Median -0.f.43 0.f.n = 3.6).01 '*' 0.21 3. las = 1.001 '**' 0. pch=as. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.798 Coefficients: Estimate Std.91 IQN:statusM 0.ns£residuals. data = twins) Residuals: Min 1Q -14.n.99 Residuals 1317 21 --Signif.ns) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = IQF ~ IQN * status.character(twins$status)) ## 27 24 23 ## 1 27 26 ## residuals vs order of data #plot(lm.1822 5.2046 16. 7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes L H L H −15 70 80 90 100 110 0 10 15 20 0 0.2 0 0.3 0.4 0.5 L L 23 L 10 lm.f.n.s.ns$residuals H H 5 10 M 10 L 0 L H H H −15 −10 L L M L L −5 H M L 90 24 L L 5 0 −5 0.1 QQ Plot L 80 0.20 25 H M Residuals vs status L 70 H M L H L LLL H H L L LL L M LH Residuals vs IQN M M H Leverage hii H L L L M Obs. number L −15 −10 5 0.5 Fitted values M lm.f.n.s.ns$residuals 0.30 0.00 120 L27 13M M 27L 1 M10 0.00 L L 10 2.52 1.5 0.10 0.30 L M LL 13 0.20 10 5 0 H H 27 Cook's distance H M L L H L M M −5 Residuals L H Cook's distance 24 L L23 M Cook's dist vs Leverage hii (1 − hii) Cook's distance 0.10 15 Residuals vs Fitted 187 100 110 120 130 L L L L 5 LL 0 L MM MMM −5 HH M M H −15 H L H H −10 L LLL H H 27 −2 −1 twins$IQN 0 1 2 norm quantiles The natural way to express the fitted model is to give separate prediction equations for the three status groups. Here is an easy way to get the separate fits. For the general model (7.1), the predicted IQF satisfies Predicted IQF = (Intercept + Coeff for Status Indicator) + (Coeff for Status Product Effect + Coeff for IQN) × IQN. For the baseline group, use 0 as the coefficients for the status indicator and product effect. Thus, for the baseline group with status = L, Predicted IQF = 7.20 + 0 + (0.948 + 0) IQN = 7.20 + 0.948 IQN. For the M status group with indicator I2 and product effect I2 × IQN: Predicted IQF = 7.20 − 6.39 + (0.948 + 0.024) IQN = 0.81 + 0.972 IQN. For the H status group with indicator I1 and product effect I1 × IQN: Predicted IQF = 7.20 − 9.08 + (0.948 + 0.029) IQN = −1.88 + 0.977 IQN. 188 Ch 7: Analysis of Covariance: Comparing Regression Lines The LS lines are identical to separately fitting simple linear regressions to the three groups. 7.2.2 Equal slopes ANCOVA model There are three other models of potential interest besides the general model. The equal slopes ANCOVA model IQF = β0 + β1I1 + β2I2 + β3 IQN + e Population Mean IQ Foster Twin (IQF) 80 100 120 140 160 is a special case of (7.1) with β4 = β5 = 0 (no interaction). In the ANCOVA model, β3 is the slope for all three regression lines. The other parameters have the same interpretation as in the general model (7.1), see the plot above. Output from the ANCOVA model is given below. L Status M Status H Status 70 80 90 100 110 Home Twin IQ (IQN) lm.f.n.s <- lm(IQF ~ IQN + status, data = twins) library(car) Anova(aov(lm.f.n.s), type=3) ## Anova Table (Type III tests) ## ## Response: IQF ## Sum Sq Df F value Pr(>F) 120 130 7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes ## ## ## ## ## ## 189 (Intercept) 18 1 0.32 0.58 IQN 4675 1 81.55 5e-09 *** status 175 2 1.53 0.24 Residuals 1318 23 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 summary(lm.f.n.s) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = IQF ~ IQN + status, data = twins) Residuals: Min 1Q -14.823 -5.237 Median -0.111 3Q 4.476 Max 13.698 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.619 9.963 0.56 0.58 IQN 0.966 0.107 9.03 5e-09 *** statusH -6.226 3.917 -1.59 0.13 statusM -4.191 3.695 -1.13 0.27 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.57 on 23 degrees of freedom Multiple R-squared: 0.804,Adjusted R-squared: 0.778 F-statistic: 31.4 on 3 and 23 DF, p-value: 2.6e-08 For the ANCOVA model, the predicted IQF for the three groups satisfies Predicted IQF = (Intercept + Coeff for Status Indicator) +(Coeff for IQN) × IQN. As with the general model, use 0 as the coefficients for the status indicator and product effect for the baseline group. For L status families: Predicted IQF = 5.62 + 0.966 IQN, for M status: Predicted IQF = 5.62 − 4.19 + 0.966 IQN = 1.43 + 0.966 IQN, 190 Ch 7: Analysis of Covariance: Comparing Regression Lines and for H status: Predicted IQF = 5.62 − 6.23 + 0.966 IQN = −0.61 + 0.966 IQN. 7.2.3 Equal slopes and equal intercepts ANCOVA model The model with equal slopes and equal intercepts IQF = β0 + β3 IQN + e is a special case of the ANCOVA model with β1 = β2 = 0. This model does not distinguish among social classes. The common intercept and slope for the social classes are β0 and β3, respectively. The predicted IQF for this model is IQF = 9.21 + 0.901 IQN for each social class. lm.f.n <- lm(IQF ~ IQN, data = twins) #library(car) #Anova(aov(lm.f.n), type=3) summary(lm.f.n) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = IQF ~ IQN, data = twins) Residuals: Min 1Q -11.351 -5.731 Median 0.057 3Q 4.324 Max 16.353 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.2076 9.2999 0.99 0.33 IQN 0.9014 0.0963 9.36 1.2e-09 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.73 on 25 degrees of freedom Multiple R-squared: 0.778,Adjusted R-squared: 0.769 F-statistic: 87.6 on 1 and 25 DF, p-value: 1.2e-09 7.2: Generalizing the ANCOVA Model to Allow Unequal Slopes 7.2.4 191 No slopes, but intercepts ANCOVA model The model with no predictor (IQN) effects IQF = β0 + β1I1 + β2I2 + e Population Mean IQ Foster Twin (IQF) 100 110 120 130 140 is a special case of the ANCOVA model with β3 = 0. In this model, social status has an effect on IQF but IQN does not. This model of parallel regression lines with zero slopes is identical to a one-way ANOVA model for the three social classes, where the intercepts play the role of the population means, see the plot below. H Status M Status 90 L Status 70 80 90 100 110 Home Twin IQ (IQN) 120 130 For the ANOVA model, the predicted IQF for the three groups satisfies Predicted IQF = Intercept + Coeff for Status Indicator Again, use 0 as the coefficients for the baseline status indicator. For L status families: Predicted IQF = 93.71, 192 Ch 7: Analysis of Covariance: Comparing Regression Lines for M status: Predicted IQF = 93.71 − 4.88 = 88.83, and for H status: Predicted IQF = 93.71 + 9.57 = 103.28. The predicted IQFs are the mean IQFs for the three groups. lm.f.s <- lm(IQF ~ status, data = twins) library(car) Anova(aov(lm.f.s), type=3) ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: IQF Sum Sq Df F value Pr(>F) (Intercept) 122953 1 492.38 <2e-16 *** status 732 2 1.46 0.25 Residuals 5993 24 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 summary(lm.f.s) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = IQF ~ status, data = twins) Residuals: Min 1Q Median -30.71 -12.27 2.29 3Q 12.50 Max 28.71 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.71 4.22 22.19 <2e-16 *** statusH 9.57 7.32 1.31 0.20 statusM -4.88 7.71 -0.63 0.53 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.8 on 24 degrees of freedom Multiple R-squared: 0.109,Adjusted R-squared: 0.0345 F-statistic: 1.46 on 2 and 24 DF, p-value: 0.251 7.3: Relating Models to Two-Factor ANOVA 7.3 193 Relating Models to Two-Factor ANOVA Recall the multiple regression formulation of the general model (7.1): IQF = β0 + β1I1 + β2I2 + β3 IQN + β4I1 IQN + β5I2 IQN + e. (7.2) If you think of β0 as a grand mean, β1I1 + β2I2 as the status effect (i.e., the two indicators I1 and I2 allow you to differentiate among social classes), β3 IQN as the IQN effect and β4I1 IQN + β5I2 IQN as the status by IQN interaction, then you can represent the model as IQF = Grand Mean + Status Effect + IQN effect +Status×IQN interaction + Residual. (7.3) This representation has the same form as a two-factor ANOVA model with interaction, except that IQN is a quantitative effect rather than a qualitative (i.e., categorical) effect. The general model has the same structure as a twofactor interaction ANOVA model because the plot of the population means allows non-parallel profiles. However, the general model is a special case of the two-factor interaction ANOVA model because it restricts the means to change linearly with IQN. The ANCOVA model has main effects for status and IQN but no interaction: IQF = Grand Mean + Status Effect + IQN effect + Residual. (7.4) The ANCOVA model is a special case of the additive two-factor ANOVA model because the plot of the population means has parallel profiles, but is not equivalent to the additive two-factor ANOVA model. The model with equal slopes and intercepts has no main effect for status nor an interaction between status and IQN: IQF = Grand Mean + IQN effect + Residual. (7.5) The one-way ANOVA model has no main effect for IQN nor an interaction between status and IQN: IQF = Grand Mean + Status Effect + Residual. (7.6) I will expand on these ideas later, as they are useful for understanding the connections between regression and ANOVA models. 194 Ch 7: Analysis of Covariance: Comparing Regression Lines 7.4 Choosing Among Models I will suggest a backward sequential method to select which of models (7.1), (7.4), and (7.5) fits best. You would typically be interested in the one-way ANOVA model (7.6) only when the effect of IQN was negligible. Step 1: Fit the full model (7.1) and test the hypothesis of equal slopes H0 : β4 = β5 = 0. (aside: t-tests are used to test either β4 = 0 or β5 = 0.) To test H0, eliminate the predictor variables I1 IQN and I2 IQN associated with β4 and β5 from the full model (7.1). Then fit the reduced model (7.4) with equal slopes. Reject H0 : β4 = β5 = 0 if the increase in the Residual SS obtained by deleting I1 IQN and I2 IQN from the full model is significant. Formally, compute the F -statistic: Fobs = (ERROR SS for reduced model − ERROR SS for full model)/2 ERROR MS for full model and compare it to an upper-tail critical value for an F -distribution with 2 and df degrees of freedom, where df is the Residual df for the full model. The F -test is a direct extension of the single degree-of-freedom F -tests in the stepwise fits. A p-value for F -test is obtained from library(car) with Anova(aov(LMOBJECT), type=3) for the interaction. If H0 is rejected, stop and conclude that the population regression lines have different slopes (and then I do not care whether the intercepts are equal). Otherwise, proceed to step 2. Step 2: Fit the equal slopes or ANCOVA model (7.4) and test for equal intercepts H0 : β1 = β2 = 0. Follow the procedure outlined in Step 1, treating the ANCOVA model as the full model and the model IQF = β0 + β3 IQN + e with equal slopes and intercepts as the reduced model. See the intercept term using library(car) with Anova(aov(LMOBJECT), type=3). If H0 is rejected, conclude that that population regression lines are parallel with unequal intercepts. Otherwise, conclude that regression lines are identical. Step 3: Estimate the parameters under the appropriate model, and conduct a diagnostic analysis. Summarize the fitted model by status class. A comparison of regression lines across k > 3 groups requires k −1 indicator variables to define the groups, and k − 1 interaction variables, assuming the 7.4: Choosing Among Models 195 model has a single predictor. The comparison of models mimics the discussion above, except that the numerator of the F -statistic is divided by k − 1 instead of 2, and the numerator df for the F -test is k − 1 instead of 2. If k = 2, the F -tests for comparing the three models are equivalent to t−tests given with the parameter estimates summary. For example, recall how you tested for equal intercepts in the tools problems. The plot of the twins data shows fairly linear relationships within each social class. The linear relationships appear to have similar slopes and similar intercepts. The p-value for testing the hypothesis that the slopes of the population regression lines are equal is essentially 1. The observed data are consistent with the reduced model of equal slopes. The p-value for comparing the model of equal slopes and equal intercepts to the ANCOVA model is 0.238, so there is insufficient evidence to reject the reduced model with equal slopes and intercepts. The estimated regression line, regardless of social class, is: Predicted IQF = 9.21 + 0.901*IQN. There are no serious inadequacies with this model, based on a diagnostic analysis (not shown). An interpretation of this analysis is that the natural parents’ social class has no impact on the relationship between the IQ scores of identical twins raised apart. What other interesting features of the data would be interesting to explore? For example, what values of the intercept and slope of the population regression line are of intrinsic interest? 7.4.1 Simultaneous testing of regression parameters In the twins example, we have this full interaction model, IQF = β0 + β1I1 + β2I2 + β3 IQN + β4I1 IQN + β5I2 IQN + e, (7.7) where I1 = 1 indicates H, and I2 = 1 indicates M, and L is the baseline status. Consider these two specific hypotheses: 1. H0 : equal regression lines for status M and L 2. H0 : equal regression lines for status M and H 196 Ch 7: Analysis of Covariance: Comparing Regression Lines That is.s. the intercept and slope for the regression lines are equal for the pairs of status groups.01 0.10 3.f. β coefficients. β is our vector e linear system rβ equals. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: IQF Sum Sq Df F value Pr(>F) (Intercept) 12 1 0.18 0. H0 : β1 = β2 and β4 = β5 Using linear model theory. ˆ H0 : equal slopes for all status groups ˆ H0 : β4 = β5 = 0 lm. then we’ll test our two simultaneous hypotheses above.ns <. 1. H0 : β2 = 0 and β5 = 0 2.f.n.7e-05 *** status 9 2 0. there are methods for testing these multiple-parameter hypothesis tests.07 0.  =  0 0 0 0 0 0 1  β3     β4  β5 Let’s go about testing another hypothesis. and r is a hypothesized vector of what the e For our first hypothesis test. That is. find the β values that make the null hypothesis true in terms of the model equation.s. the linear system we’re testing in matrix enotation is   β0 β  1        0 0 0 1 0 0 0  β2  .67 IQN 1700 1 27.n. First. using the Wald test.99 Residuals 1317 21 --- .93 IQN:status 1 2 0.ns). first.lm(IQF ~ IQN*status. it is necessary to formulate these hypotheses in terms of testable parameters. where r is a e e of regression matrix of contrast coefficients (typically +1 or −1). One strategy is to use the Wald test of null hypothesis rβ = r. data = twins) library(car) Anova(aov(lm. ns) . df = 2. same slope and intercept.4: Choosing Among Models ## Signif.n.s.' 0.values . P(> X2) = 0. Terms = c(5. we are interested in testing whether individual parameters or # set of parameters are all simultaneously equal to 0s # However.s. and these are positions 4 and 6.n. 6) coef(lm.coef.99 Now to our two simultaneous hypotheses.7).test. any null hypothesis values can be included in the vector coef. codes: 197 0 '***' 0.7.ns))) wald. we need to choose the correct positions based on the coef() order.n.07665 statusM IQN:statusH -6. 4. 5.s.f.n. library(aod) # for wald.38859 0.coef. coef.01 '*' 0.001 '**' 0.test(b = coef(lm.test(b = coef(lm.n. Sigma = vcov(lm.015.2. The large pvalue=0.values <.values <.test() coef.s.9926.test.6)) ## ## ## ## ## Wald test: ---------Chi-squared test: X2 = 0.ns) .test() ## ## Attaching package: ’aod’ ## ## The following object is masked from ’package:survival’: ## ## rats # Typically.test. However.test.ns))) wald.test.n.ns) . which indicates that common slope is reasonable.1 ' ' 1 # beta coefficients (term positions: 1.f.55 # Another way to do this is to define the matrix r and vector r.94842 statusH -9. length(coef(lm. manually.values . 2.6)) ## Wald test: ## ---------## ## Chi-squared test: ## X2 = 1.20461 ## IQN:statusM ## 0. df = 2.02414 IQN 0.s.f.rep(0. In hypothesis 1 we are testing β2 = 0 and β5 = 0. length(coef(lm. Sigma = vcov(lm.s.05 '.ns) ## (Intercept) ## 7. P(> X2) = 0.f.values.55 suggests that M and L can be described by the same regression line.f. .ns) . Here we get the same result as the ANOVA table. 3. which are the 3rd and 6th position for coefficients in our original equation (7.rep(0.s. library(aod) # for wald. Terms = c(4.n.f.f. In the Wald test notation. we want to test whether those last two coefficients (term positions 5 and 6) both equal 0.02914 The test for the interaction above (IQN:status) has a p-value=0. 4] [.matrix(rbind(c(0.] 0 0 1 -1 0 0 ## [2. 0).n. P(> X2) = 0.] 0 0 0 0 1 -1 vR <.s.n. 0. 0.2] [.55 In hypothesis 2 we are testing β1 − β2 = 0 and β4 − β5 = 0 which are the difference of the 2nd and 3rd coefficients and the difference of the 5th and 6th coefficients.n.19.test(b = coef(lm. 0.91 The results of these tests are not surprising. 0).3] [. and these are positions 3 and 4. 0. L = mR. 1. 0. same slope and intercept. L = mR.2] [. Sigma = vcov(lm.s.s.c(0.c(0.198 Ch 7: Analysis of Covariance: Comparing Regression Lines mR <. P(> X2) = 0. 1.] 0 0 0 0 0 1 vR <.f. Sigma = vcov(lm.91 suggests that M and H can be described by the same regression line.3] [. 0. 0. df = 2. c(0.6] ## [1. H0 = vR) ## ## ## ## ## Wald test: ---------Chi-squared test: X2 = 1.1] [.4] [. -1. 0. 0) vR ## [1] 0 0 wald.as. we need to choose the correct positions based on the coef() order.2. Any simultaneous linear combination of parameters can be tested in this way.ns) .5] [.f.] 0 0 0 1 0 0 ## [2. given our previous analysis where we found that the status effect is not significant for all three groups. 1))) mR ## [.ns) .matrix(rbind(c(0. 0) vR ## [1] 0 0 wald. However.6] ## [1. and 5 and 6.ns) .ns) . -1))) mR ## [. 1.test(b = coef(lm.n.f.s.5] [. H0 = vR) ## ## ## ## ## Wald test: ---------Chi-squared test: X2 = 0. c(0. 0.1] [. 0. mR <. df = 2.as. The large p-value=0. .f. 0. 0. 5: Comments on Comparing Regression Lines 7. This model is more restrictive (and less reasonable) than the ANCOVA model with equal slopes but arbitrary intercepts. A plot of the population regression lines under this model is given above. L).5 199 Comments on Comparing Regression Lines In the twins example. I defined two indicator variables (plus two interaction variables) from an ordinal categorical variable: status (H. a “natural” coding might be to define NSTAT=0 for L. This suggests building a multiple regression model with a single status variable (i. single df): IQF = β0 + β1 IQN + β2NSTAT + e. for H status. the model implies that IQF = β0 + β1 IQN + β2(0) + e = β0 + β1 IQN + e IQF = β0 + β1 IQN + β2(1) + e = (β0 + β2) + β1 IQN + e IQF = β0 + β1 IQN + β2(2) + e = (β0 + 2β2) + β1 IQN + e for L status. . For status..e.7. Many researchers would assign numerical codes to the status groups and use the coding as a predictor in a regression model. for M status. If you consider the status classes separately. 1 for M. M. The model assumes that the IQF by IQN regression lines are parallel for the three groups. assuming β2 < 0. this model is a easier to work with because it requires keeping track of only one status variable instead of two status indicators. and are separated by a constant β2. and 2 for H status families. Of course. ac. b take values 0 or 1. . b) combination. b) combination. bc.6 80 90 100 110 Home Twin IQ (IQN) 120 130 Three-way interaction In this example. but difference in slope between b = 0 and b = 1 is similar for each a group (and vice versa). ac. All lines parallel. different intercepts for each (a. bc. (b. c) combinations have parallel lines. abc. All combinations may have different slope lines with different intercepts. and c be a continuous variable taking any value. different intercepts for each (a. ac. (2) Interactions: ab. All combinations may have different slope lines with different intercepts. bc. different intercepts for each (a. b) combination. (5) Interactions: ab. (3) Interactions: ab. (4) Interactions: ab. Let a take values 0 or 1 (it’s an indicator variable). Below are five models: (1) Interactions: ab. a three-way interaction is illustrated with two categorical variables and one continuous variable.Ch 7: Analysis of Covariance: Comparing Regression Lines Population Mean IQ Foster Twin (IQF) 80 100 120 140 160 200 L Status M Status H Status 70 7. (a. c) combinations have parallel lines. 4] each model <. -4. "b". 3] Beta[c( 8).7.matrix(vbeta.c("one". -2. ncol = 1) rownames(vbeta) <."b"].3. -1.8). "Y".1).rownames(vbeta) # Beta vector for Beta[c(6. 8).c("obs".expand.p + facet_grid(Model ~ b. beta0 beta1 beta2 beta3 beta4 beta5 beta6 beta7 3 −1 2 2 5 −4 −2 8 library(ggplot2) p <.c(1.7. 1:5.8)] # reorder columns to be consistent with table above #£ vbeta <. y = Y.X %*% Beta library(reshape2) YX <. 2] Beta[c(6.p + geom_line(aes(linetype = a)) p <.frame(cbind(melt(Y).grid(c(0. X[.6.8).data. "a".paste("beta".ggplot(YX.0 colnames(Beta) <.matrix(X) X <.5.p + geom_point() p <. labeller = "label_both") print(p) . "Model".6: Three-way interaction Model (1) (2) (3) (4) (5) y y y y y = = = = = β0 β0 β0 β0 β0 Intercepts +β1a +β2b +β1a +β2b +β1a +β2b +β1a +β2b +β1a +β2b 201 Slopes for c +β3ab +β3ab +β3ab +β3ab +β3ab +β4c +β4c +β5ac +β4c +β6bc +β4c +β5ac +β6bc +β4c +β5ac +β6bc +β7abc X <.factor(YX$a) YX$b <. nrow = dim(vbeta)[1].X$a * X$b * X$c X <.X$a * X$b X$ac <. aes(x = c.2.c(0."c"])) colnames(YX) <.as. X[.c(0.X$a * X$c X$bc <. 2. sep="") Beta <. ncol = 5) rownames(Beta) <.1:5 #paste("model". 0:7.7. 1] Beta[c( 7.0 <. "a".X[.0 <. X[.factor(YX$b) These are the β values used for this example.0 <.matrix(c(3.X$b * X$c X$abc <.4.cbind(1."a"]. X) colnames(X) <. 2. "c") YX$a <. group = a)) #p <.p + labs(title = "Three-way Interaction") p <. 5. sep="") # Calculate response values Y <. 8). "c") X$ab <.1)) X <.1). "b". 50 0.25 0.00 0.50 0.75 1.202 Ch 7: Analysis of Covariance: Comparing Regression Lines Three−way Interaction b: 0 b: 1 Model: 1 10 5 Model: 2 10 5 Y Model: 3 10 5 Model: 4 10 5 Model: 5 10 5 0.00 a 0 1 .25 0.75 1.00 0.00 c 0. y21 <. main="Quadratics". . The two panels below illustrate different quadratic and cubic relationships. y22. and 5 we have quadratic.-(x+1)^2+3.1 Polynomial Models with One Predictor A pth order polynomial model relating a dependent variable Y to a predictor X is given by Y = β0 + β1X + β2X 2 + · · · + βpX p + ε. . cubic. This is a multiple regression model with predictors X. a point where trend changes direction from increasing to decreasing. y32.(x+1)^2*(x-3).0.5)-10.01).3. . or from decreasing to increasing). A second order polynomial (quadratic) allows at most one local maximum or minimum (i. a pth order polynomial allows at most p − 1 local maxima or minima. type="l". points(x. ylab="y") lt=2) main="Cubics". y31 <. For p = 2.x^2-5. #### Creating polynomial plots # R code for quadratic and cubic plots x <. 4. respectively. plot( x. type="l". y21.e.seq(-3.Chapter 8 Polynomial Regression 8.. points(x. plot( x. A third order polynomial (cubic) allows at most two local maxima or minima. . y32 <.2)^2*(x+. y22 <. In general.-(x-. X p. ylab="y") lt=2) . quartic and quintic relationships. type="l". y31. 3. type="l". X 2. . X3 <. the fitted curve must oscillate wildly between data points. n where the Xis are distinct.X^3 . Xi) for i = 1. I would not use the fitted model to make predictions with new data. .X^4 . of the “turning-points” in a polynomial may be observed if the range of X is suitably restricted.X^5 . # R code for quadratic and cubic plots X <. X2 <. rather than the trend. Although R2 = 1. Although polynomial models allow a rich class of non-linear relationships between Y and X (by virtue of Taylor’s Theorem in calculus). . In the picture below. 2.) Intuitively.X^6 . the 10th degree polynomial is modelling the variability in the data. the extreme X-values can be highly influential. X1 <. X4 <. # observed . I show the 10th degree polynomial that fits exactly the 11 distinct data points.rnorm(11). Y <. X5 <. models should always be validated using new data. (If possible.X^2 . To illustrate the third concern. a quadratic or a lower order polynomial would likely be significantly better.204 Ch 8: Polynomial Regression −10 y −4 −20 −2 y 0 2 −5 0 Cubics 4 Quadratics −3 −2 −1 0 1 x 2 3 −3 −2 −1 0 1 2 3 x It is important to recognize that not all. for a high order polynomial to fit exactly. However. In particular. and predictions based on high order polynomial models can be woeful.X^1 . or even none.rnorm(11). consider a data set (Yi. numerical instabilities occur when fitting high order models. One can show mathematically that an (n − 1)st degree polynomial will fit the observed data exactly. In essence. . X6 <. some caution is needed when fitting polynomials. x^9 .x5.5 1.0 0.2. . x4 <. x2 <. Y. . x6 <.5.5 ● −1.5 −10000 −0. cex=1.x9.1 X9 -3396.6 X6 -81282. X p−1) depends on the scale in which we measure X.x8.x^2 .xx %*% fit$coefficients.x6. X^9 . main="High-order polynomial". cex=2) type="l".e.x7. points(x.10000)) type="l".matrix(c(rep(1.length(x)). x3 <.7 ## X5 ## -26416. lt=1) High−order polynomial (same. lt=1) main="(same. X. X^10.0 1.5 0.x^3 .5. pch=20. x5 <.1: Polynomial Models with One Predictor X7 X8 X9 X10 <<<<- 205 X^7 . x8 <.6 X7 15955.x^6 .5 X4 29149.9 X8 70539. suppose for some chemical reaction.ncol=11) y <.x^8 .5 X -461.1 X3 13030.2 X2 -620.x^4 . . x7 <.x^10. longer y-axis)".lm(Y~X+X2+X3+X4+X5+X6+X7+X8+X9+X10) fit$coefficients ## (Intercept) ## 36. ylim=c(-10000. x9 <.8. X^8 .x10).5 ● Y ● 0 ● −1. x1 <. points(x. y.3 ## X10 ## -18290.5 −0.x^5 . x10 <.5 0. fit <.x3.x1. y. . plot( X.x^7 .5 Y 0.5 ● −0. plot( X.. Time to reaction = β0 + β1 Temp + β2 Temp2 + ε. Y. xx <.5 1. pch=20. The significance level for the estimate of the Temp coefficient depends on whether we measure temperature in degrees Celsius or Fahrenheit.0.5 X Another concern is that the importance of lower order terms (i.0 X 0. X 2..x^1 .0 1. For example.x2. longer y−axis) 5000 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● −1.seq(-2.01).x4.1 x <. start with the model of maximum acceptable order (for example a fourth or third order polynomial) and consider deleting terms in the order X p. Pay careful attention to diagnostics. sequentially until no additional term is significant. 1 Draper and Smith 1966. If a fourthorder polynomial does not fit. Data were collected to examine this model. It has been suggested that the percent of I-8 (variable “i8”) in the base stock is an excellent predictor of the cloud point using a second order (quadratic) model: Cloud point = β0 + β1 I8 + β2 I82 + ε. X p−1. 8. regardless of what other effects are included. 2. p. . 3.1 Example: Cloud point and percent I-8 The cloud point of a liquid is a measure of the degree of crystallization in a stock.. Add or delete variables using the natural hierarchy among powers of X and include all lower order terms if a higher order term is needed. say quartic or less. 4.. For example. 162 . add terms X. . Similarly. in a forward-selection type algorithm.1. until no further terms can be omitted. The backward option sequentially eliminates the least significant effect in the model. X 2. . . but do not delete powers that were entered earlier.206 Ch 8: Polynomial Regression To avoid these problems. a transformation may provide a more succinct summary. Center the X data at X ¯ + β2(X − X) ¯ 2 + · · · + βp(X − X) ¯ p + ε. and is measured by the refractive index 1. Restrict attention to low order models. I recommend the following: ¯ and fit the model 1. with a backward-elimination type algorithm. The select=backward option in the reg procedure does not allow you to invoke the hierarchy principle with backward elimination. Y = β0 + β1(X − X) This is usually important only for cubic and higher order models. . . p + geom_point() p <.read.com/teach/ADA2/ADA2_notes_Ch08_cloudpoint. To be sure.i.c.c.i). # plot diagnistics par(mfrow=c(2.i <.6).0 −2.p + labs(title="Cloudpoint data. type=3) #summary(lm. which = c(1.4.5 5.c. Also by the residuals against the i8 values. suggesting that a simple linear regression model is inadequate.i) The data plot is clearly nonlinear.8.character(cloudpoint$type)) . lm. we will first fit a cubic model.5 0.1: Polynomial Models with One Predictor 207 #### Example: Cloud point cloudpoint <.0 i8 Fit the simple linear regression model and plot the residuals.cloudpoint$i8 .c. y = cloud)) p <. library(ggplot2) p <. We do not see any local maxima or minima.0 2. header = TRUE) # center i8 by subracting the mean cloudpoint$i8 <.dat" . and see whether the third order term is important.table("http://statacumen. aes(x = i8. This is confirmed by a plot of the studentized residuals against the fitted values from a simple linear regression of Cloud point on i8. data = cloudpoint) #library(car) #Anova(aov(lm. cloud by centered i8") print(p) Cloudpoint data. pch=as.3)) plot(lm. cloud by centered i8 ● 33 ● ● ● ● ● ● 30 ● cloud ● ● ● ● 27 ● ● ● ● 24 ● ● −5.ggplot(cloudpoint.lm(cloud ~ i8.mean(cloudpoint$i8) The plot of the data suggest a departure from a linear relationship. so a second order model is likely to be adequate. 2 1.0 ● ● 0.i$residuals.5 Residuals ● ● 0.4 ● ● 2. Case 1 has the largest studentized residual: r1 = −1. main="Residuals vs Order of data") # horizontal line at zero abline(h = 0. number ● lm.0 −1.5 ● ● lm.c. type=3) summary(lm.character(cloudpoint$type)) ## ## 1 11 17 1 18 2 # residuals vs order of data plot(lm.2 Residuals vs Order of data 11 ● ● 0.8 ● 11 ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 1 0.0 ● ● Leverage hii ● −1. pch=as.n = 3.0 ● ● ● −2 ● ● ● ● ● ● ● 0.6 15 ● ● ● Obs. # I() is used to create an interpreted object treated "as is" # so we can include quadratic and cubic terms in the formula # without creating separate columns in the dataset of these terms lm.5 0. main="QQ Plot". Furthermore. col = "gray75") ● 17 ● ●1 0.8 14 17 0.4 ● Cook's distance 0.i$residuals.5 0 ● Residuals vs i8 ● −1.6 ● 0.i$residuals 0.0 ● ● 0.0 26 28 30 32 34 5 10 0.i$residuals ● ● ● ●● ●● ● ● Fitted values ● −4 1 14 ● ● −0. The plot of the studentized residuals against the fitted values does not show any extreme abnormalities.5 ● 2 0.5 ● ● ● ● 17 ● ● ● −1.i3) ## ## Call: ## lm(formula = cloud ~ i8 + I(i8^2) + I(i8^3).lm(cloud ~ i8 + I(i8^2) + I(i8^3).5 ● ● ● ● 0.c.5 ● −0.c.c.05 0.5 −1.. data = cloudpoint) #library(car) #Anova(aov(lm.0 ● ● 3 1. id.character(cloudpoint$type)) # horizontal line at zero abline(h = 0.0 0. pch=as. lm.i3).0 Residuals vs Fitted ● ● 2 norm quantiles 5 10 15 Index The output below shows that the cubic term improves the fit of the quadratic model (i.1 0.5 ● ● ● ● ● ● −0.i$residuals. data = cloudpoint) .2 ● 0.997.c.5 0.5 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.15 QQ Plot ● ● 0.i3 <.c.i$residuals ● lm.c. main="Residuals vs i8".5 Cook's distance 0.5 ● 0 cloudpoint$i8 2 4 ● −2 ● 1 −1 0 1 ● ● ● ● −1.c.208 Ch 8: Polynomial Regression plot(cloudpoint$i8. las = 1. the cubic term is important when added last to the model).c.0 ● ● −0. no individual point is poorly fitted by the model.5 24 1● ● 17 ● ● −1.e. Error t value Pr(>|t|) (Intercept) 28.04854 17. main="QQ Plot".1 ' ' 1 Residual standard error: 0.994.87045 0.01 '*' 0. # plot diagnistics par(mfrow=c(2.01 3.47 6.05 '. p-value: 6. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.i3$residuals. main="Residuals vs Order of data") # horizontal line at zero abline(h = 0.08836 326.3e-07 *** I(i8^3) 0.4.8.6).76 0.001 '**' 0.993 F-statistic: 813 on 3 and 14 DF.1: Polynomial Models with One Predictor ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Residuals: Min 1Q -0. id.72 < 2e-16 *** i8 0.0021 ** --Signif.3)) plot(lm.i3$residuals. codes: 0 '***' 0.character(cloudpoint$type)) plot(cloudpoint$i8. las = 1.n = 3.00259 3.c.00732 -9.i3$residuals. col = "gray75") 209 . pch=as.Adjusted R-squared: 0.i3. pch=as.character(cloudpoint$type)) # horizontal line at zero abline(h = 0. pch=as.c.06600 0.7e-11 *** I(i8^2) -0.00974 0.c.84789 0.19e-16 Below are plots of the data and the studentized residuals.26 on 14 degrees of freedom Multiple R-squared: 0.' 0.3933 Coefficients: Estimate Std.character(cloudpoint$type)) ## ## 4 12 1 18 1 2 # residuals vs order of data plot(lm.4289 -0. main="Residuals vs i8". which = c(1.0735 3Q 0.c.1866 Median 0.1354 Max 0. lm. data = cloudpoint2) #library(car) #Anova(aov(lm. data = cloudpoint2) Residuals: Min 1Q -0.0374 3Q 0.2 ● ● 0.4 Leverage hii 0.3374 .55) when added after the quadratic term.7 Residuals vs i8 QQ Plot Residuals vs Order of data 0.i3$residuals 0.2 ● ● ● 0.6 0.c.0 ● Cook's distance ● ● ● −0.4 0.2 0.4 ● 0.2 0.i2). ] lm.c. type=3) summary(lm. One may reasonably conclude that the significance of the cubic term in the original analysis is solely due to the two extreme I8 values. given by 0 and 10.6 32 5 10 15 0 0.2 ● ● 2 4 0.3662 -0.c.4 ● ● ● 0. respectively.cloudpoint[!(cloudpoint$i8 == min(cloudpoint$i8) | cloudpoint$i8 == max(cloudpoint$i8)).lm(cloud ~ i8 + I(i8^2) + I(i8^3).0 ● ● ● −2 cloudpoint$i8 4 −0.1403 Max 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.c.i3$residuals ● ● ● ●● ● ● ●● ● ● 0.4 12 ● 1 0. number 12 ● ● ● ● ● ● ● −0.5 ● ● 24 26 28 30 0.4 0.c. If we delete these cases and redo the analysis we find that the cubic term is no longer important (p-value=0.8 2 1.5 0.4 ● Cook's distance ● 18 14 0.4 0 ● ● ● −2 ● 0.210 Ch 8: Polynomial Regression Cook's dist vs Leverage hii (1 − hii) Cook's distance 0. # remove points for minimum and maximum i8 values cloudpoint2 <.5 1● 1 0.6 ● 0.i2 <.2 ● ● −0.1 0.i2) ## ## ## ## ## ## ## ## ## Call: lm(formula = cloud ~ i8 + I(i8^2) + I(i8^3).c.4 ● Fitted values ● lm.i3$residuals ● lm.4 Residuals vs Fitted 1 −1 0 1 ● 2 norm quantiles ● 5 10 15 Index The first and last observations have the lowest and highest values of I8.0 ● −4 ● ● ● 0.0 4● 18 ● ● 14 ● ●1 0.4 Obs.2 Residuals 0.0 ● ● −0.2 lm.1285 Coefficients: Median 0. These cases are also the most influential points in the data set (largest Cook’s D). and that the quadratic model appears to fit well over the smaller range of 1 ≤ I8 ≤ 9.0 −0.2 ● ● ● ● ● ● ● ● −0. 00317 0.03e-11 8.com/teach/ADA2/ADA2_notes_Ch08_mooney. Error t value Pr(>|t|) (Intercept) 28.' 0.ggplot(mooney.06071 0. mooney by oil with filler labels") print(p) ## Warning: Removed 1 rows containing missing values (geom text).001 '**' 0.read.01 '*' 0.1 Example: Mooney viscosity The data below give the Mooney viscosity at 100 degrees Celsius (Y ) as a function of the filler level (X1) and the naphthenic oil (X2) level for an experiment involving filled and plasticized elastomer compounds.p + scale_y_continuous(limits = c(0. p-value: 1.231 on 11 degrees of freedom Multiple R-squared: 0.rm=TRUE))) p <.05 '.ggplot(mooney.05834 15. header = TRUE) library(ggplot2) p <.01269 -4. The model. aes(x = oil.2.90451 0. label = oil)) p <. max(mooney$mooney.61 0.50 8e-09 *** I(i8^2) -0. y = mooney. with two predictors: Y = β0 + β1X1 + β2X2 + β3X12 + β4X22 + β5X1X2 + ε.2: Polynomial Models with Two Predictors ## ## ## ## ## ## ## ## ## ## ## 211 Estimate Std.p + geom_text() .989 F-statistic: 436 on 3 and 11 DF.55 < 2e-16 *** i8 0.Adjusted R-squared: 0.dat" .1 ' ' 1 Residual standard error: 0. For simplicity. library(ggplot2) p <.992. label = filler)) p <.08946 322.p + labs(title="Mooney data. 8.00517 0.p + geom_text() p <. na.table("http://statacumen.8. includes quadratic terms in X1 and X2 and the product or interaction of X1 and X2. consider the general quadratic model.78 0. which can be justified as a second order approximation to a smooth trend.85704 0.55220 --Signif.00057 *** I(i8^3) 0. y = mooney.2 Polynomial Models with Two Predictors Polynomial models are sometimes fit to data collected from experiments with two or more predictors. aes(x = filler. #### Example: Mooney viscosity mooney <. codes: 0 '***' 0. 14458 2. max(mooney$mooney.m. data = mooney) Residuals: Min 1Q Median -6.f2) ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = mooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler). deleting either or both cases does not change the significance of the effects in the model (not shown).223 -0.275 Coefficients: (Intercept) Estimate Std.350 -2. Although there are potentially influential points (cases 6 and 20).0e-09 *** . # I create each term separately lm.rm=TRUE))) p <.o2. Mooney data. Error t value Pr(>|t|) 27. This supports fitting the general quadratic model as a first step in the analysis. data = mooney) summary(lm. the relationship between the Mooney viscosity and oil level appears quadratic for each filler level (with 4 levels). the relationship between the Mooney viscosity and filler level (with 6 levels) appears to be quadratic.61678 10.542 Max 5. The output below shows that each term is needed in the model. Similarly.lm(mooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler).m. mooney by filler with oil labels 60 0 150 150 60 10 48 0 100 48 60 mooney mooney 100 36 50 24 36 10 60 48 0 10 0 10 20 10 20 40 20 40 40 50 12 24 36 0 12 0 24 12 0 48 36 24 12 0 20 0 0 10 20 20 40 40 0 0 10 20 30 40 0 20 oil 40 60 filler At each of the 4 oil levels.212 Ch 8: Polynomial Regression p <.161 3Q 2.37 9.p + labs(title="Mooney data. mooney by oil with filler labels Mooney data. mooney by filler with oil labels") print(p) ## Warning: Removed 1 rows containing missing values (geom text).o2. na.p + scale_y_continuous(limits = c(0.f2 <. m.2: Polynomial Models with Two Predictors ## ## ## ## ## ## ## ## ## ## ## ## oil -1.f2$residuals. which = c(1.3)) plot(lm.m.f2 <.f2$residuals. get the indices of non-missing ind <.8.o2.21 1.4e-09 I(oil * filler) -0. raw = TRUE).o2.character(mooney$filler[ind])) # horizontal line at zero abline(h = 0.o2. pch=as. lm.992. codes: 0 '***' 0.01 '*' 0.' 213 *** * *** *** *** 0.6e-05 filler 0. degree = 2.character(mooney$oil[ind])) ## 18 20 6 ## 1 2 23 ## residuals vs order of data #plot(lm.86 0.f2£residuals. id. pch=as. pch=as.02732 0. lm.m. pch=as.o2. filler. col = "gray75") . las = 1.Adjusted R-squared: 0.43698 0.o2.03866 0.4.13 8.numeric(names(lm.f2) # plot diagnistics par(mfrow=c(2.21353 -5. mooney£filler. col = "gray75") plot(mooney$filler[ind].m.5e-10 --Signif.03361 0. main="Residuals vs oil with filler labels".34 2.94 on 17 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.1 ' ' 1 Residual standard error: 3.00241 11.lm(mooney ~ poly(oil.character(mooney$oil[ind])) # horizontal line at zero abline(h = 0. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.m.character(mooney$oil)) # because of one missing value. 10) #head(poly(mooney£oil. 10) ## This model is equivalent to the one above #lm.27144 0. data = mooney) #summary(lm.m.n = 3. degree = 2.15266 2.6).95 1. raw = TRUE). p-value: <2e-16 ## poly() will evaluate variables and give joint polynomial values ## which is helpful when you have many predictors #head(mooney.989 F-statistic: 405 on 5 and 17 DF.00466 7.m.5e-06 I(filler^2) 0.001 '**' 0.011 I(oil^2) 0.as.f2$residuals.05 '.f2$residuals)) plot(mooney$oil[ind].00319 -12. main="Residuals vs filler with oil labels". col = "gray75") # Normality of Residuals library(car) qqPlot(lm.o2.o2. main="QQ Plot".m.f2.o2. max(mooney£logmooney.4 1 0 2 6 0. aes(x = filler.p + geom_text() #p <.p + labs(title="Mooney data.0 0. log(mooney) by filler with oil labels") print(p) ## Warning: Removed 1 rows containing missing values (geom text).p + scale_y_continuous(limits = c(0.ggplot(mooney.f2$residuals 4 0 0 2 0 3 4 6 −2 2 2 1 0 0.f2$residuals 0 1 60 2 420 20 0 1. If we make this transformation and replot the data.8 Residuals vs Fitted 60 −6 0 0 18 −2 20 −1 0 1 2 norm quantiles Example: Mooney viscosity on log scale As noted earlier.ggplot(mooney. library(ggplot2) p <. aes(x = oil.0 120 140 5 10 15 20 0.f2$residuals −6 0. The plots of the transformed data suggest that a simpler model will be appropriate.p + labs(title="Mooney data.6 0.4 Residuals vs filler with oil labels QQ Plot 3 3 0 −6 4 1 6 0 10 20 30 40 mooney$oil[ind] 8. log(mooney) by oil with filler labels") print(p) ## Warning: Removed 1 rows containing missing values (geom text).5 2 20 Cook's distance 6 4 2 Cook's dist vs Leverage hii (1 − hii) Cook's distance 0. but is a quadratic function of oil at each filler level.5 0.3 0.6 2 4 0.m. y = logmooney. y = logmooney.5 0 1 2 −4 0 3 4 2 1 2 −6 1 6 1 4 0 Residuals vs oil with filler labels lm.2 2 0. na.5 Leverage hii 4 2 182 204 218 Obs. For example. transformations can often be used instead of polynomials.m.rm=TRUE))) p <.rm=TRUE))) p <. # log transform the response mooney$logmooney <. number 0 −2 −4 18 Fitted values 6 lm.p + geom_text() #p <.214 Ch 8: Polynomial Regression 40 60 80 100 0.p + scale_y_continuous(limits = c(0.2 0 −2 −4 Residuals 1 1 0 0. na.m.4 1 2 4 111 22 00400 14 0. we see that the log Mooney viscosity is roughly linearly related to the filler level at each oil level. the original data plots suggest transforming the Moody viscosity to a log scale.o2.2 1 04 2 4 Cook's distance 2 1 4 60 1 2.2 4 2 4 1 4 4 4 1 1 0 2 2 1 0 0 0 10 30 40 50 mooney$filler[ind] 64 2 2 2 2 2 22 0 1 1 −2 1 1 1 0 1 0 0 −4 0 2 20 4 4 2 4 0 4 4 4 lm. max(mooney£logmooney.o2.2.o2. label = oil)) p <. label = filler)) p <.log(mooney$mooney) library(ggplot2) p <. .1 0. we fit the full quadratic model. log(mooney) by oil with filler labels 5.5 0 10 40 20 3.17 I(oil * filler) -4.2e-10 *** I(oil^2) 4.90e-03 -13. The interaction term can be omitted here.03064 Max 0.0 20 10 0 40 0 10 20 10 20 40 3.0535 on 17 degrees of freedom (1 observation deleted due to missingness) .86e-02 2.01 '*' 0.f2) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = logmooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler).97 < 2e-16 *** oil -3.1 ' ' 1 Residual standard error: 0. codes: 0 '***' 0. data = mooney) summary(lm.34 --Signif.lm(logmooney ~ oil + filler + I(oil^2) + I(filler^2) + I(oil * filler).2: Polynomial Models with Two Predictors Mooney data.5 0 12 24 36 24 12 3.5 60 10 0 24 12 36 48 48 24 36 3.28e-05 1. log(mooney) by filler with oil labels 60 0 5.f2 <. The p-value for the interaction term in the quadratic model is 0.0 215 Mooney data. data = mooney) Residuals: Min 1Q -0.07726 -0. without much loss of predictive ability (R-squared is similar).00919 3Q 0.78 1.001 '**' 0.56e-02 90.0 20 40 2.34.0e-06 *** I(filler^2) 4.o2.23e-04 6.5 30 40 20 0 20 oil 40 60 filler To see that a simpler model is appropriate.42 0.92e-02 2.5 36 48 4.07564 Coefficients: Estimate Std.66e-05 3.51 1.lm.98 0.lm.05 '.0 0 0 10 12 0 2.33e-05 -0. # I create each term separately lm.23e-05 4.03580 Median 0.24e+00 3.' 0.8.o2.08e-03 13.0 60 10 48 0 4.0 20 4. Error t value Pr(>|t|) (Intercept) 3.5 40 logmooney logmooney 60 4.67 4.6e-10 *** filler 2.34e-05 6. 00 1 −0.3 4 1 0.o2.lm. pch=as.3 0.05 10 20 1 0.f <.o2.f2$residuals.1 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.4. each of the remaining effects is significant.5 0. number lm.05 0. main="Residuals vs filler with oil labels".05 Residuals 0.lm.f2$residuals 0.o2. data = mooney) summary(lm.00 30 40 50 60 12 2 2 2 2 11 0 1 1 0 0 0 0 mooney$filler[ind] 12 4 2 4 0 2 4 0 21 4 4 4 −0.f) .character(mooney$filler[ind])) # horizontal line at zero abline(h = 0.f2$residuals 1 1 0.f2$residuals. which = c(1.o2.3)) plot(lm. lm.o2.character(mooney$oil)) # because of one missing value. the quadratic effect in filler is not needed in the model (output not given). col = "gray75") plot(mooney$filler[ind].05 lm. col = "gray75") Residuals vs Fitted 3. Once these two effects are removed.4 0 0 Cook's distance 0.4 Residuals vs oil with filler labels Residuals vs filler with oil labels QQ Plot 6 6 4 0 6 10 0 3 20 mooney$oil[ind] 0 0 2 4 1 1 2 4 0 2 0 40 1 2 0.as.2 0 0.6). las = 1.0 422 3.lm.character(mooney$oil[ind])) ## 22 12 21 ## 1 23 22 ## residuals vs order of data #plot(lm. lm.lm.5 1 lm.0 1 3 60 Fitted values 6 4 0 21 13 0.numeric(names(lm.o2. pch=as.0 1 1 0.f2$residuals 5.0 4.3 0. p-value: <2e-16 # plot diagnistics par(mfrow=c(2.f2$residuals)) plot(mooney$oil[ind].2 1 0.05 421 Cook's dist vs Leverage hii (1 − hii) Cook's distance 6 22 −2 0 −1 0 1 2 norm quantiles After omitting the interaction term.1 1 0 Cook's distance 0 2 4 2 1. main="Residuals vs oil with filler labels".05 2 4 0 Leverage hii 4 2 0 4 Obs. id.lm. pch=as.4 121 1 2 −0.0 0 2 0.994 ## F-statistic: 737 on 5 and 17 DF.995.2 1 421 213 4 1 2 0 1 00 4 111222 0 2 20 0. pch=as.05 1 2 3 4 0.lm. get the indices of non-missing ind <.o2.lm.lm.o2.n = 3.f2£residuals. # I create each term separately lm. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.05 3 30 4 2 1 0 0.lm(logmooney ~ oil + filler + I(oil^2).o2.f2$residuals. main="QQ Plot".00 2 2 4 2 0 0.lm.1 0.5 10 15 0.o2.Adjusted R-squared: 0.00 5 4 2 −0.216 Ch 8: Polynomial Regression ## Multiple R-squared: 0.character(mooney$oil[ind])) # horizontal line at zero abline(h = 0.o2.f2.lm.5 0.5 4.lm. o2. which = c(1.01 '*' 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.36e-05 6. col = "gray75") plot(mooney$filler[ind].09080 -0.f$residuals.05 '.o2. p-value: <2e-16 # plot diagnistics par(mfrow=c(2. pch=as.995.as.6).8. col = "gray75") .o2. main="QQ Plot". data = mooney) Residuals: Min 1Q Median -0.' 0.02e-02 2.character(mooney$oil[ind])) # horizontal line at zero abline(h = 0.3)) plot(lm.994 F-statistic: 1.lm.o2. id.2e+03 on 3 and 19 DF.lm.lm.03253 Max 0.character(mooney$oil[ind])) ## 12 22 16 ## 23 1 2 ## residuals vs order of data #plot(lm. main="Residuals vs oil with filler labels".lm.23e+00 2.f.3e-12 *** filler 3.70e-03 -14.lm.73e-02 118.f$residuals.00883 3Q 0.10059 Coefficients: Estimate Std.001 '**' 0. get the indices of non-missing ind <. main="Residuals vs filler with oil labels".5e-06 *** --Signif.o2.n = 3.03111 -0. pch=as. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0. Error t value Pr(>|t|) (Intercept) 3.f$residuals.4. pch=as.f$residuals)) plot(mooney$oil[ind].45 3.numeric(names(lm. codes: 0 '***' 0.2: Polynomial Models with Two Predictors ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 217 Call: lm(formula = logmooney ~ oil + filler + I(oil^2).72e-04 53.character(mooney$filler[ind])) # horizontal line at zero abline(h = 0. las = 1.Adjusted R-squared: 0.99 < 2e-16 *** I(oil^2) 4. lm.89 6.character(mooney$oil)) # because of one missing value.o2.1 ' ' 1 Residual standard error: 0.0542 on 19 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0. lm.lm.f£residuals.10e-04 6.14 < 2e-16 *** oil -4.09e-02 5. pch=as. we might wish to know what combination of oil level between 0 and 40 and filler level between 0 and 60 provides the lowest predicted Mooney viscosity (on the original or log scale). but you may not agree.5 0 1 1 20 0.00 1 2 3 1 6 1 4 −0.10 Residuals vs filler with oil labels 0 4 2 4 3 10 20 mooney$oil[ind] 0.20 0 4 2.lm.15 0.o2.5 216 0. but one can do a more careful job of analysis using standard tools from calculus.00 −0.5 5.0 3.f$residuals 0.20 0.f$residuals 0. agrees with our visual assessment of the data.15 0 1 0.05 0 2 4 1 2 1 1 0 2 0 4 2 2 4 40 0 10 20 30 4 2 4 22 40 50 60 2 2 2 0.0 5 0.f$residuals 0 6 0 2 1 4 0. over the range of predictor variable values where the model is reasonable.10 1 0 Cook's distance 0. the predicted log Moody viscosity is given by \ log(Moody viscosity) = 3.218 Ch 8: Polynomial Regression 0 0 422 3. Quadratic models with two or more predictors are often used in industrial experiments to estimate the optimal combination of predictor values to maximize or minimize the response.10 Residuals vs Fitted 22 −2 0 16 −1 0 1 2 norm quantiles The model does not appear to have inadequacies.05 2 1 0 0 22 0.05 −0. (This strategy is called “response surface methodology”.o2.05 2 22 0.10 1 4 0.10 0.o2.15 QQ Plot 0.25 0.5 2 0.2297 − 0.00 −0.0309 Filler. with linear effects due to the oil and filler levels and a quadratic effect due to the oil level.2 Residuals vs oil with filler labels 2 0 15 4 2 Leverage hii 4 3 6 4 214 Obs.00 11 0 0 1 1 1 1 0 0 0 mooney$filler[ind] 4 0. This is satisfying to me.05 2 1 0.05 0.0004 Oil2 + 0.05 0 3 30 0 0 12 4 4 lm.00 1 2 2 1 21 Cook's distance 4 1 2 2 0. Assuming no inadequacies.lm.lm. an important effect of the transformation is that the resulting model is simpler than the model selected on the original scale. . Assuming no difficulties. number 6 1 0 10 1112224 Fitted values lm.) For example.5 0.1 4 0 0 2 1 2 0 0.0402 Oil + 0.05 lm. We can visually approximate the minimizer using the data plots.0 4.5 4.10 2.05 4 4 Cook's dist vs Leverage hii (1 − hii) Cook's distance 121 0.00 Residuals 0.10 0 1. Note that the selected model.05 12 0. Chapter 9 Discussion of Response Models with Factors and Predictors We have considered simple models for designed experiments and observational studies where a response variable is modeled as a linear combination of effects due to factors or predictors. with values A. or both. If data set includes the survival time (times) for each beetle. you would fit the ANCOVA model this way . and dose. C. and 30. With designed experiments. the dose or some function of dose can be used as a (quantitative) predictor instead of as a qualitative effect. with different intercepts for the four insecticides. the insecticide (insect: an alphanumeric variable. the potential effects of insecticide (with levels A. and 3=high) are included in the model as factors because these variables are qualitative. B. the dose given to each beetle was recorded on a measurement scale. in the experiment comparing survival times of beetles. The natural model to consider is a two-way ANOVA with effects for dose and insecticide and a dose-by-insecticide interaction. we get a “pure ANOVA” model. or ANCOVA model. assume that the doses are 10. The simple additive model. For simplicity. 2=medium. C. where only qualitative factors are considered. B. If. and D). however. then the dosages can be used to define a predictor variable which can be used as a “regression effect” in a model. but the actual levels are irrelevant to the discussion. assumes that there is a linear relationship between mean survival time and dose. That is. and D) and dose (with levels 1=low. 20. For example. lm(times ~ insect + dose.i.d.lm(times ~ insect + dose. The LS estimates of the mean responses from the quadratic model Times = β0 + β1 Dose + β2 Dose2 + ε are the observed average survival times at the three dose levels.lm(times ~ insect + dose + insect:dose. even though we just defined dose on a measurement scale! Is there a basic connection between the ANCOVA and separate regression line models for dose and two-way ANOVA models where dose and insecticide are treated as factors? Yes — I mentioned a connection when discussing ANCOVA and I will try now to make the connection more explicit.t.id <. as illustrated in the picture below. let us simplify the discussion and assume that only one insecticide was used at three dose levels.i. Effects in the model statement that are numeric data types are treated as predictors. To treat a measurement variable as a factor (with one level for each distinct observed value of the variable) instead of a predictor. Thus.t. convert that varible type to a factor using factor(). these models beetles$insect <.t.220 Ch 9: Discussion of Response Models with Factors and Predictors beetles$insect <.d <. data = beetles) A more complex model that allows separate regression lines for each insecticide is specified as follows: beetles$insect <. The LS curve goes through the mean survival time at each dose. For the moment.lm(times ~ insect + dose + insect:dose.d. Each effect of Factor data type is treated as a factor.factor(beetles$insect) lm.factor(beetles$dose) # call this (A) for additive lm. data = beetles) # call this (I) for interaction lm.i.d <.id <.i. data = beetles) give the analysis for a two-way ANOVA model without interaction and with interaction.t. .factor(beetles$insect) beetles$dose <. in the survival time experiment. where both dose and insecticide are treated as factors (since dose and insect are both converted to factors). respectively. If we treat dose as a factor.factor(beetles$insect) lm. and fit the one-way ANOVA model Times = Grand Mean + Dose Effect + Residual. data = beetles) It is important to recognize that the factor() statement defines which variables in the model are treated as factors. and a convenient way to check whether a simple description such as a simple linear regression model is adequate to describe the effect. and neither does the quadratic model! (WHY?) ● ● ● ● ● ● ● Insecticide ● 2 ● ● 1 ● 10 ● A B C D Mean ● ● 15 20 25 30 dose In a one-way ANOVA. . then the one-way ANOVA model Times = Grand Mean + Dose Effect + Residual. the standard hypothesis of interest is that the dose effects are zero. With three dosages. An advantage of the polynomial model over the one-way ANOVA is that it provides an easy way to quantify how dose impacts the mean survival. This can be tested using the one-way ANOVA F-test.221 ● ● 4 3 times 5 6 7 then the LS estimates of the population mean survival times are the observed mean survival times. but the parameters have different interpretations. The two models are mathematically equivalent. the absence of a linear or quadratic effect implies that all the population mean survival times must be equal. or by testing H0 : β1 = β2 = 0 in the quadratic model. In essence. More generally. if dose has p levels. is equivalent to the (p − 1)st degree polynomial Times = β0 + β1 Dose + β2 Dose2 + · · · + βp−1 Dose(p−1) + ε and the one-way ANOVA F-test for no treatment effects is equivalent to testing H0 : β1 = β2 = · · · = βp−1 = 0 in this polynomial. the one-way ANOVA model places no restrictions on the values of the population means (no a priori relation between them) at the three doses. the model assumes that the quadratic curves for the four insecticides differ only in level (i.t. the two-way additive ANOVA model.i. Because dose has three levels. beetles$insect <.d. First.d2 <.i.lm(times ~ insect + dose + I(dose^2) + insect:dose + insect:I(dose^2)..d2.d. the two-way ANOVA interaction model.e.factor(beetles$insect) lm.factor(beetles$insect) lm. is mathematically equivalent to an interaction model with insecticide as a factor. That is.e. is mathematically equivalent to an additive model with insecticide as a factor.222 Ch 9: Discussion of Response Models with Factors and Predictors Returning to the original experiment with 4 insecticides and 3 doses. this model has an additive insecticide effect. i. rather than as an interaction. data = beetles) This model fits separate quadratic relationships for each of the four insecticides.t. with insecticide and dose as factors. and a quadratic effect in dose. ● ● ● ● ● Insecticide ● 2 ● ● A B C D ● 1 ● 10 15 20 25 30 dose Second. by including interactions between insecticides and the linear and quadratic terms in dose. different intercepts) and that the coefficients for the dose and dose2 effects are identical across insecticides. but the dose effect is not differentiated across insecticides.lm(times ~ insect + dose + I(dose^2). This is an additive model. model (I).id. with insecticide and dose as factors. A possible pictorial representation of this model is given below.id2 <.e. because the population means plot has parallel profiles. and a quadratic effect in dose: beetles$insect <. i.. data = beetles) ● ● 4 3 times 5 6 7 Thinking of dose2 as a quadratic term in dose. model (A). I can show the following two equivalences.. this model places no . . but each factor or interaction involving a factor must be represented in the model using indicator variables or product terms.e. testing H0 : β9 = β10 = β11 = 0). The number of required indicators or product effects is one less than the number of distinct levels of the factor. For the “quadratic interaction model”. Recall that response models with factors and predictors as effects can be fit using the lm() procedure. to fit the model with “parallel” quadratic curves in dose. say I1. where the quadratic effect is omitted. a quadratic effect in dose. The (β6I1 Dose+β7I2 Dose+β8I3 Dose) component in the model formally corresponds to the insect∗dose interaction. I2. To summarize. and interactions between the insecticide and the linear and quadratic dose effects. The separate regression lines model with a linear effect in dose is a special case of these models. The ANCOVA model with a linear effect in dose is a special case of these models. ˆ The two-way ANOVA interaction model with insecticide and dose as factors is mathematically identical to a model with an insecticide factor. you can define (in the data. where the quadratic dose effect and the interaction of the quadratic term with insecticide are omitted. and I3. For example. we have established that ˆ The additive two-way ANOVA model with insecticide and dose as factors is mathematically identical to an additive model with an insecticide factor and a quadratic effect in dose..223 restrictions on the mean responses. you must define 6 interaction or product terms between the 3 indicators and the 2 dose terms: Times = β0 + β1I1 + β2I2 + β3I3 + β4 Dose + β5 Dose2 +β6I1 Dose + β7I2 Dose + β8I3 Dose +β9I1 Dose2 + β10I2 Dose2 + β11I3 Dose2 + ε. and fit the model Times = β0 + β1I1 + β2I2 + β3I3 + β4 Dose + β5 Dose2 + ε. whereas the (β9I1 Dose2+β10I2 Dose2+ β11I3 Dose2) component is equivalent to the insect ∗ dose ∗ dose interaction (i.frame()) three indicator variables for the insecticide effect. 1 Some Comments on Building Models A primary goal in many statistical analyses is to build a model or models to understand the variation in a response. that are needed to be a good data analyst. and to convince you of the care that is needed when modelling variation even in simple studies. and the subtleties of modelling. This can result in a manageable experiment with. and what variables are important to collect in an observational study. where many variables might influence the response. say. theory would suggest models to compare. by further experimentation. An extreme view . there is no consensus on how this should be done. Researchers are usually faced with more complex modelling problems than we have examined. regression analysis. validated. a scientist will often control the levels of variables that influence the response but that are not of primary interest. the scientist builds models to assess the effects of interest on the response. the example will give you an appreciation for statistical modelling. or unfortunately. four or fewer qualitative or quantitative variables that are systematically varied in a scientifically meaningful way. I will discuss a reasonably complex study having multiple factors and multiple predictors. Hopefully.224 Ch 9: Discussion of Response Models with Factors and Predictors This discussion is not intended to confuse. I can only scratch the surface here. but please be careful — these tools are dangerous! 9. Ideally. but certain basic principles can be applied to many of the studies you will see. Graduate students in statistics often take several courses (5+) in experimental design. the scientist knows what variables to control in an experiment and which to vary. The uncontrolled variables are usually a mixture of factors and predictors. or refuted. and linear model theory to master the breadth of models. but rather to impress upon you the intimate connection between regression and ANOVA. Fortunately. The example focuses on strategies for building models. but in many studies the goal is to provide an initial model that will be refined. If experimentation is possible. with little attempt to do careful diagnostic analyses. Ideally. where experimentation is not possible. adjusting the response for all the uncontrolled variables that might be important. The level of complexity that I am describing here might be intimidating. In observational studies. a widely accepted norm in scientific investigations whereby the simplest plausible model among all reasonable models. but still interpretable. based on model averaging and prior beliefs on the plausibility of different models. many of these effects consist of several degrees- . but can be implemented in a variety of ways. main effects and interactions (2 factor. A difficulty with implementing either approach is that importance is relative to specific goals (i. An alternative method using Mallow’s Cp criterion will be discussed later.. 3 factor. etc. A simple compromise between the two extremes might be to start the model building process with the most complex model that is scientifically reasonable. However. comment that “Science is an iterative process in which competing models of reality are compared on the basis of how well they predict what is observed. Why are you building the model and what do you plan to use the model for? Is the model a prescription or device to make predictions? Is the model a tool to understand the effect that one or more variables have on a response. and systematically eliminate effects using backward elimination. given the data. in the 1994 edition of The Journal of the American Statistical Association. but important effects that can not be controlled? etc. one can show that the average squared error in predictions is essentially reduced by eliminating insignificant regression effects from the model. and products or interactions between predictors and factors.1: Some Comments on Building Models 225 is that the selected model(s) should only include effects that are “statistically important”. However. so this approach seems tenable. models that predict much less well than their competitors are discarded. Madigan and Raftery’s ideas are fairly consistent with the first extreme. It might be sensible to only assess significance of effects specified in the model statement. The initial or maximal model might include polynomial effects for predictors.) Madigan and Raftery. They propose a Bayesian approach.” They argue that models should be selected using Occum’s razor. depending on how you measure prediction adequacy.9. is preferred. after adjusting for uninteresting.) between factors.e. whereas another extreme suggests that all effects that might be “scientifically important” should be included. This approach might appear to be less than ideal because the importance of effects is assessed using hypothesis tests and no attempt is made to assess the effect of changes on predictions. use the same algorithm for selecting a model. However. The Type III F -test on the entire effect does not depend on the baseline category. In essence. the ANOVA effects are of interest because they imply certain structure on the means.226 Ch 9: Discussion of Response Models with Factors and Predictors of-freedom. the parameters are defined differently for different choices of baseline categories). . The individual regression variables that comprise an effect could also be tested individually. B. The hierarchy principle is most appealing with pure ANOVA models (such as the three-factor model in the example below). where all the regression variables are indicators. The individual regression variables that define the effects are not usually a primary interest. Statisticians often follow the hierarchy principle.. then the interpretation of tests on individual regression coefficients depends on the level of the factor that was selected to be the baseline category. the effect corresponds to several regression coefficients in the model. and the A ∗ B interaction. and test an entire effect rather than test the single degree-of-freedom components that comprise an effect. if the effect is a factor (with 3+ levels) or an interaction involving a factor. Recall our discussion of backwards elimination from Chapter 3 earlier this semester. In ANOVA models. then the difficulty described above can not occur. given an initial model with effects A.e. two researchers can start with different representations of the same mathematical model (i. (Refer to the discussion following the displayed equations on page 223). A non-hierarchical backward elimination algorithm where single degree-offreedom effects are eliminated independently of the other effects in the model is implemented in the step() procedure. yet come to different final models for the data. which states that a lower order term (be it a factor or a predictor) may be considered for exclusion from a model only if no higher order effects that include the term are present in the model. C. For example. the only candidates for omission at the first step are C and A ∗ B. If you follow the hierarchy principle. That is. 1) . sex (coded 1 for female and 0 for male).factor(faculty$sex .read.com/teach/ADA2/ADA2_notes_Ch09_faculty.. The sample consists of tenured and tenure-stream faculty only.table("http://statacumen. "Asst")) faculty$degree <.. header = TRUE) head(faculty) ## ## ## ## ## ## ## 1 2 3 4 5 6 id sex rank year degree yd salary 1 0 3 25 1 35 36350 2 0 3 13 1 22 35350 3 0 3 10 1 23 28200 4 1 3 7 1 27 26775 5 0 3 19 0 30 33696 6 0 3 16 1 21 28516 str(faculty) ## 'data. labels=c("Other". labels=c("Full"..9. degree (coded 1 for Doctorate. 1 0 0 1 0 0 0 .dat" . levels=c(3. 0 else). "Assoc".factor(faculty$degree. ## $ id : int 1 2 3 ## $ sex : int 0 0 0 ## $ rank : int 3 3 3 ## $ year : int 25 13 ## $ degree: int 1 1 1 ## $ yd : int 35 22 ## $ salary: int 36350 of 7 variables: 4 5 6 7 8 9 10 . Temporary faculty were excluded from consideration (because they were already being discriminated against)..2: Example: The Effect of Sex and Rank on Faculty Salary 9. The variables below are id (individual identification numbers from 1 to 52)..frame': 52 obs. #### Example: Faculty salary faculty <.. faculty$rank <. 3 3 3 3 3 3 3 . yd (number of years since highest degree was earned). "Female")) # ordering the rank variable so Full is the baseline. 10 7 19 16 0 16 13 13 ... rank (coded 1 for Asst. The data were collected to assess whether women were being discriminated against (consciously or unconsciously) in salary. 23 27 30 21 32 18 30 31 . labels=c("Male". "Doctorate")) head(faculty) ## id ## 1 1 sex rank year degree yd salary Male Full 25 Doctorate 35 36350 . 1 0 1 0 1 0 0 . 2 for Assoc.. year (number of years in current rank). Professor and 3 for Full Professor)... faculty$sex <.2 227 Example: The Effect of Sex and Rank on Faculty Salary The data in this example were collected from the personnel files of faculty at a small college in the 1970s.. Professor... 35350 28200 26775 33696 28516 24900 31909 31850 32850 .factor(faculty$rank . then descending.2. and salary (academic year salary in dollars). .p2 + stat_summary(fun.05. of 7 variables: ## $ id : int 1 2 3 4 5 6 7 8 9 10 .ggplot(faculty. size=. size=. geom = "errorbar".. ## $ rank : Factor w/ 3 levels "Full".data = "mean_cl_normal". or not. alpha = 0. A primary statistical interest is whether males and females are compensated equally.y = mean.y = mean. alpha = 0. aes(x = sex. shape = 18. size = 6.. and degree) on salary... and three factors (sex. let us look at the data. alpha = 0.3.p1 + stat_summary(fun.p1 + geom_point(position = position_jitter(w = 0. and the other given effects.. h = 0).5) # boxplot. rank. y = salary. # plot marginal boxplots # Plot the data using ggplot library(ggplot2) p1 <. geom = "point". after adjusting salary for rank."Doctorate": 2 2 2 2 1 2 1 2 1 1 . and that salary tends to increase with rank.. size = 0. group = sex)) # plot a reference line for the global mean (assuming no groups) p1 <. ## $ degree: Factor w/ 2 levels "Other". alpha = 0. rank.75. alpha = 0.5) # boxplot. ## $ sex : Factor w/ 2 levels "Male".. h = 0). that faculty with Doctorates tend to earn more than those without Doctorates (median). shape = 18. colour = "black".p1 + stat_summary(fun.. years in rank..5) # points for observed data p2 <.75 to stand out behind CI p2 <.5) # diamond at mean for each group p2 <.75 to stand out behind CI p1 <. we wish to know whether an effect due to sex is the same for each rank."Assoc".ggplot(faculty.5) # points for observed data p1 <.frame': 52 obs.p1 + labs(title = "Salary by sex") # Plot the data using ggplot library(ggplot2) p2 <. y = salary. width = . linetype = "dashed".3. colour = "black". Furthermore.5) # confidence limits based on normal distribution p1 <.p2 + geom_point(position = position_jitter(w = 0.75.2. ## $ yd : int 35 22 23 27 30 21 32 18 30 31 . The data includes two potential predictors of salary (year and yd). on average. group = degree)) # plot a reference line for the global mean (assuming no groups) p2 <. size = 0. alpha = 0. linetype = "dashed". I will initially focus on the effect of the individual factors (sex. notice that women tend to earn less than men."Female": 1 1 1 2 1 1 2 1 1 1 .p1 + geom_hline(aes(yintercept = mean(salary)). size = 6. alpha = 0.. A series of box-plots is given below. .p2 + geom_boxplot(size = 0.228 ## ## ## ## ## 2 3 4 5 6 Ch 9: Discussion of Response Models with Factors and Predictors 2 Male Full 3 Male Full 4 Female Full 5 Male Full 6 Male Full 13 10 7 19 16 Doctorate Doctorate Doctorate Other Doctorate 22 23 27 30 21 35350 28200 26775 33696 28516 str(faculty) ## 'data. ## $ salary: int 36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 .. Before answering these questions. ## $ year : int 25 13 10 7 19 16 0 16 13 13 .. Looking at the boxplots. aes(x = degree.05. geom = "point". alpha = 0. and degree)...p1 + geom_boxplot(size = 0.5) # diamond at mean for each group p1 <.: 1 1 1 1 1 1 1 1 1 1 .p2 + geom_hline(aes(yintercept = mean(salary))..8) p1 <. .y = mean.8) p3 <. alpha = 0.p3 + geom_point(position = position_jitter(w = 0.5) # points for observed data p3 <.2.5) # confidence limits based on normal distribution p2 <.9.p3 + labs(title = "Salary by rank") library(gridExtra) grid. size = 0. means.p3 + geom_hline(aes(yintercept = mean(salary)). alpha = 0. The output below gives the sample sizes. h = 0).2.05. our earlier analyses have cured you of the desire to claim that a sex effect exists before considering whether the differences between male and female salaries might be due to other factors. alpha = 0.p3 + stat_summary(fun. width = . group = rank)) # plot a reference line for the global mean (assuming no groups) p3 <.5) # boxplot. y = salary.3.p2 + labs(title = "Salary by degree") # Plot the data using ggplot library(ggplot2) p3 <.data = "mean_cl_normal". size = 6.data = "mean_cl_normal". width = .5) # diamond at mean for each group p3 <. colour = "black".2.75 to stand out behind CI p3 <.75. and standard deviations for the 11 combinations of sex.p2 + stat_summary(fun.8) p2 <.p3 + geom_boxplot(size = 0.2: Example: The Effect of Sex and Rank on Faculty Salary 229 alpha = 0. linetype = "dashed". geom = "errorbar". geom = "point".arrange(p1.ggplot(faculty. geom = "errorbar". aes(x = rank. alpha = 0.1 Other Doctorate degree Full Assoc Asst rank A Three-Way ANOVA on Salary Data Hopefully. shape = 18.5) # confidence limits based on normal distribution p3 <. p2. alpha = 0. alpha = 0.p3 + stat_summary(fun. p3. nrow = 1) Salary by sex Salary by degree Salary by rank ● 35000 35000 35000 ● ● ● 25000 salary 30000 salary 30000 salary 30000 25000 25000 ● ● 20000 20000 20000 15000 15000 15000 Male Female sex 9. size=. p + scale_y_continuous(limits = c(0. with a higher percentage of men in the more advanced ranks. Side-by-side boxplots of the salaries for the 11 combinations are also provided.p + labs(title = "Salary by rank.ggplot(faculty. alpha = 0. and sex") print(p) ## ymax not defined: ## ymax not defined: ## ymax not defined: adjusting position using y instead adjusting position using y instead adjusting position using y instead . There is a big difference in the ranks of men and women.5) # boxplot.25 for thin lines p <. ~ rank) p <. size=. alpha = 0.df$salary) = sd(.frame(n . m . Looking at the summaries.25. degree. .75 puts dots up center of boxplots p <.p + geom_hline(aes(yintercept = mean(salary)). size = 0.(sex. One combination of the three factors was not observed: female Associate Professors without Doctorates.75) # 0.df$salary) sex rank degree n m s 1 Male Full Other 4 30712 4242 2 Male Full Doctorate 12 29593 3480 3 Male Assoc Other 7 23585 1733 4 Male Assoc Doctorate 5 23246 2120 5 Male Asst Other 3 20296 3017 6 Male Asst Doctorate 7 16901 729 7 Female Full Other 1 24900 NA 8 Female Full Doctorate 3 30107 6904 9 Female Assoc Other 2 21570 1245 10 Female Asst Other 1 21600 NA 11 Female Asst Doctorate 7 17006 1835 # plot marginal boxplots library(ggplot2) # create position dodge offset for plotting points pd <.230 Ch 9: Discussion of Response Models with Factors and Predictors rank. linetype = "dashed". y = salary.p + geom_point(position = pd.position_dodge(0.ddply(faculty. alpha = 0. data. fill = sex)) # plot a reference line for the global mean (assuming no groups) p <. library(plyr) fac. s ) }) fac. aes(x = degree. and degree observed in the data.sum ## ## ## ## ## ## ## ## ## ## ## ## rank. max(faculty$salary))) p <. function(.sum <.p + geom_boxplot(size = 0. degree).25) # points for observed data p <.5) p <. colour = "black". the differences between sexes within each combination of rank and degree appear to be fairly small.df) { = length(.p + facet_grid(.df$salary) = mean(.3. This might explain the differences between male and female salaries. when other factors are ignored. we write the full model as Salary = Grand mean + S effect + D effect + R effect +S*D interaction + S*R interaction + R*D interaction +S*D*R interaction + Residual. the three possible two-factor interactions. The first analysis considers the effect of the three factors on salary. You should understand what main effects and two-factor interactions measure.e. plus the three-factor interaction. I am doing the three factor analysis because the most complex pure ANOVA problem we considered this semester has two factors — the analysis is for illustration only!! The full model for a three-factor study includes the three main effects. the profile plots .. degree. and sex Full Assoc Asst 30000 ● salary ● sex ● 20000 Male Female 10000 0 Other Doctorate Other Doctorate Other Doctorate degree I will consider two simple analyses of these data. A complete analysis using both factors and predictors is then considered. The second analysis considers the effect of the predictors. but what about the three-factor term? If you look at the two levels of degree separately. Identifying the factors by S (sex).2: Example: The Effect of Sex and Rank on Faculty Salary 231 Salary by rank.9. (i. then a three-factor interaction is needed if the interaction between sex and rank is different for the two degrees. D (degree) and R (rank). factor.68 < 2e-16 ## sexFemale -5812 3321 -1. three-factor interactions are hard to interpret.65 0.10665 ## rankAssoc:degreeDoctorate 780 2442 0.59 4. The first step in the elimination is to fit the full model and check whether the three-factor term is significant.lm(salary ~ sex*rank*degree. Individual regression variables are not considered for deletion. All tests were performed at the 0. After eliminating this effect.65 0. data = faculty) ## Note that there are not enough degrees-of-freedom to estimate all these effects ## because we have 0 observations for Female/Assoc/Doctorate library(car) Anova(lm.faculty. Finally. but this hardly matters here. and then sequentially deleted the least important effects.93 0. The three-factor term was not significant (in fact.full. I fit the model with all three two-factor terms.32 0. The final model includes only an effect due to rank.75 0. I compute the lsmeans() to compare salary for all pairs of rank.1e-05 ## degreeDoctorate -1119 1715 -0.full) ## ## Call: ## lm(formula = salary ~ sex * rank * degree. I considered a hierarchical backward elimination of effects (see Chapter 3 for details).35823 ## sexFemale:rankAsst 7116 4774 1. # fit full model lm. while still adhering to the hierarchy principle using the AIC criterion from the step() function.83 0.08758 ## rankAssoc -7127 1862 -3.full <.factor.faculty. type=3) ## Error: there are aliased coefficients in the model summary(lm.51777 ## sexFemale:rankAssoc 3797 4086 0.49 0. one at a time.75095 ## rankAsst:degreeDoctorate -2276 2672 -0.00043 ## rankAsst -10416 2268 -4. unless they correspond to an effect in the model statement.39930 ## sexFemale:rankAssoc:degreeDoctorate NA NA NA NA . Error t value Pr(>|t|) ## (Intercept) 30712 1485 20.232 Ch 9: Discussion of Response Models with Factors and Predictors are different for the two degrees). data = faculty) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6262 -1453 -226 1350 7938 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std.10 level.factor.85 0.faculty.14373 ## sexFemale:degreeDoctorate 6325 3834 1. Not surprisingly. it couldn’t be fit because one category had zero observations). 2: Example: The Effect of Sex and Rank on Faculty Salary ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## sexFemale:rankAsst:degreeDoctorate -7525 5384 -1.40 233 0. # remove variable lm.sex:rank:degree ).factor. type=3) ## ## ## ## ## ## ## Anova Table (Type III tests) Response: salary Sum Sq Df F value Pr(>F) (Intercept) 3.lm.91 < 2e-16 *** sex 1.Adjusted R-squared: 0.17 ## Because the full model can not be fit.3 salary ~ sex * rank * degree Df Sum of Sq <none> .factor.faculty.99e-11 ## AIC # option: test="F" includes additional information # for parameter estimate tests that we're familiar with # option: for BIC. then the step() procedure will ## do the rest of the work for us.sex:rank:degree 1 RSS AIC F value Pr(>F) 3.97e+08 2 10. p-value: 2. # model reduction using update() and subtracting (removing) model terms lm. rankAssoc *** rankAsst *** degreeDoctorate sexFemale:rankAssoc sexFemale:rankAsst sexFemale:degreeDoctorate rankAssoc:degreeDoctorate rankAsst:degreeDoctorate sexFemale:rankAssoc:degreeDoctorate sexFemale:rankAsst:degreeDoctorate --Signif.16972 (Intercept) *** sexFemale . then use step() to perform backward selection based on AIC.faculty.24 0.79e+08 842 1.red.red.faculty.' 0.1 ' ' 1 Residual standard error: 2970 on 41 degrees of freedom Multiple R-squared: 0.full.factor.faculty.12e+07 1 1.red <.frame name] )) lm. Anova(lm.factor. ~ .9.AIC <.full.step(lm.factor.faculty. direction="backward".01 '*' 0. the step() procedure does not work ## Below we remove the three-way interaction.797.red <.factor.00015 *** . codes: 0 '***' 0. include k=log(nrow( [data.faculty.factor. test="F") ## ## ## ## ## ## Start: AIC=841.1 on 10 and 41 DF. Remove the three-way interaction.update(lm.05 '.90 0.001 '**' 0.red.95 0.748 F-statistic: 16.faculty.93e+09 1 435. .62e+08 841 17233177 3.27094 rank 1. 35e+09 1.65 1.36 <none> 3.' 0.7 salary ~ sex + rank + degree + sex:rank + sex:degree + rank:degree Df Sum of Sq RSS AIC F value Pr(>F) .17 Step: AIC=838 salary ~ sex + rank + degree + sex:degree + rank:degree . codes: 0 '***' 0.faculty.4e-15 *** --Signif.16936 Residuals 3.23 0.10e+07 4.sex:degree 1 7661926 3. codes: 0 '***' 0.86 .red.05 '. test="F") ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Start: AIC=841.36202 rank:degree 3. direction="backward".78e+09 906 75.55 0.factor.94e+08 838 Step: AIC=836.85 0.22 <none> 3.sex 1 3009036 3.24 838 838 1.85 0.234 ## ## ## ## ## ## ## Ch 9: Discussion of Response Models with Factors and Predictors degree 4.1 salary ~ rank + degree + rank:degree Df Sum of Sq RSS AIC F value Pr(>F) .66e+06 1 0.05 '.05 0.79e+08 42 --Signif.86140 sex:degree 7.28e+08 836 .AIC <.rank:degree 2 27067985 4.12e+08 842 1.degree 1 1.94e+08 3.rank:degree 2 33433415 4.56 .97e+08 836 0.42 0.79e+08 842 .39e+08 835 1.87 0.27 <none> 4.8 0.28e+08 836 1.15 0.7 salary ~ sex + rank + degree + rank:degree Df Sum of Sq RSS AIC F value Pr(>F) .001 '**' 0.001 '**' 0.82e+08 2 32435968 4.rank 2 1.red.factor.01 '*' 0.18 <none> 3.17 Step: AIC=837.21e+08 837 1.1e+07 4.34 0.97e+08 836 Step: AIC=836 salary ~ rank + degree Df Sum of Sq RSS AIC F value Pr(>F) .85 0.82989 sex:rank 2.87e+08 841 0.faculty.82e+08 838 0.sex:rank 2 2701493 3.rank:degree 2 3.' 0.22e+05 1 0.rank:degree Df Sum of Sq RSS 1 12335789 3.70e+06 2 0.14e+08 AIC F value Pr(>F) 838 1.1 ' ' 1 # AIC backward selection lm.85 0.step(lm.15 0.1 ' ' 1 .01 '*' 0.sex:degree <none> .34e+07 2 1. 76e+10 1 1963.red.9.35e+09 1.01 '*' 0. Error t value Pr(>|t|) (Intercept) 29659 669 44.factor.2: Example: The Effect of Sex and Rank on Faculty Salary ## ## ## ## ## ## ## ## Step: AIC=835. codes: 0 '***' 0.05 '. list(pairwise ~ rank).39e+08 49 --Signif.rank 2 --Signif.' 0.1 ' ' 1 summary(lm. codes: 0 '***' 0.32 < 2e-16 *** rankAssoc -6483 1043 -6. p-value: 1. codes: 0 '***' 0.001 '**' 0.factor.17e-15 All ranks are different with salaries increasing with rank.1e-07 *** rankAsst -11890 972 -12.2e-15 *** <none> .final.01 '*' 0.faculty.lm.3 salary ~ rank Df Sum of Sq RSS AIC F value Pr(>F) 4.factor.39e+08 835 1.factor.23 < 2e-16 *** --Signif.1 ' ' 1 # all are significant.' 0.factor.744 F-statistic: 75.754.faculty.001 '**' 0.2 1.' 0.01 '*' 0. ### comparing lsmeans (may be unbalanced) library(lsmeans) ## compare levels of main effects lsmeans(lm. # final model: salary ~ rank lm.05 '.faculty.Adjusted R-squared: 0.final.final <.final) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = salary ~ rank.faculty.2 on 2 and 49 DF.79e+09 904 75.35e+09 2 75.AIC library(car) Anova(lm.faculty.22 1.2e-15 *** Residuals 4.2 1. data = faculty) Residuals: Min 1Q Median -5209 -1819 -418 3Q 1587 Max 8386 Coefficients: Estimate Std. adjust = "bonferroni") 235 . stop.9 < 2e-16 *** rank 1.001 '**' 0.1 ' ' 1 Residual standard error: 2990 on 49 degrees of freedom Multiple R-squared: 0.05 '. type=3) ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: salary Sum Sq Df F value Pr(>F) (Intercept) 1. 216 <.value Full .CL 49 28314 31004 49 21568 24783 49 16351 19186 Confidence level used: 0. 9.5) p1 <. You might think to transform the salaries to a log scale to eliminate this effect.0001 Full .5)) p1 <.5)) p2 <. size = degree)) p1 <.p1 + labs(title = "Salary by year") p1 <.ggplot(faculty. shape = sex.3 Assoc 23176 799.0001 P value adjustment: bonferroni method for 3 tests This analysis suggests that sex is not predictive of salary. once other factors are taken into account. library(ggplot2) p1 <.Assoc 6483 1043. The analysis is likely flawed.070 <.ratio p.95 $`pairwise differences of rank` contrast estimate SE df t.p1 + theme(legend. the analysis was meant to illustrate a three-factor ANOVA and backward selection. shape = sex.p2 + geom_point(alpha = 0.p1 + geom_point(alpha = 0. because it ignores the effects of year and year since degree on salary.ggplot(faculty. size = degree)) p2 <. As noted earlier.6 49 5.4 49 12.CL upper.2 Using Year and Year Since Degree to Predict Salary Plots of the salary against years in rank and years since degree show fairly strong associations with salary. faculty rank appears to be the sole important effect.position = "bottom") #print(p2) library(gridExtra) . y = salary.position = "bottom") #print(p1) p2 <.p2 + labs(title = "Salary by yd") p2 <. colour = rank. The variability in salaries appears to be increasing with year and with year since degree.p1 + scale_size_discrete(range=c(3.Asst 11890 972.0001 Assoc .2. In particular. which might be expected.p2 + scale_size_discrete(range=c(3.236 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 9: Discussion of Response Models with Factors and Predictors $`lsmeans of rank` rank lsmean SE Full 29659 669. aes(x = year.p2 + theme(legend.Asst 5407 1066. but doing so has little impact on the conclusions (not shown). colour = rank.5 df lower.9 Asst 17769 705. aes(x = yd.5) p2 <.0 49 6.228 <. y = salary. in the sense that once salaries are adjusted for rank no other factors explain a significant amount of the unexplained variation in salaries. 05 11.yd. These two predictors are important for explaining the variation in salaries.2: Example: The Effect of Sex and Rank on Faculty Salary 237 grid. Error t value ## (Intercept) 16287. but together they explain much less of the variation (58%) than rank does on its own (75%).7680 '*' 0.05 '.68 ## year 561.yd.04 ## yd 235. # interaction model lm. codes: 0 '***' 0.83 ## year:yd -3.y.001 '**' 0. data summary(lm. p2.' 0.27 2.30 ## --## Signif.42 83.09 10.lm(salary ~ year*yd.15 275. nrow = 1) Salary by yd 35000 35000 30000 30000 salary salary Salary by year 25000 25000 20000 20000 15000 15000 0 5 10 15 20 25 0 10 20 year sex degree rank Male Female Other Full 30 yd sex Doctorate Assoc Male degree Asst rank Female Other Full Doctorate Assoc Asst As a point of comparison with the three-factor ANOVA.s.yyd) ## ## Call: ## lm(formula = salary ~ year * yd.3e-15 *** 0.arrange(p1.0470 * 0.s.24 2.41 -0. data = ## ## Residuals: ## Min 1Q Median 3Q Max ## -10369 -2362 -506 2363 12212 ## ## Coefficients: ## Estimate Std.39 1395.y.9.1 ' ' 1 .01 = faculty) faculty) Pr(>|t|) 1.yyd <. I fit a multiple regression model with year and years since degree as predictors of salary.0068 ** 0. 579.3 Using Factors and Predictors to Model Salaries The plots we looked at helped us to understand the data.17e-09 # interaction is not significant lm.05 '. the plot of salary against years in rank.78 0.001 '**' 0.yd) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = salary ~ year + yd. Error t value Pr(>|t|) (Intercept) 16555.Adjusted R-squared: 0. data = faculty) summary(lm.578. data = faculty) Residuals: Min 1Q Median -10321 -2347 -333 3Q 2299 Max 12241 Coefficients: Estimate Std.238 Ch 9: Discussion of Response Models with Factors and Predictors ## ## Residual standard error: 3960 on 48 degrees of freedom ## Multiple R-squared: 0. using rank as a plotting symbol. regardless .4 15. p-value: 4.3 69. In particular.01 '*' 0. codes: 0 '***' 0. Note that the increasing variability in salaries for increasing years in rank and increasing years since degree is partly due to differences in the relationships across ranks.2.y.s.' 0.53e-10 9. so I will not consider transforming years since degree. years in rank.73 < 2e-16 *** year 489.1 ' ' 1 Residual standard error: 3920 on 49 degrees of freedom Multiple R-squared: 0.561 F-statistic: 33.Adjusted R-squared: 0.00043 *** yd 222.y.8 3.18 0. The non-constant variance should be less of a concern in any model that includes rank and either years in rank or years since degree as effects. p-value: 6. suggests that a combination of predictors and factors will likely be better for modelling faculty salaries than either of the two models that we proposed up to this point. There is no evidence of non-linearity in the plots of salary against the predictors.00253 ** --Signif.6 3.yd <.553 ## F-statistic: 22 on 3 and 48 DF.s.6 on 2 and 49 DF.lm(salary ~ year + yd. or salary. I started the model building process with a maximal or full model with the five main effects plus the 10 possible interactions between two effects.7 1052.3 129. 476 sex:rank 9. using the hierarchy principle. # fit full model with two-way interactions lm.631 degree 4.314 year:yd 5.faculty. where the year and year since degree effects (YD) are linear terms (as in the multiple regression model we considered).73e+06 2 0.927 sex:degree 7.09e+06 1 0. Notationally. this model is written as follows: Salary = Grand mean + S effect + D effect + R effect + YEAR effect + YD effect +S*D interaction + S*R interaction + S*YEAR interaction + S*YD interaction +D*R interaction + D*YEAR interaction + D*YD interaction +R*YEAR interaction + R*YD interaction + YEAR*YD interaction + Residual.288 sex:year 7.51e+06 1 0.57e+06 2 0.570 yd 3. data = faculty) library(car) Anova(lm.full <.880 rank:yd 9.458 degree:year 4.09e+04 1 0.47 0.358 rank:year 1. I added individual three-factor terms to this model.569 rank:degree 1.001 '**' 0. All of the three factor terms were insignificant (not shown).67 0. codes: 0 '***' 0.287 sex:yd 2.19e+06 1 1.32e+05 2 0.05 0.13 0.17 0.08 0.397 degree:yd 6. type=3) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: salary Sum Sq Df F value Pr(>F) (Intercept) 2. The output below gives the fit to the maximal model.9.2: Example: The Effect of Sex and Rank on Faculty Salary 239 of whether the effects were factors or predictors.14e+06 1 0.full.82e+06 2 0.17 0.69 0.33 0.80 0.lm(salary ~ (sex + rank + degree + year + yd)^2.064 . sex 4. Only selected summaries are provided.01 0.90e+08 31 --Signif.52 0.26e+07 1 3.01 '*' 0.1 ' ' 1 # This time I use BIC for model reduction by specifying k= # (compare this to model result using AIC -# too many nonsignificant parameters left in model) .faculty.420 rank 5.06 0.30e+07 2 1.05 '.68 0.74 0.02e+06 1 0.33 0.02e+06 1 0.417 year 2. and subsequent fits. To check whether any important effects might have been omitted.19e+06 1 0.' 0.16e+06 1 1. so I believe that my choice for the “maximal” model is sensible.41e+06 1 1.928 Residuals 1. 17 0.1 salary ~ sex + rank + degree + year + yd + sex:degree + sex:year + sex:yd + rank:degree + rank:year + rank:yd + degree:year + degree:yd + year:yd .33 0.94e+08 866 0.91e+08 861 Step: AIC=854.77 0.47 0.01e+08 860 1.degree:year 1 4510249 1.42 0.91e+08 861 0.01 0.step(lm.30 1 25889 1.95 2 16365099 2.sex:degree <none> Df Sum of Sq RSS AIC F value Pr(>F) 2 4480611 1.sex:yd 1 2024210 1.26 1 3293276 1. include k=log(nrow( [data.degree:yd 1 6407880 1.rank:yd .54 0.74 0.7 salary ~ (sex + rank + degree + year + yd)^2 Df Sum of Sq RSS AIC F value Pr(>F) .20 .full.faculty.88 .06 0.frame name] )) lm.03e+08 864 1.13 0.26 0.4 salary ~ sex + rank + degree + year + yd + sex:degree + sex:year + sex:yd + rank:degree + rank:yd + degree:year + degree:yd + year:yd Df Sum of Sq RSS AIC F value Pr(>F) .rank:degree 2 13021265 2.BIC <.47 .96e+08 851 0.93 .84 0.36 .18 1.29 <none> 1.92e+08 865 0.94e+08 858 0.80 0.97e+08 867 1.rank:year .00 0.14e+08 851 1. k=log(nrow(faculty))) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Start: AIC=868.03e+08 852 1.75 .07e+08 857 1.31 .03e+08 853 1. test="F" .82 0.degree:year .15e+08 852 1.25 .57 0.rank:degree 2 18612514 2.46 .40 .sex:year 1 7194388 1.01e+08 860 1.red.05e+08 857 1.39 1 6525075 1.13 0.95e+08 854 0.57 .year:yd 1 582367 1.degree:yd 1 8179958 2.91e+08 861 0.98e+08 851 0.39 0.faculty.46 1 4428068 1.96e+08 866 1.year:yd 1 50921 1.sex:rank 2 932237 1.sex:degree 1 7164815 1.sex:yd .95e+08 858 0.81 0.91e+08 857 0. direction="backward".05 0.00e+08 863 0.19 1 10654937 2.18 .degree:yd .rank:yd 2 9822382 2.year:yd .rank:yd 2 20258184 2.67 0.rank:year 2 1571933 1.sex:year .degree:year 1 7497925 2.sex:yd 1 3008739 1.68 2 14587933 2.08 0.97e+08 859 1.90e+08 869 Step: AIC=861.93 .17 0.rank:degree .10 0.240 Ch 9: Discussion of Response Models with Factors and Predictors ## BIC # option: test="F" includes additional information # for parameter estimate tests that we're familiar with # option: for BIC.90e+08 865 0.97e+08 867 1.30 1 10462381 2.23 .29 .34 0. 07e+08 2.9.24 2.01 '*' 0.degree:yd .506 848 2. codes: Df Sum of Sq 1 2456466 2 21836322 1 7414066 1 9232872 1 12831931 1 13646799 RSS 1.172 847 849 4.05 '.216 846 1.sex:year <none> 1 1 12500896 2.59 0.degree:year .14 Step: AIC=850.08e+08 854 12669105 2.20e+08 841 0.2: Example: The Effect of Sex and Rank on Faculty Salary ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## .51 0.21e+08 42516602 2.70 0.122 851 853 3.251 849 1.70 1 1616150 2.sex:year <none> Df Sum of Sq RSS AIC F value Pr(>F) 1 361929 2.6 salary ~ sex + rank + degree + year + yd + sex:degree + sex:year + sex:yd + rank:degree + rank:yd + degree:year + degree:yd .30e+08 843 1.80 1 855102 2.rank:yd --Signif.133 850 2.60 2 24391011 2.046 * 849 3.19e+08 2.sex:year <none> .36 0.14 0.18 2.77 0.41e+08 AIC F value Pr(>F) 845 1.15 0.1 ' ' 1 Step: AIC=847.37e+08 2 AIC F value Pr(>F) 847 0.19e+08 845 Step: AIC=840.' 0.01 0.08e+08 2.degree:yd .36 0.97 0.88 0.153 846 1.17 0.95e+08 854 2.06 0.97 0.rank:yd .degree:year .09e+08 1.09e+08 1.03e+08 2.degree:year .sex:degree .rank:degree .27 0.09e+08 2.21e+08 841 0.rank:degree .001 '**' 0.08e+08 854 1.98e+08 2.sex:yd .25 0. codes: Df Sum of Sq 2 21157939 1 8497324 1 9463400 1 10394382 1 2 RSS 2.94 0.3 salary ~ sex + rank + degree + year + yd + sex:degree + sex:year + rank:degree + rank:yd + degree:year + degree:yd .sex:year .96e+08 41051000 2.45 0.8 salary ~ sex + rank + degree + year + yd + sex:degree + sex:year + rank:yd + degree:year 241 .44e+08 842 2.01 '*' 0.29 0.20e+08 841 0.05 '.' 0.degree:yd .192 846 1.033 * 0 '***' 0.98e+08 22789419 2.sex:degree <none> .sex:degree .13 1 10569795 2.05e+08 2.027 * 0 '***' 0.77 0.001 '**' 0.rank:yd --Signif.149 849 1.1 ' ' 1 Step: AIC=844.sex:degree .6 salary ~ sex + rank + degree + year + yd + sex:degree + sex:year + rank:yd + degree:year + degree:yd .18e+08 2.201 850 2. 11 838 Step: AIC=834.' 0.06e+08 --Signif.7e-06 *** 870 34.1 ' ' 1 Step: AIC=829.sex:year 1 14770974 <none> RSS 2.45e+08 .46e+06 .degree:year 1 2585275 .03 0. <none> 2.yd 1 1.38e+08 2.rank:yd 2 24695126 2.23e+08 838 0.sex:year 1 8.70e+08 2.001 '**' 0.01 '*' 0.34e+08 832 1.18 8.5e-10 *** .11 830 849 25.rank 2 4.59 0.rank:yd .49 0.20e+08 841 --Signif.25e+08 834 Step: AIC=831.97 0.sex 1 9.12e+07 .49e+07 <none> .degree 1 1.59e+08 2.63e+07 <none> .07e+07 .25e+08 2.49 835 2.33 0.sex:year <none> Df Sum of Sq RSS AIC F value Pr(>F) 2 24905278 2.36e+08 841 3.74e+08 2.00e+08 --- RSS 2.degree:year 1 4414318 2.11 2.68e+08 2.57 0.1 ' ' 1 Step: AIC=837.11 837 2.59e+08 4.rank:yd 2 25367664 .1e-10 *** 0. codes: 0 '***' RSS 2.48e+08 2.21 828 1.05 '.5 salary ~ sex + rank + degree + year + yd Df Sum of Sq .32 0.001 '**' 0.23e+08 AIC F value Pr(>F) 834 0.089 .degree .04e+08 6.50e+08 6.50e+08 832 2.229 830 1.13e+06 .57e+08 AIC F value Pr(>F) 830 1.69 6.62e+08 2.18 828 2.11 1 8902098 2.48 0.242 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 9: Discussion of Response Models with Factors and Predictors Df Sum of Sq RSS AIC F value Pr(>F) .80 0.' 0.05 '.456 . 832 874 35.rank 2 4.119 .sex:degree 1 3112507 2.yd 1 1.167 831 2.25 0.24e+08 838 0.2 salary ~ sex + rank + degree + year + yd + sex:year + rank:yd .58 0.67e+08 2.44e+08 838 2.72 0.01 '*' 0.63 0.40e+08 833 2.8 salary ~ sex + rank + degree + year + yd + sex:year Df Sum of Sq .75 7.098 . codes: 0 '***' 0.6 salary ~ sex + rank + degree + year + yd + sex:year + rank:yd + degree:year Df Sum of Sq .66 0.59e+08 AIC F value Pr(>F) 827 1.375 .87 0.year 1 1.20 1 14134386 2.sex:year 1 16645026 2.86 0.degree 1 1. 41e+08 4.red.05 '.75e+08 2.66 815 0.29 .34 7.01 '*' 0.75e+08 825 .1 2.yd 1 7.5e-10 *** --Signif.7 salary ~ rank + year + yd Df Sum of Sq RSS AIC F value Pr(>F) .16e+08 846 25.35 0.1e-13 *** --Signif.48e+08 4.9 5.62e+08 4. codes: 0 '***' 0.53 <none> 2.' 0.rank 2 6.68 6.rank 2 4.9.77e+08 2.75e+08 2.2 salary ~ rank + year Df Sum of Sq RSS AIC F value Pr(>F) <none> 2.40 0.1 ' ' 1 Step: AIC=821.53 814 1.27 .' 0.001 '**' 0.2: Example: The Effect of Sex and Rank on Faculty Salary ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Signif.68e+08 827 .year 1 1.1 1.1 ' ' 1 Step: AIC=827. .05 '.8 4.4 0.01 '*' 0. our BIC-backward selected model appears adequate.72e+08 867 34.9e-06 *** .62e+08 AIC F value Pr(>F) 813 815 0.25 <none> 2.19 0.39 0.75e+08 825 1.1e-11 *** --Signif.77e+08 821 . In our case.77e+08 821 0. test="F") ## ## ## ## ## ## ## ## ## ## Single term additions Model: salary ~ rank + year Df Sum of Sq <none> sex 1 2304648 degree 1 1127718 yd 1 2314414 rank:year 2 15215454 RSS 2.001 '**' 0.76e+08 825 1.faculty.53e+08 869 40.87e+06 2.1 ' ' 1 Step: AIC=824.8e-06 *** .01 '*' 0.1e-05 *** .31e+06 2.34 0.4 salary ~ rank + degree + year + yd Df Sum of Sq RSS AIC F value Pr(>F) .39e+08 841 28.yd 1 2.' 0.15 0.05 '. codes: 0 '***' 0.32e+08 9. add1(lm.04e+08 6.' 0. codes: 0 '***' 0.68e+06 2.1 ' ' 1 The add1() function will indicate whether a variable from the “full” model should be added to the current model.year 1 1.001 '**' 0.BIC.76e+08 2.05 '.year 1 1.53 815 0.001 '**' 0. codes: 243 0 '***' 0.79e+08 7.degree 1 6.01 '*' 0.16e+08 842 24. ~ (sex + rank + degree + year + yd)^2.09e+08 875 54.rank 2 4. data = faculty) Residuals: Min 1Q Median -3462 -1303 -299 3Q 783 Max 9382 Coefficients: Estimate Std.3 48 21985 24567 ## Asst 19014 613.845.faculty. # final model: salary ~ year + rank lm.5 905.faculty. stop.CL upper.final) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = salary ~ rank + year. # all are significant.2 871.1 ' ' 1 Residual standard error: 2400 on 48 degrees of freedom Multiple R-squared: 0.05 '.42e+09 1 766. codes: 0 '***' 0.05 '.1 2.68 < 2e-16 *** rankAssoc -5192. Error t value Pr(>|t|) (Intercept) 25657.CL ## Full 28468 582.BIC library(car) Anova(lm. list(pairwise ~ rank).final <.01 '*' 0.final.9e-07 *** rankAsst -9454.1e-14 *** year 375.1 48 17781 20246 ## ## Confidence level used: 0.final.96 2.lm. adjust = "bonferroni") ## $`lsmeans of rank` ## rank lsmean SE df lower.95 ## .30 2.244 Ch 9: Discussion of Response Models with Factors and Predictors Let’s look carefully at our resulting model. codes: 0 '***' 0. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: salary Sum Sq Df F value Pr(>F) (Intercept) 4.001 '**' 0.001 '**' 0.' 0.1 ' ' 1 summary(lm.8 926.8 4.1e-13 *** year 1.9e-06 *** Residuals 2.faculty.9e-06 *** --Signif.red.01 '*' 0. p-value: <2e-16 ### comparing lsmeans (may be unbalanced) library(lsmeans) ## compare levels of main effects lsmeans(lm.7 70.3 48 27298 29639 ## Assoc 23276 642.44 6.8 -5.62e+08 1 28.32e+08 2 54.9 5.1 on 3 and 48 DF.77e+08 48 --Signif.faculty.835 F-statistic: 87.Adjusted R-squared: 0.' 0.8 -10.4 < 2e-16 *** rank 6.faculty.8 27. once these effects have been taken into account. you must be careful because you have not done a diagnostic analysis.0001 ## Full .2. The following two issues are also important to consider.Assoc 5192 871. based on a simple additive model that includes sex plus the effects that were selected as statistically significant. Predicted salaries for the different ranks are given by: Full: Assoc: Assis: \ = 25658 + 375. However.70 year salary Do you remember how to interpret the lsmeans.89 while the selected model has 3 single df effects with R2 = 0.828 <.84. and the p-values for comparing lsmeans? You might be tempted to conclude that rank and years in rank are the only effects that are predictive of salaries. Looking at the parameter estimates table. and that differences in salaries by sex are insignificant.70 year = 16204 + 375. and whether it is important. I think that we would do better by focusing on how large the effect might be.956 <.70 year = 20466 + 375.9 48 4. with rank=3. A sex effect may exist even though there is insufficient evidence to support it based on these data.9. The baseline group is Full Professors. rank .8 48 10.ratio p.70 year salary \ = 25658 − 9454 + 375.437 <.Asst 4262 882.70 year salary \ = 25658 − 5192 + 375.0001 ## ## P value adjustment: bonferroni method for 3 tests 9.value ## Full .Asst 9455 905.0001 ## Assoc . Note that the maximal model has 20 single df effects with an R2 = 0.4 Discussion of the Salary Analysis The selected model is a simple ANCOVA model with a rank effect and a linear effect due to years in rank. all of the single df effects in the selected model are significant.2: Example: The Effect of Sex and Rank on Faculty Salary 245 ## $`pairwise differences of rank` ## contrast estimate SE df t. (Lack of power corrupts. A simple way to check is by constructing a confidence interval for the sex effect. and absolute lack of power corrupts absolutely!) If we are interested in the possibility of a sex effect.8 48 5. .7 1025. + sex) summary(lm.77 < 2e-16 *** rankAssoc -5109.' 0.faculty. codes: 0 '***' 0.2e-07 *** rankAsst -9483. was raised by M.final. I am choosing this model because the omitted effects are hopefully small.1 ' ' 1 Residual standard error: 2420 on 47 degrees of freedom Multiple R-squared: 0. adjusting for rank and year.19 4.Adjusted R-squared: 0.1 24.2e-14 *** year 390. or −1146 to 2194 dollars.246 Ch 9: Discussion of Response Models with Factors and Predictors and year in rank.63 0. p-value: <2e-16 Men are the baseline group for the sex effect. then using faculty rank to adjust . O.53 --Signif.final.sex) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = salary ~ rank + year + sex.833 F-statistic: 64.9 887. The range of plausible values for the sex effect would appear to contain values of practical importance.faculty.1 834.05 '. if women are unfairly held back from promotion through the faculty ranks.8 -10. and potentially a more important issue.5e-06 *** sexFemale 524.” Thus. the variable may be ‘tainted’. so the predicted salaries for men are 524 dollars less than that for women.sex <. . or 524 ± 2 ∗ (835).01 '*' 0. and because the regression coefficient for a sex indicator is easy to interpret in an additive model.39 9. data = faculty) Residuals: Min 1Q Median -3286 -1312 -178 3Q 939 Max 9003 Coefficients: Estimate Std. Other models might be considered for comparison.001 '**' 0.9 75.update(lm. Error t value Pr(>|t|) (Intercept) 25390.846. .76 6. ~ . A rough 95% CI for the sex differential is the estimated sex coefficient plus or minus two standard errors. Another concern.6 on 4 and 47 DF. [a] variable may reflect a position or status bestowed by the employer.4 5.1 -5.faculty. # add sex to the model lm. in which case if there is discrimination in the award of the position or status.final.8 912. Finkelstein in a 1980 discussion in the Columbia Law Review on the use of regression in discrimination cases: “. Summary output from this model is given below. so further analysis is warranted here.7 0. 1 ' ' 1 Residual standard error: 4300 on 49 degrees of freedom Multiple R-squared: 0.44 < 2e-16 *** sex 6. insufficient evidence between sexes (due to large proportion of variability in salary explained by yd [which I’m using in place of year since year is paired with rank]). yd 7.01 '*' 0. data = faculty) library(car) Anova(lm. However.yd <.1 -1.91e-08 Similar result as before.64 0.9. What happens if this is done? lm.01 '*' 0.1 ' ' 1 summary(lm.8 on 2 and 49 DF.05 '. rank and sex are (potentially) confounded.21 < 2e-16 *** sexFemale -2572.05e+08 49 --Signif.91 0.472 F-statistic: 23. data on promotions would help resolve this issue.062 .493. type=3) ## ## ## ## ## ## ## ## ## ## Anova Table (Type III tests) Response: salary Sum Sq Df F value Pr(>F) (Intercept) 4.lm(salary ~ sex + yd. codes: 0 '***' 0.05 '. there is insufficient evidence for a sex:yd interaction.sex.001 '**' 0. This data can not resolve this question.062 .yd.9e-08 *** Residuals 9.001 '**' 0. Furthermore (not shown).Adjusted R-squared: 0.2: Example: The Effect of Sex and Rank on Faculty Salary 247 salary before comparing sexes may not be acceptable to the courts.' 0.sex.' 0.66e+08 1 41. codes: 0 '***' 0.sex. .44 4.48 4.1 6.5 15.72e+07 1 3.7 59. Error t value Pr(>|t|) (Intercept) 18355.faculty.2 1206. yd 380.28e+09 1 231.5 1349.faculty.faculty. This suggests that an analysis comparing sexes but ignoring rank effects might be justifiable. Instead. p-value: 5. data = faculty) Residuals: Min 1Q Median -9632 -2529 3 3Q 2298 Max 13126 Coefficients: Estimate Std.9e-08 *** --Signif.yd) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = salary ~ sex + yd. eliminate the effects that are not very useful for explaining the variation in Y . . include the transformation along with the original Xi in the candidate list. . Assuming that the collection of variables is measured on the correct scale. In most problems one or more of the effects can be eliminated from the full model without loss of information. a method gives the best model. or equivalently. and that the candidate list of effects includes all the important predictors or binary variables. If a transformation of Xi is suggested. Note that we can transform the predictors . plot Y against each predictor to see whether transformations are needed. . the most general model is Y = β0 + β1X1 + · · · + βk Xk + ε. Before applying these methods. We want to identify the important effects. We will study several automated non-hierarchical methods for model selection. X2. . Although transformations of binary variables are not necessary. we wish to develop a regression model to predict Y . Xk .Chapter 10 Automated Model Selection for Multiple Regression Given data on a response variable Y and k predictors or binary variables X1. side-by-side boxplots of the response across the levels of a factor give useful information on the predictive ability of the factor. Given a specific criterion for selecting a model. or the equivalent ANOVA F -test.” Given a collection of candidates for the best model.1) and test H0 : β1 = 0. 10. A t-test can be used here. Find the variable in the candidate list with the largest correlation (ignoring the sign) with Y .1: Forward Selection 249 √ differently. In those examples. in situations where the researcher has little scientific information as a guide. Then fit the model Y = β0 + β1X1 + ε (10. . The next few sections cover these methods in more detail. The steps in the procedure are: 1.1 Forward Selection In forward selection we add variables to the model one at a time. and (4) scientific plausibility. Suppose this is X1. However. then discuss other criteria and selections strategies. then one should consider doing one analysis for each suggested response scale before deciding on the final scale. if several transformations are suggested for the response. I do not take any of them literally as best.10.2. if possible (2) examination of model adequacy (residuals. If we reject H0. This variable gives a simple linear regression model with the largest R2. we make a choice of model on the basis of (1) a direct comparison of models. I view the various criteria as a means to generate interesting models for further consideration. influence. I included the corresponding F -tests in the ANOVA table as a criterion for dropping variables from a model. finishing with a few examples. Otherwise stop and conclude that no variables are important. log(X1) and X2.1 for stepwise procedures and were used in examples in Chapter 9. etc. You should recognize that automated model selection methods should not replace scientific theory when building models! Automated methods are best suited for exploratory analyses. Different criteria for selecting models lead to different “best models. go to step 2.) (3) simplicity — all things being equal. for example. simpler models are preferred. AIC/BIC were discussed in Section 3. If we reject H0. or equivalently. Otherwise we stop.3) reduces R2 the least. The F -test default level for the tests on the individual effects is sometimes set as high as α = 0. In forward selection we sequentially isolate the most important effect left in the pool. one at a time. If we force the forward selection to test at standard levels then the process will never get “going” when none of the variables is important on its own.1) to predict Y . giving the new full model Y = β0 + β1X1 + · · · + βk−1Xk−1 + ε . If it is needed we continue the process. in many problems certain variables may be important only in the presence of other variables.2) and test H0 : β2 = 0. replace model (10. starting from the full model. increases the Residual SS the least.2 Backward Elimination The backward elimination procedure (discussed earlier this semester) deletes unimportant variables. Fit the model Y = β0 + β1X1 + β2X2 + ε (10.2) and repeat step 2 sequentially until no further variables are added to the model. Find the remaining variable which when added to model (10.250 Ch 10: Automated Model Selection for Multiple Regression 2. stop and use model (10. 10. delete Xk from the full model. (10.1) increases R2 the most (or equivalently decreases Residual SS the most). However. This may seem needlessly high.1) with (10.3) 2. If we do not reject H0.50 (SAS default). stop and conclude that the full model is best. This is the variable that gives the largest p-value for testing an individual regression coefficient H0 : βj = 0 for j > 0. Suppose this is X2. and check whether it is needed in the model. If you do not reject H0. Find the variable which when omitted from the full model (10. If you reject H0. Fit the full model Y = β0 + β1X1 + · · · + βk Xk + ε. The steps is the procedure are: 1. Suppose this variable is Xk . a full model. Anthropologists conducted a study to determine the long-term effects of an environmental change on systolic blood pressure. we add variables to the model as in forward regression. Otherwise. and sometimes set at α = 0. 10. but include a backward elimination step every time a new variable is added. They measured the blood pressure and several other characteristics of 39 Indians who migrated from a very primitive environment high in the Andes into the mainstream of Peruvian society at a lower altitude.3: Stepwise Regression 251 to replace (10. they are placed back into the candidate pool for consideration at the next step of the process. This is problematic because many variables that are initially important are not important once several other variables are included in the model.15 (SAS default). In stepwise regression. . That is. and check whether it is important.3 Stepwise Regression Stepwise regression combines features of forward selection and backward elimination. stop.1 Example: Indian systolic blood pressure We revisit the example first introduced in Chapter 2. Repeat steps 1 and 2 sequentially until no further variables can be deleted.3).3. depending on the software. The procedure can start from an empty model. and none of the variables in the model can be excluded. The process continues until no additional variables can be added. or an intermediate model. The default test level on the individual variables is sometimes set at α = 0. 10. A deficiency of forward selection is that variables can not be omitted from the model once they are selected. In backward elimination we isolate the least important effect left in the model. If variables can be omitted. delete it and repeat the process. The p-values used for including and excluding variables in stepwise regression are usually taken to be equal (why is this reasonable?). every time we add a variable to the model we ask whether any of the variables added earlier can be omitted. If not.10 (SAS default).10. pulse = pulse rate-beats/min. ].dat" indian <.cor$r[1.indian$yrmig / indian$age # correlation matrix and associated p-values testing "H0: rho == 0" library(Hmisc) i.ggplot(indian. "wt". The plots do not suggest any apparent transformations of the response or the predictors. pulse. p7 <. "yrage")])) # print correlations with the response to 3 significant digits signif(i.data <. which is the proportion of each individual’s lifetime spent in the new environment. "ht". fore . and seven candidate predictors: wt = weight in kilos. 3) ## ## sysbp 1. chin .ggplot(indian. p5 <.ggplot(indian. Below I generate simple summary statistics and plots.cor <. Let us illustrate the three model selection methods to build a regression model. p3 <. calf . p2 <.276 sysbp)) sysbp)) sysbp)) sysbp)) sysbp)) sysbp)) sysbp)) + + + + + + + geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) .000 wt 0. "pulse". ht . "chin" .matrix(indian[. header=TRUE) # Description of variables # id = individual id # age = age in years # wt = weight in kilos # chin = chin skin fold in mm # calf = calf skin fold in mm # sysbp = systolic bp yrmig ht fore pulse diabp = = = = = years since migration height in mm forearm skin fold in mm pulse rate-beats/min diastolic bp # Create the "fraction of their life" variable # yrage = years since migration divided by age indian$yrage <."http://statacumen. yrage. ht = height in mm. chin = chin skin fold in mm. and yrage = fraction. "calf".219 # scatterplots library(ggplot2) p1 <.table(fn.252 Ch 10: Automated Model Selection for Multiple Regression All of the Indians were males at least 21 years of age.ggplot(indian. y y y y y y y calf 0.133 -0. calf = calf skin fold in mm.com/teach/ADA2/ADA2_notes_Ch02_indian. p6 <.data. using systolic blood pressure (sysbp) as the response. so we will analyze the data using the given scales.ggplot(indian. #### Example: Indian fn. chin 0.ggplot(indian.c("sysbp". "fore".170 aes(x aes(x aes(x aes(x aes(x aes(x aes(x = = = = = = = fore 0.251 = = = = = = = pulse yrage 0.521 ht 0. and were born at a high altitude.read.272 wt . f ore = forearm skin fold in mm.rcorr(as. p4 <.ggplot(indian. p2.0 ● ● chin ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120 ●● ● ● ● ● ● ● 160 ● ● ● ● ● ● ● sysbp ● sysbp ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● sysbp 120 ●● ● ● ● ●● 160 ● ●● ● ● ●● ● ● ● ● ● ● ● 160 ● ● ●● ht ● 140 ● ● ● 1550 wt ● 140 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●●● ● ● ● sysbp ● sysbp sysbp ● 120 ● 140 ● ● ● ● ● ● ● ● 120 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 5 ● ● ● ● ● ● ● ● 10 calf 15 20 50 60 70 80 90 pulse ● 160 ● sysbp ● 140 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● 120 ● ● ●● ● ● ● ●● ● ● ● 0.frame name] )) . include k = log(nrow( [data. backward. # scope = a formula giving the terms to be considered for adding or dropping ## default is AIC # for BIC.5 10.5 ● 140 ● ● ● ● ● ● 1650 5. main = "Scatterplots of response sysbp with each predictor variable") Scatterplots of response sysbp with each predictor variable ● ● 160 160 160 ● ● ● 140 ● ● ● ●● ● ● ● ● 140 ● ● ● ●● ● ● ● ● ● 120 ● ● ● ● ● ●● ● ● ● ● 60 70 80 1500 1600 120 ● ● ● ● 2. p7. ## step() function specification ## The first two arguments of step(object. p6.75 yrage The step() function provides the forward. p4.50 ● 0.5 10.25 0.) are # object = a fitted model object.. and provides corresponding F -tests. ncol=3 .. and stepwise procedures based on AIC or BIC.5 fore 0 ●● ●● ● ● ● ● ● ● ● ● ● ● 5.00 0. p3.3: Stepwise Regression 253 library(gridExtra) grid.10.0 ● ● ● ● ● ● ● ● 7.0 12. scope.arrange(p1.0 7. p5. . k = log(nrow(indian))) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Start: AIC=203. Step 1 Variable wt =weight is entered first because it has the highest correlation with sysbp =sys bp. or equivalently.empty .66 0.05 0. + fore 1 484 6047 204 2. test = "F".10 0.08881 .09356 . codes: 0 '***' 0.red. though similar decisions are made as if using F -tests. BIC is our selection criterion. Step 2 Adding yrage =fraction to the simple linear regression model with weight as a predictor increases R2 the most.' 0.48 0.18018 + chin 1 189 6342 206 1. decreases Residual SS (RSS) the most.05 '.4 sysbp ~ 1 Df Sum of Sq RSS AIC F value Pr(>F) + wt 1 1775 4756 195 13.forward.42113 --Signif. BIC with F-tests lm.30027 + pulse 1 115 6417 206 0.lm(sysbp ~ 1.1 ' ' 1 .001 '**' 0.87 0.96 0.254 Ch 10: Automated Model Selection for Multiple Regression # test="F" includes additional information # for parameter estimate tests that we're familiar with Forward selection output The output for the forward selection method is below. direction = "forward". data = indian) # Forward selection.indian. sysbp ~ wt + ht + chin + fore + calf + pulse + yrage .00067 *** <none> 6531 203 + yrage 1 498 6033 204 3. Step 3 The last table has “<none>” as the first row indicating that the current model (no change to current model) is the best under the current selection criterion.indian.12357 + ht 1 314 6218 205 1. The corresponding F -value is the square of the t-statistic for testing the significance of the weight predictor in this simple linear regression model.81 0.BIC <.01 '*' 0. # start with an empty model (just the intercept 1) lm.empty <. + calf 1 411 6121 204 2.indian.step(lm. .05 '.444 F-statistic: 16.15 189 0.9024 + fore 1 1 4755 198 0.3: Stepwise Regression ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 255 Step: AIC=194.307 Median 0.896 14.71 0.10.52 0.2 on 2 and 36 DF.81 summary(lm.001 '**' 0. p-value: 9.896 3Q 5.2967 + calf 1 17 4739 198 0.01 '*' 0. codes: 0 '***' 0.9257 --Signif.7 sysbp ~ wt Df Sum of Sq RSS AIC F value Pr(>F) + yrage 1 1315 3441 186 13.BIC) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + yrage. codes: 0 '***' 0.01 '*' 0.05 '.06 0.Adjusted R-squared: 0. BIC is our selection criterion.5 30.8309 + ht 1 2 4754 198 0.7 5.13 0.001 '**' 0.00014 *** wt 1.02 0.0007 *** <none> 4756 195 + chin 1 144 4612 197 1.78 on 36 degrees of freedom Multiple R-squared: 0.red.24 0.75 0.' 0.' 0.767 7.indian.00070 *** --Signif.01 0.26 0.21 8e-06 *** yrage -26.13 0.234 5. data = indian) Residuals: Min 1Q -18.4 50.728 Max 23.47 189 0.473. though similar decisions are made as if using F -tests.79e-06 Backward selection output The output for the backward elimination method is below.1 ' ' 1 Residual standard error: 9.12 0.7240 + pulse 1 6 4750 198 0.31 0.forward.218 -3.1 ' ' 1 Step: AIC=185.7 sysbp ~ wt + yrage Df Sum of Sq <none> + chin + fore + calf + ht + pulse 1 1 1 1 1 197.63 189 0.217 0.433 -7.281 4.05 0.2 23.9 RSS 3441 3244 3391 3411 3418 3435 AIC F value Pr(>F) 186 187 2.58 189 0.982 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 60. 0008) tests the hypothesis that the regression coefficient for each predictor variable is zero.8461 -1.73 0.60681 calf 0. At least one of the predictors left is important.lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage.399 -5. increases the Residual SS the least.0395 -1.1036 0.52 0.577 Coefficients: Estimate Std.7109 0. This variable. as judged by the p-value.000808 The least important variable in the full model.91 on 7 and 31 DF.0453 0. So calf is the first to be omitted from the model. p-value: 0.8684 -3.9130 1.1957 0.38 0.05 '.91 with p-value=0. data = indian) Residuals: Min 1Q -14.691 3Q 6.4577 53.001 '**' 0.18124 fore -0. is calf =calf skin fold.256 Ch 10: Automated Model Selection for Multiple Regression Step 0 The full model has 7 predictors so REG df = 7.00011 *** ht -0. The .17 0.43 0. indicating that one or more of the predictors is important in the model. upon omission.full) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + ht + chin + fore + calf + pulse + yrage.01 '*' 0.70470 yrage -29.3181 7.indian. Error t value Pr(>|t|) (Intercept) 106. Step 1 After deleting calf . data = indian) summary(lm. or equivalently. # start with a full model lm.99 on 31 degrees of freedom Multiple R-squared: 0.792 Median -0.3866 4.15 0.25933 chin -1. wt 1.86664 pulse 0.945 Max 23.3499 -0.37 0.1572 0.526. as judged by the overall F -test p-value.1 ' ' 1 Residual standard error: 9.7018 1. The F -test in the full model ANOVA table (F = 4. the six predictor model is fitted.00078 *** --Signif.97 0.419 F-statistic: 4.full <.6117 0. codes: 0 '***' 0.0749 0. reduces R2 the least.05728 . This test is highly significant.Adjusted R-squared: 0.' 0. The t-value column gives the t-statistic for testing the significance of the individual predictors in the full model conditional on the other variables being in the model.indian. 05 '. k = log(nrow(indian))) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Start: AIC=199.28 0.' 0.10.3 sysbp ~ wt + ht + chin + fore + pulse + yrage Df Sum of Sq RSS AIC F value Pr(>F) .fore 1 18 3130 189 0.23 0.yrage 1 1450 4563 204 15.ht 1 130 3229 194 1.1 ' ' 1 Step: AIC=189.3: Stepwise Regression least important predictor left is pulse =pulse rate.15651 <none> 3113 193 .chin 1 184 3283 195 1.25933 .37 0.BIC <.01 '*' 0.59 0.ht Df Sum of Sq RSS AIC F value Pr(>F) 1 114 3244 187 1.32 0.00011 *** --Signif. BIC with F-tests lm.01 '*' 0.yrage 1 1387 4483 211 13.7e-05 *** --Signif.60120 .90 0.87 0.15 0.19 0.wt 1 1956 5053 215 19.27 0.calf 1 3 3099 196 0.03 6.001 '**' 0.chin 1 187 3283 198 1.indian.pulse 1 13 3113 193 0.00051 *** .05 '.001 '**' 0.27453 257 .86664 .wt 1 1954 5053 212 20. codes: 0 '***' 0.pulse 1 15 3111 196 0.17 8.95 0. direction = "backward".18124 <none> 3096 200 .fore 1 27 3126 193 0. codes: 0 '***' 0.24681 .71302 .red.indian.wt 1 1984 5096 208 21.03 0.3 sysbp ~ wt + ht + chin + yrage .88 0.yrage 1 1448 4547 208 14.00078 *** .17764 <none> 3099 196 .10 0.1 ' ' 1 Step: AIC=192.backward.00042 *** .25601 . test = "F".60681 .70470 .66701 .full .ht 1 131 3244 191 1.34 0.14 0.05 '.39 0.fore 1 27 3123 197 0.9 sysbp ~ wt + ht + chin + fore + calf + pulse + yrage Df Sum of Sq RSS AIC F value Pr(>F) .step(lm.chin 1 198 3311 192 2.001 '**' 0. codes: 0 '***' 0.1 ' ' 1 Step: AIC=196. # Backward selection.' 0.8 sysbp ~ wt + ht + chin + fore + yrage Df Sum of Sq RSS AIC F value Pr(>F) .01 '*' 0.2e-05 *** --Signif.' 0.ht 1 132 3228 198 1. 8 7e-04 *** 204 27.backward. codes: 0 '***' 0.982 Coefficients: Estimate Std.12 0.' 0.217 0.1 ' ' 1 summary(lm.26 0.70 0.2 on 2 and 36 DF.1 ' ' 1 Step: AIC=187. codes: 287 3418 3130 1446 4576 2264 5394 189 189 200 207 3. codes: RSS 3441 1315 4756 2592 6033 AIC F value Pr(>F) 186 195 13.001 '**' 0.05 '.896 14.yrage 1 .218 -3.' 0.767 7.1 ' ' 1 Residual standard error: 9.433 -7. codes: 0 '***' 0.1 8e-06 *** 0 '***' 0. 15.728 Max 23.1 ' ' 1 Step: AIC=185.05 '.indian.258 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 10: Automated Model Selection for Multiple Regression .yrage 1 1368 4612 197 14.01 '*' 0.281 4.14 8.78 on 36 degrees of freedom Multiple R-squared: 0.896 3Q 5.chin 1 197 3441 186 2.08635 .13 0.red.001 '**' 0.chin 1 <none> .Adjusted R-squared: 0.00049 *** .yrage 1 .00014 *** wt 1.71 0. data = indian) Residuals: Min 1Q -18. .00036 *** 24.wt 1 2515 5759 206 27.wt 1 --Signif. Error t value Pr(>|t|) (Intercept) 60.' 0.001 '**' 0.00070 *** --Signif.234 5.5e-06 *** --Signif.05 '.444 F-statistic: 16.59 1.BIC) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + yrage.001 '**' 0.05 '.79e-06 In the final table we are unable to drop yrage or wt from the model.01 '*' 0.15341 <none> 3244 187 .307 Median 0.21 8e-06 *** yrage -26. p-value: 9.7 sysbp ~ wt + yrage Df Sum of Sq <none> .' 0.01 '*' 0.9e-05 *** 0 '***' 0.wt 1 --Signif.76 0.01 '*' 0.473.1 sysbp ~ wt + chin + yrage Df Sum of Sq RSS AIC F value Pr(>F) . 00070 *** --Signif. trace = 0) # the anova object provides a summary of the selection steps in order lm.2 + wt -1 925.0 + yrage -1 1439.5 .982 Coefficients: Estimate Std.both.BIC) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = sysbp ~ wt + yrage. data = indian) Residuals: Min 1Q -18.198 35 4751 202.001 '**' 0. direction = "both".473.26 0.2 on 2 and 36 DF.red.BIC$anova ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 Step Df Deviance Resid.896 3Q 5. Variables are listed in the output tables in order that best improves the AIC/BIC criterion.indian. BIC will decrease (improve) by considering variables to drop or add (indicated in the first column by − and +).707 34 3311 191.01 '*' 0. # Stepwise (both) selection.281 4.7 . Df Resid. p-value: 9. Dev AIC NA NA 34 5651 212. starting with intermediate model # (this is a purposefully chosen "opposite" model.fore 1 50.pulse 1 2.indian. In the stepwise case.00014 *** wt 1.548 36 3441 185.844 36 5676 205.433 -7.05 '.7 summary(lm. we use lm.8 .71 0.ht 1 79.1 ' ' 1 Residual standard error: 9. data = indian) # option: trace = 0 does not print each step of the selection lm. k = log(nrow(indian)).21 8e-06 *** yrage -26.indian.intermediate <.10.intermediate .both.728 Max 23.red.234 5.indian.XXX$anova to print a summary of the drop/add choices made.step(lm.767 7.307 Median 0. test = "F".Adjusted R-squared: 0.871 35 3391 188.79e-06 . sysbp ~ wt + ht + chin + fore + calf + pulse + yrage . BIC with F-tests.3: Stepwise Regression 259 Stepwise selection output The output for the stepwise selection is given below. Rather than printing a small table at each step of the step() procedure.217 0.indian.78 on 36 degrees of freedom Multiple R-squared: 0.' 0.444 F-statistic: 16.218 -3. codes: 0 '***' 0.874 35 5654 208.calf 1 21. # from the forward and backward methods this model # includes all the variables dropped and none kept) lm. Error t value Pr(>|t|) (Intercept) 60.lm(sysbp ~ ht + fore + calf + pulse.both.red.4 .896 14.BIC <. to prefer one over the rest. you will find different results. then the model with ¯ 2. variables are added to the model until further additions give inconsequential increases in R2.260 Ch 10: Automated Model Selection for Multiple Regression Summary of three section methods All three methods using BIC choose the same final model. then there may be no good reason. n−p−1 where n is the sample size. if two models have the same number of variables. If several models with similar complexity have similar R2s. ¯ 2 worth mentioning: There are four properties of R ¯ 2 ≤ R2 . A substantial increase in R2 is usually observed when an “important” effect is added to a regression model. sysbp = β0 + β1 wt + β2 yrage. 10. This eliminates some of the difficulty with calibrating R2. and we wish to maximize this. The R2 criterion is not well-defined in the sense of producing a single best model. For a model with p variables and an intercept.2 Adjusted-R2 Criterion. at this stage of the analysis. All other things being equal. which increases even when unimportant predictors are added to a model.4. and we want to maximize this. 10. maximize The adjusted-R2 criterion gives a way to compare R2 across models with different numbers of variables. I prefer the simpler of two models with similar values of R2.4 Other Model Selection Procedures 10. 1. Using the AIC criterion. the adjusted-R2 is defined by ¯2 = 1 − R n−1 (1 − R2). With the R2 criterion.4. the larger R2 has the larger R . R 2.1 R2 Criterion R2 is the proportion of variation explained by the model over the grand mean. the full model has Cp = p + 1. 10. relative to those obtained from the full model. Thus. if two models have the same number of variables. then the model with fewer variables has ¯ 2 penalizes complex models the larger adjusted-R2. The “best” model and R by this criterion has the minimum Cp. Models with Cp ≈ p + 1. As I noted before. minimize Mallows’ Cp measures the adequacy of predictions from a model. where p = k. then Cp should be approximately p + 1 (the number of variables in model plus one). Cp will tend to be much greater than p + 1.4. or less. Two important properties of Cp are 1. If important variables from the candidate list are excluded.4: Other Model Selection Procedures 261 3. X2. Mallows’ Cp statistic is defined for a given model with p variables by Cp = Residual SS − Residual df + (p + 1) 2 σˆ FULL 2 where σˆ FULL is the Residual MS from the full model with k variables X1. and we want to minimize Cp. . R with many variables. . and 2. I do not take any mum R ¯ 2 near the of the criteria literally. Xk . merit further consideration. And ¯ 2 can be less than zero for models that explain little of the variation in 4. or less.10.3 Mallows’ Cp Criterion. . The adjusted R2 is easier to calibrate than R2 because it tends to decrease when unimportant variables are added to a model. Put another way. if two models have the same R2. . R Y. As with R2 ¯ 2. The model with the maxi¯ 2 is judged best by this criterion.. If all the important effects from the candidate list are included in the model. if the model under consideration includes all the important variables from the candidate list. I prefer simpler models that satisfy this condition. and would also choose other models with R maximum value for further consideration. then the model with the larger R2 has the smaller Cp. then the difference between the first two terms of Cp should be approximately zero. attr(*.r2$r2 == max(leaps.$ : chr [1:36] "1" "1" "1" "1" .r2)[best. "dimnames")=List of 2 ## . ## . method = 'r2' ..model..matrix(lm. report best subset of size 5 leaps. library(leaps) # First.. # plot model R^2 vs size of model plot(leaps.leaps. int = FALSE.XXX) as input to leaps() # R^2 -.262 10.matrix(lm.for each model size. nbest = 5..99 0.model.r2) ## List of 4 ## $ which: logi [1:36.leaps(x = model.r2 <...full <.r2$r2))).indian..$ : chr [1:8] "(Intercept)" "wt" "ht" "chin" .r2$r2. 1:8] FALSE TRUE FALSE FALSE FALSE TRUE .976 0. .full))) str(leaps..989 0.. names = colnames(model. .845 .99 0.r2 <.r2$size. main = "R2") # report the best model (indicate which terms are in the model) best.full).. y = indian$sysbp . ## . create the design matrix which leap uses as argument # using model.. leaps.model.r2] ## [1] "(Intercept)" "wt" ## [5] "fore" "calf" "ht" "pulse" "chin" "yrage" . ## $ label: chr [1:8] "(Intercept)" "wt" "ht" "chin" .. data = indian) # Second....indian.lm(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage.5 Ch 10: Automated Model Selection for Multiple Regression Illustration with Peru Indian data R2 Criterion # The leaps package provides best subsets with other selection criteria...indian.matrix(lm..r2$which[which((leaps. fit the full model lm. ## $ size : num [1:36] 1 1 1 1 1 2 2 2 2 2 .] # these are the variable names for the best model names(best. ## $ r2 : num [1:36] 0. matrix(lm.adjr2)[best.adjr2$adjr2 == max(leaps. main = "Adj-R2") # report the best model (indicate which terms are in the model) best.adjr2$adjr2.10.leaps(x = model.full))) # plot model R^2 vs size of model plot(leaps.adjr2] ## [1] "(Intercept)" "wt" ## [5] "yrage" "ht" "chin" .] # these are the variable names for the best model names(best.leaps.95 ● ● 1 leaps.r2$r2 0. leaps.matrix(lm.indian. method = 'adjr2' . names = colnames(model.model.adjr2$size. y = indian$sysbp .model.model.90 0.for each model size.85 leaps.indian.r2$size Adjusted-R2 Criterion. maximize # adj-R^2 -. int = FALSE.00 R2 ● ● ● ● ● ● ● ● ● ● ● 2 3 4 5 6 7 8 0.adjr2 <.adjr2 <. nbest = 5.5: Illustration with Peru Indian data 263 1. report best subset of size 5 leaps.adjr2$which[which((leaps.adjr2$adjr2))).full). 90 0. int = FALSE.model.Cp$size.Cp <. leaps.matrix(lm.Cp$which[which((leaps. y = indian$sysbp .indian.adjr2$adjr2 0.Cp$Cp))).00 Adj−R2 ● ● ● ● ● ● ● ● ● ● ● 2 3 4 5 6 7 8 0.indian.model.Cp] ## [1] "(Intercept)" "wt" "yrage" .Cp)[best.Cp$size) # adds the line for Cp = p # report the best model (indicate which terms are in the model) best.adjr2$size Mallows’ Cp Criterion.model.matrix(lm. names = colnames(model.] # these are the variable names for the best model names(best.95 ● ● 1 leaps.Cp$size. leaps.leaps.264 Ch 10: Automated Model Selection for Multiple Regression 1. minimize # Cp -.Cp$Cp.full).Cp <. main = "Cp") lines(leaps.Cp$Cp == min(leaps.for each model size. report best subset of size 3 leaps. nbest = 3.full))) # plot model R^2 vs size of model plot(leaps. method = 'Cp' .leaps(x = model.85 leaps. summary(bs)$rss.50333 0. # saving old options options(width=90) # setting command window output text width wider i.f. returns results sorted by BIC f.43635 cp bic 1. "rss". cn[(dim(bs2)[2]-5):dim(bs2)[2]] <.cn. "r2". summary(bs)$rsq .947 -10.10. indian) op <.47311 0. summary(bs)$bic). "cp". bs2 <.cbind(summary(bs)$which.999 1.Cp$Cp 25 ● ● 5 ● ● ● ● ● 1 2 ● ● 3 4 ● ● 5 ● 6 7 8 leaps.639 2. nbest=nbest.sort. (rowSums(summary(bs)$which)-1) . return(bs2). # best subset.colnames(bs2). summary(bs)$cp.]. ind <. cn <.453 -13.options(). "bic").912 . nbest = 5){ library(leaps) bs <. summary(bs)$adjr2.bs2[ind$ix.48085 0.best ## (Intercept) wt ht chin fore calf pulse yrage SIZE rss r2 adjr2 ## 2 1 1 0 0 0 0 0 1 2 3441 0. index. bs2 <.46075 ## 3 1 1 0 0 1 0 0 1 3 3391 0. method="exhaustive").bestsubset(formula(sysbp ~ wt + ht + chin + fore + calf + pulse + yrage) . colnames(bs2) <. data=dat.477 -12.regsubsets(form.bestsubset <.5: Illustration with Peru Indian data 265 Cp 30 ● 20 ● 15 ● ● ● ● 10 leaps.function(form. nvmax=30. } # perform on our model i.return=TRUE).44384 ## 3 1 1 0 1 0 0 0 1 3 3244 0. dat.best <.Cp$size All together The function below takes regsubsets() output and formats it into a table.c("SIZE". "adjr2".int(summary(bs)$bic. 27437 0.044 -3.27276 0.978 -8.200 4.615 6. The two predictor model with the highest value of R2 has weight and yrage = fraction as predictors: R2 = 0.02228 3.753 -6.431 -1.576 -1.43074 0.44493 0.52071 0.280 -3.215 3.43189 0.837 -1. and Cp Summary for Peru Indian Data R2 .04801 0. chin (chin skin fold).356 12.43651 0.679 -10.1 ¯ 2. All of the best three predictor models include weight and fraction as predictors.277 27.542 26.319 5.494 -5.554 14.179 -6. albeit the model with maximum R ¯2 The same conclusion is reached with R includes wt (weight).52548 0.924 -6.871 8.403 25.448 14.43020 0.52344 0.49308 0.47773 0. 2.299 4.25458 0.27213 0.50514 0.23235 0.595 7.234 4.477 4.52178 0.272.118 -8.473.794 5.365 -9.41887 0.270 13.475 3. R Discussion of R2 results: 1.147 4.52135 0.52368 0.47401 0.146 6.44692 0. and yrage (fraction) .07626 0. However.000 25.975 -8.323 4.125 -2.44489 0. ¯ 2. A good model using the R2 criterion has two predictors. 3.03757 0.44847 0.408 options(op).43437 0.05129 0.457 -5.320 14. All the other single predictor models have R2 < 0.178 0.23406 0.47674 0.402 -10.06290 0.27182 0.250 -10.029 6.162 4.41305 0.52104 0.151 3.25214 0.52592 0.23169 0.10.04911 0.340 3.43212 0.359 3.43344 0.29381 0.266 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 3 3 3 4 4 4 4 4 5 5 5 5 5 1 6 6 6 2 6 2 2 2 6 7 1 1 1 1 Ch 10: Automated Model Selection for Multiple Regression 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 3 3 3 4 4 4 4 4 5 5 5 5 5 1 6 6 6 2 6 2 2 2 6 7 1 1 1 1 3411 3418 3435 3130 3232 3244 3244 3311 3113 3126 3128 3229 3232 4756 3099 3111 3123 4612 3228 4739 4750 4754 3283 3096 6033 6047 6121 6218 0.49731 0. None of the more complex models with four or more predictors provides a significant increase in R2.43297 0.44882 0.5.50333 0.394 2. The single predictor model with the highest value of R2 has wt = weight as a predictor: R2 = 0.517 -1. ht (height).50336 0. the increase in R2 achieved by adding a third predictor is minimal. weight and yrage.605 -10.428 -3.50564 0.728 -5.326 5.397 -1.40305 0.50573 0.07414 0. # reset (all) initial options 10.45123 0. 4. No other two predictor model has R2 close to this.46433 0.50517 0.42892 0.177 7. Look at the third and fourth steps of forward selection to see this.50 level. 2. 2. and C_ AIC via stepwise and backward selection AIC via forward selection. yrage wt.2 Peru Indian Data Summary The model selection procedures suggest three models that warrant further consideration. 3. It was suggested by 4 of the 5 methods (ignoring R2). The only adequate two predictor model has wt = weight and yrage = fraction as predictors: Cp = 1. Adj-R2 I will give three reasons why I feel that the simpler model is preferable at this point: 1.45 < 2 + 1 = 3. 3. Discussion of Cp results: 1. even when the significance level for inclusion is reduced from the default α = 0. and backward elimination. This is the minimum Cp model. any reasonable model must include both weight and fraction as predictors.5. According to Cp. chin. Each has Cp  1+1 = 2. the target value. Every model with weight and fraction is adequate. None of the single predictor models is adequate. I would select the model with these two predictors as a starting point. Forward selection often chooses predictors that are not important. Predictors ------------------wt. Every model that excludes either weight or fraction is inadequate: Cp  p + 1. Based on simplicity. The AIC/BIC forward and backward elimination outputs suggest that neither chin skin fold nor height is significant at any of the standard levels of significance used in practice. yrage. yrage. I can always add predictors if subsequent analysis suggests this is necessary! 10.5: Illustration with Peru Indian data 267 as predictors. . forward. chin wt.10. ht Methods suggesting model ----------------------------------------------------------BIC via stepwise. total Kjeldahl nitrogen (tkn). but I will note that eliminating this case does have a significant impact on the least squares estimates of the regression coefficients. total volatile solids (tvs). What do you think? 10.268 Ch 10: Automated Model Selection for Multiple Regression Using a mechanical approach. Should case 1 be deleted? I have not fully explored this issue. I will just note (not shown) that the selection methods point to the same model when case 1 is held out. we are led to a model with weight and yrage as predictors of systolic blood pressure. After deleting case 1. or any gross abnormalities in plots. model selection methods can be highly influenced by outliers and influential cases. As noted earlier this semester. as a response. The data were collected on samples of dairy wastes kept in suspension in water in a laboratory for 220 days. we should hold out case 1. and re-evaluate the various procedures to see whether case 1 unduly influenced the models selected. At this point we should closely examine this model. and on predicted values. Thus. The goal is to find variables that should be further studied with the eventual goal of developing a prediction equation (day should not be considered as a predictor). or some function of o2up. We are interested in developing a regression model with o2up. in milligrams of oxygen per minute. and chemical oxygen demand (cod). We did this earlier this semester and found that observation 1 (the individual with the largest systolic blood pressure) was fitted poorly by the model and potentially influential. which is a component of ts. extremely influential points. Both analyses suggest that the “best model” for predicting systolic blood pressure is sysbp = β0 + β1 wt + β2 yrage + ε.6 Example: Oxygen Uptake An experiment was conducted to model oxygen uptake (o2up). each measured in milligrams per liter. total solids (ts). All observations were on the same sample over time. from five chemical measurements: biological oxygen demand (bod). there are no large residuals. We desire an equation relating o2up to the other variables. The researchers believe that the predictor . p5 <.table(fn.30 0.90 4642 0.40 3177 0. cod.30 −0.10 100 555 147 3709 74.50 4479 0.22 151 165 210 3301 71."http://statacumen.70 3901 1. only one plot is given. header=TRUE) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 day bod tkn ts tvs cod o2up logup 0 1125 232 7160 85.70 −0.56 7 920 268 8804 86.90 −0.10.00 0.30 4665 0. To shorten the output. If not. aes(x aes(x aes(x aes(x aes(x = = = = = bod.20 5690 2.30 0. As a first step.60 −0. we will consider appropriate transformations of the response and/or predictors.ggplot(oxygen.60 −0.00 93 510 243 4320 72.11 58 650 200 5336 80.20 6932 1.40 −0.90 8905 36. #### Example: Oxygen uptake fn. # scatterplots library(ggplot2) p1 <.50 3360 0.36 44 840 184 5896 81.80 8056 5.40 −0.75 22 1000 237 6370 83. y y y y y = = = = = o2up)) o2up)) o2up)) o2up)) o2up)) + + + + + geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) .70 4200 0.30 0. ts . p2 <. An exponential relationship can often be approximately linearized by transforming o2up to the log10(o2up) scale suggested by the researchers.data.60 5400 1.30 37 990 202 5154 79. p4 <. we should plot o2up against the different predictors.ggplot(oxygen. p3 <.72 29 1150 192 6441 82.22 72 583 165 5012 79.90 15 835 271 8108 85.15 80 570 151 4825 78.read.22 107 460 286 3969 74.00 1.80 3410 0.20 0.10 6960 2.ggplot(oxygen.40 171 244 327 2964 72.dat" oxygen <.00 86 570 171 4391 78. tkn.40 122 275 198 3558 72.60 −0. The extreme skewness in the marginal distribution of o2up also gives an indication that a transformation might be needed.00 0.com/teach/ADA2/ADA2_notes_Ch10_oxygen.00 5002 1.90 2599 0.60 0.11 65 640 180 5041 78.40 4840 0. and see whether the relationship between o2up and the individual predictors is roughly linear.data <.ggplot(oxygen.50 7388 7.30 4461 0.70 −0.80 −0.90 0.52 220 79 334 2777 71.ggplot(oxygen. so log10(o2up) was included in the data set. tvs.6: Example: Oxygen Uptake 269 variables are more likely to be linearly related to log10(o2up) rather than o2up.05 The plots showed an exponential relationship between o2up and the predictors.20 5348 5.00 0.15 129 510 196 4361 57. p5. main = "Scatterplots of response logup with each predictor variable") . nrow=2 . p5. # scatterplots library(ggplot2) p1 <.ggplot(oxygen. tvs. p4.arrange(p1. A sensible next step would be to build a regression model using log(o2up) as the response variable. several plots show a roughly linear relationship. p4.ggplot(oxygen.270 Ch 10: Automated Model Selection for Multiple Regression library(gridExtra) grid. p5 <. p2 <. cod. ts .ggplot(oxygen. p3 <. p3. p3. tkn.ggplot(oxygen. aes(x aes(x aes(x aes(x aes(x = = = = = bod. p2.ggplot(oxygen.arrange(p1. nrow=2 . main = "Scatterplots of response o2up with each predictor variable") Scatterplots of response o2up with each predictor variable ● ● 20 10 30 o2up 30 o2up o2up 30 ● 20 10 10 ● ● 0 ● ● ●● 300 ● ● ● ●●● ● 600 ● ● ● ●● 0 900 1200 ● ● ● ● ●● ●●● ● 150 ● ● ● ● ● ● 200 ● ●● ● 250 bod ● ● ● ● ●● ● ● ● 300 ● ●● ● ● 4000 ● ● ● 6000 8000 ts ● 30 o2up 30 o2up 0 tkn ● 20 10 20 10 ● ● ● ● 0 20 ●● ● ●● ● 60 ●● 70 ● ● ● ●●● ● ●● ● 0 ●● ● 80 ● ● ●● ●●● 4000 tvs ● ● ● ● ● 6000 8000 cod After transformation. p4 <. y y y y y = = = = = logup)) logup)) logup)) logup)) logup)) + + + + + geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) geom_point(size=2) library(gridExtra) grid. p2. # saving old options options(width=90) # setting command window output text width wider o.5 ● ● ● ● ● ● 150 0. "bod".options().6: Example: Oxygen Uptake 271 Scatterplots of response logup with each predictor variable ● 0.5 ● ● ● ● ● ● tkn 1.0 −0.7857 0.0 250 ● ● ● ● ● ●● ● ● 70 ● 4000 8000 ● 0.5 1.5 ● ● ● ● ● ● ● ● ● ● 4000 tvs 6000 8000 cod Correlation between response and each predictor.739 -21.f.c("logup".424 -19.0 ● ● ● ● ● ●● ● ● ● ● ● 1. "ts".8050 0.0634 0.cor <.0000 0.7506 cp bic 1.8350 0. oxygen.0850 0. "tkn".0 ● 1. # perform on our model o.8320 I used several of the model selection procedures to select out predictors.7685 ## 3 1 0 0 1 1 1 3 1.72 3. Furthermore.10. # correlation matrix and associated p-values testing "H0: rho == 0" library(Hmisc) o.319 -20.5 ● ● ● ● ● ● ● ● 60 ● ● ● 300 1.0 logup 0.9871 0.82 2. ].matrix(oxygen[. "cod")])) # print correlations with the response to 3 significant digits signif(o. nbest = 3) op <.rcorr(as.cor$r[1. 3) ## logup bod tkn ts tvs cod ## 1. no other model has a ¯ 2.best <.7605 ## 3 1 0 1 1 0 1 3 0. The model selection criteria below point to a more careful analysis of the model with ts and cod as predictors.23 .0906 0.5 ● ● ● ●● ● ● ts 1.5 80 6000 ● ● ● ● ● ● −0. "tvs".0 ● ● −0.5 ● ● 0.7740 0.0 ● ● ● 900 1200 200 ● 1.7900 0.5 0.5 ● ● 0. The fit of the model will not likely be improved substantially higher R2 or R substantially by adding any of the remaining three effects to this model.0 ● −0.5 600 ● ● bod logup ● ● ● ● ● ● 0.5 0.best ## (Intercept) bod tkn ts tvs cod SIZE rss r2 adjr2 ## 2 1 0 0 1 0 1 2 1.7110 0.5 ● ● 0.0 ● ● ● −0.0 ● ● logup logup 1.5 300 1.5 logup ● 1. This model has the minimum Cp and is selected by the backward and stepwise procedures.bestsubset(formula(logup ~ bod + tkn + ts + tvs + cod) . 7413 6. but it is reasonable to conjecture that the experiment may not have reached a steady state until the second time point.870 0.294 0.6926 0.49e-05 2.8094 0.17 -17.7376 0. p-value: 2.000 0.001 0.0626 Max 0. so these observations were the first and last data points collected. and that the experiment was ended when the experimental material . lm.37e+00 1.96 2.273 0. Error t value Pr(>|t|) (Intercept) -1.4388 0.7585 4.01 '*' 0.786.2 on 2 and 17 DF. However.21 -18.9652 2.439 0.76 F-statistic: 31.85 -17.05 '.18 -12.7401 5. nor have we considered whether the data contain influential points or outliers.5563 1.25 options(op).41e-04 5.32e-05 2.6965 0.oxygen. two observations (1 and 20) are poorly fitted by the model (both have ri > 2) and are individually most influential (largest Dis). data = oxygen) Residuals: Min 1Q Median -0.8094 0.9871 1.068 0. # reset (all) initial options These comments must be taken with a grain of salt because we have not critically assessed the underlying assumptions (linearity.66 0. We have little information about the experiment.015 * cod 1.7158 0.7948 0.0644 0.7067 5.0423 3Q 0.0924 -0.70 -16.9653 1.0388 1. Recall that this experiment was conducted over 220 days.3287 0. normality.500 -19.5760 13.7504 3.8050 0.0338 0.318 0.49e-04 5. codes: 0 '***' 0.3764 -0.97e-01 -6.final) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = logup ~ ts + cod.18 -15.6824 6.253 on 17 degrees of freedom Multiple R-squared: 0.72 -17. indicating that both predictors are important.001 '**' 0.016 * --Signif.6796 6.Adjusted R-squared: 0.7898 0.oxygen.60 -16. data = oxygen) summary(lm.5370 1.72 0.' 0.77 -17.5983 Coefficients: Estimate Std. independence).272 ## ## ## ## ## ## ## ## ## ## 3 4 1 2 4 1 4 2 5 1 Ch 10: Automated Model Selection for Multiple Regression 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 0 3 4 1 2 4 1 4 2 5 1 1.5983 0.06e-06 The p-values for testing the importance of the individual predictors are small.1 ' ' 1 Residual standard error: 0.lm(logup ~ ts + cod.7531 4.6756 6.final <.3e-06 *** ts 1.574 0. 2 ● ● ● 6000 0. which = c(1.2 ● ● ● 0.5 1.2 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 5000 ● ● −0.10.final$residuals Cook's dist vs Leverage hii (1 − hii) Cook's distance 1 ● 20 0.2 0 Residuals vs ts ● 4000 20 Leverage hii ● 3000 15 Obs.n=3) .final£residuals.2 0.4.final.oxygen.6 ● 4000 5000 6000 oxygen$cod 7000 1● 0.3)) plot(lm.6: Example: Oxygen Uptake 273 dissipated.6 0.5 0.4 −0. col = "gray75") plot(oxygen$cod. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.1 Residuals vs cod 0.4 0.5 0 0.2 1● ● lm.2 ● ● ● 3000 0.final$residuals.final.oxygen.5 ● Cook's distance 0.oxygen. col = "gray75") 1.6)) plot(oxygen$ts.0 ● ● ● 20 3● 0. las = 1.0 0.4 0.final$residuals ● 0. main="Residuals vs cod") # horizontal line at zero abline(h = 0.4 0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm.3 QQ Plot 0.oxygen.0 ● ● ● ● ● ● ● ● ● ● ● −0.0 1. library(car) avPlots(lm. id.8 0. number ● ● 10 ● ●● Fitted values ● ● ● ●● ● ● ● ●● ●● ● ● 0.6 Residuals vs Fitted −0. main="QQ Plot") ## 1 20 ## 20 19 7 1 ## residuals vs order of data #plot(lm.4 0.2 ● ● ● ● 0. id.oxygen.oxygen. # plot diagnistics par(mfrow=c(2.4 ● 7000 8000 lm. lm.oxygen. main="Residuals vs ts") # horizontal line at zero abline(h = 0.0 Residuals 3 2. the partial residual plot for both ts and cod clearly highlights outlying cases 1 and 20.final$residuals.0 1.4 0.final$residuals ● ● 9000 oxygen$ts 20 ● ● 0.2 ● ● 0.4 0.n = 3.4 ● Cook's distance ● ● ● ● −0.0 −0. lm.5 ● ● ● 0.oxygen.oxygen. The end points of the experiment may not be typical of conditions under which we are interested in modelling oxygen uptake.final$residuals.5 ●1 3 7● −0. A sensible strategy here is to delete these points and redo the entire analysis to see whether our model changes noticeably.5 20 ● ● 2 1 0.4 8000 9000 ● 7 −2 −1 0 1 2 norm quantiles Further.0 −0.6 ● lm.0 5 ● ● ● 1. 3) ## logup bod tkn ts tvs cod ## 1. "tkn". Summaries from the model selection are provided.] Correlation between response and each predictor.2 0.921 0.116 0.c("logup".806 # perform on our model o.20).717 0. oxygen2. "tvs".2 −0. nbest = 3) op <.4 0. # exclude observations 1 and 20 oxygen2 <.786 to 0.cor$r[1. Also note that the LS coefficients change noticeably after these observations are deleted.matrix(oxygen2[.4 −1000 ● 7 ● 3 ● 0 1000 2000 3000 ● ● ● ● ● ● ●● ● ● ● ● ● ● 7● 9 −2000 ts | others 10.892.000 0. # saving old options options(width=90) # setting command window output text width wider .6 0. "cod")])) # print correlations with the response to 3 significant digits signif(o.1 20 ● ● ● 1● 0.rcorr(as. # correlation matrix and associated p-values testing "H0: rho == 0" library(Hmisc) o. we exclude the end observations and repeat the model selection steps.6 Added−Variable Plots 0 1000 2000 cod | others Redo analysis excluding first and last observations For more completeness.oxygen[-c(1. "bod".8 0. After deleting observations 1 and 20 the R2 for this two predictor model jumps from 0.best <.options(). The model selection criteria again suggest ts and cod as predictors.f.6. "ts".bestsubset(formula(logup ~ bod + tkn + ts + tvs + cod) . ].274 Ch 10: Automated Model Selection for Multiple Regression 3● ● ● ● ● ● ●●● ● ● 9● logup | others 0.0 logup | others 0.4 ● 2● ● ● −0.2 1 ● 20 ● 0.cor <.813 0. 3100 0.68 1 0 0 0 0 1 1 1.2416 -0.lm(logup ~ ts + cod.52e-05 2.72 1 1 1 1 1 1 5 0.best ## ## ## ## ## ## ## ## ## ## ## ## ## ## 2 3 3 3 1 2 2 4 4 4 5 1 1 (Intercept) bod tkn ts tvs cod SIZE rss r2 adjr2 cp bic 1 0 0 1 0 1 2 0. lm. lm. and the plot of studentized residuals against predicted values do not show any extreme abnormalities.' 0.05 '.8923 0.oxygen2.82 3.34e-01 -9.1010 Max 0.34e+00 1.0036 -25.3093 0.4.3563 -13.92 1 0 1 1 1 1 4 0.3061 0.oxygen2.0084 -25.8490 0.46 0.8938 0.8938 0.6: Example: Oxygen Uptake 275 o.8396 3.6673 -13.8612 4.001 '**' 0.0202 -28.91 1 1 1 1 0 1 4 0.07 options(op).final) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = logup ~ ts + cod. data = oxygen2) Residuals: Min 1Q -0. there do not appear to be any extreme outliers.18e-05 5. Both predictors are significant at the 0.03 1 1 0 0 0 0 1 0.8669 0.892.final$residuals.8547 0.3091 0.0101 0.Adjusted R-squared: 0.3832 0.8354 4.8925 0.1560 -28.3095 0.98 5.58 1 0 0 1 0 0 1 0.4182 0.9767 0.3057 0. Furthermore.0508 -27.4347 0.144 on 15 degrees of freedom Multiple R-squared: 0.oxygen2.64e-05 3.8695 2.6396 24.3)) plot(lm.8937 0. which = c(1.1 ' ' 1 Residual standard error: 0.8709 2.6273 25. main="Residuals vs ts") # horizontal line at zero abline(h = 0.3056 0. Error t value Pr(>|t|) (Intercept) -1. col = "gray75") .4e-05 *** cod 8.8596 4.6)) plot(oxygen2$ts.01 '*' 0.878 F-statistic: 62.8780 0.1755 -31.2 on 2 and 15 DF.0000 -23.8926 0. The QQ-plot.oxygen2. # reset (all) initial options Below is the model with ts and cod as predictors.2509 Coefficients: Estimate Std.05 level.8939 0.final.8611 4.4253 -26.6608 0.2e-08 *** ts 1.8926 0. data = oxygen2) summary(lm.51e-08 # plot diagnistics par(mfrow=c(2.60 1 0 1 1 0 1 3 0.0100 3Q 0.8491 3.027 * --Signif. p-value: 5.6492 0. codes: 0 '***' 0. after omitting the end observations.44 1 0 0 1 1 1 3 0.06 1 1 0 1 1 1 4 0.10.final <.0852 Median 0.3058 0.8497 6.1452 -28.1398 -25.85e-04 3.78 1 1 0 1 0 1 3 0.63 1 0 0 1 1 0 2 0.8696 2.25 1 1 0 1 0 0 2 0.0709 -28. 3 0.1 0.2 0.2 7 0.1 0. las = 1.5 2 0.0 ● −0.6 0.2 ● ● ● 0.1 0.276 Ch 10: Automated Model Selection for Multiple Regression plot(oxygen2$cod.oxygen2. id.0 0.n = 3.0 0 0.2 ● ● 8000 ● ● ● −2 7 15 −1 0 norm quantiles 1 2 .3 Residuals vs Fitted 0. col = "gray75") ● ● 0.4 6● ● ● ● ● 5000 6000 7000 8000 0.3 0.final£residuals.oxygen2. main="QQ Plot") ## 6 ## 18 7 15 1 2 ## residuals vs order of data #plot(lm.oxygen2.final$residuals. main="Residuals vs Order of data") # # horizontal line at zero # abline(h = 0.4 ● ● 3 Cook's distance ● 1.0 −0.0 Residuals 0.2 0.oxygen2.final$residuals.oxygen2.1 ● ● ● ● ● ● −0. lm.2 0.n=3) ● ● 3000 4000 5000 6000 oxygen2$cod 7000 lm.final$residuals 0 ● Residuals vs cod ● 4000 ● Residuals vs ts ● 3000 15 ● ● ● ● ● ●●● ● ● ● Leverage hii ● ● ● ● Obs.5 Fitted values ● ● 0.2 lm.4 0.1 QQ Plot ● −0.2 ● ● −0.1 −0. col = "gray75") # Normality of Residuals library(car) qqPlot(lm. number ● ● 0.4 −0.4 0.1 0.1 ● ● 15 4● 0.oxygen2.3 0.2 ● ● 1 3● 0.oxygen2.final$residuals 0.2 ● ● lm.3 7● ●7 0.2 6● ● Cook's dist vs Leverage hii (1 − hii) Cook's distance 4 9000 oxygen2$ts library(car) avPlots(lm.0 ● ● ● ● ● −0. main="Residuals vs cod") # horizontal line at zero abline(h = 0. id.1 ● ● ● ● ● ● 0.1 ● ● ● −0.1 ● ● ● 0.0 ● ● ● −0.final.final$residuals 0.8 5 10 0.2 ● ● ● Cook's distance ● ● 0. .000086 cod. I might be inclined to eliminate the end observations and use the following equation to predict oxygen uptake: log10(o2up) = −1.10.000185 ts + 0.4 2● ● ● −0.0 logup | others 0.0 ● ● ● 6 ● ● −0.2 ● ● ● ● ● 3 ● ● ● ● ● 7● ● ●7 15 −1000 0.6: Example: Oxygen Uptake 277 Added−Variable Plots ● ● ● ● ● ● 9● ● 0.6 3● ● 0 1000 ts | others 2000 15 ● ● 9 −2000 −1000 0 1000 2000 cod | others Let us recall that the researcher’s primary goal was to identify important predictors of o2up. Regardless of whether we are inclined to include the end observations in the analysis or not.2 ● logup | others 0. If these data were the final experiment.4 4● 6● 0.335302 + 0. it is reasonable to conclude that ts and cod are useful for explaining the variation in log10(o2up).2 0. if the multiple categories are ordered. Logistic regression is frequently used to refer to the problem in which the dependent variable is binary — that is.Chapter 11 Logistic Regression Logistic regression analysis is used for predicting the outcome of a categorical dependent variable based on one or more predictor variables. 11. as a function of the explanatory (predictor) variables. The probabilities describing the possible outcomes of a single trial are modeled.1 Generalized linear model variance and link families The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have other than a normal distribution. the number of available categories is two — and problems with more than two categories are referred to as multinomial logistic regression or. using a logistic function. by converting the dependent variable to probability scores. as ordered logistic regression. As such it treats the same set of problems as does probit regression using similar techniques. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its . Logistic regression measures the relationship between a categorical dependent variable and usually (but not necessarily) one or more continuous independent variables. subset. H. F.11. all you have to specify is the family name. which has the following general structure: glm(formula.gaussian quasi Variance gaussian binomial poisson Gamma inverse. . or sqrt inverse. the basic tool for fitting generalized linear models is the glm() function. you must add a link argument. As can be seen. each of the first five choices has an associated variance function (for binomial the binomial variance µ(1 − µ)). Poland in 1965. is there to allow fitting user-defined models by maximum quasi-likelihood. Family gaussian binomial poisson Gamma inverse. 11. data. family = binomial(link = probit)) The last family on the list.. identity. In R. (1966) Age at Menarche in Warsaw girls in 1965..2 Example: Age of Menarche in Warsaw The data1 below are from a study conducted by Milicer and Szczotka on pre-teen and teenage girls in Warsaw. Human Biology 38. The key parameter here is family. or cloglog log.2: Example: Age of Menarche in Warsaw 279 predicted value.gaussian user-defined Link identity logit. The rest of this chapter concerns logistic regression with a binary response variable. If you want an alternative link. 199–203.” stands for additional options. and Szczotka.. quasi. . or complementary log-log). Some choices of family are listed in the table. which is a simple way of specifying a choice of variance and link functions. probit. family. For example to do probits you use: glm(formula. or log 1/µ2 user-defined As long as you want the default link.) where “. weights. and one or more choices of link functions (for binomial the logit.. probit. identity. The subjects were classified into 25 age categories. The number of girls in each group (Total) and the number 1 Milicer. frame': 25 obs.00 0.75 0.00 0.menarche$Menarche / menarche$Total 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Age 9.08 14.58 14.77 0.integer(menarche$Menarche) str(menarche) ## 'data.. ## $ Total : int 376 200 93 120 90 88 105 111 100 93 .08 12.33 14.as.16 0.58 10.96 0. #### Example: Menarche # menarche dataset is available in MASS package rm(menarche) ## Warning: object ’menarche’ not found library(MASS) # these frequencies look better in the table as integers menarche$Total <.08 11. ## $ Menarche: int 0 0 0 2 2 5 10 17 16 29 .hat <.. # create estimated proportion of girls reaching menarche for each age group menarche$p. The age for a group corresponds to the midpoint for the age interval.hat 0. and ROW2 = number that have not reached menarche = (Total − Menarche).10 0. using the proportion of girls reaching menarche as the response and age as a predictor.08 15.as.280 Ch 11: Logistic Regression that reached menarche (Menarche) at the time of the study were recorded.98 0.21 10. One could perform a test of homogeneity (Multinomial goodness-of-fit test) by arranging the data as a 2-by-25 contingency table with columns indexed by age and two rows: ROW1 = Menarche.21 10.93 0.08 13.00 0.00 The researchers were curious about how the proportion of girls that reached menarche (ˆ p = Menarche/Total) varied with age.39 0.02 0.15 0.21 10. A plot of the observed proportion pˆ of girls that have reached menarche .06 0.83 12.58 13.83 13. of 3 variables: ## $ Age : num 9.58 Total 376 200 93 120 90 88 105 111 100 93 100 108 99 106 105 117 98 97 120 102 122 111 94 114 1049 Menarche 0 0 0 2 2 5 10 17 16 29 39 51 47 67 81 88 79 90 113 95 117 107 92 112 1049 p.02 0.93 0.58 12.47 0.58 10.98 1..58 11.96 0.83 11..83 11.83 14..81 0.integer(menarche$Total) menarche$Menarche <.33 15.21 10.33 13.94 0. A more powerful approach treats these as regression data.47 0.83 17.58 15.63 0.08 .33 12.31 0.83 15..33 11. p + geom_point() p <.logit <. The observed proportions.5/menarche$Total) / (1 .hat + 0. have a lazy S-shape (a sigmoidal function) when plotted against age. To overcome this problem. The trend is nonlinear so linear regression is inappropriate. A common transformation of response proportions following a sigmoidal curve is to the logit scale µˆ = loge{ˆ p/(1 − pˆ)}. y = emp.log(( menarche$p. but that the relationship is nonlinear.11. The natural logarithm (base e) is traditionally used in logistic regression.p + labs(title = "Empirical logits") print(p) . which are bounded between zero and one.ggplot(menarche.2: Example: Age of Menarche in Warsaw 281 shows that the proportion increases as age increases.menarche$p. This transformation is the basis for the logistic regression model. The logit transformation is undefined when pˆ = 0 or pˆ = 1. This phenomenon is common with regression data where the response is a proportion.5/menarche$Total)) library(ggplot2) p <. The change in the observed proportions for a given change in age is much smaller when the proportion is near 0 or 1 than when the proportion is near 1/2. "Warsaw.5/n)/(1− pˆ + 0. A plot of the empirical logits against age is roughly linear. Poland in 1965".hat)) p <. aes(x = Age. which supports a logistic transformation for the response.p + geom_point() p <.ggplot(menarche. researchers use the empirical logits. sep="")) print(p) # emperical logits menarche$emp.5/n)}. defined by log{(ˆ p+0. where n is the sample size or the number of observations on which pˆ is based. y = p. A sensible alternative might be to transform the response or the predictor to achieve near linearity.hat + 0.p + labs(title = paste("Observed probability of girls reaching menarche. library(ggplot2) p <.logit)) p <.\n". aes(x = Age. Graphs of the logistic model relating p to X are given below. when p = 1/2 the odds of success are 1 (or 1 to 1). Warsaw.hat ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● 0. The sign of the slope refers to the sign of β1.00 ● ● 10 ● ● ● ● ● 12 14 Age 11. equivalently.9 the odds of success are 9 (or 9 to 1).logit ● p. Poland in 1965 1. success (cases with the attribute of interest) and failure (cases without the attribute of interest). .25 ● −4 ● ● ● ● ● ● 0.50 emp. The logistic model assumes that the log-odds of success is linearly related to X.282 Ch 11: Logistic Regression Observed probability of girls reaching menarche.00 ● ● ● ● ● ● ● Empirical logits 8 ● ● 0. When p = 0. The model assumes that p is related to X through   p log = β0 + β1X 1−p or. The odds of success are p/(1 − p). I should write p = p(X) to emphasize that p is the proportion of all individuals with score X that have the attribute of interest. For example.75 ● ● 4 ● ● ● 0. as exp(β0 + β1X) p= . 1 + exp(β0 + β1X) The logistic regression model is a binary response model. In the menarche data. where the response for each case falls into one of two exclusive and exhaustive categories.3 16 10 12 14 16 Age Simple logistic regression model The simple logistic regression model expresses the population proportion p of individuals with a given attribute (called the probability of success) as a function of a single predictor variable X. 0 Logit Scale 0 slope Probability 0. ..slope 0.11. .0 + slope -5 0 X 5 -5 0 X 5 The data in a logistic regression problem are often given in summarized or aggregate form: X n y X1 n1 y1 X2 n2 y2 . For logistic regression. yi = 1 or 0. but neither plot is very informative when the sample sizes are small..slope . For raw data on individual cases. a plot of the sample proportions pˆi = yi/ni against Xi should be roughly sigmoidal.. and a plot of the empirical logits against Xi should be roughly linear. Xm nm ym where yi is the number of individuals with the attribute of interest among ni randomly selected or representative individuals with predictor variable value Xi.. depending on whether the case at Xi is a success or failure. There are a variety of other binary response models that are used in practice. Probability Scale 5 0. and the sample size column n is omitted with raw data.2 . I find the second plot easier to calibrate.3: Simple logistic regression model 283 p = p(X) is the population proportion of girls at age X that have reached menarche. say 1 or 2.4 0. (Why?). If not.6 Log-Odds 0 + slope 0 slope -5 0. then some other model is probably appropriate.8 1. The probit regression model or the complementary log-log regression ... data = menarche) # LS coefficients coef(lm. The power of the logistic model versus the contingency table analysis discussed earlier is that the model gives estimates for the population proportion reaching menarche at all ages within the observed age range. lm.03 and b1 = 1.03 + 1.e. . in contrast to pˆ which is the observed proportion at a given age.03 + 1.68 Age 1 − p˜ or exp(−22. using the empirical logits as responses and the Xis as the predictor values.68 Age) where p˜ is the predicted proportion (under the model) of girls having reached menarche at the given age.a) ## (Intercept) ## -22.lm(emp.03 + 1. 1−p A simple way to estimate β0 and β1 is by least squares (LS).284 Ch 11: Logistic Regression model might be appropriate when the logistic model does fit the data.68 Age) p˜ = .68.3. 11.676 The LS estimates for the menarche data are b0 = −22. I used p˜ to identify a predicted probability. 1 + exp(−22.e.1 Estimating Regression Parameters via LS of empirical logits (This is a naive method.logit ~ Age. The observed proportions allow you to estimate only the population proportions at the observed ages. The following section describes the standard MLE strategy for estimating the logistic regression parameters.menarche.menarche.a <. we will discuss a better way in the next section. which gives the fitted relationship   p˜ log = −22.) There are two unknown population parameters in the logistic regression model   p log = β0 + β1X.028 Age 1. Below we use standard regression to calculate the LS fit between the empirical logits and age. and are extremely skewed when the sample sizes ni are small. The Binomial distribution is a discrete probability model associated with counting the number of successes in a fixed size sample.11. and is not roughly constant when the observed proportions or the sample sizes vary appreciably across groups. . 1 − pi The ML method also gives standard errors and significance tests for the regression estimates. the regression coefficients are estimated iteratively by minimizing the deviance function (also called the likelihood ratio chi-squared statistic) D=2 m  X i=1  yi log yi nipi   + (ni − yi) log n i − yi ni − nipi  over all possible values of β0 and β1. where the pis satisfy the logistic model   pi log = β0 + β1 X i . An alternative approach called maximum likelihood uses the exact Binomial distribution of the responses yi to generate optimal estimates of the regression coefficients. In maximum likelihood estimation (MLE).3. The distribution of the empirical logits depend on the yis so they are not normal (but are approximately normal in large samples). The differences in variability among the empirical logits can be accounted for using weighted least squares (WLS) when the sample sizes are large.2 285 Maximum Likelihood Estimation for Logistic Regression Model There are better ways to the fit the logistic regression model than LS which assumes that the responses are normally distributed with constant variance. A deficiency of the LS fit to the logistic model is that the observed counts yi have a Binomial distribution under random sampling. and other equivalent experiments such as counting the number of heads in repeated flips of a coin.3: Simple logistic regression model 11. The response variability depends on the population proportions. Software for maximum likelihood estimation is widely available. so LS and WLS methods are not really needed. Menarche # For our summarized data (with frequencies and totals for each age) # The left-hand side of our formula binds two columns together with cbind(): # the columns are the number of "successes" and "failures". that is. The deviance is small when the data fits the model. A p-value for the deviance is given by the area under the chi-squared curve to the right of D. Suppose that b0 and b1 are the MLEs of β0 and β1. D=2 ˜ ˜ n p n − n p i i i i i i=1 where the fitted probabilities p˜i satisfy   p˜i log = b0 + b1 X i . which suggests that the model is inappropriate.286 Ch 11: Logistic Regression The deviance is an analog of the residual sums of squares in linear regression. ˜ ˜ ˜ ˜ n p n (1 − p ) n p (1 − p ) i i i i i i=1 i=1 i i 11. Alternatively. Large values of D occur when one or more of the observed and fitted proportions are far apart. when the observed and fitted proportions are close together. then D has a chi-squared distribution with m − r degrees of freedom.3 Fitting the Logistic Model by Maximum Likelihood. . 1 − p˜i is used to test the adequacy of the model. where m is the the number of groups and r (here 2) is the number of estimated regression parameters. # For logistic regression with logit link we specify family = binomial.     m  X yi ni − yi yi log + (ni − yi) log .3. The deviance evaluated at the MLEs. If the logistic model holds. The choices for β0 and β1 that minimize the deviance are the parameter values that make the observed and fitted proportions as close together as possible in a “likelihood sense”. A small p-value indicates that the data does not fit the model. the fit of the model can be evaluated using the chi-squared approximation to the Pearson X 2 statistic:  X m  m 2 2 X p ˜ p ˜ (yi − nip˜i)2 (y − n ) ((n − y ) − n (1 − )) i i i i i i i 2 X = + = . 967.58 14.01 0.values 0.fit <.11 0.17 3.69 0.98 0.00 0.00 Menarche 0.83 17.00 29.02 0.04 0.00 111.00 67.02 0.83 13.28 0.values pred <.fit = TRUE) menarche$fit <.00 112.98 1.86 −3.06 0.07 0.58 10. 3) This printed summary information is easily interpreted.00 .14 fit −6.72 0.96 0.00 108.65 fitted.58 13.00 0.39 0.97 se.00 92.96 * se.54 3.99 0.06 emp.36 0.94 0.96 * se.00 88.75 0.predict(glm.08 0.60 0.00 2.14 0.11 0.00 1049.00 93.18 0.03 0.fit 0.09 0.14 0.00 5.93 0.31 0.a. and fit.11 −0.lower.12 0.93 0.00 97.81 0. 21 Age 15. { fit.13 0.33 14.63 0.16 0.75 2.a$fitted.20 −4. family = binomial.upper. The fitted probabilities and the limits are stored in columns labeled fitted.43 0.95 0.88 0.38 3.98 0.12 0.975. among girls in the age interval with midpoint 15.81 7. You are 95% confident that the population proportion is between 0.00 90.08) that have reached menarche is 0. For example.00 0.34 1.74 −2.Menarche) ~ Age.58 15. fit.00 200.28 fit.58 Total 376.08 15.50 0.55 −3.00 107.hat 0. A variety of other summaries and diagnostics can be produced.91 0.33 15.00 90.00 100.82 0.15 0.05 0.08 (more precisely.08 12.96 * se.values.00 10.96 0.23 0.00 79.00 39.94 1.00 47.pred$fit menarche$se.00 113.07 0.00 0.00 122.00 fit.00 94.44 −0.09 0.glm(cbind(Menarche.21 10.00 95.08 14.00 51.00 120.08 13.958 and 0.98 0.a <.97 fitted.29 0.38 se.33 11.frame(Age = menarche$Age).90 0.85 0.10 0.53 0.m.83 0.83 12.00 1049.33 −1.00 0.99 −5.30 0.57 −2.00 fit 3.02 0.frame menarche$fitted.upper 0.00 16.40 0.96 fit.14 −2.11 0.16 2.00 106.fit 0.83 14.03 0.07 0.46 fit.02 0.96 p.78 −0.08 Total 122.lower = exp(fit .08 0.10 −0. glm.91 0.00 100.94 0.logit −6.00 p.00 Menarche 117.within(menarche.98 3. se.01 0.63 −0.56 0.83 15.96 0.18 0.94 0.10 0.10 0.49 2.06 0.92 −1.06 0.hat 0. the estimated population proportion of girls aged 15.62 −5.54 1.00 93.fit)) }) #round(menarche.21 0.fit)) fit.00 17.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Age 9.21 10.56 −3.01 0.97 0.glm.57 2.upper 0.upper = exp(fit + 1.00 0.98 0.70 −0.10 1.00 120.02 0.77 0.61 3. menarche) The glm() statement creates an object which we can use to create the fitted probabilities and 95% CIs for the population proportions at the ages in menarche.1.00 98.06 3.16 0.66 0.58 11.00 102.values <.01 0.69 −1.00 117.16 0.63 0.93 0.00 105.96 0.00 0.00 0.06 0.00 105.00 99.00 81.13 0.12 0.46 0.15 0.fit) / (1 + exp(fit .51 −1.lower 0.00 88.11 0.99 1.fit # CI for fitted values menarche <.07 0.96 * se.m.logit 3.pred$se.16 0.00 emp.99 1.33 12.1.79 4. respectively.08 11.08 0.97 0.77 0.00 114.10 0.00 111.98 0.20 1.values 0.99 1.21 −1.07 0.79 0.23 −3.40 2. # put the fitted values in the data.22 0.58 12. Total .lower 0.75 0.04 0. data.83 11.20 4.61 7.96 −3.87 0.18 0.47 0.fit) / (1 + exp(fit + 1.47 0. type = "link".53 0.05 0.15 0.97 0.m.33 13.25 0.00 117.72 2.11.33 0.3: Simple logistic regression model 287 # where logit is the default link function for the binomial family.72 −2.00 2. Menarche) ~ Age.367 Coefficients: Estimate Std.995 -0.m.884 Residual deviance: 26.70 on 25 − 2 = 23 df.01 '*' 0.288 Ch 11: Logistic Regression The summary table gives MLEs and standard errors for the regression parameters. residual D = χ2residual df. codes: 0 '***' 0. then the model does not capture all the features in the data. . The large p-value for D suggests no gross deficiencies with the logistic model.778 Max 1. The p-values are used to test whether the corresponding parameters of the logistic model are zero. or the p-value is too small. Total .490 3Q 0.632 0. Also.5 <2e-16 *** Age 1.771 -27.05 '.036 -0. The data fits the logistic regression model reasonably well. Error z value Pr(>|z|) (Intercept) -21.703 AIC: 114. emp. This is consistent with D being fairly small.a) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = cbind(Menarche. data = menarche) Deviance Residuals: Min 1Q Median -2.8 on 24 on 23 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 If the model is correct and when sample sizes are large.001 '**' 0. The z-value column is the parameter estimate divided by its standard error.hat and fitted.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3693. The deviance statistic is D = 26.' 0.7 <2e-16 *** --Signif. the residual deviance D has an approximate chi-square distribution. family = binomial.226 0. summary(glm.059 27. If D is too large.logit and fit are close. The observed and fitted proportions (p.values in the output table above are reasonably close at each observed age. values).68.63 Age) library(ggplot2) p <.m. y = p.hat)) # predicted curve and point-wise 95% CI p <.3: Simple logistic regression model 289 # Test residual deviance for lack-of-fit (if > 0.10.63 Age or p˜ = exp(−21.23 + 1.a$df.23 + 1.p + geom_line(aes(x = Age.p + geom_ribbon(aes(x = Age.p + labs(title = paste("Observed and predicted probability of girls reaching menarche.m.p.ggplot(menarche.residual ## [1] 23 dev.pchisq(glm. alpha = 0. color = "red".\n" "Warsaw. aes(x = Age. Poland in 1965".p + geom_point(aes(y = fitted.upper).63 Age) .7 glm.23 and b1 = 1.1 .11.a$df. little-to-no lack-of-fit) glm. color = "red") # fitted values p <. The MLE of the predicted probabilities satisfy  p˜ log 1 − p˜  = −21.residual) dev. respectively from page 284.val <. 1 + exp(−21.m.a$deviance. sep="")) print(p) . ymin = fit.m.a$deviance ## [1] 26. y = fitted.03 and bLS1 = 1. ymax = fit. size=2) # observed values p <.p. The two estimation methods give similar predicted probabilities here.23 + 1. glm.values).63 for the intercept and slope are close to the LS estimates of bLS0 = −22.val ## [1] 0.2) p <.p + geom_point(size=2) p <.2688 The MLEs b0 = −21.lower. 00 ● ● ● ● ● ● ● ● ● ● 10 12 14 16 Age If the model holds. Again. The only predictor is AGE.m.75 ● ● ● ● p. so the implied test is that the slope of the regression line is zero. This is more appealing than testing homogeneity across age groups followed by multiple comparisons. This is the logistic regression analog of the overall model F-test in ANOVA and regression.a) ## (Intercept) Age . Poland in 1965 1. The Wald p-value for the slope is < 0. The Wald test can also be used to test specific contrasts between parameters. the power of the model is that it gives you a simple way to quantify the effect of age on the proportion reaching menarche. The Wald test statistic and p-value reported here are identical to the Wald test and p-value for the AGE effect given in the parameter estimates table.25 ● ● ● ● ● ● ● ● 0. i. which leads to rejecting H0 : β1 = 0 at any of the usual test levels..e. the proportion of girls that have reached menarche is identical across age groups. Warsaw.290 Ch 11: Logistic Regression Observed and predicted probability of girls reaching menarche. then a slope of β1 = 0 implies that p does not depend on AGE. # Testing Global Null Hypothesis library(aod) coef(glm. The proportion of girls that have reached menarche is not constant across age groups. that all non-intercept βs are equal to zero. Wald tests can be performed to test the global null hypothesis.0001.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.hat ● ● 0.50 ● ● ● ● ● 0. 632 # specify which coefficients to test = 0 (Terms = 2:4 would be terms 2. header = TRUE) leuk$ag <.a). given the skewness in the WBC values. Four variables are given in the data set: WBC. NTOTAL (the number of patients with the given combination of AG and WBC).dat" . a binary factor or indicator variable AG (1 for AG+. df = 1.4 Example: Leukemia white blood cell types Feigl and Zelen2 reported the survival time in weeks and the white cell blood count (WBC) at time of diagnosis for 33 patients who eventually died of acute leukemia.read. The patients were also factored into 2 groups according to the presence or absence of a morphologic characteristic of white blood cells.. Zelen.226 291 1.11. P(> X2) = 0. Sigma = vcov(glm.test(b = coef(glm. #### Example: Leukemia ## Leukemia white blood cell types example # ntotal = number of patients with IAG and WBC combination # nres = number surviving at least one year # ag = 1 for AG+. and 4) wald. indicating the presence or absence of a certain morphological characteristic in the white cells. 0 for AG−).com/teach/ADA2/ADA2_notes_Ch11_leuk. 826–838. M. Survival times are given for 33 patients who died from acute myelogenous leukaemia. and NRES (the number of NTOTAL that survived at least one year from the time of diagnosis). (1965) Estimation of exponential survival probabilities with concomitant information.factor(leuk$ag) 2 Feigl. P.hat = Emperical Probability leuk <.0 11.m.4: Example: Leukemia white blood cell types ## -21. Also measured was the patient’s white blood cell count at the time of diagnosis.3. The researchers are interested in modelling the probability p of surviving at least one year as a function of WBC and AG. .m. 3.a). Terms = 2:2) ## ## ## ## ## Wald test: ---------Chi-squared test: X2 = 766. Each person was classified as AG+ or AG−. 0 for AG# wbc = white cell blood count # lwbc = log white cell blood count # p. They believe that WBC should be transformed to a log scale. Patients termed AG positive were identified by the presence of Auer rods and/or significant granulation of the leukaemic cells in the bone marrow at the time of diagnosis.table("http://statacumen. Biometrics 21. 00 1 0 1 3200 8.leuk$nres / leuk$ntotal str(leuk) ## 'data.91 0.00 1 0 0 3100 8.32 1."1": 2 2 2 2 2 2 2 2 2 1 ..frame': 30 obs.00 1 1 0 440 6. ## $ nres : int 1 1 1 1 1 1 1 1 1 1 .27 0.00 1 0 0 1000 6.hat 1 1 1 75 4.56 0.55 .00 1 0 0 900 6.44 0.00 1 1 1 1050 6.log(leuk$wbc) leuk$p.16 0.91 1.09 1.00 1 0 1 540 6..00 1 1 1 1000 6.00 1 0 1 5200 8.00 1 0 0 150 5.21 0..80 0.00 3 1 1 10000 9.97 0..00 .hat : num 1 1 1 1 1 . ## $ wbc : int 75 230 260 430 700 940 1000 1050 10000 300 .90 0..40 0.00 1 0 0 530 6.44 5.44 1.00 1 1 1 260 5.01 0.55 0.06 1..00 1 1 1 700 6.21 0.00 1 0 1 3500 8.29 0..00 1 0 0 2600 7.99 0.292 Ch 11: Logistic Regression leuk$lwbc <..06 6. ## $ p..33 1 1 0 300 5.56 6.00 1 0 0 400 5.70 1. of 6 variables: ## $ ntotal: int 1 1 1 1 1 1 1 1 3 1 ..00 2 0 0 10000 9.00 1 0 0 7900 8. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ntotal nres ag wbc lwbc p.00 1 1 1 940 6.85 1. ## $ lwbc : num 4.00 1 1 1 230 5.04 0.32 5.00 1 0 0 2100 7.07 0.00 1 1 1 430 6.00 1 0 0 2800 7.65 0.00 1 0 1 1700 7.96 1..00 1 0 1 600 6.00 1 0 0 1900 7.56 1.55 1.94 0..00 1 0 0 2700 7.86 0. ## $ ag : Factor w/ 2 levels "0".hat <. but more general models are possible.. AG=1 and the model implies  p log 1−p  = β0 + β1 LWBC + β2 ∗ 1 = (β0 + β2) + β1 LWBC. A natural extension would be to include a product or interaction effect.4: Example: Leukemia white blood cell types 293 As an initial step in the analysis. with a constant slope for the two AG levels. but the order between AG groups is preserved on the probability scale. β1.e. The model without AG (i. . This model includes an effect for the AG morphological factor. The β2 coefficient for the AG indicator is the difference between intercepts for the AG+ and AG− regression lines. Including the binary predictor AG in the model implies that there is a linear relationship between the log-odds of surviving one year and LWBC. consider the following model:  p log 1−p  = β0 + β1 LWBC + β2 AG. and is independent of AG. The population regression lines are parallel on the logit scale only. respectively. A picture of the assumed relationship is given below for β1 < 0. a point that I will return to momentarily. AG=0 so the model reduces to  log p 1−p  = β0 + β1 LWBC + β2 ∗ 0 = β0 + β1 LWBC. The reduced model with β2 = 0 implies that there is no effect of the AG level on the survival probability once LWBC has been taken into account. where LWBC = log(WBC). The model is best understood by separating the AG+ and AG− cases. The parameters are easily interpreted: β0 and β0 + β2 are intercepts for the population logistic regression lines for AG− and AG+. β2 = 0) is a simple logistic model where the log-odds of surviving one year is linearly related to LWBC. For AG− individuals. The lines have a common slope.11. For AG+ individuals. ntotal . With small group sizes as we have here. or 30 “groups” or samples. Diagnostics would be used to highlight problems with the model.2 IAG=0 Probability 0.l$deviance.l <. little-to-no lack-of-fit) dev.val ## [1] 0.p. leuk) # Test residual deviance for lack-of-fit (if > 0.1 ..e. glm.val <.nres) ~ ag + lwbc. note that the data set has 30 distinct AG and LWBC combinations. Instead.10.l$df. Only two samples have more than 1 observation.i. Although significance tests on the regression coefficients do not require large group sizes.6843 The large p-value for D indicates that there are no gross deficiencies with the .294 Ch 11: Logistic Regression Probability Scale IAG=1 IAG=1 IAG=0 0. family = binomial. most researchers would not interpret the p-value for D literally.0 Logit Scale -5 0 LWBC 5 -5 0 LWBC 5 Before looking at output for the equal slopes model. 0/1) or 1 (i.p.residual) dev. The majority of the observed proportions surviving at least one year (number surviving ≥ 1 year/group sample size) are 0 (i.4 0.pchisq(glm.i. 1/1)..glm(cbind(nres. they would use the p-values to informally check the fit of the model.e.6 Log-Odds -5 0 0.i.0 -10 0.8 5 1. glm. the chi-squared approximation to the deviance statistic is suspect in sparse data settings. This sparseness of the data makes it difficult to graphically assess the suitability of the logistic model (because the estimated proportions are almost all 0 or 1). 2.109 # specify which coefficients to test = 0 (Terms = 2:3 is for terms 2 and 3) wald.l). This checks whether the regression lines are identical for the two AG levels.l). Terms = 2:3) ## ## ## ## ## Wald test: ---------Chi-squared test: X2 = 8. a test of H0 : β2 = 0 might be a primary interest here. I would consider refitting the model omitting the least significant effect. If either predictor was insignificant.i.test(b = coef(glm. The two predictors are LWBC and AG.001 '**' 0.01 '*' 0. # Testing Global Null Hypothesis library(aod) coef(glm.091 2.4: Example: Leukemia white blood cell types 295 model.l) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = cbind(nres. Recall that the Testing Global Null Hypothesis gives p-values for testing the hypothesis that the regression coefficients are zero for each predictor in the model. This test is rejected at any of the usual significance levels. df = 2. after taking LWBC into account. P(> X2) = 0.067 .713 Coefficients: Estimate Std. family = binomial.' 0.543 ag1 2. codes: 0 '***' 0. suggesting that the AG level affects the survival probability (assuming a very specific model). Sigma = vcov(glm.83 0. ag1 2. as in regression. data = leuk) Deviance Residuals: Min 1Q Median -1. ntotal .i.461 -2. Error z value Pr(>|z|) (Intercept) 5.520 lwbc -1.022 1. summary(glm.644 Max 1.i. are important predictors of survival.017 Given that the model fits reasonably well. The p-values in the estimates table suggest that LWBC and AG are both important.11.659 -0.l) ## (Intercept) ## 5.nres) ~ ag + lwbc. which is a test for whether AG affects the survival probability.021 * lwbc -1.i.660 -0.520 1.543 3.016 * --Signif.1 ' ' 1 .109 0.41 0. or both.278 3Q 0. so the small p-values indicate that LWBC or AG.05 '.31 0. alpha = 0.5) p <. ymax = fit. which show little information about the exact form of the trend. se. size=2) # observed values p <.values). 3) library(ggplot2) p <.96 * se. indicates that the probability of surviving at least one year from the time of diagnosis is a decreasing function of LWBC.lower = exp(fit . { fit.pred$fit leuk$se.l.64 on 29 on 27 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 A plot of the predicted survival probabilities as a function of LWBC.96 * se.lower.p + geom_point(aes(y = fitted.191 Residual deviance: 23.frame(lwbc = leuk$lwbc. ag = leuk$ag). colour = ag. type = "link" .glm. # put the fitted values in the data.2) p <.fit = TRUE) leuk$fit <.p + geom_ribbon(aes(x = lwbc.p + labs(title = "Observed and predicted probability of 1+ year survival") print(p) .predict(glm.i. y = fitted.values pred <. using AG as the plotting symbol.fit)) }) #round(leuk.96 * se.fit) / (1 + exp(fit + 1. y = p.l$fitted.hat.ggplot(leuk. ymin = fit.upper = exp(fit + 1.296 ## ## ## ## ## ## ## Ch 11: Logistic Regression (Dispersion parameter for binomial family taken to be 1) Null deviance: 38. aes(x = lwbc. fill = ag)) # predicted curve and point-wise 95% CI p <.fit)) fit.within(leuk. alpha = 0.fit) / (1 + exp(fit .96 * se.fit # CI for fitted values leuk <.frame leuk$fitted.values <.1. This tendency is consistent with the observed proportions.i.fit <.1.upper).p + geom_point(size = 2. For a given LWBC the survival probability is greater for AG+ patients than for AG− patients.014 AIC: 30.p + geom_line(aes(x = lwbc. data.pred$se.values)) # fitted values p <. 11.4: Example: Leukemia white blood cell types 297 Observed and predicted probability of 1+ year survival 1.00 ● ● ● ● 0.75 ● ● ● p.hat ● 0.50 ● ● ag ● ● ● ● 0.25 ● ● ● ● ● ● ● ● ● ● ●●● ● ● 0.00 5 6 7 8 ● 9 lwbc 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ntotal 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 nres 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ag 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 wbc 75 230 260 430 700 940 1000 1050 10000 300 440 540 600 1700 3200 3500 5200 150 400 530 900 1000 1900 2100 2600 2700 2800 3100 7900 10000 lwbc 4.32 5.44 5.56 6.06 6.55 6.85 6.91 6.96 9.21 5.70 6.09 6.29 6.40 7.44 8.07 8.16 8.56 5.01 5.99 6.27 6.80 6.91 7.55 7.65 7.86 7.90 7.94 8.04 8.97 9.21 p.hat 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.33 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 fitted.values 0.96 0.88 0.87 0.79 0.69 0.62 0.60 0.59 0.10 0.31 0.23 0.75 0.73 0.45 0.29 0.27 0.19 0.50 0.25 0.20 0.12 0.11 0.06 0.05 0.04 0.04 0.04 0.03 0.01 0.01 fit 3.28 2.03 1.90 1.34 0.80 0.47 0.40 0.35 −2.15 −0.78 −1.21 1.09 0.97 −0.18 −0.89 −0.99 −1.42 −0.01 −1.10 −1.41 −2.00 −2.12 −2.83 −2.94 −3.18 −3.22 −3.26 −3.37 −4.41 −4.67 se.fit 1.44 0.99 0.94 0.78 0.66 0.61 0.61 0.60 1.12 0.87 0.83 0.72 0.69 0.61 0.73 0.75 0.88 1.02 0.84 0.83 0.86 0.87 1.01 1.03 1.09 1.11 1.12 1.15 1.48 1.57 fit.upper 1.00 0.98 0.98 0.95 0.89 0.84 0.83 0.82 0.51 0.72 0.61 0.92 0.91 0.73 0.63 0.62 0.57 0.88 0.63 0.55 0.42 0.40 0.30 0.29 0.26 0.26 0.26 0.25 0.18 0.17 fit.lower 0.61 0.52 0.51 0.45 0.38 0.33 0.31 0.30 0.01 0.08 0.06 0.42 0.41 0.20 0.09 0.08 0.04 0.12 0.06 0.05 0.02 0.02 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 ● 0 ● 1 298 Ch 11: Logistic Regression The estimated survival probabilities satisfy   p˜ log = 5.54 − 1.11 LWBC + 2.52 AG. 1 − p˜ For AG− individuals with AG=0, this reduces to   p˜ log = 5.54 − 1.11 LWBC, 1 − p˜ or equivalently, p˜ = exp(5.54 − 1.11 LWBC) . 1 + exp(5.54 − 1.11 LWBC) For AG+ individuals with AG=1,   p˜ log = 5.54 − 1.11 LWBC + 2.52(1) = 8.06 − 1.11 LWBC, 1 − p˜ or p˜ = exp(8.06 − 1.11 LWBC) . 1 + exp(8.06 − 1.11 LWBC) Using the logit scale, the difference between AG+ and AG− individuals in the estimated log-odds of surviving at least one year, at a fixed but arbitrary LWBC, is the estimated AG regression coefficient (8.06 − 1.11 LWBC) − (5.54 − 1.11 LWBC) = 2.52. Using properties of exponential functions, the odds that an AG+ patient lives at least one year is exp(2.52) = 12.42 times larger than the odds that an AG− patient lives at least one year, regardless of LWBC. This summary, and a CI for the AG odds ratio, is given in the Odds Ratio table. Similarly, the estimated odds ratio of 0.33 for LWBC implies that the odds of surviving at least one year is reduced by a factor of 3 for each unit increase of LWBC. We can use the confint() function to obtain confidence intervals for the coefficient estimates. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function. 11.4: Example: Leukemia white blood cell types 299 ## CIs using profiled log-likelihood confint(glm.i.l) ## ## ## ## ## Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.1596 12.452 ag1 0.5993 5.015 lwbc -2.2072 -0.332 We can also get CIs based on just the standard errors by using the default method. ## CIs using standard errors confint.default(glm.i.l) ## 2.5 % 97.5 % ## (Intercept) -0.3804 11.4671 ## ag1 0.3819 4.6572 ## lwbc -2.0122 -0.2053 You can also exponentiate the coefficients and confidence interval bounds and interpret them as odds-ratios. ## coefficients and 95% CI cbind(OR = coef(glm.i.l), confint(glm.i.l)) ## ## ## ## ## Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 5.543 0.1596 12.452 ag1 2.520 0.5993 5.015 lwbc -1.109 -2.2072 -0.332 ## odds ratios and 95% CI exp(cbind(OR = coef(glm.i.l), confint(glm.i.l))) ## ## ## ## ## Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 255.53 1.173 2.559e+05 ag1 12.42 1.821 1.506e+02 lwbc 0.33 0.110 7.175e-01 Although the equal slopes model appears to fit well, a more general model might fit better. A natural generalization here would be to add an interaction, or product term, AG × LWBC to the model. The logistic model with an AG effect and the AG × LWBC interaction is equivalent to fitting separate logistic regression lines to the two AG groups. This interaction model provides an easy way to test whether the slopes are equal across AG levels. I will note that the interaction term is not needed here. 300 Ch 11: Logistic Regression Interpretting odds ratios in logistic regression Let’s begin with probability3. Let’s say that the probability of success is 0.8, thus p = 0.8. Then the probability of failure is q = 1 − p = 0.2. The odds of success are defined as odds(success) = p/q = 0.8/0.2 = 4, that is, the odds of success are 4 to 1. The odds of failure would be odds(failure) = q/p = 0.2/0.8 = 0.25, that is, the odds of failure are 1 to 4. Next, let’s compute the odds ratio by OR = odds(success)/odds(failure) = 4/0.25 = 16. The interpretation of this odds ratio would be that the odds of success are 16 times greater than for failure. Now if we had formed the odds ratio the other way around with odds of failure in the numerator, we would have gotten something like this, OR = odds(failure)/odds(success) = 0.25/4 = 0.0625. Another example This example is adapted from Pedhazur (1997). Suppose that seven out of 10 males are admitted to an engineering school while three of 10 females are admitted. The probabilities for admitting a male are, p = 7/10 = 0.7 and q = 1 − 0.7 = 0.3. Here are the same probabilities for females, p = 3/10 = 0.3 and q = 1 − 0.3 = 0.7. Now we can use the probabilities to compute the admission odds for both males and females, odds(male) = 0.7/0.3 = 2.33333 and odds(female) = 0.3/0.7 = 0.42857. Next, we compute the odds ratio for admission, OR = 2.3333/0.42857 = 5.44. Thus, the odds of a male being admitted are 5.44 times greater than for a female. Leukemia example In the example above, the OR of surviving at least one year increases 12.43 times for AG+ vs AG−, and increases 0.33 times (that’s a decrease) for every unit increase in lwbc. Example: Mortality of confused flour beetles This example illustrates a quadratic logistic model. The aim of an experiment originally reported by Strand (1930) and quoted by Bliss (1935) was to assess the response of the confused flour beetle, Tribolium confusum, to gaseous carbon disulphide (CS2). In the experiment, prescribed 3 Borrowed graciously from UCLA Academic Technology Services at http://www.ats.ucla.edu/stat/ sas/faq/oratio.htm 11.4: Example: Leukemia white blood cell types 301 volumes of liquid carbon disulphide were added to flasks in which a tubular cloth cage containing a batch of about thirty beetles was suspended. Duplicate batches of beetles were used for each concentration of CS2. At the end of a fivehour period, the proportion killed was recorded and the actual concentration of gaseous CS2 in the flask, measured in mg/l, was determined by a volumetric analysis. The mortality data are given in the table below. #### Example: Beetles ## Beetles data set # conc = CS2 concentration # y = number of beetles killed # n = number of beetles exposed # rep = Replicate number (1 or 2) beetles <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_beetles.dat", header = beetles$rep <- factor(beetles$rep) 1 2 3 4 5 6 7 8 conc 49.06 52.99 56.91 60.84 64.76 68.69 72.61 76.54 y n 2 29 7 30 9 28 14 27 23 30 29 31 29 30 29 29 rep 1 1 1 1 1 1 1 1 conc 9 49.06 10 52.99 11 56.91 12 60.84 13 64.76 14 68.69 15 72.61 16 76.54 y n 4 30 6 30 9 34 14 29 29 33 24 28 32 32 31 31 rep 2 2 2 2 2 2 2 2 beetles$conc2 <- beetles$conc^2 # for quadratic term (making coding a little easier) beetles$p.hat <- beetles$y / beetles$n # observed proportion of successes # emperical logits beetles$emp.logit <- log(( beetles$p.hat + 0.5/beetles$n) / (1 - beetles$p.hat + 0.5/beetles$n)) #str(beetles) Plot the observed probability of mortality and the empirical logits with linear and quadratic LS fits (which are not the same as the logistic MLE fits). library(ggplot2) p <- ggplot(beetles, aes(x = conc, y = p.hat, shape = rep)) # observed values p <- p + geom_point(color = "black", size = 3, alpha = 0.5) p <- p + labs(title = "Observed mortality, probability scale") print(p) library(ggplot2) p <- ggplot(beetles, aes(x = conc, y = emp.logit)) p <- p + geom_smooth(method = "lm", colour = "red", se = FALSE) p <- p + geom_smooth(method = "lm", formula = y ~ poly(x, 2), colour = "blue", se = FALSE) # observed values p <- p + geom_point(aes(shape = rep), color = "black", size = 3, alpha = 0.5) p <- p + labs(title = "Empirical logit with `naive' LS fits (not MLE)") print(p) 302 Ch 11: Logistic Regression Observed mortality, probability scale Empirical logit with `naive' LS fits (not MLE) 1.00 4 0.75 p.hat rep 1 0.50 2 emp.logit 2 rep 1 2 0 0.25 −2 50 60 70 conc 50 60 70 conc In a number of articles that refer to these data, the responses from the first two concentrations are omitted because of apparent non-linearity. Bliss himself remarks that . . . in comparison with the remaining observations, the two lowest concentrations gave an exceptionally high kill. Over the remaining concentrations, the plotted values seemed to form a moderately straight line, so that the data were handled as two separate sets, only the results at 56.91 mg of CS2 per litre being included in both sets. However, there does not appear to be any biological motivation for this and so here they are retained in the data set. Combining the data from the two replicates and plotting the empirical logit of the observed proportions against concentration gives a relationship that is better fit by a quadratic than a linear relationship,   p log = β0 + β1X + β2X 2. 1−p The right plot below shows the linear and quadratic model fits to the observed values with point-wise 95% confidence bands on the logit scale, and on the left is the same on the proportion scale. 11.4: Example: Leukemia white blood cell types 303 # fit logistic regression to create lines on plots below # linear glm.beetles1 <- glm(cbind(y, n - y) ~ conc, family = binomial, beetles) # quadratic glm.beetles2 <- glm(cbind(y, n - y) ~ conc + conc2, family = binomial, beetles) ## put model fits for two models together beetles1 <- beetles # put the fitted values in the data.frame beetles1$fitted.values <- glm.beetles1$fitted.values pred <- predict(glm.beetles1, data.frame(conc = beetles1$conc), type = "link", se.fit = TRUE) beetles1$fit <- pred$fit beetles1$se.fit <- pred$se.fit # CI for fitted values beetles1 <- within(beetles1, { fit.lower = exp(fit - 1.96 * se.fit) / (1 + exp(fit - 1.96 * se.fit)) fit.upper = exp(fit + 1.96 * se.fit) / (1 + exp(fit + 1.96 * se.fit)) }) beetles1$modelorder <- "linear" beetles2 <- beetles # put the fitted values in the data.frame beetles2$fitted.values <- glm.beetles2$fitted.values pred <- predict(glm.beetles2, data.frame(conc = beetles2$conc, conc2 = beetles2$conc2), type = beetles2$fit <- pred$fit beetles2$se.fit <- pred$se.fit # CI for fitted values beetles2 <- within(beetles2, { fit.lower = exp(fit - 1.96 * se.fit) / (1 + exp(fit - 1.96 * se.fit)) fit.upper = exp(fit + 1.96 * se.fit) / (1 + exp(fit + 1.96 * se.fit)) }) beetles2$modelorder <- "quadratic" beetles.all <- rbind(beetles1, beetles2) beetles.all$modelorder <- factor(beetles.all$modelorder) # plot on logit and probability scales library(ggplot2) p <- ggplot(beetles.all, aes(x = conc, y = p.hat, shape = rep, colour = modelorder, fill = mod # predicted curve and point-wise 95% CI p <- p + geom_ribbon(aes(x = conc, ymin = fit.lower, ymax = fit.upper), linetype = 0, alpha = p <- p + geom_line(aes(x = conc, y = fitted.values, linetype = modelorder), size = 1) # fitted values p <- p + geom_point(aes(y = fitted.values), size=2) # observed values p <- p + geom_point(color = "black", size = 3, alpha = 0.5) p <- p + labs(title = "Observed and predicted mortality, probability scale") print(p) library(ggplot2) 304 Ch 11: Logistic Regression p <- ggplot(beetles.all, aes(x = conc, y = emp.logit, shape = rep, colour = modelorder, fill = # predicted curve and point-wise 95% CI p <- p + geom_ribbon(aes(x = conc, ymin = fit - 1.96 * se.fit, ymax = fit + 1.96 * se.fit), li p <- p + geom_line(aes(x = conc, y = fit, linetype = modelorder), size = 1) # fitted values p <- p + geom_point(aes(y = fit), size=2) # observed values p <- p + geom_point(color = "black", size = 3, alpha = 0.5) p <- p + labs(title = "Observed and predicted mortality, logit scale") print(p) Observed and predicted mortality, probability scale 1.00 ● ● Observed and predicted mortality, logit scale 7.5 ● ● ● ● ● ● ● 5.0 0.75 ● ● modelorder linear ● quadratic p.hat ● ● 0.50 rep ● ● emp.logit ● ● linear ● quadratic ● ● 2.5 rep 1 ● ● ● 2 1 2 ● ● ● 0.0 ● modelorder ● ● 0.25 ● ● ● ● ● ● −2.5 ● ● 0.00 50 60 70 conc 11.5 50 60 70 conc Example: The UNM Trauma Data The data to be analyzed here were collected on 3132 patients admitted to The University of New Mexico Trauma Center between the years 1991 and 1994. For each patient, the attending physician recorded their age, their revised trauma score (RTS), their injury severity score (ISS), whether their injuries were blunt (i.e., the result of a car crash: BP=0) or penetrating (i.e., gunshot/knife wounds: BP=1), and whether they eventually survived their injuries (SURV=0 if not, SURV=1 if survived). Approximately 10% of patients admitted to the UNM Trauma Center eventually die from their injuries. The ISS is an overall index of a patient’s injuries, based on the approximately 1300 injuries cataloged in the Abbreviated Injury Scale. The ISS can take on values from 0 for a patient with no injuries to 75 for a patient with 3 or more 11.5: Example: The UNM Trauma Data 305 life threatening injuries. The ISS is the standard injury index used by trauma centers throughout the U.S. The RTS is an index of physiologic injury, and is constructed as a weighted average of an incoming patient’s systolic blood pressure, respiratory rate, and Glasgow Coma Scale. The RTS can take on values from 0 for a patient with no vital signs to 7.84 for a patient with normal vital signs. Champion et al. (1981) proposed a logistic regression model to estimate the probability of a patient’s survival as a function of RTS, the injury severity score ISS, and the patient’s age, which is used as a surrogate for physiologic reserve. Subsequent survival models included the binary effect BP as a means to differentiate between blunt and penetrating injuries. We will develop a logistic model for predicting survival from ISS, AGE, BP, and RTS, and nine body regions. Data on the number of severe injuries in each of the nine body regions is also included in the database, so we will also assess whether these features have any predictive power. The following labels were used to identify the number of severe injuries in the nine regions: AS = head, BS = face, CS = neck, DS = thorax, ES = abdomen, FS = spine, GS = upper extremities, HS = lower extremities, and JS = skin. #### Example: UNM Trauma Data trauma <- read.table("http://statacumen.com/teach/ADA2/ADA2_notes_Ch11_trauma.dat" , header = TRUE) ## Variables # surv = survival (1 if survived, 0 if died) # rts = revised trauma score (range: 0 no vital signs to 7.84 normal vital signs) # iss = injury severity score (0 no injuries to 75 for 3 or more life threatening injuries) # bp = blunt or penetrating injuries (e.g., car crash BP=0 vs gunshot/knife wounds BP=1) # Severe injuries: add the severe injuries 3--6 to make summary variables trauma <- within(trauma, { as = a3 + a4 + a5 + a6 # as = head bs = b3 + b4 + b5 + b6 # bs = face cs = c3 + c4 + c5 + c6 # cs = neck ds = d3 + d4 + d5 + d6 # ds = thorax es = e3 + e4 + e5 + e6 # es = abdomen fs = f3 + f4 + f5 + f6 # fs = spine gs = g3 + g4 + g5 + g6 # gs = upper extremities hs = h3 + h4 + h5 + h6 # hs = lower extremities js = j3 + j4 + j5 + j6 # js = skin }) # keep only columns of interest names(trauma) ## [1] "id" "surv" "a1" "a2" "a3" "a4" "a5" "a6" "b1" "b2" id. alpha = 0.ggplot(trauma. Survivors tend to have lower ISS scores. h = 0). so these boxplots are not very enlightening.p + stat_summary(fun. scales = "free_y".05. alpha = 0.306 ## ## ## ## ## ## ## Ch 11: Logistic Regression [11] [21] [31] [41] [51] [61] [71] "b3" "d1" "e5" "g3" "j1" "age" "as" "b4" "d2" "e6" "g4" "j2" "prob" "b5" "d3" "f1" "g5" "j3" "js" "b6" "d4" "f2" "g6" "j4" "hs" "c1" "d5" "f3" "h1" "j5" "gs" "c2" "d6" "f4" "h2" "j6" "fs" "c3" "e1" "f5" "h3" "iss" "es" "c4" "e2" "f6" "h4" "iciss" "ds" "c5" "e3" "g1" "h5" "bp" "cs" "c6" "e4" "g2" "h6" "rts" "bs" trauma <.804 20 0.9947 1238898 1 0 0 0 0 0 0 2 0 0 13 0.5) # points for observed data p <.75 to stand out behind CI p <. surv. shape = 18.1) # diamond at mean for each group p <.9005 0 7.841 23 0.p + stat_summary(fun. colour = "red") # confidence limits based on normal distribution p <. In several body regions the number of injuries is limited.p + facet_wrap( ~ variable.841 13 0. alpha = 0. The importance of the effects individually towards predicting survival is directly related to the separation between the survivors and non-survivors scores.967 17 0. AGE.7251 0 7. width = . and tend to have fewer severe head (AS) and abdomen injuries (ES) than non-survivors. geom = "point". # Create boxplots for each variable by survival library(reshape2) trauma.841 32 0. as:js. tend to have higher RTS scores. y = value)) # boxplot.long <.9347 0 4.9910 1238393 1 0 0 0 0 0 0 0 0 0 5 0.9422 0 7.75.vars = c("id".9947 #str(trauma) I made side-by-side boxplots of the distributions of ISS.9616 1239961 1 1 0 0 0 0 0 0 0 1 9 0.melt(trauma.841 43 0.long.subset(trauma. tend to be slightly younger.8) p <. aes(x = factor(surv). RTS.p + geom_point(position = position_jitter(w = 0. alpha = 0.y = mean.75. and AS through JS for the survivors and non-survivors.data = "mean_cl_normal". geom = "errorbar". select = c(id. "surv".9338 1240266 1 0 0 0 0 0 0 0 1 0 13 0. size = 6.8613 0 7.p + labs(title = "Boxplots of variables by survival") print(p) . "prob")) # Plot the data using ggplot library(ggplot2) p <. iss:prob)) head(trauma) ## ## ## ## ## ## ## 1 2 3 4 5 6 id surv as bs cs ds es fs gs hs js iss iciss bp rts age prob 1238385 1 0 0 0 1 0 0 0 0 0 13 0.p + geom_boxplot(size = 0. ncol = 4) p <.2.9947 1239516 1 1 0 0 0 0 0 0 0 0 16 1.0000 0 5. size=. 00 ● ● ● 6 ● iciss ● 60 0.0 ● cs ● 2.5: Example: The UNM Trauma Data 307 Boxplots of variables by survival as 8 bs 2.50 rts 0 1.5. BP. AGE.00 gs 4 3 2 0 fs ● ● ● 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 25 ● ● 0 0 0 1 0 1 factor(surv) 11.5 1. In our previous logistic regression analyses.00 age 8 4 bp 1. Revisit Chapter 10 for more information.0 es 6 5 ● ● 4 ● ● ● ● ● value ● ● 4 hs ● 3 ● ● 2 ● ● ● 2 ● ● 1 ● ● 5 ● 4 ● 3 ● ● 2 ● ● 1 ● ● ● ● 0 0 ● 1 iss ● ● ● ● ● ● ● ● ● 40 0. RTS.50 0. the cases in the data set were . and AS–JS.75 ● 0 js 1. Below we perform a stepwise selection using AIC starting at the full model. including backward elimination.25 20 0.0 0. starting with a full model having 13 effects: ISS.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 0. among others.00 0 0.5 1 ● 0 0.5 0.0 ● ● 1. forward selection.75 0.1 Selecting Predictors in the Trauma Data The same automated methods for model building are available for the glm() procedure.00 0. and stepwise methods.0 ds ● 5 ● ● 4 ● ● 3 ● ● ● 6 ● ● 1.75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● 2 ● 2 ● 0.11.5 ● 4 ● 1.25 0.50 0.25 0. 8 .tr.5 889. All of the effects in the selected model are significant at the 5% level.tr.7 887. ISS. I only included the summary table from the backward elimination.4 The final model includes effects for ES (number of severe abdominal injuries). surv). The second is to convert a model for the log-odds of surviving to a model for the log-odds of dying by simply changing the sign of each regression coefficient in the model.81690 3123 871. Dev NA NA 3118 869. and BP. The numbers of cases in the success category and the group sample sizes were specified in the model statement.04912 3119 869. direction="both". glm. is raw data consisting of one record per patient (i.5 . which is not reproduced here.5 . family = binomial.4 .fs 1 0. The logistic model is fitted to data on individual cases by specifying the binary response variable (SURV) with successes and 1 − SURV failures with the predictors on the right-hand side of the formula. The trauma data set.6 891.AIC <.4 AIC 897.surv. so we are modeling the probability of surviving. summary(glm. along with the names of the predictors.tr <.step(glm.js 1 1. 1 .17941 3125 873.as 1 0.7 . RTS.4 888..78703 3122 870. Df Resid.tr.red.bs 1 0. and information on the fit of the selected model.ds 1 0.308 Ch 11: Logistic Regression pre-aggregated into groups of observations having identical levels of the predictor variables.18273 3124 872.hs 1 1. trauma) # option: trace = 0 doesn't show each step of the automated selection glm.71204 3126 875.4 .glm(cbind(surv.e.AIC) ## ## Call: .surv) ~ as + bs + cs + ds + es + fs + gs + hs + js + iss + rts + age + bp .12243 3120 869.8 890.red.5 . 3132 lines). AGE.cs 1 0. As an aside.tr.4 895. The first is to swap the order the response is specified in the formula: cbind(1 .5 887. there are two easy ways to model the probability of dying (which we don’t do below).15243 3121 869. Keep in mind that we are defining the logistic model to model the success category. trace = 0) # the anova object provides a summary of the selection steps in order glm.AIC$anova ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 Step Df Deviance Resid.gs 1 1.red.5 893.6 . in terms of the impact that individual predictors have on the survival probability.0569 ISS 1 − p˜ +0.5: Example: The UNM Trauma Data ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 309 glm(formula = cbind(surv.6e-14 *** rts 0.00529 -9.red.01 '*' 0.232 Max 3.143 3Q 0.51977 1.445 Coefficients: Estimate Std.05 '.AIC).19 2. little-to-no lack-of-fit) dev.37 Residual deviance: 875.04971 0.54 0. and the odds ratios. 1 . glm.11.39 < 2e-16 *** bp -0.155 0.5 % ## (Intercept) 0.422 es -0.35584 -0.46132 0.residual) dev.0497 AGE.surv) ~ es + iss + rts + age + bp. codes: 0 '***' 0.44294 0. indicating no gross deficiencies with the selected model.val <.' 0.AIC)) ## Waiting for profiling to be done.4 on 3131 on 3126 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 7 The p-value for D is large.8e-05 *** iss -0.05534 15.63514 0.p.011 * --Signif. Let us interpret the sign of the coefficients.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1825.pchisq(glm.001 '**' 0.red.099 0.red.46132 -0.1 .6351 BP − 0. the estimated survival probability is given by   p˜ log = 0.10.80 0.tr. ## OR 2.val ## [1] 1 Letting p be the probability of survival. Error z value Pr(>|z|) (Intercept) 0.3558 − 0.8431 RTS − 0.05692 0.tr.AIC$df.00741 -7.4613 ES − 0.67604 -0..11010 -4.tr.43 AIC: 887.. confint(glm.red.21869 ## es -0. family = binomial.AIC$deviance. data = trauma) Deviance Residuals: Min 1Q Median -3. ## coefficients and 95% CI cbind(OR = coef(glm. # Test residual deviance for lack-of-fit (if > 0.24 < 2e-16 *** age -0.84314 0.p.68 1.35584 0.24960 -2.tr.24308 .5 % 97. red.6305 0. .fitted(glm.310 ## ## ## ## iss rts age bp Ch 11: Logistic Regression -0.seq(0.5 % 97.9416 0. confint(glm.63514 -1.tr.03944 -0. This leads to an estimated survival probability for the case. .5086 0.05692 -0.AIC))) ## ## ## ## ## ## ## ## Waiting for profiling to be done.1) # predicted probabilities Yhat <.84314 0.by=0. The results are summarized in terms of total number of correct and incorrect classifications for each possible outcome.3237 2.04971 -0.red.5 % (Intercept) 1.1.AIC) .9447 0. which is compared to the actual result.5299 0. assume that the two types of errors (a false positive prediction of survival.8693 11.red.0.3261 0.04250 0.06021 -0.9515 0.5. a table is generated that gives classifications for cutoff probabilities thresh= 0. Under this assumption. 1. .50 or larger.tr.9613 bp 0. OR 2. and not to the actual outcomes..95531 -0..14005 ## odds ratios and 95% CI exp(cbind(OR = coef(glm. the columns labeled Event (a survival) and Non-Event (a death) refer to the classification of observations.4274 0. While I do not do this below.9584 rts 2. .0 based on the selected model. In the table below. the optimal classification rule is to predict survival for an individual with an estimated survival probability of 0. and a false negative prediction) have equal costs. In the below script.AIC).7842 iss 0.9309 0.07160 -0.2 Checking Predictions for the Trauma Model To assess the ability of the selected model to accurately predict survival. each observation is temporarily held out and the selected model is fitted to the remaining (3131) cases.5947 3.0921 2. The columns labeled Correct and Incorrect identify whether the classifications are accurate.3828 es 0. 0.tr. # thresholds for classification given model proportion predictions for each observation thresh <.12052 -0. it is common to use the jackknife method for assessing classification accuracy. To implement the jackknife method.1.73818 0.5995 age 0. 1) ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 Thresh Cor. rep(NA.9 99. "Event")) # contingency table and marginal sums cTab <.2] / sum(cTab[.9 58.P Fal. breaks = c(-Inf.P [i.6 52.4 7.11. length(thresh)) length(thresh)) length(thresh)) length(thresh)) length(thresh)) length(thresh)) length(thresh)) length(thresh)) length(thresh)) length(thresh))) for (i.table$Fal.2]) 100 * cTab[1.1 2861 79 188 4 93.thresh] classify.0 100.8 27.NonEv .0 46.6 98.table$Fal.8 2727 196 71 138 93.thresh] classify.NonEv Cor.3 0.6 0. Cor.Event .2] / sum(cTab[1.9 2627 229 38 238 91. higher (1) = Event YObs <.Overall Sensitivity Specificity False. Inc.3 0. Sens .2] cTab[1.thresh] classify. Cor.1]) 100 * cTab[2.NonEv[i.Event[i.4 2837 125 142 28 94. Inc. Inf) .table$Inc.thresh] classify. rep(NA.3 0.6 2805 157 110 60 94.N 0. rep(NA.NonEv Inc. "Event")) classify.4 51.1] / sum(cTab[.Event Incorrect.table$Spec [i. Inf) .8 0. Cor.3 95.data.8 65.thresh] classify.Level Correct.6 97.table <. labels = c("NonEvent".7 39.2 2856 105 162 9 94.Event Inc.table(YhatPred. rep(NA.2 5.5 41.5 0.9 0.8 3.Pos False.0 1.thresh] classify.Event Correct.0 12.0 2865 0 267 0 91.7 85. thresh[i. rep(NA.thresh] classify.2 73.N [i.All Sens Spec Fal.3 0.2 91.thresh].2] 100 * sum(diag(cTab)) / sum(cTab) 100 * cTab[2.2 3. Fal.9 29.0 0 267 0 2865 8.8 18.5: Example: The UNM Trauma Data 311 # Name: lower (0) = NonEvent.All [i. Using a 0.6 99.2 4.3 2848 118 149 17 94.thresh in 1:length(thresh)) { # choose a threshold for dichotomizing according to predicted probability YhatPred <.5 NaN 0.7 99.NonEv .P .NonEvent Correct.8 1.50 cut- . breaks = c(-Inf.]) 100 * cTab[1.thresh] classify. Fal. rep(NA. labels = c("NonEvent". mean(trauma$surv).table$Inc.cut(Yhat.thresh] cTab[2.]) # # # # # # # # # # Prob.3 22.5 100.5 The data set has 2865 survivors and 267 people that died.thresh] <<<<<<<<<<- thresh[i.4 44.6 6.N = = = = = = = = = = rep(NA.table$Sens [i.table$Cor. YObs) addmargins(cTab) # Classification Table classify.Event Cor.Neg } round(classify.NonEv[i.table$Thresh [i.table$Cor.1] cTab[1.1] / sum(cTab[2.1] cTab[2.7 2774 174 93 91 94.table$Cor.0 0.4 2.Event[i. rep(NA.1 4.cut(trauma$surv.5 2825 139 128 40 94.All .3 5.table.0 8.6 0. Spec .5 99. rep(NA.thresh] classify.frame(Thresh .NonEvent Incorrect.0 NaN 91.2 34.1 96.8 4. rep(NA.Event . 11. Similarly. # Thresh = 0.1%.NonEv Inc.3 22.cut(Yhat. which is the % of those predicted to die that did not is 40/(40 + 139) = 22. and 40 misclassified. YObs) addmargins(cTab) ## YObs ## YhatPred NonEvent Event Sum ## NonEvent 139 40 179 ## Event 128 2825 2953 ## Sum 267 2865 3132 round(subset(classify.5 2825 139 128 40 94. A regression for the binary data would model the .6 52.table. The false negative rate.6%. is 128/(128 + 2825) = 4.3%. labels=c("NonEvent". "Event")) # contingency table and marginal sums cTab <. 139 of the patients that died would be correctly classified and 128 would not.table(YhatPred.NonEv Cor. The false positive rate.5. and Hoadley (1989) on field O-ring failures in the 23 pre-Challenger space shuttle launches. Temperature at lift-off and O-ring joint pressure are predictors.3 The misclassification rate seems small. you could achieve a 10% misclassification rate by completely ignoring the data and classifying each admitted patient as a survivor.N ## 6 0. The binary version of the data views each flight as an independent trial.All Sens Spec Fal.6%.6 98. 2825/(2825 + 40) = 98. 139/(139 + 128) = 52. The overall percentage of cases correctly classified is (2825 + 138)/3132 = 94. but you should remember that approximately 10% of patients admitted to UNM eventually die from their injuries. 1) ## Thresh Cor.4%). which is the % of those predicted to survive that did not. 2825 of the survivors would be correctly identified.1 4. The specificity is the percentage of patients that died that are correctly classified. breaks=c(-Inf.P Fal. 0. The sensitivity is the percentage of survivors that are correctly classified.5 classification table YhatPred <. which is an important reduction in this problem.Event Cor. Fowlkes. Using the data reduces the misclassification rate by about 50% (from 10% down to 4.Event Inc.6 Historical Example: O-Ring Data The table below presents data from Dalal. Inf). The result of a trial (y) is a 1 if at least one field O-ring failure occurred on the flight and a 0 if all six O-rings functioned properly. Given this historical information only.312 Ch 11: Logistic Regression off.5).5%. Thresh == 0. read. 1 − pi Logistic histogram plots of the data show a clear marginal relationship of failure with temp but not with pressure. The Challenger explosion occurred during a takeoff at 31 degrees Fahrenheit.com/teach/ADA2/ADA2_notes_Ch11_shuttle.e. The binomial model makes no allowance for an effect due to having the six O-rings on the same flight. The two regressions model different probabilities.csv("http://statacumen. Consider fitting a logistic regression model using temperature and pressure as predictors. . We still need to assess the model with both variables together..6: Historical Example: O-Ring Data 313 probability that any O-ring failed as a function of temperature and pressure. It views the six field Orings on each flight as independent trials. they would almost certainly be treated as dependent repeated measures. If these data were measurements rather than counts. A regression for the binomial data would model the probability that a particular O-ring on a given flight failed as a function of temperature and pressure. The binomial version of these data is also presented.csv") 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 flight 14 9 23 10 1 5 13 15 4 3 8 17 2 11 6 7 16 21 19 22 12 20 18 y 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 six 2 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 2 0 0 0 0 0 temp 53 57 58 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 79 81 pressure 50 50 200 50 200 50 200 50 200 200 50 200 200 200 200 200 100 200 200 200 200 200 200 The regression model for the binomial data (i. If interest is focused on whether one or more O-rings failed.11. Let pi be the probability that any O-ring fails in case i and model this as   pi log = β0 + β1 Tempi + β2 Pressurei. the simplest. six trials per launch) is suspect on substantive grounds. #### Example: Shuttle O-ring data shuttle <. most direct data are the binary data. 6 10 0.0 ● ● 5 0.sh$df.sh$deviance.0 ● ● ● ● ● boxp=FALSE. ylabel ● ● 50 100 ● 150 0 200 Pressure We fit the logistic model below using Y = 1 if at least one O-ring failed.val ## [1] 0. or both. type="hist" = "Probability".sh <.hist.p.1. We are modelling the chance of one or more O-ring failures as a function of temperature and pressure. shuttle) # Test residual deviance for lack-of-fit (if > 0. shuttle$y. the test of H0 : β1 = β2 = 0 (no regression effects) based on the Wald test has a p-value of 0. family = binomial. shuttle$y. . col="gray". glm.1 . The z-test test p-values for testing H0 : β1 = 0 and H0 : β2 = 0 individually are 0.hist. ylabel logi. . rug=TRUE.plot(shuttle$temp.10.val <.pchisq(glm. which indicates pressure is not important (when added last to the model).8 20 0. This conclusion might be anticipated by looking at data plots above.glm(cbind(y.8 10 Frequency Probability 0.2 10 5 0.4589 # Testing Global Null Hypothesis library(aod) coef(glm. but that temperature is important. glm.314 Ch 11: Logistic Regression # plot logistic plots of response to each predictor individually library(popbio) 1. 1 .0 ● ● ● ● ● 55 60 65 70 ● ● ● ● 75 ● ● ● 80 Temp 0 0 0.2 Probability 0.p.sh) ## (Intercept) temp pressure .0 Frequency ## Loading required package: quadprog logi. which suggests that neither temperature or pressure. respectively. and 0 otherwise.plot(shuttle$pressure.y) ~ temp + pressure. xlabel = "Temp") boxp=FALSE.4 10 0.576. type="hist" = "Probability".037 and 0. are useful predictors of the probability of O-ring failure. Furthermore. xlabel = "Pressure") 0 ● 1.6 0.4 20 0.residual) dev. little-to-no lack-of-fit) dev. col="gray". The D goodness-of-fit statistic suggest no gross deviations from the model. rug=TRUE. 38532 8.sh).576 --Signif.1 .01 '*' 0.193 -0.08 0.385319 -0.sh) .' 0. after omitting pressure as a predictor.04 0. shuttle) # Test residual deviance for lack-of-fit (if > 0.00926 0. df = 2.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 28. 1 − pi glm. little-to-no lack-of-fit) dev.00518 0.041 * temp -0.98 on 22 on 20 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 A reasonable next step would be to refit the model.379 3Q 0.residual) dev. P(> X2) = 0.1 # Model summary summary(glm.417 Max 2.pchisq(glm.glm(cbind(y. glm.263404 315 0. family = binomial.5014 # Model summary summary(glm.sh).12637 -2.sh$deviance.788 -0.sh <.984 AIC: 25. codes: 0 '***' 0. The target model is now   pi log = β0 + β1 Tempi. data = shuttle) Deviance Residuals: Min 1Q Median -1. 1 .test(b = coef(glm.56 0.p.11.05 '.6: Historical Example: O-Ring Data ## 16.val ## [1] 0.y) ~ temp + pressure.02747 2.203 Coefficients: Estimate Std.005178 # specify which coefficients to test = 0 (Terms = 2:3 is for terms 2 and 3) wald. Error z value Pr(>|z|) (Intercept) 16. Terms = 2:3) ## ## ## ## ## Wald test: ---------Chi-squared test: X2 = 4.p.y) ~ temp.sh) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = cbind(y.6.267 Residual deviance: 19.26340 0.sh$df. Sigma = vcov(glm.val <. 1 .001 '**' 0. family = binomial.10.037 * pressure 0. what can/should we say about the potential for O-ring failure? Clearly.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 28.043 − 0. what is the estimated probability of O-ring failure? p˜ = . 1 − p˜ The estimated probability of (at least one) O-ring failure is exp(15.05 '.217 Coefficients: Estimate Std. Thus. data = shuttle) Deviance Residuals: Min 1Q Median -1.043 − 0. codes: 0 '***' 0.001 '**' 0. Error z value Pr(>|z|) (Intercept) 15.2322 Temp) .032 * --Signif. family = binomial.2322 Temp.108 -2.043 − 0.061 -0. The Challenger was set to launch on a morning where the predicted temperature at lift-off was 31 degrees.761 -0.378 3Q 0.2322 Temp) This is an decreasing function of temperature.315 AIC: 24. 1 + exp(15. The model estimates the log-odds of (at least one) O-ring failure to be   p˜ log = 15. 1 . Given that temperature appears to affect the probability of O-ring failure (a point that NASA missed).316 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 11: Logistic Regression Call: glm(formula = cbind(y.452 Max 2.232 0.041 * temp -0. and we can extrapolate back to 31 degrees. we really have no prior information about what is likely to occur. the launch temperature is outside the region for which data were available.04 0.' 0. If we assume the logistic model holds.14 0.y) ~ temp.043 7.01 '*' 0.267 Residual deviance: 20.32 on 22 on 21 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 Our conclusions on the overall fit of the model and the significance of the effect of temperature on the probability of O-ring failure are consistent with the results for two predictor model.379 2. 232 0. flight = rep(NA.fit) / (1 + exp(fit + 1.96 * se.within(shuttle.043 7.glm(cbind(y.039 0. temp = c(31.fit)) fit.6: Historical Example: O-Ring Data 317 The following gives the answer this question. y = rep(NA.fit # CI for fitted values shuttle <. 40. 5) . 5). including appended values pred <. # append values to dataset for which we wish to make predictions shuttle.108 -2.data. family = binomial. shuttle) # fit model glm. Predictions are then made for all temperatures in the dataset and the resulting table and plot are reported.frame( case = rep(NA. glm. shuttle) # Note: same model fit as before since glm() does not use missing values round(summary(glm.96 * se.fit) / (1 + exp(fit .11.pred <. 35.379 2.predict(glm.frame shuttle$fitted. pressure = rep(NA.y) ~ temp.sh <. se. 5) .fit)) }) .values) # predict() uses all the temp values in dataset. Error z value Pr(>|z|) ## (Intercept) 15. data. 45. 1 .032 # put the fitted values in the data.pred$fit shuttle$se. 3) ## Estimate Std. six = rep(NA.fit = TRUE) shuttle$fit <.041 ## temp -0.fit <. 5) ) shuttle <. type = "link".sh$fitted.values <. 5) . 5) .96 * se.frame(temp = shuttle$temp). I augmented the original data set to obtain predicted probabilities for temperatures not observed in the data set.pred$se. 50) # temp values to predict .rbind(shuttle.145 0.upper = exp(fit + 1.1. Note that the fitted model to data with missing observations gives the same model fit because glm() excludes observations with missing values. { # added "fitted" to make predictions at appended temp values fitted = exp(fit) / (1 + exp(fit)) fit.lower = exp(fit .96 * se.sh)$coefficients.1.pred.sh.c(rep(NA. 318 Ch 11: Logistic Regression 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 case flight y six 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 14.00 9.00 23.00 10.00 1.00 5.00 13.00 15.00 4.00 3.00 8.00 17.00 2.00 11.00 6.00 7.00 16.00 21.00 19.00 22.00 12.00 20.00 18.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 2.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 temp 31.00 35.00 40.00 45.00 50.00 53.00 57.00 58.00 63.00 66.00 67.00 67.00 67.00 68.00 69.00 70.00 70.00 70.00 70.00 72.00 73.00 75.00 75.00 76.00 76.00 78.00 79.00 81.00 pressure fitted.values 50.00 50.00 200.00 50.00 200.00 50.00 200.00 50.00 200.00 200.00 50.00 200.00 200.00 200.00 200.00 200.00 100.00 200.00 200.00 200.00 200.00 200.00 200.00 0.94 0.86 0.83 0.60 0.43 0.38 0.38 0.38 0.32 0.27 0.23 0.23 0.23 0.23 0.16 0.13 0.09 0.09 0.07 0.07 0.04 0.04 0.02 fit 7.85 6.92 5.76 4.60 3.43 2.74 1.81 1.58 0.42 −0.28 −0.51 −0.51 −0.51 −0.74 −0.98 −1.21 −1.21 −1.21 −1.21 −1.67 −1.90 −2.37 −2.37 −2.60 −2.60 −3.07 −3.30 −3.76 se.fit 4.04 3.61 3.08 2.55 2.02 1.71 1.31 1.21 0.77 0.59 0.56 0.56 0.56 0.55 0.56 0.59 0.59 0.59 0.59 0.70 0.78 0.94 0.94 1.03 1.03 1.22 1.32 1.51 fit.upper 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.87 0.71 0.64 0.64 0.64 0.58 0.53 0.49 0.49 0.49 0.49 0.43 0.40 0.37 0.37 0.36 0.36 0.34 0.33 0.31 fit.lower 0.48 0.46 0.43 0.40 0.37 0.35 0.32 0.31 0.25 0.19 0.17 0.17 0.17 0.14 0.11 0.09 0.09 0.09 0.09 0.05 0.03 0.01 0.01 0.01 0.01 0.00 0.00 0.00 fitted 1.00 1.00 1.00 0.99 0.97 0.94 0.86 0.83 0.60 0.43 0.38 0.38 0.38 0.32 0.27 0.23 0.23 0.23 0.23 0.16 0.13 0.09 0.09 0.07 0.07 0.04 0.04 0.02 library(ggplot2) p <- ggplot(shuttle, aes(x = temp, y = y)) # predicted curve and point-wise 95% CI p <- p + geom_ribbon(aes(x = temp, ymin = fit.lower, ymax = fit.upper), alpha = 0.2) p <- p + geom_line(aes(x = temp, y = fitted), colour="red") # fitted values p <- p + geom_point(aes(y = fitted.values), size=2, colour="red") # observed values p <- p + geom_point(size = 2) p <- p + ylab("Probability") p <- p + labs(title = "Observed events and predicted probability of 1+ O-ring failures") print(p) ## Warning: ## Warning: Removed 5 rows containing missing values (geom point). Removed 5 rows containing missing values (geom point). 11.6: Historical Example: O-Ring Data 319 Observed events and predicted probability of 1+ O−ring failures 1.00 ● ● ● ● ● ● ● ● ● Probability 0.75 ● 0.50 ● ● ● ● 0.25 ● ● ● ● ● ● ● 0.00 ● ● ● ● ● 30 40 50 60 temp 70 ● ● ● ● ● ● ● ● 80 Part V Multivariate Methods Chapter 12 An Introduction to Multivariate Methods Multivariate statistical methods are used to display, analyze, and describe data on two or more features or variables simultaneously. I will discuss multivariate methods for measurement data. Methods for multi-dimensional count data, or mixtures of counts and measurements are available, but are beyond the scope of what I can do here. I will give a brief overview of the type of problems where multivariate methods are appropriate. Example: Turtle shells Jolicouer and Mosimann provided data on the height, length, and width of the carapace (shell) for a sample of female painted turtles. Cluster analysis is used to identify which shells are similar on the three features. Principal component analysis is used to identify the linear combinations of the measurements that account for most of the variation in size and shape of the shells. Cluster analysis and principal component analysis are primarily descriptive techniques. Example: Fisher’s Iris data Random samples of 50 flowers were selected from three iris species: Setosa, Virginica, and Versicolor. Four measurements were made on each flower: sepal length, sepal width, petal length, and petal width. Suppose the sample means on each feature are computed within the three species. Are the means on the four traits significantly different across 12.1: Linear Combinations 323 species? This question can be answered using four separate one-way ANOVAs. A more powerful MANOVA (multivariate analysis of variance) method compares species on the four features simultaneously. Discriminant analysis is a technique for comparing groups on multidimensional data. Discriminant analysis can be used with Fisher’s Iris data to find the linear combinations of the flower features that best distinguish species. The linear combinations are optimally selected, so insignificant differences on one or all features may be significant (or better yet, important) when the features are considered simultaneously! Furthermore, the discriminant analysis could be used to classify flowers into one of these three species when their species is unknown. MANOVA, discriminant analysis, and classification are primarily inferential techniques. 12.1 Linear Combinations Suppose data are collected on p measurements or features X1, X2, . . . , Xp. Most multivariate methods use linear combinations of the features as the basis for analysis. A linear combination has the form Y = a1 X 1 + a2 X 2 + · · · + ap X p , where the coefficients a1, a2, . . . , ap are known constants. Y is evaluated for each observation in the data set, keeping the coefficients constant. For example, three linear combinations of X1, X2, . . . , Xp are: Y = 1X1 + 0X2 + 0X3 + · · · + 0Xp = X1, 1 Y = (X1 + X2 + · · · + Xp), and p Y = 2X1 − 4X2 + 55X3 − 1954X4 + · · · + 44Xp. Vector and matrix notation are useful for representing and summarizing multivariate data. Before introducing this notation, let us try to understand linear combinations geometrically when p = 2. 324 Ch 12: An Introduction to Multivariate Methods Example: −45◦ rotation A plot of data on two features X1 and X2 is given below. Also included is a plot for the two linear combinations 1 Y1 = √ (X1 + X2) 2 1 Y2 = √ (X2 − X1). 2 and This transformation creates two (roughly) uncorrelated linear combinations Y1 and Y2 from two highly correlated features X1 and X2. The transformation corresponds to a rotation of the original coordinate axes by −45 degrees. Each data point is then expressed relative to the new axes. The new features are uncorrelated! 1 1 Y1 1, 0   0 2 1 1  ,   2 2 Y2 2 ## ## Attaching package: ’ellipse’ ## ## The following object is masked from ’package:car’: ## ## ellipse ● ● Y2 −2 −1 0 −1 −2 X2 45° −2 −1 0 X1 1 2 −2 −1 0 Y1 1 2 12.1: Linear Combinations 325 √ The 2 divisor in Y1 and Y2 does not alter the interpretation of these linear combinations: Y1 is essentially the sum of X1 and X2, whereas Y2 is essentially the difference between X2 and X1. 2 2 Example: Two groups The plot below shows data on two features X1 and X2 from two distinct groups. Y2 1 group 1 1 Y1 group 1 −1 0 Y2 0 −1 X2 θ° group 2 −2 −2 group 2 −2 −1 0 1 2 −2 X1 −1 0 1 2 Y1 If you compare the groups on X1 and X2 separately, you may find no significant differences because the groups overlap substantially on each feature. The plot on the right was obtained by rotating the coordinate axes −θ degrees, and then plotting the data relative to the new coordinate axes. The rotation corresponds to creating two linear combinations: Y1 = cos(θ)X1 + sin(θ)X2 Y2 = − sin(θ)X1 + cos(θ)X2. The two groups differ substantially on Y2. This linear combination is used with discriminant analysis and MANOVA to distinguish between the groups. The linear combinations used in certain multivariate methods do not correspond to a rotation of the original coordinate axes. However, the pictures given 326 Ch 12: An Introduction to Multivariate Methods above should provide some insight into the motivation for the creating linear combinations of two features. The ideas extend to three or more features, but are more difficult to represent visually. 12.2 Vector and Matrix Notation A vector is a string of numbers or variables that is stored in either a row or in a column. For example, the collection X1, X2, . . . , Xp of features can be represented as a column-vector with p rows, using the notation  X1  X X =  .. 2  . Xp    .  The entry in the j th row is Xj . The transpose of X, represented by X 0, is a row-vector with p columns: X 0 = (X1, X2, . . . , Xp). The j th column of X 0 contains Xj . Suppose you collect data on p features X1, X2, . . . , Xp for a sample of n individuals. The data for the ith individual can be represented as the columnvector:   Xi1    Xi2  Xi =  ..  .  .  Xip or as the row-vector Xi0 = (Xi1, Xi2, · · · , Xip). Here Xij is the value on the j th variable. Two subscripts are needed for the data values. One subscript identifies the individual and the other subscript identifies the feature. A matrix is a rectangular array of numbers or variables. A data set can be viewed as a matrix with n rows and p columns, where n is the sample size. . For example. . .2: Vector and Matrix Notation 327 Each row contains data for a given individual:  X11 X12 · · · X1p   X21 X22 · · · X2p  .... . ..  . p rows and p columns)   s11 s12 · · · s1p    s21 s22 · · · s2p  S =  . . Xn1 Xn2 · · · Xnp    . .  .  . Using matrix algebra..  . . . .  .e....  sp1 sp2 · · · spp where n 1 X ¯ i )2 sii = (Xki − X n−1 k=1 . X ¯ is where X defined using a familiar formula: n 1X ¯ X= Xi . The sample variances and covariances on the p variables can be grouped together in a p × p sample variance-covariance matrix S (i. .  ¯p X ¯ j is the sample average on the j th feature...  Vector and matrix notation are used for summarizing multivariate data. the sample mean vector is   ¯ X1  ¯  X2  ¯ =  X  ..12. n i=1 This mathematical operation is well-defined because vectors are added elementwise. . rp1 rp2 · · · rpp The ith row and j th column element of R is the correlation between the ith and j th features. More formally. The interpretation of covariances is enhanced by standardizing them to give correlations. . Matrix algebra allows you to express S using a formula analogous to the sample variance for a single feature: n 1 X ¯ ¯ 0 S= (Xk − X)(X k − X) ..   . The off-diagonal elements satisfy sij .. S is symmetric. meaning that the elements above the main diagonal are a reflection of the entries below the main diagonal. . The matrix products with (Xki − X are added up over all n observations and then divided by n − 1.. The sample correlation matrix is denoted by the p × p symmetric matrix   r11 r12 · · · r1p    r21 r22 · · · r2p  R =  . This matrix product is a p × p matrix ¯ i)(Xkj − X ¯ j ) in the ith row and j th column. . . and n 1 X ¯ i)(Xkj − X ¯j) sij = (Xki − X n−1 k=1 is the sample covariance between the ith and j th features. . . sij = sji.328 Ch 12: An Introduction to Multivariate Methods is the sample variance for the ith feature. The diagonal elements are one: rii = 1. The variances are found on the main diagonal of the matrix. rij = rji = √ siisjj . The covariances are off-diagonal elements. n−1 k=1 ¯ ¯ 0 Here (Xk − X)(X k − X) is the matrix product of a column vector with p entries times a row vector with p entries.. The subscripts on the elements in S identify where the element is found in the matrix: sij is stored in the ith row and the j th column. The sample covariance between response times on A and C is 1.00 0. The sample variance-covariance matrix for the standardized data is the correlation matrix R for the raw data.63 1. .66 1. 1. X2. For example.69 R =  0.63. on each feature. respectively. subtracts the mean transformation: (Xki − X from each observation and divides by the corresponding standard deviation. .18 1. .12. The data are standardized through the so-called Z-score ¯ i)/sii which.3: Matrix Notation to Summarize Linear Combinations 329 In many applications the data are standardized to have mean 0 and variance 1 on each feature.3 Matrix Notation to Summarize Linear Combinations Matrix algebra is useful for computing sample summaries for linear combinations of the features X 0 = (X1.00 The average response time on B is 5.7   2. and X3 be the reaction times for three visual stimuli named A.26 2. X2. 12. suppose you define the linear combination Y1 = a1X1 + a2X2 + · · · + apXp.82 2. . The sample correlation between response times on B and C is 0.26.71 1. X 4.63 S =  2.71. B and C.89 0. Example: Let X1. Xp) from the sample summaries on these features.71  .18 2.89 1.00 0.47   1. The sample variance of response times on A is 2. .82  . Suppose you are given the following summaries based on a sample of 30 individuals:   4 ¯ =  5 .69 0. 0. . . . . . Xp). a2. the sample covariance between Y1 and Y2 = b0X = b1X1 + b2X2 + · · · + bpXp is sY1. . .65. X2. X 4. X3 The mean reaction time is Y¯ = [1 1 1]   4 ¯ = [1 1 1]  5  = 4 + 5 + 4.7. .Y2 = X aibj sij = a0Sb = b0Sa.18 + · · · + 2.7 The variance of Y is the sum of the elements in the variance-covariance matrix:   1 X 2   sY = [1 1 1] S 1 = sij = 2.7 = 13. ij ¯ and S are the sample mean vector and sample variance-covariance where X matrix for X 0 = (X1. Similarly. ij 1 .330 Ch 12: An Introduction to Multivariate Methods Using matrix algebra. the total reaction time per individual is   X1 Y = [1 1 1]  X2  = X1 + X2 + X3. ij Example: In the stimuli example. where a0 = (a1. The sample mean and variance of Y1 are ¯ 1 + a2X ¯ 2 + · · · + ap X ¯ p = a0X ¯ Y¯ = a1X and s2Y = X aiaj sij = a0Sa. Y1 is the matrix product Y1 = a0X.47 = 18. ap).26 + 2. X2.Chapter 13 Principal Component Analysis Principal component analysis (PCA) is a multivariate technique for understanding variation.. Given data on p variables or features X1. the original variables in certain analyses. The second principal component PRIN2 = a21X1 + a22X2 + · · · + a2pXp has the largest variability among all unit-length linear combinations of X1. Xp that are uncorrelated with PRIN1.. A unit-length linear combination a1X1 +a2X2 +· · ·+apXp has a21 + a22 + · · · + a2p = 1. . Principal components (the variables created in PCA) are sometimes used in addition to. . X2. the j th principal . called principal components. or in place of. and for summarizing measurement data possibly through variable reduction. Xp. . In general. I will illustrate the use and misuse of principal components in a series of examples. that are unit-length linear combinations of the original variables. PCA uses a rotation of the original coordinate axes to produce a new set of p uncorrelated variables. . The principal components have the following properties. . . The first principal component PRIN1 = a11X1 + a12X2 + · · · + a1pXp has the largest variability among all unit-length linear combinations of the original variables. The ordered principal components are uncorrelated variables with progressively less variation.2X1 + 0. . I have described PCA on the raw or unstandardized data. Without this constraint. For example. An alternative method for PCA uses standardized data. . The unit-length constraint on the coefficients in PCA is needed to make the maximization well-defined. .4X2 + 0. The last or pth principal component PRINp has the smallest variability among all unit-length linear combinations of the features. This method is often called PCA on the sample covariance matrix. then you might need only the first few principal components to capture most of the variability in the data. The variability of each component divided by the total variability of the components is the proportion of the total variation in the data captured by each component. p. there does not exist a linear combination with maximum variation.8X4 have the same variability.4X2 − 0. the variability of an arbitrary linear combination a1X1 + a2X2 + · · · + apXp is increased by 100 when each coefficient is multiplied by 10! The principal components are unique only up to a change of the sign for each coefficient. .4X3 − 0. PRIN1 = 0. . 2. because the principal components are computed numerically using a singular value decomposition of the sample covariance matrix for the Xis. For example.2X1 − 0. which is often called PCA on the correlation matrix. PRIN2. has the largest variability among all unitlength linear combinations of the features that are uncorrelated with PRIN1. PRIN(j − 1).332 Ch 13: Principal Component Analysis component PRINj for j = 1. If data reduction is your goal. so either could play the role of the first principal .4X3 + 0. Principal components are often viewed as separate dimensions corresponding to the collection of features. . The sum of the variances of the principal components is equal to the sum of the variances in the original features. . This issue will be returned to later. The coefficients in the PCs are eigenvectors of the sample covariance matrix.8X4 and PRIN1 = −0. The variances of the PCs are eigenvalues of the sample covariance matrix. . 5).2 39.. ## 123456789012345678901234 ## [ 14 char ][ 5 ][ 5 ] # mobile 51. ## Each field is specified by column ranges.1 29.2 # little rock 39.7 # remove that white space with strip.0 72.2 51.2 81.2 39.7 75.7 75.6 67.read.c("city".2 51.. ## $ january: num 51.fwf(fn.8 V3 81.1: Example: Temperature Data 333 component.. This non-uniqueness does not have an important impact on the analysis.4 75..read. 13.4 75.frame': 64 obs.2 81. 5)) # the city names have trailing white space (we fix this below) str(temp) ## 'data.5 81.com/teach/ADA2/ADA2_notes_Ch13_temperature. "january".3 .2 73. of 3 variables: ## $ V1: Factor w/ 64 levels "albany ".frame': 64 obs.: 39 48 33 56 21 27 64 62 31 36 .. "july") temp$id <.2 91.data <.1:nrow(temp) str(temp) ## 'data.8 32 35.: 39 48 33 56 21 27 64 62 31 36 . of 4 variables: ## $ city : Factor w/ 64 levels "albany".9 24.data...6 # phoenix 51.6 67.2 .1 29. head(temp) . strip.."http://statacumen. 5.fwf(fn.6 91..S..8 32 35.8 78.2 81. widths = c(14. ## $ V3: num 81.5 45.2 81.7 81 82.6 91.white = TRUE) # name columns colnames(temp) <..8 78.1 29."albuquerque". ## Below I've provided numbers to help identify the column numbers ## as well as the first three observations in the dataset.3 . ## $ id : int 1 2 3 4 5 6 7 8 9 10 .4 75.2 39. widths = c(14. cities...2 51..5 45.9 24. #### Example: Temperature of cities ## The Temperature data file is in "fixed width format". 5.2 .dat" temp <.6 91. head(temp) ## ## ## ## ## ## ## V1 1 2 3 4 5 6 mobile phoenix little rock sacramento denver hartford V2 51.7 81 82.2 73 72..data.2 73 72..6 54.6 54.9 24.1 Example: Temperature Data The following temperature example includes mean monthly temperatures in January and July for 64 U. ## $ july : num 81..white=TRUE temp <. an older data file format. ## $ V2: num 51.4 fn.13.5 45. .and y-axis # good idea since both are in the same units p1 <.7 6 # plot original data library(ggplot2) p1 <. PRINp are centered to have mean zero.2 4 5 denver 29.5. The principal component scores are the values of the principal components across cases. PRIN2.5 81.p1 + labs(title = "Mean temperature in Jan and July for selected cities") print(p1) Mean temperature in Jan and July for selected cities phoenix ● 90 dallas ● july 80 70 houston ● el paso new orleans ● jackson mobile oklahoma city little rock ● columbia ● ● jacksonville ● ● wichita ● ● ● nashville memphis ● ● kansas city washington albuquerque dc st louis charlotte norfolk ● ● ● ● atlanta ● richmond ● raleigh ● ● omaha salt lake philadelphia ● louisville city ● new baltimore york ● ● ● ● ● wilmington cincinnati sacramento des moinespeoria atlantic city wv ● ● indianapolis charleston.2 2 3 little rock 39.4 3 4 sacramento 45.334 ## ## ## ## ## ## ## Ch 13: Principal Component Analysis city january july id 1 mobile 51. . or ● duluth miami ● ● sault ste mari ● 20 40 60 january The princomp() procedure is used for PCA.2 81. # perform PCA on covariance matrix temp.25) # city labels p1 <.princomp( ~ january + july.2 91.p1 + geom_text(aes(label = city). The correlation matrix may also be used (it effectively z-scores the data first) with the cor = TRUE option.6 1 2 phoenix 51.p1 + geom_point() # points p1 <. y = july)) p1 <. By default the principal components are computed based on the covariance matrix.9 73. Output from a PCA on the covariance matrix is given. aes(x = january. vjust = -0.pca <. alpha = 0. .ggplot(temp. The principal component scores PRIN1.0 5 6 hartford 24. Two principal components are created because p = 2. ● ● ● boise ● ● ● ● columbus sioux falls detroit boston ● denver ● ● ● hartford ● ● pittsburgh providence albany minneapolis chicago ● cleveland ● ●● ● bismarck ● ● buffalo milwaukee burlington concord spokane ● ● great reno ● ● falls ● cheyenne ● ● portland. me ● ● portland.p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x.8 72. data = temp) . .1 75. . temp.line.scale[1] * temp.076 4.line.291 -8.scale[1] * temp.pca$center[2] + line.939 SS loadings Proportion Var Cumulative Var Comp.pca$loadings[2.1 Comp. zero) and a rotation of the data.0942 -8.944 0. # create small data.957 1. rep("PC2".pca$scores) ## ## ## ## ## ## ## 1 2 3 4 5 6 Comp.pca) ## ## ## ## ## ## ## ## ## ## Loadings: Comp. .851 0. 15) # length of PCA lines to draw # endpoints of lines to draw temp.000 0.pca$loadings[2. it also performs a scaling so that the resulting PC scores have unit-variance in all directions.line.343 -0.00046 Proportion of Variance 0. temp.05598 Cumulative Proportion 0.pca.2333 PCA is effectively doing a location shift (to the origin.endpoints 1] 1] 2] 2]) 1] 1] 2] 2]) .scale[2] * temp.pca) ## ## ## ## ## Importance of components: Comp.pca$center[2] .line.13. x = c(temp. y = c(temp.frame with endpoints of PC lines through data line. .scale[2] * temp.5 0. temp.pca$loadings[1.1 Comp.scale[1] * temp.2 -20.322 3.pca.343 july -0. 2)) .pca$center[2] + line.scale <.2 january -0.pca$loadings[2.2 1.1: Example: Temperature Data 335 # standard deviation and proportion of variation for each component summary(temp.scale[1] * temp.0 0.pca$loadings[1. When the correlation is used for PCA (instead of the covariance). temp.8995 -12.pca$center[1] . temp.5 1. ) temp.pca$loadings[1.endpoints <data.5 0.pca$center[2] .1 Comp. .c(35.line.pca$loadings[1.line. 2).pca$center[1] + line. temp.941 -2. . .00000 # coefficients for PCs loadings(temp.frame(PC = c(rep("PC1".944 1.8447 2.939 0.pca$center[1] .scale[2] * temp.9240 -23.scale[2] * temp.pca$loadings[2.7000 7.0 1.2 Standard deviation 12. .1 Comp.0 # scores are coordinates of each observation on PC scale head(temp.pca$center[1] + line. line. aes(x = january.pca.line. hjust = 1) #. alpha=0.endpoints$y[3] .p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x.line.endpoints$y[1] . PC=="PC1").character(temp. y = july)) p1 <.pca.endpoints$x[3] .9526 37.5) p1 <.61 89.pca.2381 y 87.pca. aes(x=x. size = 10) p1 <.pca.61 63.pca.52 # plot original data with PCA vectors overlayed library(ggplot2) p1 <.7834 26. vjust = 0) #.character(temp.p1 + geom_path(data = subset(temp.endpoints$PC[3]) .and y-axis # good idea since both are in the same units p1 <.pca. y=y) .endpoints$PC[1]) . PC=="PC2"). label = as.p1 + geom_point() # points p1 <.endpoints. y = temp.p1 + geom_text(aes(label = id). alpha = 0. x = temp.p1 + annotate("text" .endpoints$x[1] . label = as.line.70 61.p1 + labs(title = "Mean temperature in Jan and July for selected cities") print(p1) Mean temperature in Jan and July for selected cities 2 ● 90 PC2 PC1 54 ● 17 ● 80 28 july 41 ● 70 ● 45 3 ● 53 52 ● ● 35 8 39 ●● 11 60 59 ● ● 40 ● 29 ● 4718 57 38 21 ● ● 7●● ●● 42 16 14 15 ● ●34 62 12 ● ● ● ● ● ● 4422 51 5 ● 623 ● ● ● ● 49 ● 3613 26 4348 ● ● ● ●● 37 61 ● 58 63 33 ● 32 ● 30 ● ● ● 64 ● ● ● 20 31 55 ● ● ● ● 50 27 ● ● 56 ● 1 19 ● 9 ● 10 ● ● ● 4 ● 46 ● 25 ● 24 ● 0 20 40 january 60 . y = temp.line. x = temp.ggplot(temp. size = 10) p1 <.line. y=y) .9740 -0.pca. vjust = -0.25) # city labels # plot PC lines p1 <.p1 + annotate("text" . aes(x=x.line.line.5.5) # label lines p1 <.endpoints.336 ## ## ## ## ## 1 2 3 4 Ch 13: Principal Component Analysis PC PC1 PC1 PC2 PC2 x 64. alpha=0.p1 + geom_path(data = subset(temp. main="Temperature data and PC scores") .25) # cit # plot PC lines p3 <.5.data.frame(-temp.and y-axis # good idea since both are in the same units p2 <.1. alpha = 0.data.p3 + geom_text(aes(label = rownames(temp.13.ggplot(as.pca$scores).pca$scores)). PC scores") #print(p2) # plot PCA scores (data on (negative) PC-scale centered at 0) library(ggplot2) # negative temp.2)) p3 <.p2 + geom_vline(xintercept = 0. aes(x = Comp.pca$scores)).1. but negative PC scores match orientation of original data") #print(p3) library(gridExtra) grid. vjust = -0.ggplot(as. y = Comp.pca$scores). alpha=0.and y-axis # good idea since both are in the same units p3 <. p3.p3 + geom_vline(xintercept = 0.p2 + labs(title = "Same.5) p2 <.arrange(p2. alpha=0.5) p2 <.p3 + labs(title = "Same.p2 + geom_point() # points p2 <.pca£scores p3 <.5) p3 <. alpha=0.p3 + geom_hline(yintercept = 0. alpha=0.1: Example: Temperature Data 337 # plot PCA scores (data on PC-scale centered at 0) library(ggplot2) p2 <. aes(x = Comp. vjust = -0. ncol=1.p2 + geom_hline(yintercept = 0.25) # cit # plot PC lines p2 <. alpha = 0.2)) p2 <. y = Comp.frame(temp.p3 + geom_point() # points p3 <.p2 + coord_fixed(ratio = 1) # makes 1 unit equal length on x.p2 + geom_text(aes(label = rownames(temp.5) p3 <.5.p3 + coord_fixed(ratio = 1) # makes 1 unit equal length on x. which is perpendicular to the first PC axis.1 Some comments on the output: 1.2 ● 0 ● 24 ● ● 17 28 ● ● 29 ● 58 25 31 54 45 ● ● 3 ● 57 55 ● 14 ● 35 ● 8 ● 53 4721 15 18 ●● 60 ● 52 50 27 42738 ●● ● ● ● 3613 623 ● 12 ● ● ● ● ● 59 ● ● ● 44 34 ● ● ● 39 63 40 ●22 ● 11 ● 62 ● 33 ● ●5 ● 30 ● 49 ● 4348 ● 37 ● ● ● 61 ● ● 20 ● ● 64 ● 4 ● 32 16 ● ● ● −5 51 56 ● 1 19 ● ● 9 ● 10 ● ● 46 −10 ● −20 0 20 Comp. the direction of maximal variability corresponds to the first PC axis.1 Same. You can visualize PCA when p = 2.73 ## july 46.2 9 ● 0 19 1 ● ● 56 27 50 ● ● ● 55 ● 54 ● −5 32 24 ● 64 ● 20 61 ● 43 48 ● ● 37 49 30 ● 5 33 ●● 62 ● 11 22 ● 40 ● 63 ● 39 ● 34 ●44 ● 59● 13 6 ● ● ● 23 742 12 ● ● 36 60 ● 21 ● ● 18 ● 15 52 ● 38● ● ● 47 ● ● ● ● 53 ● 8 35 ● 14 ●● 57 ● 3 ● 29 ● ● 16 45 ● 28 ● 31 17 ● ● ● 25 58 ● ● 51 26 ● ● 41 ● ● 2 ● −20 0 20 Comp. # variance of data (on diagonals. PC scores 46 10 ● 10 4 ● 5 ● Comp.18 46.338 Ch 13: Principal Component Analysis Temperature data and PC scores Same. The direction of minimum variation corresponds to the second PC axis. The total variance is the sum of variances for the monthly temperatures: 163. 2. In the temperature plot.20. covariance of off-diags) var(temp[.73 26.18 + 26."july")]) ## january july ## january 137.20 # sum of variance . The PRIN2 score for each city is obtained by projecting the temperature pairs onto this axis. The PRIN1 score for each city is obtained by projecting the temperature pairs perpendicularly onto this axis.c("january".38 = 137. but negative PC scores match orientation of original data 2 ● 5 41 26 Comp. 38. The second PC accounts for the remaining 5.4 3. . Almost all of the variability (94.2 1.343 JAN + 0.146.146/163.146 $vectors [.831e-15 ## Comp.2] [1.c("january".3428 -0.pca$scores))) ## [1] 163.1: Example: Temperature Data 339 sum(diag(var(temp[."july")])) ## ## ## ## ## ## ## $values [1] 154. January temperatures are more variable. 5.831e-15 9.c("january". The eigenvalues of the covariance matrix are variances for the PCs.] -0. This is sensible because PRIN1 maximizes variation among linear combinations of the January and July temperatures.13.146e+00 # sum is same as original data sum(diag(var(temp. # eigenvalues and eigenvectors of covariance matrix give PC variance and loadings eigen(var(temp[.38. The proportion of the total variability due to PRIN1 is 0.2 ## Comp.236.1] [. so they are weighted heavier in this linear combination.939 JAN + 0.4 # variance of PC scores var(temp.9394 0.1 1.343 JULY is 154.23/163.939 JULY is 9. The variability of PRIN1 = +0.1 Comp."july")]))) ## [1] 163. PRIN1 weights the January temperature about three times the July temperature.4%) in the original temperatures is captured by the first PC.9394 4.542e+02 1.pca$scores) ## Comp.] -0. The proportion of the total variability due to PRIN2 is 0.3428 [2.236 9.056 = 9.6% of the total variability. The variability of PRIN2 = −0.944 = 154. The built-in plots plot the scores and original data directions (biplot) and the screeplot shows the relative variance proportion of all components in decreasing order.1 13.0 32 10 january temp. each of which is non-negative. even though PRIN1 is a weighted average of the January and July temperatures.2)) biplot(temp. This might have a dramatic effect on the PCA. The features with large variances have larger coefficients or loadings.4 24 64 20 61 9 433730 48 49 5 33 11 191 40 62 22 25 39 44 613 6358 59 3412 7 23 36 60 42 56 27 21 50 52 18 38 15 47 5335 8 july 55 57 14 3 29 26 16 51 41 45 54 28 31 17 −0.1 Comp.340 Ch 13: Principal Component Analysis 6.pca) screeplot(temp.4 100 0 20 −50 0 50 4 Variances 0.0 0. if you measure height in meters but then change height to centimeters.pca) 120 140 100 0.pca 100 46 −0.2 0. the variability increases by a factor of 100*100 = 10. You might prefer to standardize the features when they are measured on .2 Comp.2 −100 2 0. This explains why some PRIN1 scores are negative.2 50 80 0 60 −50 40 −100 Comp. because variability is scale dependent and the principal component analysis on the raw data does not take scale into account. The PCs PRIN1 and PRIN2 are standardized to have mean zero.2 Comp. # a couple built-in plots par(mfrow=c(1. This might be considered a problem.2 PCA on Correlation Matrix The coefficients in the first principal component reflect the relative sizes of the feature variances.000. For example.2 0. of 4 variables: ## $ city : Factor w/ 64 levels "albany". ## $ january: num 1.z[. 1] 1.0456 1.pca <..7794 1.187 ."july")]) ## ## january july 1..attr(*.c("january". The features are standardized to have mean zero and variance one by using the Z-score transformation: (Obs − Mean)/Std Dev.08028 4 denver -0. or when the features have wildly different variances.7794 ## july 0.631 1. The PCA is then performed on the standardized data.0803 -0. as.6311 3.131 -0.vector(scale(temp."july")]) ## january july ## january 1.1874 -0.04555 2 little rock 0.z[...z$july) # the manual z-score and scale() match all.(temp.z. temp.631 0. ## . data = temp.equal(temp.z <..7794 1.12 ## $ id : int 1 2 3 4 5 6 7 8 9 10 .c("january".c("january".17005 1 phoenix 1.1701 3.z$january) # z-score using R function scale() temp.229e-16 -1.z) ## 'data.z$january))) ## [1] TRUE # scale() includes attributes for the mean() and sd() used for z-scoring str(temp.0000 0.11 -0. "scaled:center")= num 75.51009 5 hartford -0.z[.2: PCA on Correlation Matrix 341 different scales..z) ## ## ## ## ## ## ## 1 2 3 4 5 6 city january july id mobile 1.."july")]) ## january july ## january 1.13.z$january.: 39 48 33 56 21 27 64 62 31 36 ..z$january .1103 -0.13098 3 sacramento 1.6229 -0.7794 ## july 0.6322 1. ## $ july : num [1:64.scale(temp. "scaled:scale")= num 5.0000 ## Plot z-scored data temp...6 ## .6311 1.mean(temp.. head(temp.z$january <..632 1.215e-15 var(temp.z) .princomp( ~ january + july."albuquerque".56869 6 # z-scored data has mean 0 and variance 1 colMeans(temp.z$january)) / sd(temp.attr(*..0000 0.5101 .temp # manual z-score temp.frame': 64 obs.0000 # the correlation is used to contruct the PCs # (same as covariance for z-scored data) cor(temp.z$july <. z.z. aes(x=x. size = 10) p1 <.pca.pca$loadings[2.z. temp.z.z.character(temp.p1 + labs(title = "Z-score temperature in Jan and July for selected cities") print(p1) .121 PC1 -2.pca$center[1] .pca.pca. .pca$loadings[2.endpoints ## ## ## ## ## 1 2 3 4 1] 1] 2] 2]) 1] 1] 2] 2]) PC x y PC1 2.z.scale[1] * temp.frame with endpoints of PC lines through data line.pca$center[1] + line.z.5) p1 <.z.scale[2] * temp.z. .121 PC2 2. ) temp. .z.line.scale[1] * temp.line.z.pca.c(3.z.pca.pca$loadings[2.z.z.line. .121 2.endpoints <data.pca$center[1] . alpha=0. label = as.pca. hjust = 0) #. y=y).z.z.p1 + geom_path(data = subset(temp.pca.z.pca$center[2] .line.frame(PC = c(rep("PC1".pca$center[2] + line.line.endpoints$y[1] .5.scale <. x = temp. y = temp. PC=="PC1").line.p1 + geom_point() # points p1 <.p1 + annotate("text" .z. size = 10) p1 <.z. temp.121 PC2 -2.endpoints$PC[1]) . 2)) .line.pca$loadings[1. aes(x=x.pca.and y-axis # good idea since both are in the same units p1 <.scale[1] * temp. .121 2.121 -2.pca$loadings[1. temp. label = as.pca. alpha=0. temp.z.25) # city labels # plot PC lines p1 <.endpoints$PC[3]) . PC=="PC2").scale[1] * temp.line.ggplot(temp. y = july)) p1 <.pca$loadings[1. x = c(temp.pca$loadings[1.z. x = temp. rep("PC2".line.pca$center[1] + line.endpoints.z. temp.pca$center[2] .z.scale[2] * temp.pca$loadings[2.scale[2] * temp.scale[2] * temp.z.pca$center[2] + line. alpha = 0. aes(x = january.121 -2.z. y=y).p1 + geom_path(data = subset(temp.endpoints$x[1] . y = temp.p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x.z.line.pca. vjust = 0) #.342 Ch 13: Principal Component Analysis # create small data. 2).p1 + annotate("text" .z.p1 + geom_text(aes(label = id).121 # plot original data with PCA vectors overlayed library(ggplot2) p1 <. 3) # length of PCA lines to draw # endpoints of lines to draw temp.endpoints$x[3] .line.line. . .character(temp.endpoints$y[3] .line. vjust = -0.line.5) # label lines p1 <.endpoints. temp. y = c(temp. data = temp.1103 1. # perform PCA on correlation matrix temp.0000 . Thus.4696 0.pca2) ## ## ## ## ## Importance of components: Comp.3340 Proportion of Variance 0.princomp( ~ january + july.707 -0.1 Standard deviation 1.8897 # coefficients for PCs loadings(temp.pca2) ## ## Loadings: ## Comp.8897 Cumulative Proportion 0.707 0.707 Comp.2: PCA on Correlation Matrix 343 Z−score temperature in Jan and July for selected cities 2 ● 3 2 PC1 PC2 54 ● 56 45 3 ● 17 1 55 ● 27 50 ● ● ● ● 28 29 35 8 ● ●● ● july 31 18 47 38 21 ● ● ● ● 7 42● 16 14 15 ● 3462 ● ● ● ● ●12 ● 44 23 ●225 ● 6● ● 49 3613 ● 48 ● ● 43●● ● ● 1 19 ● ● 10 ● 9 ● 53 52 ● ● 39 ● 60 59 ● 11 ● 40 ● ● 57 ● 0 51 ● 26 ● 41 −1 4 ● ● 58 63 33 37 ● 61 ● ● 30 ● ● 64 ● ● 20 ● 32 ● ● 46 ● 25 ● −2 24 ● −2 −1 0 1 2 3 january The covariance matrix computed from the standardized data is the correlation matrix. principal components based on the standardized data are computed from the correlation matrix.pca2 <. This is implemented by adding the cor = TRUE option on the princomp() procedure statement.13.2 0.2 ## january 0. cor = TRUE) # standard deviation and proportion of variation for each component summary(temp.1 Comp.707 ## july 0. 2 0.0 1.35547 4 0.0 Variances 0 0.03861 This plot is the same except for the top/right scale around the biplot and the variance scale on the screeplot.4971 -0.pca) 0 5 46 1.pca 10 10 −5 10 0.2 1 1.1 Comp.4 Comp.1 The standardized features are dimensionless.2 ## SS loadings 1.3331 1. so the PCs are not influenced by the original units of measure.2 Comp.2)) biplot(temp.0 Comp.32862 2 3.2 9 january 0.z.2 1.2566 0.2 5 32 4 −0.84855 5 -0.9964 -0.4 temp.344 Ch 13: Principal Component Analysis ## ## Comp.pca) screeplot(temp.0 ## Proportion Var 0. The PCs from the correlation matrix are PRIN1 = +0.2 0.00804 3 1.5 1.0 # scores are coordinates of each observation on PC scale head(temp.8492 0.z. which is not changed by standardization.0 2 −0.1 Comp. The only important factor is the correlation between the features.0 0.5 0.707 JULY . nor are they affected by the variability in the features.22995 6 -0.707 JAN + 0.pca2$scores) ## ## ## ## ## ## ## Comp.5 ## Cumulative Var 0. # a couple built-in plots par(mfrow=c(1.z.7341 -0.5 0.5 july 64 24 61 20 11 48 43 3940 62 549 3730 56 22 59 33 34 27 44 50 7 1223613 63 52 60 2142 25 18 38 15 36 58 47 8 55 5335 3 57 14 29 54 45 26 17 28 31 16 51 41 19 1 −5 0.1 Comp. 13. where each feature has the same coefficient. The coefficients or loadings in a principal component reflect the relative contribution of the features to the linear combination. In the temperature data. Principal components often have positive and negative loadings when p ≥ 3. group the features with + and − signs together and then interpret the linear combination as a comparison of weighted averages.707 JAN + 0. The simplest case of a weighted average is an arithmetic average. The principal components are then interpreted as weighted averages or comparisons of weighted averages. note that Z = 0 if and only if X = Y .13. whereas Z < 0 when X < Y and Z > 0 when X > Y . The difference Z = X − Y is a comparison of X and Y .34 JULY. PRIN2 is a comparison of January and July temperatures (signs of the loadings: JAN is − and JULY is +): PRIN2 = −0. PRIN1 is a weighted average of January and July temperatures (signs of the loadings: JAN is + and JULY is +): PRIN1 = +0. A weighted average is a linear combination of features with non-negative loadings.707 JULY.3: Interpreting Principal Components 345 and PRIN2 = −0. To interpret the components. I often do both and see which analysis is more informative. .94 JAN + 0. The sign and magnitude of Z indicates which of X and Y is larger. and some degree of creativity is involved. and by how much.34 JAN + 0.94 JULY.3 Interpreting Principal Components You should try to interpret the linear combinations created by multivariate techniques. The interpretations are usually non-trivial. Most researchers focus more on the signs of the coefficents than on the magnitude of the coefficients. To see this. PCA is an exploratory tool. so neither a PCA on the covariance matrix nor a PCA on the correlation matrix is always the “right” method. lower = list(continuous = "cor") ) print(p) # detach package after use so reshape2 works (old reshape (v. unload=TRUE) detach("package:reshape". width. I perform a PCA on the original data and on the standardized data. unload=TRUE) ## 3D scatterplot library(scatterplot3d) par(mfrow=c(1. { scatterplot3d(x=length . upper = list(continuous = "points") .ggpairs(shells.com/teach/ADA2/ADA2_notes_Ch13_shells..1)) with(shells.346 Ch 13: Principal Component Analysis You can often simplify the interpretation of principal components by mentally eliminating features from the linear combination that have relatively small (in magnitude) loadings or coefficients."http://statacumen. head(shells) ## ## ## ## ## ## ## 1 2 3 4 5 6 length width height 98 81 38 103 84 38 103 86 42 105 86 42 109 88 44 123 92 50 ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) # put scatterplots on top so y axis is vertical p <.4 Example: Painted turtle shells Jolicouer and Mosimann gave the length.frame': 24 obs. so I will be careful about this issue when necessary. ## $ height: int 38 38 42 42 44 50 46 51 51 51 . y=width .1) conflicts) detach("package:GGally". ## $ width : int 81 84 86 86 88 92 95 99 102 102 .data <. 13.table(fn.. #### Example: Painted turtle shells fn. This strategy does not carry over to all multivariate analyses.data.dat" shells <.. header = TRUE) str(shells) ## 'data.read. and height in mm of the carapace (shell) for a sample of 24 female painted turtles. of 3 variables: ## $ length: int 98 103 103 105 109 123 123 133 133 133 .... 2 Comp.3d = TRUE # makes color change with z-axis value ) }) #### For a rotatable 3D plot. color="blue".3 Standard deviation 25.13. use plot3d() from the rgl library # ## This uses the R version of the OpenGL (Open Graphics Library) # library(rgl) # with(shells.973 ● ● ● ● 70 100 120 140 160 180 ● ● ● ●● ● 65 120 Shells 3D Scatterplot ● ●● ● ● 55 length ● 50 140 ●● ●● ● width 180 50 60 PCA on shells covariance matrix The plots show that the shell measurements are strongly positively correlated. y = width. # perform PCA on covariance matrix shells. z=height .966 ● 140 130 120 110 100 ● ●● ● ● 90 height 80 80 100 120 140 160 180 50 length 40 13.000000 # coefficients for PCs .99585 1.971 ● ● 40 90 8090 100110120130 ●● ● ● ● 60 width ● ● ● ● 45 110 ● ● ●● height Corr: 0. # filled blue circles #.986 0.653746 Proportion of Variance 0. data = shells) # standard deviation and proportion of variation for each component summary(shells.54708 1. Let us perform a PCA on the covariance matrix and interpret the results.497 2. which is not surprising. z = height) }) ● 160 ● ●● ●● ● ●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● 130 ● 120 100 ● ● ● ● ● ●●● ● ● Corr: 0.986 0.pca) ## ## ## ## ## Importance of components: Comp. type = "h" # lines to the horizontal xy-plane . pch=19. main="Shells 3D Scatterplot" .princomp( ~ length + width + height.1 ●● ● ● ● ● 35 60 Corr: 0.004148 Cumulative Proportion 0.pca <.4. highlight.1 Comp.4: Example: Painted turtle shells 347 .00984 0. { plot3d(x = length. Jolicouer and Mosimann argue that the size of female turtle shells can be characterized by PRIN1 with little loss of information because this linear combination accounts for 98.172 width 0.55 Length + (0. which have relatively little variation. be used in any meaningful way? To think about this.667 1. respectively.6% of the total variability in the measurements. PRIN1 = 0.000 The three principal components from the raw data are given below.333 0.2 Comp.302 -0. Question: Can PRIN2 and PRIN3.818 -0.333 0. as measured by a weighted average of length.1 Comp.348 Ch 13: Principal Component Analysis loadings(shells.50 Width + 0. and can be viewed as an overall measure of shell size.000 1.151 0. for they appear to be a comparison of length with an average of width and height.941 Comp.333 Cumulative Var 0. suppose the variability in PRIN2 and PRIN3 was zero. The carapace measurements are positively correlated with each other.496 -0.3 length 0.3 SS loadings 1.pca) ## ## ## ## ## ## ## ## ## ## ## Loadings: Comp. .17 Length + 0.81 Length + 0.333 0.94 Height.29 Width) + 0.1 Comp. and height.291 height 0. and a comparison of height with length and width.000 1.30 Height PRIN2 = −0.15 Height) PRIN3 = −(0.82 Width + 0. so larger lengths tend to occur with larger widths and heights. The primary way the shells vary is with regards to their overall size. width. Length and width are grouped in PRIN3 because they have negative loadings.814 0. The form of PRIN1 makes sense conceptually. Jolicouer and Mosimann interpreted the second and third principal components as measures of shape.555 -0.000 Proportion Var 0. PRIN1 is a weighted average of the carapace measurements.2 Comp. 2 Comp.1 Comp.000000 # coefficients for PCs loadings(shells.pca) ## ## ## ## ## ## ## ## ## ## ## Loadings: Comp.13. True.000 The three principal components for the standardized data are PRIN1 = 0.577 0.01145 0.578 -0.pca <.3 length -0.333 0.14 Length + 0.333 0.160820 Proportion of Variance 0.667 1.2 Comp.52 Width + 0.522 height -0.000 Proportion Var 0.63 Width) + 0.9799 0.766 -0. but not obvious.628 -0. . The standardized features are essentially interchangeable with regards to the construction of the first principal component. Here.9799 0. add the cor = TRUE option.99138 1.2 Comp.pca) ## ## ## ## ## Importance of components: Comp.000 1.80 Length + (0. and height. # perform PCA on correlation matrix shells. The first principal component accounts for 98% of the total variability in the standardized data.137 0. data = shells.2 349 PCA on shells correlation matrix For the analysis on the correlation matrix. Little loss of information is obtained by summarizing the standardized data using PRIN1.28 Height).58 Width + 0. which is essentially an average of length.princomp( ~ length + width + height.333 0.4.284 Comp.77 Height PRIN3 = −0. The total variability for correlation is always the number p of features because it is the sum of the variances.000 1.333 Cumulative Var 0.18530 0.4: Example: Painted turtle shells 13.008621 Cumulative Proportion 0. p = 3. PRIN2 and PRIN3 are measures of shape.1 Comp.577 -0.3 Standard deviation 1. so they must be weighted similarly.58 Height PRIN2 = −(0.3 SS loadings 1. cor = TRUE) # standard deviation and proportion of variation for each component summary(shells. width.58 Length + 0. The loadings in the first principal component are approximately equal because the correlations between pairs of features are almost identical.1 Comp.804 width -0.7146 0. For simplicity. consider the following data plot of two features. and the implied principal components.5 Ch 13: Principal Component Analysis Why is PCA a Sensible Variable Reduction Technique? My description of PCA provides little insight into why principal components are reasonable summaries for variable reduction. .350 13. Given the value of PRIN1. this is plausible. One can show mathematically that PRIN1 is the best (in some sense) linear combination of the two features to predict the original two features simultaneously. you get a good prediction of the original feature scores by moving PRIN1 units along the axis of maximal variation in the feature space. Intuitively. and then rotating the axes appropriately. you know the direction for the axis of maximal variation. Specifically. The PC scores for an observation are obtained by projecting the feature scores onto the axes of maximal and minimal variation. In a PCA. why does it make sense for researchers to consider the linear combination of the original features with the largest variance as the best single variable summary of the data? There is an alternative description of PCA that provides more insight into this issue. but the improvement is slight when the added principal components have little variability. Prediction of the original features improves as additional components are added. The first k principal components give the best simultaneous prediction of the original p features. Thus.99. Similarly. among all possible choices of k uncorrelated unit-length linear combinations of the features. summarizing the data using the principal components with maximum variation is a sensible strategy for data reduction.3) −2 −2 −1 (0. PRIN1 is the best linear combination of features 1 and 2 to predict both features simultaneously.0.−0.5: Why is PCA a Sensible Variable Reduction Technique? −2 −1 0 Feature1 1 2 −2 −1 0 1 2 PRIN1 The LS line from regressing feature 2 on feature 1 gives the best prediction for feature 2 scores when the feature 1 score is known.33) 0 0 ● PRIN2 1 PRIN1 1 PRIN2 Feature2 351 2 2 13. the LS line from regressing feature 1 on feature 2 gives the best prediction for feature 1 scores when the feature 2 score is known. Note that feature 1 and feature 2 are linear combinations as well! This idea generalizes.● −1 (1. . For example. If the researcher reduced the two features to the first principal component. If the group structure was ignored in the PCA analysis. 0 group 1 group 2 −2 −2 −1 group 2 PRIN2 0 θ° 1 PRIN1 group 1 −1 Feature2 1 PRIN2 −2 −1 0 Feature1 1 2 −2 −1 0 1 2 PRIN1 Although PRIN1 explains most of the variability in the two features (ignoring the groups). and the implied principal components. he would be throwing away most of the information for distinguishing between the . consider the following data plot on two features and two groups. They will replace the original variables with a small number of principal components that explain most of the variation in the original data and proceed with an analysis on the principal components.352 Ch 13: Principal Component Analysis 13. especially if a primary interest in the analysis is a comparison of heterogeneous groups.5.1 A Warning on Using PCA as a Variable Reduction Technique 2 2 Some researchers view PCA as a “catchall” technique for reducing the number of variables that need to be considered in an analysis. little of the total variation is due to group differences. This strategy is not always sensible. then the linear combinations retained by the researcher may contain little information for distinguishing among groups. PRIN2 accounts for little of the total variation in the features. in the plot below. If a comparison of the two groups was the primary interest. Although there is little gained by reducing two variables to one.13. Variable reduction using PCA followed by group comparisons might be fruitful if you are fortunate enough to have the directions with large variation correspond to the directions of group differences. but most of the variation in PRIN2 is due to group differences. For example. a stepwise selection of variables can be implemented to eliminate features that have no information for distinguishing among groups.5: Why is PCA a Sensible Variable Reduction Technique? 353 2 2 groups. A comparison of the groups based on the first principal component will lead to similar conclusions as a discriminant analysis. provided you recognize that the principal component scores with the largest variability need not be informative for group comparisons! . This is data reduction as it should be practiced — with a final goal in mind. the first principal component is a linear combination of the features that distinguishes between the groups. this principle always applies in multivariate problems. then the researcher should use discriminant analysis instead. In discriminant analysis. PRIN2 1 group 2 0 group 1 group 1 −2 −2 −1 0 θ° PRIN2 1 group 2 −1 Feature2 PRIN1 −2 −1 0 Feature1 1 2 −2 −1 0 1 2 PRIN1 There is nothing unreasonable about using principal components in a comparison of groups. be used as a data reduction tool prior to a comparison across groups.354 Ch 13: Principal Component Analysis In summary. These observations can often be found in univariate plots of the lead principal component scores. even when outliers are not extreme in any individual feature. in general.2 PCA is Used for Multivariate Outlier Detection 2 2 An outlier in a multidimensional data set has atypical readings on one or on several features simultaneously. or when factor analysis is used prior to a group comparison. PRIN2 PRIN1 1 1 ● −2 −1 0 PRIN2 0 −1 −2 Feature2 ● −2 −1 0 Feature1 1 2 −2 −1 0 PRIN1 1 2 . The same concern applies to using PCA for identifying groups of similar objects (use cluster analysis instead).5. and should not. PCA should be used to summarize the variation within a homogeneous group. 13. header = TRUE) 1 Bumpus.gif Let us look at the output. Lectures: Woods Hole Marine Biological Laboratory. for Class Discussion After a severe storm in 1898.com/eb-media/46/51946-004-D003BC49. The elimination of the unfit as illustrated by the introduced sparrow.data <. .data.read. (A fourth contribution to the study of variation."http://statacumen. Eleventh lecture. alar extent.table(fn. paying careful attention to the interpretations of the principal components (zeroing out small loadings).britannica. The data here correspond to five measurements on a sample of 49 females.13. for Class Discussion 13. a number of sparrows were taken to the biological laboratory at the University of Rhode Island.web.6: Example: Sparrows. http://media-2.) Biol. beak-head length. H. 1898. Bumbus1 measured several morphological characteristics on each bird. 209–225. humerus length. How many components seem sufficient to capture the total variation in the morphological measurements? #### Example: Sparrows fn.dat" sparrows <.6 355 Example: Sparrows.com/teach/ADA2/ADA2_notes_Ch13_sparrows. Passer domesticus. and length of keel of sternum. The measurements are the total length. Hermon C. 2 22 21.4 19.5 18.5 18..1 .1 18.5 .4 20.9 32. of 5 variables: ## $ Total : int 156 153 155 157 164 158 161 157 158 155 .5 153 240 31.ggpairs(sparrows.0 18.1 18..4 18.3 31.4 30.. ## $ Humerus : num 18.1) conflicts) detach("package:GGally"..2 158 240 31..7 19. head(sparrows) ## ## ## ## ## ## ## 1 2 3 4 5 6 Total Alar BeakHead Humerus Keel 156 245 31. unload=TRUE) .5 20.5 18.6 18.3 157 238 30.6 31 31. upper = list(continuous = "points") . ## $ BeakHead: num 31.5 20. ## $ Keel : num 20.4 20.6 19.356 Ch 13: Principal Component Analysis str(sparrows) ## 'data.8 21. unload=TRUE) detach("package:reshape".3 18.3 .3 20.5 31.2 164 248 32. lower = list(continuous = "cor") ) print(p) # detach package after use so reshape2 works (old reshape (v.7 31.6 20.frame': 49 obs.5 30.6 155 243 31..6 18.2 21...0 ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) # put scatterplots on top so y axis is vertical p <.3 32.8 19.6 20..3 18.1 21.9 18..6 20.6 22. ## $ Alar : int 245 240 243 238 248 240 246 235 244 236 . 6: Example: Sparrows.3 Comp.8829 2. cutoff = 0) # to show all values .531 Corr: 0.5 ## Standard deviation 5.pca <.78469 0.pca) # print method for loadings() uses cutoff = 0.99049 0.13.1281 0.674 32 BeakHead 31 30 31 32 33 ● ● ● ● ● ● ● ● ● Corr: 0.9751 0.735 ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 160 357 ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● Corr: 0.pca). data = sparrows) # standard deviation and proportion of variation for each component summary(sparrows.769 Corr: 0.01534 0.8623 0.529 Corr: 0. for Class Discussion ● 165 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Total ● ● ● ● 155 160 165 ● ● ●● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● Alar ●● ● ● ● ● 230 235 240 245 250 ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● 235 ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●● ●● ● 240 ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● 245 ●● ● ●● ● ● 250 Corr: 0.1 by default print(loadings(sparrows.275613 ## Proportion of Variance 0.662 ● ● 33 ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 19 Corr: 0.001893 ## Cumulative Proportion 0.007618 0.2 Comp.998107 1.8623 0.4 Comp.608 Corr: 0.763 ● ● ● ● Humerus 18 ● ●●● ●● ● ●● ● ● 19 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● 23 22 Corr: 0.609 Keel 21 20 19 20 21 22 23 # perform PCA on covariance matrix sparrows.pca) ## Importance of components: ## Comp.552957 0.645 Corr: 0.1128 0.000000 # coefficients for PCs # loadings(sparrows.1 Comp.princomp( ~ Total + Alar + BeakHead + Humerus + Keel . 4 0.828 0.pca) print(loadings(sparrows.8 1.6 0.3 Comp.pca) ## ## ## ## ## Importance of components: Comp.pca) .5 SS loadings 1.2 0.0 1.7239 Cumulative Proportion 0.345 -0.545 0.358 ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 13: Principal Component Analysis Loadings: Total Alar BeakHead Humerus Keel Comp.015 -0.073 0.927 -0.4 Comp.189 0.6 0.pca <. cutoff = 0) # to show all values ## ## ## ## ## ## ## ## ## ## ## ## ## Loadings: Comp.4 0.689 0.326 -0.5 SS loadings 1.453 -0.058 0.0 1.399 Comp.90682 0.pca).1 Comp.4 Comp.2 0.551 0.06029 0.039 -0.0 1.0 # a couple built-in plots par(mfrow=c(1.058 -0.2 0.461 BeakHead 0.8 1.3 0.100 -0.62139 0.0 1.2 0.2 0.2 0.2 Comp.0 1.450 Humerus 0.0000 # coefficients for PCs #loadings(sparrows.0329 0.4056 0.9026 Proportion of Variance 0.607 0.306 -0.2 Comp.536 0.110 Comp.018 -0.2 0.2 0.897 0.1 Standard deviation 1.651 -0.2 0.54902 0.2 Comp.375 -0.0 1.5 0.934 0.530 0.5 0.4 Comp.4 Comp.07723 0.241 -0.452 Alar 0.301 -0.2 Cumulative Var 0.2 Cumulative Var 0.0 1.157 0.470 Keel 0.1 Total 0.034 -0.1 Comp.874 Comp.3 Comp.8296 0.357 -0.422 -0.390 -0.0 # perform PCA on correlation matrix sparrows. data = sparrows.96710 1.040 -0.princomp( ~ Total + Alar + BeakHead + Humerus + Keel .7239 Comp.2 0.194 Comp.829 -0.7268 0.101 0.1056 0.2)) biplot(sparrows.2 Comp.342 -0.2 0.096 0.205 -0.3 Comp.5 -0.pca) screeplot(sparrows.0 Proportion Var 0.069 0.310 0.0 Proportion Var 0.3 Comp.409 -0.0 1.074 -0.4 Comp.184 Comp. cor = TRUE) # standard deviation and proportion of variation for each component summary(sparrows.1 Comp. 0 Humerus Alar BeakHead 42 −0. HT9 height at age 9. The variables selected from the study are ID an identification number. LG9 leg circumference at age 9 in cm.0 1.5 −5 15 45 Total 0. HT18 height at age 18. as a measure of fatness (1=slender to . California. and SOMA somatotype on a 7-point scale.5 −5 32 0.pca 5 3.7: PCA for Variable Reduction in Regression 0 sparrows.2 Keel 30 11 6 1. WT2 weight at age 2 in kg. ST9 a composite measure of strength at age 9 (higher is stronger).1 Comp.2 2. HT2 height at age 2 in cm. WT18 weight at age 18.2 359 Comp.13. LG18 leg circumference at age 18.4 0.0 16 Comp.0 0.7 PCA for Variable Reduction in Regression I will outline an analysis where PCA was used to create predictors in a regression model.0 40 13 −0. The data were selected from the Berkeley Guidance Study.2 Variances 18 9 17 21 4137 35 43 2710 33 24 3644 2 12 7 4 29 23 4622 31 14 49 39 26 8 4728 31 25 5 20 48 0.5 Comp.2 0.5 0.4 3. a longitudinal monitoring of children born in Berkeley.0 19 38 34 0 0.5 5 2.3 Comp. ST18 a composite measure of strength at age 18.1 13.2 Comp. between 1928 and 1929.4 Comp. WT9 weight at age 9. 8 37.6 86. ## $ ST18: int 226 252 216 220 200 215 152 189 183 193 . unload=TRUE) detach("package:reshape".4 ...2 179..7 12.5 31. header = TRUE) str(bgs) ## 'data.6 64 76.6 12.5 27.1) conflicts) detach("package:GGally".6 14.0 30.. ## $ LG18: num 44..2 24.9 HT2 90.5 31 30..7 77 68.5 26.. ## $ HT18: num 179 195 184 179 172 . head(bgs) ## ## ## ## ## ## ## 1 2 3 4 5 6 ID 201 202 203 204 205 206 WT2 13.6 26 26.4 87.dat" bgs <.9 33.3 220 3. ## $ WT18: num 110. Data on 26 boys are given below.9 11.1 226 7.5 29.2 79.9 27 .3 74.9 37.6 29 26 ..0 26. ## $ HT9 : num 139 144 136 135 129 .1 WT9 41.0 44.7 88.4 76.2 91.5 178. unload=TRUE) .2 75 74.frame': 26 obs.1 36.4 87.6 14.3 136.read.3 183.9 216 6.1 24.0 28.."http://statacumen.4 86.1 82. #### Example: BGS (Berkeley Guidance Study) fn.7 11...7 36.2 16..0 24.1 37.8 26 30.8 12.0 73 79.table(fn.. ## $ LG9 : num 31.5 135.5 13.5 29..3 ..2 83.2 26.1 36.8 91 87..8 12.0 LG9 ST9 WT18 HT18 LG18 ST18 SOMA 31.2 91.1 34.6 86.data <.5 31..9 12..3 31 37 39.6 12.7 .com/teach/ADA2/ADA2_notes_Ch13_bgs.360 Ch 13: Principal Component Analysis 7=fat) determined using a photo taken at age 18. ## $ SOMA: num 7 4 6 3 1.7 26.7 12.1 36.7 11.6 74 110.2 181..5 55.3 33.0 215 3..1 24.0 26.6 28.1 252 4. ## $ WT2 : num 13.0 200 1. ## $ HT2 : num 90. ## $ ST9 : int 74 73 64 75 63 77 45 70 61 74 .4 144.data.. of 12 variables: ## $ ID : int 201 202 203 204 205 206 207 209 210 211 .9 136.2 63 55.0 ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) p <.1 37. We are interested in building a regression model to predict somatotype (SOMA) from the other variables.4 195..8 HT9 139.7 .7 171.5 3 6 4 3 3 .4 128.7 37..4 86.ggpairs(bgs) print(p) # detach package after use so reshape2 works (old reshape (v.7 88..1 34. ## $ WT9 : num 41. 216 Corr: 0.0788 0.284 Corr: 0.wordpress.0531 −0. ## my.0268 Corr: Corr: Corr: −0.0935 Corr: 0.034 Corr: 0.184 Corr: 0.344 Corr: 0.plotcorr() function code # calculate correlations to plot corr. see associated R file for my.227 ●● ● ● ● ●● ● ● ● ● ●●● ● ●●●● ● ● ● ● ● ● ● 6 SOMA 4 2 4 6 As an aside.283 Corr: 0.315 Corr: 0.236 −0.7: PCA for Variable Reduction in Regression 220 ID 210 200 210220 Corr: 0.34 Corr: 0.524 Corr: 0.581 Corr: 0.273 Corr: 0.18 Corr: −0.199 Corr: 0.bgs <.584 Corr: 0.113 Corr: 0.579 Corr: 0.0721 16 14WT2 12 14 16 Corr: 0.334 100 WT18 80 60 80100 Corr: 0.209 Corr: 0.607 ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ●●● ●● ● ●● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ●● ●●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●●●● ● ●● ●● ● ● ● ● ● ● ●● ●● ●●●● ●● ●● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●●●● ● ● ●● ●●● ●● ●●● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ●●●● ● ●● ● ●● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ●● ●●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●●● ●● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ●●● ●● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ST9 60 80 100 ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● 100 80 ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●●● ●● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ● ●● ● ●●●●●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●● ●●● ● ● ●● ● ● ● ●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ●●● ● ●● ●● ● ●● ● ●●● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●●● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● 190 ● 180HT18 170180190 ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ● ● ●● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ●●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● 40 LG18 35 40 ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●●●●●● ●● ● ● ●● ●● ●● ●● ● ● ● ● 250 225 ST18 200 175 150 175 200 225 250 ● Corr: −0.336 Corr: 0.285 Corr: Corr: Corr: −0.498 Corr: 0.62 Corr: 0.664 Corr: 0.776 Corr: 0.12 Corr: −0. An improvement has been made with an updated version2 of the plotcorr() function.357 Corr: 0.13.285 Corr: 0.906 Corr: 0.353 Corr: 0.108 Corr: 0.0986 Corr: 0.64 Corr: 0.358 Corr: 0.901 Corr: 0.536 Corr: 0.plotcorr example.347 Corr: 0.363 Corr: 0.173 Corr: 0.23 Corr: −0.cor(bgs) 2 http://hlplab.194 Corr: −0.0372 Corr: 0.384 Corr: 0.com/2012/03/20/correlation-plot-matrices-using-the-ellipse-library .258 Corr: −0.141 32 30 28LG9 26 24 26283032 Corr: 0. The ellipse library has a function plotcorr().123 Corr: 0.159 Corr: 0.185 Corr: 0.66 Corr: −0.296 Corr: 0.363 Corr: 0.382 90 87HT2 84 81 848790 Corr: 0. though it’s output is less than ideal.157 145 140 HT9 135 130 125 130 135 140 145 Corr: 0.685 Corr: 0. there are other ways to visualize the linear relationships more quickly.531 40 35 WT9 30 25303540 361 Corr: 0.709 Corr: 0.864 Corr: 0.0855 −0.599 Corr: −0. 66 0.23 .bgs) = colnames(corr. 104.04 0.33 0.19 −0.0).09 −0.32 0.19 0. upper.3 0.34 0.86 0.36 0.362 Ch 13: Principal Component Analysis # Change the column and row names for clarity colnames(corr.18 −0.0.58 0.9 0.38 0.61 LG9 0.64 0.17 0.58 0.58 0.24 −0.52 0.14 0.09 WT9 SOMA ST9 ST18 LG9 LG18 HT9 HT18 WT9 WT18 HT2 ST9 WT2 HT2 ID WT2 ID Correlations −0.08 0.03 −0.5 0. space='Lab') colors = colramp(100) # plot correlations. colored ellipses on lower diagonal.11 0. 'white'.23 −0. rgb(0.28 0. red negative) colsc=c(rgb(241.36 0.26 −0.35 0.27 0.bgs.05 −0.36 0.34 0.bgs + 1)/2) * 100] .03 0.66 −0.1 0.71 0. 23. mar=c(0.38 0.28 0.6 −0.18 0.21 0.12 0.35 0.36 0.16 0. col=colors[((corr.28 0.54 0. 54. numerical correlations on upper my.11 0. diag='ellipse'.34 0.panel="number".12 −0.07 0.2 0.22 0. maxColorValue=255).78 0.16 0.53 0.bgs) = colnames(bgs) rownames(corr.91 0.29 HT9 0.2.68 0. main='Correlations') WT18 HT18 LG18 ST18 SOMA −0.bgs) # set colors to use (blue positive.plotcorr(corr.62 0. 61. maxColorValue=255)) colramp = colorRampPalette(colsc. and HT18 are strongly correlated. p2.3441 LG18 0. p4 <.0000 cor(bgs[. "WT9".6847 0.2845 0. "HT9". main="Selected BGS variables") .2085 0.2085 1. Two of the three subsets include measures over time on the same characteristic.0000 0.ggplot(bgs.ggplot(bgs. . ncol=3.c("HT2". cor(bgs[.0000 0.5239 ST18 0.c("WT2".6645 ST9 0.7758 0. . p5. ST9 0. "LG18". p6. y y y y y y y y = = = = = = = = WT9 )) WT18)) WT18)) HT9 )) HT18)) HT18)) LG18)) ST18)) + + + + + + + + geom_point() geom_point() geom_point() geom_point() geom_point() geom_point() geom_point() geom_point() library(gridExtra) grid. p3 <.ggplot(bgs.0000 0.6596 1. p8.ggplot(bgs.6645 1.5239 0. p6 <. .0000 LG18 0. p2 <.7089 1. Evidence supporting this hypothesis is given in the following output. for example HT2.0000 0. p8 <.0000 0.6596 aes(x aes(x aes(x aes(x aes(x aes(x aes(x aes(x = = = = = = = = ST18 0. "HT18")]) ## HT2 HT9 HT18 ## HT2 1.5792 1. "ST9".8645 ## HT18 0. p5 <.8645 1. which summarizes correlations within subsets of the predictors. "ST18")]) ## ## ## ## ## LG9 LG9 1.13. p3.5792 0.0000 cor(bgs[.2845 library(ggplot2) p1 <.2159 ## WT9 0. .3441 0.7758 1.ggplot(bgs.c("LG9". "WT18")]) ## WT2 WT9 WT18 ## WT2 1.0000 0.0000 WT2 WT2 WT9 HT2 HT2 HT9 LG9 ST9 . .arrange(p1.ggplot(bgs. p7.7: PCA for Variable Reduction in Regression 363 It is reasonable to expect that the characteristics measured over time. p4. HT9. p7 <. .ggplot(bgs.2159 0. .6847 ## HT9 0.7089 ## WT18 0.ggplot(bgs. The presence of collinearity makes the interpretation of regression effects more difficult. such as principal components. A natural way to avoid collinearity and improve interpretation of regression effects is to use uncorrelated linear combinations of the original predictors. to build a model. and can reek havoc with the numerical stability of certain algorithms used to compute least squares summaries.364 Ch 13: Principal Component Analysis Selected BGS variables ● ● ● ● 40 100 100 ● ● ● ● ● ● ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● 60 ● ● ●● 25 ● 12 ● ● ● ● ● ● ● 14 15 16 17 ● ● 60 ● ● ● 13 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 12 13 14 15 16 17 25 30 35 40 WT9 195 ● ● ● WT2 145 ● ● ●● ● ● WT2 195 ● ● ● ● ● 190 ● ● ● ● ● ● 135 ● ● ● ● ● ● ● ● ● ● HT18 140 HT9 ● 80 ● ● ● ● 185 ● ● ● ● ● 180 ● ● ● ● ● ● 130 175 ● ● ● ● ● ● ● ● 190 ● HT18 30 ● ● ● ● WT18 ● WT18 WT9 ● 35 ● ● ● ● ● 180 ● ● ● ● ● 175 ● ● ● ● ● ● ● 170 ● 125 81 84 87 90 170 ● 81 84 87 HT2 90 ● 225 ● 40 ●● ● ● ● ●● ● ● 35 ● ● ● ● ST18 LG18 ● ● ● 140 145 ● ● ●● ● ● ● ● ● ● ● 175 ● 135 ● ● ● 130 ● 200 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● HT9 ● 250 ● ● ● ● 125 HT2 ● ● 185 ● ● ● ● 24 150 26 28 LG9 30 32 ● 50 60 70 80 90 100 ST9 Strong correlation among predictors can cause collinearity problems in regression. . pca) . WT. The overall sum of squared loadings is not important in the regression analysis. standardizing the original predictors does not change the significance of individual regression effects nor does it change the interpretation of the model. or in terms of the linear combinations. A reasonable strategy with the Berkeley data might be to find principal components for the three height variables separately. data = bgs. you should not assume that principal components with low variance are unimportant for predicting somatotype. the three weights separately. and 18. I would use (WT2+WT9+WT18)/3 instead. Only the relative sizes are important.pca <. For example. 9. I would not necessarily use the given PCS.princomp( ~ WT2 + WT9 + WT18.707 = (1/2)1/2 to satisfy the unit-length restriction. what linear combinations of the heights are reasonable? Second. # WT bgsWT. the first principal component of the heights is roughly an unweighted average of the weights at ages 2. ST18) and (LG9. It may not make sense to combine the four different types of measures (HT. First. and LG) together because the resulting linear combinations would likely be uninterpretable. Output from a PCA on the four sets of standardized measures is given below. Following this idea. Here are some comments on how I might use the PCA to build a regression model to predict somatotype. but would instead use interpretable linear combinations that were nearly identical to the PCs. The same problem is possible here. LG18) have two predictors. two models will be equivalent in the sense of giving identical fitted values when p linearly independent combinations of p predictors are used. ST.13. cor = TRUE) summary(bgsWT. and so on. Two sets (ST9. Feel free to explore these ideas at your leisure.7: PCA for Variable Reduction in Regression 365 The interpretation of regression effects changes when linear combinations of predictors are used in place of the original predictors. Recall our earlier discussion on potential problems with ignoring low variability components when group comparisons are the primary interest. The fitted model can be expressed in terms of the original predictors. The loadings have magnitude 0. Similarly. However. The PCs on the strength and leg measures are essentially the sum and difference between the two standardized features. 366 ## ## ## ## ## Ch 13: Principal Component Analysis Importance of components: Comp.1092 0.3 1.5724 0.554 0.578 -0.8322 Cumulative Proportion 0. cutoff = 0) ## ## ## ## ## ## ## ## ## ## ## Loadings: Comp.1 Comp.665 -0.584 SS loadings Proportion Var Cumulative Var Comp.42764 0.2630 0.424 Proportion of Variance 0.667 1.2 Comp.2 Comp. cutoff = 0) ## ## ## ## ## ## ## ## ## ## ## Loadings: Comp.570 0.princomp( ~ HT2 + HT9 + HT18.3 0.pca) ## ## ## ## ## Importance of components: Comp.8882 0.745 WT18 -0.000 # LG bgsLG.pca) ## ## ## ## ## Importance of components: Comp.000 0.622 0.9600 1.000 1.pca).3 1.3 0. data = bgs.231 HT9 -0.3 WT2 -0.3 HT2 -0.492 0. data = bgs.5792 0.000 0.333 0.princomp( ~ LG9 + LG18.34652 0.2 Comp.pca <.5976 Proportion of Variance 0.2 Comp.000 1.599 -0.1678 1. cutoff = 0) ## ## Loadings: .000 # HT bgsHT.1 Comp.06096 0.2 0.333 0.333 0.8508 Cumulative Proportion 0.00000 print(loadings(bgsWT.9390 1.1 Comp.667 1.562 -0.000 1.000 1.1 Standard deviation 1.pca).8508 Comp.pca <.190 -0.800 0.384 WT9 -0.0000 print(loadings(bgsLG.333 0.676 Comp. cor = TRUE) summary(bgsHT.8322 Comp.781 0.00000 print(loadings(bgsHT.2 Comp. cor = TRUE) summary(bgsLG.1 Standard deviation 1.1 Standard deviation 1.04003 0.676 Cumulative Proportion 0.333 0.545 SS loadings Proportion Var Cumulative Var Comp.053 -0.pca).2 Comp.1 Comp.333 0.778 HT18 -0.2901 Proportion of Variance 0.333 0.333 0. 707 0.8298 Comp.2883 Proportion of Variance 0.0000 print(loadings(bgsST.2 ST9 0.13.707 Comp.707 0.0 367 .5 0.5 0.0 0.1 Comp.707 -0.5834 0.707 LG18 -0.1702 1.2 SS loadings 1.pca <.0 Proportion Var 0.2 LG9 -0. cutoff = 0) ## ## ## ## ## ## ## ## ## ## Loadings: Comp.5 1.1 Standard deviation 1.2 0. data = bgs.1 Comp.princomp( ~ ST9 + ST18.5 Cumulative Var 0.707 -0.2 1.707 SS loadings Proportion Var Cumulative Var Comp.0 1.1 Comp.5 0.0 # ST bgsST.1 Comp.8298 Cumulative Proportion 0.0 1.pca) ## ## ## ## ## Importance of components: Comp.7: PCA for Variable Reduction in Regression ## ## ## ## ## ## ## ## Comp.pca).707 ST18 0. cor = TRUE) summary(bgsST.5 1. giving one cluster with two observations. Step 1. The aim in cluster analysis is to define groups based on similarities.r-project.html . There are a variety of clustering algorithms1. I will discuss a simple (agglomerative) hierarchical clustering method for grouping observations. 1 http://cran.org/web/views/Cluster. Cluster analysis can also be used to group variables that are similar across observations.1 Illustration To illustrate the steps. The clusters are then joined sequentially until one cluster is left. suppose eight observations are collected on two features X1 and X2. The clusters are then examined for underlying characteristics that might help explain the grouping. A plot of the data is given below. Each observation is a cluster. Clustering or grouping is distinct from discriminant analysis and classification. The two most similar observations are then grouped.1. The method begins with each observation as an individual cluster or group. In discrimination problems there are a given number of known groups to compare or distinguish.1 Introduction Cluster analysis is an exploratory tool for locating and grouping observations that are similar to each other across features. 14. The remaining clusters have one observation.Chapter 14 Cluster Analysis 14. table(text = " x1 x2 4 8 6 6 10 11 11 8 17 5 19 3 20 11 21 2 ".pca <.p2 + geom_point() # points p2 <. aes(x = Comp.5) # labels p1 <. Merge (fuse or combine) the remaining two clusters. Form a new cluster by grouping the two clusters that are most similar. alpha = 0.pca$scores). hjust = -0.princomp( ~ x1 + x2. aes(x = x1.ggplot(as.1: Introduction 369 Step 2.frame by reading the text table intro <. Form a new cluster by grouping the two clusters that are most similar. header=TRUE) str(intro) ## 'data.p1 + geom_text(aes(label = 1:nrow(intro)). #### Example: Fake data cluster illustration # convert to a data. Finally Use a tree or dendrogram to summarize the steps in the cluster formation. Step 3.2)) p2 <. PCA scores") print(p2) . This leaves six clusters.frame': 8 obs.p1 + labs(title = "Introductory example") print(p1) # plot PCA scores (data on PC-scale centered at 0) library(ggplot2) p2 <.5) # lab p2 <.5.read.14. Step 8.p1 + geom_point() # points p1 <.pca$scores)).frame(intro. This leaves seven clusters. or closest to each other.p2 + labs(title = "Introductory example. data = intro) # plot original data library(ggplot2) p1 <. or closest to each other.5. of 2 variables: ## $ x1: int 4 6 10 11 17 19 20 21 ## $ x2: int 8 6 11 8 5 3 11 2 # perform PCA on covariance matrix intro.data.p2 + geom_text(aes(label = rownames(intro. y = x2)) p1 <.1. hjust = -0. Step 4–7.ggplot(intro. alpha = 0. Continue the process of merging clusters one at a time. y = Comp. dist(intro) intro.clus) . left. cutree(intro. cex.hc.370 Ch 14: Cluster Analysis Introductory example ● 3 ● 7 Introductory example. 2. PCA scores 8 ● ● 2 6 ● 2 9 ● 1 6 ● ● 4 Comp. 2. sub = NULL) } par(op) # reset plot options .hc.average <.hclust(intro. top.par(no. k = i.txt = 1. 1)) # margins are c(bottom. color = TRUE. which will be discussed in more detail after the plots.5.clus in 7:2) { clusplot(intro. # create distance matrix between points intro. "clusters"). method = "average") op <.2). labels = 2.dist. The clustering algorithm order for average linkage is plotted here. lines = 0 . col.1 Here are the results of one distance measure.average.2 x2 ● 2 ● 5 ● 5 0 ● ● ● 10 15 3 6 ● 5 4 −2 −4 3 1 8 −6 20 ● 7 −5 x1 0 5 10 Comp. right) library(cluster) for (i. main = paste(i. mar = c(2.clus.readonly = TRUE) # save original plot options par(mfrow = c(3.txt = "gray20" .dist <. cex = 2. see http://gastonsanchez.385 3.dist <.dist ## ## 2 ## 3 ## 4 2 1 2.708 7.14.000 2 3 6.html for several examples.dist(intro) intro.com/blog/how-to/ 2012/10/03/Dendrograms.828 6. .403 5.1: Introduction 371 7 clusters 6 clusters 5 22 1 ●2 6 2 2 6 8 1 55 45 −2 44 34 23 −4 −4 33 ●1 0 Component 2 0 ●1 −2 6 8 67 −6 −6 77 −5 0 5 10 −5 0 5 clusters1 Component 4 clusters1 Component 38 1 ●2 1 ●2 6 2 2 6 35 −2 24 24 3 −4 −4 3 ●1 5 0 Component 2 ●1 0 10 −2 4 8 5 47 −6 −6 57 −5 0 5 10 −5 0 3 clusters1 Component 8 Component 2 0 −2 ●4 2 ●4 ●3 7 −6 −6 37 0 5 1 ●1 5 −4 −4 ●3 ●2 6 0 ●1 5 2 4 1 −2 ●2 6 −5 10 2 clusters1 Component 2 28 5 10 −5 0 5 10 The order of clustering is summarized in the average linkage dendrogram on the right reading the tree from the bottom upwards2. # create distance matrix between points intro.162 4 5 6 7 There are many ways to create dendrograms in R. ## Use ’plot’ instead.828 6.readonly = TRUE) # save original plot options par(mfrow = c(1.487 15.811 16.342 15.3)) # margins are c(bottom. hang = -1.236 9. hang = -1. left. The single linkage distance is the minimum distance between points across two clusters. "single") intro.342 12.par(no.042 9. main = "complete") ## Warning: ’plclust’ is deprecated.dist.045 9. main = "average") ## Warning: ’plclust’ is deprecated.hc.028 11. "complete") intro.dist hclust (*.hclust(intro.average.062 2.524 14.220 6.hc.372 ## ## ## ## Ch 14: Cluster Analysis 5 6 7 8 13. ## Use ’plot’ instead.dist hclust (*.1. method = "complete") plclust(intro. .000 8.hc. method = "single") plclust(intro.hc.2 Distance measures There are several accepted measures of distance between clusters. top.average <.complete.dist.hclust(intro.434 14. right) intro.single.662 2.708 5. ## See help("Deprecated") par(op) # reset plot options complete average 6 4 Height 10 Height 3 7 5 6 8 1 2 3 4 1 2 3 4 7 5 6 8 1 2 3 4 7 5 6 8 0 2 5 2 Height 4 8 5 10 15 6 12 7 single intro.dist. ## See help("Deprecated") intro.complete <. ## See help("Deprecated") intro.279 18. ## Use ’plot’ instead.dist hclust (*.hc.000 9.hc. method = "average") plclust(intro.hclust(intro.866 10. main = "single") ## Warning: ’plclust’ is deprecated.708 13.single <.055 op <. "average") 14. hang = -1.213 11. Average linkage tends to produce clusters with similar variability. Complete linkage is biased towards producing clusters with roughly equal diameters.0 1. The two clusters that are closest to each other are merged.2: Example: Mammal teeth 373 The complete linkage distance is the maximum distance between points across two clusters.0 5 2 −2. single linkage complete linkage 1 2 −10 −5 0 Component 1 5 −10 0 1 0. from left to right.0 3 Component 2 −1.0 0. Complete uses the length of the longest line between points in clusters. You should try different distances to decide the most sensible measure for your problem.2 Example: Mammal teeth The table below gives the numbers of different types of teeth for 32 mammals.0 Component 2 1. Single uses the length of the shortest line between points in clusters. (v2) .0 1 4 Component 2 1.0 2 average linkage 1 5 Component 1 −10 −5 0 5 Component 1 Different distance measures can produce different shape clusters. Average uses the average length of all line between points in clusters. Given a distance measure. Single linkage has the ability to produce and detect elongated and irregular clusters.14. 14. The observations are usually standardized prior to clustering to eliminate the effect of different variability on the distance measure.0 −5 5 2 4 −1.0 3 1 4 1 2 5 2 −1.0 −2. The pictures below illustrate the measures. The columns. The average linkage distance is the average distance between points across two clusters. give the numbers of (v1) top incisors. In these three cases the distance between points is the Euclidean or “ruler” distance.0 0. the distance between each pair of clusters is evaluated at each step.0 3 −2. 374 Ch 14: Cluster Analysis bottom incisors, (v3) top canines, (v4) bottom canines, (v5) top premolars, (v6) bottom premolars, (v7) top molars, (v8) bottom molars, respectively. A cluster analysis will be used to identify the mammals that have similar counts across the eight types of teeth. #### Example: Mammal teeth ## Mammal teeth data # mammal = name # number of teeth # v1 = top incisors # v2 = bottom incisors # v3 = top canines # v4 = bottom canines # v5 = top premolars # v6 = bottom premolars # v7 = top molars # v8 = bottom molars fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch14_teeth.dat" teeth <- read.table(fn.data, header = TRUE) str(teeth) ## 'data.frame': 32 obs. of 9 variables: ## $ mammal: Factor w/ 32 levels "Badger","Bear",..: 4 17 29 19 13 24 20 22 3 12 ... ## $ v1 : int 2 3 2 2 2 1 2 2 1 1 ... ## $ v2 : int 3 2 3 3 3 3 1 1 1 1 ... ## $ v3 : int 1 1 1 1 1 1 0 0 0 0 ... ## $ v4 : int 1 0 1 1 1 1 0 0 0 0 ... ## $ v5 : int 3 3 2 2 1 2 2 3 2 2 ... ## $ v6 : int 3 3 3 2 2 2 2 2 1 1 ... ## $ v7 : int 3 3 3 3 3 3 3 3 3 3 ... ## $ v8 : int 3 3 3 3 3 3 3 3 3 3 ... 14.2: Example: Mammal teeth 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 mammal Brown Bat Mole Silver Hair Bat Pigmy Bat House Bat Red Bat Pika Rabbit Beaver Groundhog Gray Squirrel House Mouse Porcupine Wolf Bear Raccoon Marten Weasel Wolverine Badger River Otter Sea Otter Jaguar Cougar Fur Seal Sea Lion Grey Seal Elephant Seal Reindeer Elk Deer Moose v1 2 3 2 2 2 1 2 2 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 0 0 0 0 v2 3 2 3 3 3 3 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 2 3 3 2 2 2 1 4 4 4 4 v3 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 v4 1 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 375 v5 3 3 2 2 1 2 2 3 2 2 1 0 1 4 4 4 4 3 4 3 4 3 3 3 4 4 3 4 3 3 3 3 v6 3 3 3 2 2 2 2 2 1 1 1 0 1 4 4 4 4 3 4 3 3 3 2 2 4 4 3 4 3 3 3 3 v7 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 1 1 1 1 1 1 1 1 1 1 2 1 3 3 3 3 v8 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 1 1 1 1 2 1 3 3 3 3 The program below produces cluster analysis summaries for the mammal teeth data. # create distance matrix between points teeth.dist <- dist(teeth[,-1]) # number of clusters to identify with red boxes and ellipses # i.clus <- 8 # create dendrogram teeth.hc.average <- hclust(teeth.dist, method = "average") plclust(teeth.hc.average, hang = -1 , main = paste("Teeth with average linkage") # and", i.clus, "clusters") , labels = teeth[,1]) ## Warning: ’plclust’ is deprecated. ## Use ’plot’ instead. ## See help("Deprecated") # rect.hclust(teeth.hc.average, k = i.clus) # # create PCA scores plot with ellipses # clusplot(teeth, cutree(teeth.hc.average, k = i.clus) # , color = TRUE, labels = 2, lines = 0 # , cex = 2, cex.txt = 1, col.txt = "gray20" # , main = paste("Teeth PCA with average linkage and", i.clus, "clusters") # , sub = NULL) 376 Ch 14: Cluster Analysis Raccoon Wolf Bear Elephant_Seal Fur_Seal Sea_Lion Jaguar Cougar River_Otter Marten Wolverine Grey_Seal Sea_Otter Weasel Badger Reindeer Elk Deer Moose House_Mouse Beaver Groundhog Gray_Squirrel Porcupine Brown_Bat Silver_Hair_Bat Red_Bat Pigmy_Bat House_Bat Mole Pika Rabbit 0 1 Height 2 3 4 Teeth with average linkage teeth.dist hclust (*, "average") 14.3 Identifying the Number of Clusters Cluster analysis can be used to produce an “optimal” splitting of the data into a prespecified number of groups or clusters, with different algorithms3 usually giving different clusters. However, the important issue in many analyses revolves around identifying the number of clusters in the data. A simple empirical method is to continue grouping until the clusters being fused are relatively dissimilar, as measured by the normalized RMS between clusters. Experience with your data is needed to provide a reasonable stopping rule. # NbClust provides methods for determining the number of clusters library(NbClust) str(teeth) 3 pdf There are thirty in this package: http://cran.r-project.org/web/packages/NbClust/NbClust. 14.3: Identifying the Number of Clusters 377 ## 'data.frame': 32 obs. of 9 variables: ## $ mammal: Factor w/ 32 levels "Badger","Bear",..: 4 17 29 19 13 24 20 22 3 12 ... ## $ v1 : int 2 3 2 2 2 1 2 2 1 1 ... ## $ v2 : int 3 2 3 3 3 3 1 1 1 1 ... ## $ v3 : int 1 1 1 1 1 1 0 0 0 0 ... ## $ v4 : int 1 0 1 1 1 1 0 0 0 0 ... ## $ v5 : int 3 3 2 2 1 2 2 3 2 2 ... ## $ v6 : int 3 3 3 2 2 2 2 2 1 1 ... ## $ v7 : int 3 3 3 3 3 3 3 3 3 3 ... ## $ v8 : int 3 3 3 3 3 3 3 3 3 3 ... # Because the data type is "int" for integer, the routine fails NbClust(teeth[,-1], method = "average", index = "all") ## Error: system is computationally singular: reciprocal condition number = 1.51394e-16 # However, change the data type from integer to numeric and it works just fine! teeth.num <- as.numeric(as.matrix(teeth[,-1])) NC.out <- NbClust(teeth.num, method = "average", index = "all") ## Warning: no non-missing arguments to max; returning -Inf ## [1] "*** : The Hubert index is a graphical method of determining the number of clusters. In ## [1] "*** : The D index is a graphical method of determining the number of clusters. In the ## Warning: data length [51] is not a sub-multiple or multiple of the number of rows [2] ## [1] "All 256 observations were used." # most of the methods suggest 4 or 5 clusters, as do the plots NC.out$Best.nc ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index index.KL index.CH index.Hartigan index.CCC 5 5 4 5.0 Inf Inf Inf 369.1 index.Scott index.Marriot index.TrCovW index.TraceW 5 5 -Inf 25.87 7787 414 5 5.00 index.Friedman index.Rubin index.Cindex index.DB 8.721e+14 -9.811e+14 0 0 5.000e+00 6.000e+00 5 5 index.Silhouette index.Duda index.PseudoT2 1 0.47 168.2 2 2.00 2.0 index.Beale index.Ratkowsky index.Ball 0.38 0.47 61.83 3.00 3.00 3.00 index.PtBiserial index.Frey index.McClain index.Dunn 0.77 0.88 0 Inf 2.00 5.00 5 0 index.Hubert index.SDindex index.Dindex index.SDbw 0 Inf 0 0 2 0 5 5 378 Ch 14: Cluster Analysis Normalized Hubert Statistic ● ● ● ● ● ● ● ● ● ● ● Index Value ● 2e−04 4e−04 ● ● 0.0034 Index Value 0.0038 ● Second Differences of Hubert index 0e+00 ● 0.4 2 4 6 8 10 12 14 2 6 ● ● 8 ● ● 10 ● ● 12 ● 14 Number of clusters Dindex Second Differences of D index ● Index Value 0.3 0.2 4 ● Number of clusters ● ● ● ● 0.1 Index Value ● 0.00 0.05 0.10 0.15 0.20 0.0030 ● ● ● ● ● ● ● ● ● ● 2 4 ● 6 ● ● 8 ● ● 10 ● ● 12 Number of clusters ● ● 14 ● −0.10 0.0 ● ● 2 4 6 8 10 12 14 Number of clusters There are several statistical methods for selecting the number of clusters. No method is best. They suggest using the cubic clustering criteria (ccc), a pseudo- 14.3: Identifying the Number of Clusters 379 F statistic, and a pseudo-t statistic. At a given step, the pseudo-t statistic is the distance between the center of the two clusters to be merged, relative to the variability within these clusters. A large pseudo-t statistic implies that the clusters to be joined are relatively dissimilar (i.e., much more variability between the clusters to be merged than within these clusters). The pseudoF statistic at a given step measures the variability among the centers of the current clusters relative to the variability within the clusters. A large pseudo-F value implies that the clusters merged consist of fairly similar observations. As clusters are joined, the pseudo-t statistic tends to increase, and the pseudo-F statistic tends to decrease. The ccc is more difficult to describe. The RSQ summary is also useful for determining the number of clusters. RSQ is a pseudo-R2 statistic that measures the proportion of the total variation explained by the differences among the existing clusters at a given step. RSQ will typically decrease as the pseudo-F statistic decreases. A common recommendation on cluster selection is to choose a cluster size where the values of ccc and the pseudo-F statistic are relatively high (compared to what you observe with other numbers of clusters), and where the pseudo-t statistic is relatively low and increases substantially at the next proposed merger. For the mammal teeth data this corresponds to four clusters. Six clusters is a sensible second choice. Let’s look at the results of 5 clusters. # create distance matrix between points teeth.dist <- dist(teeth[,-1]) # number of clusters to identify with red boxes and ellipses i.clus <- 5 # create dendrogram teeth.hc.average <- hclust(teeth.dist, method = "average") plclust(teeth.hc.average, hang = -1 , main = paste("Teeth with average linkage and", i.clus, "clusters") , labels = teeth[,1]) ## Warning: ’plclust’ is deprecated. ## Use ’plot’ instead. ## See help("Deprecated") rect.hclust(teeth.hc.average, k = i.clus) # create PCA scores plot with ellipses clusplot(teeth, cutree(teeth.hc.average, k = i.clus) , color = TRUE, labels = 2, lines = 0 380 Ch 14: Cluster Analysis , cex = 2, cex.txt = 1, col.txt = "gray20" , main = paste("Teeth PCA with average linkage and", i.clus, "clusters") , sub = NULL) 1 Raccoon Wolf Bear Elephant_Seal Fur_Seal Sea_Lion Jaguar Cougar River_Otter Marten Wolverine Grey_Seal Sea_Otter Weasel Badger Reindeer Elk Deer Moose House_Mouse Beaver Groundhog Gray_Squirrel Porcupine Brown_Bat Silver_Hair_Bat Red_Bat Pigmy_Bat House_Bat Mole Pika Rabbit 0 Height 2 3 4 Teeth with average linkage and 5 clusters teeth.dist hclust (*, "average") Teeth PCA with average linkage and 5 clusters 4 312 25 26 0 17 21 19 24 23 22 20 27 8 9 11 10 13 27 18 2 15 ●1 ●3 −1 16 14 ●4 1●5 ●6 5 31 32 −2 Component 2 1 28 30 29 −3 −2 −1 0 1 Component 1 2 3 4 14.3: Identifying the Number of Clusters 381 # print the observations in each cluster for (i.cut in 1:i.clus) { print(paste("Cluster", i.cut, " ----------------------------- ")) print(teeth[(cutree(teeth.hc.average, k = i.clus) == i.cut),]) } ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## [1] "Cluster 1 ----------------------------mammal v1 v2 v3 v4 v5 v6 v7 v8 1 Brown_Bat 2 3 1 1 3 3 3 3 3 Silver_Hair_Bat 2 3 1 1 2 3 3 3 4 Pigmy_Bat 2 3 1 1 2 2 3 3 5 House_Bat 2 3 1 1 1 2 3 3 6 Red_Bat 1 3 1 1 2 2 3 3 [1] "Cluster 2 ----------------------------mammal v1 v2 v3 v4 v5 v6 v7 v8 2 Mole 3 2 1 0 3 3 3 3 7 Pika 2 1 0 0 2 2 3 3 8 Rabbit 2 1 0 0 3 2 3 3 [1] "Cluster 3 ----------------------------mammal v1 v2 v3 v4 v5 v6 v7 v8 9 Beaver 1 1 0 0 2 1 3 3 10 Groundhog 1 1 0 0 2 1 3 3 11 Gray_Squirrel 1 1 0 0 1 1 3 3 12 House_Mouse 1 1 0 0 0 0 3 3 13 Porcupine 1 1 0 0 1 1 3 3 [1] "Cluster 4 ----------------------------mammal v1 v2 v3 v4 v5 v6 v7 v8 14 Wolf 3 3 1 1 4 4 2 3 15 Bear 3 3 1 1 4 4 2 3 16 Raccoon 3 3 1 1 4 4 3 2 17 Marten 3 3 1 1 4 4 1 2 18 Weasel 3 3 1 1 3 3 1 2 19 Wolverine 3 3 1 1 4 4 1 2 20 Badger 3 3 1 1 3 3 1 2 21 River_Otter 3 3 1 1 4 3 1 2 22 Sea_Otter 3 2 1 1 3 3 1 2 23 Jaguar 3 3 1 1 3 2 1 1 24 Cougar 3 3 1 1 3 2 1 1 25 Fur_Seal 3 2 1 1 4 4 1 1 26 Sea_Lion 3 2 1 1 4 4 1 1 27 Grey_Seal 3 2 1 1 3 3 2 2 28 Elephant_Seal 2 1 1 1 4 4 1 1 [1] "Cluster 5 ----------------------------mammal v1 v2 v3 v4 v5 v6 v7 v8 29 Reindeer 0 4 1 0 3 3 3 3 30 Elk 0 4 1 0 3 3 3 3 31 Deer 0 4 0 0 3 3 3 3 32 Moose 0 4 0 0 3 3 3 3 " " " " " 382 Ch 14: Cluster Analysis 14.4 Example: 1976 birth and death rates Below are the 1976 crude birth and death rates in 74 countries. A data plot and output from a complete and single linkage cluster analyses are given. #### Example: Birth and death rates fn.data <- "http://statacumen.com/teach/ADA2/ADA2_notes_Ch14_birthdeath.dat" bd <- read.table(fn.data, header = TRUE) str(bd) ## 'data.frame': 74 obs. of 3 variables: ## $ country: Factor w/ 74 levels "afghan","algeria",..: 1 2 3 4 5 6 7 8 9 10 ... ## $ birth : int 52 50 47 22 16 12 47 12 36 17 ... ## $ death : int 30 16 23 10 8 13 19 12 10 10 ... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 country afghan algeria angola argentina australia austria banglades belguim brazil bulgaria burma cameroon canada chile china taiwan columbia cuba czechosla ecuador egypt ethiopia france german dr german fr birth 52 50 47 22 16 12 47 12 36 17 38 42 17 22 31 26 34 20 19 42 39 48 14 12 10 death 30 16 23 10 8 13 19 12 10 10 15 22 7 7 11 5 10 6 11 11 13 23 11 14 12 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 country ghana greece guatamala hungary india indonesia iran iraq italy ivory cst japan kenya nkorea skorea madagasca malaysia mexico morocco mozambique nepal netherlan nigeria pakistan peru phillip birth 46 16 40 18 36 38 42 48 14 48 16 50 43 26 47 30 40 47 45 46 13 49 44 40 34 death 14 9 14 12 15 16 12 14 10 23 6 14 12 6 22 6 7 16 18 20 8 22 14 13 10 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 country poland portugal rhodesia romania saudi ar sth africa spain sri lanka sudan sweden switzer syria tanzania thailand turkey ussr uganda uk usa upp volta venez vietnam yugoslav zaire birth 20 19 48 19 49 36 18 26 49 12 12 47 47 34 34 18 48 12 15 50 36 42 18 45 death 9 10 14 10 19 12 8 9 17 11 9 14 17 10 12 9 17 12 9 28 6 17 8 18 # plot original data library(ggplot2) p1 <- ggplot(bd, aes(x = birth, y = death)) p1 <- p1 + geom_point(size = 2) # points p1 <- p1 + geom_text(aes(label = country), hjust = -0.1, alpha = 0.2) # labels p1 <- p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x- and y-axis p1 <- p1 + labs(title = "1976 crude birth and death rates") print(p1) Scott index.1 Complete linkage library(NbClust) # Change integer data type to numeric bd.0 2.matrix(bd[.-1])) NC.nc ## ## ## ## ## ## index.num <.CH index.000 15 5. returning -Inf [1] "*** : The Hubert index is a graphical method of determining the number of clusters.4.Marriot index.out$Best.76 index. In the Warning: data length [51] is not a sub-multiple or multiple of the number of rows [2] [1] "All 148 observations were used. In [1] "*** : The D index is a graphical method of determining the number of clusters.NbClust(bd.KL index.num.2 20.333 1781 209.14.0 Number_clusters Value_Index .00 3.TrCovW index.79 9041 4 15. index = "all") ## ## ## ## ## Warning: no non-missing arguments to max.CCC 2. as do the plots NC.6 Value_Index 86." # most of the methods suggest 2 to 6 clusters. method = "complete".00 6 -Inf 854.TraceW Number_clusters 4.Hartigan index.as.out <.numeric(as.4: Example: 1976 birth and death rates 383 1976 crude birth and death rates 30 ● ● ● death 20 10 afghan upp_volta ● ● angola ethiopia ivory_cst ● ● cameroon madagasca nigeria nepal ● ● banglades saudi_ar ● zaire mozambique ● ● ● vietnam● tanzania uganda sudan ● ● ● indonesia morocco algeria ● ● india burma ● ● ● ● ● ● ● german_dr guatamala pakistan ghana syria iraq rhodesia kenya ● ● ● austria egypt peru ● ● ● ● ● german_fr uk belguim ● hungary turkey sth_africa●iran nkorea ● ● ● ● sweden france ● czechosla china ecuador ● ● ● ● ● italy ● bulgaria romania portugal argentina phillip thailand columbia brazil ● ● ● ● ● switzer usa greece ussr poland ● sri_lanka ● ● ● netherlan australia spain yugoslav ● ● canada ● chile mexico ● ● japan ● cuba skorea● malaysia ● venez ● taiwan ● 10 20 30 40 50 birth 14. 17 0.PseudoT2 0.Cindex index.23 0.Frey index.32 0 0.0 2.Dunn 0.SDbw 0 0.00 15 2.00 index.Ratkowsky index.0 2.43 13.46 5166 3.Rubin index.01 3 0.Duda index.00 3.DB 395.00 2.Beale index.25 142 2.33 3.00 .00 2 index.99 0.Friedman index.00 index.384 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Ch 14: Cluster Analysis index.Silhouette index.SDindex index.PtBiserial index.McClain index.00 2.4 -131.85 3.Ball 0.00 2.5 0.54 0.00 2 index.75 0.00 0.00 13.Dindex index.Hubert index. hang = -1 .4e−05 ● ● 0e+00 4.hc.dist(bd[. # create distance matrix between points bd.0 3 Index Value ● 1 Index Value 4 0. i.4 0.0e−05 Index Value Normalized Hubert Statistic 385 ● 2 4 6 8 10 12 14 ● ● ● ● 2 4 6 8 ● ● 10 ● 12 ● 14 Number of clusters Number of clusters Dindex Second Differences of D index ● 2 ● ● ● ● ● ● 2 4 6 8 10 ● ● 12 ● ● ● 14 Number of clusters 0.clus.dist.hclust(bd.8e−05 ● ● Second Differences of Hubert index ● ● ● ● ● ● 2e−06 ● Index Value ● 4.3 # create dendrogram bd. First we’ll use complete linkage. "clusters") .complete.hc.2 0.complete <.14. method = "complete") plclust(bd.6 ● ● ● ● ● 2 4 ● ● 6 8 10 12 14 Number of clusters Let’s try 3 clusters based on the dendrogram plots below.-1]) # number of clusters to identify with red boxes and ellipses i.dist <. main = paste("Teeth with complete linkage and".2 ● ● ● ● ● ● −0.4: Example: 1976 birth and death rates ● ● ● ● ● 4e−06 4.clus <. k = i.clus)) 20 10 afghan upp_volta algeria kenya iraq rhodesia ghana syria cameroon vietnam nigeria madagasca angola ethiopia ivory_cst mozambique zaire banglades nepal saudi_ar morocco tanzania sudan uganda german_fr german_dr austria belguim uk netherlan switzer sweden france italy argentina poland chile cuba hungary czechosla portugal romania canada japan ussr spain yugoslav bulgaria usa australia greece malaysia sri_lanka taiwan skorea pakistan nkorea ecuador iran guatamala egypt peru india burma indonesia mexico venez china brazil thailand columbia phillip sth_africa turkey 0 Height 30 40 Teeth with complete linkage and 3 clusters bd.complete.hclust(bd. k = i.hc.dist hclust (*. col. labels = 2.1]) ## Warning: ’plclust’ is deprecated. "complete") .txt = 1.comp <.hc. ## See help("Deprecated") rect.txt = "gray20" . cutree(bd. labels = bd[.clus. cex = 2.hc.clus) . cex. i.386 Ch 14: Cluster Analysis .factor(cutree(bd.complete. main = paste("Birth/Death PCA with complete linkage and". lines = 0 . ## Use ’plot’ instead. "clusters"). sub = # create a column with group membership bd$cut. k = i.complete.clus) # create PCA scores plot with ellipses clusplot(bd. color = TRUE. clus) { print(paste("Cluster".]) } ## [1] "Cluster 1 ----------------------------.cut in 1:i.")) print(bd[(cutree(bd.clus) == i. i.cut. k = i." ## country birth death cut.comp ## 1 afghan 52 30 1 ## 2 algeria 50 16 1 ## 3 angola 47 23 1 ## 7 banglades 47 19 1 ## 12 cameroon 42 22 1 ## 22 ethiopia 48 23 1 ## 26 ghana 46 14 1 ## 33 iraq 48 14 1 ## 35 ivory_cst 48 23 1 ## 37 kenya 50 14 1 ## 40 madagasca 47 22 1 ## 43 morocco 47 16 1 ## 44 mozambique 45 18 1 ## 45 nepal 46 20 1 ## 47 nigeria 49 22 1 ## 53 rhodesia 48 14 1 ## 55 saudi_ar 49 19 1 ## 59 sudan 49 17 1 ## 62 syria 47 14 1 3 .hc.complete.cut).14.4: Example: 1976 birth and death rates 387 Birth/Death PCA with complete linkage and 3 clusters 2 1 3 71 2 73 1 69 16 61 66 65 64 ● 56 58 6039 57 46 50 49 38 37● ● ● ● ● 0 47 ● 40 ● 33 3028 34 26 ● 22 20 21 24 19 14 ● 17 15 12 7● 11 9 2 ● ● 6 4 3 ● 1 ● −2 10 8 35 ● 31 29 2523 5 45 43 44 36 13 ● 48 32 18 55 53 41 27 70 ● ● 42 −1 67 ● 63 ● 59 ● 62 68 54 5152 Component 2 74 ● 72 ● −2 −1 0 1 2 Component 1 # print the observations in each cluster for (i. " ----------------------------. comp 9 brazil 36 10 3 11 burma 38 15 3 15 china 31 11 3 16 taiwan 26 5 3 17 columbia 34 10 3 20 ecuador 42 11 3 21 egypt 39 13 3 28 guatamala 40 14 3 30 india 36 15 3 31 indonesia 38 16 3 32 iran 42 12 3 38 nkorea 43 12 3 39 skorea 26 6 3 41 malaysia 30 6 3 42 mexico 40 7 3 Ch 14: Cluster Analysis ." country birth death cut.388 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 63 tanzania 47 17 1 67 uganda 48 17 1 70 upp_volta 50 28 1 72 vietnam 42 17 1 74 zaire 45 18 1 [1] "Cluster 2 ----------------------------." country birth death cut.comp 4 argentina 22 10 2 5 australia 16 8 2 6 austria 12 13 2 8 belguim 12 12 2 10 bulgaria 17 10 2 13 canada 17 7 2 14 chile 22 7 2 18 cuba 20 6 2 19 czechosla 19 11 2 23 france 14 11 2 24 german_dr 12 14 2 25 german_fr 10 12 2 27 greece 16 9 2 29 hungary 18 12 2 34 italy 14 10 2 36 japan 16 6 2 46 netherlan 13 8 2 51 poland 20 9 2 52 portugal 19 10 2 54 romania 19 10 2 57 spain 18 8 2 60 sweden 12 11 2 61 switzer 12 9 2 66 ussr 18 9 2 68 uk 12 12 2 69 usa 15 9 2 73 yugoslav 18 8 2 [1] "Cluster 3 ----------------------------. p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x.comp ● nepal ● ● banglades saudi_ar a● 1 ● zaire mozambique ● ● ● ● vietnam tanzania uganda sudan a 2 ● indonesia ● morocco algeria a 3 india burma ● ● ● ● german_dr guatamala pakistan ghana syria iraq rhodesia kenya austria egypt peru german_fr uk belguim hungary turkey sth_africairan nkorea sweden france czechosla china ecuador italy bulgaria romania portugal argentina phillip thailand columbia brazil switzer usa greece ussrpoland sri_lanka netherlan australia spain yugoslav canadachile mexico japan cuba skoreamalaysia venez taiwan ● death 20 10 10 20 30 40 50 birth In very general/loose terms4.p1 + geom_text(aes(label = country). shape = cut. y = death.1.and y-axis p1 <. aes(x = birth.ggplot(bd. while the countries with more Euro-centric wealth are mostly clustered on the left side of the swoop. complete linkage afghan ● upp_volta 30 ● ● ● angola ethiopia ivory_cst ● ● cameroon madagasca nigeria cut.14.wikipedia. Perhaps the birth and death rates of a given country are influenced in part by 4 5 Thanks to Drew Enigk from Spring 2013 who provided this interpretation.p1 + geom_point(size = 2) # points p1 <. colour = cut. http://en.org/wiki/Four_Asian_Tigers . it appears that at least some members of the “Four Asian Tigers5” are toward the bottom of the swoop.4: Example: 1976 birth and death rates ## ## ## ## ## ## ## ## 48 pakistan 49 peru 50 phillip 56 sth_africa 58 sri_lanka 64 thailand 65 turkey 71 venez 44 40 34 36 26 34 34 36 14 13 10 12 9 10 12 6 389 3 3 3 3 3 3 3 3 # plot original data library(ggplot2) p1 <. and many developing countries make up the steeper right side of the swoop.2) # labels p1 <.comp. complete linkage") print(p1) 1976 crude birth and death rates.p1 + labs(title = "1976 crude birth and death rates. alpha = 0. hjust = -0.comp)) p1 <. the Four Asian Tigers have supposedly developed wealth in more recent years through export-driven economies.out$Best. we seek a significant knee Warning: data length [51] is not a sub-multiple or multiple of the number of rows [2] [1] "All 222 observations were used.html http://en. 14.investopedia.Duda index.wikipedia.00 2 index. method = "single".00 2.Scott index.390 Ch 14: Cluster Analysis the primary means by which the country has obtained wealth6 (if it is considered a wealthy country).num. library(NbClust) # Change integer data type to numeric bd. In the plot of Hubert index.78 index.85 4.00 http://www.00 2.Dunn 0. we seek a signif [1] "*** : The D index is a graphical method of determining the number of clusters.38 -12.-1])) NC.Rubin index.as.00 8.00 2.matrix(bd[. In the plot of D index.00 0.TraceW 6.Hartigan index.19 0.06 0.35 9162 7.PseudoT2 0.SDbw 0 0.Friedman index.04 3 0.2 Single linkage Now we’ll use single linkage to compare.NbClust(bd.0 5.Marriot index.00 15 7.00 index. and the Tiger Cub Economies7 are currently developing in a similar fashion8.51 0 0.org/wiki/Tiger_Cub_Economies 8 http://www.SDindex index.nc ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index Number_clusters Value_Index 6 index.00 7.14 2.numeric(as.com/terms/t/tiger-cub-economies.asp 7 .Cindex index.Beale index.CCC 7.00 15.PtBiserial index.povertyeducation.8 115129 6 6 index.00 index.3 473.DB 20.34 6.08 0. index = "all") ## ## ## ## ## Warning: no non-missing arguments to max.74 0.4.Ball 0.out <.McClain index.Ratkowsky index.org/the-rise-of-asia.00 index.00 2.47 53.Frey index.KL index.0 2.594 342.71 2.00 3.TrCovW index.num <. as do the plots NC.Silhouette index.Hubert index.12 0.Dindex index.000 11.00 3. returning -Inf [1] "*** : The Hubert index is a graphical method of determining the number of clusters.9 13.37 0. For example.CH index.0 6 -Inf 4991 221." # most of the methods suggest 4 to 11 clusters. single.dist(bd[. cutree(bd.hclust(bd.clus.single <.dist.txt = "gray20" ● 2 4 6 8 10 12 Number of clusters 14 .hclust(bd.single. color = TRUE.3 # create dendrogram bd. labels = bd[. k = i.7e−05 ● 6 Index Value 2. ## Use ’plot’ instead.hc.clus) # create PCA scores plot with ellipses clusplot(bd.hc. "clusters") .-1]) # number of clusters to identify with red boxes and ellipses i. k = i.1]) ## Warning: ’plclust’ is deprecated. main = paste("Teeth with single linkage and". method = "single") plclust(bd. i.txt = 1. cex. lines = 0 .9e−05 Index Value ● ● ● 8e−07 ● ● Second Differences of Hubert index 4e−07 ● 391 ● 10 ● ● ● 12 ● ● 14 Number of clusters # create distance matrix between points bd.4: Example: 1976 birth and death rates Normalized Hubert Statistic ● ● ● ● ● ● 4 6 8 10 12 14 ● ● ● 2 ● ● 4 6 ● 8 10 12 14 Number of clusters Number of clusters Dindex Second Differences of D index ● Index Value 4 3 ● 2 ● 4 6 ● 8 ● ● ● ● ● ● ● ● ● ● −2 ● 1 ● 0 ● 5 ● 2 ● 2 ● ● ● ● 2 Index Value ● 0e+00 ● ● −1 2. cex = 2.hc.dist <. ## See help("Deprecated") rect.14.hc.clus <. col.clus) . hang = -1 . labels = 2.single. clus.hc. "clusters") .clus)) .factor(cutree(bd.single. main = paste("Birth/Death PCA with single linkage and". i. sub = NULL) # create a column with group membership bd$cut.sing <.392 Ch 14: Cluster Analysis . k = i. "single") Birth/Death PCA with single linkage and 3 clusters 2 3 72 66 73 71 62 65 59 53 48 26 31 12 7 2 1 0 49 38 30 28 22 21 20 11 64 68 56 60 44 43 37 42 32 1 69 16 57 61 58 54 50 52 51 33 46 39 36 41 34 29 27 25 23 24 18 19 17 15 14 13 9 10 8 5 46 ● 2 .−2 −1 0 Component 2 1 2 afghan upp_volta argentina netherlan switzer german_fr sweden german_dr austria belguim uk poland hungary czechosla portugal romania ussr spain yugoslav japan france italy canada bulgaria usa australia greece chile cuba malaysia sri_lanka taiwan skorea cameroon mexico venez vietnam china turkey sth_africa brazil thailand columbia phillip mozambique zaire nigeria madagasca angola ethiopia ivory_cst saudi_ar kenya iraq rhodesia ghana syria algeria sudan uganda morocco tanzania banglades nepal pakistan nkorea ecuador iran guatamala egypt peru india burma indonesia 0 1 2 Height 3 4 5 14.4: Example: 1976 birth and death rates 74 ● 70 1 −3 67 63 55 47 45 40 35 3 −2 −1 Component 1 393 Teeth with single linkage and 3 clusters bd.dist hclust (*. sing 2 algeria 50 16 1 2 3 angola 47 23 1 2 7 banglades 47 19 1 2 9 brazil 36 10 3 2 11 burma 38 15 3 2 12 cameroon 42 22 1 2 15 china 31 11 3 2 17 columbia 34 10 3 2 20 ecuador 42 11 3 2 21 egypt 39 13 3 2 22 ethiopia 48 23 1 2 26 ghana 46 14 1 2 28 guatamala 40 14 3 2 30 india 36 15 3 2 31 indonesia 38 16 3 2 32 iran 42 12 3 2 33 iraq 48 14 1 2 35 ivory_cst 48 23 1 2 37 kenya 50 14 1 2 38 nkorea 43 12 3 2 40 madagasca 47 22 1 2 42 mexico 40 7 3 2 43 morocco 47 16 1 2 44 mozambique 45 18 1 2 45 nepal 46 20 1 2 47 nigeria 49 22 1 2 48 pakistan 44 14 3 2 49 peru 40 13 3 2 50 phillip 34 10 3 2 53 rhodesia 48 14 1 2 55 saudi_ar 49 19 1 2 56 sth_africa 36 12 3 2 59 sudan 49 17 1 2 62 syria 47 14 1 2 63 tanzania 47 17 1 2 64 thailand 34 10 3 2 65 turkey 34 12 3 2 67 uganda 48 17 1 2 71 venez 36 6 3 2 72 vietnam 42 17 1 2 . " ----------------------------.cut.clus) { print(paste("Cluster".sing 1 afghan 52 30 1 1 70 upp_volta 50 28 1 1 [1] "Cluster 2 ----------------------------.clus) == i.hc.394 Ch 14: Cluster Analysis # print the observations in each cluster for (i.")) print(bd[(cutree(bd.comp cut.]) } ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## [1] "Cluster 1 ----------------------------.cut)." country birth death cut.single. i." country birth death cut.comp cut.cut in 1:i. k = i. sing)) p1 <.sing.p1 + labs(title = "1976 crude birth and death rates.p1 + geom_point(size = 2) # points p1 <. colour = cut.2) # labels p1 <.ggplot(bd.14. y = death.p1 + geom_text(aes(label = country).sing 4 argentina 22 10 2 3 5 australia 16 8 2 3 6 austria 12 13 2 3 8 belguim 12 12 2 3 10 bulgaria 17 10 2 3 13 canada 17 7 2 3 14 chile 22 7 2 3 16 taiwan 26 5 3 3 18 cuba 20 6 2 3 19 czechosla 19 11 2 3 23 france 14 11 2 3 24 german_dr 12 14 2 3 25 german_fr 10 12 2 3 27 greece 16 9 2 3 29 hungary 18 12 2 3 34 italy 14 10 2 3 36 japan 16 6 2 3 39 skorea 26 6 3 3 41 malaysia 30 6 3 3 46 netherlan 13 8 2 3 51 poland 20 9 2 3 52 portugal 19 10 2 3 54 romania 19 10 2 3 57 spain 18 8 2 3 58 sri_lanka 26 9 3 3 60 sweden 12 11 2 3 61 switzer 12 9 2 3 66 ussr 18 9 2 3 68 uk 12 12 2 3 69 usa 15 9 2 3 73 yugoslav 18 8 2 3 # plot original data library(ggplot2) p1 <.1. alpha = 0.comp cut. hjust = -0. shape = cut.p1 + coord_fixed(ratio = 1) # makes 1 unit equal length on x." country birth death cut. aes(x = birth. single linkage") print(p1) 395 .and y-axis p1 <.4: Example: 1976 birth and death rates ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 74 zaire 45 18 1 2 [1] "Cluster 3 ----------------------------. but the clusters were unappealing so this analysis will not be presented here. An important point to recognize is that different clustering algorithms may agree on the number of clusters. The three clusters generated by the two methods are very different.sing nepal banglades saudi_ar ● 1 zaire mozambique a vietnam tanzania uganda sudan a 2 indonesia morocco algeria a 3 india burma german_dr guatamala pakistan ghana syria iraq rhodesia kenya austria egypt peru german_fr uk belguim hungary turkey sth_africairan nkorea sweden france czechosla china ecuador italy bulgaria romania portugal argentina phillip thailand columbia brazil switzer usa greece ussrpoland sri_lanka netherlan australia spain yugoslav canada chile mexico japan cuba skoreamalaysia venez taiwan death 20 10 10 20 30 40 50 birth The two methods suggest three clusters. single linkage 30 afghan upp_volta ● ● angola ethiopia ivory_cst cameroon madagasca nigeria cut. .396 Ch 14: Cluster Analysis 1976 crude birth and death rates. Complete linkage also suggests 14 clusters. The same tendency was observed using average linkage and Ward’s method. but they may not agree on the composition of the clusters. #### Example: Painted turtle shells fn.."http://statacumen... ## $ height: int 38 38 42 42 44 50 46 51 51 51 .data.com/teach/ADA2/ADA2_notes_Ch15_shells_mf."M": 1 1 1 1 1 1 1 1 1 1 .Chapter 15 Multivariate Analysis of Variance Jolicouer and Mosimann studied the relationship between the size and shape of painted turtles.. ## $ length: int 98 103 103 105 109 123 123 133 133 133 ..table(fn..frame': 48 obs. of 4 variables: ## $ sex : Factor w/ 2 levels "F". and height (all in mm) for 24 males and 24 females. #head(shells) ..dat" shells <.data <.read.. width. header = TRUE) str(shells) ## 'data. ## $ width : int 81 84 86 86 88 92 95 99 102 102 . The table below gives the length. highlight. color = as. { scatterplot3d(x = length . unload=TRUE) detach("package:reshape".398 Ch 15: Multivariate Analysis of Variance 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 sex F F F F F F F F F F F F F F F F F F F F F F F F length 98 103 103 105 109 123 123 133 133 133 134 136 138 138 141 147 149 153 155 155 158 159 162 177 width 81 84 86 86 88 92 95 99 102 102 100 102 98 99 105 108 107 107 115 117 115 118 124 132 height 38 38 42 42 44 50 46 51 51 51 48 49 51 51 53 57 55 56 63 60 62 63 61 67 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 sex M M M M M M M M M M M M M M M M M M M M M M M M length 93 94 96 101 102 103 104 106 107 112 113 114 116 117 117 119 120 121 121 125 127 128 131 135 width 74 78 80 84 85 81 83 83 82 89 88 86 90 90 91 93 89 93 95 93 96 95 95 106 height 37 35 35 39 38 37 39 39 38 40 40 40 43 41 41 41 40 44 42 45 45 45 46 47 ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) # put scatterplots on top so y axis is vertical p <. angle = 40 # viewing angle (seems hard to control) ) }) #### Try this! #### For a rotatable 3D plot.integer(sex)+19 # plotting character by group #. type = "h" # lines to the horizontal xy-plane . unload=TRUE) ## 3D scatterplot library(scatterplot3d) with(shells. colour = "sex") print(p) # detach package after use so reshape2 works (old reshape (v. col = sex) }) . y = width. pch = as.1) conflicts) detach("package:GGally". y = width . main = "Shells 3D Scatterplot" .3d = TRUE # makes color change with z-axis value . z = height .ggpairs(shells. z = height. use plot3d() from the rgl library # ## This uses the R version of the OpenGL (Open Graphics Library) # library(rgl) # with(shells. { plot3d(x = length.integer(sex) # color by group . groups. n2.963 height 140 Cor : 0. width. and height the same for males and females? ˆ If not.978 70 180 160 70 80 100 120 140 160 180 length 60 MANOVA considers the following two questions: ˆ Are the population mean length. .973 ● ● ● 120 110 ●●● ● ● ● ● 100 90 80 ●● 35 120 ● ● 65 length Cor : 0. A one-way MANOVA tests the hypothesis that the population mean vectors are identical: H0 : µ1 = µ2 = · · · = µk against HA : not H0.912 ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ●● ●● ●● Cor : 0. then what combination of features is most responsible for the differences? To describe MANOVA. you are simultaneously testing that the sexes have equal population mean lengths. Let  0 Xij0 = Xij1 Xij2 · · · Xijp . .96 50 ● ● ●●● ● ●●● ● ● ● ●● ●● ●●● ● ● ●● ●● ●●●● ● ● ● ● ●● ● ●● ●● ● 130 120 110 width 100 90 80 90100110120130 45 ● ● ● ● ●● ● ● ● ●● 40 100 120 140 160 180 ● ● ● 60 50 height 40 50 width F: 0. Let 0  0 µi = µi1 µi2 · · · µip be the vector of population means for the ith population. .947 60 ● ●● ● F: 0.95 M: 0. p = 3 features and k = 2 strata (sexes). or populations. Assume that the sample sizes from the different groups are n1. where µij is the ith population mean on the j th feature. .971 M: 0. equal population mean widths. and equal population mean heights. The total sample size is n = n1 + n2 + · · · + nk .399 Shells 3D Scatterplot F sex M ● F: 0. nk . For the carapace data.966 ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ●● ●●● ● ● ●● ●● ● ● ●● 55 ● ● ●● ● ● ● ● ● ● ● ● 140 ● ● ● ●● ● ● ● ● 130 ● M: 0. For the turtles. suppose you measure p features on independent random samples from k strata. The standard MANOVA assumes that you have independent samples from multivariate normal populations with identical variance-covariance matrices. The pooled variance-covariance matrix is a weighted average of the variance-covariance matrices from each group. . The off-diagonal elements are SS between features. that a feature has the same variability across populations. which is the multivariate analog of the ANOVA table: Source df SS MS P ¯ i − X)( ¯ X ¯ i − X) ¯ 0 Between k − 1 X i ni (P Within n − k i (ni − 1)Si P ¯ ¯ 0 Total n − 1 ij (Xij − X)(Xij − X) where all the MSs are SS/df. The Error MS matrix is the pooled variance-covariance matrix S. The expressions for the SS have the same form as SS in univariate analysis of variance.400 Ch 15: Multivariate Analysis of Variance be the vector of responses for the j th individual from the ith sample. The Error MS matrix estimates the common population variance-covariance matrix when the population variance-covariance matrices are identical. Finally. To test H0. let   ¯0 = X ¯1 X ¯2 · · · X ¯p 0 X be the vector of means ignoring samples (combine all the data across samples and compute the average on each feature). and let P (ni − 1)Si S= i n−k be the pooled variance-covariance matrix. The diagonal elements of the SS matrices are the SS for one-way ANOVAs on the individual features. construct the following MANOVA table. and that the correlation (or covariance) between two features is identical across populations. Let  0 0 ¯ ¯ ¯ ¯ Xi = Xi1 Xi2 · · · Xip and Si be the mean vector and variance-covariance matrix for the ith sample. This implies that each feature is normally distributed in each population. except that each SS is a p × p symmetric matrix. 2:4]))[[1]] # all ## length width height . :35.3 Mean :40.: 94.0 Median :51. : 74. H0 is implausible if a significant portion of the total variation in the data.355 # correlation matrix (excluding associated p-values testing "H0: rho == 0") library(Hmisc) rcorr(as.:121 3rd Qu. with numerical summaries below.046 ---------------------------------------------------shells$sex: M length width height 11. as measured by the Between MS matrix.2 1st Qu. # summary statistics for each sex by(shells.0 ## Mean :113 Mean : 88. :135 Max.0 ## Mean :136 Mean :102.0 Max.:154 3rd Qu. :67.:123 1st Qu. several MANOVA tests of H0 have been proposed.6 Mean :52. :47. is large relative to the variability within groups. As a result. sd) ## ## ## ## ## ## ## shells$sex: F length width height 21.:104 1st Qu.:43.7 ## 3rd Qu.8 3rd Qu. Graphical summaries for the carapace data are given above.0 ## ---------------------------------------------------## shells$sex: M ## sex length width height ## F: 0 Min. :132.0 # standard deviations by(shells[.matrix(shells[. :106.074 3. shells$sex.0 Median :40. 2:4]. Equivalently.8 ## Median :115 Median : 89. The same idea is used in a one-way ANOVA to motivate the F -test of no differences in population means.: 93.0 ## M: 0 1st Qu. summary) ## shells$sex: F ## sex length width height ## F:24 Min. is due to differences among the groups. some care is needed to quantify these ideas in a MANOVA because there are several natural matrix definitions for comparing the Between MS matrix to the Error MS matrix. :177 Max. as measured by the Error MS matrix.0 1st Qu.0 ## 3rd Qu.0 3rd Qu.0 Max.2 ## Max. : 98 Min.0 Min.:38.806 7.0 Min.249 13. :38.0 ## M:24 1st Qu.:109.:47. 2.:57. : 93 Min.5 ## Median :137 Median :102. as measured by the Total SS matrix.105 8. apply. : 81.: 83. shells$sex.401 The H0 of equal population mean vectors should be rejected when the difference among mean vectors.8 ## Max. However. 0000 ## height 0. the MANOVA assumptions do not appear to be grossly violated here.0000 0. 2:4]) df.data.26 -3.11 .9123 1.9731 ## width 0..2 . 2:4]))[[1]] ## length width ## length 1. ## $ Comp. Although females are more variable on each feature than males. No outliers are present."M": 1 1 1 1 1 1 1 1 1 1 .matrix(shells[shells$sex == "F".9659 height 0.sh.0000 rcorr(as.1) conflicts) detach("package:GGally".44 -3. you could consider transforming the each dimension of the data in hopes to make the covariances between sexes more similar.9779 1.0000 0.1 .73 1.9501 1.3: num 0.9628 ## width 0.9628 0.frame': 48 obs..0000 0. unload=TRUE) detach("package:reshape".pca.pca.9501 ## width 0.9659 1.671 1.2: num -2.0000 The features are positively correlated within each sex.ggpairs(df.) pca..9599 ## height 0.9471 0. 2:4]))[[1]] ## length width ## length 1.9779 0.0000 ## height 0. colour = "sex" . unload=TRUE) . (Additionally.402 Ch 15: Multivariate Analysis of Variance ## length 1.9471 0.618 2.9 -23. though it may not be easy to find a good transformation to use. ## $ Comp.9123 # females # males height 0.matrix(shells[shells$sex == "M".. width. The distributions for length. title = "Principal components of Shells") print(p) # detach package after use so reshape2 works (old reshape (v.pca.sh <. Females tend to be larger on each feature.27 -1.sh) ## 'data.0000 rcorr(as..0000 0.9707 0.princomp(shells[.6 -22 -17.9599 1.sh$scores) str(df. pca.. The correlations between pairs of features are similar for males and females.. of 4 variables: ## $ sex : Factor w/ 2 levels "F".9731 1. ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) # put scatterplots on top so y axis is vertical p <.sh <.4 -25.9707 0. ## $ Comp.frame(sex = shells$sex.1: num -31. and height are fairly symmetric within sexes.43 -4..943 -0. 2 F: 0. data = shells) summary(lm.46e−15 2.107 F: −0.27 3Q 11. data = shells) Residuals: Min 1Q Median -38.638 0 −25 0 25 50 ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● 5 ● ● ● ● ● ● ●● ● ● ● ● ●● Cor : −3. width. For the carapace data.5 0Comp.04 -10.96 -4. here are the univariate ANOVAs for each feature. Error t value Pr(>|t|) (Intercept) 136. Females are larger on average than males on each feature.1 F: 0.196 ●● ● ● ● ● −2.04 3.07e−16 50 25 ● Comp.67 1.5 0 2.sh) ## ## ## ## ## ## ## ## ## ## ## ## ## Response length : Call: lm(formula = length ~ sex. and height.403 Principal components of Shells F sex M ● Cor : 1. width.sh <.56 3. the univariate ANOVAs indicate significant differences between sexes on length.93 Max 40.51 38.5 5 M: 0.431 M: −0.63 4.8e-05 *** .068 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● 3 0Comp. height) ~ sex.3 −3 0 3 ● For comparison with the MANOVA below. # Univariate ANOVA tests. by each response variable lm.242 M: 0.96 Coefficients: Estimate Std.lm(cbind(length.83e−15 Cor : 9.77 < 2e-16 *** sexM -22. 457 F-statistic: 40. p-value: 8.29 3.8 on 1 and 46 DF. the left-hand side needs to be a matrix data type.15 47.05 '.885 Max 29.Adjusted R-squared: 0.542 Median -0.01 '*' 0.311.438 3Q 4.79e-05 Response width : Call: lm(formula = width ~ sex.78 -6.001 '**' 0.792 Median -0.' 0.1 on 1 and 46 DF. p-value: 2.' 0.325.001 '**' 0. Error t value Pr(>|t|) (Intercept) 102.01 '*' 0. it may be easier # to select by column number.1e-08 *** --Signif. data = shells) Residuals: Min 1Q -14. p-value: 3.33 1. data = shells) Residuals: Min 1Q -21.2 on 46 degrees of freedom Multiple R-squared: 0.1 ' ' 1 Residual standard error: 10.708 3Q 4.05 '. Error t value Pr(>|t|) (Intercept) 52.26 41. for many ANOVAs at once.1 ' ' 1 Residual standard error: 17.469. # Also. codes: 0 '***' 0.4e-05 *** --Signif.958 Coefficients: Estimate Std.37 8.36 < 2e-16 *** sexM -11.09e-08 # Alternatively.' 0. codes: Ch 15: Multivariate Analysis of Variance 0 '***' 0.01 '*' 0.5 on 46 degrees of freedom Multiple R-squared: 0.58 2.04 -4.6 on 1 and 46 DF.31 F-statistic: 22.7 < 2e-16 *** sexM -14.042 Max 14.417 Coefficients: Estimate Std. codes: 0 '***' 0.38e-05 Response height : Call: lm(formula = height ~ sex.04 1.1 ' ' 1 Residual standard error: 6.001 '**' 0.16 on 46 degrees of freedom Multiple R-squared: 0. .042 -2.296 F-statistic: 20.404 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## --Signif.7 2.Adjusted R-squared: 0.Adjusted R-squared: 0. but you won't get the column names in the output.05 '.583 -5. xlab="Chi-squared quantiles") abline(a = 0. mshapiro. ylab="Mahalanobis D2 distance" .405 # lm.function(x. First we check the assumption of multivariate normality. name = "") { # creates a QQ-plot for assessing multivariate normality x <. # Test multivariate normality using the Shapiro-Wilk test for multivariate normality library(mvnormtest) # The data needs to be transposed t() so each variable is a row # with observations as columns.test(t(shells[shells$sex == "F". center.8932. d . main=paste("QQ Plot MV Normality:".sh <. 2:4].test(t(shells[shells$sex == "M".qqplot <. "Male") . p-value = 0.936. 2:4].norm.norm.cov(x).mnv.lm(as. 2:4])) ## ## Shapiro-Wilk normality test ## ## data: Z ## W = 0. d <.norm.ncol(x). two are manova() and the car package’s Manova().qqplot(shells[shells$sex == "F". p <. df=p).qqplot(shells[shells$sex == "M".colMeans(x) # centroid n <. 1]) A few procedures can be used for one-way MANOVA.matrix(shells[. col = "red") } f. "Female") f.mahalanobis(x.mnv.1329 # Graphical Assessment of Multivariate Normality f. p-value = 0.01551 mshapiro. 2:4])) ## ## Shapiro-Wilk normality test ## ## data: Z ## W = 0.mnv. cov <. b = 1. name) .as. 2:4]) ~ shells[.nrow(x). cov) # distances qqplot(qchisq(ppoints(n).matrix(x) # n x p numeric matrix center <. 406 Ch 15: Multivariate Analysis of Variance 7 QQ Plot MV Normality: Female QQ Plot MV Normality: Male ● ● 4 ● ● ● 3 ● 2 ● ●●● ● ●● 6 ● ● ● ●● ● ●●● ● ● ● ●●● ● ●● ● ●●● ● ● ●●● 0 0 ●● ● 4 ● ● ● 2 Mahalanobis D2 distance 5 ● 1 Mahalanobis D2 distance 8 6 ● 0 2 4 6 Chi−squared quantiles 8 10 0 2 4 6 8 10 Chi−squared quantiles The curvature in the Famale sample cause us to reject normality. in which case Pillai’s trace is more robust. should be used when sample size decreases. violations of assumptions. We’ll proceed anyway since this deviation from normality in the female sample will largely increase the variability of the sample and not displace the mean greatly. Wilks’ lambda. unequal cell sizes or homogeneity of covariances is violated . and the sample sizes are somewhat large. unequal sample sizes between groups.. while the males do not deviate from normality. etc. the larger the between-groups dispersion Pillai’s trace ˆ Considers differences over all the characteristic roots ˆ More robust than Wilks’. In general Wilks’ lambda is recommended unless there are problems with small total sample size. (λ) ˆ Most commonly used statistic for overall significance ˆ Considers differences over all the characteristic roots ˆ The smaller the value of Wilks’ lambda. Multivariate test statistics These four multivariate test statistics are among the most common to assess differences across the levels of the categorical variables for a linear combination of responses. 24 3 44 3.24 3 44 3.62e-09 *** .Manova(lm.387 23.407 Hotelling’s trace ˆ Considers differences over all the characteristic roots Roy’s greatest characteristic root ˆ Tests for differences on only the first discriminant function (Chapter 16) ˆ Most appropriate when responses are strongly interrelated on a single dimension ˆ Highly sensitive to violation of assumptions.man) summary(man. height) ~ sex.05 '. data = shells) summary(man.6e-09 *** ## Residuals 46 ## --## Signif. "Wilks". height) ~ sex.24 3 44 3.62e-09 *** ## Wilks 1 0.001 '**' 0. codes: 0 '***' 0. test="Wilks") ## Df Wilks approx F num Df den Df Pr(>F) ## sex 1 0.' 0.01 '*' 0.lm(cbind(length.5843 23.man <.sh <.manova(cbind(length. width.1 ' ' 1 # I prefer the output from the car package library(car) lm. data = shells) man.sh <.62e-09 *** ## Hotelling-Lawley 1 1. but most powerful when all assumptions are met.sh.3869 23.6131 23.sh) ## ## Type II MANOVA Tests: ## ## Sum of squares and products for error: ## length width height ## length 13591 8057 4680 ## width 8057 5101 2840 ## height 4680 2840 1748 ## ## -----------------------------------------## ## Term: sex ## ## Sum of squares and products for the hypothesis: ## length width height ## length 6143 3880 3077 ## width 3880 2451 1944 ## height 3077 1944 1541 ## ## Multivariate Tests: sex ## Df test stat approx F num Df den Df Pr(>F) ## Pillai 1 0. "Hotelling-Lawley".2 3 44 3. "Roy") man. # Multivariate MANOVA test # the specific test is specified in summary() # test = c("Pillai". width. 87623 -0.2] [. That is.sh£error. Roy’s test locates the linear combination of the features that produces the most significant one-way ANOVA test for no differences among groups.408 ## Roy ## --## Signif.24 3 44 3.001 '**' 0. then there is no evidence that the population mean vectors are different. the critical value for Roy’s test is not the same critical value that is used in a one-way ANOVA. I prefer Roy’s test because it has an intuitive interpretation. The critical value for Roy’s test accounts for the linear combination being suggested by the data.2152 -0.03786 [2.05 '.] 0.62e-09 *** 0 '***' 0.solve(E) %*% H # solve() computes the matrix inverse ev <.1 ' ' 1 The four MANOVA tests of no differences between sexes are all highly significant.58342 [3.193e-15 $vectors [. This is a reason for treating multivariate problems using multivariate methods rather than through individual univariate analyses on each feature.9713 0.man. codes: Ch 15: Multivariate Analysis of Variance 1 1. even when the differences across groups are not significant on any feature.eigen(EinvH) # eigenvalue/eigenvectors ev ## ## ## ## ## ## ## ## $values [1] 1.1] [. The idea is similar to a Bonferroni-type correction with multiple comparisons.sh$SSPE # E = error matrix # man.5843 23.584e+00 1. If the groups are not significantly different on the linear combination that best separates the groups in a one-way ANOVA sense. Roy’s method has the ability to locate linear combinations of the features on which the groups differ.sh) H <.man. These tests reinforce the univariate analyses.sh£df # hypothesis df E <.] 0.01 '*' 0. I will mostly ignore the other three tests for discussion.680e-15 -1.' 0.1014 0.47793 0. ## For Roy's characteristic Root and vector #str(man.df # error df # characteristic roots of (E inverse * H) EinvH <.3] [1. Of the four tests.sh$SSP$sex # H = hypothesis matrix # man.] -0.06175 0.81129 . 6 -6. Here p = 3 and k = 2 gives a = min(p. "D3") df.. the linear combinations should be interpreted. In general.2:4]) %*% ev$vectors colnames(D) <. the most important feature for distinguishing among the groups might have a small loading because of the measurement scale.c("D1". # linear combinations of features D <.22 -9. In particular. D2. Females typically have much larger D1 scores than males.36 . The three linear combinations for the carapace data are (reading down the columns in the matrix of eigenvectors) D1 = 0. However.frame': 48 obs. 2 − 1) = 1 so only D1 from (D1.8762 Height D3 = 0. By construction. The following output shows the differences between sexes on D1. "D2". k − 1) = min(3. in the shells example.1014 Width + −0. the linear combinations are uncorrelated (adjusting for groups).data.9713 Height D2 = −0. The separation between sexes on D1 is greater than on any single feature..D) ## 'data.matrix(shells[. The first linear combination is used by Roy’s test..D <.2152 Length + 0.0379 Length + 0."M": 1 1 1 1 1 1 1 1 1 1 . D) str(df. do not discount the contribution of a feature with a small loading (though.48 -10.5834 Width + −0. and D3) contains information for distinguishing between male and female painted turtles.0618 Width + 0. of 4 variables: ## $ sex: Factor w/ 2 levels "F".409 The characteristic vector (eigenvector) output gives one linear combination of the features for each variable in the data set. ## $ D1 : num -7.wikipedia.org/wiki/Rotation_matrix .. k − 1) linear combinations contain information in decreasing amounts for distinguishing among the groups. D2 and D3 have 0 for loadings). As in PCA.8113 Height. the first a = minimum(p.4779 Length + 0.frame(sex = shells$sex.91 -9.as. Since the matrix of eigenvectors are a rotation matrix1 we can create the D linear combinations by matrix multiplication of the eigenvector (rotation) matrix with the original data (being careful about dimensions). 1 http://en. 1 22.74 -7.sh) ## ## Call: ## lm(formula = D1 ~ sex. data = df. colour = "sex" . 20.1) conflicts) detach("package:GGally"..87e−15 Cor : −3.54 -10.D.D) summary(lm.54e−16 −8 D1 F: 0.167 F: −0.516 −12 −16 −12 −8 ● ● ●● ● −4 ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● Cor : −0.ggpairs(df. data = df.. by D1 linear combination variable lm.sh <. title = "D1 is the linear combination that best distinguishes the sexes") print(p) # detach package after use so reshape2 works (old reshape (v..D.11 .D.410 ## ## Ch 15: Multivariate Analysis of Variance $ D2 : num $ D3 : num -8.07 -8.8 . unload=TRUE) D1 is the linear combination that best distinguishes the sexes F ● ●● sex M ● −4 ● Cor : −1. unload=TRUE) detach("package:reshape".745 −15 M: −0.1 20 20..D) ## ## Residuals: 20 D3 25 30 .77 −9 −12 D2 F: −0.463 M: 0.1 19.804 −18−15 −12 −9 ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ●● ● ● 30 25 ● ● ●● ● ● ● ● ● # Univariate ANOVA tests. ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) # put scatterplots on top so y axis is vertical p <.11 -8.lm(D1 ~ sex.19 M: −0. 9 on 46 degrees of freedom Multiple R-squared: 0.868 0.001 '**' 0.957 Max 4.613.84e-11 .643 Coefficients: Estimate Std.9 on 1 and 46 DF.8e-11 *** --Signif.' 0.54 4.305 -0.549 8.690 0.411 ## ## ## ## ## ## ## ## ## ## ## ## ## Min 1Q Median -5.1 ' ' 1 Residual standard error: 1.98 < 2e-16 *** sexM 4. Error t value Pr(>|t|) (Intercept) -10.979 0.326 3Q 0.605 F-statistic: 72.Adjusted R-squared: 0. codes: 0 '***' 0.01 '*' 0. p-value: 4.05 '.388 -27. as illustrated. . Can1 is the linear combination of the two features that best distinguishes or discriminates the two sub-species. The value of Can1 could be used to classify insects into one of the two groups.Chapter 16 Discriminant Analysis A researcher collected data on two external features for two (known) sub-species of an insect. The analysis can then be used to classify insects with unknown sub-species origin into one of the two sub-species based on their external features. consider the following data plot. To see how this might be done. She can use discriminant analysis to find linear combinations of the features that best distinguish the sub-species. 16. strata. .1: Canonical Discriminant Analysis 413 The method generalizes to more than two features and sub-species. Canonical discriminant analysis computes r = min(p. . . X2. canonical discriminant analysis assumes you have independent samples from multivariate normal populations with identical variance-covariance matrices. Each selected individual is measured on p features (measurements) X1.1 Canonical Discriminant Analysis While there’s a connection between canonical discriminant analysis and canonical correlation. Assume that you have representative samples from k groups. Xp. I prefer to emphasize the connection between canonical discriminant analysis and MANOVA because these techniques are essentially identical. . 16. As in MANOVA. or subpopulations. k −1) linear combina- . #### Example: Riding mowers fn.frame': 24 obs. . The first linear combination.com/teach/ADA2/ADA2_notes_Ch16_mower.8 9. .5 29 36. 16. In general. r) gives the most significant F -test for no group differences in a oneway ANOVA. of 3 variables: ## $ income : num 20 28. 1988). .2 10 10.2 Example: Owners of riding mowers The manufacturer of a riding lawn mower wishes to identify the best prospects for buying their product using data on the incomes (X1) and lot sizes (X2) of homeowners (Johnson and Wichern.8 10.data. The data below are the incomes and lot sizes from independent random samples of 12 current owners and 12 non-owners of the mowers. .6 8. ## $ owner : Factor w/ 2 levels "nonowner". . 2.414 Ch 16: Discriminant Analysis tions of the features with the following properties. multiplied by the constant −1).2 8. . header = TRUE) # income = income in £1000 # lotsize = lot size in 1000 sq ft # owner = nonowners or owners str(mower) ## 'data. among all linear combinations of the features.4 10.5 21. .6 20. among all linear combinations of the features that are uncorrelated with Can1.4 . The second linear combination or the second linear discriminant function: Can2 = a21X1 + a22X2 + · · · + a2pXp gives the most significant F -test for no group differences in a one-way ANOVA."owner": 2 2 2 2 2 2 2 2 2 2 . among all linear combinations of the features that are uncorrelated (adjusting for groups) with Can1. .read...dat" mower <.4 11. Can(j − 1).. the j th linear combination Canj (j = 1.. called the first linear discriminant function Can1 = a11X1 + a12X2 + · · · + a1pXp gives the most significant F -test for a null hypothesis of no group differences in a one-way ANOVA. Can2."http://statacumen.data <.8 11. or all the signs can be changed (that is... ## $ lotsize: num 9.7 36 27. . without changing their properties or interpretations.6 23 31 . The coefficients in the canonical discriminant functions can be multiplied by a constant.table(fn. 20 8.00 27. 40)) # square axes (for perp lines) ft") lot size in 1000 sq ft 15 10 ● ● ● ● ● ● ● ● ● ● ● owner ● ● nonowner owner 5 0 0 10 20 30 40 income in $1000 suppressMessages(suppressWarnings(library(GGally))) p <.20 8.40 11.00 19.60 10.2: Example: Owners of riding mowers 1 2 3 4 5 6 7 8 9 10 11 12 income 20.00 10. p <.40 28.80 22.40 21.p + coord_fixed(ratio = 1) p <.ggplot(mower.40 10.40 8.70 36.00 16.p + ylab("lot size in 1000 sq print(p) y = lotsize.00 27. aes(x = income.16.p + geom_point(size = 3) p <.50 21.40 415 owner nonowner nonowner nonowner nonowner nonowner nonowner nonowner nonowner nonowner nonowner nonowner nonowner library(ggplot2) p <.80 7. colour = "owner") print(p) # detach package after use so reshape2 works (old reshape (v.00 15.00 lotsize 9.60 14.ggpairs(rev(mower).00 10.80 11.60 20. shape = owner.00 36.80 11.40 11.p + scale_y_continuous(limits p <.20 10.40 7.00 17.00 8.20 9.00 13 14 15 16 17 18 19 20 21 22 23 24 lotsize 9. unload=TRUE) detach("package:reshape".80 10.00 31.80 9.00 17.00 owner owner owner owner owner owner owner owner owner owner owner owner owner income 25. colour = owner)) = c(0. 15)) = c(0.60 8.20 8. unload=TRUE) .80 8.80 10.60 23.00 17.p + xlab("income in $1000") p <.1) conflicts) detach("package:GGally".60 21.p + scale_x_continuous(limits p <.50 29.00 28.00 9. the owners tend to have higher incomes and larger lots than the non-owners. but both variables seem to be useful for discriminating between groups. Income seems to distinguish owners and non-owners better than lot size. one might classify prospects based on their location relative to a roughly vertical line on the scatter plot. First we compare using univariate ANOVAs. Below we first fit a lm() and use that object to compare populations. 1) = 1. candisc() computes one discriminant function here because p = 2 and k = 2 gives r = min(p. Qualitatively. A discriminant analysis gives similar results to this heuristic approach because the Can1 scores will roughly correspond to the projection of the two features onto a line perpendicular to the hypothetical vertical line.416 Ch 16: Discriminant Analysis nonowner owner owner 12 11 Cor : 0.0865 9 8 owner: −0. . The p-values are for one-way ANOVA comparing owners to non-owners and both income and lotsize features are important individually for distinguishing between the groups.311 7 8 9 10 11 12 ● ● ● ● ● ● 30 ● ● income ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● 10 20 30 Although the two groups overlap. k − 1) = min(2.172 10 lotsize nonowner: −0. lotsize) ~ owner.7167 Max 1.3e-11 *** ownerowner 7.05 '.001 '**' 0. data = mower) Residuals: Min 1Q Median -1.6667 Coefficients: Estimate Std.36 2. .00498 Second. data = mower) # univariate ANOVA tests summary(lm.598 10.001 '**' 0.Adjusted R-squared: 0.294 F-statistic: 10.' 0.55 <2e-16 *** ownerowner 1.16.324.492 -3.01 '*' 0.Adjusted R-squared: 0.03 on 22 degrees of freedom Multiple R-squared: 0.25 0.95 4. the MANOVA indicates the multivariate means are different indicating both income and lotsize features taken together are important for distinguishing between the groups.005 ** --Signif.74 on 1 and 22 DF.307.12 0. codes: 0 '***' 0.1 ' ' 1 Residual standard error: 1.6667 -0.208 Coefficients: Estimate Std.13 1.802 0.00367 Response lotsize : Call: lm(formula = lotsize ~ owner. data = mower) Residuals: Min 1Q Median -9.60 11. p-value: 0.317 0. p-value: 0.mower) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Response income : Call: lm(formula = income ~ owner.0037 ** --Signif.01 '*' 0.1 ' ' 1 Residual standard error: 5.587 3Q Max 2.0167 3Q 0.817 0.' 0.26 3.8167 -0.6 on 1 and 22 DF.05 '.422 3.lm(cbind(income. Error t value Pr(>|t|) (Intercept) 8.mower <. codes: 0 '***' 0.298 29.275 F-statistic: 9. Error t value Pr(>|t|) (Intercept) 19.54 on 22 degrees of freedom Multiple R-squared: 0.2: Example: Owners of riding mowers 417 # first fit lm() with formula = continuous variables ~ factor variables lm. we fit the canonical discriminant function with candisc().26 2 21 Hotelling-Lawley 1 1. # perform canonical discriminant analysis library(candisc) ## Loading required package: heplots ## ## Attaching package: ’candisc’ ## ## The following object is masked from ’package:stats’: ## ## cancor can.418 Ch 16: Discriminant Analysis # test whether the multivariate means of the two populations are different library(car) man.000297 0.50 -----------------------------------------Term: owner Sum of squares and products for the hypothesis: income lotsize income 324.1 ' ' 1 Finally.000297 *** *** *** *** 0.26 2 21 Wilks 1 0.26 2 21 Roy 1 1.mower ## ## Canonical Discriminant Analysis for owner: .' Pr(>F) 0.mower <.candisc(lm.01 '*' 0. The LR (likelihood ratio) p-values below correspond to tests of no differences between groups on the canonical discriminant functions.41 lotsize -26.1673 12.1673 12.40 Multivariate Tests: owner Df test stat approx F num Df den Df Pillai 1 0.05 '.Manova(lm.4614 12.001 '**' 0.41 23.mower) summary(man.26 2 21 --Signif. codes: 0 '***' 0.87 58.13 lotsize 58.mo <.000297 0. There is only one canonical discriminant function here.mo) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Type II MANOVA Tests: Sum of squares and products for error: income lotsize income 676.mower) can.13 10.5386 12.32 -26.000297 0. The tests of no differences based on the first canonical discriminant function is equivalent to Roy’s MANOVA test. Only Can1 is generated here.461 25.7590 The means output gives the mean score on the canonical discriminant variables by group.034 -1.001 '**' 0.mower) # list of objects in can.raw ## Can1 ## income -0.1453 ## lotsize -0.01 '*' 0.1 ' ' 1 The objects available from the candisc() object are named below.mower.mower. names(can. These are in order of the owner factor levels (nonowner.std" "dfe" "rank" "term" "structure" "eigenvalues" "canrsq" "ndim" "means" "terms" "coeffs. owner). after centering the scores to have mean zero over all groups.539 1. and we’ll soon use a few. scale=6. scale=6.mower ## [1] ## [5] ## [9] ## [13] "dfh" "pct" "factors" "coeffs.mower$means ## [1] 1. There are also a few plots available.mower) # this plot causes Rnw compile errors # it would show box plots # with proportional contribution of each variable to Can1 ### can also plot 2D plots when have more than two groups (will use later) ## library(heplots) #heplot(can.2: Example: Owners of riding mowers ## ## ## ## ## ## ## ## ## ## ## 419 CanRsq Eigenvalue Difference Percent Cumulative 1 0.034 The linear combination of income and lotsize that best distinguishes owners .' 0.5e-05 *** --Signif.05 '.mower$coeffs. but I’ll be creating other plots shortly.17 100 100 Test of H0: The canonical correlations in the current row and all that follow are zero LR test stat approx F num Df den Df Pr(> F) 1 0. except for an unimportant multiplicative factor. codes: 0 '***' 0.16.7 1 22 4.raw" "scores" # plot(can. can. can. fill=TRUE) The raw canonical coefficients define the canonical discriminant variables and are identical to the feature loadings in a one-way MANOVA. fill=TRUE) #heplot3d(can. 420 Ch 16: Discriminant Analysis from non-owners Can1 = −0.mower$coeffs.-can.b1 * mean(mower$income) . hjust = 0.p + annotate("text".3.5) p <. label = "Perp to Can1 for discrim" .can.5 # intercept p <. x = 22. y = 6.p + p <. Can1 is the direction indicated by the dashed line.mower$coeffs. 15)) scale_x_continuous(limits = c(0. In the scatterplot below. linetype = 1. linetype = 2) p <.p + p <.759 LOTSIZE is a weighted average of income and lotsize. colour = owner)) p <. x = 10. # dashed line of Can1 b1 <.raw[2] # slope a1 <.p + geom_abline(intercept = a1. vjust = 1.ggplot(mower. library(ggplot2) # Scatterplots with Can1 line overlayed p <.raw[2]/can. shape = owner. aes(x = income.p + print(p) scale_y_continuous(limits = c(0.p + p <. y = lotsize.p + geom_abline(intercept = a2.mower$coeffs.1453 INCOME + −0. 40)) coord_fixed(ratio = 1) # square axes (for perp lines) xlab("income in $1000") ylab("lot size in 1000 sq ft") lot size in 1000 sq ft 15 Perp to Can1 for discrim 10 ● ● ● ● ● ● ● ● ● ● ● owner ● ● nonowner owner Can1 5 0 0 10 20 income in $1000 30 40 .5 # intercept p <.mower$coeffs.p + geom_point(size = 3) # use a little algebra to determine the intercept and slopes of the # Can1 line and a line perpendicular to it.raw[1] # slope a2 <. size = 4) # solid line to separate groups (perpendicular to Can1) b2 <.p + p <. hjust = 0.mean(mower$lotsize) .raw[1]/can. vjust = 1. size = 4) p <. y = 15.b2 * mean(mower$income) . alpha = 0. slope = b1.4. label = "Can1" . slope = b2.p + annotate("text".mean(mower$lotsize) . 8058 ## lotsize -0.mower$scores. which suggests that income and lotsize contribute similarly to distinguishing the owners from non-owners.mower$scores$Can1).std ## Can1 ## income -0.ggplot(can. main = "Can1 for mower data") Can1 for mower data 5 owner owner 3 nonowner 2 owner ● ●●● ● ●● ●● ● ● ● owner owner count 4 nonowner 1 ● ● ● ● ●● ●● ●●● ● nonowner ● owner ● 0 −2 −1 0 1 2 Can1 −2 −1 0 1 2 Can1 The standardized coefficients (use the pooled within-class coefficients) indicate the relative contributions of the features to the discrimination.5.man. ncol=2.p2 + geom_boxplot(alpha = 0. ## For Roy's characteristic Root and vector H <.mower$scores$Can1))) #p1 <. The MANOVA test p-values agree with the candisc output (as we saw earlier).mo$SSPE # E = error matrix . can.7846 The p-value of 0. shape = 3. fill = owner)) p1 <.mower$scores$Can1).16.p1 + labs(title = "Can1 for mower data") #print(p1) p2 <. The first characteristic vector from the MANOVA is given here.man. aes(x = Can1. max(can.p2 + labs(title = "Can1 for mower data") #print(p2) library(gridExtra) grid. aes(y = Can1.mower$scores$Can1))) #p2 <. I noted above that Can1 is essentially the same linear combination given in a MANOVA comparison of owners to non-owners.y = mean.arrange(p1.p1 + scale_x_continuous(limits = c(min(can.5) # add a "+" at the mean p2 <.mower$coeffs.0004 on the likelihood ratio test indicates that Can1 strongly distinguishes between owners and non-owners.ggplot(can.p2 + geom_point() p2 <. p2. x = owner. position="identity") p1 <. geom = "point". The standardized coefficients are roughly equal. max(can. fill = owner)) p2 <.p2 + scale_y_continuous(limits = c(min(can. Here is some Manova() output to support this claim. alpha = 0.mo$SSP$owner # H = hypothesis matrix E <.mower$scores.2: Example: Owners of riding mowers 421 # Plots of Can1 p1 <.p2 + stat_summary(fun.p2 + coord_flip() p2 <. size = 2) p2 <.p1 + geom_histogram(binwidth = 2/3. This is consistent with the separation between owners and non-owners in the boxplot of Can1 scores. 000 $vectors [.1] mult.1881 INCOME + 0.can.2] [1.1881 -0.solve(E) %*% H # solve() computes the matrix inverse ev <.can.9844 mult. and Virginica. The differences between Versicolor and Virginica are smaller. .1761 [2.9822 LOTSIZE) 16.disc ## [1] -0.eigen(EinvH) # eigenvalue/eigenvectors ev ## ## ## ## ## ## ## $values [1] 1. sepal width. petal length.1] [.1453 INCOME + −0.] 0.167 0. Four measurements (in mm) were taken on each flower: sepal length.9822 0. and appear to be mostly due to differences in the petal widths and lengths. and petal width.char.can.7728 (0.mower$coeffs.759 LOTSIZE = −0.7728: Can1 = −0.raw[1] / ev$vectors[1.422 Ch 16: Discriminant Analysis # characteristic roots of (E inverse * H) EinvH <.7728 The first canonical discriminant function is obtained by multiplying the first characteristic vector given in MANOVA by -0.3 Discriminant Analysis on Fisher’s Iris Data Fisher’s iris data consists of samples of 50 flowers from each of three species of iris: Setosa.disc <.] 0. Versicolor.char. The plots show big differences between Setosa and the other two species. 9 .5 ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ● ●●●● ●● ●● ● ● ●● ●●● ●● ● ● ●●●● ●●●●●●● ● ● ●● ●● ●●●●● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ●●●●●●●●●● ● ● ●● ● Cor : 0..: 1 1 1 1 1 1 1 1 1 1 ..401 virginica: 0.818 setosa: 0.332 versicolor: 0.4 1.5 .4 1.5 1 1.5 1.561 versicolor: 0.7 1.278 versicolor: 0.6 5 5. ## $ Species : Factor w/ 3 levels "setosa".Length: num 1.Width : num 3.5 .4 1.2 0...457 4.Width : num 0.6 5 4.963 6 4 6 setosa: 0.1 .16.Width 1 0..5 2 1.5 1.754 versicolor: 0. colour = "Species") print(p) # detach package after use so reshape2 works (old reshape (v.1 4.1 3.546 virginica: 0.4)]. ## $ Petal.2 0. ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) p <.Width 3 2.5 3 3.9 4."versicolor".526 versicolor: 0.4 1..ggpairs(iris[.4 4.2 0. ## $ Petal.5 4 Cor : −0.178 setosa: 0.2.322 ● ●● ● ● ●●●● ●●● ● ●● ● ●●●●● ● ●●●● ●● ●●● ● ●●● ●●●● ● ● ● ●● ● ● ● ●●●●●● ● ●●●● ● ● ●●●●●●● ●●●● ● ● ●● ●●● ●● ● ● ●●●●● ●●●● ●●● ●● ●● ● ●●●● ● ●● 2.5 Cor : −0.538 4 4.frame': 150 obs. ## $ Sepal.4 4..2 0.281 8virginica: ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●●● ●● ● ●● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●●● ● ● ●● ● ●● ● ●● ● ●●●●●●● ● ●● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ●●● ● ●●● ●●●● ● ● ●● ● ● ●● ●● ● ●● ● 0..4 2..Length 2 ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●●● ● ● ● ● ●● ● ● ●●●●● ●●●●●●● ● ● ●●●● ●●● ● ●● ● ●●● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ●●●●●●●● ● ● ●● ● ● Cor : 0. of 5 variables: ## $ Sepal.366 setosa: 0.743 setosa: 0.787 virginica: 0.872 Sepal..5 ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●●●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ●● ●● ● ● ●● ●● ● ● ●●● ● ●●●● ● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ●●●● ● ● ● ●● ●●●●● ●● ● ● ● ●●●● ● ● ● ● 4Petal.2 0.5 2 2. unload=TRUE) setosa ● ● ●● ● ●● Species versicolor ● virginica ● ● ● 8 Cor : −0.5 Petal.1.c(5.2 0.5 3 3.Length 6 5 6 7 ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ●●● ● ● ●● ●● ● ● ● ●● ● ● ●● ●●● ● ● ●● ● ●● ●● ●●● ● ●● ●●● ●●●● ●● ● ●● ●●●●● ● ● ●●● ●●●●● ● ● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● Cor : 0.267 setosa: 0.3.2 0.Length: num 5.7 4.2 3.5 2 2.3: Discriminant Analysis on Fisher’s Iris Data 423 #### Example: Fisher's iris data # The "iris" dataset is included with R in the library(datasets) str(iris) ## 'data.664 virginica: 0.9 3..118 7 Sepal.1) conflicts) detach("package:GGally".4 1.3 1.428 3.1 . unload=TRUE) detach("package:reshape".4 0.5 0 0..864 virginica: 0.233 versicolor: 0.3 0.6 3.4 3.9 3. data = iris) ## univariate ANOVA tests #summary(lm. There are k = 3 species and p = 4 features. though perhaps a comperison of lengths and widths ignoring sepalL: Can2 = 0. Can2 is not easily interpreted.iris <.534 sepalW + 2.Length.mo <.201 petalL + 2.lm(cbind(Sepal.Width 2. colour = "Species") print(p) # detach package after use so reshape2 works (old reshape (v.Width -1.ggpairs(can. Petal.Length 2.iris <.raw ## ## ## ## ## Can1 Can2 Sepal.5345 2.424 Ch 16: Discriminant Analysis candisc was used to discriminate among species. unload=TRUE) detach("package:reshape". Sepal.iris) ## test whether the multivariate means of the two populations are different #library(car) #man.81 petalW.8294 sepalL + −1.0241 Sepal.2012 -0.8392 Can1 is a comparison of petal and sepal measurements (from Raw Canonical Coefficients): Can1 = −0. ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) p <.iris$scores.Width.iris) #summary(man.8294 0.9319 Petal.Width) ~ Species .iris) can.1645 Petal.0241 sepalL + 2.1) conflicts) detach("package:GGally".mo) # perform canonical discriminant analysis library(candisc) can.9319 petalL + 2.165 sepalW + −0. The canonical directions provide a maximal separation the species. # first fit lm() with formula = continuous variables ~ factor variables lm. Petal.Manova(lm. unload=TRUE) . Two lines across Can1 will provide a classification rule.8105 2.iris$coeffs.839 petalW.Length -0.Length.candisc(lm. so the number of discriminant functions is 2 (the minimum of 4 and 3 − 1). Of course. Setosa has the lowest Can1 scores because this species has the smallest petal measurements relative to its sepal measurements.iris ## ## ## ## ## ## ## ## ## ## Canonical Discriminant Analysis for Species: 1 2 CanRsq Eigenvalue Difference Percent Cumulative 0. can.224 −10 −5 0 5 ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ●● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● 3 ● 2 ● 1 Can2 0 ● −1 −2 −1 0 1 2 3 ● There are significant differences among species on both discriminant functions.970 32.192 31.121 99.1 0.9 99.36e−16 5 setosa: −0.16.222 0.69 Can1 0 versicolor: 0.268 −5 virginica: 0. Virginica has the highest Can1 scores. Can1 produces the largest differences — the overlap among species on Can1 is small.0 Test of H0: The canonical correlations in the current row and all that follow are zero .879 100.9 0.285 31. see the p-values under the likelihood ratio tests.3: Discriminant Analysis on Fisher’s Iris Data setosa 425 ● ● Species versicolor ● virginica ● 10 Cor : −3. 07310 Petal. Does the assumption of equal population covariance matrices across species seem plausible? 3.Length 0.test(t(iris[iris$Species == "setosa" .30459 0.04763 0. What is the most striking feature of the plot of the Can1 scores? 2.001 '**' 0.Length Sepal.Length Petal.01 '*' 0. p-value = 0. codes: 0 '***' 0.07310 0.006069 0.08518 0.030159 0.009298 0.011106 ---------------------------------------------------iris$Species: versicolor Sepal.Width 0. How about multivariate normality? # Covariance matrices by species by(iris[.011698 0.07138 0.04763 Petal. 1:4])) .023 404 4 292 < 2e-16 *** 2 0.1:4].07138 0.Width Petal.3e-09 *** --Signif.006069 Petal.011698 0.04909 Sepal.Length 0.9588.Width Sepal.016355 0.426 ## ## ## ## ## Ch 16: Discriminant Analysis LR test stat approx F num Df den Df Pr(> F) 1 0.Width Petal.01636 0.26643 0.08265 0.30329 0.099216 0.009298 Petal.Length 0.08518 0.07543 # Test multivariate normality using the Shapiro-Wilk test for multivariate normality library(mvnormtest) # The data needs to be transposed t() so each variable is a row # with observations as columns.01033 0.778 42 1 147 1.Length Sepal.Length 0.143690 0.test(t(iris[iris$Species == "versicolor".03911 ---------------------------------------------------iris$Species: virginica Sepal.18290 0.Width Sepal.Width 0. mshapiro.04120 0.18290 0.09376 0.04120 Petal.04909 0.Width 0.' 0.04882 Petal.Length Petal.05578 Sepal.07906 mshapiro.Length 0.Length 0.Length Petal.05 '.Length Sepal.10400 0.Width 0.Width Sepal.40434 0.Width 0.09847 0.Width Petal.22082 0.08265 0.09376 0.09922 0.12425 0.1 ' ' 1 Questions: 1.30329 0.05578 0. iris$Species.010331 Sepal.Width 0.04882 0. 1:4])) ## ## Shapiro-Wilk normality test ## ## data: Z ## W = 0. cov) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## iris$Species: setosa Sepal. 9304. "versicolor") f. df=p). 1:4])) ## ## Shapiro-Wilk normality test ## ## data: Z ## W = 0.nrow(x). ylab="Mahalanobis D2 distance" . 1:4].qqplot <.qqplot(iris[iris$Species == "versicolor".matrix(x) # n x p numeric matrix center <.16. main=paste("QQ Plot MV Normality:".as. p <.test(t(iris[iris$Species == "virginica" .qqplot(iris[iris$Species == "setosa" .1)) ● ● QQ Plot MV Normality: virginica ● 12 ● 10 ●● ●● ●●● ●● 10 8 ●● ● ● ● ● ●●● ●● ●● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ●●● ●●● 0 2 ●● ●● ●●● ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● 6 ●● ● ●● ● ●● ● 4 ● ● 2 4 ● ●● ● ●●● ●●● ●● ● ● ●●● ● ● ● ● ● ● ● ●●● Mahalanobis D2 distance 6 ●● ● ●● ● ● ● 8 ●● ● ● 2 ● 6 Mahalanobis D2 distance 8 ● 4 10 ● ● 0 Mahalanobis D2 distance ● 12 12 QQ Plot MV Normality: versicolor 14 QQ Plot MV Normality: setosa 0 2 4 6 8 10 Chi−squared quantiles 12 0 2 4 6 8 10 Chi−squared quantiles 12 0 2 4 6 8 10 Chi−squared quantiles 12 .mnv.ncol(x). p-value = 0. col = "red") } par(mfrow=c(1.mnv.cov(x).007955 # Graphical Assessment of Multivariate Normality f.mahalanobis(x. center. name = "") { # creates a QQ-plot for assessing multivariate normality x <. cov <.norm.norm.norm. p-value = 0.mnv. xlab="Chi-squared quantiles") abline(a = 0. 1:4].norm. 1:4]. cov) # distances qqplot(qchisq(ppoints(n).qqplot(iris[iris$Species == "virginica" .colMeans(x) # centroid n <. name) .9341. "setosa" ) f. d . "virginica" ) par(mfrow=c(1.005739 mshapiro.3)) f.3: Discriminant Analysis on Fisher’s Iris Data 427 ## ## Shapiro-Wilk normality test ## ## data: Z ## W = 0. d <. b = 1.mnv.function(x. assign insects to the sub-species that they most resemble. Given the score on CAN1 for each insect to be classified. How should this be done? Recall our example where two subspecies of an insect are compared on two external features. The discriminant analysis gives one discriminant function (CAN1) for distinguishing the subspecies.Chapter 17 Classification A goal with discriminant analysis might be to classify individuals of unknown origin into one of several known groups. identified by X’s on the plot. . Similarity is measured by the distance on CAN1 to the average CAN1 scores for the two subspecies. It makes sense to use CAN1 to classify insects because CAN1 is the best (linear) combination of the features to distinguish between subspecies. Then compute the canonical discriminant function scores for each individual to be classified. . compute the average response on CAN1. as measured by the distance from the observation in r-space to the sample mean vector on the canonical variables. .429 To classify using r discriminant functions. . . . CANr in each sample. How do you measure distance in r-space? The plot below illustrates the idea with the r = 2 discriminant functions in Fisher’s iris data: Obs 1 is classified as Versicolor and Obs 2 is classified as Setosa. Each observation is classified into the group it is closest to. The classification rules below can be defined using either the Mahalanobis . .430 Ch 17: Classification 3 ● 2 ● ● ● ● ● Can2 1 0 −1 obs 2 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●● ● ●● ● obs 1 Species ● setosa versicolor virginica ● −2 −10 −5 0 5 10 Can1 17. . X ¯ i2. Suppose p features X = (X1. Although a linear classification rule is identical to the method I just outlined. .1 Classification using Mahalanobis distance Classification using discrimination is based on the original features and not the canonical discriminant function scores. . n−k where the nis are the group sample sizes and n = n1 + n2 + · · · + nk is the total sample size. . . Let X for the ith sample. . X ¯ ip)0 be the vector of mean responses the k groups. and let Si be the p-by-p variance-covariance matrix for the ith sample. . X2. I will discuss the method without justifying the equivalence. Xp)0 are used to discriminate among ¯ i = (X ¯ i1. The pooled variance-covariance matrix is given by S= (n1 − 1)S1 + (n2 − 1)S2 + · · · + (nk − 1)Sk . the equivalence is not obvious. suppose you have three groups and two features. The M -distance from an observation X to (the center of) the ith sample is ¯ i)0S −1(X − X ¯ i). All of the points on a given ellipse are the same M -distance ¯ 1. Thus. X2 > 0 Group 3 obs 2 0 4 X2 4 6 6 obs 3 −2 2 obs 1 0 2 4 X1 6 0 5 10 15 X1 To see how classification works. I will describe each.17. The picture below (left) highlights the idea when p = 2. X2 = (3. As the ellipse expands. Note that if S is the identity matrix (a matrix with 1s on the diagonal and 0s on the off-diagonals).5)   Corr X1. Observation 2 is closest to group 1. as in the plot above (right). and S −1 is the where (X − X matrix inverse of S. the M -distance to the center to the center (X increases. Observation 3 is closest to the center of group 2 in terms of the standard Euclidean (walking) distance. respectively. Observations 1 is closest in M -distance to the center of group 3. X ¯ 2)0. classify X into the group which has the minimum M -distance. 8 8 Group 2 Group 1 X2 2 X1. Di2(X) = (X − X ¯ i)0 is the transpose of the column vector (X − X ¯ i). The M -distance is an elliptical distance measure that accounts for correlation between features. Given the M -distance from X to each sample. and adjusts for different scales by standardizing the features to have unit variance. classify observations 1 and 2 into groups 3 and 1. starting with the Mahalanobis or M -distance.1: Classification using Mahalanobis distance 431 generalized squared distance or in terms of a probability model. . then this is the Euclidean distance. which reflects the correlation between the two features. The M -distance accounts for the elliptical cloud of data within each group. D2(i. 3) which implies that it should be easier to distinguish between groups 1 and 3 than groups 1 and 2..5Dj2(X)} = P 2 (X)} . This assumption is consistent with the plot above where the data points form elliptical clouds with similar orientations and spreads across samples. The M -distance from the ith group to the j th group is the M -distance between the centers of the groups: ¯i − X ¯ j )0S −1(X ¯i − X ¯ j ). 2) < D2(1. exp{−0. I will note that Pr(j|X) is unknown. Thus. merge all sub-populations) is equally likely to be from any group: 1 PRIORj ≡ Pr(observation is from group j) = . M -distance classification is equivalent to classification based on a probability model that assumes the samples are independently selected from multivariate normal populations with identical covariance matrices. The group with the largest posterior probability Pr(j|X) is the group into which X is classified. given the observed features X for an individual Pr(j|X) ≡ Pr(observation is from group j given X) exp{−0.e. . and the expression for Pr(j|X) is an estimate based on the data. In the plot above. you would classify observation 3 into group 1. D2(1. so the two classification rules are equivalent. Maximizing Pr(j|X) across groups is equivalent to minimizing the M -distance Dj2(X) across groups. Then. observation 3 is more similar to data in group 1 than it is to either of the other groups. i) = (X Larger values suggest relatively better potential for discrimination between groups.432 Ch 17: Classification However. j) = D2(j.5D k k To be precise. k where k is the number of groups. The M -distance from observation 3 to group 1 is substantially smaller than the M -distances to either group 2 or 3. Suppose you can assume a priori (without looking at the data for the individual that you wish to classify) that a randomly selected individual from the combined population (i. the proportion of test cases misclassified estimates the misclassification rate. In particular. The jackknife method is . The lda() function allows for jackknife cross-validation (CV) and crossvalidation using a single test data set (predict()). The process is repeated for each case. but there is no universal cutoff for what is considered good in a given problem. In many statistical packages. giving an estimated misclassification rate as the proportion of cases misclassified. then classified after constructing the classification rule from the remaining data.17. A greater percentage of misclassifications is expected when the rule is used on new data. is a good yardstick to gauge a classification rule.2 433 Evaluating the Accuracy of a Classification Rule The misclassification rate. You should judge a classification rule relative to the current standards in your field for “good classification”. The remaining data. This process is often repeated. Resubstitution evaluates the misclassification rate using the data from which the classification rule is constructed. say 10 times. Repeated random splitting can be coded. The resubstitution estimate of the error rate is optimistic (too small). Another form of cross-validation uses a jackknife method where single cases are held out of the data (an n-fold). Cross-validation is a better way to estimate the misclassification rate. With repeated random splitting. called the test data set. provided you have a reasonably large data base. it is common to use 10% of each split as the test data set (a 10-fold cross-validation).2: Evaluating the Accuracy of a Classification Rule 17. Better rules have smaller misclassification rates. and the error rate estimated to be the average of the error rates from the individual splits. you might consider using one random 50-50 split (a 2-fold) to estimate the misclassification rate. or on data from which the rule is not constructed. As an alternative. or the expected proportion of misclassified observations. is used with the classification rule to estimate the error rate. you can implement cross-validation by randomly splitting the data into a training or calibration set from which the classification rule is constructed. com/teach/ADA2/ADA2_notes_Ch15_shells_mf.3 Example: Carapace classification and error #### Example: Painted turtle shells fn. highlight. { plot3d(x = length.434 Ch 17: Classification necessary with small sized data sets so single observations don’t greatly bias the classification. pch = as. unload=TRUE) detach("package:reshape". 17. angle = 100 # viewing angle (seems hard to control) ) }) #### Try this! #### For a rotatable 3D plot.integer(sex)+19 # plotting character by group #. colour = "sex") print(p) # detach package after use so reshape2 works (old reshape (v. by treating the observations to be classified as a test data set. y = width. z = height. type = "h" # lines to the horizontal xy-plane . y = width .read.data <.table(fn.3d = TRUE # makes color change with z-axis value . header = TRUE) ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) # put scatterplots on top so y axis is vertical p <.integer(sex) # color by group . z = height .ggpairs(shells. color = as. You can also classify observations with unknown group membership. col = sex) }) .1) conflicts) detach("package:GGally".data. main = "Shells 3D Scatterplot" .dat" shells <. unload=TRUE) ## 3D scatterplot library(scatterplot3d) with(shells. { scatterplot3d(x = length ."http://statacumen. use plot3d() from the rgl library # ## This uses the R version of the OpenGL (Open Graphics Library) # library(rgl) # with(shells. qda) for every combination of two variables.971 M: 0. lda. library(klaR) partimat(sex ~ length + width + height.947 ● ● ● ●●● ● ●●● ● ● ● ●● ●● ●●● ● ● ●● ●● ●●●● ● ● ● ● ●● ● ●● ●● ● 130 120 110 width 100 90 80 90100110120130 ● ●● ● 140 ● ● ●● 130 120 F: 0.963 ● ● ● length 120 F: 0.17. classification based on the length and height of the carapace appears best (considering only pairs). data = shells .978 Cor : 0.146 140 FFF 80 90 F 60 F FF F F FF F 55 F F F F F FFF F F MMM F MMM FF M M ●M M MM M MF MM M M FM MM MM 120 F F F● F F F 65 F M M MM F M MM M MM M M● F MF M FM MM M MF M F M FFF F F 100 M width Error: 0.matrix = TRUE) 140 F F FF F FF F● FF M F M M M FMM F MM MM M MM ● M F M MF MM FF M M F MM FF FF 40 45 50 55 F F FF F F F FF F F MFF F● M M M F F MMM M M M● M M MM F F MM FM M F M F M M F Error: 0. we’ll only consider those two features.966 ● 110 ● ●● 100 M: 0.188 60 65 80 90 100 110 120 130 100 120 length ● 35 90 Error: 0. .95 M: 0.973 F: 0.g.96 ● ● 50 height 40 50 80 100 120 140 160 height ● ● ● ● ●● ● 35 40 45 50 55 60 65 70 100 120 140 160 180 width 140 ● ● 180 length 60 As suggested in the partimat() plot below (and by an earlier analysis).146 F F F FF Error: 0.912 90 ● 80 ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●●●● ● ● ●● 70 60 ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ●● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ●● ●● ●● Cor : 0.188 140 180 80 120 160 100 110 120 130 100 140 50 120 180 100 40 45 50 55 60 65 The default linear discriminant analysis assumes equal prior probabilities for males and females.3: Example: Carapace classification and error 435 Shells 3D Scatterplot F sex M ● 180 160 Cor : 0.083 65 F FF FF F M F FF F● F FF F F MFM MM MMFM M MM ●M FM M FF M M MF MM M MF M M M 60 Error: 0. For this example.083 F ●F height 160 80 90 100 110 120 13035 40 45 F FF F FF M M F FMMM MM FM M ●MM MM M M M FM FM M M MM 35 55 50 45 40 100 110 120 130 35 160 160 M F Error: 0. plot. # classification of observations based on classification methods # (e. The plot of the lda object shows the groups across the linear discriminant function.04 M 113.5 0.sh ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lda(sex ~ length + height.sh <.8 group F −4 −3 −2 −1 group M .0 0.4 0.5 Group means: length height F 136. col = as.71 Coefficients of linear discriminants: LD1 length 0. type = "both".4891 The linear discrimant function is in the direction that best separates the sexes.0 52.4 0. dimen = 1.8 plot(lda.4 40. From the klaR package we can get color-coded classification areas based on a perpendicular line across the LD function.0 0. LD1 = 0.1371 length + −0.numeric(shells$sex)) −4 −3 −2 −1 0 1 2 0 1 2 0. 0.4891 height.1371 height -0.lda(sex ~ length + height. data = shells) lda.436 Ch 17: Classification library(MASS) lda. data = shells) Prior probabilities of groups: F M 0.sh. sh. The Total Error of 0.sh$error <.numeric(shells$sex) .5+0×0.483 . data = shells.agree == 0)] The classification summary table is constructed from the canonical discriminant functions by comparing the predicted group membership for each observation to its actual group label.cv$class)) classify. if you assume that the sex for the 24 males are unknown then you would classify each of them correctly.as.as.834 F M -1 0.sh.sh ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 sex class error postF postM F M -1 0. by case. round(lda.sh$error) classify.748 0. class = lda. gives you an idea of the clarity of classification.3)) colnames(classify.agree <.lda(sex ~ length + height.166 0.17.008 F F 0.sh.character(classify.5.data. "class". computed as the sum of Rates×Prior over sexes: 0. error = "" . Are the misclassification results sensible.cv$posterior).900 0.cv$class .cv <.3: Example: Carapace classification and error 437 The constructed table gives the jackknife-based classification and posterior probabilities of being male or female for each observation in the data set. colnames(lda. with the other four classified as males.100 F F 0.847 0.252 F F 0.992 0. "error" . Similarly.517 0.as. # print table classify.frame(sex = shells$sex .1667×0. given the data plots that you saw earlier? The listing of the posterior probabilities for each sex.sh$error[!(classify.c("sex". The misclassification rate follows.sh.sh. CV = TRUE) # Create a table of classification and posterior probabilities for each observation classify.sh) <.classify. paste("post".numeric(lda. 20 of the 24 females are classified correctly.0833 = 0. with larger differences between the male and female posteriors corresponding to more definitive (but not necessarily correct!) classifications.031 0. To be precise.cv$posterior.character(as.0833 is the estimated miscassification rate.969 F F 0.153 F F 0.sh <. # CV = TRUE does jackknife (leave-one-out) crossvalidation lda.agree[!(classify.agree == 0)] <. sep="")) # "postF" and "postM" column names # error column classify. 964 0.744 0.933 0.979 0.000 1.519 0.023 0.092 0.063 0.733 0.175 0.733 0.114 0.000 0.981 0.cv$class) # row = true sex.980 0.937 0.083 0.081 0.002 0.063 0.926 # Assess the accuracy of the prediction pred.937 -1 0. col = classified sex pred.090 0.910 0.816 0.908 0.074 0.706 0.854 0.999 0.table(pred.998 0.994 0.021 0.977 0.026 0.157 0.267 0.982 0.000 0. 1) # proportions by row ## .654 0.001 0.040 0.481 0.977 0.267 0.078 0.000 0.002 0.886 0.012 0.020 0.023 0.136 0.825 0.146 0.063 0.018 0.256 0.843 0.freq.sh.346 0.036 0.table(shells$sex.980 0.067 0.917 0.019 0.007 0.988 0.freq ## ## ## ## F M F 20 4 M 0 24 prop.184 -1 0.998 1.993 0.freq <.000 0.864 1.919 0.974 0.922 0.294 0.020 0.000 0.438 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 17: Classification 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 F F F F F F F F F F F F F F F F F M M M M M M M M M M M M M M M M M M M M M M M M F F F M M F F F F F F F F F F F F M M M M M M M M M M M M M M M M M M M M M M M M 0.006 0.960 0. lda.937 0. freq.0000 # proportion correct for each category diag(prop..9 .. #### Example: Fisher's iris data # The "iris" dataset is included with R in the library(datasets) data(iris) str(iris) ## 'data.1 3.9167 # total error rate 1 .1 4.2 0.Length: num 5.6 5 5.Width : num 0. There is no general rule about the relative sizes of the test data and the training data.2 0.frame': 150 obs.4 0.4 2. This rule was applied to the remaining 75 flowers.7 1.7 4.4 4.Length: num 1.table(pred.3 0. and then using the jackknife method. A plot indicates the two subsamples are similar.4 1.2 3.5 3 3.9 4.4 Example: Fisher’s Iris Data cross-validatio I will illustrate cross-validation on Fisher’s iris data first using a test data set. Below.8333 1.5 . ## $ Species : Factor w/ 3 levels "setosa".2 0. # Randomly assign equal train/test by Species strata .0000 # total proportion correct sum(diag(prop. you should combine the two data sets at the end of the cross-validation to create the actual rule for classifying future data.1667 M 0.table(pred.2 0.sum(diag(prop.6 3. 1)) ## F M ## 0.: 1 1 1 1 1 1 1 1 1 1 .9 3.2 0. whereas the rest are “train”.4 1.0000 1.4 1.4 1.. The 75 observations in the calibration set were used to develop a classification rule.. Regardless of the split.8333 0.5 1. ## $ Sepal. Many researchers use a 50-50 split. the half of the indices of the iris data set are randomly selected.4 3.Width : num 3.5 1..2 0.... and assigned a label “test”.17. which form the test data set.freq))) ## [1] 0. ## $ Petal.3 1.. ## $ Petal. of 5 variables: ## $ Sepal.9 3.table(pred..2 0.1 . The 150 observations were randomly rearranged and separated into two batches of 75.freq))) ## [1] 0...4: Example: Fisher’s Iris Data cross-validation ## ## ## 439 F M F 0.1 .4 1.08333 17."versicolor".4 4.6 5 4. ggpairs(subset(iris. colour = "Species".151 ● ● ● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● Cor : −0.5 ● ● ● ● ● ● ● ●●●●●●●●●● ● ● 3 3.Width 3 ●● ● ●● ● ● ● ● ●●● ●● ●● ● ●● ● setosa: 0.746 versicolor: 0.5 versicolor: 0.378 ● ●● Cor : 0.5 2 1. unload=TRUE) detach("package:reshape".ggpairs(subset(iris.3. test == "train")[.397 virginica: 0.3.(Species).5 As suggested in the partimat() plot below.706 virginica: 0.744 ● ● ●● 8 Cor : 0.5 Cor : −0.219 versicolor: 0."train" X$test[ind] <.Length ● ● Cor : −0. size = round(nrow(X)/2)) sort(ind) X$test <. function(X) { ind <.966 6 4Petal.5 Petal.c(5. we should expect Sepal.427 virginica: 0. colour = "Species".806 setosa: 0.5 ● ● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● Cor : 0.2.758 versicolor: 0.914 virginica: 0.418 Sepal.1.4)].c(5.338 8virginica: Petal.sample.Length ● ● ●● ●●● ●● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● Cor : −0.2.5 0 0.int(nrow(X).276 setosa: 0. test == "test")[.5 2 2.1.156 5 6 ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● versicolor: 0.5 ● ●● 0.Width 1 0.741 7 ● virginica ● 7 Sepal. title = "test") print(p) # detach package after use so reshape2 works (old reshape (v.552 versicolor: 0.5 Cor : 0.867 setosa: 0.5 3 3.341 ● ● ● ●● ●● ●● ● ●●● ●●● ●● ● ● ● ● setosa: 0. title = "train" print(p) p <.96 6 setosa: 0.Length to potentially not contribute much to the classification.Width 1 0.5 2 2.5 1 1.1) conflicts) detach("package:GGally".5 4 setosa: 0.Length ● ● ● ●●●● ● ● ●●●●●● ● ● ● ● ● ● ● ●● 2 ● versicolor: 0.517 versicolor: 0. since (pairwise) more errors are introduced with that variable than between other pairs.574 versicolor: 0.factor(X$test) X$test return(X) }) summary(iris$test) ## ## test train 75 75 table(iris$Species. .5 0 0. iris$test) ## ## ## ## ## test train setosa 25 25 versicolor 25 25 virginica 25 25 ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) p <.5 ● 3.5 ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● 2 1.757 virginica: 0.361 versicolor: 0. .454 ● ● ● ●●● ● ●●● ●● ●●● ●● ● 2.4)].Length 6 ●● Species versicolor ● virginica ● ● 2 6 versicolor: 0."test" X$test <.833 setosa: 0.0662 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ● ● ●●● ●● ●●●● ● ● ● ● ●● ● ● ● ● 2.357 setosa: 0.246 5 Sepal.0858 7 versicolor: 0.ddply(iris.537 6 7 Cor : −0.82 6 ●● ● ● ●● ●● ● ●● ● ●●● ● ●● ●● ● ● ● ● ●●●●● ● ●●●● ● ● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●● ● ● 4 ● ● ●●● ●●●● ●●●●●●● ● ●● ● ● ● virginica: 0.0964 2.396 4 4.54 6 virginica: 0.628 virginica: 0.551 virginica: 0.5 1 1.278 3.162 4Petal.5 ● Cor : 0. unload=TRUE) train setosa ● test ● ●● setosa ●● Species versicolor ● Cor : −0.440 Ch 17: Classification library(plyr) iris <.476 4.375 setosa: 0.Width 3 2.44 4 ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● Sepal.733 virginica: 0.879 ● ●● ●● ● ● setosa: 0. (In fact.5 2 2.668 4 ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ●●● ● ●● ● ●●● ●●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ●● ●●●●●●● ●● ● Cor : 0.391 ● ● ●● ● ● ● ● ●● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●●●● ● ● ● ● ● ● ● ●● ● ●●●● ●●●●● ● ● 4 setosa: 0.406 virginica: 0. 5 v v v v vv v v v v v v●vv v v v vv v vv v●vvv v vv v v v vv v v v v v v v v v v v v v v vv v vv vv v vv v ● v●vv vvvv v v v v v v v v v v Error: 0.Length is smallest.5 4.3333 0.5 6.5 2 3 4 5 6 2.5 1.Width + Petal.Width + Petal.Length + Sepal.5 5.0 1.04 Error: 0.Length + Petal.227 2.0 3 2.Length Sepal.Width s vv s vv v ss s v v ● vv sss vv vvv ● v v v vv v vv v vv●v vv v v v vv●v v v v v v vv v vv vv v v v v v vv v v v vv s v v s v v v v v v Error: 0.053 s s s ss ss s●ss s s s sss v v v vv s v 4.5 v v v Error: 0.Length + Petal.5 5.0 2 3 4 5 6 7.Length v vvv v v v vv v v vv vv ss s●sss s s ss ssssssss ss 0.0 v Error: 0. data = subset(iris.5 0.5 2.067 s s s s ssss s● s s ss s s ss s ss s s Error: 0.5 4 3.053 v v v vv v vv v v v vv● v v v●v v v v vvv vv v v v v v vvv vvvvv vv v v v v v v v v vvv v v vv ●vvv vv vv v v v v v v ●v v v v v vv Petal. plot.5 1.0 2.0 4.400 1.3333 Group means: Sepal.0 1.5 v Petal.067v s s ss s ● s s ss ss s sss s ss svsss vv v v v v Error: 0. test == "train")) Prior probabilities of groups: setosa versicolor virginica 0.Length + Sepal.067 vv vvv vv ● v vvvv vv v vvv v v vvvvv v v vvv● vv vvvvv v v vvvv 2.04 v v v vv v vv v v v vv v v v v v v v vv v ●v v ● v vv vvv v v v vv vv vv vv v vv v v v v v v v v v Sepal.iris <.5 3.lda(Species ~ Sepal.960 3.0 4.5 Error: v 0.5 7.matrix = TRUE) s sssss●sssss sss s sss s s Error: 0.5 6.Width setosa 4.5 library(MASS) lda.g.5 1.Width.5 5.Length Petal.5 6.Length v v● ●v vv v v v vv v vvv s s v v v v v v v v vv s s s ssss v s v s s s ss ss vv s s● sss s v ● s s v s s s s s s s s s s s s s ss sss s s Error: 0.5 2.4: Example: Fisher’s Iris Data cross-validation 441 we’ll see below that the coefficients for Sepal.5 v 1.067 vv v v v v v v ●v vvvv v v vvvvv vv vv v v vv v v v v v v●vvv vv v v vv vv v s s s s s s ss ss ss s s●ss sss v v v v Error: 0.053 vv vvvv vv v v v v ● ● v v vv v vvvv v v v vv v v v v v v v v v v v v v v v v v v vv ●vv vv vv v vv vv vv v vv v v v● v v v v vv v v vv v 1.227 s s s s ss s s ss v s v s s s s●s s v v s●ssss v v s v v s vv ss s v v vvv v sss vv v vvv Sepal.Length + Petal.3333 0.5 4.240 .5 4 3.iris ## ## ## ## ## ## ## ## ## ## ## Call: lda(Species ~ Sepal. data = subset(iris.0 2.0 2.5 1.0 0.5 5 3.Width s ss s ss s s s s s sss ss●s ss s ● s s sss s s sss ss ssss s 6.0 3.5 2 7. library(klaR) partimat(Species ~ Sepal.0 3 2.Length + Sepal.17.Width .5 4.Width Petal.Width + Petal.0 5 3.5 2 6 4.5 2.Width .5 7.0 2. data = subset(iris.053 Error:s0. test == "train")) lda.5 0.0 3.5 3.) # classification of observations based on classification methods # (e.5 4.448 0. test == "train") . lda. qda) for every combination of two variables.5 6 5. iris.Length + Sepal. col = as.0 −5 −5 0 group virginica 0 5 LD1 # CV = TRUE does jackknife (leave-one-out) crossvalidation lda. dimen = 1. col = as. test == "train").3 LD2 0.8096 Proportion of trace: LD1 LD2 0.6842 2.cv <. dimen = 2.3320 Sepal.Width 1.iris.287 petalL + −2.Length 0.6 −5 group versicolor 0.0 0.332 sepalL + 2.944 4.5981 sepalL + 1.Width + Petal. The plots of the lda object shows the data on the LD scale.81 petalW.6 group setosa −5 0 5 10 5 10 virginica virginica virginica virginicavirginica virginica virginica virginica virginica virginica versicolor versicolor virginica virginica virginica versicolor virginica versicolor virginicaversicolor versicolor versicolor versicolor virginica virginica virginica versicolor virginica versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica versicolor virginica versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica versicolor versicolor setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 0. data = subset(iris.684 sepalW + −2.344 2.968 6.304 5.442 ## ## ## ## ## ## ## ## ## ## ## ## ## Ch 17: Classification versicolor virginica 5.numeric(iris$Species)) #pairs(lda. CV = TRUE) # Create a table of classification and posterior probabilities for each observation .0882 Petal.8035 Petal.numeric(iris$Species)) plot(lda.lda(Species ~ Sepal.0 0.792 2.iris.088 sepalW + −0.512 2.020 Coefficients of linear discriminants: LD1 LD2 Sepal.Length + Petal.508 1. 0.5981 -0.Length -2.297 petalW LD2 = −0.2972 2.iris.Width -2.6 plot(lda.3 0.0066 The linear discrimant functions that best classify the Species in the training set are LD1 = 0.8035 petalL + 2.numeric(iris£Species)) 0 5 10 5 −5 0 0.2871 -0.3 0. col = as.Width .9934 0. as.cv$posterior.iris$error) classify.iris.predict(lda.agree[!(classify.table(pred.c("Species". test == "test")) . "class".96 # proportion correct for each category diag(prop. 1)) ## ## setosa versicolor 1.numeric(lda.freq))) ## [1] 0.as.00 0.iris <.sum(diag(prop.96 # total error rate 1 . paste("post".iris$error <. sep="")) # error column classify.agree == 0)] The misclassification error is low within the training set.3)) colnames(classify. 1) # proportions by row ## ## ## ## ## setosa versicolor virginica setosa 1.iris # Assess the accuracy of the prediction # row = true Species.iris. test == "train")$Species .00 0.00 0.iris$error[!(classify.frame(Species = subset(iris.04 How well does the LD functions constructed on the training data predict the Species in the independent test data? # predict the test data from the training data LDFs pred. test == "train")$Species.freq.iris.96 # total proportion correct sum(diag(prop. test == "train")$Species) .agree == 0)] <.92 virginica 0.cv$class) pred.character(classify.00 0.as.iris.table(subset(iris.table(pred.4: Example: Fisher’s Iris Data cross-validation 443 classify.numeric(subset(iris.freq.iris. "error" . class = lda.iris. newdata = subset(iris.classify.table(pred.table(pred.00 versicolor 0. col = classified Species pred. round(lda. lda.data.freq))) ## [1] 0. colnames(lda.cv$posterior).cv$class .cv$class)) classify.17.iris <.04 0. error = "" .agree <.00 0.freq <.08 virginica 0.iris) <.freq ## ## ## ## ## setosa versicolor virginica setosa versicolor virginica 25 0 0 0 23 2 0 1 24 prop. # print table #classify.92 0.character(as. 008 versicolor 0 0.000 setosa 1 0.cv$posterior). "error" .000 0.999 0.iris$class .000 setosa 1 0.000 0.000 0.999 0.agree <.iris$class)) classify.000 0. round(pred.as.000 setosa 1 0.000 setosa 1 0.999 0.000 0.000 setosa 1 0.000 0.000 versicolor 0 1.000 0.000 setosa 1 0.000 setosa 1 0.000 versicolor 0 0.000 0.iris$posterior.008 versicolor 0 0.classify. sep="")) # error column classify.992 0.iris <.000 setosa 1 0.000 setosa 1 0.as.000 setosa 1 0.001 versicolor 0 0.frame(Species = subset(iris.000 setosa 1 0.000 0. error = "" .000 0.iris) <.3)) colnames(classify.000 setosa 1 0.000 setosa 1 0.992 0.012 versicolor 0 1.000 0. paste("P".000 setosa 1 0.c("Species".iris$error) classify.001 versicolor 0 1.988 0.000 0.iris$error[!(classify.iris ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 2 3 6 8 9 10 11 12 15 16 23 24 26 28 29 31 35 37 40 41 43 44 45 47 49 52 53 54 55 57 59 61 62 63 Species setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor class error Psetosa Pversicolor Pvirginica setosa 1 0.000 0.agree == 0)] # print table classify.numeric(subset(iris. colnames(lda.character(classify.000 0.000 .000 0.data.000 setosa 1 0.000 versicolor 0 0.agree[!(classify.000 setosa 1 0.000 setosa 1 0.000 0.iris.000 setosa 1 0.444 Ch 17: Classification # Create a table of classification and posterior probabilities for each observation classify. test == "test")$Species) .000 0.000 setosa 1 0.000 0. test == "test")$Species .as.000 0.000 setosa 1 0.001 versicolor 0 0.000 0.000 setosa 1 0. "class". class = pred.agree == 0)] <.numeric(pred.000 setosa 1 0.000 0.000 0.000 0.character(as.000 0.000 0.iris$error <.000 0.000 0.000 0.000 setosa 1 0.000 setosa 1 0. 000 0.000 ## 83 versicolor versicolor 0 1.000 1.000 1.114 0.886 ## 108 virginica virginica 0 0.999 0.000 0.010 0.000 1.000 ## 97 versicolor versicolor 0 1. col = classified Species pred.973 0.000 ## 98 versicolor versicolor 0 1.999 0.346 0.991 0.000 ## 107 virginica virginica 0 0.000 ## 138 virginica virginica 0 0.000 ## 90 versicolor versicolor 0 1.997 ## 141 virginica virginica 0 0.004 0.freq <.001 0.000 ## 118 virginica virginica 0 0.000 ## 85 versicolor versicolor 0 0. pred. test == "test")$Species.000 1.990 ## 139 virginica virginica 0 0.000 ## 145 virginica virginica 0 0.000 1.998 0.000 0.863 ## 127 virginica virginica 0 0.009 0.000 1.053 0.000 ## 122 virginica virginica 0 0.039 0.999 ## 116 virginica virginica 0 0.000 ## 91 versicolor versicolor 0 0.17.001 0.717 ## 130 virginica virginica 0 0.000 ## 113 virginica virginica 0 0.000 0.000 0.000 0.000 ## 150 virginica virginica 0 0.991 ## 149 virginica virginica 0 0.000 ## 148 virginica virginica 0 0.001 ## 96 versicolor versicolor 0 1.freq.996 ## 124 virginica virginica 0 0.000 1.000 0.000 ## 119 virginica virginica 0 0.000 ## 105 virginica virginica 0 0. 1) # proportions by row 445 .000 1.table(subset(iris.000 ## 99 versicolor versicolor 0 1.000 ## 109 virginica virginica 0 0.table(pred.000 1.001 ## 76 versicolor versicolor 0 1.002 ## 93 versicolor versicolor 0 1.918 0.001 0.961 # Assess the accuracy of the prediction # row = true Species.000 1.freq ## ## setosa versicolor virginica ## setosa 25 0 0 ## versicolor 0 25 0 ## virginica 0 0 25 prop.947 ## 132 virginica virginica 0 0.000 ## 106 virginica virginica 0 0.000 1.654 ## 140 virginica virginica 0 0.000 0.027 ## 89 versicolor versicolor 0 1.000 ## 95 versicolor versicolor 0 0.999 ## 136 virginica virginica 0 0.283 0.000 ## 79 versicolor versicolor 0 0.4: Example: Fisher’s Iris Data cross-validation ## 69 versicolor versicolor 0 0.iris$class) pred.000 ## 103 virginica virginica 0 0.009 ## 80 versicolor versicolor 0 1.000 1.000 0.082 ## 74 versicolor versicolor 0 0.999 ## 114 virginica virginica 0 0.137 0.003 0.000 0. the first model starts full and ends full. improvement = 0.table(pred.iris. Classification performance is estimated by selected from one of Uschi’s classification performance measures.sum(diag(prop.freq))) ## [1] 1 # total error rate 1 .table(pred.freq))) ## [1] 0 The classification rule based on the training set works well with the test data. The resulting model can be very sensitive to the starting model. The second model starts empty and ends after one variable is added.freq. Do not expect such nice results on all classification problems! Usually the error rate is slightly higher on the test data than on the training data.Width + Petal. It is important to recognize that statistically significant differences (MANOVA) among groups on linear discriminant function scores do not necessarily translate into accurate classification rules! (WHY?) 17. data = iris .Length + Sepal.01 # stop criterion: improvement less than 1% . Below.Width .1 Stepwise variable selection for classification Stepwise variable selection for classification can be performed using package klaR function stepclass() using any specified classification function.Length + Petal.4. Note that running this repeatedly could result in slightly different models because the k-fold crossvalidation partitions the data at random.446 ## ## ## ## ## Ch 17: Classification setosa versicolor virginica setosa 1 0 0 versicolor 0 1 0 virginica 0 0 1 # proportion correct for each category diag(prop.table(pred. 1)) ## ## setosa versicolor 1 1 virginica 1 # total proportion correct sum(diag(prop. method = "lda" .b <. library(klaR) # start with full model and do stepwise (direction = "backward") step.stepclass(Species ~ Sepal. The formula object gives the selected model. main = "Start = empty model.elapsed sec.Length + Sepal.elapsed min.Width ## <environment: 0x000000001e2fdc38> # start with empty model and do stepwise (direction = "both") step. data = iris . ## 150 observations of 4 variables in 3 classes.Width ## <environment: 0x0000000021e66e18> .Length.f$formula ## Species ~ Petal. method = "lda" .Width. starting variables (4): Sepal.00 0. main = "Start = full model.stepclass(Species ~ Sepal. in: "Petal. Petal.00 0.01 # stop criterion: improvement less than 1% # default of 5% is too coarse .Width . direction = "backward") ## ‘stepwise classification’.Length + Petal. ## 150 observations of 4 variables in 3 classes. ## correctness rate: 0.f.b.96. using 10-fold cross-validated correctness rate of method lda’.iris.4: Example: Fisher’s Iris Data cross-validation 447 # default of 5% is too coarse .b$formula ## Species ~ Sepal.00 0.elapsed ## 0. P ## ## hr.Length + Sepal.iris.f <. direction: forward ## stop criterion: improvement less than 1%.iris.00 0. backward selection") step.31 plot(step.elapsed ## 0. variables (1): Petal. improvement = 0. Sepal.Width + Petal.elapsed min. direction: backward ## stop criterion: improvement less than 1%.iris.elapsed sec. using 10-fold cross-validated correctness rate of method lda’.Width ## ## hr.98.iris. ## correctness rate: 0.Length + Petal.Length. forward selection") step.37 plot(step. direction = "forward") ## ‘stepwise classification’.Width".Width + Petal.17. you may wish to use the alternate .Length -2.8 1.0 1.3333 0.462 0.448 Ch 17: Classification Start = empty model. data = iris) Prior probabilities of groups: setosa versicolor virginica 0. backward selection Given your selected model. forward selection 0.Width 1.Width ● START START 0.lda(step.588 2.Width Petal.b$formula.iris.006 3.9319 Petal.iris.552 2.4 0.Length Petal.3333 0.8105 2.4 Start = full model.iris.770 4.026 Coefficients of linear discriminants: LD1 LD2 Sepal.0 estimated correctness rate ● 0. data = iris) lda.974 5.iris.b$formula .2 ● 0.6 0.326 virginica 6.6 estimated correctness rate 1.246 versicolor 5. library(MASS) lda.8294 0.Length 0. you can then go on to fit your classification model by using the formula from the stepclass() object.9912 0.1645 Petal.260 1.2 + Petal.5345 2.0088 Note that if you have many variables.8392 Proportion of trace: LD1 LD2 0.428 1.Width setosa 5.0241 Sepal.Width -2.936 2.3333 Group means: Sepal.Length Sepal.8 0.2012 -0.step <.step ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lda(step. p + theme(legend. "lda".1:4] # the data iris.p + geom_point(size = 6) library(R.ggpairs(business. #### Example: Business school admissions data fn. The data below gives the GPA and GMAT scores for recent applicants who are classified as admit (A).dat" business <.data.17.d <. borderline (B). An equal number of A.data <. B.c.Width") 17. colour = "admit") print(p) # detach package after use so reshape2 works (old reshape (v.5] # the classes sc_obj <."http://statacumen.stepclass(iris.1) conflicts) detach("package:GGally". 1988). iris. or not admit (N).iris[.5 Example: Analysis of Admissions Data The admissions officer of a business school has used an index of undergraduate GPA and management aptitude test scores (GMAT) to help decide which applicants should be admitted to graduate school.oo) # for ascii code lookup p <. aes(x = gpa. unload=TRUE) detach("package:reshape". and N’s (roughly) were selected from their corresponding populations (Johnson and Wichern.iris[. shape = admit.read. header = TRUE) ## Scatterplot matrix library(ggplot2) suppressMessages(suppressWarnings(library(GGally))) p <.com/teach/ADA2/ADA2_notes_Ch17_business.d. start. iris.p + scale_shape_manual(values=charToInt(sort(unique(business$admit)))) p <. y = gmat.table(fn.5: Example: Analysis of Admissions Data 449 syntax below to specify your formula (see the help ?stepclass for this example). colour = admit)) p <.position="none") # remove legend with fill colours print(p) .ggplot(business. unload=TRUE) library(ggplot2) p <.c <.vars = "Sepal. There is a fair amount of overlap between the borderline and other groups. This is a natural place to use discriminant analysis. but this should not be too surprising. Given that the outliers are not very extreme. and the remaining 70% are not admitted. The GPA and GMAT distributions are reasonably symmetric. Otherwise these applicants would not be borderline! . Except for the outliers.5 gpa The officer wishes to use these data to develop a more quantitative (i. Although a few outliers are present.507 2 2.5 N: 0. Let us do a more careful analysis here. I would analyze the data on this scale.0 3.5 ● ● ●● ● ● ●●●●●● ● ● ● ●● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● N 600 500 gmat 400 ● ● ● 400 700 ● ● ●● ● 300 400 500 600 700 A A AAAA AA A A AA AA A B A A N B NNNN B B N N A B BB AB N B N A N N A B B B B N A BB B BB BB A N BB B N B N N N N B NN A B B NA B Cor : 0. paying attention to underlying assumptions of normality and equal covariance matrices. I will look carefully at the variance-covariance matrices later.e. 10% are classified as borderline. The officer would like to keep these percentages roughly the same in the future.450 Ch 17: Classification A 700 A ● admit B AA A ● 600 ● ● ● ● ● 3. Historically.0 2.5 500 A: 0.331 2. the spreads (IQRs) are roughly equal across groups within GPA and GMAT. less subjective) approach to classify prospective students.442 gmat N NA AA B B N B N 300 N 2.5 3 3. about 20% of all applicants have been admitted initially. it does not appear that any transformation will eliminate the outliers and preserve the symmetry..147 3 gpa B: −0.5 3. 1466 gmat 0. but the correlation between GPA and GMAT varies greatly over groups.3311 1.5.0000 ---------------------------------------------------business$admit: B gpa gmat gpa 1.17.87757 4002.0000 0. cor) ## ## ## ## ## ## ## ## ## ## ## ## ## ## business$admit: A gpa gmat gpa 1.0000 ---------------------------------------------------business$admit: N gpa gmat gpa 1. library(klaR) partimat(admit ~ gmat + gpa .01 gmat 10. This is consistent with the original data plots. data = business . The GPA and GMAT variances are roughly constant across groups.5065 gmat 0.1 451 Further Analysis of the Admissions Data The assumption of constant variance-covariance matrices is suspect.991 ---------------------------------------------------business$admit: B gpa gmat gpa 0.01171 6973. # classification of observations based on classification methods # (e.2:3]. lda.05603 10.matrix = FALSE) .5: Example: Analysis of Admissions Data 17.878 gmat -4. cov) ## ## ## ## ## ## ## ## ## ## ## ## ## ## business$admit: A gpa gmat gpa 0. plot.g.1466 1. # Covariance matrices by admit by(business[. business$admit.05867 2.16 # Correlation matrices by admit by(business[. business$admit.761 ---------------------------------------------------business$admit: N gpa gmat gpa 0.0000 -0.3311 gmat -0.85760 6479.2:3].0000 0. both GPA and GMAT are important for discriminating among entrance groups.858 gmat 2.05423 -4.0000 Assuming equal variance-covariance matrices.5065 1. qda) for every combination of two variables. 5 A A . data = business) ## ## Prior probabilities of groups: ## A B N ## 0.01292 ## ## Proportion of trace: 3.157 A AA A A B A A AA A A A● A A A A A B A A A A N A 500 N N NN N B N N N N N N N 400 gmat 600 A N NN B A BB BB B A B BA B B ● BB ● B BB B B N N B B B N B N N A B B NA B N 300 B A A B B N B N N 2.business ## Call: ## lda(admit ~ gpa + gmat.4 ## B 3.452 Ch 17: Classification Partition Plot 700 app. data = business) lda.2 ## N 2.48346 ## gmat -0.7 ## ## Coefficients of linear discriminants: ## LD1 LD2 ## gpa -3. error rate: 0.005 454.0 gpa library(MASS) lda.lda(admit ~ gpa + gmat .322 554.5 3.2921 ## ## Group means: ## gpa gmat ## A 3.977913 -1.400 443.3483 0.003058 0.business <.0 2.3596 0. 0031 gmat LD2 = −1.business. interpretted as a weighted average of the scores and a contrast of the scores.4 0.cv$class)) .c("admit".business$error <. "error" .cv$class . paste("post".2 0. 4 0.9473 0.6 plot(lda. data = business.as.lda(admit ~ gpa + gmat .0 0. col = as.business.cv$posterior.frame(admit = business$admit .business) <.2 0.978 gpa + −0. dimen = 2.numeric(business£admit)) −4 −2 0 2 A 4 2 group A AA A A B A A N −2 0 2 0 LD2 0.as. col = as.business.business. error = "" .business. dimen = 1) plot(lda.business.4 0.as.483 gpa + 0.6 B −2 −4 −2 0 2 NN N N B N N N N NN A N B A A A A A N B B A B A N N N A A AA AB B BB A N N A A B N B N A AB B BB N B B B BB A B A BB B N N AN B A A B A N N B B 0 2 4 4 group N LD1 # CV = TRUE does jackknife (leave-one-out) crossvalidation lda. sep="")) # error column classify.0 0.6 −4 4 −2 group B A B 0.agree <.numeric(business$admit)) #pairs(lda.business <.business.character(classify. The plots of the lda() object shows the data on the LD scale.numeric(lda.3)) colnames(classify.17. CV = TRUE) # Create a table of classification and posterior probabilities for each observation classify.0527 The linear discrimant functions that best classify the admit are LD1 = −3.0 0.0129 gmat.character(as.data. "class".numeric(business$admit) .2 0. class = lda.cv <.cv$posterior). colnames(lda. round(lda.5: Example: Analysis of Admissions Data 453 ## LD1 LD2 ## 0.business$error) classify.business.4 0. agree == 0)] <.00000 0.2 Classification Using Unequal Prior Probabilities A new twist to this problem is that we have prior information on the relative sizes of the populations. and 70% are not admitted.freq ## ## ## ## ## A B N A 26 6 0 B 3 25 3 N 0 3 23 prop.80645 0. col = classified admit pred.09677 0.business # Assess the accuracy of the prediction # row = true admit.freq <. and never a misclassification between A and N.business$error[!(classify.00000 B 0.5.18750 0.8315 # total error rate 1 . given the overlap between the B group and the others.8065 0.1685 17.freq))) ## [1] 0.table(pred.8125 0.81250 0.sum(diag(prop.classify.table(pred.8846 # total proportion correct sum(diag(prop.freq. The prior probabilities are assumed to be equal when this statement is omitted. The .table(pred. 1)) ## A B N ## 0. 10% are borderline.88462 # proportion correct for each category diag(prop.agree == 0)] The misclassification error within the training set is reasonably low. # print table #classify. 1) # proportions by row ## ## ## ## ## A B N A 0.11538 0.freq))) ## [1] 0.table(business$admit.09677 N 0.freq.table(pred.business. lda.454 Ch 17: Classification classify. Historically 20% of applicants are admitted.cv$class) pred. This prior information is incorporated into the classification rule by using a prior option with lda().agree[!(classify. and log(PRIORj ) is the (natural) log of the prior probability of being in group j. As before. We also have 6 new observations that we wish to classify.17.table(text = " admit gpa gmat NA 2.8 420 NA 3. header = TRUE) . as outlined below.5 340 NA 3. you classify observation X into the group that it is closest to in terms of generalized distance.0 500 ".5D k k Here S is the pooled covariance matrix. # new observations to classify business.test <. The penalty makes it harder (relative to equal probabilities) to classify into a low probability group.5: Example: Analysis of Admissions Data 455 classification rule for unequal prior probabilities uses both the M -distance and the prior probability for making decisions. into the group with the maximum posterior probability.4 540 NA 2. and is extremely large when PRIORj is near zero. the penalty terms are equal so the classification rule depends only on the M -distance. classification is based on the generalized distance to group j: ¯ j )0S −1(X − X ¯ j ) − 2 log(PRIORj ). or equivalently. When the prior probabilities are unequal.read. If the prior probabilities are equal.5Dj2(X)} Pr(j|X) = P 2 (X)} . we make the tenuous assumption that the population covariance matrices are equal. In the admissions data. The generalized distance is the M -distance plus a penalty term that is large when the prior probability for a given group is small. These observations are entered as a test data set with missing class levels. an observation has to be very close to the B or A groups to not be classified as N.7 630 NA 3. exp{−0. Note that in the analysis below. Note that −2 log(PRIORj ) exceeds zero.3 450 NA 3. and easier to classify into high probability groups. Dj2(X) = (X − X or on the estimated posterior probability of membership in group j: exp{−0. 1. The classification rule requires strong evidence that an observation is borderline before it can be classified as such. sep="")) # error column classify. "error" .381 gmat -0.4 B 3.1.1. 0.2 N 2. 0. prior = c(0.9808 0. CV = TRUE) # Create a table of classification and posterior probabilities for each observation classify. # CV = TRUE does jackknife (leave-one-out) crossvalidation lda.002724 0.business <.7 Group means: gpa gmat A 3.cv$posterior.2.lda(admit ~ gpa + gmat . 0.business. data = business) lda. error = "" .2.as.c("admit".cv$posterior).business) <.7 Coefficients of linear discriminants: LD1 LD2 gpa -4. data = business. colnames(lda.cv$class .cv <. data = business.business ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lda(admit ~ gpa + gmat. but also reflects the low prior probability for the borderline group.business.character(classify.7) .business$error) .business.business.data. the LDs are different. prior = c(0.0192 About 1/2 of the borderlines in the calibration set are misclassified.lda(admit ~ gpa + gmat .7) .frame(admit = business$admit .005 454. "class".322 554. This is due to the overlap of the B group with the other 2 groups.014778 -1.3)) colnames(classify.7)) Prior probabilities of groups: A B N 0.456 Ch 17: Classification With priors.400 443. class = lda.business$error <.2.2 0.013 Proportion of trace: LD1 LD2 0. paste("post".1 0. 0. 0. prior = c(0. 0. round(lda. library(MASS) lda.business <. 021 0.03846 0.001 0.cv$class) pred.033 0. 1)) ## A B N ## 0.agree == 0)] # print table.017 0.90625 0.686 B N -1 0.801 B N -1 0.business$error[!(classify.227 0.freq.freq ## ## ## ## ## A B N A 29 2 1 B 5 16 10 N 1 1 24 prop.277 0. lda.329 0.360 N A 2 0.627 0.020 0.894 B N -1 0.001 B A 1 0.955 B N -1 0.table(pred.freq.129 0.classify.412 0.016 0.557 0. col = classified admit pred.freq <.365 0.059 0.06250 0.045 0.9231 # total proportion correct sum(diag(prop.114 0.146 0.595 0.table(pred.107 0.cv$class)) classify.044 0.table(business$admit.as.383 0.agree == 0)] <.32258 N 0.233 0.747 0.9062 0.freq))) ## [1] 0.545 B A 1 0.numeric(lda.416 A N -2 0.16129 0.business.344 0.17.03125 B 0.5: Example: Analysis of Admissions Data 457 classify.367 0.numeric(business$admit) .442 0.633 0.03846 0.089 0.070 0.979 0.161 0.451 B N -1 0.agree[!(classify.982 B N -1 0. 1) # proportions by row ## ## ## ## ## A B N A 0. ] ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 29 30 31 35 36 59 60 61 66 67 68 71 72 74 75 82 85 86 87 88 admit class error postA postB postN A B -1 0.009 B A 1 0.428 B N -1 0.51613 0.549 B A 1 0.834 B N -1 0.business.business[!(classify.business$error == "").002 0. errors only classify.008 B A 1 0.character(as.as.000 B N -1 0.285 N B 1 0.126 0.714 B N -1 0.agree <.000 # Assess the accuracy of the prediction # row = true admit.92308 # proportion correct for each category diag(prop.758 0.253 0.503 A B -1 0.273 0.516 0.027 0.table(pred.5161 0.037 0.7753 . business. "class"#.test ## ## ## ## ## ## ## 1 2 3 4 5 6 admit class postA postB postN NA N 0. so all the test data cases are identified as misclassified.test <. the posterior probabilities for the test cases give strong evidence in favor of classification into a specific group. class = pred.297 0.freq))) ## [1] 0.table(pred.148 Except for observation 5.business. which are unknown. paste("post".business$posterior.102 0. sep="")) ## error column #classify. error = "" .538 0.458 Ch 17: Classification # total error rate 1 .074 0.business. "error" .test) # Create a table of classification and posterior probabilities for each observation classify.2247 The test data cases were entered with missing group IDs.as.business$class #.business.367 0.629 0.agree[!(classify.agree == # print table classify.business. The classification table compares the group IDs.agree == 0)] <.numeric(business.business.001 NA B 0.character(classify.004 NA A 0.test£error[!(classify.as. QDA The assumption of multivariate normal populations with equal covariance matrices is often unrealistic.test£error) #classify.test£admit) # .824 NA A 0.676 NA B 0.business.sum(diag(prop.agree <. These two labels differ. 17. Although there are no widely available procedures .919 0.as.classify. # predict the test data from the training data LDFs pred.test$admit .character(as.numeric(pred. round(pred.3 Classification With Unequal Covariance Matrices. colnames(pred.461 0. and ignore the other summaries.c("admit".predict(lda. Do not be confused by this! Just focus on the classification for each case.test) <.frame(admit = business.5.000 NA N 0.business <.business$posterior).385 0.business£class)) #classify.test£error <.081 0.467 0.3)) colnames(classify. to the ID for the group into which an observation is classified.026 0.data. newdata = business. the posterior probability of membership in group j: exp{−0.17.5Dj2(X)} Pr(j|X) = P 2 (X)} . The individuals in the test data have the same classifications under both approaches. The rule is not directly tied to linear discriminant function variables. There is evidence that quadratic discrimination does not improve misclassification rates in many problems with small to modest sample sizes. A modest to large number of observations is needed to accurately estimate variances and correlations. in part.5D k k Here Sj is the sample covariance matrix from group j and log |Sj | is the log of the determinant of this covariance matrix. The qda() function is a quadratic discriminant classification rule that assumes normality but allows unequal covariance matrices. and would summarize my analysis based . I often compute the linear and quadratic rules. because the quadratic rule requires an estimate of the covariance matrix for each population. Recall that the GPA and GMAT sample variances are roughly constant across admission groups. Assuming that the optimistic error rates for the two rules were “equally optimistic”. but the correlation between GPA and GMAT varies widely across groups. there are a variety of classification rules that weaken one or both of these assumptions.5: Example: Analysis of Admissions Data 459 for MANOVA or stepwise variable selection in discriminant analysis that relax these assumptions. exp{−0. The quadratic rule does not classify the training data noticeably better than the linear discriminant analysis. Dj2(X) = (X − X j or equivalently. The quadratic discriminant classification rule is based on the generalized distance to group j: ¯ j )0S −1(X − X ¯ j ) − 2 log(PRIORj ) + log |Sj |. so interpretation and insight into this method is less straightforward. The determinant penalty term is large for groups having large variability. I would be satisfied with the standard linear discriminant analysis. but use the linear discriminant analysis unless the quadratic rule noticeably reduces the misclassification rate. 0 2. prior = c(0. method = "qda". data = business) qda.5 2.157 AA A AA A A B A A B A 500 B A B BA B B ● B B B B ● B BB N NN B B B B N N A B B NA B N 300 A BB BB N N NN N B N N NN N N N A A B 400 NN B A N NN B N N B A BB BB B N A A B B N B N N 2. main = "QDA partition") LDA partition QDA partition A app.7) . 0. data = business. qda) for every combination of two variables.7 Group means: gpa gmat A 3. plot.qda(admit ~ gpa + gmat .1.005 454.400 443.5 gpa library(MASS) qda. plot.7 3.0 3.146 700 700 app. # classification of observations based on classification methods # (e. library(klaR) partimat(admit ~ gmat + gpa .322 554.2 0. qda.business <.matrix = FALSE .1.5 3.7)) Prior probabilities of groups: A B N 0.2 N 2. Additional data is needed to decide whether the quadratic rule might help reduce the misclassification rates.g. data = business .matrix = FALSE . error rate: 0. main = "LDA partition") partimat(admit ~ gmat + gpa .1 0.4 B 3.460 Ch 17: Classification on this approach.business ## ## ## ## ## ## ## ## ## ## ## ## ## B A B BA B B ● B B B B ● B BB N NN B B B B N N A B B NA B B N A A AA A A A● A A A A A B A A A A N A 300 N B 600 A A A AA A A A● A A A A A B A A A A gmat 500 N N NN N B N N NN N N N 400 gmat 600 A N A Call: qda(admit ~ gpa + gmat.5 A A . error rate: 0. data = business .2. prior = c(0. method = "lda".0 2.0 gpa 3.2. 0. 0. 0. 16129 0.freq))) ## [1] 0.freq <.2. error = "" . 1) # proportions by row ## ## ## ## ## A B N A 0.business$error) classify.business # Assess the accuracy of the prediction # row = true admit.numeric(business$admit) .agree[!(classify. "class".c("admit".freq.freq. qda.character(classify.7753 # total error rate 1 .as.business.84375 0.as.table(pred.numeric(qda.cv <.2247 .96154 # proportion correct for each category diag(prop. colnames(qda.business$error[!(classify.sum(diag(prop.business.cv$posterior).03846 0.54839 0.data.business. 1)) ## A B N ## 0.table(pred.8438 0. data = business.business.business) <.agree == 0)] # print table #classify.business.classify.table(pred.as.table(pred. col = classified admit pred.business$error <.agree == 0)] <.business <.17. "error" .character(as.00000 0. class = qda.1.frame(admit = business$admit .29032 N 0.cv$class) pred.freq))) ## [1] 0. CV = TRUE) # Create a table of classification and posterior probabilities for each observation classify.cv$class .freq ## ## ## ## ## A B N A 27 3 2 B 5 17 9 N 0 1 25 prop.cv$class)) classify.06250 B 0.09375 0.9615 # total proportion correct sum(diag(prop. sep="")) # error column classify.5484 0.qda(admit ~ gpa + gmat .table(business$admit. 0.agree <.3)) colnames(classify.5: Example: Analysis of Admissions Data 461 # CV = TRUE does jackknife (leave-one-out) crossvalidation qda. round(qda. 0. prior = c(0.cv$posterior. paste("post".business.7) . 513 0. "class"#.000 NA B 0.frame(admit = business.data.character(classify.363 0.classify.test£error) #classify.agree[!(classify.462 Ch 17: Classification # predict the test data from the training data LDFs pred.numeric(business.038 0. paste("post".708 0. colnames(pred.character(as.agree == 0)] <.as.business. sep="")) ## error column #classify.agree <.business£class)) #classify. "error" .as.business$class #.business.test£error[!(classify.022 0.597 0.business$posterior.423 0.000 NA A 0.business.test) # Create a table of classification and posterior probabilities for each observation classify.test£admit) # .numeric(pred.test <. round(pred.business.business.business.123 .043 0.test$admit .402 0.business$posterior). newdata = business. class = pred.292 0.c("admit".predict(qda.as.919 NA A 0.000 NA N 0.526 NA B 0.978 0.test ## ## ## ## ## ## ## 1 2 3 4 5 6 admit class postA postB postN NA N 0.business <.test£error <.business.3)) colnames(classify.051 0.test) <. error = "" .agree == # print table classify. Part VI R data manipulation . . 18. 1 Content in this chapter is derived with permission from Statistics Netherlands at http://cran. to be performed in a reproducible manner.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R. For this reason. r-project.pdf . Typical actions like imputation or outlier handling obviously influence the results of a statistical analyses. or data preparation. in practice it is often more time-consuming than the statistical analysis itself. Data cleaning may profoundly influence the statistical statements based on the data. data cleaning should be considered a statistical operation.Chapter 18 Data Cleaning Data cleaning1. is an essential part of statistical analysis. In fact.1 The five steps of statistical analysis Statistical analysis can be viewed as the result of a number of value-increasing data processing steps. The R statistical environment provides a good environment for reproducible data cleaning since all cleaning actions can be scripted and therefore reproduced. However.frame directly is either difficult or impossible without some sort of preprocessing. etc. 3. normalizing 2. an age variable may be reported negative. an under-aged person may be registered to possess a driver’s license. It is the data that most statistical theories use as a starting point. Ideally. derive. Formatted output Each box represents data in a certain state while each arrow represents the activities needed to get from one state to the other. Raw Data The data “as is” may lack headers.frame. Statistical results tabulate. Consistent data The data is ready for statistical inference. 1. that does not mean that the values are error-free or complete.Ch 18: Data Cleaning Data cleaning 466 1. For example. Reading such files into an R data. Technically correct data The data can be read into an R data. or data may simply be missing. such theories can still be applied without taking previous data cleaning steps into account. plot 5. and they should be ironed out before valid statistical inference from such data can be produced.g. 4. Such inconsistencies obviously depend on the subject matter that the data pertains to. numbers stored as strings). contain wrong data types (e. Technically correct data fix and impute 3. Raw data type checking. analyze. types and labels. without further trouble. 2.. unknown or unexpected character encoding and so on. with correct names. Consistent data estimate. wrong category labels. In practice however. data cleaning methods like imputation of missing values will influence statistical results and so must be accounted for in . Best practice Store the input data for each stage (raw. The basic types in R are as follows. "three") ## [1] "1" "2" "three" # shorter arguments are recycled (1:3) * 2 ## [1] 2 4 6 (1:4) * c(1.2 R background review 18. All basic operations in R act on vectors (think of the elementwise arithmetic. like educational level) character Character data (strings) raw Binary data (rarely used) All basic operations in R work element-wise on vectors where the shortest argument is recycled if necessary. 5.18. Why does the following code work the way it does? # vectors have variables of _one_ type c(1. numeric Numeric data (approximations of the real numbers) integer Integer data (whole numbers) factor Categorical data (simple classifications. Formatted output The results in tables and figures ready to include in statistical reports or publications. technically correct. like gender) ordered Ordinal data (ordered classifications. consistent. Statistical results The results of the analysis have been produced and can be stored for reuse. 18. Each step between the stages may be performed by a separate R script for reproducibility. 2) ## [1] 1 4 3 8 # warning! (why?) (1:4) * (1:3) .2. results. and formatted) separately for reuse. An R vector is a sequence of values of the same type. 2.1 Variable types The most basic variable in R is a vector. 4.2: R background review 467 the following analyses or interpretation thereof. for example). you should be able to predict the result of the following R statements. pi/0 2 * Inf Inf .2.rm = TRUE) length(c(NA. operations between Inf and a finite numeric are well-defined and comparison operators work as expected.na(c(1. Technically. NA. NULL. ±Inf. Since Inf is a numeric. 2)) median(c(NA. 2.NULL length(x) c(x. length(c(1.null() to detect NULL variables is. ˆ NA Stands for “not available”.na() to detect NAs is. Inf is a valid numeric that results from calculations like division of a number by zero. NULL. na. 3)) ˆ NULL Think of NULL as the empty set from mathematics. 1.null(x) ˆ Inf Stands for “infinity” and only applies to vectors of class numeric (not integer). 4)) 3 == NA NA == NA TRUE | NA # use is.468 Ch 18: Data Cleaning ## Warning: longer object length is not a multiple of shorter object length ## [1] 1 4 9 4 18. NA is a placeholder for a missing value. and NaN. 2. If you understand NA. NA + 1 sum(c(NA. 2. 4)) x <. 3. 2. 2) # use is. 1. it has no class (its class is NULL) and has length 0 so it does not take up any space in a vector. 4)) sum(c(1.infinite() to detect Inf variables . All basic operations in R handle NA without crashing and mostly return NA as an answer whenever one of the input arguments is NA.1e+10 Inf + Inf 3 < -Inf Inf == Inf # use is.2 Special values and value-checking functions Below are the definitions and some illustrations of the special values NA. NULL. 3). 6)) ## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE 18.finite() checks a numeric vector for the occurrence of any non-numerical or special values.nan() to detect NULL variables is. is. Computations involving numbers and NaN always result in NaN. . 4. is stored in a data.3 From raw to technically correct data 18. NA. textual data should be stored as character and categorical data should be stored as a factor or ordered vector. technically correct data in R 1. Inf. but it is surely not a number. In particular operations like 0/0. 2. with the appropriate levels. which may seem odd since it is used to indicate that something is not numeric. NULL. like a spreadsheet or proprietary statistical software that uses undisclosed file formats. NaN + 1 exp(NaN) # use is. and 2.finite(c(1.18.frame with suitable columns names.nan(0/0) Note that is. Inf − Inf and Inf/Inf result in NaN. 5. NaN is of class numeric. NaN. -Inf.frame is of the R type that adequately represents the value domain.1 Technically correct data Limiting ourselves to “rectangular” data sets read from a text-based format. The second demand implies that numeric data should be stored as numeric or integer. This is generally the result of a calculation of which the result is unknown. Technically.3.3: From raw to technically correct data 469 is. each column of the data.infinite(-Inf) ˆ NaN Stands for “not a number”. 3. make that software responsible for exporting the data to an open format that can be read by R. Best practice Whenever you need to read data from a foreign file format. we assume that the text-files we are reading contain data of at most one unit per line. Will coerce the columns to the specified types.csv2() for semicolon separated values with comma as decimal separator.table() and similar functions below will read a text file and return a data. str().frame should always be inspected with functions like head(). Reading text read. ˆ read. The read.frame. A freshly read data.2 Ch 18: Data Cleaning Reading text data into an R data. This includes files in fixed-width or csv-like format.table() with some fixed parameters and possibly after some preprocessing.frame In the following. converts all character vectors into factor vectors. Additional optional arguments include: Argument Description header Does the first line contain column names? col. ˆ read.names character vector with column names. stringsAsFactors If TRUE. ˆ read. Best practice. sep Field separator.470 18. na. but excludes XML-like storage formats. their format and separation symbols in lines containing data may differ over the lines. The number of attributes. Specifically ˆ read. The other read-functions below all eventually use read.delim2() tab-delimited files with comma as decimal separator.fwf() data with a predetermined number of bytes per column.table() function is the most flexible function to read tabular data that is stored in a textual format. .delim() tab-delimited files with period as decimal separator.string Which strings should be considered NA? colClasses character vector with the types of columns. ˆ read.csv() for comma separated values with period as decimal separator. and summary().3. names = c("age".7*". 21.0 ## 1 42 5.read. Moreover.frame output because of error .csv(fn.7*’ # no data.5. it is noticeably slower for larger files (say.6. "numeric") ) ## Error: scan() expected ’a real’.7* 21.5. then specifying necessary options.com/teach/ADA2/ADA2_notes_Ch18_unnamed.fwf().data . larger than a few MiB) and it may yield unexpected results.txt" # first line is erroneously interpreted as column names person <. Although this may seem convenient.data) person ## X21 X6."http://statacumen.7* ## 3 21 <NA> # instead.read.0 42 5. got ’5.7*).frame': 4 obs. each of the above functions assumes by default that the first line in the text file contains column headers. in the above script.table() will try to determine the column types."6.data <. header = FALSE .9 ## 2 18 5. fn.3: From raw to technically correct data 471 Except for read. col.0 42.csv(file = fn.table() and read.7* 21 <NA> If colClasses is not specified by the user.9 18. For example.9".0": 3 2 1 NA Using colClasses. one of the rows contains a malformed numerical variable (5.data . so we are now stuck with a height variable expressed as levels in a categorical variable: str(person) ## 'data.csv(fn. by default text variables are converted to factor. read. we can force R to either interpret the columns in the way we want or throw an error when this is not possible. header=FALSE . The following demonstrates this on the following text file. "height") ) person ## ## ## ## ## 1 2 3 4 age height 21 6. colClasses=c("numeric". causing R to interpret the whole column as a text variable.9 18 5."5. use header = FALSE and specify the column names person <. of 2 variables: ## $ age : int 21 42 18 21 ## $ height: Factor w/ 3 levels "5. read.18.NA Read the file with defaults. with the exception of the row containing 5.1892 1871.472 Ch 18: Data Cleaning This behaviour is desirable if you need to be strict about how data is offered to your R script.-functions can be applied to convert to the desired type. With readLines() you can exercise precise control over how each line is interpreted and transformed into fields in a rectangular data set.data .7* 21 <NA> person$height <.0 42 5.0 42 5. since we now get a warning instead of an error. header = FALSE . %% Data on the Dalton Brothers Gratt .1861.numeric(person$height) ## Warning: person ## ## ## ## ## 1 2 3 4 NAs introduced by coercion age height 21 6. albeit with less data to analyse than it was supposed to. However.Emmet . stringsAsFactors = FALSE) person ## ## ## ## ## 1 2 3 4 age height 21 6. Reading data with readLines When the rows in a data file are not uniformly formatted you can consider reading in the text line-by-line and transforming the data to a rectangular set yourself. one of the as. As an alternative. Moreover. person <. birth and death dates .names = c("age". It is of course up to the programmer to check for these extra NA’s and handle them appropriately.csv(file = fn. columns can be read in as character by setting stringsAsFactors= Next. everything is read in and the height column is translated to numeric.read. We use the following data as an example. "height") . a script containing this statement will continue to run.9 18 NA 21 NA Now.1937 % Names. col. a script containing the above code will stop executing completely when an error is encountered. as shown below. unless you are prepared to write tryCatch() constructions.9 18 5.as.7*.1892 Bob. "http://statacumen.txt <.000 records we could process them all.frame data.1861.18. We want a general strategy so that if we had a file with 10.. Reading in the Daltons file yields the following. 1 2 3 4 5 6 Step Read the data with readLines Select lines containing data Split lines into separate fields Standardize rows Transform to data. readLines() detects both the end-of-line and carriage return characters so lines are detected regardless of whether the file was created under DOS. in the third row the name and birth date have been swapped. Selecting lines containing data.frame Normalize and coerce to correct type result character character list of character vectors list of equivalent vectors data. Name Gratt Bob Emmet Birth 1861 NA 1871 473 Death 1892 1892 1937 The file has comments on several lines (starting with a % sign) and a missing value in the second row.data <..com/teach/ADA2/ADA2_notes_Ch18_dalton. This is generally done by throwing out lines containing comments or otherwise lines that do not contain any data fields. You can use grep() or grepl() to detect such lines. UNIX.1937" ## [5] "% Names. Moreover. The table suggests one strategy.frame Step 1.txt has 5 character elements.1892" ## [3] "Bob.txt) ## chr [1:5] "%% Data on the Dalton Brothers" .Emmet .1892" "1871. or MAC (each OS has traditionally had different ways of marking an end-of-line). The variable dalton.readLines(fn. The readLines() function accepts filename as argument and returns a character vector containing one element for each line in the file. equal to the number of lines in the textfile. fn. Regular .txt ## [1] "%% Data on the Dalton Brothers" "Gratt . birth and death dates" str(dalton.data) dalton.txt" dalton. Step 2. Reading data.3: From raw to technically correct data And this is the table we want. org/wiki/Regular_expression .1892" "1871. split = ". I usually search for an example and modify it to meet my needs. This function accepts a character vector and a split argument which tells strsplit() how to split a string into substrings. By default.dat <. Split lines into separate fields.strsplit(dalton.txt) ind.dat2 <. can be used to specify what you’re searching for.") dalton.grepl("^%". though challenging to learn.gsub(" ". split is interpreted as a regular expression. 2 http://en. the first argument of grepl() is a search pattern.nodata ## [1] TRUE FALSE FALSE FALSE TRUE # and throw them out !ind.nodata] dalton. dalton.dat) # split strings by comma dalton.dat2. Step 3. dalton.fieldList <.Emmet . # remove whitespace by substituting nothing where spaces appear dalton. The result of grepl() is a logical vector that indicates which elements of dalton. This can be done with strsplit(). where the caret (^) indicates a start-of-line.474 Ch 18: Data Cleaning expressions2.fieldList ## ## ## ## ## ## ## ## [[1]] [1] "Gratt" "1861" [[2]] [1] "Bob" [[3]] [1] "1871" "1892" "1892" "Emmet" "1937" Here.1937" Here. "".nodata <.dat ## [1] "Gratt .nodata ## [1] FALSE TRUE TRUE TRUE FALSE dalton.dalton.txt[!ind. The result is a list of character vectors. split= is a single character or sequence of characters that are to be interpreted as field separators. # detect lines starting (^) with a percentage sign (%) ind.wikipedia.1861. The functionality of grep() and grepl() will be discussed in more detail later.txt contain the pattern ’start-of-line’ followed by a percent-sign. and the meaning of a special characters can be ignored by passing fixed=TRUE as extra parameter.1892" "Bob. alpha] # get birth date (if any) and put into second position ind.standardFields ## [[1]] ## [1] "Gratt" "1861" ## ## [[2]] "1892" coercion coercion coercion coercion coercion coercion .alpha <.death <. lines that contain fewer fields than the maximum number of fields detected are appended with NA. One advantage of the do-ityourself approach shown here is that we do not have to make this assumption. dalton.num.num. x) out[1] <.assignFields) ## ## ## ## ## ## Warning: Warning: Warning: Warning: Warning: Warning: NAs NAs NAs NAs NAs NAs introduced introduced introduced introduced introduced introduced by by by by by by dalton. # then return that value to second position.death) > 0.assignFields() to each list element in dalton. x[ind. f. # else return NA to second position out[2] <.num. we use the knowledge that all Dalton brothers were born before and died after 1890.x[ind.birth <. The grepl() statement detects fields containing alphabetical values a-z or A-Z.fieldList.num.ifelse(length(ind.standardFields <.assignFields <.character(3) # get name and put into first position ind. we need to apply this function to every element of dalton.num.18. x[ind. The easiest way to standardize rows is to write a function that takes a single character vector as input and assigns the values in the right order. NA) # get death date (if any) and put into third position (same strategy as birth) ind.grepl("[[:alpha:]]".function(x) { # create a blank character vector of length 3 out <.birth) > 0. The function below accepts a character vector and assigns three values to an output vector of class character. To retrieve the fields for each row in the example.fieldList.numeric(x) < 1890) # if there are more than 0 years <1890.which(as. NA) out } The function lapply() will apply the function f.num.which(as.fieldList. The goal of this step is to make sure that (a) every row has the same number of fields and (b) the fields are in the right order.numeric(x) > 1890) out[3] <.birth].table(). Standardize rows. # function to correct column order for Dalton data f. In read.death].ifelse(length(ind.lapply(dalton. To assign year of birth and year of death.3: From raw to technically correct data 475 Step 4. Here.3] "1892" "1892" "1937" # name the columns colnames(dalton.] "Bob" ## [3. f. That is.] "Emmet" [.] "Gratt" ## [2. "death") dalton.standardFields) . no one but the data analyst is probably in a better position to choose how safe and general the field assigner should be.mat ## [. Transform to data.frame. Step 5. Again.] "Bob" ## [3.c("name".assignFields() function we wrote is still relatively fragile. first all elements are copied into a matrix which is then coerced into a data.standardFields) ## [1] 3 # fill a matrix will the character values dalton.standardFields length(dalton. # unlist() returns each value in a list in a single object unlist(dalton. since we are interpreting the value of fields here.frame.table offers.476 Ch 18: Data Cleaning ## [1] "Bob" NA "1892" ## ## [[3]] ## [1] "Emmet" "1871" "1937" The advantage of this approach is having greater flexibility than read.frame object. nrow = length(dalton.] "Gratt" ## [2.standardFields) ## [1] "Gratt" "1861" ## [9] "1937" "1892" "Bob" NA "1892" "Emmet" "1871" # there are three list elements in dalton. it is unavoidable to know about the contents of the dataset which makes it hard to generalize the field assigner function. There are several ways to transform a list to a data.] "Emmet" birth "1861" NA "1871" death "1892" "1892" "1937" . Furthermore.matrix(unlist(dalton. it crashes for example when the input vector contains two or more text-fields or when it contains more than one numeric value larger than 1890.mat <. byrow = TRUE ) dalton. "birth".mat) <.mat ## name ## [1.1] ## [1.2] "1861" NA "1871" [.standardFields) . However. Finally. so we need to add the argument byrow=TRUE.df ## name birth death ## 1 Gratt 1861 1892 ## 2 Bob <NA> 1892 ## 3 Emmet 1871 1937 The function unlist() concatenates all vectors in a list into one large character vector.as.frame. This step consists of preparing the character columns of our data. we add column names and coerce the matrix to a data. the matrix function usually fills up a matrix column by column. String normalization and type conversion are discussed later.frame': 3 obs. We then use that vector to fill a matrix of class character.data. dalton.18.df$birth) dalton.3: From raw to technically correct data 477 # convert to a data. stringsAsFactors=FALSE) str(dalton.frame(dalton. of 3 variables: ## $ name : chr "Gratt" "Bob" "Emmet" ## $ birth: num 1861 NA 1871 ## $ death: num 1892 1892 1937 dalton. Normalize and coerce to correct types. However.df ## name birth death ## 1 Gratt 1861 1892 ## 2 Bob NA 1892 ## 3 Emmet 1871 1937 . We use stringsAsFactors=FALSE since we have not started interpreting the values yet. our data is stored with rows concatenated.df) ## 'data. In this example we can suffice with the following statements.frame': 3 obs. of 3 variables: ## $ name : chr "Gratt" "Bob" "Emmet" ## $ birth: chr "1861" NA "1871" ## $ death: chr "1892" "1892" "1937" dalton.numeric(dalton.as.frame but don't turn character variables into factors dalton.mat.df <.frame for coercion and translating numbers into numeric vectors and possibly character vectors to factor variables.numeric(dalton.df$death) str(dalton. Here.df$birth <.df$death <.df) ## 'data.as. Step 6. Objects can be created.0". or overwritten on-the-fly by the user. "def")) ## [1] "character" class(1:10) ## [1] "integer" class(c(pi.”.numeric as. Under the hood. 18.df.logical as. as.character as. The type of C structure that .478 18.frame sapply(dalton. exp(1))) ## [1] "numeric" class(factor(c("abc". but as a reference they are listed here. The reader is probably familiar with R’s basic coercion functions. values that cannot be converted to the specified type will be converted to a NA value while a warning is issued.numeric(c("7". class) ## name ## "character" birth "numeric" death "numeric" For the user of R these class labels are usually enough to handle R objects in R scripts. as.factor as.4. "7.1 Introduction to R’s typing system Everything in R is an object.ordered Each of these functions takes an R object and tries to convert it to the class specified behind the “as. the basic R objects are stored as C structures as C is the language in which R itself has been written. An object is a container of data endowed with a label describing the data. destroyed. "7.0")) ## Warning: NAs introduced by coercion ## [1] 7 NA 7 NA In the remainder of this section we introduce R’s typing and storage system and explain the difference between R types and classes. "def"))) ## [1] "factor" # all columns in a data.integer as. The function class returns the class label of an R object. After that we discuss date conversion. class(c("abc". "7*".4 Ch 18: Data Cleaning Type conversion Converting a variable from one type to another is called coercion. By default. In short. typeof(c("abc". A factor is an integer vector endowed with a table specifying what integer value corresponds to what level.numeric(f) . Also. the value of categorical variables is stored in factor variables. Normally. but there are exceptions (the homework includes an example of the subtleties).18.2 Recoding factors In R. "b". which is a standard way for lower-level computer languages such as C to store approximations of real numbers.factor(c("a". the type of an object of class factor is integer.4. "a". one may regard the class of an object as the object’s type from the user’s point of view while the type of an object is the way R looks at the object. f <. "def"))) ## [1] "integer" Note that the type of an R object of class numeric is double.4: Type conversion 479 is used to store a basic type can be found with the typeof function. The values in this translation table can be requested with the levels function. The reason is that R saves memory (and computational time!) by storing factor values as integers. It is important to realize that R’s coercion functions are fundamentally functions that change the underlying type of an object and that class changes are a consequence of the type changes. Compare the results below with those in the previous code snippet. "a". while a translation table between factor and integers are kept in memory. The term double refers to double precision. a user should not have to worry about these subtleties. "def")) ## [1] "character" typeof(1:10) ## [1] "integer" typeof(c(pi. "c")) f ## [1] a b a a c ## Levels: a b c levels(f) ## [1] "a" "b" "c" as. exp(1))) ## [1] "double" typeof(factor(c("abc". 18. 0. female = 2) recode ## ## male female 1 2 gender <. Conversion to a factor variable can be done as in the example below.c(27. 34. ref = "female") gender ## [1] female male male ## Levels: female male female <NA> male male Levels can also be reordered. "scores") <. 1) gender ## [1] 2 1 1 2 0 1 1 # recoding table.480 Ch 18: Data Cleaning ## [1] 1 2 1 1 3 You may need to create a translation table by hand. gender <. The relevel function allows you to determine which level comes first. attr(gender. R’s standard multivariate routines (lm. levels = recode. 45. the means are added as a named vector attribute to gender. 1.NULL gender ## [1] female male male female <NA> male male .relevel(gender. stored in a simple vector recode <. 1.factor(gender."scores") female male 30. 1. depending on the mean value of another variable. For example. 2 stands for female and 0 stands for unknown. suppose we read in a vector where 1 stands for male. 2. Every integer value that is encountered in the first argument. for example: age <. # example: gender <. Levels in a factor variable have no natural ordering. 68) gender <. 89.5 57. However in multivariate (regression) analyses it can be beneficial to fix one of the levels as the reference level.reorder(gender. 65.c(2. labels = names(recode)) gender ## [1] female male male ## Levels: male female female <NA> male male Note that we do not explicitly need to set NA as a label.5 Levels: female male female <NA> male male Here. 52.c(male = 1. It can be removed by setting that attribute to NULL. glm) use the first level as reference level. but not in the levels argument will be regarded missing. age) gender ## ## ## ## ## [1] female male male attr(. The Date object can only be used to store dates.c("15/02/2013" . For example. Such a storage format facilitates the calculation of durations by subtraction of two POSIXct objects. As an example. current_time <. Converting from text to POSIXct is further complicated by the many textual conventions of time/date denotation. POSIXlt. For example. Under the hood. 1970 00:00. R shows it in a human-readable calender format. library(lubridate) dates <.time() returns the system time provided by the operating system in POSIXct format.3 Converting dates The base R installation has three types of objects to store a time instance: Date. since there are many idiosyncrasies to handle in calender systems. The lubridate package contains a number of functions facilitating the conversion of text to POSIXct dates. daylight saving times. Moreover. time zones and so on. When a POSIXct object is printed.time() uses the time zone that is stored in the locale settings of the machine running R.time() class(current_time) ## [1] "POSIXct" "POSIXt" current_time ## [1] "2014-04-22 20:33:19 MDT" Here. where the language is again defined in the operating system’s locale settings.18. both 28 September 1976 and 1976/09/28 indicate the same day of the same year. and POSIXct. Sys. Here. Converting from a calender time to POSIXct and back is not entirely trivial.4. consider the following code. the name of the month (or weekday) is language-dependent. we focus on converting text to POSIXct objects since this is the most portable way to store such information.Sys. a POSIXct object stores the number of seconds that have passed since January 1. "15 Feb 13" . the command Sys. the other two store date and/or time. "It happened on 15 02 '13") dmy(dates) . leap seconds.4: Type conversion 481 ## Levels: female male 18. These include leap days. the following notation does not convert. Explicitly. all of the following functions exist. Note that the code above will only work properly in locale settings where the name of the second month is abbreviated to Feb.482 Ch 18: Data Cleaning ## [1] "2013-02-15 UTC" "2013-02-15 UTC" "2013-02-15 UTC" Here. Currently all are now 2000-2099. leaving out the indication of century. This holds for English or Dutch locales. dmy() dym() mdy() myd() ydm() ymd() So once it is known in what order days. 2013") ## Warning: ## [1] NA All formats failed to parse. The complete list can be found by . and y. this behaviour is according to the 2008 POSIX standard. months and years are denoted. the function dmy assumes that dates are denoted in the order daymonth-year and tries to extract valid dates. dmy("01 01 68") ## [1] "2068-01-01 UTC" dmy("01 01 69") ## [1] "2069-01-01 UTC" dmy("01 01 90") ## [1] "2090-01-01 UTC" dmy("01 01 00") ## [1] "2000-01-01 UTC" It should be noted that lubridate (as well as R’s base functionality) is only capable of converting certain standard notations. m. For example. extraction is very easy. 00-69 was interpreted as 2000-2069 and 70-99 as 1970-1999. but one should expect that this interpretation changes over time. There are similar functions for all permutations of d. No formats found. The standard notations that can be recognized by R. dmy("15 Febr. Recently in R. but fails for example in a French locale (Fevrier). either using lubridate or R’s built-in functionality are shown below. Note It is not uncommon to indicate years with two numbers. 28 %y Year without century (00-99) 13 %Y Year including century. Code Description Example %a Abbreviated weekday name in the current locale. so 01. . format = "I was born on %B %d. the %d indicator makes R look for numbers 1-31 where precursor zeros are allowed. to convert dates from POSIXct back to character. Strings that are not in the exact format specified by the format argument (like the third string in the above example) will not be converted by as.18. Finally. .POSIXct(dates. 1976" . These are the day. . 2013 Here. This can be done with the as. you may want to use R’s core functionality to convert from text to POSIXct. September %m Month number (01-12) 09 %d Day of the month as decimal number (01-31).dmy("28 Sep 1976") format(mybirth.4: Type conversion 483 typing ?strptime in the R console. "16-07-2008". such as the leap day in the fourth date above are also not converted. and year formats recognized by R. It takes as arguments a character vector with time/date strings and a string describing the format. 02. such a %-code tells R to look for a range of substrings. 31 are recognized as well. the names of (abbreviated) week or month names that are sought for in the text depend on the locale settings of the machine that is running R. If you know the textual format that is used to describe a date in the input. dates <. Monday %b Abbreviated month name in the current locale. . Impossible dates. Sep %B Full month name in the current locale. "29-02-2011") as. date and time fields are indicated by a letter preceded by a percent sign (%). It accepts a POSIXct date/time object and an output format string. %Y") ## [1] "I was born on September 28.POSIXct function. one may use the format function that comes with base R. month.c("15-9-2009". For example.POSIXct. "17 12-2007". Mon %A Full weekday name in the current locale. mybirth <. Basically. format = "%d-%m-%Y") ## [1] "2009-09-15 MDT" "2008-07-16 MDT" NA ## [4] NA In the format string. In statistical contexts. Below we discuss two complementary approaches to string coding: string normalization and approximate text matching. extra white spaces at the beginning or end of a string can be removed using str_trim(). the following topics are discussed. By default. ˆ Approximate matching procedures based on string distances. . gender M male Female fem.1 String normalization String normalization techniques are aimed at transforming a variety of strings to a smaller set of string values which are more easily processed. In particular. ˆ Search for strings containing simple patterns (substrings). obviously four. If this would be treated as a factor variable without any preprocessing. For example. consider the following excerpt of a data set with a gender variable. character data can be difficult to process.5. classifying such “messy” text strings into a number of fixed categories is often referred to as coding. R comes with extensive string manipulation functionality that is based on the two basic string operations: finding a pattern in a string and replacing one pattern with another. 18. We will deal with R’s generic functions below but start by pointing out some common string cleaning operations. ˆ Pad strings to a certain width. ˆ Transform to upper/lower case.5 Ch 18: Data Cleaning Character-type manipulation Because of the many ways people can write the same things down.484 18. ˆ Remove prepending or trailing white spaces. The job at hand is therefore to automatically recognize from the above data whether each element pertains to male or female. The stringr package offers a number of functions that make some some string manipulation tasks a lot easier than they would be with R’s base functions. For example. not two classes would be stored. pad = 0) ## [1] "000112" Both str_trim() and str_pad() accept a side argument to indicate whether trimming or padding should occur at the beginning (left).)). side = "right") ## [1] " hello world" Conversely. indicating which element of the input character vector contains the pattern.) as which(grepl(.2 Approximate string matching There are two forms of string matching. The first consists of determining whether a (range of) substring(s) occurs within another string. The most used are probably grep() and grepl(). width = 6. toupper("Hello world") ## [1] "HELLO WORLD" tolower("Hello World") ## [1] "hello world" 18. Below we will give a short introduction to pattern matching and string distances with R. end (right).. Converting strings to complete upper or lower case can be done with R’s built-in toupper() and tolower() functions. You may think of grep(... For example. side = "left". strings can be padded with spaces or other characters with str_pad() to a certain width. side = "left") ## [1] "hello world " str_trim(" hello world ". Both functions take a pattern and a character vector as input. In the second form one defines a distance metric between strings that measures how “different” two strings are. In this case one needs to specify a range of substrings (called a pattern) to search for in another string. while grep() returns a numerical index. The output only differs in that grepl() returns a logical index.5: Character-type manipulation 485 library(stringr) str_trim(" hello world ") ## [1] "hello world" str_trim(" hello world ". or both sides of the string. numerical codes are often represented with prepending zeros..18.5. There are several pattern matching functions that come with base R. str_pad(112. . "fem. gender. That is.") grepl("m". For example. The search patterns that grep(). ignore. you can use the option fixed=TRUE. Either by case normalization or by the optional argument ignore. the grepl() function now finds only the first two elements of gender. fixed = TRUE) ## [1] FALSE FALSE FALSE FALSE This will make grepl() or grep() ignore any meta-characters in the search string (and thereby search for the “^” character). \ | ( ) [ { ^ $ * + ? If you need to search a string for any of these characters. looking for the occurrence of m or M in the gender vector does not allow us to determine which strings pertain to male and which not. If 3 http://en. Fortunately. Regular expressions3 offer powerful and flexible ways to search (and alter) text. grepl() (and sub() and gsub()) understand have more of these meta-characters. the search patterns that grep() accepts allow for such searches.case = TRUE) ## [1] TRUE TRUE FALSE FALSE Indeed. grepl("^m". Search patterns using meta-characters are called regular expressions. gender <. the pattern to look for is a simple substring. The caret is an example of a so-called meta-character. grepl("m". namely the beginning of a string.case. we get the following. gender. "male ". it does not indicate the caret itself but something else. gender) ## [1] 2 3 4 Note that the result is case sensitive: the capital M in the first element of gender does not match the lower case m.c("M". gender) ## [1] FALSE TRUE TRUE TRUE grep("m".486 Ch 18: Data Cleaning In the most simple case.org/wiki/Regular_expression . Preferably we would like to search for strings that start with an m or M. "Female". grepl("^". A concise description of regular expressions allowed by R’s built-in string processing functions can be found by typing ?regex at the R command line.case = TRUE) ## [1] TRUE TRUE TRUE TRUE grepl("m". There are several ways to circumvent this case sensitivity. gender. namely: .wikipedia. ignore. The beginning of a string is indicated with a caret (^). tolower(gender)) ## [1] TRUE TRUE TRUE TRUE Obviously. from the previous example. we need to find the index of the smallest distance for each row of dist.c) <. For example: codes <. "bac") ## [.c("male".gender dist.c) <. it is an investment that could pay off several times.g. learning to work with regular expressions is a worthwhile investment. For example adist("abc". Using adist(). An important distance measure is implemented by the R’s native adist() function.c.g.c. Note that .g.c.adist(gender.] 2 The result equals two since turning ”abc” into ”bac” involves two character substitutions: abc → bbc → bac.18. ind. This can be done as follows. These operations include insertion. A string distance is an algorithm or equation that indicates how much two strings differ from each other.1] ## [1. 4 3 Here. to find out which code matches best with our raw data.g. female We use apply() to apply which. or substitution of a single character.5: Character-type manipulation 487 you frequently have to deal with “messy” text variables. 1.codes rownames(dist. We now turn our attention to the second method of approximate matching. deletion.c ## ## ## ## ## male female M 4 6 male 1 3 Female 2 1 fem. Moreover. which.apply(dist. "female") # calculate pairwise distances between the gender strings and codes strings dist. This function counts how many basic operations are needed to turn one string into another. namely string distances. since many popular programming languages support some dialect of regexps.frame(rawtext = gender.min <. coded = codes[ind.g.g.min]) ## ## ## ## ## 1 2 3 4 rawtext coded M male male male Female female fem. Now. For readability we added row and column names accordingly.g.c <.min() to every row of dist.min) data. adist() returns the distance matrix between our vector of fixed codes and the input data. codes) # add column and row names colnames(dist. we can compare fuzzy text strings to a list of known codes. Secondly. Recall the earlier gender and code example. At the end of this subsection we show how this code can be simplified with the stringdist package.frame data.. codes. female Informally. . Most importantly. maxDist = 4) ind ## [1] 1 1 2 2 # store results in a data. code = codes[ind]) ## ## ## ## ## 1 2 3 4 4 rawtext code M male male male Female female fem. Using the optimal string alignment distance (the default choice for stringdist()) we get library(stringdist) stringdist("abc". the Levenshtein distance between two words is the minimum number of single-character edits (i. the first match will be returned. we mention three more functions based on string distances. which mimics the behaviour of R’s match() function: it returns an index to the closest match within a maximum distance. or substitutions) required to change one word into the other: https: //en. Thirdly. Finally. some of which are likely to provide results that are better than adist()’s. which makes it very flexible. which is a common typographical error. The agrep() function allows for searching for regular expression patterns. but it allows one to specify a maximum Levenshtein distance4 between the input pattern and the found substring.frame(rawtext = gender.amatch(gender. # this yields the closest match of 'gender' in 'codes' (within a distance of 4) ind <. insertions. the stringdist package offers a function called stringdist() which can compute a variety of string distance metrics.wikipedia.488 Ch 18: Data Cleaning in the case of multiple minima. the stringdist package provides a function called amatch().e. "bac") ## [1] 1 The answer is now 1 (not 2 as with adist()). the distance function used by adist() does not allow for character transpositions. since the optimal string alignment distance allows for transpositions of adjacent characters: abc → bac. deletions. the R built-in function agrep() is similar to grep().org/wiki/Levenshtein_distance. First.
Copyright © 2025 DOKUMEN.SITE Inc.