Stata Step by Step

Table of ContentsSETTING UP STATA SETTING UP A PANEL HOW TO GENERATE VARIABLES GENERATING VARIABLES GENERATING DATES HOW TO GENERATE DUMMIES GENERATING GENERAL DUMMIES GENERATING TIME DUMMIES TIMES-SERIES ANALYSES 1. ASSUMPTIONS OF THE OLS ESTIMATOR 2. CHECK THE INTERNAL AND EXTERNAL VALIDITY A. THREATS TO INTERNAL VALIDITY B. THREATS TO EXTERNAL VALIDITY 3. THE LINEAR REGRESSION MODEL 4. LINEAR REGRESSION WITH MULTIPLE REGRESSORS ASSUMPTIONS OF THE OLS ESTIMATOR 5. NONLINEAR REGRESSION FUNCTIONS A. EXAMPLES OF NONLINEAR REGRESSIONS 1) POLYNOMIAL REGRESSION MODEL OF A SINGLE INDEPENDENT VARIABLE 2) LOGARITHMS B. INTERACTIONS BETWEEN TWO BINARY VARIABLES C. INTERACTIONS BETWEEN A CONTINUOUS AND A BINARY VARIABLE D. INTERACTIONS BETWEEN TWO CONTINUOUS VARIABLES RUNNING TIME-SERIES ANALYSES IN STATA A TIME SERIES REGRESSION REGRESSION DIAGNOSTICS: NON NORMALITY 5 5 6 6 7 8 8 9 10 10 11 12 12 13 13 13 14 15 15 15 16 16 16 16 16 21 STATA STEP-BY-STEP Thierry Warin © Thierry Warin 1 © Thierry Warin 2 REGRESSION DIAGNOSTICS: NON-LINEARITY REGRESSION DIAGNOSTICS: HETEROSCEDASTICITY REGRESSION DIAGNOSTICS: OUTLIERS REGRESSION DIAGNOTICS: MULTICOLLINEARITY REGRESSION DIAGNOSTICS: NON-INDEPENDENCE TIME-SERIES CROSS-SECTION ANALYSES (TSCS) OR PANEL DATA MODELS A. THE FIXED EFFECTS REGRESSION MODEL B. REGRESSION WITH TIME FIXED EFFECTS RUNNING POOLED OLS REGRESSIONS IN STATA THE FIXED AND RANDOM EFFECTS MODELS CHOICE OF ESTIMATOR TESTING PANEL MODELS ROBUST STANDARD ERRORS RUNNING PANEL REGRESSIONS IN STATA 31 36 40 52 55 1. PROBIT AND LOGIT REGRESSIONS PROBIT REGRESSION LOGIT REGRESSION LINEAR PROBABILITY MODEL EXAMPLES 75 75 75 76 77 77 91 57 57 58 59 59 60 61 61 63 HEALTH CARE APPENDIX 1 IT IS ABSOLUTELY FUNDAMENTAL THAT THE ERROR TERM IS NOT CORRELATED WITH THE INDEPENDENT VARIABLES. 63 CHOOSING BETWEEN FIXED EFFECTS AND RANDOM EFFECTS? THE HAUSMAN TEST 64 IF YOU QUALIFY FOR A FIXED EFFECTS MODEL, SHOULD YOU INCLUDE TIME EFFECTS? 66 FIXED EFFECTS OR RANDOM EFFECTS WHEN TIME DUMMIES ARE INVOLVED: A TEST 66 DYNAMIC PANELS AND GMM ESTIMATIONS HOW DOES IT WORK? TESTS IN NEED FOR A CAUSALITY TEST? MAXIMUM LIKELIHOOD ESTIMATION 68 69 71 71 75 © Thierry Warin 3 © Thierry Warin 4 Setting up Stata We are going to allocate 10 megabites to the dataset. You do not want to allocate to much memory to the dataset because the more memory you allocate to the dataset, the less memory will be available to perform the commands. You could reduce the speed of Stata or even kill it. set mem 10m we can also decide to have the “more” separation line on the screen or not when the software displays results: set more on set more off You should describe and summarize the dataset as usually before you perform estimations. Stata has specific commands for describing and summarizing panel datasets. xtdes xtsum xtdes permits you to observe the pattern of the data, like the number of individuals with different patterns of observations across time periods. In our case, we have an unbalanced panel because not all individuals have observations to all years. The xtsum command gives you general descriptive statistics of the variables in the dataset, considering the overall, the between and the within variations. Overall refers to the whole dataset. Between refers to the variation of the means to each individual (across time periods). Within refers to the variation of the deviation from the respective mean to each individual. You may be interested in applying the panel data tabulate command to a variable. For instance, to the variable south, in order to obtain a one-way table. xttab south As in the previous commands, Stata will report the tabulation for the overall variation, the within and the between variation. Setting up a panel Now, we have to instruct Stata that we have a panel dataset. We do it with the command tsset, or iis and tis iis idcode tis year or tsset idcode year In the previous command, idcode is the variable that identifies individuals in our dataset. Year is the variable that identifies time periods. This is always the rule. The commands refering to panel data in Stata almost always start with the prefix xt. You can check for these commands by calling the help file for xt. help xt How to generate variables Generating variables gen age2=age^2 gen ttl_exp2=ttl_exp^2 gen tenure2=tenure^2 © Thierry Warin 5 © Thierry Warin 6 And format: Now, let's compute the average wage for each individual (across time periods). Format varname2 %d bysort idcode: egen meanw=mean(ln_wage) In this case, we did not apply the sort command previously and then the by prefix command. We could have done it, but with this only command, you can always abreviate the implementation of the by prefix command. The command egen is an extension of the gen command to generate new variables. The general rule to apply egen is when you want to generate a new variable that is created using a function inside Stata. In our case, we used the function mean. You can apply the command list to list the first 10 observations of the new variable mwage. list meanw in 1/10 And then apply the xtsum command to summarize the new variable. xtsum meanw You may want to obtain the average of the logarithm of wages to each year in the panel. bysort year: egen meanw1=mean(ln_wage) And then you can apply the xttab command. xttab meanw1 sort idcode year by idcode: gen tenure1=l.tenure If you were interested in generating a new variable tenure3 equal to one difference of the variable tenure, you would use the time series d operator. by idcode: gen tenure3=d.tenure If you would like to generate a new variable tenure4 equal to two lags of the variable tenure, you would type: by idcode: gen tenure4=l2.tenure The same principle would apply to the operator d. Let’s generate dates: Gen varname2 = date(varname1, “dmy”) Let's just save our data file with the changes that we made to it. Suppose you want to generate a new variable called tenure1 that is equal to the variable tenure lagged one period. Than you would use a time series operator (l). First, you would need to sort the dataset according to idcode and year, and then generate the new variable with the "by" prefix on the variable idcode. How to generate dummies Generating general dummies Let's generate the dummy variable black, which is not in our dataset. gen black=1 if race==2 replace black=0 if black==. Generating dates © Thierry Warin 7 © Thierry Warin 8 Related to the first assumption: if the variance of this conditional distribution of ui does not depend on X i . We use the "tabulate" command with the option "gen" in order to generate time dummies for each year of our dataset. If this assumption does not hold. One should test for omitted variables using (Ramsey and Braithwaite. Generating time dummies In order to do this. then it is likely because there is an omitted variable bias.. let's first generate our time dummies. You need to change the base anyway: char _dta[omit] “prevalent” xi: i. ui ) = 0 . the conditional distribution of ui given X i has a mean of zero. replace Times-series analyses Another way would be to use the xi command. It takes the items (string of letters. 2. • and we will get a first time dummy called "y1" which takes the value 1 if year=1980. 0 otherwise. You could give any other name to your time dummies.. for instance) and create a dummy variable for each item. The error term ui is homoskedastic if the variance of the conditional distribution of ui given X i is constant for i = 1. where closeness is measured by the sum of the squared mistakes made in predicting Y given X: ∑ (Y − b i i =1 n 0 − b1 X i ) 2 (1) With b0 and b1 being estimators of β 0 and β1 . 0 otherwise.. This means that the other factors captured in the error term are unrelated to X i . a second time dummy "y2" which assumes the value 1 if year=1982. g(y) 1. for instance) of a designated variable (category. Assumptions of the OLS estimator 1.save. This is the most important assumption in practice. and similarly for the remaining years. We will name the time dummies as "y".category tabulate category The OLS estimator chooses the regression coefficient so that the estimated regression line is as close as possible to the observed data. then the errors are said to be homoskedastic. n and in • © Thierry Warin 9 © Thierry Warin 10 .. 1931)’s test. The correlation between X i and ui should be nil: corr ( xi . tab year. the error term is heteroskedastic. 4.d. B.i. This second assumption holds in many cross-sectional data sets. Causal effects are estimated using the estimated regression function. but it is inappropriate for time series data. a. and confidence intervals should have the desired confidence level. Internal and external validity distinguish between population and setting studied and the population and setting to which the results are generalized. b. Check the internal and external validity A statistical analysis is internally valid if the statistical inferences about causal effects are valid for the population being studied.. This is to be sure that there is no selection bias in the sample. Whether the errors are homoskedastic or heteroskedastic. If so.. To test for heteroskedasticity. imprecise measurement of the independent variables. misspecification of the functional form of the regression function. we use (Breusch and Pagan. i = 1. there are two or more studies on different but related populations. Important differences between the two will cast doubt on the external validity of the study. 2. heteroskedasticity. simultaneous causality. All seven sources of ( X i . Reasons why the OLS estimator of the multiple regression coefficients might be biased are sevenfold: omitted variables. n are Independently and Identically Distributed..particular does not depend on X i . consistent. Hypothesis tests are performed using the estimated regression coefficients and their standard errors. X i and ui have four moments.. Sometimes. the external validity of both studies can be checked by comparing their results. one should use hetereoskedastic-robust standard errors. Threats to internal validity Internal validity has two components: 1. Studies based on regression analysis are internally valid if the estimated regression coefficients are unbiased and consistent. the OLS estimator is unbiased. The fourth assumption is that the fourth moments of X i and ui are nonzero and finite: 0 < E ( X i 4 ) < ∞ and 0 < E ( ui 4 ) < ∞ 2. The estimator of the causal effect should be unbiased and consistent. hypothesis test should have the desired significance level. © Thierry Warin 11 © Thierry Warin 12 . 3. and asymptotically normal. bias arise because the regressor is correlated with the error term violating the first least squares assumption. If the standard errors are heteroskedastic. Threats to external validity External validity must be judged using specific knowledge of the populations and settings studied and those of interest. and correlation of the error term across observations (sample not i. and if their standard errors yield confidence intervals with the desired confidence level. The analysis is externally valid if its inferences and conclusions can be generalized from the population and setting studied to other populations and settings. A. 1979)’s test.). Yi ) . sample selection. Otherwise. .. X ki is 3. X ki . (2) a.. n are Independently and Identically 4....... The fourth assumption is that Assumptions of the OLS estimator the fourth moments of X1i . X 2i .... the error term is heteroskedastic. X ki and ui have four moments... If this assumption does not hold. X 2i ... it is 1. 1931)’s test... n (4) said to be homoskedastic... X 2i . 5.variance of the conditional distribution of ui given X1i ... X 2i . In case of perfect multicollinearity. i = 1. X1i ... consistent. 4.... One should test for omitted variables using (Ramsey and Braithwaite. the conditional distribution of ui given X1i . Nonlinear regression functions Yi = f ( X 1i . This means that the other factors captured in the error term are unrelated to X1i . ( X 1i ........ To test for heteroskedasticity. X ki . X ki ) + ui . X ki and ui are nonzero and finite. b.. X 2i . X 2i . i = 1. but it is (3) inappropriate for time series data. Otherwise. The regressors are said to be perfectly multicollinear if one of regressors is a perfect linear function of one of the other regressors. The correlation between X1i . the OLS estimator is unbiased.. This is to be sure that there is no selection bias in the sample.. 3. X ki .... If the standard errors are heteroskedastic.. X ki . X ki and ui should be nil. n and in particular does not depend on X1i . one should use This can be a time series analysis or not. Whether the errors are homoskedastic or heteroskedastic.. for instance test scores and class sizes in 1998 in 420 California school districts. we use (Breusch and Pagan. X 2i . X 2i .. Linear regression with multiple regressors Yi = β 0 + β1 X1i + β 2 X 2i + ui Distributed. and asymptotically normal.. Related to the first assumption: if the variance of this conditional distribution of ui does not depend on X1i . This is the most important assumption in practice. X 2i . 2.. 1979)’s test. No perfect multicollinearity... The linear regression model Yi = β 0 + β1 X i + ui constant for i = 1.. then the errors are impossible to compute the OLS estimator.. The error term ui is homoskedastic if the © Thierry Warin 13 © Thierry Warin 14 .. 5. This second assumption holds in many cross-sectional data sets. X ki has a mean of zero. hetereoskedastic-robust standard errors.. Yi ) . then it is likely because there is an omitted variable bias. X 2i .. Interactions between two continuous variables ln ( Yi ) = β 0 + β1 X i + ui A change in X by 1 unit is associated with a 100 β1 %. so β1 is the elasticity of (8) 1) Polynomial regression model of a single independent variable Y with respect to X. clear © Thierry Warin 15 © Thierry Warin 16 . The regressor coefficients will then measure the elasticity in a log-log model. it is often assumed that a 1% increase in price leads to a certain percentage decrease in the quantity demanded.01 β1 . This percentage change in demand is called the price elasticity.A. (10) 2. (7) Yi = β 0 + β1 X 1i + β 2 X 2 i + β 3 ( X 1i × X 2i ) + ui (11) 3.ats. Running time-series analyses in Stata A time series regression use http://www.. + β r X i r + ui (5) Yi = β 0 + β1 D1i + β 2 D2i + β 3 ( D1i × D2i ) + ui 2) Logarithms (9) 1. B. Examples of nonlinear regressions ln ( Y )i = β 0 + β1 ln ( X i ) + ui A 1% change in X is associated with a β1 % change in Y. Interactions between a continuous and a binary variable Yi = β 0 + β1 ln ( X i ) + ui (6) Yi = β 0 + β1 X i + β 2 Di + β 3 ( X i × Di ) + ui A 1% change in X is associated with a change in Y of 0..edu/stat/stata/modules/reg/ok. Log-log model. Interactions between two binary variables Yi = β 0 + β1 X i + β 2 X i 2 + . In the economic analysis of consumer demand.ucla. Log-lin model D. Logarithms convert changes in variables into percentage changes. Lin-log model C. 6969874 x3 | . predict rstud.50512 . This can be useful in checking for outliers. rstudent .5130679 _cons | 31.000 . checking for non-linearity.4040 ---------+-----------------------------Adj R-squared = 0. . let's look at a scatterplot of all variables. so no serious problems seem evident from this plot. summarize rstud. The residuals do not seem to be seriously skewed (although they do have a higher than expected kurtosis).60194 33. detail © Thierry Warin 17 © Thierry Warin 18 .3466306 .73977 Prob > F = 0. Std. rvfplot Let's use the regress command to run a regression predicting y from x1 x2 and x3.9587921 32. . . or 1/VIF values (tolerances) lower than 0.1187692 2. 96) = 21. graph y x1 x2 x3. matrix distribution of points seems fairly even and random.23 0.859 0.5341981 x2 | . but there is nothing seriously wrong in this scatterplot.4527284 .First.88.298443 . . The largest studentized residual (in absolute value) is -2. The We can use the vif command to check for multicollinearity. vif Variable | VIF 1/VIF ---------+---------------------x1 | 1. t P>|t| [95% Conf.0838481 4.69 Model | 5936.2372989 R-squared = 0.17 We can use the predict command (with the rstudent option) to make studentized residuals.0626879 .014 . VIF values in excess of 20.679 0.864654 x3 | 1. regress y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3. and then use the summarize command to check the distribution of the residuals.1230534 3.0000 Residual | 8758. and checking for heteroscedasticity. These values all seem to be fine. There are a few observations that could be outliers.513 0.000 .812781 x2 | 1. As a rule of thumb.21931 3 1978.000 29.00 99 148.12 0.892234 ---------+---------------------Mean VIF | 1.16 0. which is somewhat large but not extremely large. . checking for nonnormality.3853 Total | 14695. Err.40831 -----------------------------------------------------------------------------We can use the rvfplot command to display the residuals by the fitted (predicted) values. Interval] ---------+-------------------------------------------------------------------x1 | .5518 -----------------------------------------------------------------------------y| Coef.2084695 .1801934 .78069 96 91.134 0.434343 Root MSE = 9.05 may merit further investigation. 019451 Variance 1.88) is plotted as a residual. While not perfectly normal.96464 2.096213 Skewness . normal We can use the pnorm command to make a normal probability plot.032194 1. 100 50% 75% 90% 95% 99% -. .0950425 2. pnorm rstud Below we show a boxplot of the residuals.020196 We can use the kdensity command (with the normal option) to show the distribution of the residuals (in yellow). and a normal overlay (in red). calling that to our attention. A perfect normal distribution would be an exact diagonal line (as shown in red).885066 5% -1.0016893 Largest Std. The largest residual (-2.0789393 Mean . Dev.217658 -2.694324 -1.6562946 2. 1. The actual data is plotted in yellow and is fairly close to the diagonal.043349 Obs 100 25% -. . box © Thierry Warin 19 © Thierry Warin 20 . kdensity rstud.253903 2. graph rstud.411592 Kurtosis 3.Studentized residuals ------------------------------------------------------------Percentiles Smallest 1% -2.409712 2.173733 10% -1. . this is not seriously non-normal.015969 .013153 1.763776 Sum of Wgt.5294 -2.52447 -2. The results look pretty close to normal. 0358513 99% 26.389845 5 17 10 17 Variance 70. detail y ------------------------------------------------------------Percentiles Smallest 1% 44.ucla.5 -17 10% -11 -16 Obs 100 25% -5 -16 Sum of Wgt. Dev.374975 x2 ------------------------------------------------------------Percentiles Smallest 1% -18 -18 5% -13 -18 10% -11. The skewness for y suggests that y might be skewed.5 25 5% 173 64 10% 306.5 -22 Obs 100 25% -7 -20 Sum of Wgt.12 Largest Std. Dev.84 Largest Std.085636 We use the kdensity command below to show the distribution of y (in yellow) and a normal overlay (in red).38949 15. .ats. summarize y x1 x2 x3.8479 1600 3136 2162. We can see that y is positively skewed. Dev. 8.5 19 Skewness .5 Mean .5 4096 Skewness 1.Regression Diagnostics: Non normality use http://www.5 -16 Obs 100 25% -4.5 121 Obs 100 25% 576 121 Sum of Wgt.0378385 19.7 2864.5 -15 Sum of Wgt. 845.e.5 19.5 -38 5% -18.edu/stat/stata/modules/reg/nonnorm. 12.965568 18 19 Variance 80. 100 50% 75% 90% 95% 99% 961 Mean 1151.5 20 Kurtosis 2.17527 © Thierry Warin 21 © Thierry Warin 22 . kdensity y.5 22 Skewness . Dev. 100 50% Mean -. 100 50% 75% 90% 95% 99% Mean . .5 31 Kurtosis 3. The skewness for x1 x2 and x3 all look fine.9168 95% 19.5 3364 Variance 715458.2514963 20 Kurtosis 3.711806 0 x3 ------------------------------------------------------------Percentiles Smallest 1% -30. it has a long tail to the right. normal -1 x1 ------------------------------------------------------------Percentiles Smallest 1% -22.5 16. i.12092 75% 8 22 90% 18 22 Variance 146. 8.5 -23 10% -15.68 Largest Std.38141 19 Skewness -..5 7 10. clear Let's start by using the summarize command to look at the distribution of x1 x2 x3 and y looking for evidence of non-normality.18 Largest Std. 100 50% 75% 90% 95% 99% 1.45566 4226 4356 Kurtosis 5.5 -28 5% -15. after y has been adjusted for x2 and x3.59041 x3 | 25. indicating that there could be a problem of non-normally distributed residuals.000 1006.54624 5. and the plot in the bottom left shows x3 on the bottom axis.448 0. .24294 x2 | 29. We see that the points are more densely packed together at the bottom part of the graph (for the negative residuals). We can see that the observed values of y depart from this line.416 66. the plot in the top left shows the relationship between x1 and y.4 99 715458. another possible indicator that the residuals are not normally distributed.794 1272.038 -----------------------------------------------------------------------------The rvfplot command gives us a graph of the residual value by the fitted (predicted) value. let's run a regression predicting y from x1 x2 and x3 and look at some of the regression diagnostics.57 Prob > F = 0.3808 Total | 70830413.722 Root MSE = 665. and the dependent variable after adjusting for all other predictors.054 0. Err. Interval] Below we use the avplots command to produce added variable plots.019 3.7 3 9432997. t P>|t| [95% Conf.---------+-------------------------------------------------------------------x1 | 19. pnorm y Even though y is skewed.54852 46. .0000 Residual | 42531420.001 12. For example. avplots © Thierry Warin 23 © Thierry Warin 24 .81458 8.276317 2.000 13.61 -----------------------------------------------------------------------------y| Coef.842875 4. 96) = 21.38622 36.372 0.3995 ---------+-----------------------------Adj R-squared = 0. Looking at these plots show the data points seem to be more densely packed below the regression line. We are looking for a nice even distribution of the residuals across the levels of the fitted value.633 R-squared = 0. We would expect the points to be normally distributed around these regression lines.574851 3. These plots show the relationship between each predictor.7 96 443035. . The plot in the top right shows x2 on the bottom axis.29 Model | 28298992.56946 8. rvfplot We can make a normal probability plot of y using the pnorm command. the yellow points would be a diagonal line (right atop the red line).14426 _cons | 1139. .94823 37. If y were normal. regress y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3.81248 17. Std.394 0. 022305 75% .753908 Skewness .740343 10% -1. while the five highest values go from 2.089911 3.587549 Obs 100 25% -.115274 -1. Stata knew we wanted studentized residuals because we used the rstudent option after the comma. . We can see the skew in the residuals below.363443 2. . graph box rstud Finally.42. a normal probability plot can be used to examine the normality of the residuals. see below. kdensity rstud.44973 -1. 100 50% -.0068091 Largest Std.59326 Variance 1. We then use the summarize command to examine the residuals for normality.764866 -1.9225476 99% 3. We see that the residuals are positively skewed. also shows the skew in the residuals.045107 95% 2. rstudent .44 to 3. another indicator of the positive skew.106763 2.78. summarize rstud. .6660804 -1.789388 5% -1. creating a variable called rstud containing the studentized residuals. normal A boxplot.1976601 Mean .933529 The kdensity command is used below to display the distribution of the residuals (in yellow) as compared to a normal distribution (in red). . detail Studentized residuals ------------------------------------------------------------Percentiles Smallest 1% -1. and that the 5 smallest values go as low as -1. pnorm rstud © Thierry Warin 25 © Thierry Warin 26 . Dev.5167466 2.425914 Kurtosis 3.Below we create studentized residuals using the predict command.515767 Sum of Wgt. 1.446876 90% 1. predict rstud. 1864127 .Let us try using a square root transformation on y creating sqy. sqy below.3182 -----------------------------------------------------------------------------sqy | Coef. summarize sqy. 100 Mean 31.8288354 R-squared = 0.61973 . Err.76309 33.91722 75% 40 56 90% 46. and examine the distribution of sqy. 11.018 . .806 0.000 . detail sqy ------------------------------------------------------------Percentiles Smallest 1% 6.3487791 .1200437 3.5682 96 86.6848214 x3 | .4465366 .4513345 99% 65 66 Kurtosis 3.4318 3 1908. it is much improved. . rvfplot © Thierry Warin 27 © Thierry Warin 28 .508619 x2 | . normal 50% 31 We run the regression again using the transformed value.219272 Looking at the distribution using kdensity we can see that although the distribution of sqy is not completely normal. t P>|t| [95% Conf.5 64 Skewness .5 58 Variance 142.0000 Residual | 8335.000 29.0486412 .8 Largest Std.1158643 2. kdensity sqy.264 0. .2082518 .720 0.405 0. 96) = 21.3886 Total | 14060.9353415 33.0202 95% 53.47637 -----------------------------------------------------------------------------The distribution of the residuals in the rvfplot below look much better.14393 Prob > F = 0.000 . Interval] ---------+-------------------------------------------------------------------x1 | . generate sqy = sqrt(y) . Std.5111456 _cons | 31. Dev.0817973 4. The residuals are more evenly distributed.5 5 5% 13 8 10% 17. The transformation has considerably reduced (but not totally eliminated) the skewness.5 11 Obs 100 25% 24 11 Sum of Wgt.4071 ---------+-----------------------------Adj R-squared = 0.98 Model | 5724.00 99 142.020202 Root MSE = 9. . regress sqy x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3.2786301 . 010917 2. 100 In this case. Dev. detail Studentized residuals ------------------------------------------------------------Percentiles Smallest 1% -2. normal The avplots below also look improved (not perfect.180998 Skewness . predict rstud2.825144 Obs 100 25% -. . and outliers in the residuals. 1.014223 75% .7082379 -1.550404 -2. and there are no outliers in the plot.0026017 Largest Std. .262745 -1.450073 2.0782854 Mean .033796 90% 1.243119 5% -1. . but much improved).451008 Kurtosis 2. avplots The boxplot of the residuals looks symmetrical. box We create studentized residuals (called rstud2) and look at their distribution using summarize. The skewness is much better (.718882 The distribution of the residuals below look nearly normal.6425427 2. a square root transformation of the dependent variable addressed both problems in skewness in the residuals.180267 -2.316003 2. rstudent . kdensity rstud2.117415 10% -1.50% -. graph rstud2.27) and the 5 smallest and 5 largest values are nearly symmetric. Had we tried © Thierry Warin 29 © Thierry Warin 30 .2682509 99% 2. summarize rstud2. .789484 Sum of Wgt.085326 Variance 1.028647 95% 2. 8729 Total | 87318.00 Root MSE = 29.09967 11.3757505 -0.0647 ---------+-----------------------------Adj R-squared = 0.023 .0000 A scatterplot matrix is used below to look for higher order trends.0000 Residual | 10648.00 Model | 76669.ats. Because the test is significant.0850439 1. 87) = 67.313 0. regress y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3.101495 _cons | 20. the problem of the skewness of the residuals would have remained.6237 95 112. 95) = 171.6064824 .ucla. squared. rhs Ramsey RESET test using powers of the independent variables Ho: model has no omitted variables F(9. Regression diagnostics: Non-linearity use http://www. this suggest there are higher order trends in the data that we have overlooked.g.to address these problems via dealing with the outliers.1134093 .981 -.2560351 2. it can be useful to assess whether the residuals are skewed.75 96 850. t P>|t| [95% Conf.167 -----------------------------------------------------------------------------y| Coef. The null hypothesis. cubed) are present but omitted from the regression model.0355 Total | 87318. avplots Below we create x2sq and add it to the regression equation to account for the curvilinear relationship between x2 and y. 96) = 2. Err.25003 3 1883. as shown below.755 -.3441 Prob > F = 0. clear .0915 Residual | 81668. Consistent with the scatterplot.833301 x2 | -.61974 1. .965335 43.00 99 882.16468 -----------------------------------------------------------------------------The ovtest command with the rhs option tests whether higher order trend effects (e.3763 4 19167. graph matrix y x1 x2 x3 We can likewise use avplots to look for non-linear trend patterns.5932696 . the avplot for x2 (top right) exhibits a distinct curved pattern.716146 R-squared = 0. generate x2sq = x2*x2 .087 -2. regress y x1 x2 x2sq x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 4. is that there are no omitted variables (no significant higher order trends).3626687 0.21 Model | 5649.090776 R-squared = 0.8780 ---------+-----------------------------Adj R-squared = 0.7548232 . .00 Root MSE = 10. Std. When there are outliers in the residuals. Interval] ---------+-------------------------------------------------------------------x1 | .7368946 x3 | . If so.00 99 882. . ovtest. .024 0.317 0. addressing the skewness may also solve the outliers at the same time. We can see that there is a very clear curvilinear relationship between x2 and y.86 Prob > F = 0.587 ------------------------------------------------------------------------------ © Thierry Warin 31 © Thierry Warin 32 .edu/stat/stata/modules/reg/nonlin.08334 Prob > F = 0.0089643 .730 0. 1706278 .14 0.6237 95 112.2970677 . t P>|t| [95% Conf. . .3187645 x3 | .45 © Thierry Warin 33 © Thierry Warin 34 .0720997 .874328 x2centsq | 1.011738 25. Interval] ---------+-------------------------------------------------------------------x1 | .00 Model | 76669.848 -. however the results are misleading.82489 .2721586 .864632 x3 | 1. the VIF values are much better. Std.812539 x2cent | 1. If we "center" x2 (i.753 0.015 0.4448688 _cons | -.000 . A general rule of thumb is that a VIF in excess of 20 (or a 1/VIF or tolerance of less than 0.0262898 .0938846 2.2954615 . The reason for this is that x2 and x2sq are very highly correlated. We see that the VIF for x2 and x2sq are over 32. We use the vif command below to look for problems of multicollinearity. This time. and this time it does not drop any of the terms that we placed into the regression model. the results of the ovtest will no longer be misleading. The resulting ovtest misleads us into thinking there may be higher order trends. ovtest.7208091 -24.639 0.007 .978676 ---------+---------------------Mean VIF | 1.3187645 x3 | .8729 Total | 87318.9799 10.23 0.05) may merit further investigation.729 0.1316641 1. egen x2mean = mean(x2) .8589132 1. and then x2sq was discarded since it was the same as the term Stata created.1316641 1.171 0. rhs (note: x2cent^2 dropped due to collinearity) (note: x2cent^3 dropped due to collinearity) Ramsey RESET test using powers of the independent variables Ho: model has no omitted variables F(10.0938846 2. ovtest.8780 ---------+-----------------------------Adj R-squared = 0.0907586 .0000 There is another minor problem.14 0.1706277 . Err.78 We can solve both of these problems with one solution. Interval] ---------+-------------------------------------------------------------------x1 | .y| Coef.587 -----------------------------------------------------------------------------y| Coef. subtract its mean) before squaring it.00 99 882. t P>|t| [95% Conf.4320141 x2 | -17.14 We try the ovtest again. . vif Variable | VIF 1/VIF ---------+---------------------x2sq | 32. In testing for higher order trends.4448688 _cons | 267.00 Root MSE = 10. but it has discarded the higher trend we just included.0000 Residual | 10648.3441 Prob > F = 0.30 0. 95) = 171.296 0.090775 R-squared = 0.808669 -----------------------------------------------------------------------------As we expected.812539 x3 | 1.x2mean . Stata created x2sq^2 which duplicated x2sq.524 -3. .4320141 x2cent | -.2584843 . 85) = 0. rhs (note: x2sq dropped due to collinearity) (note: x2sq^2 dropped due to collinearity) Ramsey RESET test using powers of the independent variables Ho: model has no omitted variables F(11.193 0.874328 ---------+---------------------Mean VIF | 16. vif Variable | VIF 1/VIF ---------+---------------------x1 | 1. .526496 1.198 -. regress y x1 x2cent x2centsq x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 4.011738 25. generate x2cent = x2 .25588 -16. Stata gave us a note saying that x2sq was dropped due to collinearity.000 -19.030834 x2 | 32. Err.000 . and the VIF values for x2 and x2sq will get much better. Below.43 0. we center x2 (called x2cent) and then square that value (creating x2centsq).198 -.23 0. 85) = 54.000 246.007 .3763 4 19167.712 289.71298 25. .030959 x1 | 1.2478 -----------------------------------------------------------------------------We use the ovtest again below.57 Prob > F = 0. Std.39391 x2sq | .3437 -0.753 0.16 0.1363948 -0. generate x2centsq = x2cent^2 We now run the regression using x2cent and x2centsq in the equation.296 0. the results indicate that there are no more significant higher order terms that have been omitted from the model.2954615 .171 0.0720997 .244488 x2centsq | .e.2584843 .0907586 .02 0.2721586 . 0591071 6. rvfplot Note that if we examine the avplot for x2cent.19 99 134.3381903 R-squared = 0. avplot x2cent Had we simply run the regression and reported the initial results.30 Prob > chi2 = 0. regress y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3.68 Model | 8933. Std. .0000 Looking at the rvfplot below that shows the residual by fitted (predicted) value.ucla.5837503 .490543 _cons | 33. .2158539 .715 0.086744 8.23969 .Prob > F = 0.011 .3732164 .6758811 49. The variability of the residuals at the left side of the graph is much smaller than the variability of the residuals at the right side of the graph.90791 Prob > F = 0.6724 ---------+-----------------------------Adj R-squared = 0. Err.314 0. The test indicates that the regression results are indeed heteroscedastic.ats. hettest Cook-Weisberg test for heteroscedasticity using fitted values of y Ho: Constant variance chi2(1) = 21.9281211 x3 | .edu/stat/stata/modules/reg/hetsc. This is because the avplot adjusts for all other terms in the model.180 0. it shows no curvilinear trend.2558898 .7334 -----------------------------------------------------------------------------y| Coef. t P>|t| [95% Conf. . . so after adjusting for the other terms (including x2centsq) there is no longer any curved trend between x2cent and the adjusted value of y.000 .89807 34. so we need to further understand this problem and try to address it.0000 Residual | 4352. avplots Regression Diagnostics: Heteroscedasticity use http://www.46627 96 45.7559357 . we would have ignored the significant curvilinear component between x2 and y. we can clearly see evidence for heteroscedasticity.203939 Root MSE = 6.0496631 .72373 3 2977.578 0. Interval] ---------+-------------------------------------------------------------------x1 | .6622 Total | 13286.3820447 x2 | .9178 We create avplots below and no longer see any substantial non-linear trends in the data.5813 -----------------------------------------------------------------------------We can use the hettest command to test for heteroscedasticity. .083724 2. 96) = 65.000 . clear We try running a regression predicting y from x1 x2 and x3. © Thierry Warin 35 © Thierry Warin 36 .000 31. 0024562 2.000 . the chi-square value is somewhat reduced.0508362 .6760 Total | 11. t P>|t| [95% Conf.0054677 .19754 -----------------------------------------------------------------------------lny | Coef. hettest Cook-Weisberg test for heteroscedasticity using fitted values of sqy Ho: Constant variance chi2(1) = 13.017 . and run the regression.484862 -----------------------------------------------------------------------------We again try the hettest and the results are much improved. Err.000 .0652379 .317176905 R-squared = 0. The square root transformation was not successful. 96) = 69.0565313 100. .85 Model | 8.4489829 96 . regress lny x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3. hettest Cook-Weisberg test for heteroscedasticity using fitted values of lny Ho: Constant variance © Thierry Warin 37 © Thierry Warin 38 . regress sqy x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3.0000 Residual | 30.0328274 .06 We next try a natural log transformation.974272688 Root MSE = . Std.0005921 .521 0. . t P>|t| [95% Conf.37 Model | 66.432 0.570379 5.445503 .0040132 3 22.03902155 R-squared = 0.17710164 3 2. .0426407 _cons | 5. but the test for heteroscedasticity is still quite significant.0309297 x2 | .0179788 .120436065 Root MSE = . and then run the regression again.0152643 _cons | 3.6843 ---------+-----------------------------Adj R-squared = 0.000 3.003129 .0070027 2.406144 3.0198285 173.0013377 Prob > F = 0. generate sqy = y^.028 .6744 Total | 96.050 0.682593 .001734 6.0103432 x2 | .5 . generate lny = ln(y) .0280817 x3 | .72570055 Prob > F = 0.000 5.000 .74606877 96 .0072553 8.4529961 99 .0796397 x3 | .0118223 . but the test is still significant.640 0. Interval] ---------+-------------------------------------------------------------------x1 | . Interval] ---------+-------------------------------------------------------------------x1 | .0025448 9. Err.0049438 6.56318 -----------------------------------------------------------------------------sqy | Coef.Prob > chi2 = 0.992 0.0230141 .0003 Looking at the rvfplot below indeed shows that the results are still heteroscedastic.818 0.9231704 99 . .6858 ---------+-----------------------------Adj R-squared = 0.0000 Residual | 3. 96) = 69.794807 -----------------------------------------------------------------------------Using the hettest again.000 . Std.0083803 .765 0. rvfplot We will try to stabilize the variance by using a square root transformation. .226 0.0230303 .0170293 . 0011052 9.54837 Prob > F = 0.78014 -----------------------------------------------------------------------------Let's start an examination for outliers by looking at a scatterplot matrix showing scatterplots among y. Std.0051344 . 96) = 14.0121957 x3 | . graph matrix y x1 x2 x3 © Thierry Warin 39 © Thierry Warin 40 .007359919 R-squared = 0.24884946 99 .06578 R-squared = 0. We show that below.3062 ---------+-----------------------------Adj R-squared = 0.1037212 .479269 1.60 Prob > chi2 = 0. .028 .85 Model | 1.dta .6858 ---------+-----------------------------Adj R-squared = 0.12 Model | 6358. generate log10y = log10(y) .0000 Residual | .304 0. Interval] ---------+-------------------------------------------------------------------x1 | . Regression Diagnostics: Outliers use http://www. .0007531 6. .2635928 . the effect in reducing heteroscedasticity (as measured by hettest) was the same.000 29.1399371 .050 0.0002571 .576853 . .5009867 x2 | . Err. and x3.496363 .edu/stat/stata/modules/reg/outlier.765 0.ats.3149 96 150. we will be content for now that this has substantially reduced the heteroscedasticity as compared to the original data.3533915 .1523206 1. The results suggest that x2 and x3 are significant.022715651 Root MSE = . Whether we chose a log to the base e or a log to the base 10.0023746 .286 0.33932 1.ucla.655 0.0078081 .0036395 . regress log10y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3.000 . . x2.2845 Total | 20764. x2. but it is much improved. x1.0100019 .004492 x2 | .0066292 _cons | 1.229643 26.0086114 173.25 -----------------------------------------------------------------------------y| Coef.8901132 x3 | .1578149 3.54229722 3 .chi2(1) = 5.226 0. t P>|t| [95% Conf.64512 3 2119.300 0. clear Below we run an ordinary least squares (OLS) regression predicting y from x1. Std.747071 Root MSE = 12.000 . t P>|t| [95% Conf.000 1. Interval] ---------+-------------------------------------------------------------------x1 | .0179 While these results are not perfect.60 Prob > chi2 = 0.1075346 3. rvfplot Cook-Weisberg test for heteroscedasticity using fitted values of log10y Ho: Constant variance chi2(1) = 5.0000 Residual | 14406.08579 -----------------------------------------------------------------------------log10y | Coef.001 .195 -.513456 -----------------------------------------------------------------------------The results for the hettest are the same as before.706552237 96 .000 .5668459 _cons | 32. 96) = 69. Err. but x1 is not significant.818 0.1986327 .96 99 209. and x3. regress y x1 x2 x3 Perhaps you might want to try a log (to the base 10) transformation.514099074 Prob > F = 0.0179 Below we see that the rvfplot does not look perfect. Although we cannot see a great deal of detail in these plots (especially since we have reduced their size for faster web access) we can see that there is a single point that stands out from the rest.0010667 2.6760 Total | 2.8985 34. This looks like a potential outlier. hettest Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3. . and the residual value of y after using x2 and x3 as predictors. symbol([case]) The rvfplot command shows us residuals by fitted (predicted) values. The variable case is the case id of the observation. © Thierry Warin 41 © Thierry Warin 42 . the top right plot shows x2 on the horizontal axis. The line plotted has the slope of the coefficient for x1 and is the least squares regression line for the data in the scatterplot. these plots allow you to view each of the scatterplots much like you would look at a scatterplot from a simple regression analysis with one predictor. In short. The most problematic outliers would be in the top right of the plot. The plot in the top left shows x1 on the horizontal axis. . indicating both high leverage and a large residual. If you run this in Stata yourself. symbol([case]) The avplots command gives added variable plots (sometimes called partial regression plots). using the symbol([case]) option that indicates to make the symbols the value of the variable case. This plot shows us that case 100 has a very large residual (compared to the others) but does not have exceptionally high leverage. rvfplot. we see that case 100 appears to be an outlier in each plot. the case numbers will be much easier to see. lvr2plot. . In looking at these plots. and the residual value of y after using x1 and x3 as predictors. ranging from 1 to 100. symbol([case]) We can use the lvr2plot command to obtain a plot of the leverage by normalized residual squared plot. graph matrix y x1 x2 x3. and the residual value of y after using x1 and x2 as predictors. Returning to the top left plot. It is difficult to see below. this shows us the relationship between x1 and y. but the case for the outlier is 100. Likewise. after adjusting y for x2 and x3. and the bottom plot shows x3 on the horizontal axis.We repeat the scatterplot matrix below. and also indicates that case 100 has the largest residual. Beyond noting that x1 is an outlier. leverage . observation 100 stands out as an outlier. graph box d. Finally. possibly increasing the overall slope for x2. For x1 we see that x1 seems to be tugging the line up at the left giving the line a smaller slope. . This is consistent with the lvr2plot (see above) that showed us that observation 100 had a high residual. . showing that you can obtain an avplot for a single variable at a time. graph box rstu. . avplots. but not exceptionally high leverage. symbol([case]) Below we use the predict command to create a variable called l that will contain the leverage for each observation. . symbol([case]) graph command to make a boxplot looking at the studentized residuals. for x3 the outlier is right in the center and seems to have no influence on the slope (but would pull the entire line up influencing the intercept). By contrast. predict rstu. Also. symbol([case]) We use the predict command below to create a variable containing the studentized residuals called rstu. we can better see the influence of observation 100 tugging the regression line up at the right. We make a boxplot of that below. . symbol([case]) Below we repeat the avplot just for the variable x2. Stata knew we wanted leverages because we used the leverage option after the comma. symbol([case]) © Thierry Warin 43 © Thierry Warin 44 . rstudent . As we would expect. the outlier for x2 seems to be tugging the line up at the right giving the line a greater slope. avplot x2. graph box l. and 100 shows to have the highest value for Cooks D. predict d. looking for outliers. Stata knew we wanted studentized residuals because we used the rstudent option after the comma. we can see the type of influence that it has on each of the regression lines. The boxplot shows some observations that might be outliers based on their leverage. We can then use the Below use use the predict command to compute Cooks D for each observation. predict l. Note that observation 100 is not among them. cooksd . This indicates that the presence of observation x1 decreases the value of the coefficient for x1 and if it was removed. symbol([case]) We repeat the graph above. the plot below shows an observation that has a very large residual. DFx2. but does not have a very large leverage. . and DFx3. The output below shows that three variables were created. dfbeta DFx1: DFbeta(x1) DFx2: DFbeta(x2) DFx3: DFbeta(x3) Below we make a graph of the studentized residual by DFx1. graph rstu DFx1. . We see that observation 100 has a very high residual value. and for each predictor. As we would expect. except using the symbol([case]) option to show us the variable case as the symbol. the larger the symbol will be. From our examination of the avplots above.We can make a plot that shows us the studentized residual. The dfbeta value shows the degree the coefficient will change when that single observation is omitted. how influential a single observation can be. symbol([case]) © Thierry Warin 45 © Thierry Warin 46 . it appeared that the outlier for case 100 influences x1 and x2 much more than it influences x3. This allows you to see. the coefficient for x1 would get larger. DFx1. . and that it has a large negative DFBeta. and cooks D all in one plot. graph rstu l [w=d] The leverage gives us an overall idea of how influential an observation is. The graph command below puts the studentized residual (rstud) on the vertical axis. We can use the dfbeta command to generate dfbeta values for observation. graph rstu l. a very large value of Cook's D. which shows us that the observation we identified above to be case = 100. The [w=d] tells Stata to weight the size of the symbol by the variable d so the higher the value of Cook's D. . and the size of the bubble reflects the size of Cook's D (d). leverage (l) on the horizontal axis. for a given predictor. leverage. but it has a small DFBeta (small DFx2). Removing this observation will make the coefficient for x2 go from . that coefficient would get smaller. avplot x2. graph rstu DFx3. a DFbeta value of 1 or larger is considered worthy of attention. We can see that the outlier at the top right has the largest DFbeta value and that observation enhances the coefficient (. It looks like observation 100 diminishes the coefficient for x1.1578 =.0.Below we make a graph of the studentized residual by the value of DFx2. As a rule of thumb. symbol([case]) The results of looking at the DFbeta values is consistent with our observations looking at the avplots.422.98 * . Like above. symbol([case]) © Thierry Warin 47 © Thierry Warin 48 . enhances the coefficient for x2. . and has little impact on the coefficient for x3. generate rDFx2 = round(DFx2. . or . This suggests that the presence of observation 100 enhances the coefficient for x2 and its removal would lower the coefficient for x2. Below we take the DFbeta value for x2 (DFx2) and round it to 2 decimal places. . This shows that observation 100 has a large residual.154. we see that x2 has a very large residual. we could look at the DFbeta values right in the added variable plots.576 to . Instead of looking at this information separately. the value of the DFbeta tells us exactly how much smaller. We can see that the information provided by the avplots and the values provided by dfbeta are related.01) . In fact. This suggests that the exclusion of observation 100 would have little impact on the coefficient for x3. it indicates that the coefficient will be . graph rstu DFx2. We then include rDFx2 as a symbol in the added variable plot below. symbol([rDFx2]) Finally.98 standard errors smaller.576) and if this value were omitted. creating rDFx2. but instead DFx2 is a large positive value. we make a plot showing the studentized residual and DFx3. Interval] ---------+-------------------------------------------------------------------x1 | . We checked the original data.000 29. The lvr2plot shows a point with a larger residual than most but low leverage. symbol([case]) Below. t P>|t| [95% Conf.325). avplot x2. The value really should have been 11.000 0.0298236 d .1737208 . The coefficient for x1 increased (from . replace y = 11 if (case == 100) (1 real change made) © Thierry Warin 49 © Thierry Warin 50 .34). the coefficient for x2 decreased by the exact amount the DFbeta indicated (from . we look at the data for observation 100 and see that it has a value of 110. and found that this was a data entry error. but shows us the case numbers (the variable case) as the symbol.28053 .0861919 4. regress y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3.0784904 rDFx2 . 96) = 20. As we expected. Std.4193103 . . Err. . the coefficient for x3 was changed very little (from .001 .28744 96 96.56752 Prob > F = 0.35 to . but a small residual.325315 .19 to . . Note that for x2 the coefficient went from being non-significant to being significant.23692 -----------------------------------------------------------------------------We repeat the regression diagnostic plots below. lvr2plot The plot below is the same as above.5676601 x2 | .3878 ---------+-----------------------------Adj R-squared = 0.126493 3.57 to . .08297 . We need to look carefully at the scale of these plots since the axes will be rescaled with the omission of the large outlier.5159 _cons | 31.Having fixed the value of y we run the regression again.000 .32415 33.717071 Root MSE = 9.3448104 . allowing us to see that observation 100 is the outlying case.27 Model | 5863.98 We change the value of y to be 11.1220892 2.670397 x3 | .9855929 31.70256 3 1954.2893599 DFx1 -.738 0.1682236 .8180526 DFx2 .99 99 152.3687 Total | 15118. . list in 100 Observation 100 case 100 x1 -5 x2 8 x3 0 y 110 rstu 7. and 3 points with larger leverage than most. The coefficients change just as the regression diagnostics indicated.315 0.42).829672 l .009 .9819155 DFx3 .8188 -----------------------------------------------------------------------------y| Coef.0000 Residual | 9255.665 0. the correct value.4092442 R-squared = 0. regress y x1 x2 x3 x4 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 4.ats. leverage and Cook's D all in one graph.5719733 R-squared = 0. . and Cook's D.91563 Prob > F = 0. y. leverage and Cook's D value. Regression Diagnotics: Multicollinearity use http://www. clear Below we run a regression predicting y from x1 x2 x3 and x4. drop l rstu d . we note that the test of all four predictors is significant (F = 16. rvfplot .4080 We create values of leverage. . If we look more carefully. predict l. yet none of them are significant.66253 4 1498. p = 0. None of the points really jump out as having an exceptionally high residual. . It seems like a contradiction that the combination of these 4 predictors should be so strongly related to y. studentized residuals. (We first drop the variables l rstu and d because the predict command will not replace existing values). predict rstu.33747 95 91. . Had we skipped checking the residuals. 95) = 16. we will leave this up to the reader.0000 Residual | 8699.37 Model | 5995. avplots Although we could scrutinize the data a bit more closely. leverage .40). we would conclude that none of these predictors are significant predictors of the dependent variable. cooksd We then make the plot that shows the residual.ucla.The rvfplot below shows a fairly even distribution of residuals without any points that dramatically stand out from the crowd. rstudent © Thierry Warin 51 © Thierry Warin 52 . If we were to report these results without any further checking.0000) and these predictors account for 40% of the variance in y (R-squared = 0. graph rstu l [w=d] The avplots show a couple of points here and there that stand out from the crowd. predict d. we can tentatively state that these revised results are good. For now.37. we would have used the original results which would have underestimated the impact of x1 and overstated the impact of x2. .edu/stat/stata/modules/reg/multico. Looking at an avplot with the DFbeta value in each plot would be a useful followup to assess the influence of these points on the regression coefficients. Let us investigate further. 010964 x2 | 82. using x1 as an example.83 0. 1-Rsquared equals 0.010964 (the value of 1/VIF for x1). With these improved tolerances.2084695 .892234 ---------+---------------------Mean VIF | 1.16 0.8971469 3. A tolerance (1/VIF) can be described in this way.9155806 3.4040 ---------+-----------------------------Adj R-squared = 0.191635 1.78069 96 91.5693 -----------------------------------------------------------------------------y| Coef. Interval] ---------+-------------------------------------------------------------------x1 | 1.298443 .278 -. A general rule of thumb is that a VIF in excess of 20.220 -.6516 0.8370988 1. .6969874 x3 | .0626879 . also called tolerances).05 or less may be worthy of further investigation. The solutions are often driven by the nature of your study © Thierry Warin 53 © Thierry Warin 54 .69 Model | 5936.042406 1.092 0.859 0.7281 0.1187692 2. we see that less than .9709127 32.7790 1.73977 Prob > F = 0.806 0.23 0.566 0. and is really not needed in the model.21931 3 1978.4527284 . If you compare the standard errors in the table above with the standard errors below.280417 x4 | -.54662 -----------------------------------------------------------------------------We use the vif command to examine the VIF values (and 1/VIF values. regress y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3. and use x2 x3 and x4 as predictors and compute the Rsquared (the proportion of variance that x2 x3 and x4 explain in x1) and then take 1-Rsquared. Note that the variance explained is about the same (still about 40%) but now the predictors x1 x2 and x3 are now significant.000 29. Err.134 0.012093 ---------+---------------------Mean VIF | 221.3553 1.5518 -----------------------------------------------------------------------------y| Coef.0000 x4 | 0.12 0.001869 x3 | 175. vif Variable | VIF 1/VIF ---------+---------------------x4 | 534. and much better than the prior results.0000 x3 | 0.---------+-----------------------------Total | 14695.50512 .864654 x3 | 1.5130679 _cons | 31.133 0. t P>|t| [95% Conf.434343 Root MSE = 9. Err. the standard errors in the table above are reduced.17 We should emphasize that dropping variables is not the only solution to problems of multicollinearity.69 0.00 99 148. or a tolerance of 0.21 0.513 0. 96) = 21. corr x1 x2 x3 x4 (obs=100) | x1 x2 x3 x4 ---------+-----------------------------------x1 | 1.118277 1. If we look at x4.899733 1.356131 x3 | 1.225535 _cons | 31. Std.40831 -----------------------------------------------------------------------------Below we look at the VIF and tolerances and see that they are very good.00 99 148. This means that only about 1% of the variance in x1 is not explained by the other predictors.024484 1.3466306 .422 -2. This makes sense.3853 Total | 14695. .61912 . t P>|t| [95% Conf. You can see that these results indicate that there is a problem of multicollinearity.014 .2021 1.17 If we examine the correlations among the variables. In this example.2372989 R-squared = 0.260 -.812781 x2 | 1.152135 x2 | 1.9587921 32.0000 Residual | 8758. we see that the standard errors in the table above were much larger. so we try removing x4 from the regression equation.1230534 3. it seems that x4 is most strongly related to the other predictors.05215 1.2% of the variance in x4 is not explained by x1 x2 and x3.3831 Root MSE = 9.679 0. Std.000 .000 29.97 0.69161 33.0000 We might conclude that x4 is redundant.000 .5341981 x2 | .0000 x2 | 0. .1801934 .286694 1.7827429 3.60194 33.3136 0.434343 Adj R-squared = 0. because when a variable has a low tolerance.0838481 4. . its standard error will be increased. Interval] ---------+-------------------------------------------------------------------x1 | . vif Variable | VIF 1/VIF ---------+---------------------x1 | 1.005687 x1 | 91.038979 -0. Use x1 as a dependent variable.234 0. We first need to tell Stata the name of the time variable using the tsset command. . . Stata replies back that the time values range from 1 to 100.834634 96 9.ucla.0717626 .0030059 .0118 -----------------------------------------------------------------------------y| Coef. regress y x1 x2 x3 Source | SS df MS Number of obs = 100 ---------+-----------------------------F( 3. This model would permit correlations between observations across time. and the closer the value is to 0. In this instance. These results suggest that none of the predictors are related to y. Below we see the residuals are clearly not distributed evenly across time. the greater the autocorrelation. we might have concluded that none of these predictors were related to the dependent variable.0356469 .468 -. Std. Interval] ---------+-------------------------------------------------------------------x1 | . you might choose to generate factor scores from a principal component analysis or factor analysis.0740128 .04442 © Thierry Warin 55 © Thierry Warin 56 . with a first order autocorrelation.67536 Iteration 1: log likelihood = -148. the value of . section 8. .75845547 Prob > F = 0. .41 Model | 11. Or.edu/stat/stata/modules/reg/nonind.938 -. .ats. and decide how you might combine the variables.04916 Iteration 6: log likelihood = -148. arima y x1 x2 x3. Hadi and Price. If there were no autocorrelation. the results were dramatically different showing x1 x2 and x3 all significantly related to the dependent variable.0181 Total | 882.3023225 4.902034 -----------------------------------------------------------------------------Let's create and examine the residuals for this analysis.0264387 -0.14569 (switching optimization to BFGS) Iteration 5: log likelihood = -148.2753664 3 3.3 for more information). and use the factor scores as predictors. t P>|t| [95% Conf.301929 .0374498 0. 96) = 0.0388007 0.1099843 x2 | . showing the residuals over time. predict rstud.306 0.14 is sufficiently close to 0 to indicate a strong autocorrelation (see Chatterjee. suggesting the results are not independent over time.000 .91020202 Root MSE = 3.0331981 _cons | 1.7431 Residual | 870. but not a test of the significance of "d".0192823 .19301 Iteration 3: log likelihood = -148.1389823 Let's run the analysis using the arima command.0386904 .0128 ---------+-----------------------------Adj R-squared = -0. the value of "d" would be 2.729 0. tsset time time variable: time. You might decide to use principal component analysis or factor analysis to study the structure of your variables. Err.952 0. graph rstud time We can use the dwstat command to test to see if the results are independent over time.344 -. 100) = . rstud .04442 Iteration 8: log likelihood = -148.0711941 R-squared = 0.0800247 x3 | -. After dropping x4.and the nature of your variables.04445 Iteration 7: log likelihood = -148.11 99 8. clear Below we run a regression predicting y from x1 x2 and x3. Regression Diagnostics: Non-Independence use http://www.077 0. 1 to 100 The dwstat command gives a value of "d". You may decide to combine variables that are very highly correlated because you realize that the measures are really tapping the exact same thing. ar(1) (setting optimization to BHHH) Iteration 0: log likelihood = -151. dwstat Durbin-Watson d-statistic( 4.94194 Iteration 2: log likelihood = -148.7018232 1. Had we not investigated further.16849 Iteration 4: log likelihood = -148. so can time fixed effects control for variables that said to be homoskedastic. X ki . constant for i = 1.. conditional on regressors... that is. 6. No perfect multicollinearity.. Otherwise.... we use (Breusch and Pagan. one should use hetereoskedastic-robust standard errors. X ki . X 2i . and asymptotically normal. X ki and ui are nonzero and finite.. 1979)’s test. 2. In case of perfect multicollinearity. it is impossible to compute the OLS estimator... ( X 1i . the variables are observed for each entity and each time period. consistent.. heteroskedasticity. X 2i .. d... Regression with time fixed effects Just as fixed effects for each entity can control for variables that are constant over time but differ across entities.. In cross-sectional data. This is the most important assumption in practice. the conditional distribution of ui given X1i .. If this assumption does not hold. n are Independently and Identically (12) Distributed. and here errors are uncorrelated across time as well as entities. X 2i . One should test for omitted variables using (Ramsey and Braithwaite.. conditional on regressors. This is to be sure that there is no selection bias in the sample. i = 1. 1931)’s test. the errors are uncorrelated across entities. Yi ) .. then it is likely because there is an omitted variable bias. X 2i . X ki . but it is (13) inappropriate for time series data. X 2i . Assumptions of the OLS estimator 1..... To test for A. This means that the other factors captured in the error term are unrelated to X1i . If the standard errors are heteroskedastic. X ki and ui should be nil. Whether the errors are homoskedastic or heteroskedastic. Yit = β 0 + β1 X it + β 2 Zi + uit Where Z i is an unobserved variable that varies from one state to the next but does not change over time.... X ki has a mean of zero.. X 2i ..... X ki . B.. The fixed effects regression model 3. n and in particular does not depend on X1i . X ki is Time-Series Cross-Section Analyses (TSCS) or Panel data models A balanced panel has all its observations. then the errors are 5. X1i . c. X ki and ui have four moments. A panel that has some missing data for at least one time period for at least one entity is called an unbalanced panel data. Related to the first assumption: if the variance of this conditional distribution of ui does not depend on X1i . The regressors are said to be perfectly multicollinear if one of regressors is a perfect linear function of one of the other regressors... The error term ui is homoskedastic if the are constant across entities but evolve over time. The correlation between X1i .... the error term is heteroskedastic... the OLS estimator is unbiased.. This second assumption holds in many cross-sectional data sets. 4..variance of the conditional distribution of ui given X1i . The fourth assumption is that the fourth moments of X1i . X 2i .. X 2i .. X 2i ... We can rewrite equation (12): Yit = β1 X it + α i + uit Where α i = β 0 + β 2 Z i . © Thierry Warin 57 © Thierry Warin 58 . A somewhat analogous prohibition applies to the random effects estimator. various statistical issues must be taken into account. the random effects approach attempts to model the group effects as drawings from a probability distribution instead of removing them. In most cases this is unlikely to be adequate. GLS estimation is equivalent to OLS using “quasi-demeaned” variables. however. If the panel comprises observations on a fixed and relatively small set of units of interest (say. there is a presumption in favor of random effects. the vis are not treated as fixed parameters. This is sometimes called the Least Squares Dummy Variables (LSDV) method or “de-meaned” variables method. that is. (14) Choice of estimator Which panel method should one use. If you want to include such variables in the model. that is. variables from which we subtract a fraction of their average. Some panel data sets contain variables whose values are specific to the crosssectional unit but which do not vary over time. If k > n. if. the time-invariant differences in mean between the groups) beyond the fact that they exist — and that can be tested. see below. This means that if all the variance is attributable to the individual effects. This can be done by including a dummy variable for each cross-sectional unit (and suppressing the global constant). uncorrelated with the regressors. we decompose uit into a unit-specific and time-invariant component. there is a presumption in favor of fixed effects. and an observation specific error. the “between” estimator is undefined — since we have only n effective observations — and hence so is the random effects estimator. © Thierry Warin 59 © Thierry Warin 60 . then the fixed effects estimator is optimal. but as random drawings from a given probability distribution. to be the optimal estimator. When using the approach of subtracting the group means.Yit = β 0 + β1 X it + β 2 Zi + β 3 St + uit Where St is unobserved. However. That is. 1. which are to be estimated. The fixed and random effects models The fixed and random effects models have in common that they decompose the unitary pooled error term. the member states of the European Union). where the subscript “t” emphasizes that the variable S changes over time but is constant across states. taking into account the covariance structure of the error term. In the fixed effects approach. individual effects are negligible. On the other hand. the problem is that the time-invariant variables are perfectly collinear with the per-unit dummies. unit-specific y-intercepts). the fixed effects option is simply not available. unsurprisingly. the remaining parameters can be estimated. the choice between fixed effects and random effects may be expressed in terms of the two econometric desiderata. This requires that individual effects are representable as a legitimate part of the disturbance term. This estimator is in effect a matrix-weighted average of pooled OLS and the “between” estimator. The celebrated Gauss–Markov theorem. As a consequence. according to which OLS is the best linear unbiased estimator (BLUE). then pooled OLS turns out. efficiency and consistency. zero-mean random variables. 2. When the fixed effects approach is implemented using dummy variables. If these assumptions are not met — and they are unlikely to be met in the context of panel data — OLS is not the most efficient estimator. fixed effects or random effects? One way of answering this question is in relation to the nature of the data set. we do not make any hypotheses on the “group effects” (that is. For the random effects model. "it . If one does not fall foul of one or other of the prohibitions mentioned above. the issue is that after de-meaning these variables are nothing but zeros. on the other hand. once these effects are swept out by taking deviations from the group means. depends on the assumption that the error term is independently and identically distributed (IID). If it comprises observations on a large number of randomly selected individuals (as in many epidemiological and other longitudinal studies). Suppose we have observations on n units or individuals and there are k independent variables of interest. we could say that there is a tradeoff between robustness and efficiency. uit . From a purely statistical viewpoint. _i. Besides this general heuristic.1 The _is are then treated as fixed parameters (in effect. Running Pooled OLS regressions in Stata The simplest estimator for panel data is pooled OLS. In contrast to the fixed effects model. Greater efficiency may be gained using generalized least squares (GLS). you do not need to type age1 or age2. These advantages. it is not considering for the panel nature of the dataset. When you do this. When you estimate a model using fixed effects. the fixed-effects estimator “always works”. for example. Let's now turn to estimation commands for panel data. you would type: ereturn list Testing panel models Panel models carry certain complications that make it difficult to implement all of the tests one expects to see for models estimated on straight time-series or crosssectional data. and that estimation of the parameters for time-varying regressors is carried out more efficiently.time Let's perform a regression where only the variation of the means across individuals is considered. if this hypothesis is not rejected. H. are tied to the validity of the additional hypotheses. If you want to control for some categories: xi: reg dependent ind1 ind2 i.category2 i.and random effects estimates agree. This is valid for any regression that you perform. you are instructing Stata to include all the variables starting with the expression age to be included in the regression. but at the cost of not being able to estimate the effect of time-invariant regressors. to within the usual statistical margin of error. there is no reason to think the additional hypotheses invalid. This is the between regression. In the case of panel data. Suppose you want to observe the internal results saved in Stata associated with the last estimation. and so. the Breusch–Pagan and Hausman tests are presented automatically. xtreg ln_wage grade age ttl_exp tenure black not_smsa south.category1 i. You just need to type age. no reason not to use the more efficient RE estimator. then again we conclude that the simple pooled model is adequate.As a consequence. which is simply an OLS regression applied to the whole dataset. When you estimate using random effects. while fixed-effects estimates would still be valid. you automatically get an F-test for the null hypothesis that the cross-sectional units all have a common intercept. The test is based on a measure. The null hypothesis is that these estimates are consistent — that is. It is precisely on this principle that the Hausman test is built: if the fixed. that the requirement of orthogonality of the vi and the Xi is satisfied. be Robust standard errors For most estimators. The null hypothesis is that the variance of vi in equation equals zero. If. Stata offers the option of computing an estimate of the covariance matrix that is robust with respect to heteroskedasticity and/or © Thierry Warin 61 © Thierry Warin 62 . The richer hypothesis set of the random-effects estimator ensures that parameters for timeinvariant regressors can be estimated. This regression is not considering that you have different individuals across time periods. The first type of regression that you may run is a pooled OLS regression. In order to observe them. The Hausman test probes the consistency of the GLS estimates. constructed such that under the null it follows the _2 distribution with degrees of freedom equal to the number of time-varying regressors in the matrix X. of the “distance” between the fixed-effects and random-effects estimates. If the value of H is “large” this suggests that the random effects estimator is not consistent and the fixed-effects model is preferable. reg ln_wage grade age ttl_exp tenure black not_smsa south In the previous command. though. autocorrelation (and hence also robust standard errors). then the random-effects estimator would be inconsistent. and as a consequence. The Breusch–Pagan test is the counterpart to the F-test mentioned above. there is reason to think that individual effects may be correlated with some of the explanatory variables. robust covariance matrix estimators are available for the pooled and fixed effects model but not currently for random effects. if there is correlation between the individual and/or time effects and the independent variables. This choice is between fixed effects (or within. x.xi.t..t + v.v..x.) © Thierry Warin 63 © Thierry Warin 64 .. then the individual and time effects (fixed effects model) must be estimated as dummy variables in order to solve for the endogeneity problem. you are always concerned in choosing between two alternative regressions. In panel data. The fixed effects (or within regression) is an OLS regression of the form: (yit .. a specific time effect 3. then the random effects model should be used because it is a weighted average of between and within estimations. x. or least squares dummy variables . in the two-way model.t are the means of the respective variables (and the error) within each time period across individuals and y.t and v. so you should run random effects if it is statistically justifiable to do so.y..)B + (vit .LSDV) estimation and random effects (or feasible generalized least squares . Statistically. and v. is the overall mean of the respective variables (and the error). But. Choosing between Fixed effects and Random effects? The Hausman test It is absolutely fundamental that the error term is not correlated with the independent variables. • If you have no correlation. 2. The two-way model assumes the error term as having a specific individual term effect. assumes the error term as having a specific individual term effect where yi.. and vi.yi. .vi. Random effects will give you better P-values as they are a more efficient estimator.FGLS) estimation. are the means of the respective variables (and the error) within the individual across time.. the error term can be the result of the sum of three components: 1.) = (xit . y. • The generally accepted way of choosing between fixed and random effects is running a Hausman test.Running Panel regressions in Stata In empirical work in panel data. xi. and an additional idiosyncratic term.t + x.t + y. fixed effects are always a reasonable thing to do with panel data (they always give consistent results) but they may not be the most efficient model to run. In the one-way model. . .. the error term can be the result of the sum of one component: 1. testparm y 3. 5.The Hausman test checks a more efficient model against a less efficient but consistent model to make sure that the more efficient model also gives consistent results. you need to first estimate the fixed effects model. xtreg ln_wage grade age ttl_exp tenure black not_smsa south y. and then do the comparison. 2... If you get a significant Pvalue.. We reject the null hypothesis that the time dummies are not jointly significant if p-value smaller than 10%. the time dummies were abbreviated to "y" (see “Generating time dummies”. however. In order to perform the test for the inclusion of time dummies in our fixed effects regression.) © Thierry Warin 65 © Thierry Warin 66 . If you want a fixed effects model with robust standard errors. The hausman test tests the null hypothesis that the coefficients estimated by the efficient random effects estimator are the same as the ones estimated by the consistent fixed effects estimator. . In the next fixed effects regression.the purpose here is to show you an additional test for random effects in panel data. 1. 3. but you could type them all if you prefer. absorb(idcode) robust You may be interested in running a maximum likelihood estimation in panel data. fe 2. xtreg dependentvar independentvar1 independentvar2. 1. you can use the following command: areg ln_wage grade age ttl_exp tenure black not_smsa south. save the coefficients so that you can compare them with the results of the next model.. we apply the "testparm" command. re estimates store random hausman fixed random If you qualify for a fixed effects model. first we run fixed effects including the time dummies. fe estimates store fixed xtreg dependentvar independentvar1 independentvar2. You would type: xtreg ln_wage grade age ttl_exp tenure black not_smsa south. Second. you should use fixed effects. and as a consequence our fixed effects regression should include time effects. when you are doing empirical work in panel data is to choose for the inclusion or not of time effects (time dummies) in your fixed effects model. . 4. If they are insignificant (P-value. mle Fixed effects or random effects when time dummies are involved: a test What about if the inclusion of time dummies in our regression would permit us to use a random effects model in the individual effects? [This question is not usually considered in typical empirical work. Prob>chi2 larger than . should you include time effects? Other important question. which assumes the null hypothesis that the time dummies are not jointly significant. To run a Hausman test comparing fixed with random effects in Stata.05) then it is safe to use random effects. estimate the random effects model. It is the test for time dummies. That means that OLS will be inconsistent as well as inefficient. then yit 1 is bound to be correlated with the error. and producing consistent estimates of _ and _. However. since the value of vi affects yi at all t. an alternative tactic for sweeping out the group effects: Although the Anderson–Hsiao estimator is consistent. but a subtler issue remains. This procedure has the double effect of handling heteroskedasticity and/or serial correlation. which applies to both fixed and random effects estimation.1. One-step estimators have sometimes been preferred on the grounds that they are more robust. vi. First. Two additional commands that are very usefull in empirical work are the Arellano and Bond estimator (GMM estimator) and the Arellano and Bover estimator (system GMM). they suggest taking the first difference. we will run a random effects regression including dummies. and then we will apply the "xttest0" command to test for random effects in this case. © Thierry Warin 67 © Thierry Warin 68 . The null hypothesis of random effects is again rejected if p-value smaller than 10%. The fixed-effects model sweeps out the group effects and so overcomes this particular problem. computing the covariance matrix of the 2-step estimator via the standard GMM formulae has been shown to produce grossly biased results in finite samples. was proposed by Anderson and Hsiao (1981). Moreover. One strategy for handling this problem. that of a GMM estimator. and thus we should use a fixed effects model with time effects. First. strictly speaking. re 2. The rationale behind it is. plus producing estimators that are asymptotically efficient. it is not most efficient: it does not make the fullest use of the available instruments. which assumes the null hypothesis of random effects. Dynamic panels and GMM estimations Special problems arise when a lag of the dependent variable is included among the regressors in a panel model. implementing the finite-sample correction devised by Windmeijer (2005). nor does it take into account the differenced structure of the error _it . Stata implements natively the Arellano–Bond estimator. our time xtreg ln_wage grade age ttl_exp tenure black not_smsa south y. Estimators which ignore this correlation will be consistent only as T ! 1 (in which case the marginal effect of "it on the group mean of y tends to vanish). if the error uit includes a group effect. Instead of de-meaning the data. It is improved upon by the methods of Arellano and Bond (1991) and Blundell and Bond (1998). xttest0 3. leads to standard errors for the 2-step estimator that can be considered relatively accurate. xtabond2 dep_variable ind_variables (if. the lagged values of two and three periods will be used as instruments. o Option collapse reduces the size of the instruments matrix and aloow to prevent the overestimation bias in small samples when the number of instruments is close to the number of observations. eq(level)) gmm(x. eq(level) or eq(both) mean that the instruments must be used respectively for the equation in first difference. eq(level).b) means that for the equation in difference. But although this two-step estimation is asymptotically more efficient. the option is eq(both). the exogenous variables are differentiated to serve as instruments in the equations in first difference. Example: gmm(x y. whereas for variable y. Option two: • This option specifies the use of the GMM estimation in two steps. and eq(both): see above o By default. will be used as instruments. the lagged values of one period and two periods will be used as instruments. lagged by © Thierry Warin 69 © Thierry Warin 70 . "xtabond" is a built in command in Stata. whereas for the equation in level. This option impacts the coefficients only if the variables are exogenous. the first differences dated t-a+1 will be used as instruments. If b=●. Example 2: gmm(x. iv(list2. o Options eq(diff). so in order to check how it works. eq(both) and collapse o lag(a. options2) two robust small 1. eq(diff). allowing thus to include in the regression the observations whose data on exogenous variables are missing. noleveleq gmm(list1. a=1. gmm(list1. previously. it is the GMM estimator in system that’s used. How does it work? The xtabond2 commands allows to estimate dynamic models either with the GMM estimator in difference or the GMM estimator in system. eq(level). Otherwise. the lagged variables (in level) of each variable from list1. at least two periods.Both commands permit you do deal with dynamic panels (where you want to use as independent variable lags of the dependent variable) as well with problems of endogeneity. leads to biased results. or for both. If you want to look at it. options1) iv(list2. the equation in level. When noleveleq is specified. You may want to have a look at them The commands are respectively "xtabond" and "xtabond2". By default. you must get it from the net (this is another feature of Stata. and b=●. and are used undifferentiated to serve as instruments in the equations in level. options2): • List2 is the list of variables that are strictly exogenous. Option robust: • This option allows to correct the t-test for heteroscedasticity. will be used as instruments. lag(1 2)) gmm (y. there is no test to know whether the on-step GMM estimator or two-step GMM estimator should be used. and options2 may take the following values: eq(diff). the xtabond2 command proceeds to a correction of the covariance matrix for finite samples.you can always get additional commands from the net). To fix this issue. if noleveleq is not specified. You type the following: findit xtabond2 The next steps to install the command should be obvious. it is the GMM estimator in difference that’s used. o Eq(diff). eq(both). 5. 4. in). lag(2 . 2. o Option mz replaces the missing values of the exogenous variables by zero. eq(diff) pass) allows to use variable x in level as an instrument in the equation in level as well as in the equation in difference. pass and mz. eq(level). Example: gmm(z. By default. But it reduces the statistical efficiency of the estimator in large samples. So far. 3. it means b is infinite. lag (2 3)) ⇒ for variable x. The pass option allows to prevent that exogenous variables are differentiated to serve as instruments in equations in first difference.)) ⇒ all the lagged variables of x and y.b). dated from t-a to tb. just type: help xtabond "xtabond2" is not a built in command in Stata. options): • list1 is the list of the non-exogenous independent variables • options1 may take the following values: lag(a. to obtain a description of the data.9 ------------------------------------------------------------------------------ test L.9612 Total | 39495157.44 841.chic Source | SS df MS Number of obs = 53 ---------+-----------------------------F( 2. Err. 50) = 0.6.egg ( 1) L. t P>|t| [95% Conf.8 2 19010988.25 2.999 0. For example. Option small: • This option replaces the z-statistics by the t-test results. _cons | 279. Interval] ---------+-------------------------------------------------------------------egg | L1 | .0712e+10 50 614248751 R-squared = 0.chic Source | SS df MS Number of obs = 53 ---------+-----------------------------F( 2.05 Prob > F = 0.1170e+11 52 2.289 0.92 Model | 8.5832 R-squared = 0.0 F( 1.277 -12.egg L.906597 1.099 0. Table 1.7140 Total | 1. Err. Std.3413 279.9613121 .72 40384. but here you need to regress chickens against the lags of chickens and the lags of eggs.9627 ---------+-----------------------------Adj R-squared = 0. A simple example in Stata: *Causality direction A: Do chickens Granger-cause eggs? For example.027241 35. it is required that you show explicitly what are the NULL and ALTERNATIVE hypotheses of this test.0984e+10 2 4.chic = 0.569 170065. detail or other functions presented in the previous tutorials.0000 Residual | 1473179.933252 -1.016027 chic | L1 | -.203 0.32139 3.6937 0.000 . using one lag you have: regress chic L. and the regression equations you are going to run. The results of Thurman and Fisher's (1988).0009383 ( 1) L.1480e+09 Root MSE = 24784 -----------------------------------------------------------------------------chic | Coef.323 -282.22156 3.egg = 0.65 -----------------------------------------------------------------------------egg | Coef.egg L.0492e+10 Prob > F = 0.829 -. Once again. 50) = 645.032 7837.0005237 -0. Std.0 © Thierry Warin 71 © Thierry Warin 72 .1226 -----------------------------------------------------------------------------And you can test if chickens Granger cause eggs using a F-test: test L.217 0.0011655 .8292 **Causality direction B: Do eggs Granger-cause chickens? This involves the same techniques.25 Root MSE = 171.chic TESTS In need for a causality test? The first thing to do is to use the command summarize.6830493 .16 50 29463. Interval] ---------+-------------------------------------------------------------------egg | L1 | -4.0000 Residual | 3.0001136 .9868117 _cons | 88951.8349305 . using the number of lags equals 1 you proceed as follows: regress egg L.9 Prob > F = 0.0 52 759522. can be easily replicated using OLS regressions and the time series commands introduced in the previous tutorials.57878 chic | L1 | .24 Model | 38021977.7250 ---------+-----------------------------Adj R-squared = 0. t P>|t| [95% Conf.000 .042 0.075617 11. 50) = 65. 0 F( 4.26989 80.egg = 0.0 L2.3336602 . plus a graphical analysis.496 0. Interval] ---------+-------------------------------------------------------------------egg | L1 | 87.76817 -1.0184877 .21 Prob > F = 0.egg L2. 41) = 4. 50) = 1.252 0.7802 Total | 1.3 -----------------------------------------------------------------------------and then test the joint significance of all lags of eggs © Thierry Warin 73 © Thierry Warin 74 .2059394 -0. Just remember to test the joint hypothesis of non-significance of the "causality" terms.egg ( 1) ( 2) ( 3) ( 4) L.886 -.2 241007.142 -146.egg L2. and 4.0057 Do that for the for lags 1.3.0461663 . 41) = 22.84086 L4 | -22.38472 26.090 0.chic L4.0000 Residual | 2.26 Prob > F = 0. Please provide a table in the same format of Thurman and Fisher's (1988).0256691 .1934323 1.15897 chic | L1 | .egg L3.egg L4.8466 21.egg = 0.egg L.63552 30.0 L3.2.3849984 _cons | 147330.929 -.0961e+11 49 2.75 Model | 8.3 46385.002 33.235 -.144 0.2332566 .246 0.85845 L3 | -8.09684 -0.214513 44.003 53653.8697736 L3 | -.740 0.1779262 0.9451e+10 8 1. containing your results.2039095 2.8161 ---------+-----------------------------Adj R-squared = 0.2369e+09 Root MSE = 22171 -----------------------------------------------------------------------------chic | Coef.1181e+10 Prob > F = 0.egg L3.11014 141.59828 -0.egg L4. Std.4343907 .176 0.0154e+10 41 491569158 R-squared = 0.623901 L2 | .186 0.chic Source | SS df MS Number of obs = 50 ---------+-----------------------------F( 8.87471 3.030 . Err.0 L4.chic L3.3974153 L4 | .egg = 0. Example: Do eggs Granger cause chickens (in four lags)? regress chic L.6593 L2 | -62.2772 test L. Causality in further lags: To test Granger causality in further lags.chic L2. t P>|t| [95% Conf.853 -97.egg = 0. the procedures are the same.45797 .1573878 .49408 41.206 0.464 -84.32 3.43 39.F( 1. . Probit and logit regressions Probit and logit regressions are models designed for binary dependent variables. Probit regression Pr ( Y = 1 X 1 . Because a regression with a binary dependent variable Y models the probability that Y=1.. + β k X k ) Pr ( Y = 1 X 1 ... X k ) = F ( β0 + β1 X 1 + ... Logit regression uses the logistic cumulative probability distribution function... + βk X k ) Where φ is the cumulative standard normal distribution.. (15) Logit regression Pr ( Y = 1 X 1 .. X k ) = 1 1+ e − ( β 0 + β1 X 1 +....+ β k X k ) (16) Logit regression is similar to probit regression except that the cumulative distribution function is different.. © Thierry Warin 75 © Thierry Warin 76 ... Probit regression uses the standard normal cumulative probability distribution function.. it makes sense to adopt a nonlinear formulation that forces the predicted values to be between zero and one. X k ) = φ ( β0 + β1 X 1 + ..Linear probability model Maximum likelihood estimation 1. kdensity timedrs. clear Let's start by checking univariate the distribution of these variables. visits physical/mental health prof ------------------------------------------------------------Percentiles Smallest 1% 0 0 5% 0 0 10% 1 0 Obs 465 25% 2 0 Sum of Wgt.698121 Life Change Units ------------------------------------------------------------Percentiles Smallest 1% 0 0 5% 25 0 10% 59 0 Obs 465 25% 98 0 Sum of Wgt.098588 EXAMPLES Health Care use http://www.99% 12 15 Kurtosis 4.23763 81 Kurtosis 15. Dev.6005144 18 Kurtosis 2. We see that timedrs phyheal and stress show considerable skewness.768424 Let's graph the distribution of the variables.94849 60 60 Variance 119. Dev. Dev. of physical health problems ------------------------------------------------------------Percentiles Smallest 1% 2 2 5% 2 2 10% 2 2 Obs 465 25% 3 2 Sum of Wgt. Variance Skewness 4. 465 50% 178 Mean 204. Dev. .edu/stat/stata/modules/reg/health. . 4.66 95% 441 731 Skewness 1. 465 50% 75% 90% 95% 99% 6 9 12 14 17 Mean 6.039773 99% 594 920 Kurtosis 4.028006 © Thierry Warin 77 © Thierry Warin 78 .972043 2.7927 75% 278 597 90% 389 643 Variance 18439.ats.2172 Largest Std.901075 Largest Std.122581 Largest Std.193594 17 18 Variance 17. normal No. 465 50% 75% 90% 95% 5 6 8 9 Largest 13 13 14 Mean Std. 10. summarize timedrs phyheal menheal stress.8695 75 Skewness 3. of mental health problems ------------------------------------------------------------Percentiles Smallest 1% 0 0 5% 0 0 10% 1 0 Obs 465 25% 3 0 Sum of Wgt.388296 5.ucla.58623 18 Skewness . detail No.9472 No. 135. 465 50% 75% 90% 95% 99% 4 10 18 27 58 Mean 7.703958 1. normal Even though we know there are problems with these variables. kdensity menheal.3154 3 4056. normal © Thierry Warin 79 © Thierry Warin 80 .10512 Prob > F = 0. kdensity stress.2188 ---------+-----------------------------Adj R-squared = 0.0000 Residual | 43451. normal From the graphs above. Below we create scatterplot matrices and they clearly show problems that need to be addressed. timedrs and phyheal seem the most skewed.7085 .869503 Root MSE = 9. while stress is somewhat less skewed. . graph timedrs phyheal menheal stress. 461) = 43.1341 461 94. kdensity phyheal.) ..254087 R-squared = 0. regress timedrs phyheal menheal stress Source | SS df MS Number of obs = 465 ---------+-----------------------------F( 3.2137 Total | 55619. .4495 464 119. let's try running a regression and examine the diagnostics.03 Model | 12168. matrix symbol(. 4771213 Obs 465 25% . Err.296 0.9542425 1.0207128 _cons | -3.30103 0 Obs 465 25% .770852 1. Tabachnick and Fidell recommend a log (to the base 10) transformation for ltimedrs and phyheal and a square root transformation for stress. .940 -.000 .811711 lphyheal ------------------------------------------------------------Percentiles Smallest 1% . rvfplot ltimedrs ------------------------------------------------------------Percentiles Smallest 1% 0 0 5% 0 0 10% . 465 50% 75% 13.769 0. . Dev.83 Prob > chi2 = 0.786948 . Interval] ---------+-------------------------------------------------------------------phyheal | 1.880814 Skewness .146128 Variance .113943 1. .913814 Kurtosis 2.075 0.352511 2.0036121 3. summarize ltimedrs lphyheal sstress.4771213 . hettest Cook-Weisberg test for heteroscedasticity using fitted values of timedrs Ho: Constant variance chi2(1) = 148.845098 1. We make these transformations below. generate lphyheal = log10(phyheal+1) .60206 .4771213 0 Sum of Wgt.7781513 Mean .4152538 1.914029 -1. .67333 24. generate ltimedrs = log10(timedrs+1) .78533 Variance . t P>|t| [95% Conf. .972175 © Thierry Warin 81 © Thierry Warin 82 .447158 1.704848 1.495666 -----------------------------------------------------------------------------The rvfplot shows a real fan spread pattern where the variability of the residuals grows across the fitted values.0065162 .000 1.34166 Mean Largest Std.2210735 8.0000 Let's address the problems of non-normality and heteroscedasticity.124195 -3.-----------------------------------------------------------------------------timedrs | Coef.278754 1.221385 menheal | -.78533 1.2277155 1.354632 sstress ------------------------------------------------------------Percentiles Smallest 1% 0 0 5% 5 0 10% 7.001 -5.4771213 10% .1668434 .0278367 1 1.39955 4.1724357 1. detail 50% 75% 90% 95% 99% .4771213 .2438915 stress | .43358 13.2632227 .7437625 Largest Std.0136145 . 465 The hettest command confirms there is a problem of heteroscedasticity. 465 50% 75% 90% 95% 99% . Dev. These transformations have nearly completely reduced the skewness.083 0.69897 Mean .1290286 -0.176091 Skewness .20412 Kurtosis 2.4771213 Sum of Wgt.681146 0 Obs 465 25% 9. Std.4771213 .1555756 1.041393 1.741285 Largest Std.899495 0 Sum of Wgt. generate sstress = sqrt(stress) Let's examine the distributions of these new variables.146128 . . 16.4771213 5% .0096656 . Dev. 3747 Total | 80.000 1.90% 19.293965 .107815788 R-squared = 0. kdensity ltimedrs. 461) = 93.32835 -----------------------------------------------------------------------------ltimedrs | Coef.172435699 Root MSE = . Std. The distributions look pretty good.72308 25.3788 ---------+-----------------------------Adj R-squared = 0. normal © Thierry Warin 83 © Thierry Warin 84 .35744 Variance 24.70 Model | 30. kdensity lphyheal.37212 30.1077396 12. . normal The scatterplots for the transformed variables look better.3315 Kurtosis 3.102605 Let's use kdensity to look at the distribution of these new variables. matrix symbol(.) . kdensity sstress.0908912 99% 24. normal Now let's try running a regression and diagnostics with these transformed variables.0000 Residual | 49. .505687 . graph ltimedrs lphyheal menheal sstress. regress ltimedrs lphyheal menheal sstress Source | SS df MS Number of obs = 465 ---------+-----------------------------F( 3.010 0.0101644 464 . Interval] ---------+-------------------------------------------------------------------lphyheal | 1.102362 Prob > F = 0.3070861 3 10.082244 1.7030783 461 . Err.72252 95% 21 27.03701 Skewness -. t P>|t| [95% Conf. . 0016188 . There still is a flat portion in the bottom left of the plot.0156626 . avplots The hettest command is no longer significant. ovtest Ramsey RESET test using powers of the fitted values of ltimedrs Ho: model has no omitted variables F(3. rvfplot Ho: model has no omitted variables F(9.713 -.0102645 sstress | .0033582 4.2923398 -----------------------------------------------------------------------------The distribution of the residuals looks better.4409002 . .664 0.5525 Examination of the added variable plots below show no dramatic problems.0070268 . and see that observation 548 is the observation we identified in the plot above.87 Prob > F = 0.000 . hettest Cook-Weisberg test for heteroscedasticity using fitted values of ltimedrs Ho: Constant variance chi2(1) = 0. studentized residuals.0222619 _cons | -. These result look mostly OK. 458) = 0. graph rstu l [w=d] Below we show the same plot showing the subject number. .368 0.000 -. leverage . rstudent . ovtest.6134 We use the ovtest with the rhs option to test for omitted higher order trends (e.3529 We use the ovtest command to test for omitted variables from the equation.0043995 0. predict rstud. . .60 Prob > F = 0. rhs Ramsey RESET test using powers of the independent variables Let's create leverage. predict l. The results suggest no omitted variables. and Cook's D. . The results suggest there are no omitted higher order trends. .5894606 -. but not a very large leverage.g. . quadratic. cubic trends). 452) = 0. graph rstu l. cooksd .menheal | . There is one observation in the middle top-right section that has a large Cook's D (large bubble) a fairly large residual.0755985 -5. symbol([subjno]) © Thierry Warin 85 © Thierry Warin 86 . and plot these.0090632 . and there is a residual in the top left.832 0.86 Prob > chi2 = 0. predict d. suggesting that the residuals are homoscedastic. 307 0. regress ltimedrs lphyheal menheal sstress.16334 1.910 0. rreg ltimedrs lphyheal menheal sstress Huber iteration 1: maximum difference in weights = .0699104 -6.400 0.0156626 . .66878052 Huber iteration 2: maximum difference in weights = .0000 Pseudo R2 = 465 -----------------------------------------------------------------------------| Robust ltimedrs | Coef.3788 Root MSE = .3035176 -----------------------------------------------------------------------------Let's try robust regression and again check to see if the results change.01514261 Log likelihood = -2398.000 -. .5995681 -.1570797 . Err.0000 R-squared = 0.1084569 11.0044165 1. Err.507096 menheal | .000 .4409002 .0223958 _cons | -.1019097 13.0000 465 The residuals look like they are OK.0094834 sstress | .314 0.466 0.0016188 .772 Poisson regression Number of obs = LR chi2(3) = 1307. Std. t P>|t| [95% Conf.00324634 Robust regression estimates Number of obs = F( 3.000 . we could have tried analyzing the data using poisson regression. We try analyzing the original variables using poisson regression.000 .0044805 0.363605 .3092 Iteration 1: log likelihood = -2398. 461) = 114.000 1.162 -.0065397 25.Biweight iteration 6: maximum difference in weights = .000 . Interval] ---------+-------------------------------------------------------------------lphyheal | 1.980 0. Std.718 -.0011976 . Interval] ---------+-------------------------------------------------------------------phyheal | .66 Prob > F = 0.931 0. Err.252 0. z P>|z| [95% Conf.571 0.05608508 Huber iteration 3: maximum difference in weights = .0016444 _cons | . robust Regression with robust standard errors Number of obs = F( 3. the results below (using robust standard errors) are virtually the same as the prior results.0148395 stress | .293965 . Indeed.0041615 0. 461) = 105.0124211 .1698972 .5782828 -. Let's try running the regression using robust standard errors and see if we get the same results. t P>|t| [95% Conf. the results are nearly identical to the original results.8240076 -----------------------------------------------------------------------------We can check to see if there is overdispersion in the poisson regression.0013055 .001421 .3185249 -----------------------------------------------------------------------------Since the dependent variable was a count variable.27285455 Biweight iteration 5: maximum difference in weights = .0715078 -6.0034264 4.0186633 _cons | -.0061789 .0031765 3.32835 465 -----------------------------------------------------------------------------ltimedrs | Coef.772 Iteration 2: log likelihood = -2398.000 1.0104235 sstress | .0068723 .56387 menheal | . by trying negative binomial regression.64 Prob > chi2 = 0.6558835 . Again.01226236 Biweight iteration 4: maximum difference in weights = .2142 -----------------------------------------------------------------------------timedrs | Coef.420 0.000114 12.0024729 . Interval] ---------+-------------------------------------------------------------------lphyheal | 1.7399455 . .4590465 .0061833 .754 -. poisson timedrs phyheal menheal stress Iteration 0: log likelihood = -2399.080834 1.772 0. nbreg timedrs phyheal menheal stress Fitting comparison Poisson model: © Thierry Warin 87 © Thierry Warin 88 .0089293 .361 0.94 Prob > F = 0.0071859 .1827147 menheal | . Std.000 -.381 0. .000 .0428896 17. 1821909 .000 .5528521 ---------+-------------------------------------------------------------------/lnalpha | -.000 -. indicating that the negative binomial model would be preferred over the poisson model.4704323 -. Std.0788172 -4.6247321 .0165 This module illustrated some of the diagnostic techniques and remedies that can be used in regression analysis.0168 log likelihood = -1453.0017756 . z P>|z| [95% Conf.0165 log likelihood = -1453.8849 log likelihood = -1360.3092 Iteration 1: log likelihood = -2398.909 0.7290933 .463 0.000 . © Thierry Warin 89 © Thierry Warin 90 .0629302 .850888 -----------------------------------------------------------------------------Likelihood ratio test of alpha=0: chi2(1) = 2075.3159535 .0000 Pseudo R2 = 0.2253456 .4125 log likelihood = -1453.2685003 menheal | . Fitting full model: Iteration 0: Iteration 1: Iteration 2: Iteration 3: Iteration 4: log likelihood = -1380.235 0. Err. The main problems shown here were problems of non-normality and heteroscedasticity that could be mended using log and square root transformations.382 0.772 Iteration 2: log likelihood = -2398.772 Fitting constant-only model: Iteration 0: Iteration 1: Iteration 2: Iteration 3: log likelihood = -1454.0130667 .001129 .5593 log likelihood = -1360.0124366 0.009 0.1249824 2.1614747 ---------+-------------------------------------------------------------------alpha | .0634 Negative binomial regression LR chi2(3) Prob > chi2 Log likelihood = -1360.8849 -----------------------------------------------------------------------------timedrs | Coef. Interval] ---------+-------------------------------------------------------------------phyheal | .0220181 10.0024222 _cons | .Iteration 0: log likelihood = -2399.26 = 0.3078912 .0000 The test of overdispersion (test of alpha=0) is significant.8849 Number of obs = 465 = 184.0356838 stress | .77 Prob > chi2 = 0.0113085 .014 .363 -.0003299 5.3758 log likelihood = -1362.8911 log likelihood = -1360.0574651 . linear-by-linear association test. For tables in which both rows and columns contain ordered values. Fisher’s exact test is computed when a table that does not result from missing rows or columns in a larger table has a cell with an expected frequency of less than 5. The Mantel-Haenszel common odds ratio is also computed. select Chi-square to calculate the Pearson chi-square and the likelihood-ratio chi-square. For tables that have the same categories in the columns as in the rows (for example.Appendix 1 The Crosstabs procedure forms two-way and multiway tables and provides a variety of tests and measures of association for two-way tables. select Cohen’ s Kappa. the Crosstabs procedure forms one panel of associated statistics and measures for each value of the layer factor (or a combination of values for two or more control variables). no) against LIFE (is life exciting. Goodman and Kruskal’s tau. conditional upon covariate patterns defined by one or more layer (control) variables. or dull). Fisher’s exact test. while the majority of large companies (more than 2500 employees) yield low service profits. select Risk for relative risk estimates and the odds ratio. When both table variables are quantitative. For example. The McNemar test is a nonparametric test for two related dichotomous variables. McNemar. When one variable is categorical and the other is quantitative. Spearman’s rho. Lambda (symmetric and asymmetric lambdas and Goodman and Kruskal’s tau). Correlations yields the Pearson correlation coefficient. Nominal. r. Are customers from small companies more likely to be profitable in sales of services (for example. Pearson chi-square. For tables in which both rows and columns contain ordered values. It tests for changes in responses using the chi-square distribution. When both table variables (factors) are quantitative. Risk. the results for a two-way table for the females are computed separately from those for the males and printed as panels following one another. Nominal by Interval. For tables with any number of rows and columns. Cochran’s and Mantel-Haenszel. gamma. rho (numeric data only). For tables with two rows and two columns. The categorical variable must be coded numerically. Crosstabs’ statistics and measures of association are computed for two-way tables only. the likelihood-ratio chi-square. and Jewish). Somers’ d. you might learn that the majority of small companies (fewer than 500 employees) yield high service profits. symmetric and asymmetric lambdas. Contingency coefficient. a measure of linear association between the variables. and Yates’ corrected chi-square (continuity correction). measuring agreement between two raters). phi. For predicting column categories from row categories. and Uncertainty coefficient. training and consulting) than those from larger companies? From a crosstabulation. and a layer factor (control variable). and Kendall’s tau-c. select Eta. Correlations. Pearson’s r. along with Breslow-Day and Tarone's statistics for testing the homogeneity of the common odds ratio. Cramér’s V. Chi-square. if GENDER is a layer factor for a table of MARRIED (yes. For nominal data (no intrinsic order. Kendall’s tau-c. select Gamma (zero-order for 2-way tables and conditional for 3-way to 10-way tables). Cohen’s kappa. eta coefficient. Yates’ corrected chisquare. Cochran's and Mantel-Haenszel. select Chi-square to calculate the Pearson chi-square. Cochran’s and Mantel-Haenszel statistics can be used to test for independence between a dichotomous factor variable and a dichotomous response variable. If you specify a row. Correlations yields Spearman’s correlation coefficient. Kendall’s tau-b. select Somers’ d. Example. contingency coefficient. Yates’ corrected chi-square is computed for all other 2 ´ 2 tables. Fisher’s exact test. For tables with two rows and two columns. Spearman’s rho is a measure of association between rank orders. a column. © Thierry Warin 91 © Thierry Warin 92 . you can select Phi (coefficient) and Cramér’s V. Ordinal. Statistics and measures of association. Kendall’s tau-b. The structure of the table and whether categories are ordered determine what test or measure to use. Kappa. uncertainty coefficient. likelihood-ratio chisquare. Protestant. McNemar test. Chi-square yields the linear-by-linear association test. odds ratio. relative risk estimate. routine. such as Catholic. It is useful for detecting changes in responses due to experimental intervention in "before and after" designs. For 2 ´ 2 tables. References © Thierry Warin 93 .

Comments

Description