EDU5950SEM2 2010-11 CORRELATION & SIMPLE REGRESSION Correlation - Test of association A correlation measures the “degree of association” between two variables (interval or ordinal) Associations can be positive (an increase in one variable is associated with an increase in the other) or negative (an increase in one variable is associated with a decrease in the other) Correlation is measured in “r” (parametric, Pearson’s) or “ρ” (non-parametric, Spearman’s) 1 Test of association - Correlation Compare two continuous variables in terms of degree of association e.g. attitude scale vs behavioural frequency 300 250 250 200 200 150 150 100 100 50 50 0 0 50 100 150 200 250 0 300 0 50 Positive 100 150 200 250 Negative Test of association - Correlation Test statistic is “r” (parametric) or “ρ” (non-parametric) 0 (random distribution, zero correlation) 1 (perfect correlation) 180 160 160 140 140 120 120 100 100 80 80 60 60 40 40 20 20 0 0 0 50 100 150 High 200 250 0 50 100 150 200 250 Low 2 .Test of association ..3. zero correlation) 1 (perfect correlation) 180 200 160 180 160 140 140 120 120 100 100 80 80 60 60 40 40 20 20 0 0 0 50 100 High 150 200 250 0 50 100 150 200 250 Zero Regression & Correlation A correlation measures the “degree of association” between two variables (interval (50.2.Correlation Test statistic is “r” (parametric) or “ρ” (non-parametric) 0 (random distribution.)) Associations can be positive (an increase in one variable is associated with an increase in the other) or negative (an increase in one variable is associated with a decrease in the other) 6 3 .150…) or ordinal (1.100. then how heavy? 140 120 100 80 60 40 20 0 0 50 100 150 200 Height (cms) 7 Example: Symptom Index vs Drug A Graph Two: Relationship between Symptom Index and Drug A 160 140 Symptom Index Weight (kgs) 160 120 100 80 60 40 20 Strong negative correlation Can see how relationship works. but cannot make predictions What Symptom Index might we predict for a standard dose of 150mg? 0 0 50 100 150 200 250 Drug A (dose in mg) 4 . Weight Graph One: Relationship between Height and Weight 180 n Strong positive correlation between height and weight n Can see how the relationship works.Example: Height vs. but cannot predict one from the other n If 120cm tall. Example: Symptom Index vs Drug A Graph Three: Relationship between Symptom Index and Drug A (with best-fit line) 180 Symptom Index 160 140 120 100 80 60 40 20 0 0 50 100 150 200 250 “Best fit line” Allows us to describe relationship between variables more accurately. We can now predict specific values of one variable from knowledge of the other All points are close to the line Drug A (dose in mg) Example: Symptom Index vs Drug B Graph Four: Relationship between Symptom Index and Drug B (with best-fit line) 160 Symptom Index 140 120 100 Will predictions be as accurate? Why not? “Residuals” 80 60 40 20 We can still predict specific values of one variable from knowledge of the other 0 0 50 100 150 200 250 Drug B (dose in mg) 5 . A secondary purpose is to use regression analysis as a means of explaining causal relationships among variables. 6 .Correlation examples 11 Regression Ø Ø Regression analysis procedures have as their primary purpose the development of an equation that can be used for predicting values on some DV for all members of a population. If the correlation is perfect (i. Regression line is the straight line that lies closest to all points in a given scatterplot This line sometimes pass through the centroid of the scatterplot. The correlation tells us how much information about the DV is contained in the IV. 7 . to which is referred as simple linear regression.e r = ±1.00). or just simple regression. Goal: to obtain a linear equation so that we can predict the value of the DV if we have the value of the IV. Simple regression capitalizes on the correlation between the DV and IV in order to make specific predictions about the DV. Simple regression involves a single IV and a single DV. Regression analysis is the means by which we determine the best-fitting line. the IV contains everything we need to know about the DV.Ø Ø Ø Ø Ø Ø Ø Ø Ø The most basic application of regression analysis is the bivariate situation. called the regression line. and we will be able to perfectly predict one from the other. It is important to determine exactly where the regression line crosses the Y-axis (this value is known as the Y-intercept). It is the slope that largely determines the predicted values of Y from known values for X. Ø Ø Ø The degree of slope is determined by the amount of change in Y that accompanies a unit change in X. 8 .Ø 3 important facts about the regression line must be known: Ø Ø Ø Ø Ø The extent to which points are scattered around the line The slope of the regression line The point at which the line crosses the Y-axis The extent to which the points are scattered around the line is typically indicated by the degree of relationship between the IV (X) and DV (Y). This relationship is measured by a correlation coefficient – the stronger the relationship. the higher the degree of predictability between X and Y. Ø Ø Ø Ø Ø The regression line is essentially an equation that express Y as a function of X. Ø b is the slope of the regression line Ø a is the Y-intercept Simple Linear Regression ♠ Purpose ► determine relationship between two metric variables ► predict value of the dependent variable (Y) based on value of independent variable (X) ♠ Requirement : ► DV Interval / Ratio ► IV Internal / Ratio ♠ Requirement : Ø The independent and dependent variables are normally distributed in the population Ø The cases represents a random sample from the population 9 . The basic equation for simple regression is: Y = a + bX where Y is the predicted value for the DV. X is the known raw score value on the IV. Simple Regression How best to summarise the data? 160 180 140 160 140 Symptom Index Symptom Index 120 100 80 60 120 100 80 60 40 40 20 20 0 0 0 50 100 150 200 250 0 50 Drug A (dose in mg) 100 150 200 250 Drug A (dose in mg) Adding a best-fit line allows us to describe data simply General Linear Model (GLM) How best to summarise the data? Ø Establish equation for the best-fit line: Y = a + bX 200 180 160 140 Where: a = y intercept (constant) b = slope of best-fit line Y = dependent variable X = independent variable 120 100 80 60 40 20 0 0 50 100 150 200 250 10 . R2 is the square of the correlation coefficient Ø Reflects variance accounted for in data by the best-fit line Ø Takes values between 0 (0%) and 1 (100%) Ø Frequently expressed as percentage. rather than decimal Ø High values show good fit.Simple Regression R2 . no apparent relationship between X and Y) Implies that a best-fit line will be a very poor description of data IV (regressor.randomly scattered points. predictor) 11 .“Goodness of fit” Ø For simple regression. low values show poor fit Simple Regression Low values of R2 DV 300 Ø 250 Ø 200 150 100 Ø 50 0 0 100 200 300 R2 = 0 (0% . “Goodness of fit” 180 160 160 140 120 120 S ymptom Index S ymptom Index 140 100 80 60 100 80 60 40 40 20 20 0 0 0 50 100 150 200 250 Drug A (dose in mg) 0 50 100 150 200 250 Drug B (dose in mg) Good fit ⇒ R2 high Moderate fit ⇒ R2 lower High variance explained Less variance explained 12 .Simple Regression High values of R2 300 250 DV 200 Ø R2 = 1 150 Ø 100 50 0 0 100 200 300 IV (100% .points lie directly on the line .perfect relationship between X and Y) 250 Ø DV 200 150 100 Implies that a best-fit line will be a very good description of data 50 0 0 50 100 150 200 250 IV Simple Regression R2 . Problem: to draw a straight line through the points that best explains the variance 9 8 7 Line can then be used to predict Y from X 6 5 4 3 2 1 0 0 1 2 3 4 5 6 25 Example: Symptom Index vs Drug A Ø Graph Three: Relationship between Symptom Index and Drug A (with best-fit line) Ø 180 Symptom Index 160 Ø 140 120 100 80 60 Ø 40 20 “Best fit line” allows us to describe relationship between variables more accurately. We can now predict specific values of one variable from knowledge of the other All points are close to the line 0 0 50 100 150 200 250 Drug A (dose in mg) 26 13 . Regression Ø Establish equation for the best-fit line: Y = a + bX n Best-fit line same as regression line n b is the regression coefficient for x n x is the predictor or regressor variable for y 27 Regression .Types 14 . not just the sample l Population parameters: β0 and β1 l Sample statistics: a and b 30 15 .Linear Regression .Model Yi = β 0 + β1 X i + ε i Constant Population Regression Coefficients Sample ˆ = a + bX Y Parameters l The population parameters β0 and β1 are simple the least squares estimates computed on all the members of the population. Inference About the Population Slope and Intercept Y = β0 + β1 X + ε l If β1 > 0 then we have a graph like this: β0 + β1 X X 31 Inference About the Population Slope and Intercept Y = β0 + β1 X + ε l If β1 < 0 then we have a graph like this: This is the mean of Y for those whose independent variable is X β0 + β1 X X 32 16 . Inference About the Population Slope and Intercept Y = β0 + β1 X + ε l If β1 = 0 then we have a graph like this: β0 + β1 X Note how the mean of Y does not depend on X: Y and X are independent X Copyright (c) Bani K. we can test the null hypothesis l H0 : that Y and X are independent by testing H0 : β1 = 0 l The p-value in regression tables tests this hypothesis 34 17 . Mallick 33 Linear Regression and Correlation Y = β0 + β1 X + ε β1 = 0 then Y and X are independent l If l So. 98 Ice Cream Sales 4 3.68 2.4 3.5 3 2.5 1 0.Ice Cream Example X Temperature 63 70 73 75 80 82 85 88 90 91 92 75 98 100 92 87 84 88 80 82 76 Y Sales 1.14 1.52 1.92 3.26 2.17 2.28 3.06 3.58 2.14 3.8 2.86 2.36 2.83 2.24 1.68 1.05 2.5 2 1.5 0 0 20 40 60 80 100 120 ˆ Y = a + bX Simple Regression Line TWO STEPS TO SIMPLE LINEAR REGRESSION ● Regression equation : Ŷ = a + bX ● Correlation coefficient (r) ● Coefficient of Determination (r²) Descriptive Inferential Hypothesis Test : 1 Regression Model 2 Slope 18 .25 2.9 3. 5 6 9 10 8 7 5 6 7. Calculate coefficient of determination and the correlation coefficient 2. Determine the prediction equation. Distribution for the data is presented in the table below.05 level of significance Data set: ID 1 2 3 4 5 6 7 8 9 10 Scores Assign 8. 3.First Step -Descriptive Derive Regression / Prediction equation ● Calculate a and b a=y–b X Ŷ = a + bX Example1 : Data were collected from a randomly selected sample to determine relationship between average assignment scores and test scores in statistics.5 5 Test 88 66 94 98 87 72 45 63 85 77 19 . 1. Test hypothesis for the slope at 0. 050 Prediction equation: Ŷ = 18.05 + 8.05 + 8.257 (7. Derive Regression / Prediction equation = 215. Y will change by 8.05 8.5 5 Y 88 66 94 98 87 72 45 63 85 77 Summary stat: = 77.257X 10 72 775 544.2) n ΣΧ ΣΥ ΣΧ² ΣΥ² ΣΧΥ = 18.257 units 57 18.ID 1 2 3 4 5 6 7 8 9 10 1.5 6 9 10 8 7 5 6 7.1 a= y – b x X 8.441 5.795.5 = 8.5 62.5 Interpretation of regression equation Ŷ = 18.257 26.2 ΔY ΔX 20 .257x For every 1 unit change in X.5 – 8. 65 (5.29) = 8.Example 2: MARITAL SATISFACTION Children : Y Parents : X 1 3 7 9 8 4 5 Mean of X No of pairs ΣX Σ X squared Standard deviation Σ XY 3 2 6 7 8 6 3 Mean of Y ΣY Σ X squared Standard deviation 1.438 Prediction equation: Ŷ = 8.00 +. Derive Regression / Prediction equation a= y – b x = 5.44 + 65x 21 . Deviation N 2.000 . Predictors: (Constant).PMR MATH 22 .6 8.65 units 5 0.43 ΔY ΔX Descriptive Statistics Mean Grade .571a R Square Adjusted R Square . . (1-tailed) Grade .326 . Dependent Variable: Grade . Y will change by .000 . Error of the Estimate 1.000 .PMR MATH TEACHER_FACTO R e si o n 62 62 0 a.91443 62 Correlations Model Summaryb Model Grade .315 Std.000 . TEACHER_FACTOR b.PMR TEACHER_F MATH Pearson Correlation Grade .571 R .9643 .Interpretation of regression equation Ŷ = 8.43 + .PMR MATH TEACHER_FACTO n R N Grade .468 62 3.65x For every 1 unit change in X.53 1.PMR MATH TEACHER_FACTOR Std.215 di m TEACHER_FACTO .571 1.PMR MATH 1 ACTOR 1. 62 62 R Sig. 327 . 62 62 62 62 62 62 62 62 62 Model d 1 62 Model Summaryb Adjusted R Std. (1-tailed) N Grade .848 1. .021 Sig.591 Sig.440 .917 .000 . .9643 Race Grade .101 .571 -.019 1.692 (Constant) TEACHER_FACTOR a.90 62 . . TEACHER_FACTOR b.588 Total 131.PMR MATH TEACHER_FACTOR 3.TEACHER PMR MATH _FACTOR Race 1.304 1.571 1.225 i m e n Sig. Error of R Square Square the Estimate R .53 1.PMR MATH TEACHER_FA CTOR Race Grade . Predictors: (Constant).PMR MATH 23 .572a .91443 1.170 .000 . TEACHER_FACTOR b.440 . ANOVAb Model 1 Sum of Squares Regression Residual df 42. Predictors: (Constant).435 1 60 61 Mean Square 42. Dependent Variable: Grade .015 .015 .848 88.000 Descriptive Statistics Mean Std.000a a.000 .PMR MATH TEACHER_FA CTOR Race 62 . Race.468 Grade . Dependent Variable: Grade .PMR MATH TEACHER_FA CTOR Race s i o n 0 a. Error Beta -1.019 -.117 5.453 .PMR MATH .593 Correlations Pearson Correlation N Grade .387 .476 F 29. Deviation 2.453 . .000 . Dependent Variable: Grade .PMR MATH Model 1 Coefficientsa Standardized Unstandardized Coefficients Coefficients B Std.000 .571 t -1. 497 59 1.000a 88. Dependent Variable: Grade .PMR MATH Coefficientsa Model 1 (Constant) TEACHER_FACTOR Race a.Model 1 Regression Residual Total ANOVAb Sum of Mean Squares Square df F Sig.571 -.150 5. Predictors: (Constant). Race.435 61 a.065 .172 -.853 .246 Sig. Error -.500 131.349 -.000 .806 24 .917 . TEACHER_FACTOR b.PMR MATH Unstandardized Coefficients B Std. 42.939 2 21. .469 14.313 .980 .255 .265 Standardized Coefficients Beta . Dependent Variable: Grade .026 t -1.