Course Notesci

March 21, 2018 | Author: Courtney Williams | Category: Probability Theory, Expected Value, Random Variable, Probability Distribution, Empty Set


Comments



Description

Econometric TheoryJohn Stachurski October 10, 2013 Contents Preface v I 1 Background Material Probability 1.1 1.2 1.3 1.4 1.5 Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 15 25 30 38 49 49 59 66 71 74 79 2 Linear Algebra 2.1 2.2 2.3 2.4 2.5 2.6 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . Span, Dimension and Independence . . . . . . . . . . . . . . . . . . . Matrices and Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . Convergence of Random Matrices . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i CONTENTS ii 84 84 90 93 3 Projections 3.1 3.2 3.3 3.4 Orthogonality and Projection . . . . . . . . . . . . . . . . . . . . . . . Overdetermined Systems of Equations . . . . . . . . . . . . . . . . . . Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 II 4 Foundations of Statistics Statistical Learning 4.1 4.2 4.3 4.4 4.5 4.6 4.7 107 108 Inductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Parametric vs Nonparametric Estimation . . . . . . . . . . . . . . . . 125 Empirical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 137 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 153 5 Methods of Inference 5.1 5.2 5.3 5.4 5.5 Making Inference about Theory . . . . . . . . . . . . . . . . . . . . . . 153 Confidence Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Use and Misuse of Testing . . . . . . . . . . . . . . . . . . . . . . . . . 170 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 175 6 Linear Least Squares 6.1 6.2 6.3 6.4 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Transformations of the Data . . . . . . . . . . . . . . . . . . . . . . . . 179 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 October 10, 2013 J OHN S TACHURSKI CONTENTS iii III 7 Econometric Models Classical OLS 7.1 7.2 7.3 7.4 7.5 7.6 194 195 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Variance and Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 The FWL Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Normal Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 When the Assumptions Fail . . . . . . . . . . . . . . . . . . . . . . . . 215 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 226 8 Time Series Models 8.1 8.2 8.3 8.4 8.5 Some Common Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Dynamic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Maximum Likelihood for Markov Processes . . . . . . . . . . . . . . . 248 Models with Latent Variables . . . . . . . . . . . . . . . . . . . . . . . 256 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 264 9 Large Sample OLS 9.1 9.2 9.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 276 10 Further Topics 10.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 10.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 10.3 Breaking the Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 J OHN S TACHURSKI October 10, 2013 CONTENTS iv IV Appendices 294 295 11 Appendix A: An R Primer 11.1 The R Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 11.2 Variables and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 11.3 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 11.4 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 11.5 Simple Regressions in R . . . . . . . . . . . . . . . . . . . . . . . . . . 319 12 Appendix B: More R Techniques 323 12.1 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 12.2 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 12.3 Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 12.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 12.5 General Programming Tips . . . . . . . . . . . . . . . . . . . . . . . . 345 12.6 More Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 13 Appendix C: Analysis 360 13.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 13.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 13.3 Logical Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 J OHN S TACHURSKI October 10, 2013 Preface This is a quick course on modern econometric and statistical theory, focusing on fundamental ideas and general principles. The course is intended to be concise— suitable for learning concepts and results—rather than an encyclopedic treatment or a reference manual. It includes background material in probability and statistics that budding econometricians often need to build up. Most of the topics covered here are standard, and I have borrowed ideas, results and exercises from many sources, usually without individual citations to that effect. Some of the large sample theory is new—although similar results have doubtless been considered elsewhere. The large sample theory has been developed to illuminate more clearly the kinds of conditions that allow the law of large numbers and central limit theorem to function, and to make large sample theory accessible to students without knowledge of measure theory. The other originality is in the presentation of otherwise standard material. In my humble opinion, many of the mathematical arguments found here are neater, more precise and more insightful than other econometrics texts at a similar level. Finally, the course also teaches programming techniques via an extensive set of examples, and a long discussion of the statistical programming environment R in the appendix. Even if only for the purpose of understanding theory, good programming skills are important. In fact, one of the best ways to understand a result in econometric theory is to first work your way through the proof, and then run a simulation which shows the theory in action. These notes have benefitted from the input of many students. In particular, I wish to thank without implication Blair Alexander, Frank Cai, Yiyong Cai, Patrick Carvalho, Paul Kitney, Bikramaditya Datta, Stefan Webb and Varang Wiriyawit. v Part I Background Material 1 and is typically denoted Ω. We can think of Ω as being the collection of all possible outcomes in a random experiment. then every extra detail of probability theory that you can grasp and internalize will prove an excellent investment. What follows will involve a few set operations. . Example 1. .1 Sample Spaces In setting up a model of probability. . . with the outcome being an integer ω in the set {1. . 1.1. in general. Let Ω := {1. we usually start with the notion of a sample space. which.1. and you might like to glance over the results on set operations (and the definition of functions) in §13.Chapter 1 Probability Probability theory forms the foundation stones of statistics and econometrics. If you want to be a first class statistician/econometrician.1. can be any nonempty set. . .1 Probability Models We begin with the basic foundations of probability theory. 1. 6}.1. . 6} represent the six different faces of a dice. 2 . The general idea is that a realization of uncertainty will lead to the selection of a particular ω ∈ Ω. A typical element of Ω is denoted ω . A realization of uncertainty corresponds to a roll of the dice. PROBABILITY MODELS 3 The specification of all possible outcomes Ω is one part of our probability model. P(Ω) = 1. Additivity: P( A ∪ B) = P( A) + P( B) whenever A. See §1.1. and hence ∅ ⊂ Ω). skirting technical details here. satisfies 1. because they are so messy that assigning probabilities to these sets cause problems for the theory. (For example. Ω is called the certain event because it always occurs (regardless of which outcome ω is selected. then the notation P( B) will represent the probability that event B occurs. we exclude a few troublesome subsets of Ω from F . Example 1. The other thing we need to do is to assign probabilities to outcomes. let B be the event {1. In the language of probability theory.1. 1991). subsets of Ω are called events.1. 2}. In order to make sure our model of probability is well behaved. Continuing example 1. ω ∈ Ω is true by definition). The empty set ∅ is called the impossible event. The number P( B) represents the probability that the face ω selected by the roll is either 1 or 2.. we wouldn’t want to have a B with P( B) = −93. that is is not the right way forward.” the statement ω ∈ B is true.1. but it turns out. if B is an event (i. The obvious thing to do here is to assign a probability to every ω in Ω.3. it’s best to put certain restrictions on P. F ) is a function that associates to each event in F a number in [0. Two events we always find in F are Ω itself and the empty set ∅.1. (The empty set is regarded as being a subset of every set. the standard approach is to assign probabilities to subsets of Ω.g.) These restrictions are imposed in the next definition: A probability P on (Ω. B ∈ F ). In all of what follows..1 Example 1.2 for more discussion. Let Ω be any sample space.2. 2013 . and. In many common situations. in addition. Williams. The way you should think about it is this: P( B) represents the probability that when uncertainty is resolved and some ω ∈ Ω is selected by “nature.e. In particular. Instead. 1]. The set of all events is usually denoted by F .1. The second stage of our model construction is to assign probabilities to elements of F . e. and 2. In this context.1. and we follow this convention. as negative probabilities don’t make much sense. we take F to be a proper subset of the set of all subsets of Ω. B ∈ F with A ∩ B = ∅. 1 I’m J OHN S TACHURSKI October 10. for technical reasons beyond the scope of this text (see. we see that P{2.1. let P( A) := # A/6. the number of elements in the union is just the number contributed by A plus the number contributed by B.1. and. and that P(Ω) = 1.4. and then sum their probabilities. In particular. disjoint) ways that the event could occur. 2013 .1.1. For example. Suppose that A and B are two disjoint subsets of {1. if A1 . Example 1. 6}. A J are disjoint in the sense that Ai ∩ A j = ∅ whenever i = j. 6} = P[{2} ∪ {4} ∪ {6}] = P{2} + P{4} + P{6} = 1/6 + 1/6 + 1/6 = 1/2 To finish off the proof that P is a probability on (Ω. P{2. .1) For example. since.e.1. . then P(∪ jJ=1 A j ) = ∑ jJ=1 P( A j ) also holds. . PROBABILITY MODELS 4 Together. It is simple to check that P is a probability on (Ω. The axioms in the definition of P are fairly sensible. but this implies additivity over any disjoint finite union. Formally. F ).5. where # A := number of elements in A (1. Third. the probability of getting an even number is the probability of getting a 2 plus that of getting a 4 plus that of getting a 6. . Remark 1. For starters.1. .1. P) is called a probability space. it is clear why we require P(Ω) = 1. let Ω := {1.1. if we roll the dice. See §1. . given this definition of P. . we are restricting the probability of any event to be between zero and one.2 if you’d like to know more. 6} = 3/6 = 1/2. Second. F ). These details are left to you. P( A ∪ B) = #( A ∪ B)/6 = (# A + #B)/6 = # A/6 + #B/6 = P( A) + P( B) The additivity property is very intuitive in this setting. See exercise 1. the triple (Ω. In this case we must have #( A ∪ B) = # A + # B. since the realization ω will always be chosen from Ω by its definition.. . F . . . we can determine all the different (i.1. 4. Continuing example 1. the additivity property is natural: To find the probability of a given event. Confession: I’ve simplified the standard definition of a probability space slightly—in ways that will never matter to us in this course—to avoid technical discussions that we don’t need to go into. it only remains to check that 0 ≤ P( B) ≤ 1 for any B ∈ F . . Additivity of P was defined for two sets. . by disjointness. 4. It describes a set of events and their probabilities for a given experiment. for A ∈ F . 6} represent the six different faces of a dice. Let’s check additivity. J OHN S TACHURSKI October 10. As a result. F .1.1. These claims are not hard to prove. b N ) : bn ∈ {0. J OHN S TACHURSKI October 10. and that P is additive. if A ⊂ B. starting with the next fact. b N ) : bn ∈ {on. then we have B = ( B \ A) ∪ A. . Rearranging gives part 1.5. One sample space for this experiment is Ω0 := {(b1 . and 4. while nonnegativity of P gives part 2. P(∅) = 0. . PROBABILITY MODELS 5 Example 1. . made up of billions of tiny switches. Let’s list the key ones.5. Imagine that a random number generator accesses a subset of N switches. From the axioms above.1. . and setting B = A gives part 4. we define P ( A ) : = 2− N (# A ) To see that this is indeed a probability on (Ω. F ) we need to check that 0 ≤ P( A) ≤ 1 for all A ⊂ Ω. Specializing to B = Ω gives part 3. Fact 1.1. (Sketching the Venn diagram will help confirm this equality in your mind.1. 2013 . that P(Ω) = 1. . we can derive a suprising number of properties. For example.7 asks you to confirm that P is additive. and let A. P) be a probability space. F .1. 3. P) is an arbitrary probability space. If A ⊂ B. As our probability. P( Ac ) := P(Ω \ A) = 1 − P( A). 2. . Let (Ω. That P(Ω) = 1 follows from the fact that the number of binary sequences of length N is 2 N . then 1. Consider a memory chip in a computer. regarding the part 1. off} for each n} Letting zero represent off and one represent on. Now let’s go back to the general case. 1} for each n} Thus. B ∈ F . P( B \ A) = P( B) − P( A). setting each one to “on” or “off” at random. Exercise 1. P( A) ≤ P( B). Ω is the set of all binary sequences of length N .1.) Since B \ A and A are disjoint. . where (Ω. . we can also use the more practical space Ω := {(b1 . additivity of P now gives P( B ) = P( B \ A ) + P( A ) (whenever A ⊂ B) This equality implies parts 1–4 of fact 1. For example. j ∈ {1.1.1. Ω := {(i. j ∈ J and k ∈ K J OHN S TACHURSKI October 10. j).1. while the second element j represents the outcome of the second roll. Many crucial ideas in probability boil down to this one point. (In this case. B ∈ F . . . given the information that B has occurred. for any A. and is fundamental. where #E is the number of elements in E ⊂ Ω. The basic principle for counting ordered tuples is that the total number of possible tuples is the product of the number of possibilities for each tuple. Hence. then the conditional probability of A given B is just the probability of A. we need to be able to count the number of elements in A. the number of distinct tuples and B := {(i. then we know that B occurs whenever A occurs (because if ω lands in A. indicating that A and B are independent under the probability P.6. the probability of B should be larger. j) ∈ Ω : i is even} In this case we have A ∩ B = {(i. If A and B are events.2) It represents the probability that A will occur. B and A ∩ B. Events A and B are called independent if P( A ∩ B) = P( A)P( B). then P( A ∪ B ) = P( A ) + P( B ) − P( A ∩ B ) In particular. If A and B are any (not necessarily disjoint) events. Example 1. elements are pairs.2. 2013 . Formally. j. so #E is the number of pairs in E. The first element i represents the outcome of the first roll.) Now consider the events A := {(i. then it also lands in B). then the conditional probability of A given B is P( A | B ) : = P( A ∩ B ) /P( B ) (1. j) ∈ Ω : i and j are even} With a bit of work we can verify that P( A ∩ B) = P( A)P( B). it requires that P( B) > 0. . 6}} For our probability. let’s define P( E) := #E/36.1. PROBABILITY MODELS 6 The property that if A ⊂ B. A suitable sample space is the set of pairs (i. . we have P( A ∪ B) ≤ P( A) + P( B). then P( A) ≤ P( B) is called monotonicity. k) where i ∈ I . Fact 1. To check this. If A ⊂ B. Consider an experiment where we roll a dice twice. j) ∈ Ω : j is even} (i. If A and B are independent. j) : i. where i and j are between 1 and 6. For the definition to make sense. and the number of elements in A ∩ B is 3 × 3 = 9. P( A ∩ B) = 9/36 = 1/4 = (18/36) × (18/36) = P( A)P( B) Thus. 2013 . . . eventually you will have to work your way through them. as claimed.1. The overall chance of losing money (LM) that evening is P{LM } = P{LM | play}P{play} + P{LM | don’t play}P{don’t play} which is (2/3) × (1/2) + 0 × (1/2) = 1/3.1. then P( A) = ∑ P( A | Bm ) · P( Bm ) m =1 M The proof is quite straightforward. Bm ∈ F for each m. 1. B M is a partition of Ω (i. which says that if A ∈ F and B1 . and this whole course can be completed successfully without knowing anything about them. Hence you can skip this section on first reading. I’ve swept some technical details under the carpet to make the presentation smooth. in my presentation of probability spaces. A and B are independent.1. as alluded to above. So let’s note them for the record.1. if you intend to keep going deeper into probability and statistics. J OHN S TACHURSKI October 10. the Bm ’s are mutually disjoint M B = Ω) with P( B ) > 0 in the sense that Bj ∩ Bk is empty when j = k.e. A very useful result is the law of total probability. . As a result. the number of elements in A is 3 × 6 = 18. PROBABILITY MODELS 7 is (# I ) × (# J ) × (#K ).2 Technical Details Okay. These details won’t affect anything that follows. . Hence. the number of elements in B is 6 × 3 = 18. the chance of losing money when I play is 2/3. Being a bad player. Here’s an informal example of the law of total probability: Suppose I flip a coin to decide whether to take part in a poker game. Nevertheless.7. and ∪m m =1 m for all m.. although you should check that the manipulations of intersections and unions work if you have not seen them before: M M P( A ) = P[ A ∩ ( ∪ m =1 Bm )] = P[ ∪m=1 ( A ∩ Bm )] = m =1 ∑ P( A ∩ Bm ) = ∑ P( A | Bm ) · P( Bm ) m =1 M M Example 1. is a disjoint sequence of sets in F (disjoint means that Ai ∩ A j = ∅ for any i = j). we don’t have to. we say that F is “closed under the taking of complements”. Also. For example. the events we assign probability to. 2013 . A2 .1. because doing so will make it hard to form a consistent theory. Suffice to say that when F satisfies all these properties. When this is true. because (see fact 13. There is one more restriction that’s typically placed on F . then P(∪i Ai ) := P{ω ∈ Ω : ω ∈ Ai for some i} = ∑ P( Ai ) i =1 ∞ J OHN S TACHURSKI October 10. which is the property of being closed under “countable” unions.) Another sensible restriction concerns complements. then A ∩ B is also in F . it would be natural to think about the probability of the event “ A and B”. then A ∪ B is also in F ? Actually. if F is closed under the taking of complements and intersections. So we also require that if A and B are in F .1 on page 362). so we assign probabilities to these events. We won’t go into details. . given that we can assign a probability to the event A. then F is automatically closed under the taking of unions. so that P( A) is well defined.” Finally. Thus. we require that P(Ω) = 1. . One restriction we always put on F is to require that it contains the empty set ∅ and the whole set Ω. in standard probability theory. and represents the “probability of event A. A ∪ B = ( Ac ∩ Bc )c Thus. usually we don’t just choose freely. When we choose F . the first stage of our model construction is to choose (i) a sample space Ω. which corresponds to Ac . let’s suppose that A ∈ F .” Now. In this case. then Ac ∈ F . . it would be a bit odd if we couldn’t assign a probability to the event “not A”. Hence we need Ω to be an element of F .” Perhaps we should also require that if A and B are in F . so in practice we permit our set of events F to be a sub-collection of the subsets of Ω. which corresponds to A ∩ B. We say that F is “closed under the taking of intersections. So normally we require that if A ∈ F . let’s suppose that A and B are both in F . and (ii) a collection of its subsets F that we want to assign probabilities to.1. called countable additivity. and only assign probabilities to elements of F . (In the definition of P. it is called a “σ-algebra. The definition of countable additivity is that if A1 .1. PROBABILITY MODELS 8 Assigning probabilities to all subsets of Ω in a consistent way can be quite a difficult task. there is another restriction placed on P that I have not mentioned. depending on the definition of x). That is. you can learn all about σ-algebras and countable additivity in any text on measure theory. . The definition of random variables on infinite Ω is actually a bit more subtle. In this course we will never meet the nasty functions. . . This is useful. The remaining “nice” functions are our random variables.1. we typically exclude some particularly nasty (i. In terms of example 1. Depending on the conversion rule (i. Thus. . .1. etc. I recommend Williams (1991).2 To visualize the definition.” or something to that effect. 1991). Example 1. PROBABILITY MODELS 9 Why strengthen additivity to countable additivity? Countable additivity works behind the scenes to make probability theory run smoothly (expectations operators are suitably continuous. in this formal model. A random variable x converts this sequence into a real number. and so on).. 1. 1} for each n} The set of events and probability were defined as follows: F := all subsets of Ω and P( A) := 2− N (# A) Consider a random variable x on Ω that returns the first element of any given sequence. . when identifying random variables with the class of all functions from Ω to R. however.5. and there’s no need to go into further details. None of these details will concern us in this course. a random number generator picks out an ω . complicated and generally badly behaved) functions. 2013 .1. and imagine that “nature” picks out an element ω in Ω according to some probability. the outcome x (ω ) might simulate a Bernoulli random variable.. Recall example 1. which is a binary sequence.. For this reason. The random variable now sends this ω into x (ω ) ∈ R. x (ω ) = x (b1 .1.1. consider a random variable x. a normal random variable. mathematicians define random variables to be functions from Ω into R. random variables convert outcomes in sample space into numerical outcomes. with sample space Ω := {(b1 .g.e. . b N ) : bn ∈ {0. In practice. because numerical outcomes are easy to manipulate and interpret.8.3 Random Variables If you’ve done an elementary probability course. If you wish.1. To work more deeply with these objects. Williams. you might have been taught that a random variable is a “value that changes randomly. a uniform random variable. .e. 2 I’m J OHN S TACHURSKI October 10. we need a sounder definition. Those who want to know more should consult any text on measure theory (e. b N ) = b1 skipping some technical details again.5. Imagine a computer with an infinite amount of memory. and hence we can ignore it. 1} for each n} Ω is called the set of all infinite binary sequences. . However.1. What if ω = ω0 is an infinite sequence containing only zeros? Then {n : bn = 1} = ∅. we can set x (ω0 ) = 0 without changing anything significant. . .5. .1. b N ) : b1 = 1} The number of length N binary sequences with b1 = 1 is 2 N −1 . (This is an infinite version of the sample space in example 1. x is a well-defined function from Ω into R. .e. In general (in fact always).) = min{n : bn = 1} As per the definition. x takes only the values zero and one). But then x is not a map into R.3 Let’s go back to the general case. The experiment of flipping until we get a heads can be modeled with the random variable x (ω ) = x (b1 . P). because it can take the value ∞. The probability that x = 1 is 1/2. For example. There is a generic way to create Bernoulli random variables. b N ) : b1 = 1} = 2− N × #{(b1 .) : bn ∈ {0. . using indicator functions. .1. P{ x = 1} : = P{ ω ∈ Ω : x ( ω ) = 1} = P{(b1 . so P{ x = 1} = 1/2. F . b2 .9. .1.e.1 for the definition of “range. a Bernoulli random variable has the form x ( ω ) = 1{ ω ∈ C } where C is some subset of Ω. . b2 . . and talk a bit more about Bernoulli (i. If Q is a statement.) Consider an experiment where we flip a coin until we get a “heads”. The convention here is to set x (ω0 ) = min{n : bn = 1} = min ∅ = ∞. 2013 . Thus.4 We can create disthat’s not strictly true. . . it turns out that this event has probability zero. Now we’re back to a well-defined function from Ω to R. and zero when the statement Q is false. such as “on the planet Uranus. Indeed. A discrete random variable is a random variable with finite range. .” 3 Actually. J OHN S TACHURSKI October 10.” then 1{Q} is considered as equal to one when the statement Q is true. PROBABILITY MODELS 10 Then x is a Bernoulli random variable (i. binary) random variables..1. Hence. Consider the sample space Ω := {(b1 . 4 See §13. x is a binary random variable indicating whether or not the event C occurs (one means “yes” and zero means “no”). 1{ Q} is a binary indicator of the truth of the statement Q. Example 1. We let 0 represent tails and 1 represent heads. with arbitrary probability space (Ω.. there exists a tribe of three-headed monkeys. . From Bernoulli random variables we can create “discrete” random variables. and zero otherwise. A J form partition of Ω.4) We will work with this expression quite a lot.) For example. . . and • the sets A1 . PROBABILITY MODELS 11 Figure 1. . t when ω falls in B.1. we can also define a discrete random variable as a random variable having the form x (ω ) = j =1 ∑ s j 1{ ω ∈ A j } J (1. In doing so.3) is a discrete random variable taking the value s when ω falls in A.1 shows a graph of x when Ω = R. . . The random variable x ( ω ) = s1{ ω ∈ A } + t1{ ω ∈ B } (1. Ai ∩ A j = ∅ when i = j. let A and B be disjoint subsets of Ω. we will always assume that • the scalars s1 . s J are distinct.1: Simple function x (ω ) = s1{ω ∈ A} + t1{ω ∈ B} crete random variables by taking “linear combinations” of Bernoulli random variables. 2013 . J OHN S TACHURSKI October 10. (Check it!) Figure 1. It turns out that any discrete random variable can be created by taking linear combinations of Bernoulli random variables. .1. In particular. . and ∪ j A j = Ω. . (A linear combination of certain elements is formed by multiplying these elements by scalars and then summing the result.5 5 That is. 1.1. PROBABILITY MODELS 12 Given these assumptions, we then have • x (ω ) = s j if and only if ω ∈ A j . • {x = sj } = Aj. • P{ x = s j } = P( A j ). The second two statements follow from the first. Convince yourself of these results before continuing. In addition to discrete random variables there are also random variables with infinite range. We’ll talk about these a bit more later on. Before doing so, we note that there’s a common notational convention with random variables that will be adopted in this course. With a random variable x, we often write { x has some property} as a shorthand for {ω ∈ Ω : x (ω ) has some property} We’ll follow this convention, but you should translated it backwards and forwards in your mind if you’re not yet familiar with the notation. To give you an example, consider the claim that, for any random variable x, P{ x ≤ a} ≤ P{ x ≤ b} whenever a ≤ b (1.5) This is intuitively obvious. The mathematical argument goes as follows: Observe that { x ≤ a} := {ω ∈ Ω : x (ω ) ≤ a} ⊂ {ω ∈ Ω : x (ω ) ≤ b} =: { x ≤ b} (The inclusion ⊂ must hold, because if ω is such that x (ω ) ≤ a, then, since a ≤ b, we also have x (ω ) ≤ b. Hence any ω in the left-hand side is also in the right-hand side.) The result in (1.5) now follows from monotonicity of P (fact 1.1.1 on page 5). 1.1.4 Expectations Our next task is to define expectations for an arbitrary random variable x on probability space (Ω, F , P). Roughly speaking, E [ x ] is defined as the “sum” of all possible values of x, weighted by their probabilities. (Here “sum” is in quotes because J OHN S TACHURSKI October 10, 2013 1.1. PROBABILITY MODELS 13 there may be an infinite number of possibilities.) The expectation E [ x ] of x also represents the “average” value of x over a “very large” sample. (This is a theorem, not a definition—see the law of large numbers below.) Let’s start with the definition when x is a discrete random variable. Let x be the discrete random variable x (ω ) = ∑ jJ=1 s j 1{ω ∈ A j }, as defined in (1.4). In this case, the expectation of x is defined as E [ x ] : = ∑ s j P( A j ) j =1 J (1.6) The definition is completely intuitive: For this x, given our assumption that the sets A j ’s are a partition of Ω and the s j ’s are distinct, we have A j = { x = s j } := {ω ∈ Ω : x (ω ) = s j } Hence, (1.6) tells us that J E [ x ] = ∑ s j P{ x = s j } j =1 Thus, the expectation is the sum of the different values that x may take, weighted by their probabilities. How about arbitrary random variables, with possibly infinite range? Unfortunately, the full definition of expectation for these random variables involves measure theory, and we can’t treat in in detail. But the short story is that any arbitrary random variable x can be approximated by a sequence of discrete variables xn . The expectation of discrete random variables was defined in (1.6). The expectation of the limit x is then defined as E [ x] := lim E [ xn ] (1.7) n→∞ When things are done carefully (details omitted), this value doesn’t depend on the particular approximating sequence { xn }, and hence the value E [ x ] is well defined. Let’s list some facts about expectation, and then discuss them one by one. Fact 1.1.3. Expectation of indicators equals probability of event: For any A ∈ F , E [1{ ω ∈ A } ] = P( A ) Fact 1.1.4. Expectation of a constant is the constant: If α ∈ R, then E [α] = α. (1.8) J OHN S TACHURSKI October 10, 2013 1.1. PROBABILITY MODELS 14 Fact 1.1.5. Linearity: If x and y are random variables and α and β are constants, then E [ α x + β y ] = αE [ x ] + βE [ y ] Fact 1.1.6. Monotonicity: If x, y are random variables and x ≤ y,6 then E [ x ] ≤ E [y]. Fact 1.1.3 follows from the definition in (1.6). We have 1{ ω ∈ A } = 1 × 1{ ω ∈ A } + 0 × 1{ ω ∈ A c } Applying (1.6), we get E [1{ ω ∈ A } ] = 1 × P( A ) + 0 × P( A c ) = P( A ) Fact 1.1.4 say that the expectation of a constant α is just the value of the constant. The idea here is that the constant α should be understood in this context as the constant random variable α1{ω ∈ Ω}. From our definition (1.6), the expectation of this “constant” is indeed equal to its value α: E [ α ] : = E [ α1{ ω ∈ Ω } ] = αP( Ω ) = α Now let’s think about linearity (fact 1.1.5). The way this is proved, is to first prove linearity for discrete random variables, and then extend the proof to arbitrary random variables via (1.7). We’ll omit the last step, which involves measure theory. We’ll also omit the full proof for discrete random variables, since it’s rather long. Instead, let’s cover a quick sketch of the argument that still provides most of the intuition. Suppose we take the random variable x in (1.4) and double it, producing the new random variable y = 2x. More precisely, for each ω , we set y(ω ) = 2x (ω ). (Whatever happens with x, we’re going to double it and y will return that value.) In that case, we have E [y] = 2E [ x ]. To see this, observe that y(ω ) = 2x (ω ) = 2 Hence, applying (1.6), j =1 ∑ s j 1{ ω ∈ A j } = J J j =1 ∑ 2s j 1{ ω ∈ A j } J E [y] = ∑ 2s j P( A j ) = 2 ∑ s j P( A j ) = 2E [ x] j =1 j =1 J statement x ≤ y means that x is less than y for any realization of uncertainty. Formally, it means that x (ω ) ≤ y(ω ) for all ω ∈ Ω. 6 The J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 15 What we have shown is that E [2x ] = 2E [ x ]. Looking back over our argument, we can see that there is nothing special about the number 2 here—we could have used any constant. In other words, For any constant γ, we have E [γ x ] = γE [ x ] Another aspect of linearity of expectations is additivity, which says that given random variables x and y, the statement E [ x + y] = E [ x ] + E [y] is always true. Instead of giving the full proof, let’s show this for the Bernoulli random variables x ( ω ) = 1{ ω ∈ A } and y ( ω ) = 1{ ω ∈ B } (1.9) Consider the sum x + y. By this, we mean the random variable z(ω ) = x (ω ) + y(ω ). More succinctly, ( x + y)(ω ) := x (ω ) + y(ω ). We claim that E [ x + y] = E [ x ] + E [y]. To see that this is the case, note first that ( x + y)(ω ) = 1{ω ∈ A \ B} + 1{ω ∈ B \ A} + 21{ω ∈ A ∩ B} (To check this, just go through the different cases for ω , and verify that the right hand side of this expression agrees with x (ω ) + y(ω ). Sketching a Venn diagram will help.) Therefore, by the definition of expectation, E [ x + y ] = P ( A \ B ) + P ( B \ A ) + 2P ( A ∩ B ) Now observe that A = ( A \ B) ∪ ( A ∩ B) It follows (why?) that (1.10) E [ x ] : = P( A ) = P( A \ B ) + P( A ∩ B ) Performing a similar calculation with y produces E [ y ] : = P( B ) = P( B \ A ) + P( A ∩ B ) Adding these two produces the value on the right-hand side of (1.10), and we have now confirmed that E [ x + y] = E [ x ] + E [y]. 1.2 Distributions All random variables have distributions. Distributions summarize the probabilities of different outcomes for the random variable in question, and allow us to compute expectations. In this section, we describe the link between random variables and distributions. J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 16 F(s) 0.0 −15 0.2 0.4 0.6 0.8 1.0 −10 −5 0 s 5 10 15 Figure 1.2: Cauchy cdf 1.2.1 CDFs A cumulative distribution function (cdf) on R is a right-continuous, monotone increasing function F : R → [0, 1] satisfying lims→−∞ F (s) = 0 and lims→∞ F (s) = 1. ( F is monotone increasing if F (s) ≤ F (s ) whenever s ≤ s , and right continuous if F (sn ) ↓ F (s) whenever sn ↓ s.) Example 1.2.1. The function F (s) = arctan(s)/π + 1/2 is a cdf—one variant of the Cauchy distribution. A plot is given in figure 1.2. Let x be a random variable on some probability space (Ω, F , P), and consider the function Fx (s) := P{ x ≤ s} := P{ω ∈ Ω : x (ω ) ≤ s} ( s ∈ R) (1.11) It turns out that this function is always a cdf.7 We say that Fx is the cdf of x, or, alternatively, that Fx is the distribution of x, and write x ∼ Fx . following is also true: For every cdf F, there exists a probability space (Ω, F , P) and a random variable x : Ω → R such that the distribution of x is F. Exercise 1.5.14 gives some hints on how the construction works. 7 The J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 17 We won’t go through the proof that the function Fx defined by (1.11) is a cdf. Note however that monotonicity is immediate from (1.5) on page 12. Fact 1.2.1. If x ∼ F, then P{ a < x ≤ b} = F (b) − F ( a) for any a ≤ b. Proof: If a ≤ b, then { a < x ≤ b} = { x ≤ b} \ { x ≤ a} and { x ≤ a} ⊂ { x ≤ b}. Applying fact 1.1.1 on page 5 gives the desired result. A cdf F is called symmetric if F (−s) = 1 − F (s) for all s ∈ R.8 The proof of the next fact is an exercise (exercise 1.5.12). Fact 1.2.2. Let F be a cdf and let x ∼ F. If F is symmetric and P{ x = s} = 0 for all s ∈ R, then the distribution F| x| of the absolute value | x | is given by F| x| (s) := P{| x | ≤ s} = 2 F (s) − 1 ( s ≥ 0) 1.2.2 Densities and Probability Mass Functions Cdfs are important because every random variable has a well-defined cdf via (1.11). However, they can be awkward to manipulate mathematically, and plotting cdfs is not a very good way to convey information about probabilities. For example, consider figure 1.2. The amount of probability mass in different regions of the xaxis is determined by the slope of the cdf. Research shows that humans are poor at extracting quantitative information from slopes. They do much better with heights, which leads us into our discussion of densities and probability mass functions. Densities and probability mass functions correspond to two different, mutually exclusive cases. The first (density) case arises when the increase of the cdf in question is smooth, and contains no jumps. The second (probability mass function) case arises when the increase consists of jumps alone. Let’s have a look at these two situations, starting with the second case. The pure jump case occurs when the cdf represents a discrete random variable. To understand this, suppose that x takes values s1 , . . . , s J . Let p j := P{ x = s j }. We then have 0 ≤ p j ≤ 1 for each j, and ∑ jJ=1 p j = 1 (exercise 1.5.13). A finite collection of the probability that x ≤ −s is equal to the probability that x > s. Centered normal distributions and t-distributions have this property. 8 Thus, J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 18 s1 s2 Figure 1.3: Discrete cdf numbers p1 , . . . , p J such that 0 ≤ p j ≤ 1 and p1 + · · · + p J = 1 is called a probability mass function (pmf). The cdf corresponding to this random variable is Fx (s) = j =1 ∑ 1{ s j ≤ s } p j J (1.12) How do we arrive at this expression? Because, for this random variable,     J { x = s j } = ∑ P{ x = s j } = ∑ 1{ s j ≤ s } p j P{ x ≤ s } = P   j s.t. s ≤s j =1 j s.t. s j ≤s j Visually, Fx is a step function, with a jump up of size p j at point s j . Figure 1.3 gives an example with J = 2. The other case of interest is the density case. A density is a nonnegative function p that integrates to 1. For example, suppose that F is a smooth cdf, so that the derivative F exists. Let p := F . By the fundamental theorem of calculus, we then have s s p(t)dt = F (t)dt = F (s) − F (r ) r r J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 19 p(s) 0.00 −15 0.05 0.10 0.15 0.20 0.25 0.30 −10 −5 0 s 5 10 15 Figure 1.4: Cauchy density +∞ From the definition of cdfs, we can see that p is nonnegative and −∞ p(s)ds = 1. In other words, p is a density. Also, taking the limit as r → −∞ we obtain s F (s) = −∞ p(t)dt which tells us that F can be recovered from p. More generally, if F is the cdf of random variable x and p : R → [0, ∞) satisfies s F (s) = then p is called the density of x. −∞ p(t)dt for all s ∈ R Not every random variable has a density. The exact necessary and sufficient condition for a density to exist is that F is “absolutely continuous.” The most important special case of absolute continuity is when F is differentiable. (See any text on measure theory for details.) For our purposes, you can think of absolute continuity meaning that F does not have any jumps. In particular, discrete random variables are out, as J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 20 is any random variable putting positive probability mass on a single point.9 Fact 1.2.3. If x has a density, then P{ x = s} = 0 for all s ∈ R. As discussed above, cdfs are useful because every random variable has one, but pmfs and densities are nicer to work with, and visually more informative. For example, consider figure 1.4, which shows the density corresponding to the Cauchy cdf in figure 1.2. Information about probability mass is now conveyed by height rather than slope, which is easier for us humans to digest. 1.2.3 The Quantile Function Let F be any cdf on R. Suppose that F is strictly increasing, so that the inverse function F −1 exists: F −1 (q) := the unique s such that F (s) = q (0 < q < 1) (1.13) The inverse of the cdf is called the quantile function, and has many applications in probability and statistics. Example 1.2.2. The quantile function associated with the Cauchy cdf in example 1.2.1 is F −1 (q) = tan[π (q − 1/2)]. See figure 1.5. Things are a bit more complicated when F is not strictly increasing, as the inverse F −1 is not well defined. (If F is not strictly increasing, then there exists at least two distinct points s and s such that F (s) = F (s ).) This problem is negotiated by setting F −1 (q) := inf{s ∈ R : F (s) ≥ q} (0 < q < 1) This expression is a bit more complicated, but in the case where F is strictly increasing, it reduces to (1.13). The value F −1 (1/2) is called the median of F. The quantile function features in hypothesis testing, where it can be used to define critical values (see §5.3). An abstract version of the problem is as follows: Let x ∼ F, where F is strictly increasing, differentiable (so that a density exists and x puts no elementary texts, random variables with densities are often called “continuous random variables.” The notation isn’t great, because “continuous” here has nothing to do with the usual definition of continuity of functions. 9 In J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 21 −5 0 s 5 0.2 0.4 q 0.6 0.8 Figure 1.5: Cauchy quantile function probability mass on any one point) and symmetric. Given α ∈ (0, 1), we want to find the c such that P{−c ≤ x ≤ c} = 1 − α (see figure 1.6). The solution is given by c := F −1 (1 − α/2). That is, c = F −1 (1 − α/2) =⇒ P{| x | ≤ c} = 1 − α To see this, fix α ∈ (0, 1). From fact 1.2.2, we have (1.14) P{| x| ≤ c} = 2 F(c) − 1 = 2 F[ F−1 (1 − α/2)] − 1 = 1 − α In the case where F is the standard normal cdf Φ, this value c is usually denoted by zα/2 . We will adopt the same notation: zα/2 := Φ−1 (1 − α/2) (1.15) 1.2.4 Expectations from Distributions Until now, we’ve been calculating expectations using the expectation operator E , which was defined from a given probability P in §1.1.4. One of the most useful J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 22 0.00 0.05 0.10 0.15 0.20 0.25 0.30 −c 0 c Figure 1.6: Finding critical values facts about distributions is that one need not know about x or E to calculate E [ x ]— knowledge of the distribution of x is sufficient. Mathematically, this is an interesting topic, and a proper treatment requires measure theory (see, e.g., Williams, 1991). Here I’ll just tell you what you need to know for the course, and then follow that up with a bit of intuition. In all of what follows, h is an arbitrary function from R to R. Fact 1.2.4. If x is a discrete random variable taking values s1 , . . . , s J with probabilities p1 , . . . , p J , then E [h( x)] = ∑ h(s j ) p j j =1 J (1.16) Fact 1.2.5. If x is a “continuous” random variable with density p, then E [h( x)] = ∞ −∞ h(s) p(s)ds (1.17) It’s convenient to have a piece of notation that captures both of these cases. As a J OHN S TACHURSKI October 10, 2013 1.2. DISTRIBUTIONS 23 result, if x ∼ F, then we will write E [h( x)] = h(s) F (ds) The way you should understand this expression is that when F is differentiable with ∞ derivative p = F , then h(s) F (ds) is defined as −∞ h(s) p(s)ds. If, on the other random variable in fact 1.2.4, then h(s) F (ds) is defined as ∑ jJ=1 h(s j ) p j . hand, F is the step function F (s) = ∑ jJ=1 1{s j ≤ s} p j corresponding to the discrete Example 1.2.3. Suppose that h( x ) = x2 . In this case, E [h( x )] is the second moment of x. If we know the density p of x, then fact 1.2.5 can be used to evaluate that second moment by solving the integral on the right-hand side of (1.17). Just for the record, let me note that if you learn measure theory you will come to understand that, for a given cdf F, the expression h(s) F (ds) has its own precise definition, as the “Lebesgue-Stieltjes” integral of h with respect to F. In the special case where F is differentiable with p = F , one can prove that h(s) F (ds) = ∞ −∞ h ( s ) p ( s ) ds, where the left hand side is the Lebesgue-Stieltjes integral, and the right hand side is the ordinary (Riemann) integral you learned in high school. An similar statement holds for the discrete case. However, this is not the right place for a full presentation of the Lebesgue-Stieltjes integral. We want to move on to statistics. Although we’re skipping a lot of technical details here, we can at least prove fact 1.2.4. This is the discrete case, where x is of the form x (ω ) = ∑ jJ=1 s j 1{ω ∈ A j }, and p j := P{ x = s j } = P( A j ). As usual, the values {s j } are distinct and the sets { A j } are a partition of Ω. As we saw in §1.1.4, the expectation is E [ x ] = ∑ s j P( A j ) = ∑ s j P{ x = s j } = ∑ s j p j j =1 j =1 j =1 J J J Now let h : R → R. You should be able to convince yourself that h( x (ω )) = j =1 ∑ h ( s j )1{ ω ∈ A j } J (Pick an arbitrary A j and check that the left- and right-hand sides are equal when ω ∈ A j .) This is a discrete random variable, which we can take the expectation of using (1.6) (page 13). We get E [h( x)] = ∑ h(s j )P( A j ) = ∑ h(s j ) p j j =1 j =1 J J J OHN S TACHURSKI October 10, 2013 b] is the distribution associated with the density 1 p ( s. .) The mean is b a s p(s. α N are any constants. xk ∼ N (0.6. . 1) is called the standard normal distribution It is well-known that if x ∼ N (µ. then E [ x ] = µ.2. then p(s. b) := 0.2. The chi-squared distribution with k degrees of freedom is the distribution with density 1 p(s. If x has a distribution described by this density.7. Fact 1. . . and this is one of many attractive features of the distribution.5 Common Distributions Let’s list a few well-known distributions that will come up in this course. and var[ x ] = σ2 . . If x1 . then Q := IID i =1 ∑ xi2 ∼ χ2 (k) October 10. We represent this distribution symbolically by N (µ.2. . The uniform distribution on interval [ a. a. µ. then we write x ∼ χ2 (k ). b ) : = ( a ≤ s ≤ b) b−a (If s < a or s > b. Fact 1. . . k ) := k/2 sk/2−1 e−s/2 ( s ≥ 0) 2 Γ(k/2) where Γ is the Gamma function (details omitted). DISTRIBUTIONS 24 Fact 1. 2013 k J OHN S TACHURSKI . a. b)ds = a+b 2 The univariate normal density or Gaussian density is a function p of the form 1 p(s) := p(s.4 is now confirmed. N then α0 + ∑n =1 αn xn is also normally distributed. If x1 . σ2 ). σ) := (2πσ2 )−1/2 exp − (s − µ)2 σ−2 2 for some µ ∈ R and σ > 0. a. 1. 1).2.1. . σ2 ). . The distribution N (0. x N are normally distributed and α0 .2. . . Let a < b. Hence the two parameters separately define the mean and the variance (or standard deviation). then ∑ jJ=1 Q j ∼ χ2 (∑ j k j ). some of which we will examine later.3. then Q1 / k 1 Q2 / k 2 is distributed as F (k1 . . k ) : = 1 Γ( k+ 2 ) k (kπ )1/2 Γ( 2 ) s2 1+ k −(k+1)/2 Fact 1. and 3.8. Q ∼ χ2 (k). .2. The F-distribution arises in certain hypothesis tests.2. 2.1. the t-distribution with k degrees of freedom. If Z and Q are two random variables such that 1. If Q1 .3 Dependence [roadmap] J OHN S TACHURSKI October 10. k2 ). k 2 ) : = sB(k1 /2. Student’s t-distribution with k degrees of freedom. 1. Z ∼ N (0. Fact 1. more simply. DEPENDENCE 25 Fact 1. 2013 . k2 /2) ( s ≥ 0) where B is the Beta function (details omitted). Q J are independent with Q j ∼ χ2 (k j ). .10. is the distribution on R with density p ( s.2. 1). If Q1 ∼ χ2 (k1 ) and Q2 ∼ χ2 (k2 ) are independent. or. k2 is the distribution with the unlikely looking density k1 +k2 2 (k 1 s)k1 k k ] 2 /[k1 s + k2 p ( s. The F-distribution with parameters k1 . Z and Q are independent. then Z (k / Q)1/2 has the t-distribution with k degrees of freedom.9. . k 1 . N ) In this setting. . . . . x N . Once special case where we can tell the joint from the marginals is when there is no interaction. . . . . n = 1. s N ) p ( s1 . This is called independence. . . ∞) satisfying tN −∞ ··· t1 −∞ p(s1 . . . x N . . For each individual random variable xn : Ω → R. . . x N to be F ( s1 . . s N ) = p ( s k +1 .1 Joint Distributions Consider a collection of N random variables x1 . s k ) p ( s 1 . . To quantify these things. . . . since the marginals do not tell us about the interactions between the different variables. . From joint densities we can construct conditional densities. . . . (1. . . x N given x1 = s1 . . . . . . .18) This distribution tells us about the random properties of xn viewed as a single entity. . . The joint density of x1 . . s N | s 1 . . . . . . . . . . n = 1. s N | s 1 .1. x N . . s k ) This decomposition is useful in many situations. . . . DEPENDENCE 26 1. . x N ≤ s N } (−∞ < sn < ∞ . . . . in order to distinguish it from the joint distribution. . . s N ) : = P{ x1 ≤ s1 . . . . and we treat it in the next section. we define the joint distribution of x1 . . . s N )ds1 · · · ds N = F (t1 . . and outcomes for the group of variables as a whole.3. . . .3. the distribution Fn of xn is Fn (s) := P{ xn ≤ s} (−∞ < s < ∞) (1. 2013 . the distribution Fn of xn is sometimes called the marginal distribution. . s k ) (1. . s k ) : = p ( s1 . . . . xk = sk is defined by p ( s k +1 . t N ) (1. . N .19) for all tn ∈ R.20) Rearranging this expression we obtain a decomposition of the joint density: p ( s 1 .21) J OHN S TACHURSKI October 10. . Typically. . . . . . But we often want to know about the relationships between the variables x1 . The conditional density of xk+1 . if it exists. . . the joint distribution cannot be determined from the N marginal distributions alone. . is a function p : R N → [0. . . 2. s N . x M are independent and E | xm | is finite for each m. . .9).3.2 Independence Let x1 . . If x1 . Fact 1. .1. we can illustrate the idea by showing that E [ xy] = E [ x ]E [y] when x and y are independent and defined by (1. The variables x1 .3. . . . x N be a collection of random variables with xn ∼ Fn . and let y denote the vertical location. by the definition of expectations. then f ( x ) and g(y) are independent. . . then E ∏ xm = ∏ E [ xm ] m =1 m =1 M M We won’t prove the last fact in the general case. . They are called independent if. . . . . given any s1 . (For example. we have E [ xy] = stP( A ∩ B) = stP( A)P( B) = sP( A)tP( B) = E [ x]E [y] Fact 1. Now observe that ( xy)(ω ) := x (ω )y(ω ) = s1{ω ∈ A}t1{ω ∈ B} = st1{ω ∈ A ∩ B} Hence. . . we have P{ x1 ≤ s1 . Let x denote the horizontal location of the dart relative to the center of the board. . x N ≤ s N } = P{ x1 ≤ s1 } × · · · × P{ x N ≤ s N } (1. Example 1. . m.) At first pass. . x N and Fn is the marginal distribution of xn . then the dart is 1cm to the left of the center. . .3. if F is the joint distribution of x1 . . we might suppose that x and y are independent and identically distributed. .22) Equivalently. x N are called identically distributed Fn = Fm for all n. . if x = −1 and y = 3.1. s N ) = F1 (s1 ) × · · · × FN (s N ) = n =1 ∏ Fn (sn ) N We use the abbreviation IID for collections of random variables that are both independent and identically distributed. . as this involves measure theory. If x and y are independent and g and f are any functions. . .3. where Fn is a cdf. then independence states that F (s1 . and 3cm above.1. DEPENDENCE 27 1. it can be shown (details omitted) that the random variables x and y are independent precisely when the events A and B are independent. J OHN S TACHURSKI October 10. . 2013 . In this case. Consider a monkey throwing darts at a dartboard. However.3. Fact 1. . The variance of random variable x is defined as var[ x ] := E [( x − E [ x ])2 ] This gives a measure of the dispersion of x.3 Variance and Covariance Let x ∼ F.10 var[α + β x ] = β2 var[ x ]. Fact 1. but in general we’ll just talk about the variance of a given random variable without adding the caveat “assuming it exists. .1. .”) The standard deviation of x is var[ x ]. x N are random variables and α1 .1. the k-th moment of x is defined as E [ x k ] = sk F (ds). DEPENDENCE 28 An important special case of the “independence means multiply” rule is as follows. 2013 .5. and is the product of the marginal densities: p ( s1 . (Not all random variables have a well defined variance. . if α and β are real numbers and x and y are random variables. For the normal distribution. xm ] n<m In particular. . and each has density pn . .3. . then var[α] = 0.3.3. y] := E [( x − E [ x ])(y − E [y])] Fact 1. The covariance of random variables x and y is defined as cov[ x. as was the case when we discussed fact 1. For k ∈ N. then the joint density p exists. s N ) = n =1 ∏ pn (sn ) N 1. If the k-th moment of x exists. .3. . If random variables x1 . If E [| x|k ] < ∞ then the k-th moment is said to exist. . 10 Here J OHN S TACHURSKI October 10. α N are constant scalars.3. If x1 . For a random variable with the Cauchy distribution. even the first moment does not exist. then so does the j-th moment for all j ≤ k. x N are independent. . every moment exists. and var[α x + βy] = α2 var[ x ] + β2 var[y] + 2αβ cov[ x. then var n =1 ∑ αn xn N = α2 n n =1 ∑ var[xn ] + 2 ∑ N αn αm cov[ xn . y] var[α] should be understood as var[α1{ω ∈ Ω}]. . .3.4 on page 13. . . .4. 1. The class of functions L is more correctly known as the set of affine functions. y and positive constants α. let’s consider the problem of predicting the value of a random variable y given knowledge of the value of a second random variable x.6. y] = 0. 2013 J OHN S TACHURSKI .) Thus.7. β. in fact they are not linear unless α = 0 (see §2. the Given two random variables x and y with finite variances σx y correlation of x and y is defined as corr[ x. y] ≤ 1 and corr[α x. For this to occur. Thus. The class of functions we will consider is the set of “linear” functions L := { all functions of the form ( x ) = α + β x } (While elementary courses refer to these functions as linear.3. y] σx σy If corr[ x.3. Given any two random variables x. βy] = corr[ x. while negative correlation means that corr[ x. the minimizer of the mean squared deviation over all functions of x is obtained by choosing f ( x ) = E [y | x ].1. y] is positive. DEPENDENCE 29 2 and σ2 respectively. then cov[ x. However. it is necessary and sufficient that cov[ x.3. we consider the problem min E [(y − ( x ))2 ] = min E [(y − α − β x )2 ] ∈L α. To measure the “average distance” between f ( x ) and y. so let’s now consider the simpler problem of finding a good predictor of y within a small and well-behaved class of functions.23) October 10. the conditional expectation may be nonlinear and complicated. β ∈ R (1. we will use the mean squared deviation between f ( x ) and y. Fact 1. If x and y are independent. y] = 0.3. y] := cov[ x. y] = 0. Note that the converse is not true: One can construct examples of dependent random variables with zero covariance. we seek a function f such that f ( x ) is close to y on average. Positive correlation means that corr[ x. where the right-hand size is the conditional expectation of y given x. we have −1 ≤ corr[ x.1. y] is negative. y] Fact 1.4 Best Linear Predictors As a little exercise that starts moving us in the direction of statistics. which is E [(y − f ( x))2 ] As we will learn in chapter 3. y] = corr[ x.3). we say that x and y are uncorrelated. We begin by discussing three modes of convergence for random variables. β) =0 ∂α and ∂ψ(α.5. To this end. we often want to know how our tests and procedures will perform as the amount of data we have at hand becomes large. 2013 .4. In almost all the applications we consider. you will realize that α∗ and β∗ are the “population” counterparts for the coefficient estimates in the regression setting. We say that { xn }n=1 converges to random variable x in probability if for any δ > 0. y] var[ x ] and α∗ := E [y] − β∗ E [ x ] (1. 1. J OHN S TACHURSKI October 10. β) := E [y2 ] − 2αE [y] − 2 βE [ xy] + 2αβE [ x ] + α2 + β2 E [ x2 ] Computing the derivatives and solving the equations ∂ψ(α.4.4 Asymptotics In statistics.1. β) =0 ∂β We obtain (exercise 1. P{| xn − x| > δ} → 0 as n → ∞ p In symbols.1 Modes of Convergence ∞ Let { xn }∞ n=1 be a sequence of random variables. this convergence is represented by xn → x. We’ll talk more about the connections in the next chapter. the limit x will be a constant.25) the minimizers β∗ := cov[ x. we now investigate the limiting properties of sequences of random variables. all of which are used routinely in econometrics. the objective function becomes ψ(α.24) The best linear predictor is therefore ∗ ( x ) := α∗ + β∗ x If you’ve studied elementary linear least squares regression before. 1. ASYMPTOTICS 30 Expanding the square on the right-hand side and using linearity of E . The next example illustrates the definition for this case. This probability collapses to zero as n → ∞. we have P{| xn − α| > δ} → 0.1. Fact 1.4.6 0. 1/n) Example 1. If xn ∼ N (α. If xn → x and αn → α. If g : R → R is continuous and xn → x.8 0. then xn → α. J OHN S TACHURSKI October 10. decreasing the variance and causing the density to become more peaked.4. That is.2 0.8 1.4 0.1.6 0.2 0. A full proof of the convergence result in example 1. 1/n). the following statements are true: 1.7 for two different values of n. The details are below.4.7: P{| xn − α| > δ} for xn ∼ N (α.1 can be found by looking at the normal density and bounding tail probabilities.0 0. If xn → x and yn → y.0 0.4 0.4. 2. then xn + αn → x + α and xn αn → x α. for any δ > 0.11 11 Here p p p p p p p p p p {αn } is a nonrandom scalar sequence. then g( xn ) → g( x ).0 α−δ α α+δ (a) n = 10 (b) n = 20 Figure 1. the probability P{| xn − α| > δ} is shown in figure 1. 2013 . Regarding convergence in probability. then xn + yn → x + y and xn yn → xy. where it corresponds to the size of the shaded areas. However. 3. Fixing δ > 0. ASYMPTOTICS 31 1.1. a much simpler proof can also be obtained by exploiting the connection between convergence in probability and convergence in mean squared error.0 α−δ α α+δ 0. J OHN S TACHURSKI October 10.2. 1/n).29.4. then xn → x. 2013 . this convergence is represented by xn → x. Regarding convergence in mean square. and let F be a cdf.4.4.4. which states that for any random variable y with finite second moment and any δ > 0.4. This follows from parts 1 and 2 of fact 1.4.8. In example 1. ASYMPTOTICS 32 We say that { xn } converges to x in mean square if E [( xn − x)2 ] → 0 as n → ∞ In symbols. Part 1 of fact 1. It is well-known that the cdf of the t-distribution with k degrees of freedom converges to the standard normal cdf as k → ∞.5. We say that Fn converges weakly to F if. Example 1. Let { Fn }∞ n=1 be a sequence of cdfs. the following statements are true: 1. we stated that if xn ∼ N (α. This convergence is illustrated in figure 1. we have Fn (s) → F (s) as n→∞ p Example 1.2 follows.4. Part 2 is implied by the equality E [( xn − α)2 ] = var[ xn ] + (E [ xn ] − α)2 Verification of this equality is an exercise. If xn → x. we have ms ms p ms P{|y| ≥ δ} ≤ E [ y2 ] δ2 (1.4.) Using monotonicity of P and then applying (1. 2. for any s such that F is continuous at s. since E [ xn ] = α and var[ xn ] = 1/n → 0. then xn → α.1.2. then xn → α if and only if E [ xn ] → α and var[ xn ] → 0.25) (See exercise 1.2 follows from Chebychev’s inequality. If α is constant. Fact 1.25) to y = xn − x.2.3. we obtain E [( xn − x)2 ] P{| xn − x| > δ} ≤ P{| xn − x| ≥ δ} ≤ δ2 Part 1 of fact 1.1. Let { Fn }∞ n=1 be a sequence of cdfs.8: t-distribution with k df converges to N (0.0 df = 1 df = 10 N(0. J OHN S TACHURSKI October 10.3. the following statements are true: 1. We say that xn converges in distribution to x if Fn converges weakly to F.4.1) 0. In this connection.4. Let { xn }∞ n=1 and x be random variables. where xn ∼ Fn and x ∼ F. then g( xn ) → g( x ). 1) as k → ∞ Sometimes densities are easier to work with than cdfs.0 −3 0.6 0. then xn → x. In symbols. If g : R → R is continuous and xn → x.2 0. 3.1.4. and let F be a cdf. 2. Suppose that all these cdfs are differentiable. Regarding convergence in distribution.4. ASYMPTOTICS 33 1. then xn → α.8 −2 −1 0 1 2 3 Figure 1. then Fn converges weakly to F. If α is constant and xn → α. 2013 d p p d d d d .4 0. this convergence is represented by xn → x. If pn (s) → p(s) for all s ∈ R. If xn → x. note that pointwise convergence of densities implies weak convergence of the corresponding distribution functions: Fact 1. and let pn and p be the densities of Fn and F respectively. Fact 1. This means that if we simulate many observations of x and take the sample mean. The central limit theorem tells us that a simple transform of the average converges to a normal distribution. Let x be the number of tails observed in the process. then xn + yn → α + y and xn yn → αy. The coin is not fair: The probability of heads is 0. we should get a value close to 15. If α is constant. xn → α and yn → y. To illustrate the law of large numbers. 12 The proof involves a bit of cheating.1. 2013 . Let’s start with the law of large numbers.4. and.4.2 The Law of Large Numbers Two of the most important theorems in both probability and statistics are the law of large numbers and the central limit theorem.5.2 on page 32. we can use fact 1.12 Example 1.5.4. This second moment assumption is not necessary for the result. In view of this fact. . The law of large numbers tells us that these averages converge in probability to the mean of the distribution in question.31). which relates to the sample mean ¯ N := x of a given sample x1 . These steps suffices to show that E [ x are left as an exercise (exercise 1.1. ASYMPTOTICS 34 The next result is sometimes known as Slutsky’s theorem. consider flipping a coin until 10 heads have occurred. with a little bit of googling. If the first moment |s| F (ds) is finite. p d d d 1.4. . x N Theorem 1. it ¯ N ] → sF (ds) and var[ x ¯ N ] → 0 as N → ∞. Code to do this is provided in because it assumes that the variance of each xn is finite. .4. but it helps to simplify the proof. J OHN S TACHURSKI October 10. note to yourself exactly where independence bites.1. . It is known that such an x has the negative binomial distribution.4. we find that the mean E [ x ] is 15. When you do the exercise.4. then ¯ N → E [ xn ] = x p 1 N n =1 ∑ xn N sF (ds) as N→∞ (1. In their simplest forms.26) To prove theorem 1.4. these theorems deal with averages of independent and identically distributed (IID) sequences. Fact 1.4. Let { xn } be an IID sequence of random variables with common distribution F. The generation of a single observation has been wrapped in a function called f. repetitions ) for ( i in 1: num .runif (1) num .1.num .num .27) This can be confirmed by letting yn := h( xn ) and then applying theorem 1. let x ∼ F.1. ASYMPTOTICS 35 listing 1.4) } outcomes [ i ] <.0 num .10000 outcomes <. the law of large numbers (1. Using the principle that expectations of indicator functions equal probabilities of events (page 13). 13 Hint: J OHN S TACHURSKI October 10.4. 1] and q ∈ [0.4) num .4. repetitions ) { num . tails <. which is handy for simulations. heads <. 2013 .26) appears to only be a statement about the sample mean. the law of large numbers applies to probabilities as well as expectations. Listing 1 Illustrates the LLN num .0 while ( num . heads + ( b < 0.numeric ( num . then P{u ≤ q} = q. Also. we have E [h( x)] = E [1{ x ∈ B}] = P{ x ∈ B} If u is uniform on [0. we have used the R function replicate. To see this. if h : R → R is such that |h(s)| F (ds) is finite. heads <. and consider the probability P{ x ∈ B}. fix B ⊂ R. heads < 10) { b <.num . but actually it can be applied to functions of the random variable as well. Let h be the function defined by h(s) = 1{s ∈ B} for all s ∈ R. Can you see how this program works?13 An improved implementation is given in listing 2. To generate multiple observations. repetitions <. For example. 1]. This fact is used to simulate the coin flips. then 1 N n =1 ∑ h(xn ) → E [h(xn )] = N p h(s) F (ds) (1. tails } print ( mean ( outcomes ) ) At first glance. tails <. Also recall that the logical values TRUE and FALSE are treated as 1 and 0 respectively in algebraic expressions. tails + ( b >= 0. in addition. the second moment s2 F (ds) is finite.1. tails <.num . 2013 . tails <. Relative to the LLN. tails ) } outcomes <. It is arguably one of the most beautiful and important results in all of mathematics. it requires an additional second moment condition.1.function ( q ) { num . tails + ( b >= q ) } return ( num .4. then √ ¯N − µ x d z N := N → z ∼ N (0. Theorem 1. 1.30) σ J OHN S TACHURSKI October 10. σ2 ) as N → ∞ N (x (1.0 num .3 The Central Limit Theorem The central limit theorem is another classical result from probability theory. f (0.runif (1) num . Assume the conditions of theorem 1.4.4. heads < 10) { b <.4.replicate (10000 . Another common statement of the central limit theorem is as follows: If all the conditions of theorem 1.4) ) print ( mean ( outcomes ) ) It now follows from (1.0 while ( num .28) tells us that this fraction converges to the probability that xn ∈ B. then 1 N n =1 ∑ 1{ x n ∈ B } → P{ x n ∈ B } N p (1. then √ d ¯ N − µ) → y ∼ N (0.num .4.2 are satisfied. heads <.2. heads + ( b < q ) num .27) that if { xn } is an IID sample from F. ASYMPTOTICS 36 Listing 2 Second version of listing 1 f <.28) The left hand side is the fraction of the sample that falls in the set B. If. and (1. 1) as N → ∞ (1. heads <.29) where µ := sF (ds) = E [ xn ] and σ2 := (s − µ)2 F (ds) = var[ xn ]. Assume the conditions of theorem 1. We see that x approximately normal.2.4. x N N ¯ N is Here ≈ means that the distributions are approximately equal.numeric ( num .1000 k <.30).4. breaks =50 . The last line of the listing superimposes the density of the standard normal distribution over the histogram. we briefly note the following extension to the central limit theorem: Theorem 1. The central limit theorem tells us about the distribution of the sample mean when N is large.9. If g : R → R is differentiable at µ and g (µ) = 0.4. g (µ)2 σ2 ) N { g( x d as N→∞ (1. (The mean of this distribution is 5. freq = FALSE ) curve ( dnorm .5. then √ ¯ N ) − g(µ)} → N (0.1. k ) outcomes [ i ] <. Arguing informally. replications ) N <.rchisq (N . As expected. ∴ The convergence in (1. the output of which is given in figure 1. the fit is pretty good. with mean equal to µ := E [ x1 ] and variance converging to zero at a rate proportional to 1/ N . where each xn is χ2 (5).32 asks you to confirm this via theorem 1. The listing generates 5. and the variance is 2 × 5 = 10.k ) } hist ( outcomes . replications <.2 and fact 1. and then histogrammed.31) J OHN S TACHURSKI October 10. ASYMPTOTICS 37 Exercise 1.5 # Degrees of freedom for ( i in 1: num .3. 2013 .30) is illustrated by listing 3. add = TRUE . lw =2 . Listing 3 Illustrates the CLT num .5000 outcomes <.4.4. for N large we have √ ¯ N − µ) ≈ y ∼ N (0. σ2 ) N (x y σ2 ¯ N ≈ √ + µ ∼ N µ.000 observations of the random variable z N defined in (1.) The observations of z N are stored in the vector outcomes. replications ) { xvec <.4. col = " blue " ) Before finishing this section.sqrt ( N / (2 * k ) ) * ( mean ( xvec ) . 1.31) to occur. a rather large value of N is required for the convergence in (1.3 −3 −2 −1 0 outcomes 1 2 3 Figure 1. 1. B and C are disjoint.9: Illustration of the CLT This theorem is used frequently in statistics. 1.14 Sketching the Venn diagram.2.5.5. so that P( A ∪ B) = P( A) + P( B) whenever A and B are disjoint. Finish the proof using the definition of a probability and fact 1. B.5. The proof of theorem 1. F ). EXERCISES 38 Histogram of outcomes 0. Show that if A.5.1. The technique is referred to as the delta method. Prove fact 1. convince yourself that A = [( A ∪ B) \ B] ∪ ( A ∩ B). 2013 . then P( A ∪ B ∪ C ) = P( A ) + P( B ) + P( C ).4 Density 0.2: P( A ∪ B) = P( A) + P( B) − P( A ∩ B) for any A.39.0 −4 0.1. 14 Hint: J OHN S TACHURSKI October 10.1. 1. Suppose that P is a probability on (Ω.3 is based on Taylor expansion of g around the point µ. to obtain the asymptotic distribution of certain kinds estimators. One word of warning when applying this theorem: In many situations. Some of the details are given in exercise 1.2 0.4. Ex.1 0.1 (page 5).5 Exercises Ex. Ex.e. Let Ω be a nonempty finite set.5. B := {2} and C := {3}. Recall Fx defined in (1.4. Let G be another cdf on R.5. and let P be a probability on the subsets F .5.1.5. .3. then G −1 (u) ∼ G. J OHN S TACHURSKI October 10.8. Ex. P( A ∩ B).5.12. and let ω0 be a fixed element of Ω. Suppose that x is the finite-valued random variable in (1.5. Suppose that G is strictly increasing. Let P and Ω be defined as in example 1. s J .5. Show that when Ω is finite. then P( A ∪ B) = P( A) + P( B).14. 1. in the sense that if A and B are disjoint events. 1.4).2. P( Ac ∪ Bc ) and P( A | B).5.6. where m ∈ {1. define P( A) := 1{ω0 ∈ A}.5. . Let P( A) = P( B) = 1/3. 6} and q is a constant. Let A be the event that the first switch is on.13. 1.11). . Show that if A is independent of itself. Show that lims→−∞ Fx (s) = 0.5. This exercise describes the inverse transform method for generating random variables with arbitrary distribution from uniform random variables. P( A ∪ B). The uniform cdf on [0. . Prove the claim in fact 1. Compute q. 1. has finite range). 1.1. For each A ⊂ Ω. 1. 1. Let A ∈ F . Show that if A and B are independent. Show that 0 ≤ p j ≤ 1 for each j. 2013 . . Verify this when x is the finite-valued random variable in (1. and ∑ jJ=1 p j = 1. A dice is designed so that the probability of getting face m is qm. Let Ω be any sample space. 1] is given by F (s) = 0 if s < 0.11).5.5. 3}. 1.5.1. 1. Show that A and B are independent under P. and let G −1 be the inverse (quantile). then A is independent of every other event in F .5.5. a random variable x on Ω can only take on a finite set of values (i.1. and let B be the event that the second switch is on.5. F (s) = s if 0 ≤ s ≤ 1.7. Are A and C independent? Ex. Let x be a discrete random variable taking values s1 . EXERCISES 39 Ex.15 Ex. and F (s) = 1 if s > 1. Show that P is additive. Is P a probability on Ω? Why or why not? Ex. Let P and Ω be defined as in example 1.2 on page 17. Ex.9. Ex. P( Ac ). then Ac and Bc are also independent. Ex.10. . and let p j := P{ x = s j }. Ex. which implies that lims→∞ Fx (s) = 1. 2.11.4). If you can. 1. Show that if P( A) = 0 or P( A) = 1. show that F is rightcontinuous.1. 1. 15 Hint: Have a look at the definition of a function in §13. . We claimed that Fx is a cdf. then either P( A) = 0 or P( A) = 1. Compute P(C ). .. Recall Fx defined in (1. Given sample space Ω := {1. 1. Show that if u ∼ F. Ex. let A := {1}. Ex. Let x and y be independent uniform random variables on [0. then cov[ x.5.20.5.6 on page 29. Let z := max{ x. Let the prediction error u be defined as u := y − ∗ ( x ). Let α∗ . Confirm the solutions in (1. 1.16. Show that F (s) = E [1{y ≤ s}] for any s.3. then f ( x ) and g(y) are independent. Confirm the claim in fact 1. Ex. y]2 var[y] Fix s ∈ R and compare the sets {z ≤ s} and { x ≤ s} ∩ {y ≤ s}. 1. 1.24).16 In addition.5. y] for any constant scalars α and β? Why or why not? Ex. βy] = corr[ x.5.5. 1. 1]. 1]. Ex. Consider the setting of §1.18.5. Ex. var[ ∗ ( x )] = corr[ x.19.5.5.7: If x and y are independent.15. E [ ∗ ( x )] = E [y] 2.25. (Existence of k-th moment implies existence of j-th moment for all j ≤ k. s ∈ R.2 tells us that if x and y are independent random variables and g and f are any two functions. 1. y}. Prove fact 1.1. Confirm monotonicity of expectations (fact 1. With reference to fact 1. 1.5. Show that 1. Ex.5. β∗ and ∗ be as defined there. Give an expression for the cdf G of the random variable y = x2 . Let x and y be scalar random variables.) Ex. 1.3. Prove this for the case where f ( x ) = 2x and g(y) = 3y − 1. where F is a cdf. 1.9).4. What is the relationship between these two sets? 16 Hint: J OHN S TACHURSKI October 10.5. s ) = p1 (s) p2 (s ) for every s. density and mean of z. Ex.3. 1. Ex. is it true that corr[α x. Confirm the expression for variance of linear combinations in fact 1.21. Show that x1 and x2 are independent whenever q(s.4. Let q be their joint density.1. Ex.5.26.5.23. 2013 . Ex.3. Let x1 and x2 be random variables with densities p1 and p2 . 1. 1. y}. Fact 1. y] = corr[ x.24. Ex. Compute the cdf.6 on page 14) for the special case where x and y are the random variables in (1.5. Let x ∼ F where F is the uniform cdf on [0. compute the cdf of w := min{ x. EXERCISES 40 Ex. y] = 0.3. 1. Let y ∼ F.3.17.22. 1. J OHN S TACHURSKI October 10. then xn → x. y]2 ) var[y] Ex. EXERCISES 41 3.30.26. we complete the proof of the LLN on page 34. lnorm. Ex. all on the same figure. 4. 1.33 (Computational). In particular. give an example of a sequence of random variables { xn } and random variable x such that xn converges to x in distribution. plot the chi-squared density for k = 1.25).5. Let { xn } be an IID sequence of random variables with common distribution F. 1. G2 and G3 available in R. Show that the converse is not generally true. In R. where name is one of norm. Use different colors for different k. 1. compute the first 10 moments of the exponential distribution with mean 1.5.4. Let { xn } be a sequence of random variables satisfying xn = y for all n. 3.5. E [x Ex. Using numerical integration. 2. 1. 1. 17 Look them up if you don’t know what they are. etc.31. Show that ¯ N ] → sF (ds) and var[ x ¯ N ] → 0 as N → ∞.32. Ex. Provided that we can at least generate uniform random variables. 1. Using numerical integration and a for loop. 1.2 and fact 1.17 examine for each Gi whether the random variables generated via inverse transform do appear equally distributed to the random variables generated from Gi using R’s built in algorithms (accessed through rname. where y is a single random variable.5. show that if x is a random variable with finite second moment and δ > 0. show that cov[ ∗ ( x ). 1.35 (Computational).5. In this exercise. the value of the parameter is pinned down to what value?) Ex. Show that if P{y = 0} = 1. but not in probability. Ex.5. 2013 .27.5.5. Ex.28. Prove Chebychev’s inequality (1. Pick three different continuous distributions G1 . then xn → 0 holds. see the documentation on qqplot. Ex.4. u] = 0. Using Q-Q plots. 1.5. Confirm (1.5. 5.29. Continuing on from exercise 1. We saw in fact 1.5. (The exponential distribution has one parameter.30) via theorem 1. In other words. Ex. δ2 Ex.36 (Computational).4.1. Show that if P{y = −1} = P{y = 1} = 0. then xn → 0 fails. then P{| x | ≥ δ} ≤ p d p p E [ x2 ] .5.5. var[u] = (1 − corr[ x. and include a legend.).4.4 that if xn → x. If the mean is 1.34 (Computational).5. show that the 8th moment of the standard normal density is approximately 105. Using a for loop.14) can be (and is) used to generate random variables with arbitrary distribution G. the inverse transform method (see exercise 1. B and C are disjoint. 1. Compare histograms and normal density plots in the manner of figure 1. Why is the fit not as good? Ex. N = 5 and N = 200.2.5. Replicate the simulation performed in listing 3. J OHN S TACHURSKI October 10.1 Solutions to Selected Exercises Solution to Exercise 1.37 (Computational). Pick any sets A. This exercise covers some of the proof behind theorem 1. Produce plots for N = 1. It turns out that under these conditions we √ p have nR(tn − θ ) → 0.3 on page 37. Using additivity of P. Use the appropriate mean and variance in the normal density. we can apply part 1 of fact 1. P( A ∪ B ∪ C) = P(( A ∪ B) ∪ C) = P( A ∪ B) + P(C) = P( A) + P( B) + P(C) This result can be extended to an arbitrary number of sets by using induction. The details are omitted. but this time for N = 2.5. σ2 ) as n → ∞ Let g : R → R be differentiable at θ with g (θ ) = 0. Suppose that {tn } is a sequence of random variables. 1.5. EXERCISES 42 Ex. then A ∪ B and C are also disjoint. Ex. To show that P( A ∪ B ) = P( A ) + P( B ) − P( A ∩ B ) we start by decomposing A into the union of two disjoint sets: A = [( A ∪ B) \ B] ∪ ( A ∩ B).1. Solution to Exercise 1.1. θ is a constant.1 (page 5) to obtain P( A ) = P( A ∪ B ) − P( B ) + P( A ∩ B ) Rearranging this expression gives the result that we are seeking. we can write g(tn ) = g(θ ) + g (θ )(tn − θ ) + R(tn − θ ).5.39. If A. prove carefully that √ d n{ g(tn ) − g(θ )} → N (0. 2013 . B ∈ F . and √ d n(tn − θ ) → N (0.5. 1. Taking a first order Taylor expansion of g around θ . 1].38 (Computational). using additivity over pairs. g (θ )2 σ2 ).1.5. Replicate the simulation performed in listing 3. N = 2. we then have P( A) = P[( A ∪ B) \ B] + P( A ∩ B) Since B ⊂ ( A ∪ B). where R(tn − θ ) is a remainder term. Using this fact.4. As a result. and A ∪ B ∪ C = ( A ∪ B) ∪ C.9.5. 1. but this time for uniformly distributed random variables on [−1. 5. P( Ac ∪ Bc ) = P(( A ∩ B)c ) = P(Ω) = 1.5. . and we have 1 { ω0 ∈ A ∪ B } = 1 = 1 { ω0 ∈ A } + 1 { ω0 ∈ B } If ω0 ∈ B. Solution to Exercise 1. we have P{1. To show that P is a probability on Ω we need to check that 1. and once again we have 1 { ω0 ∈ A ∪ B } = 1 = 1 { ω0 ∈ A } + 1 { ω0 ∈ B } Finally. pick any disjoint A and B. . 2. if ω0 is in neither A nor B. 1] for every A ⊂ Ω. P( Ac ) = 2/3.5. . letting Ω = {1. 1{ω0 ∈ A} ∈ [0. In addition. P(C ) = 1/3 as 1 = P(Ω) = P( A ∪ B ∪ C ) = P( A) + P( B) + P(C) = 1/3 + 1/3 + P(C). First.5. Solution to Exercise 1.3. 2 holds because ω0 ∈ Ω. 6} = P ∪6 m=1 { m } = ∑ P{ m } = ∑ qm = 1 m =1 m =1 6 6 Solving the last equality for q. and hence P is a probability on Ω.4. and P( A ∩ C) = 0 = 1/9 = P( A)P(C). 2013 . . When the dice is rolled one face must come up. P( A ∪ B) = 2/3. then ω0 ∈ / A. 6} be the sample space. . . we get q = 1/21.5. If A ∩ B = ∅. More formally. 1{ω0 ∈ Ω} = 1 3. J OHN S TACHURSKI October 10. then ω0 ∈ / B. P( A ∩ B) = 0. . so the sum of the probabilities is one. then 1{ω0 ∈ A ∪ B} = 1{ω0 ∈ A} + 1{ω0 ∈ B} 1 is immediate from the definition of an indicator function. and hence P(C) = 1/3. Therefore A is not independent of C. If ω0 ∈ A.1. then 1 { ω0 ∈ A ∪ B } = 0 = 1 { ω0 ∈ A } + 1 { ω0 ∈ B } We have shown that 1–3 hold. Regarding 3. EXERCISES 43 Solution to Exercise 1. . Solution to Exercise 1.5.2 on page 6. P( A ∩ B) = 0. we can transform the right-hand side to obtain P( Ac ∩ Bc ) = (1 − P( A))(1 − P( B)) = P( Ac )P( Bc ) In other words. or.4 (page 4).1). We have P( Ac ∩ Bc ) = P(( A ∪ B)c ) = 1 − P( A ∪ B) Applying fact 1. in this case. it suffices to show that P( A ∪ B) = 1.10.6 (page 6).1. P( A ∩ B) = P( B). and hence takes only finitely many different values.) On the other hand. we obtain 0 ≤ P( A ∩ B ) ≤ P ( A ) = 0 Therefore P( A ∩ B) = 0 as claimed. s→∞ lim Fx (s) = lim P{ x ≤ s} ≤ lim P(Ω) = 1 s→∞ s→∞ From these two inequalities we get 1 ≤ lims→∞ Fx (s) ≤ 1. The proof of independence is essentially the same as the proof of independence of A and B in example 1. Finally. or. This last equality is implied by monotonicity of P. 2013 .6.5.5. For this m.2 and independence. Ac and Bc are independent. Now suppose that P( A) = 1.5.7. In view of fact 1. J OHN S TACHURSKI October 10.1. Solution to Exercise 1.5. Solution to Exercise 1. The proof is almost identical to the proof of additivity in example 1. Using nonnegativity and monotonicity of P (fact 1. Next. Let m be the largest such value. then either a = 0 or a = 1.8.1. we have P( A ∩ B ) = P( A ) + P( B ) − P( A ∪ B ) Since P( A) = 1. EXERCISES 44 Solution to Exercise 1. We claim that P( A ∩ B) = P( A)P( B). which is equivalent to lims→∞ Fx (s) = 1.1. in this case. because 1 = P( A) ≤ P( A ∪ B) ≤ 1.1. suppose that A is independent of itself. If a = a2 . let A and B be independent.1. we have lim Fx (s) ≥ Fx (m) = P{ω ∈ Ω : x (ω ) ≤ m} = P(Ω) = 1 s→∞ (The inequality is due to the fact that Fx is increasing. We claim that P( A ∩ B) = P( A)P( B). We are assuming that x has finite range. Then P( A) = P( A ∩ A) = P( A)P( A) = P( A)2 . Suppose that P( A) = 0 and that B ∈ F . √ s } = P{ x ≤ √ √ s} = F ( s) Solution to Exercise 1. Solution to Exercise 1. As a result.1. P{z ≤ s} = P{G−1 (u) ≤ s} = P{G(G−1 (u)) ≤ G (s)} = P{u ≤ G(s)} = G(s) We have shown that z ∼ G as claimed.1.14.8) and monotonicity of P.13.) Solution to Exercise 1. That 0 ≤ p j ≤ 1 for each j follows immediately from the definition of P.1 on page 17 then yields F| x| (s) = P{−s < x ≤ s} = F (s) − F (−s) The claim F| x| (s) = 2 F (s) − 1 now follows from the definition of symmetry. Let z := G −1 (u). Since G is monotone increasing we have G ( a) ≤ G (b) whenever a ≤ b. we have j =1 ∑ J pj = j =1 ∑ P { x = s j } = P ∪ j =1 { x = s j } = P ( Ω ) = 1 J J (1.5.1.12. EXERCISES 45 Solution to Exercise 1.17.15. we have F| x| (s) := P{| x | ≤ s} = P{−s ≤ x ≤ s} = P{ x = −s} + P{−s < x ≤ s} By assumption. Fix s ≥ 0. Solution to Exercise 1.) Using (1. for any s ∈ R. Using additivity over disjoint sets. and hence ω ∈ B. we then have y(ω ) = 1. (If ω ∈ A. In addition.32) (We are using the fact that the sets { x = s j } disjoint. We want to show that z ∼ G. then A ⊂ B.5. J OHN S TACHURSKI October 10. P{ x = −s} = 0. Evidently G (s) = 0 when s < 0.5. If x (ω ) := 1{ω ∈ A} ≤ 1{ω ∈ B} =: y(ω ) for any ω ∈ Ω. G (s) = F ( s)1{s ≥ 0}. Why is this always true? Look carefully at the definition of a function given in §13. we then have E [ x ] = E [1{ ω ∈ A } ] = P( A ) ≤ P( B ) = E [1{ ω ∈ B } ] = E [ y ] as was to be shown.5. 2013 . then x (ω ) = 1. using additivity of P.2. Since x (ω ) ≤ y(ω ) ≤ 1.5.5. Applying fact 1. For s ≥ 0 we have P{ x2 ≤ s} = P{| x| ≤ √ Thus. you should convince yourself that if a. and for any random variable x we have | x | j ≤ | x |k + 1.5. regarding the cdf of w. b and c are three numbers. we have P{ z ≤ s } = P[ { x ≤ s } ∩ { y ≤ s } ] = P{ x ≤ s }P{ y ≤ s } = s2 By differentiating we get the density p(s) = 2s.5.24.23. we then have E [| x | j ] ≤ E [| x k |] + 1. then a j ≤ 1. for any a ≥ 0.5. 1]. Solution to Exercise 1. for any s ∈ R. Hence the j-th moment exists whenever the k-th moment exists.1. J OHN S TACHURSKI October 10. Independence of u and v can be confirmed via (1. for s ∈ [0. Solution to Exercise 1. and by integrating E [z] = 2/3. • {z ≤ s} = { x ≤ s} ∩ {y ≤ s} • {w ≤ s} = { x ≤ s} ∪ {y ≤ s} (For each. we have a j ≤ ak + 1. EXERCISES 46 Solution to Exercise 1. Thus.) Now. Let u := f ( x ) = 2x and v := g(y) = 3y − 1. Using monotonicity of expectations (fact 1. then ω is in the left-hand side. y}. show that if ω is in the right-hand side. If a < 1. v ≤ s2 } = P{ x ≤ s1 /2. next convince yourself that. Fixing s1 and s2 in R.1.18. Let a be any nonnegative number. and let j ≤ k. x and y are independent uniform random variables on [0. As in the statement of the exercise. 1].6 on page 14) and E [1] = 1.5. 2013 . z := max{ x. and vice versa.22) on page 27. we have P{u ≤ s1 . where x and y are independent. Finally. b} ≤ c if and only if a ≤ c or b ≤ c Using these facts. y ≤ (s2 + 1)/3} = P{ x ≤ s1 /2}P{y ≤ (s2 + 1)/3} = P{u ≤ s1 }P{v ≤ s2 } Thus u and v are independent as claimed. 1] we have 1 0 sp ( s ) ds we get P{ w ≤ s } = P[ { x ≤ s } ∪ { y ≤ s } ] = P{ x ≤ s } + P{ y ≤ s } − P[ { x ≤ s } ∩ { y ≤ s } ] Hence P{w ≤ s} = 2s − s2 . for s ∈ [0. y} and w := min{ x. then a j ≤ ak . If a ≥ 1. equality. b} ≤ c if and only if a ≤ c and b ≤ c • min{ a. then • max{ a. As a first step to the proofs. Consider first the case where P{y = −1} = P{y = 1} = 0. you should be able to convince yourself that x2 = 1{| x | ≥ δ} x2 + 1{| x | < δ} x2 ≥ 1{| x | ≥ δ}δ2 Using fact 1. Hence xn → 0 fails.4.29.5} = 1 Thus. u] = 0 as claimed. Pick any random variable x and δ > 0. the statement xn → 0 means that. then xn and x have the same distribution for all n. p P{| xn | > δ} = P{|y| > 0. since xn = y for all n. but not in probability.6 (page 14). given any δ > 0. zn := xn − x has distribution N (0.3 (page 13) and rearranging completes the proof that P{| x | ≥ δ} ≤ E [ x2 ] .26. y]2 var[y] + (1 − corr[ x. J OHN S TACHURSKI October 10.27. We want to give an example of a sequence of random variables { xn } and random variable x such that xn converges to x in distribution.5.1.1.5.28. Solution to Exercise 1. Solution to Exercise 1. the sequence does not converge to zero. Thus. 2013 .5. On the other hand. For example. Using y = we have var[ ∗ ( x ) + u] = var[y] = corr[ x. and xn → 0 holds.5. However. y]2 ) var[y] = var[ ∗ ( x )] + var[u] It follows (why?) that cov[ ∗ ( x ). fact 1.5. Then. EXERCISES 47 ∗ ( x ) + u and the results form exercise 1.1).1. δ2 Solution to Exercise 1. Many examples can be found by using IID sequences.5. Take δ = 0. P{| xn − x| ≥ δ} does not converge to zero. if { xn }∞ n=1 and x are IID standard normal random variables.30. By considering what happens at an arbitrary ω ∈ Ω.5. 2) and δ be any strictly positive constant. From the definition of convergence in probability (see §1. if P{y = 0} = 1. p Solution to Exercise 1. Letting z be any random variable with distribution N (0. and hence xn converges in distribution to x. we have P{| xn − x| ≥ δ} = P{|z| ≥ δ} > 0. 2) for all n.5. we have P{| xn | > δ} → 0. then for any δ > 0 we have p P{| xn | > δ} = P{|y| > δ} = 0 This sequence does converge to zero (in fact it’s constant at zero). 5. 2013 . Using fact 1.5. we obtain var 1 N n =1 ∑ N xn = 1 N2 n =1 ∑ σ2 + N 2 ∑ N 2 cov[ xn . By linearity of expectations.5. ¯N ] = E [x 1 N n =1 ∑ E [ xn ] = N N N sF (ds) = sF (ds) ¯ N ] → sF (ds) as claimed.3. this reduces to var[ x J OHN S TACHURSKI October 10. This confirms that E [ x 2 let σ be the common variance of each xn . To see that var[ x ¯ N ] → 0 as N → ∞. By independence. EXERCISES 48 Solution to Exercise 1. which converges to zero.1. xm ] n<m ¯ N ] = σ2 / N .31. .  where xn ∈ R for each n  . This set is denoted by R N . while the second deals with random matrices. as a column of numbers. We could also write x horizontally. We start our story from the beginning. when we come to deal with matrix algebra. so it makes no difference whether they are written vertically or horizontally. the set of all N -vectors.Chapter 2 Linear Algebra The first part of this chapter is mainly about solving systems of linear equations. we are viewing vectors just as sequences of numbers. 2. 49 . we will distinguish between column (vertical) and row (horizontal) vectors. with the notions vectors and matrices. Later. x N ). . like so: x = ( x1 .1 Vectors and Matrices [roadmap] 2. for arbitrary N ∈ N. .1 Vectors An important set for us will be.1. . and a typical element is of the form   x1  x2    x =  . . At this stage. or vectors of length N .  xN Here x has been written vertically.  .1. analogous to addition. Figures 2. In the figure. . starting at the origin and ending at the location in R2 defined by the vector. 1 :=  . vectors are represented as arrows.    .  αxN Thus. 0 :=  .     .  :=   . If x ∈ R N and y ∈ R N .1) October 10. then the sum is defined by       x1 y1 x1 + y1  x2   y2   x2 + y2        x + y :=:  . xN yN xN + yN If α ∈ R.  +  . The way to remember this is to draw a line from y to x. . 2013 . and defined as the sum of the products of their elements: x y := n =1 ∑ xn yn = y x N 1/2 2 xn N The (euclidean) norm of a vector x ∈ R N is defined as √ x := J OHN S TACHURSKI x x := n =1 ∑ (2. We have defined addition and scalar multiplication of vectors. Subtraction is performed element by element. VECTORS AND MATRICES 50 The vector of ones will be denoted 1.  1 0 For elements of R N there are two fundamental algebraic operations: addition and scalar multiplication.1 and 2.   . . and then shift it to the origin.1 show examples of vector addition and scalar multiplication in the case N = 2. An illustration of this operation is given in figure 2. but not subtraction. addition and scalar multiplication are defined in terms of ordinary addition and multiplication in R.2. x − y := x + (−1)y.  . The inner product of two vectors x and y in R N is denoted by x y. The definition can be given in terms of addition and scalar multiplication. . while the vector of zeros will be denoted 0:     1 0  . and computed element-by-element. then the scalar product of α and x is defined to be   α x1  α x2    αx :=  . by adding and multiplying respectively.  . .3. 1: Vector addition Figure 2. 2013 .2: Scalar multiplication J OHN S TACHURSKI October 10. VECTORS AND MATRICES 51 Figure 2.2.1. 3 again. the value x − y has the interpretation of being the “distance” between these points. J OHN S TACHURSKI October 10.3. x+y ≤ x + y . y ∈ R N .1. Given two vectors x and y.) Fact 2.3: Difference between vectors and represents the length of the vector x. (In the arrow representation of vectors in figures 2. 2013 . 2.1.2. while the fourth is called the Cauchy-Schwartz inequality. αx = |α| x .1–2.1. x ≥ 0 and x = 0 if and only if x = 0. For any α ∈ R and any x. consult figure 2. y . the norm of the vector is equal to the length of the arrow. 4. To see why. |x y| ≤ x The third property is called the triangle inequality. the following properties are satisfied by the norm: 1. 3. VECTORS AND MATRICES 52 Figure 2.  . . . If.  .1. In the former case. such as a11 x1 + a12 x2 + · · · + a1K x3 = b1 a21 x1 + a22 x2 + · · · + a2K x3 = b2 . .2. . (Clearly every diagonal matrix is symmetric. ··· ··· a1K a2K . . For obvious reasons. A is called a row vector. written in the following way:    A=  a11 a21 . then A is called the identity J OHN S TACHURSKI October 10. . When convenient. in addition to being diagonal. If A is N × K and N = K. .1. . we will use the notation rown (A) to refer to the n-th row of A. 2013 . For a square matrix A. . a NK      a N1 a N2 · · · Often. . while in the latter case it is called a column vector. the matrix A is also called a vector if either N = 1 or K = 1. then A is called symmetric. In matrix A. each element ann along the principal diagonal is equal to 1. . a12 a22 . then A is called square.  a N1 a N2 · · · a NN A is called diagonal if the only nonzero entries are on the principal diagonal. a N 1 x1 + a N 2 x2 + · · · + a NK x3 = b N We’ll explore this relationship further after some more definitions. .2 Matrices A N × K matrix is a rectangular array A of real numbers with N rows and K columns. the symbol ank stands for the element in the n-th row of the k-th column. . the values ank in the matrix represent coefficients in a system of linear equations.) If. the N elements of the form ann for n = 1. . N are called the principal diagonal:   a11 a12 · · · a1 N  a   21 a22 · · · a2 N   . VECTORS AND MATRICES 53 2. . . and colk (A) to refer to it’s k-th column. . . in addition ank = akn for every k and n. .      a11 · · · a1K b11 · · · b1 J c11 · · · c1 J  a      21 · · · a2K   b21 · · · b2 J   c21 · · · c2 J  =  .  . and denoted by I:    I :=   1 0 . 2013 . a1K + b1K a2K + b2K . . If A and B are two matrices. .  . . . . . The first two. . VECTORS AND MATRICES 54 matrix.1. we let     a11 a12 · · · a1K γ a11 γ a12 · · · γ a1K   γa  a γ a22 · · · γ a2K  21    21 a22 · · · a2K  γ . . . . . . . Now let’s look at multiplication of matrices. 1      0 0 ··· Just as was the case for vectors. j-th element the inner product of the i-th row of A and the j-th column of B. . . .2.  a N1 · · · a NK bK 1 · · · bK J c N1 · · · cN J Here c11 is computed as c11 = row1 (A) col1 (B) = k =1 ∑ a 1 k bk 1 K There are many good tutorials for multiplying matrices on the web (try Wikipedia. . . . a NK γa N1 γa N2 · · · γ a NK ··· ··· . For example. . so I’ll leave it to you to get a feeling for this operation. .  . . . .  . 0 ··· 1 ··· . . the matrices have to have the same number of rows and columns in order for the definition to make sense. . .   . then their product AB is formed by taking as it’s i. . . . . ··· ··· . . for example). . ··· ··· . . . .  :=  . . . J OHN S TACHURSKI October 10.  a N1 a N2 · · · while  a11  a  21  . a NK + b NK      a N1 · · · bN1 · · · a N1 + bN1 · · · In the latter case. .  . . . consider the following product. . .  . . . . 0 0 . a1K a2K . are immediate generalizations of the vector case: For γ ∈ R. scalar multiplication and addition. . . .  . . a number of algebraic operations are defined for matrices.      . a NK       +   b11 b21 . . . b NK     :=       a11 + b11 a21 + b21 . b1K b2K . . . . . 1. The function f : R → R defined by f ( x ) = x2 is nonlinear. because if we take α = β = x = y = 1. we’ll say “for two conformable matrices A and B. we have • A(BC) = (AB)C • A(B + C) = AB + AC • (A + B)C = AC + BC (Here.1. in that AB and BA are not generally the same thing. In other words. We will see that these two perspectives are very closely related. while α f ( x ) + β f (y) = 1 + 1 = 2. the two are not generally equal. etc. The resulting matrix AB is N × M. 2013 .3 Linear Functions One way to view matrices is as objects representing coefficients in linear systems of equations. Even in this case.1. the product AB satisfies xyz” if the dimensions of A and B are such that the product is well defined. Other than that. Let’s begin with a definition: A function f : R N → R M is called linear if f (αx + βy) = α f (x) + β f (y) for any x. Indeed BA is not well-defined unless N = M also holds. Another way to view matrices is as functions. or mappings from one space to another. In particular. VECTORS AND MATRICES 55 Since inner products are only defined for vectors of equal length. then we require K = J . if A is N × K and B is J × M. B and C. it is clear that multiplication is not commutative. multiplication behaves pretty much as we’d expect.1. we are using the word “conformable” to indicate dimensions are such that the operation in question makes sense. then f (α x + βy) = f (2) = 4. Here’s the rule to remember: product of N × K and K × M is N × M From the definition.2. Put differently. β ∈ R Example 2. the number of columns of A is equal to the number of rows of B. and similarly for addition. For example. y ∈ R N and α. this requires that the length of the rows of A is equal to the length of the columns of B.) 2. J OHN S TACHURSKI October 10. for conformable matrices A. I can’t give you any more examples because there aren’t any more. and any scalars α and β. The affine function f : R → R defined by f ( x ) = 1 + 2x is nonlinear. The proof of the second part of theorem 2. How about some more examples of a linear functions. if f : RK → R N is linear. Moreover. to help you grasp the intuition? As I mentioned just above. then there exists an N × K matrix A such that f (x) = Ax for all x ∈ RK . f is linear. If A is an N × K matrix and f : RK → R N is defined by f (x) = Ax. The function f : R → R defined by f ( x ) = 2x is linear.2.1. then f (α x + βy) = f (2) = 5.1.1. The next theorem states this result. because if we take any α. Let’s look at some of the details.3.6. 2013 . Theorem 2.1. y in RK . β. while α f ( x ) + β f (y) = 3 + 3 = 6.2. it turns out that the functions defined by matrices are the only linear functions. we can identify A with the function f (x) = Ax. because if we take α = β = x = y = 1. we are considering the operation of sending a vector x ∈ RK into a new vector y = Ax in R N . y in R. these functions defined by matrices have a special property: they are all linear.1.1 is an exercise (exercise 2. Conversely. Now let’s return to matricies. 2.1. then the function f is linear. the set of linear functions from RK to R N and the set of N × K matrices are essentially the same thing. When we think of an N × K matrix A as a mapping.1. The rules of matrix addition and scalar multiplication tell us that f (αx + βy) := A(αx + βy) = Aαx + A βy = αAx + βAy =: α f (x) + β f (y) In other words. pick any x. then f ( α x + β y ) = 2( α x + β y ) = α2 x + β2y = α f ( x ) + β f ( y ) Example 2. x. A defines a function from RK to R N . VECTORS AND MATRICES 56 Example 2.1. given an N × K matrix A. Among the collection of all functions from RK to R N .3).3.4 Maps and Linear Equations In §2. In other words. In this sense. That is. we started to think about matrices as maps. To see that f is a linear function. Now let’s think about linear J OHN S TACHURSKI October 10. Take a fixed N × K matrix A and consider the function f : RK → R N defined by f (x) = Ax. 1] over which f is defined is called the domain of f . and consists of all y such that f ( x ) = y for some x in the domain [0. For one. The best way to think about this is that we want to invert the mapping f (x) = Ax. and consider the equation Ax = b for some fixed b. Consider figure 2. we want to find f −1 (b). there may be multiple x with f (x) = b. In turn. this last question turns out to be closely connected to the issue of uniqueness of solutions. VECTORS AND MATRICES 57 equations. Secondly. The set [0. Even if the equation f ( x ) = b has a solution. To grasp these ideas takes a bit of effort but is certainly worthwhile. there may be no x such that f (x) = b. 1]. 1] → R. where A is a given matrix. the advantage being that we can graph the function and gain visual intuition. Evidently. That is. you will know that there are various potential problems here. b falls outside rng( f ) and there is no solution.5. the solution may not be unique. the rank is related to whether or not the columns of A are linearly independent or not. 2013 . Figure 2. which graphs a onedimensional function f : [0. Conveniently. let’s go back to the general problem of inverting functions in the one-dimensional case. we want to find a vector x such that Ax = b.6 gives an example. Before tackling this problem directly. Here both x1 and x2 solve f ( x ) = b. This is more likely if the range of f is “large. The canonical problem is as follows: Given A and a fixed vector b. for an arbitrary function f : X → Y . because what we want to do is find an x such that f (x) = b. When will this equation have a solution? In view of our preceding discussion. Let f (x) = Ax. In figure 2. which is a measure of the size of its range.1.4. have another look at the function f in figure 2.4. the preimage of b under f . Let’s get started. the answer is that there will be a solution if and only if b is in rng(A) := rng( f ).2. the range of f is rng( f ) := {y ∈ Y : f ( x ) = y for some x ∈ X } Returning to the problems of existence and uniqueness of solutions. The red interval is called the range of f . Let’s return now to the matrix setting. J OHN S TACHURSKI October 10.” One way to check this by looking at something called the rank of A. More generally. These problems concern existence and uniqueness of solutions respectively. Now if you know anything about inverting functions. The other issue we must consider is uniqueness. the equation f ( x ) = b has a solution if and only if b ∈ rng( f ). 2.1. VECTORS AND MATRICES 58 b 0 x 1 Figure 2.4: Preimage of b under f b 0 1 Figure 2.5: No solution J OHN S TACHURSKI October 10, 2013 2.2. SPAN, DIMENSION AND INDEPENDENCE 59 b 0 x1 x2 1 Figure 2.6: Multiple solutions 2.2 Span, Dimension and Independence Motivated by the preceding discussion, we now introduce the important notions of linear subspaces, spans, linear independence, basis and dimension. 2.2.1 Spans and Linear Subspaces Given K vectors x1 , . . . , xK in R N , we can form linear combinations, which are vectors of the form y= k =1 ∑ α k x k = α1 x1 + · · · + α K x K K for some collection α1 , . . . , αK of K real numbers. Fact 2.2.1. Inner products of linear combinations satisfy the following rule: k =1 ∑ αk xk K j =1 ∑ β j yj J = k =1 j =1 ∑ ∑ αk β j xk y j K J J OHN S TACHURSKI October 10, 2013 2.2. SPAN, DIMENSION AND INDEPENDENCE 60 The set of all linear combinations of X := {x1 , . . . , xK } is called the span of X , and denoted by span( X ): span( X ) := all vectors k =1 ∑ αk xk such that α := (α1, . . . , αK ) ∈ RK K Let Y be any subset of R N , and let X be as above. If Y ⊂ span( X ), we say that the vectors X := {x1 , . . . , xK } span the set Y , or that X is a spanning set for Y . This is a particularly nice situation when Y is large but X is small, because it means that all the vectors in the large set Y are “described” by the small number of vectors in X . Example 2.2.1. Let X = {1} = {(1, 1)} ⊂ R2 . The span of X is all vectors of the form (α, α) with α ∈ R. This constitutes a line in the plane. Since we can take α = 0, it follows that the origin 0 is in span( X ). In fact span( X ) is the unique line in the plane that passes through both 0 and the vector 1 = (1, 1). Example 2.2.2. Consider the vectors {e1 , . . . , e N } ⊂ R N , where en has all zeros except for a 1 as the n-th element. The case of R2 , where e1 := (1, 0) and e2 := (0, 1), is illustrated in figure 2.7. The vectors e1 , . . . , e N are called the canonical basis vectors of R N —we’ll see why later on. One reason is that {e1 , . . . , e N } spans all of R N . To see this in the case of N = 2 (check general N yourself), observe that for any y ∈ R2 , we have y := y1 y2 = y1 0 + 0 y1 = y1 1 0 + y2 0 1 = y1 e1 + y2 e2 Thus, y ∈ span{e1 , e2 } as claimed. Since y is just an arbitrary vector in R2 , we have shown that {e1 , e2 } spans R2 . Example 2.2.3. Consider the set P := {( x1 , x2 , 0) ∈ R3 : x1 , x2 ∈ R}. Graphically, P corresponds to the flat plane in R3 , where the height coordinate is always zero. If we take e1 = (1, 0, 0) and e2 = (0, 1, 0), then given y = (y1 , y2 , 0) ∈ P we have y = y1 e1 + y2 e2 . In other words, any y ∈ P can be expressed as a linear combination of e1 and e2 , and {e1 , e2 } is a spanning set for P. Fact 2.2.2. Let X and Y be any two finite subsets of R N . If X ⊂ Y , then we have span( X ) ⊂ span(Y ). One of the key features of the span of a set X is that it is “closed” under the linear operations of vector addition and scalar multiplication, in the sense that if we take J OHN S TACHURSKI October 10, 2013 2.2. SPAN, DIMENSION AND INDEPENDENCE 61 Figure 2.7: Canonical basis vectors in R2 elements of the span and combine them using these operations, the resulting vectors are still in the span. For example, to see that the span is closed under vector addition, observe that if X = {x1 , . . . , xK } and y, z are both in span( X ), then we can write them as y= k =1 ∑ αk xk K K and z= k =1 ∑ β k xk K for a suitable scalars αk , β k ∈ R. It then follows that y+z = k =1 ∑ (αk + βk )xk ∈ span(X ) Hence span( X ) is closed under vector addition as claimed. Another easy argument shows that span( X ) is closed under scalar multiplication. The notion of a set being closed under scalar multiplication and vector addition is important enough to have its own name: A set S ⊂ R N with this property is called a linear subspace of R N . More succinctly, a nonempty subset S of R N is called a linear subspace if, for any x and y in S, and any α and β in R, the linear combination αx + βy is also in S. Example 2.2.4. It follows immediately from the proceeding discussion that if X is any finite nonempty subset of R N , then span( X ) is a linear subspace of R N . For this reason, span( X ) is often called the linear subspace spanned by X . J OHN S TACHURSKI October 10, 2013 2.2. SPAN, DIMENSION AND INDEPENDENCE 62 Figure 2.8: The vectors e1 and x Example 2.2.5. In R3 , lines and planes that pass through the origin are linear subspaces. Other linear subspaces of R3 are the singleton set containing the zero element 0, and the set R3 itself. Fact 2.2.3. Let S be a linear subspace of R N . The following statements are true: 1. The origin 0 is an element of S. 2. If X is a finite subset of S, then span( X ) ⊂ S. 2.2.2 Linear Independence In some sense, the span of X := {x1 , . . . , xK } is a measure of the “diversity” of the vectors in X —the more diverse are the elements of X , the greater is the set of vectors that can be represented as linear combinations of its elements. In fact, if X is not very diverse, then some “similar” elements may be redundant, in the sense that one can remove an element xi from the collection X without reducing its span. Let’s consider two extremes. First consider the vectors e1 := (1, 0) and e2 := (0, 1) in R2 (figure 2.7). As we saw in example 2.2.2, the span {e1 , e2 } is all of R2 . With just these two vectors, we can span the whole plane. In algebraic terms, these vectors are relatively diverse. We can also see their diversity in the fact that if we remove one of the vectors from {e1 , e2 }, the span is no longer all of R2 . In fact it is just a line in R2 . Hence both vectors have their own role to play in forming the span. Now consider the pair e1 and x := −2e1 = (−2, 0), as shown in figure 2.8. This pair is not very diverse. In fact, if y ∈ span{e1 , x}, then, for some α1 and α2 , y = α1 e1 + α2 x = α1 e1 + α2 (−2)e1 = (α1 − 2α2 )e1 ∈ span{e1 } J OHN S TACHURSKI October 10, 2013 2.2. SPAN, DIMENSION AND INDEPENDENCE 63 In other words, any element of span{e1 , x} is also an element of span{e1 }. We can kick x out of the set {e1 , x} without reducing the span. Let’s translate these ideas into formal definitions. In general, the set of vectors X := {x1 , . . . , xK } in R N is called linearly dependent if one (or more) vector(s) can be removed without changing span( X ). We call X linearly independent if it is not linearly dependent. To see this definition in a slightly different light, suppose that X := {x1 , . . . , xK } is linearly dependent, with span{x1 , . . . , xK } = span{x2 , . . . , xK } Since x1 ∈ span{x1 , . . . , xK } certainly holds, this equality implies that x1 ∈ span{x2 , . . . , xK } Hence, there exist constants α2 , . . . , αK with x1 = α2 x2 + · · · α K x K In other words, x1 can be expressed as a linear combination of the other elements in X . This is a general rule: Linear dependence means that at least one vector in the set can be written as a linear combination of the others. Linear independence means the opposite is true. The following fact clarifies matters further: Fact 2.2.4 (Definitions of linear independence). The following statements are all equivalent: 1. The set X := {x1 , . . . , xK } is linearly independent. 2. If X0 is a proper subset of X , then span( X0 ) is a proper subset of span( X ).1 3. No vector in X can be written as a linear combination of the others. K 4. If ∑k =1 αk xk = 0, then α1 = α2 = · · · = αK = 0. K 5. If α j = 0 for some j, then ∑k =1 α k x k = 0 . 1A is a proper subset of B if A ⊂ B and A = B. J OHN S TACHURSKI October 10, 2013 2.2. SPAN, DIMENSION AND INDEPENDENCE 64 Part 5 is just the contrapositive of part 4, and hence the two are equivalent. (See §13.3 if you don’t know what a contrapositive is.) The equivalence of part 4 and part 3 might not be immediately obvious, but the connection is clear when you think about K it. To say that (α1 , . . . , αK ) = 0 whenever ∑k =1 αk xk = 0 means precisely that no xk can be written as a linear combination of the other vectors. For example, if there K does exist some α j = 0 with ∑k =1 αk xk = 0, then x j = ∑k = j ( − αk / α j ) xk . Example 2.2.6. The set of canonical basis vectors in example 2.2.2 is linearly indeK pendent. Indeed, if α j = 0 for some j, then ∑k =1 α k e k = ( α 1 , . . . , α K ) = 0 . One reason for our interest in the concept of linear independence lies in the following problem: We know when a point in R N can be expressed as a linear combination of some fixed set of vectors X . This is true precisely when that point is in the span of X . What we do not know is when that representation is unique. It turns out that the relevant condition is independence: Theorem 2.2.1. Let X := {x1 , . . . , xK } be any collection of vectors in R N , and let y be any vector in span( X ). If X is linearly independent, then there exists one and only one set of K scalars α1 , . . . , αK such that y = ∑k =1 α k x k . Proof. Since y is in the span of X , we know that there exists at least one such set of scalars. Suppose now that there are two. In particular, suppose that y= k =1 ∑ K αk xk = k =1 ∑ β k xk K K It follows from the second equality that ∑k =1 ( αk − β k ) xk = 0. Using fact 2.2.4, we conclude that αk = β k for all k. In other words, the representation is unique. 2.2.3 Dimension In essence, the dimension of a linear subspace is the minimum number of vectors needed to span it. To understand this idea more clearly, let’s look at an example. Consider the plane P := {( x1 , x2 , 0) ∈ R3 : x1 , x2 ∈ R} (2.2) from example 2.2.3. Intuitively, this plane is a “two-dimensional” subset of R3 . This intuition agrees with the definition above. Indeed, P cannot be spanned by one vector, for if we take a single vector in R3 , then the span created by that singleton is J OHN S TACHURSKI October 10, 2013 e2 } is a basis. Applying theorem 2. we see that B2 has at most K1 elements. If S is a linear subspace of R N .2. The pair {e1 . if S is spanned by K vectors.2. . K2 ≤ K1 . Proof. On the other hand. We now come to a key definition. Moreover. By definition.2.8.2. a line through the origin is a one-dimensional subspace. and its span is equal to all of R N . because the canonical basis vectors form a basis. Hence K1 = K2 . Theorem 2. . In R3 . Example 2.8. In other words. and there are N canonical basis vectors. . That is.2. then every basis of S has the same number of elements.7. {e1 . if P is the plane in (2. Consider the set of canonical basis vectors {e1 . any collection of 3 or more vectors from P will be linearly dependent. If S is a linear subspace of R N and B is some finite subset of R N . If S is a linear subspace of R N spanned by K vectors. This common number is called the dimension of S. The following theorem contains the general statement of this idea: Theorem 2.2. For example. then every linearly independent subset of S has at most K vectors.2. but it is a little long. The whole space R N is N dimensional. . and written as dim(S). e2 } is a basis for the set P defined in (2. The proof is not overly hard. P can be spanned by two vectors.3 states that if S is a linear subspace of R N .2. As a result. Put differently. .2). .3. while a plane through the origin is a two-dimensional subspace. Example 2. then any subset of S with more than K vectors will be linearly dependent. not a plane. . Theorem 2. Let B1 and B2 be two bases of S.2). 2013 . with K1 and K2 elements respectively. J OHN S TACHURSKI October 10. then dim( P) = 2. While P can also be spanned by three or more vectors. then every basis of S has the same number of elements. e N } ⊂ R N described in example 2. DIMENSION AND INDEPENDENCE 65 only a line in R3 . S is spanned by the set B1 .2. . then B is called a basis of S if B spans S and is linearly independent. B2 is a linearly independent subset of S.3. Readers keen to learn more will find it in most texts on linear algebra. because the set {e1 .2.2.2. and this set contains two elements. as we saw in example 2. it turns out that one of the vectors will always be redundant—it does not change the span. SPAN. which has K1 elements. e N } is a basis for R N —as anticipated by the name. Repeating the same argument while reversing the roles of B1 and B2 we obtain K1 ≤ K2 . This set is linearly independent. This result is sometimes called the exchange theorem. 1. Lemma 2. 2. Proof.3 Matrices and Equations [Roadmap] 2.2. 2. Conversely.1 is important in what follows.2 implies that B has no more than K elements. xK } ⊂ R N . 2013 . Therefore.3.1. . B is a linearly independent subset of span( X ).4.3–2. If b ∈ R N is given and we are looking for an x to solve Ax = b. Hence. We take A to be an N × K matrix. By definition. Regarding part 1. and also rather intuitive. dim(span( X )) ≤ K. By part 1 of this theorem. Since X has K elements. Then X is a basis for span( X ). It says that the span of a set will be large when it’s elements are algebraically diverse.1 Rank Let’s now connect matrices to our discussion of span.2.3. Let X := {x1 . Since span( X ) is spanned by K vectors. dim(span( X )) ≤ K − 1. we can view this matrix as a mapping f (x) = Ax from RK to R N . then how large will its span be in terms of dimension? The next lemma answers this question.2. For if X is not linearly independent. We will be particularly interested in solving equations of the form Ax = b for unknown x. suppose first that X is linearly independent. If we take a set of K vectors. dim(span( X )) = K if and only if X is linearly independent. theorem 2. The only N -dimensional linear subspace of R N is R N .2. then exists a proper subset X0 of X such that span( X0 ) = span( X ). we conclude that dim(span( X )) = K.2. if dim(span( X )) = K then X must be linearly independent. .5. let B be a basis of span( X ). As discussed in §2. . linear independence and dimension. dim(span( X )) ≤ K. then we know at least one such x will J OHN S TACHURSKI October 10. MATRICES AND EQUATIONS 66 Fact 2. we then have dim(span( X0 )) ≤ #X0 ≤ K − 1. .1. Regarding part 2. Then 1. Part 2 of lemma 2. Contradiction. 2. That is. The dimension of rng(A) is known as the rank of A.2. A is of full column rank if and only if the only x satisfying Ax = 0 is x = 0. A is said to have full column rank if rank(A) is equal to K. by definition. 2013 . By fact 2.1. When is this maximum achieved? By part 2 of lemma 2. Let’s return to the problem of solving the system Ax = b for some fixed b ∈ R N . theorem 2.3. rank(A) := dim(rng(A)) Furthermore. we require that b ∈ rng(A).2.3.2. this set is sometimes called the column space of A. Being defined as a span. and this range will be large when A is full column rank. That is. we’ll probably be wanting to check that rng(A) is suitably “large. by part 1 of lemma 2. the number of its columns. Why do we say “full” rank here? Because. for the system Ax = b to have a solution.1.1.2. . A is said to have full column rank when this maximum is achieved.1 translates to the following result: Fact 2. J OHN S TACHURSKI October 10. rng(A) := {Ax : x ∈ RK } Just a little bit of thought will convince you that this is precisely the span of the columns of A: rng(A) = span(col1 (A). As stated above. Fact 2.1. we will denote the range by rng(A) instead of rng( f ). If we want to check that this is true. In other words.3.4 on page 63.” The obvious measure of size for a linear subspace such as rng(A) is its dimension. the matrix A is of full column rank if and only if the columns of A are linearly independent. we have dim(rng(A)) ≤ K. . If A has full column rank and b ∈ rng(A). as follows immediately from theorem 2. colK (A)) For obvious reasons. For existence of a solution we need b ∈ rng(A). Even better.2.2. . . rng(A) is the span by K vectors. it is obviously a linear subspace of R N . the rank of A is less than or equal to K. the next characterization is also equivalent. Since A and f are essentially the same thing. the full column rank condition is exactly what we need for uniqueness as well. and hence. MATRICES AND EQUATIONS 67 exist if b is in the range of f . this while be the case precisely when the columns of A are linearly independent. In matrix terminology. Thus. then the system of equations Ax = b has a unique solution. So the property of A being full column rank will be connected to the problem of existence. A is of full column rank.3. the following are equivalent: 1. then A is called singular.) If det(A) = 0. Since this problem is so important. 2013 (2. it follows immediately that. For N × N matrix A.2. so that A is a square N × N matrix. For each b ∈ R N . so that its columns are independent. but note that it won’t be required in what follows. for any b ∈ R N . there are several different ways of describing it. but the general definition is a bit fiddly to state and hence we omit it.2 Square Matrices Now let N = K. If this is the case. A is invertible. It turns out that nonsingularity is also equivalent to invertibility. the system Ax = b has a solution. First. the solution is unique.2. we can associate a unique number det(A) called the determinant of A. Since rng(A) = R N . and written as A−1 . 5. 2. The determinant crops up in many ways. Let’s summarize our discussion: Fact 2. if A is square and full column rank.5 on page 66 implies that rng(A) is equal to R N . To repeat. Otherwise A is called nonsingular. then B is called the inverse of A. and its dimension is N by the full column rank assumption.3. Moreover. the existence of an inverse A−1 means that we can take our solution as x b = A −1 b since Axb = AA−1 b = Ib = b. The columns of A are linearly independent. J OHN S TACHURSKI October 10. MATRICES AND EQUATIONS 68 2. Indeed. the system Ax = b has unique solution A−1 b.3. Invertibility of A is exactly equivalent to the existence of a unique solution x to the system Ax = b for all b ∈ R N .2. because rng(A) is a linear subspace of R N by definition. by fact 2. the there is a unique x ∈ R N such that Ax = b. a square matrix A is called invertible if there exists a second square matrix B such that AB = BA = I. A is nonsingular. where I is the identity matrix. to each square matrix A. fact 2.3. In that case. Suppose that A is full column rank. then for any b ∈ R N .3) .3. In addition. (Look it up if you are interested. 4. 3. 3 Other Properties of Matrices The transpose of N × K matrix A is a K × N matrix A such that the first column of A is the first row of A.4. the second column of A is the second row of A.3.6. and 4. and so on. (αA)−1 = α−1 A−1 .3. Note also that A A and AA are well-defined and symmetric.2. Fact 2.4) A = Fact 2. given  10 40 A :=  20 50  30 60 the transposes are 10 20 30 40 50 60  1 2 B :=  3 4  5 6   1 3 5 2 4 6 B := (2. transposition satisfies the following: 1.3. 2. (A + B) = A + B 3. (A−1 )−1 = A. det(A−1 ) = (det(A))−1 .3. Note that a square matrix C is symmetric precisely when C = C. (AB)−1 = B−1 A−1 . 3. then 1. (AB) = B A 2. MATRICES AND EQUATIONS 69 The next fact collects useful results about the inverse. Fact 2. we have J OHN S TACHURSKI October 10. 2. For conformable matrices A and B.3. (cA) = cA for any constant c. For example. For each square matrix A. 2013 .5. If A and B are invertible and α = 0. If A is idempotent.9. The expression x Ax = j =1 i =1 ∑ ∑ aij xi x j N N is called a quadratic form in x. Notice that if A = I.3.7.3. MATRICES AND EQUATIONS 70 1. ∑ ann . then this reduces to x 2 . The rank of a matrix can be difficult to determine.  . Transposition does not alter trace: trace(A) = trace(A ). the identity matrix is positive definite. det(A ) = det(A). That is. . The trace of a square matrix is the sum of the elements on its principal diagonal. then trace(AB) = trace(BA). . Fact 2. As we have just seen. . (A−1 ) = (A )−1 whenever the inverse exists. which is positive whenever x is nonzero. then trace(αA + βB) = α trace(A) + β trace(B) Moreover. =  .  n =1 a N 1 a N 2 · · · a NN Fact 2. Fact 2.3. J OHN S TACHURSKI October 10. 2013 . A square matrix A is called idempotent if AA = A. and 2.   a11 a12 · · · a1 N  a  N  21 a22 · · · a2 N  trace  .8. The next two definitions generalize this idea: An N × N symmetric matrix A is called • nonnegative definite if x Ax ≥ 0 for all x ∈ R N .3.3. and • positive definite if x Ax > 0 for all x ∈ R N with x = 0. If A and B are N × N matrices and α and β are two scalars.4 Quadratic Forms Let A be N × N and symmetric. if A is N × M and B is M × N . . and let x be N × 1. . One case where it is easy is where the matrix is idempotent. 2.2. then rank(A) = trace(A). 10. If A is positive definite.3. then each element ann on the principal diagonal is nonnegative. some subsets of RK are so messy that it’s not possible to integrate over them. . . This contradicts the definition of positive definiteness.2. (Here and in what follows. with det(A) > 0.4.5) for each s in RK . . . If A is nonnegative definite. . We say that f : RK → R is the density of random vector x := ( x1 . the statement x ≤ s means that xn ≤ sn for n = 1. xK ). . . 2. Each realization of x is an element of RK .3. . xK ) if B f ( s ) d s = P{ x ∈ B } (2.6) for every subset B of RK . To see that positive definiteness implies full column rank. . . . . K. Each element ann on the principal diagonal is positive. .2). But then x Ax = 0 for nonzero x.11. . The distribution (or cdf) of x is the joint distribution F of ( x1 . . 2 Actually. J OHN S TACHURSKI October 10. s K ) : = P{ x1 ≤ s1 .4 Random Vectors and Matrices A random vector x is just a sequence of K random variables ( x1 . so we only require (2.6) to hold for a large but suitably well-behaved class of sets called the Borel sets.3. That is. then A must be full column rank. . for if not there exists a x = 0 with Ax = 0 (fact 2. . F ( s ) : = : F ( s1 . 2013 . 2.2 Most of the distributions we work with in this course have density representations. then: 1.1).2.) Just as some but not all distributions on R have a density representation (see §1. . See any text on measure theory for details. x K ≤ s K } : = : P{ x ≤ s } (2. consider the following argument: If A is positive definite. . RANDOM VECTORS AND MATRICES 71 Fact 2. xK ). . . . Fact 2. some but not all distributions on RK can be represented by a density. . . A is full column rank and invertible. . E [x N1 ] E [x N2 ] · · ·  . if X is a random N × K matrix.1 Expectations for Vectors and Matrices Let x = ( x1 . we briefly review some properties of random vectors and matrices. In this section.4. . E [ x NK ] Expectation of vectors and matrices maintains the linearity of scalar expectations: Fact 2.3. . .2: Fact 2.4.2. . x N ≤ s N } = P{ x1 ≤ s1 } × · · · × P{ x N ≤ s N } We note the following multivariate version of fact 1. .2. given any s1 . . x N is called independent if. . . . E [ xK ] µK More generally.  . then E [A + BX + CY] = A + BE [X] + CE [Y] J OHN S TACHURSKI October 10. . In particular. . . . .4. s N .  . . . . . . a collection of random vectors x1 . B and C are conformable constant matrices. If x and y are independent and g and f are any functions. then its expectation E [X] is the matrix of the expectations: E [X] :=   E [ x11 ] E [ x12 ] · · · E [ x1K ]  E [ x ] E [ x22 ] · · · E [ x2K ]  21     . A random N × K matrix X is a rectangular N × K array of random variables. the definition of independence mirrors the scalar case. . . 2. . xK ) be a random vector taking values in RK with µk := E [ xk ] for all k = 1.    . we have P{ x1 ≤ s1 . . K.1.4. . RANDOM VECTORS AND MATRICES 72 For random vectors. . If X and Y are random and A. then f (x) and g(y) are also independent. The expectation E [x] of vector x is defined as the vector of expectations:     µ1 E [ x1 ]  E [ x2 ]   µ2     =  . . . 2013 .  =: µ E [x] :=    . In symbols. positive definite N × N matrix Σ. it can be shown that if x ∼ N (µ. 2013 . then E [x] = µ and var[x] = Σ We say that x is normally distributed if x ∼ N (µ. x) = E [(x − E [x])(x − E [x]) ] = E [(x − µ)(x − µ) ] Expanding this out. y] := E [(x − E [x])(y − E [y]) ] The variance-covariance matrix of random vector x with µ := E [x] is defined as var[x] := cov(x.4. positive definite N × N matrix. k-th term is the scalar covariance between x j and xk . Although we omit the derivations. we get  E [( x1 − µ1 )( x1 − µ1 )]  E [( x2 − µ2 )( x − µ )] 1 1  var[x] =  . we represent this distribution by N (µ. symmetric and nonnegative definite. . any constant conformable matrix A and any constant conformable vector a.2 Multivariate Gaussians The multivariate normal density or Gaussian density in R N is a function p of the form 1 p(s) = (2π )− N /2 det(Σ)−1/2 exp − (s − µ) Σ−1 (s − µ) 2 where µ is any N × 1 vector and Σ is a symmetric. we have var[a + Ax] = A var[x]A 2. Σ). We say that x is standard normal if µ = 0 and Σ = I.4. y] = E [xy ] − E [x]E [y] and var[x] = E [xx ] − E [x]E [x] Fact 2.  . J OHN S TACHURSKI October 10.  .4. E [( x N − µ N )( x1 − µ1 )] · · · ··· ··· .2.3. . Fact 2. For any random vector x. the principle diagonal contains the variance of each xn . Σ) for some N × 1 vector µ and symmetric. the variance-covariance matrix var[x] is square. Some simple algebra yields the alternative expressions cov[x. .4. For any random vector x.  . .4. As a result. E [( x N − µ N )( x N − µ N )] E [( x1 − µ1 )( x N − µ N )] E [( x2 − µ2 )( x N − µ N )]    The j. Σ). RANDOM VECTORS AND MATRICES 73 The covariance between random N × 1 vectors x and y is cov[x. 4.) 2. j-th element of X. Σ).5 Convergence of Random Matrices As a precursor to time series analyais. if both x and y are normally distributed and cov[ x.4. We say that Xn converges to a random I × J matrix X in probability (and write Xn → X) if every element of Xn converges to the corresponding element of X in probability in the scalar sense. If x ∼ N (µ. Normally distributed random variables are independent if and only if they are uncorrelated. That is. If x ∼ N (µ.4.2. I) and A is a conformable idempotent and symmetric matrix with rank(A) = j. then (x − µ) Σ−1 (x − µ) ∼ χ2 (k ). CONVERGENCE OF RANDOM MATRICES 74 Fact 2.4. Fact 2. y] = 0. Fact 2. then x Ax ∼ χ2 ( j). J OHN S TACHURSKI October 10.8. 2013 . In particular.7.5.4.5.4. we extend the probabilistic notions of convergence discussed in §1.3. n Xn → X whenever xij → xij for all i and j n is the i. (In view of fact 2.1 to random vectors and matrices.6. j-th element of X and x is the i. If x ∼ N (0. AΣA ).4. Σ). Fact 2.1). then a + Ax ∼ N (a + Aµ. where k := length of x. Here xij n ij 3 If p p p a = 0 then we can interpret a x as a “normal” random variable with zero variance.9. when using this result it is sufficient to show that trace(A) = j. What is important is that normality is preserved . then x and y are independent. N × 1 random vector x is normally distributed if and only if a x is normally distributed in R for every constant N × 1 vector a. let {Xn }∞ n=1 be a sequence of random I × J matrices.1 Convergence in Probability We already have a notion of scalar convergence in probability (see §1.3 Fact 2. Here. the fact that a + Ax has mean a + Aµ and variance-covariance matrix AΣA is not surprising.9. 2. Extending this to the matrix case.5. If Xn → X and An → A. then X− n → X . then p p xn → x if and only if xn − x → 0 In other words. Although fact 2. J OHN S TACHURSKI October 10. 2013 .1 is applicable: Xn → X if and only if Xn − X → 0. if we take an N × K matrix A. the same result is in fact true for matrices if we regard matrices as vectors. then X n + Y n → X + Y.1 is stated in terms of vectors. Thinking of matrices this way. fact 2.2. CONVERGENCE OF RANDOM MATRICES 75 If we are dealing with vectors ( I = 1 or J = 1 in the previous definition). p p p p 2. p p p p p Xn Yn → XY p and Yn Xn → YX p Xn An → XA p and An Xn → AX p In part 3 of fact 2.  p  . we have the following result. the following statements are true: 1 −1 1.5. The convergence An → A means that each element of An converges in the usual scalar sense to the corresponding element of A: n An → A means aij → aij for all i and j 4 There are various notions of matrix norms.5.5. In this connection. 3. then X n + A n → X + A.  n p xn :=  .5. Fact 2.5. we can also consider norm deviation. Assuming conformability.2.1. If Xn → X and Yn → Y. . . then the condition for convergence has the form  n    x1 x1  . we can think of it as a vector in R N ×K = R NK by stacking all the columns into one long column. In other words.2. or rows into one long row—it doesn’t matter which. If Xn → X and Xn and X are invertible. If {xn } is a sequence of random vectors in RK and x is a random vector in RK . The one defined here is called the Frobenius norm.4 Fact 2.5.  =: x whenever xk → xk for all k n xK xK With vectors. the matrices An and A are nonrandom. each element of xn converges in probability to the corresponding element of x if and only if the norm distance between the vectors goes to zero in probability. → . 2 can be used.5. Then we say that An → A if An − A → 0. one setting where convergence of the marginals implies convergence of the joint is when the elements of the vectors are independent.  d  . we have Fn (s) → F (s) as n → ∞ K Let {xn }∞ n=1 and x be random vectors in R . We say that Fn converges to F weakly if. where xn ∼ Fn and x ∼ F.7). this convergence is represented by xn → x. we have the following results. Applying it a second time yields the convergence in (2.2 once we get 2. As an example of how fact 2.5. 2013 . Fortunately. . convergence of xn to x in probability simply requires that the elements of xn converge in probability (in the scalar sense) to the corresponding elements of x. (As you might have guessed.  =: x n xK d xK Put differently.5. In symbols. → .  n d xk → xk for all k does not imply xn :=  . and let F be a cdf on R . convergence of the marginals does not necessarily imply convergence of the joint distribution. The two definitions can be shown to be equivalent.2 Convergence in Distribution Now let’s extend the notion of convergence in distribution to random vectors. which provide a link from scalar to vector convergence: J OHN S TACHURSKI October 10.5.2. with only the obvious modifications. let’s establish convergence of the quadratic form a Xn a → a Xa whenever a is a conformable constant vector and Xn → X a Xn → a X. because vector convergence is harder to work with than scalar convergence. CONVERGENCE OF RANDOM MATRICES 76 Alternatively. Applying fact 2.5. As discussed above. p p p (2. .) The fact that elementwise convergence in distribution does not necessarily imply convergence of the vectors is problematic. as discussed above.2. For convergence in distribution this is not generally true:  n    x1 x1  .7) This follows from two applications of fact 2. We say that xn converges in distribution to x if Fn converges weakly to F. The definition is almost identical to the scalar case. K K Let { Fn }∞ n=1 be a sequence of cdfs on R . and the joint is just the product of the marginals. for any s such that F is continuous at s. we can stack the matrices into vectors and take the norms. If Yn → C and xn → x. convergence in distribution is preserved under continuous transformations: Fact 2.4 (Continuous mapping theorem). CONVERGENCE OF RANDOM MATRICES 77 Fact 2.4.5. Let xn and x be random vectors in RK . we have the following result: Theorem 2. The second of these results is quite straightforward to prove (exercise 8. Let xn and x be random vectors in RK . For this sequence we have ¯ N := x 1 N 2] < ∞.5 (Slutsky’s theorem).3. n =1 ∑ xn → µ N p and √ ¯ N − µ ) → N ( 0. Σ ) N (x d (2. let Yn be random matrices. 2.5.2. If a xn → a x for any a ∈ RK . Let {xn } be an IID sequence of random vectors in RK with E [ xn Let µ := E [xn ] and let Σ := var[xn ]. 1. 2. Another result used routinely in econometric theory is the vector version of Slutsky’s theorem: Fact 2.4).5. we can move on to the next topic: Vector LLN and CLT. then g(xn ) → g(x).2).3 Vector LLN and CLT With the above definitions of convergence in hand. 2013 .8) J OHN S TACHURSKI October 10. It is often referred as the Cramer-Wold device.5. The first is more difficult (the standard argument uses characteristic functions).1. For example.5.4 extend to the vector case in a natural way. then xn → x.5.5. then Yn xn → Cx d p d d d p p d d and Yn + xn → C + x d whenever the matrices are conformable. As in the scalar case (fact 1. If a xn → a x for any a ∈ RK . and let C be a constant matrix. Let xn and x be random vectors in RK . If g : RK → R J is continuous and xn → x. then xn → x. The scalar LLN and CLT that we discussed in §1. let xn be as in theorem 2.1 follows from the scalar LLN. The claim x 5 Functions p of independent random variables are themselves independent (fact 1.9 illustrates the LLN in two dimensions. . . page 27). vector case Figure 2. J OHN S TACHURSKI October 10. In particular. The green dot is the point 0 = (0. .2. . The black dots are IID observations x1 .5. 0) in R2 .3. x N . so the summation and scalar multiplication in the sam¯ N is done elementwise—in this case for two elements. the red dot converges to the green dot.5. the ple mean x sample mean is a linear combination of the observations x1 . .4.1) we have 1 N But 1 N n =1 ∑ yn → E [yn ] = E [a xn ] = a E [xn ] = a µ N N p n =1 ∑ yn = 1 N n =1 ∑ p N a xn = a 1 N n =1 ∑ xn N ¯N =ax Since a was arbitrary. To see this.1.5. The red dot is the sample mean N ∑n =1 xn .5. x N of a random vector with N 1 mean µ = 0. The vector LLN in theorem 2. CONVERGENCE OF RANDOM MATRICES 78 Figure 2.9: LLN. The sequence {yn } inherets the IID property from {xn }.3.2. we have shown that ¯ N → a µ for any a ∈ RK ax ¯ N → µ now follows from fact 2.5 By the scalar LLN (theorem 1. (Remember that we are working with vectors here. .) By the vector LLN. 2013 . let a be any constant vector in RK and consider the scalar sequence {yn } defined by yn = a xn . . . 6. Find two unit vectors (i. show that if f : RK → R N is linear.4. 2.5. Choose the one that’s easiest to work with in terms of algebra. Let Q be the subset of R3 defined by Q := {( x1 .7.3. Ex. Ex.6. Prove the second part of theorem 2.10. 2. 2. show that | x − y | ≤ x − y . 2) are linearly independent.1.6.e.6.8 Ex.6. 2. Ex. x3 ) ∈ R3 : x2 = 1} Is Q a linear subspace of R3 ? Why or why not? Use the triangle inequality.6. vectors with norm equal to one) that are orthogonal to (1. Given two vectors x and y. then y = 0. 2.6 Exercises Ex.6.6. Let Q be the subset of R3 defined by Q := {( x1 . 2. x2 . Ex. 7 Hint: 6 Hint: J OHN S TACHURSKI October 10. x3 ) ∈ R3 : x2 = x1 + x3 } Is Q a linear subspace of R3 ? Why or why not? Ex. Show that A is a linear subspace of R N . Use the first property in fact 2.1.6.9.8.5. −2). The proof is rather similar to the vector LLN proof we have just completed.2. 2. Show that if S and S are two linear subspaces of R N . EXERCISES 79 The vector CLT in theorem 2.1 also follows from the scalar case. 2. 8 Hint: Look at the different definitions of linear independence. 2013 .5.1. 1) and (−1. See exercise 8.1 to show that if y ∈ R N is such that y x = 0 for every x ∈ R N . colk (A) = f (ek ).. 2. Ex.6 Ex. x2 . then S ∩ S is also a linear subspace. Show that every linear subspace of R N contains the origin 0.6. 2.6. 2. In particular.1.7 Ex. then there exists an N × K matrix A such that f (x) = Ax for all x ∈ RK .2. Let a ∈ R N and let A := {x ∈ R N : a x = 0}.6.5. Show that the vectors (1. e.2.13. Explain briefly why I N is full column rank. the matrix A A is well-defined (i. and A is an N × N matrix.21. Ex. 2. 2. Let A := Show that A is nonnegative definite. Let X := {x1 . Let 1 be an N × 1 vector of ones. Let A be N × K.6.3. Ex. and then show that the object in question has the stated property.6.10.12. Ex.6. Ex. the i. . 1. . then A = 0 (i. 2. we have (AB)−1 = B − 1 A − 1 .6.19.6. Ex. Show that for any two conformable matrices A and B. and nonnegative definite. 2. 2.18. .11. Show that if ei and e j are the i-th and j-th canonical basis vectors of R N respectively. every element of A is zero). Show that if A and B are positive definite and A + B is well defined.6.9 Ex.14. then it is also positive definite. . 2.6. then ei Ae j = aij .22. J OHN S TACHURSKI October 10. EXERCISES 80 Ex. 2. Consider the matrix Z := 9 Hint: 1 −1 −1 1 1 11 N Look at the definition of the inverse! Always look at the definition.6. Let A := αI N . Ex.6.6. 2. Is it possible that 0 ∈ X ? Why or why not? Ex.17. 3. 2. 2013 . Let A be a constant N × N matrix. Show that X is symmetric and XX = I N .. Give a condition on α such that A is positive definite. where u is an N × 1 vector with u = 1. 2. j-th element of A. xK } be a linearly independent subset of R N . Let I N be the N × N identity matrix. Ex.20. multiplication is possible). Let X := I N − 2uu .e.6. 2.6. Ex. 2. square.1 and 2.6.16. 2.. show that (A )−1 = (A−1 ) . Show that for any matrix A. Assuming existence of the inverse A−1 .3. Prove facts 2.15. Show that if Ax = 0 for all K × 1 vectors x. Ex. Show that I N is the inverse of itself. Ex.1. 2.6. then E [a x] = a µ and var[a x] = a Σa. Let x = ( x1 . Ex. (Which is which?) Check against the built-in functions colSums and rowSums. 2. .6.28 (Computational).6. then E [x Ax] = trace(A). 2.6. Show that if E [x] = µ and var[x] = Σ.1 all hold. I N ). Show that E [xx ] is nonnegative definite. create a new matrix B such that the j-th column of B is the j-th column of A minus the column mean of the j-th column of A. If a is an N × 1 constant vector.6. the elements of which are chosen randomly via uniform sampling from {1. What is the distribution of x1 2 / x 2 ? Why? 3. Form a 4 × 5 matrix A. Ex.6. What is the distribution of x1 2 2 + x 2 )]1/2 ? Why? 4. that computes the norm of any vector. The built-in function colMeans can be used to obtain the column means. What is the distribution of x1 [2/( x2 3 5. Choose two vectors and a scalar. Show that Z is idempotent. 1. 2. J OHN S TACHURSKI October 10. Show that if x is any N × 1 vector. Compute the row sums and column sums by either pre. 2. Show that if x is a random vector with E [xx ] = I and A is a conformable constant matrix.29 (Computational). then Zx is a vector with all elements equal to the sample mean of the elements of x. Write a function in R called innerprod that computes the inner product of any two vectors using sum. Ex. Ex. 2013 . . .6. also using sum.6.28. 3. Let x be a random K × 1 vector. Let x be random and let a be constant.23. 2. 2.27 (Computational). Using the matrix A in exercise 2.6. EXERCISES 81 1.24.2. x N ) ∼ N (0.or postmultiplying by a vector of ones. what is the distribution of a x? Ex.26. Ex.25. Are x1 and x2 independent? Why or why not? 2 ? Why? 2. and check that the properties in fact 2. What is the distribution of x 2 ? Why? 6. . 2. 2. Write another function called norm. 4} with replacement (look up the online documentation on the sample function). 6. x j ] = 0 for all i = j.9. Since uncorrelated normal random variables are independent. x1 is a linear combination of the elements of X . We must show that z := αx + βy ∈ A. XX = I N . we have x αI N x = α x 2 > 0. Solution to Exercise 2.2. suppose (without any loss of generality) that x1 = 0 ∈ X . To see this. I N ) we have cov[ xi .6. If X has more than one element. equivalently. First note that since x ∼ N (0. then given x = 0. The solutions are as follows: (1) I N is full column rank because its columns are the canonical basis vectors.11. If a := (1.1 Solutions to Selected Exercises Solution to Exercise 2. β ∈ R. This contradicts linear independence.8.6.21. y ∈ A and let α. then Q is all x with a x = 0. It follows immediately that I N is the inverse of itself. EXERCISES 82 2. 2013 . If this holds. This is immediate.6.26. we J OHN S TACHURSKI October 10. Solution to Exericse 2.6.6.6. then it is not possible. B is the inverse of A if BA = AB = I N . which are independent.6. as shown in exercise 2.6. (3) A sufficient condition is α > 0.8. (2) By definition. because XX = (I N − 2uu )(I N − 2uu ) = I N I N − 2I N 2uu + (2uu )(2uu ) = I N − 4uu + 4uu uu = I N − 4uu + 4uu = I N The second last equality is due to the assumption that u u = u 2 = 1. Let x. First.20. −1. Then x1 = 0 = k =2 ∑ 0x k K In other words. or. because a (αx + βy) = αa x + βa y = 0 + 0 = 0. Solution to Exercise 2. a z = a (αx + βy) = 0. Solution to Exercise 2. 1). Solution to Exercise 2. X is symmetric because X = (I N − 2uu ) = I N − 2(uu ) = I N − 2(u ) u = I N − 2uu = X Second. This set is a linear subspace of R3 . Evidently E [y] = E [a x] = a E [x] = 0. we obtain var[y] = N 2 Hence y ∼ N (0.6. Yes. Using independence. 5. then Z (k / Q)1/2 ∼ t(k ). x N ∼ N (0. 2013 . x1 2 / x 2 ∼ F (1. so y := a x is normal. . because if Z ∼ N (0. The solutions to the exercise can now be given: 1. 2 + x2 )]1/2 ∼ t (2).9) n =1 for any k ≤ N . 6. 1). n =1 ∑ N a2 n var[ xn ] = n =1 ∑ a2 n N J OHN S TACHURSKI October 10. 2 ∼ χ2 (1) by (2.2. we have in particular that 2 ∼ χ2 ( k ) ∑ xn k (2. . EXERCISES IID 83 then have x1 . 1) and Q ∼ χ2 ( k ) and Z and Q 4. ∑n =1 a n ). because if Q ∼ χ2 ( k ) and Q ∼ χ2 ( k ) and Q and Q and 3. x1 [2/( x2 3 are independent. Since sums of squares of independent standard normals are chi-squared. . . k2 ). 1). for the reason just described. Linear combinations of normals are normal.9) 2. x1 2 2 2 1 1 1 2 independent.9). x 2 N 2 2 = ∑n =1 x N ∼ χ ( N ) by (2. then ( Q1 /k1 )/( Q2 /k2 ) ∼ F (k1 . For example. xK are vectors in R N and xi ⊥ x j whenever i = j. The first thing you need to know about orthogonal vectors is the Pythagorean Law: Theorem 3. then x and z are said to be orthogonal.1.1 Orthogonality and Projection [roadmap] 3. which lies behind many of the key results in OLS theory. 3. then we say that x is orthogonal to S if x ⊥ z for all z ∈ S. The theory of projections also allows us to define conditional expectations. . x and z are orthogonal when they are perpendicular to one another (figure 3. 84 . If x z = 0.2 illustrates.1). and we write x ⊥ z. If x1 . Figure 3.Chapter 3 Projections This chapter provides further background in linear algebra for studying OLS with multiple regressors.1. then x1 + · · · + x K 2 = x1 2 + · · · + xK 2 Orthogonality and linear independence are related. . In this case we write x ⊥ S. and determine the properties of the conditioning operation.1. In R2 . At the heart of the chapter is the orthogonal projection theorem. . If x is a vector and S is a set. .1 Orthogonality Let x and z be two vectors in R N . 2: x ⊥ S J OHN S TACHURSKI October 10. ORTHOGONALITY AND PROJECTION 85 Figure 3.3. 2013 .1: x ⊥ z Figure 3.1. the problem is.1. Moreover.1.1. The closest point in S to y is the unique vector y ˆ ∈ S such that y − y ˆ ⊥ S. 2013 . In general.2 (Orthogonal Projection Theorem.3 (Orthogonal Projection Theorem 2).1. suggesting that y ˆ may not be well-defined. and let P : R N → R N be the orthogonal projection onto S.1.2 Projections One problem that comes up in many different contexts is approximation of an element y of R N by an element of a given subspace S of R N . Closeness is in terms of euclidean norm. Figure 3. and. we can think of the operation y → the orthogonal projection of y onto S as a function from R N to R N .2 is called the orthogonal projection of y onto S. Theorem 3. we can see that the closest point y ˆ to y within S is indeed the one and only point in S such that y − y ˆ is orthogonal to S. 0∈ / V .1 The function is typically denoted by P. The function P is linear. we have 1 Confirm in your mind that we are describing a functional relationship. as defined in §13.1. so that P(y) or Py represents the orthogonal projection y ˆ . Using this notation. The vector y ˆ in theorem 3. so y ˆ is the minimizer of y − z over all z ∈ S: y ˆ := argmin y − z z∈S Existence of a minimizer is not immediately obvious. However.1. Figure 3. y ∈ V . P is called the orthogonal projection onto S. given y and S. to find the closest element y ˆ of S to y. we can restate the orthogonal projection theorem.1. the intuition is easy to grasp from a graphical presentation. J OHN S TACHURSKI October 10. If V is a finite set with x ⊥ y for all distinct pairs x.1. Let y ∈ R N and let S be a subspace of R N . as well as adding some properties of P: Theorem 3. Stated more precisely.4 illustrates the action of P on two different vectors. Part 1). then V is linearly independent.3. it turns out that we need not be concerned. as y ˆ always exists (given any S and y). moreover.3 illustrates. Although we do not prove the theorem here. as well as providing a way to identify y ˆ. The next theorem states this fact. 3. ORTHOGONALITY AND PROJECTION 86 Fact 3. Holding S fixed. Looking at the figure. Let S be any linear subspace. for any y ∈ R N . ORTHOGONALITY AND PROJECTION 87 Figure 3.1.4: Orthogonal projection under P J OHN S TACHURSKI October 10.3. 2013 .3: Orthogonal projection Figure 3. These results are not difficult to prove.1. Fact 3. What then is the difference between (a) first projecting a point onto the bigger subspace S2 . the orthogonal complement of S is defined as S⊥ := {x ∈ R N : x ⊥ S} In other words. S⊥ is the set of all vectors that are orthogonal to S. Py = y if and only if y ∈ S. If S1 ⊂ S2 .3.1.3. observe that y can be decomposed as y = Py + y − Py Part 3 now follows from parts 1–2 and the Pythagorean law.2. 4. 3. 2. Let S1 and S2 be two subspaces of R N . Py ∈ S. Py ≤ y . where S1 ⊂ S2 . Fact 3.5 gives an example in R2 . To see part 3. 2013 .1.6). and let y ∈ R N . There’s one more very important property of P that we need to make note of: Suppose we have two linear subspaces S1 and S2 of R N . and 5. the orthogonal complement S⊥ is always a linear subspace. Given S ⊂ R N .2. Parts 1 and 2 follow directly from theorem 3. Given any S. Linearity of P is left as an exercise (exercise 3. then P1 P2 y = P2 P1 y = P1 y There’s yet another way of stating the orthogonal projection theorem. given theorem 3.1. J OHN S TACHURSKI October 10.4.1. (Why?) Part 5 is obvious from the definition of Py as the closest point to y in S. y − Py ⊥ S. and then projecting the result onto the smaller subspace S1 . which is also informative. y 2 = Py 2 + y − Py 2 . Let P1 and P2 be the projections onto S1 and S2 respectively. and (b) projecting directly to the smaller subspace S1 ? The answer is none—we get the same result.2. (Why?) Part 4 follows from part 3. ORTHOGONALITY AND PROJECTION 88 1. Figure 3. 5: Orthogonal complement of S This is easy enough to confirm: Looking back at the definition of linear subspaces. y ∈ S⊥ and α. If P is the orthogonal projection onto S and M is the orthogonal projection onto S⊥ . 2013 . The next theorem states this somewhat more mathematically. then Py and My are orthogonal. The figure suggests that we will have y = y ˆ +u case. so that u ˆ := My.2 is reversed for M.1. the vector that αx + βy is also in S⊥ . thus confirming that αx + βy ∈ S⊥ . β ∈ R. then (αx + βy) z = αx z + βy z (∵ linearity of inner products) (∵ x. let’s look at the orthogonal projection theorem again. Just as we used P to denote the function sending y into its projection onto S.4. we have just learned that S⊥ is itself a linear subspace. Our interest was in projecting y onto S. then S2 1 the result in fact 3.4 (Orthogonal Projection Theorem 3). y ∈ S⊥ and z ∈ S) = α×0+β×0 = 0 We have shown that αx + βy ⊥ z for any z ∈ S. ORTHOGONALITY AND PROJECTION 89 Figure 3.6 its projection onto S⊥ . For S ⊂ R N . J OHN S TACHURSKI October 10. Clearly this is the case. and y = Py + My ⊥ ⊂ S⊥ . Fact 3. so we can also project y onto S⊥ . we have S ∩ S⊥ = {0}. so we’ll use M to denote the function sending y into ˆ .1. This means that If S1 and S2 are two subspaces of R N with S1 ⊂ S2 . we see that the following statement must be verified: Given x. The result we’ll denote by u ˆ .3. Theorem 3. and indeed that is the illustrates.1. because if z ∈ S.1. Now. Let S be a linear subspace of R N . However. Figure 3. then the full column rank assumption and fact 2. In this case. that’s because the point is already in that subspace (see.2. mathematically speaking.2 3. and y is N × 1.2 Overdetermined Systems of Equations When we get to multivariate linear regression. the problem we are presented with is one of solving what is called an overdetermined system of equations. OVERDETERMINED SYSTEMS OF EQUATIONS 90 Figure 3. We begin with a system of equations such as X β = y. the system of equations is said to be overdetermined. Py = 0 if and only if y ∈ S⊥ . where X is N × K. β is K × 1. If an orthogonal projection onto a subspace doesn’t shift a point. we will see that. e. if Py = 0. and we conclude that y ∈ S⊥ . If S ⊂ S . Let S1 and S2 be two subspaces of R N and let y ∈ R N .5. overdetermined systems of equations are usually solved using orthogonal projection. Throughout this section. In turn. and My = 0 if and only if y ∈ S. 2 For J OHN S TACHURSKI October 10.1. Hence M does not shift y.g. This corresponds to example. we maintain the assumption that X is full column rank. we are going to study the case when N > K.. then y = Py + My = My. However. We regard the matrix X and the vector y as given.3).1.3.3. the projections onto S1 2 1 2 M1 M2 y = M2 M1 y = M2 y Fact 3. In this case the subspace is S⊥ . 2013 .3 imply that this system has precisely one solution. then. theorem 3.6: Orthogonal projection Fact 3.1. and seek a β ∈ RK that solves this equation.6. If K = N . Let M1 and M2 be ⊥ and S⊥ respectively. Let’s have a quick look at how this is done. 3. By this theorem. a solution to X β = y exists precisely when y lies in rng(X). the standard approach is to admit that an exact solution may not exist. ˆ is the closest point in rng(X) to y when y ˆ := X β 1. all K-dimensional subspaces of R N are “small.3. To understand this problem. and instead focus on finding a β ∈ RK such that X β is as close to y as possible. If we can show that X β ˆ ≤ y − X β for any β ∈ RK y − Xβ ˆ is in fact the closest point in which is all we need to prove. In general. if y is drawn from a continuous distribution over R3 . while y is any point in R N . and 2.3. y ˆ ∈ rng(X). Theorem 3. The minimizer of y − X β over all β ∈ RK is β ˆ is the closest point in rng(X) to y.1 that X can be viewed as a mapping from RK to R N .2. given our assumption that K < N . then the probability that it falls in this plane is zero. Using the orthogonal projection theorem. recall from §2. OVERDETERMINED SYSTEMS OF EQUATIONS 91 the situation where the number of equations (equal to N ) is larger than the number of unknowns (the K elements of β). recall the orthogonal projection theorem (page 86). Then the column space of X forms a 2 dimensional plane in R3 . Closeness is defined in the euclidean sense. due to the fact that planes in R3 are always of “Lebesgue measure zero.1. In a sense.” J OHN S TACHURSKI October 10. this set has no volume because planes have no “thickness.” and hence the chance of a randomly chosen y lying in this plane is near zero. in such a situation. Intuitively. so the problem is to minimize y − X β over the set of all β ∈ RK . the range of X is a K-dimensional subspace of R N .3 As a result. and its range is the linear subspace of R N spanned by the columns of X: rng(X) := {all vectors X β with β ∈ RK } :=: column space of X As discussed in §2. Intuitively. the minimizer is easy to identify: ˆ : = ( X X ) −1 X y .” and the “chance” of y happening to lie in this subspace is likewise small. for K < N .1. More formally. we then have Proof. y − y ˆ ⊥ rng(X) Here 1 is true by construction. 2013 . To verify that y ˆ := X β rng(X) to y. this outcome is unlikely. consider the case where N = 3 and K = 2. For example. we may not be able find a β that satisfies all N equations.2. and 2 translates to claim y − X ( X X ) −1 X y ⊥ X β 3 Why for all β ∈ RK is it unlikely that y lies in the range of X? Since X is assumed to be full column rank. 1.2.4 on page 89. Let’s drop this assumption and consider what happens. because X is assumed to be full column rank.9. while M projects onto the orthogonal complement of rng(X).1. there still exists a vector β such that y ˆ = X β. Notice that theorem 3. we then have ˆ =y Py = X(X X)−1 X y = X β ˆ and My = (I − P)y = y − Py = y − y ˆ The projection matrix and the annihilator correspond to the two projections P and M in theorem 3.2. J OHN S TACHURSKI October 10. Let’s tie this discussion in to theorem 3. Since ˆ ˆ y ˆ ∈ rng(X). We define the projection matrix P associated with X as P : = X ( X X ) −1 X We also define the annihilator M associated with X as M := I − P (3. why do we need full column rank for X? Full rank means that the columns of X are linearly independent. to find the closest element of rng(X) to a given vector y in R N . This is justified. P projects onto rng(X). Fact 3. Both P and M are symmetric and idempotent.1 implicitly assumes that X X is invertible. Applying the mapping a second time has no effect. 2013 . because if β ∈ RK . Idempotence is rather intuitive here.1) where I is.4.) Remark 3. and the orthogonal projection theorem still gives us a closest point y ˆ to y in rng(X).10). The proof is an exercise (exercise 3.4.2. because both P and M represent orthogonal projections onto linear subspaces. the identity matrix (in this case N × N ). Given these definitions.2. Such projections map vectors into their respective subspaces. The set rng(X) is still a linear subspace.1. (Exercise 3. The problems is that now there may exist two such vectors—or even an infinity. however.1 is done.3. we can just premultiply y by P := X(X X)−1 X . OVERDETERMINED SYSTEMS OF EQUATIONS 92 This is true. On an intuitive level.1. then ( X β ) [ y − X ( X X ) −1 X y ] = β [ X y − X X ( X X ) −1 X y ] = β [ X y − X y ] = 0 The proof of theorem 3.2.4.2) (3. because the vector is already in the subspace. In particular. as usual. to my mind.2. This follows from fact 3. In advanced texts. 3.4. For the purposes of this section. it will be more convenient if we make a slight adjustment. Since we’ll be using it a lot. if we define z := E [(u − v)2 ] E [ z2 ] (3. 2013 . by far the most intuitive.6 on page 90.2. but it’s also quite intuitive in light of figure 3. the presentation involves orthogonal projection. then the RMSE between u and v is the “norm” of the random variable u − v. there are several different approaches to presenting conditional expectations. where x j is the j-th column of X. and fails to provide the big picture. we usually talk about “closeness” instead of similarity.11. but it is. The annihilator M associated with X satisfies MX = 0.3 Conditioning The main purpose of this section is to introduce conditional expectations and study their properties. (Exercise 3. but the meaning is the same. As you might expect given the location of this discussion. The mean squared error of v as a predictor of u is defined as E [(u − v)2 ]. replacing the mean squared error with the root mean squared error (RMSE). 3. Since x j is in rng(X).3. CONDITIONING 93 Fact 3. Since it helps to think geometrically. The definition of conditional expectations given in elementary probability texts is often cumbersome to work with.6 (where S corresponds to rng(X)). it gets mapped into the zero vector by M. In this case we’d want u and v to be similar to each other in some sense.3) and regard this as the “norm” of the random variable z.3.1.) The intuition is as follows: The j-th column of MX is Mx j . A natural measure of closeness is mean squared error (MSE). The proof is an exercise. the square root of the MSE. J OHN S TACHURSKI October 10. The one I present here is less common than the plain vanila treatment.3. let’s give the RMSE its own notation: u − v := More generally.1 The Space L2 Suppose we want to predict the value of a random variable u using another variable v. as the name suggests. which is. then orthogonality of x and y is equivalent to cov[ x. x of x is precisely the As in the euclidean case. y] = 0. if the inner product of x and y is zero.. However. then the inner product is x y.1) on page 50. In this sense.3. caveat is that while x = 0 implies that x = 0. So let’s stop calling · a “norm. z(ω ) = 0 for all ω ∈ Ω). That is. it is not true that z = 0 implies z is the zero random variable (i. CONDITIONING 94 In fact the random variable “norm” · defined in (3. and write x ⊥ y. for random variables x and y.3. then the definitions of the two norms written side by side look pretty similar: z = n =1 ∑ N 1/2 z2 n and z = s f (s)ds 2 1/2 More importantly.1. As we saw in §2. So for the purposes of this section. we define inner product of x and y := E [ xy] As for the euclidean case. If z is a vector in R N and z is a random variable with density f .1 (page 52) carry over to then “norm” · if we replace vectors with random variables. then the set E := {ω ∈ Ω : |z(ω )| > 0} satisfies P( E) = 0.1. all the properties of the euclidean norm · given in fact 2. and the norm of vector x is x = x x.” and just start calling it a norm. The standard name of this set of random variables is L2 . then we say that x and y are orthogonal. L2 := { all random variables x with E [ x2 ] < ∞} We can draw another parallel with the euclidean norm. we can say that if z = 0. the euclidean norm is defined in terms of the inner product on R N . if either x or y is zero mean. there is a risk here that z may not be defined because E [z2 ] = ∞. z differs from the zero random variable only in a trivial way.1. Clearly. you can see here that the norm square root of the inner product of x with itself. 2013 . If x and y are √two N vectors in R . 4 One J OHN S TACHURSKI October 10.3) behaves very much like the euclidean norm · over vectors defined in (2. let’s restrict attention to random variables with finite second moment.e.4 Unlike the situation with the euclidean norm. This terminology is used frequently in econometrics (often by people who aren’t actually sure why the term “orthogonal” is used—which puts you one step ahead of them). Similarly. . . . . .5 technical note: In the definition of measurability above. Let x1 . . The first step is a definition at the very heart of probability theory: measurability. we think about how to project onto it. x p (ω )).2 and 3. and this is precisely how we will study conditional expectation. Once we have this subspace. we might get to observe the realized values x1 (ω ). orthogonal projection starts with a linear subspace S of R N . . . .3. where we speak of existence of the function g.3). even without knowning ω . . we implictly define the orthogonal projection mapping P that projects onto S (see theorems 3. x p }. . but we do get to view certain random outcomes. G is a set of random variables. . The advantage of this is that we have a lot of geometric intuition about the vector space R N . If z is G -measurable. often referred to in what follows as the information set. in the case of R N . We will say that another random variable z is G -measurable if there exists a (nonrandom) function g : R p → R such that z = g ( x1 . Since L2 with its norm · behaves a lot like R N with its norm · .1.. You might like to imagine it like this: Uncertainty is realized. . See any text on measure theory for further details. x p (ω ). we will see that the orthogonal projection theorem carries over to L2 . In fact S is the crucial component here. . the distinction will be ignored.e. we’ve made it look rather similar to euclidean vector space R N .3. and let G := { x1 . 5A J OHN S TACHURSKI October 10. . it is additional required that the function g is “Borel measurable. that same geometric intuition concerning vectors can be applied to the study of random variables. For example. we can regard non-Borel measurable functions as a mere theoretical curiosity. because once we select S.1. we can now calculate the realized value z(ω ) of z. . . Thus. It is to this topic that we now turn.” For the purposes of this course. the variable z is completely determined (i. For example. no longer random) and it’s realized value can be calculated (assuming that we can calculate the functional form g). .3. So when I tell you that conditional expectation is characterized by orthogonal projection. CONDITIONING 95 3. . what this means is that once the values of the random variables x1 . Recall that. x p ) Informally. 2013 . because we can compute z(ω ) = g( x1 (ω ).2 Measurability What’s the main point of the discussion in the previous section? By providing the collection of random variables L2 with a norm. . . As such. you will understand that the first thing we need to think about is the linear subspaces that we want to project onto. . in the sense that some ω is selected from the sample space Ω. x p be some collection of random variables. Suppose that we don’t get to view ω itself. . . x p have been realized. 2. products. Let x and z be two random variables. Let x. 2013 .3. Example 3.4. then we can take y = g( x1 . if G = { x } and y is G -measurable. Let x1 . . where α is a constant. then it is also H-measurable. Example 3. y) = 2x + 3 + 0y. Hence G -measurability is preserved under the taking of sums. y}. x N be random variables and let x N − 1 ¯ N := N ∑n=1 xn is clearly G -measurable. then u := xy and v := α x + βy are also G -measurable. . . . Example 3. because y is already deterministic. Let x. . Formally. The next example helps to clarify. x p }. . If x and y are both G -measurable. CONDITIONING 96 As a matter of notation. and let x and y be random variables. . the realization of z cannot be computed until we know the realized value of y. Formally. Example 3. To see this formally. so that z is G -measurable.3. . y).3. so y is x-measurable. y and z be three random variables with z = x + y. In this case. if y was x-measurable. then x Example 3.3. ¯ N be their sample mean. then y is not x-measurable.1. when x is realized. if G = { x1 . . Suppose that x and y are independent. This contradicts independence of x and y. But then y = g( x ) − x. Hence z is also H-measurable as claimed. we can write z = g( x ) when g( x ) = 2x + 3. Let G and H be two information sets with G ⊂ H.3. then it is certainly known when the extra information provided by H is available. etc. and let H = { x. Less formally. Then z is not x-measurable. y and z be three random variables. Let α. the value of z can be calculated. Then z is also H-measurable. we can write z = g( x. If x and y are known given the information in G . Indeed. . Informally. Example 3. Intuitively. if random variable z is G measurable.5.3.6. . This contradicts independence of x and y. x p ) = α + ∑i=1 0xi . Fact 3. Let y = α. . if z is x-measurable then z = g( x ) for some function g. then we will also say that y is x-measurable. This follows from our intuitive definition of measurability: If the value z is known once the variables in G are known. Suppose that z = 2x + 3. J OHN S TACHURSKI October 10. x N }. If x and y are independent. .3. . where g( x. then we would have y = g( x ) for some function g.3. β be any scalars.3. . If z = 2x + 3. then a third random variable that depends on only on x and y is likewise known given G . then z is xmeasurable. In particular. For p example.1.3. even if we know the realized value of x. we can see that z is deterministic once the variables in H are realized. If G = { x1 . . let G = { x }. This degenerate random variable is G -measurable for any information set G . in the sense of set inclusion. so that’s the one we’ll use. 2013 . If G ⊂ H and z is G -measurable. β and vectors x. y in S. we have the following important result: Fact 3. In particular.4) z ∈ L2 ( G ) This definition makes a lot of sense. the subspaces of interest are the subspaces of measurable random variables.4. prefer the notation E G [y] to E [y | G ] because. the set L2 (G ) is a linear subspace of L2 . However. Fact 3. and defined as the closest G -measurable random variable to y. given arbitrary scalars α. For any G ⊂ L2 . We started off this section by talking about projecting onto linear subspaces.3. 6I J OHN S TACHURSKI October 10. the linear combination αx + βy is again in S. Our intuitive understanding of the conditional expectation E [y | G ] is that it is the best predictor of y given the information contained in G .3 Conditional Expectation Now it’s time to define conditional expectations.4) says the same thing. and the former notation complements this view. the linear subspace is increasing with respect to the information set. given G ⊂ L2 . we define L2 (G ) := {the set of all G -measurable random variables in L2 } In view of fact 3. Similarly S ⊂ L2 is called a linear subspace of L2 if. as we will see.2 we see that. For conditional expectations. so we can actually compute it once the variables in G are realized. The definition in (3. E G is a function (an orthogonal projection) from L2 to L2 . then L2 (G ) ⊂ L2 (H ).3.3. y in S. and selects E [y | G ] as the closest such variable to y in terms of RMSE. From fact 3.3.2. then z is H-measurable. E [y | G ] := argmin y − z (3.1. given arbitrary scalars α.3. CONDITIONING 97 Let’s note this idea as a fact: Fact 3.3. 3. the notation E [y | G ] is a bit more standard.3. Let G ⊂ L2 and y be some random variable in L2 . If G ⊂ H.6 More formally. The conditional expectation of y given G is written as E [y | G ] or E G [y].3.3. Recall that S ⊂ R N is called a linear subspace of R N if. It simultaneously restricts E [y | G ] to be G -measurable. the random variable α x + βy is also in S. β and random variables x. Given a linear subspace S of L2 ˆ ∈ S such that and a random variable y in L2 . which means that if { xn } ⊂ S. 2. are two small caveats I should mention. but fortunately the orthogonal projection theorem comes to the rescue. x ∈ L2 and xn − x → 0. if the do exist. Second. Theorem 3. For the subspaces we consider here. Let’s state this as a theorem for the record.5) is a well-defined linear function from L2 to S. we do not distinguish between elements x and x of L2 such that P{ x = x } = 1. when we talk about uniqueness in L2 . the function Py := argmin y − z z∈S (3. 7 There J OHN S TACHURSKI October 10. and 3.3. we actually require that S is a “closed” linear subspace of L2 . then x ∈ S. So is our definition really a definition? Moreover. the projection is characterized by two properties: ˆ is the orthogonal projection of y onto S if and only if y ˆ ∈ S and y − y ˆ⊥S y ˆ as a function.3. CONDITIONING 98 While the definition makes sense. there are many situations where minimizers don’t exist. this condition is always true. The orthogonal projection theorem in L2 is almost identical to the orthogonal projection theorem we gave for R N . even assuming we do have a proper definition. y − Py ⊥ S. there are lots of them. Most other books covering Hilbert space will provide some discussion. Moreover. which we denote by P. we can think of y → y the orthogonal projection of y onto S for arbitrary y ∈ L2 . For example. we have 1. Py ∈ S.1. or. Given a linear subspace S of L2 . there is a unique y ˆ ≤ y−z y−y for all z ∈ S ˆ ∈ S is called the orthogonal projection of y onto S. A nice treatment of orthogonal projection in Hilbert spaces (of which L2 is one example) is provided in Cheney (2001. P satisfies all the properties in theorem 3. chapter 2).3 (page 86). how do we actually go about computing conditional expectations in practical situations? And what properties do conditional expectations have? These look like tricky questions. so that Py is As for R N .3. Given any y ∈ L2 .1. it still leaves many open questions. Py = y if and only if y ∈ S. First. 2013 .7 Just as for the The variable y R N case. 1 implies that E [y | G ] is always well defined and unique. . by definition. . Given { x1 . it gives us a useful characterization of E [y | G ]. Rewriting these conditions in a slightly different way. . The first claim is clearly true. .5. This is obvious because. E [y | G ] is G -measurable.3. Let’s check this using the second definition of conditional expectations given above.4) and (3. E [y | G ] is G -measurable.6) October 10. we need to show that x + E [w] is x-measurable and that E [( x + E [w]) z] = E [yz] for all x-measurable z. it’s worth keeping in mind: A conditional expectation with respect to a collection of random variables is some function of those random variables. there exists a function g : R p → R such that E [y | x1 . let’s note the following “obvious” fact: Fact 3. E [E [y | G ] z] = E [yz] for all G -measurable z ∈ L2 . . we see that y → E [y | G ] is exactly the orthogonal projection function P in the special case where the subspace S is the G -measurable functions L2 (G ). The second claim translates to the claim that E [( x + E [w]) g( x)] = E [( x + w) g( x)] J OHN S TACHURSKI (3. If x and w are independent and y = x + w. x p ). This definition seems a bit formidable. That’s kind of neat. .7. but it’s not too hard to use. . because we now know that E [y | G ] is the unique point in L2 such that E [y | G ] ∈ L2 (G ) and y − E [y | G ] ⊥ z for all z ∈ L2 (G ). because x + E [w] is a deterministic function of x. Okay. . we can give an alternative (and equivalent) definition of conditional expectation: E [y | G ] ∈ L2 is the conditional expectation of y given G if 1. let’s bow to common notation and define E [ y | x1 . Before giving an application. Example 3. . . so E [y | G ] is the orthogonal projection of y onto L2 (G ).3.5). .3. . . . CONDITIONING 99 Comparing (3. x p } and y in L2 . . At the same time. . and 2. For starters. 2013 . x p ] = g( x1 . it tells us quite a lot. x p ] : = E [ y | G ] Also. but what does it actually tell us? Well. then E [y | x ] = x + E [w]. To check that x + E [w] is indeed the conditional expectation of y given G = { x }. theorem 3.3.3. Second. or the law of iterated expectations (by econometricians). The law follows from the property of orthogonal projections given in fact 3. then E [E [y | H ] | G ] = E [y | G ] and E [E [y | G ]] = E [y]. 1.2 on page 88: Projecting onto the bigger subspace L2 (H ) and from there onto L2 (G ) is the same as projecting directly onto the smaller subspace L2 (G ). If y is independent of the variables in G .3.4.1.4. In other words. The proof of the claim in the example is the topic of exercise 3. CONDITIONING 100 for any function g. then E [y | G ] = y.3. Most are fairly straightforward. and let G and H be subsets of L2 . then y is projected into itself. then E [E [y | H ] | G ] = E [y | G ] is called the “tower” property of conditional expectations (by mathematicians). 5. then E [y | x ] = tp(t | x )dt There are some additional goodies we can harvest using the fact that conditional expectation is an orthogonal projection.1.6.3. Verifying this equality is left as an exercise (exercise 3. Let x and y be random variables in L2 . consider the claim that if y is G -measurable. Checking of these facts is mainly left to the exercises. Linearity: E [α x + βy | G ] = αE [ x | G ] + βE [y | G ]. 2013 .8.17. then E [ xy | G ] = xE [y | G ]. For example.3. The fact that if G ⊂ H.12) The next example shows that when x and y are linked by a conditional density (remember: densities don’t always exist). then E [y | G ] = E [y]. If G ⊂ H. If x and y are random variables and p(y | x ) is the conditional density of y given x. The following properties hold. Example 3. 3. 2. J OHN S TACHURSKI October 10. If x is G -measurable. let α and β be scalars. If y is G -measurable. Fact 3. then E [y | G ] = y. 4. This is immediate from the last statement in theorem 3.3. then our definition of conditional expectation reduces to the one seen in elementary probability texts. we are saying that if y ∈ L2 (G ). 6 continue to hold in the current setting. . then E [Y | X] = E [Y]. . E [AX + BY | Z] = AE [X | Z] + BE [Y | Z]. . . . one can show that all of the results on conditional expectations in fact 3. the following results hold: 1. Z] | X] = E [Y | X]. 2. 5. replacing scalars with vectors and matrices. x m .4 The Vector/Matrix Case Conditional expectations of random matrices are defined using the notion of conditional expectations for scalar random variables. 3.   .3. 2013 . . . and let A and B be constant matrices. . . so that g(X) is a matrix depending only on X. E [y N1 | X] E [y N2 | X] · · · E [y NK | X] where We also define and E [ynk | X] := E [ynk | x11 . 4. If g is a (nonrandom) function. x LM ] cov[x. If X and Y are independent. . Let X. We state necessary results for convenience: Fact 3. E [Y | Z] = E [Y | Z].3. . Assuming conformability of matrix operations.7. E [E [Y | X]] = E [Y] and E [E [Y | X. Y and Z be random matrices. then • E [ g(X) | X] = g(X) • E [ g ( X ) Y | X ] = g ( X )E [ Y | X ] • E [Y g(X) | X] = E [Y | X] g(X) J OHN S TACHURSKI October 10. given random matrices X and Y.3. we set   E [y11 | X] E [y12 | X] · · · E [y1K | X]  E [y | X] E [y22 | X] · · · E [y2K | X]  21  E [Y | X] :=    . . . .3. For example. . y | Z] := E [xy | Z] − E [x | Z]E [y | Z] var[x | Z] := E [xx | Z] − E [x | Z]E [x | Z] Using the definitions. CONDITIONING 101 3.3. CONDITIONING 102 3. In other words. We saw that E [y | x ] is a function f of x such that f ( x ) is the best predictor of y in terms of root mean squared error. we conclude that f = argmin E [(y − g( x ))2 ] g∈ G J OHN S TACHURSKI October 10. We can also reverse this process.3.8) We can re-write the term inside the curly brackets on the right-hand side of (3. f also minimizes the mean squared error.6 are we using here?) Regarding the second term in this product. From this definition of conditional expectations.8) as (Which part of fact 3.3. suppose that the properties in fact 3. and we conclude that E [(y − g( x))2 ] ≥ E [(y − f ( x))2 ] :=: E [(y − E [y | x])2 ] Since g was an arbitrary element of G. From the law of iterated expectations (fact 3.3. Since monotone increasing transformations do not affect minimizers.3.6).3. and fix an arbitrary g ∈ G.6 hold. It then follows that E [(y − g( x))2 ] = E [(y − f ( x))2 + 2(y − f ( x))( f ( x) − g( x)) + ( f ( x) − g( x))2 ] = E [(y − f ( x ))2 ] + E [( f ( x ) − g( x ))2 ] Since ( f ( x ) − g( x ))2 ≥ 0 we have E [( f ( x ) − g( x ))2 ] ≥ 0. f solves min E [(y − g( x ))2 ] g∈ G (3. showing directly that f ( x ) := E [y | x ] solves (3. To begin. we obtain E {(y − f ( x))( f ( x) − g( x))} = E { E [(y − f ( x))( f ( x) − g( x)) | x] } ( f ( x ) − g( x ))E [(y − f ( x )) | x ] (3. We have (y − g( x ))2 = (y − f ( x ) + f ( x ) − g( x ))2 = (y − f ( x ))2 + 2(y − f ( x ))( f ( x ) − g( x )) + ( f ( x ) − g( x ))2 Let’s consider the expectation of the cross-product term.8) is E [0] = 0.7) where G is the set of functions from R to R.7).3. we have (by which facts?) the result E [y − f ( x ) | x ] = E [y | x ] − E [ f ( x ) | x ] = E [y | x ] − f ( x ) = E [y | x ] − E [y | x ] = 0 We conclude that the expectation in (3. given the various properties of conditional expectations listed in fact 3. we employed the orthogonal projection theorem to deduce various properties of conditional expectations. 2013 .3.5 An Exercise in Conditional Expectations Let x and y be two random variables.6. Confirm that P is a linear function from R N to R N .2–3. To make the proof a bit simpler.4 Exercises Ex. Is it true that Px = Py whenever x = y? Why or why not?9 Ex.4: If S ⊂ R N .4. Ex. 3. Prove this using the (second) definition of the conditional expectation E [y | G ].1.1.2. Let P be the orthogonal projection described in theorem 3. x3 ) ∈ R3 : x3 = 0}. Ex.4. Look at the different equivalent conditions for linear independence of a set of vectors.1.3.3 (page 86). Using the orthogonal projection theorem. Ex. 3.3.4.1. Show by direct computation that P and M in (3.8.4. Prove the Pythagorean law in theorem 3. 1. it suffices to show that X X is positive definite.5.11. Ex.4. as defined in §2.7.4.6.4. 3. Ex. Ex..4.1. it is stated that if y is independent of the variables in G .1.4. find the closest point in S to y. then S ∩ S⊥ = {0}. 9 Hint: 8 Hint: J OHN S TACHURSKI October 10.10 Ex. 1).4. 1. Ex. See fact 2. Sketch the graph and think about it visually.2. Ex.8 Ex. Prove fact 3. 3.1.4. Show that x + y 2 = x 2 + y 2 + 2x y 2.4.1.3 (page 86). 3.e.6. 3. EXERCISES 103 3.11. Let S := {( x1 . Verify fact 3.12. Let x and y be any two N × 1 vectors. 3. Prove theorem 3. In fact 3.3.1. 3.1. x2 .9.1.4 using theorems 3. 10 Hint: This is non-trivial. 2013 . Show that when N × K matrix X is full column rank.4. then E [y | G ] = E [y]. Let P be the orthogonal projection described in theorem 3.6) holds when x and w are independent.2 (i.2. In view of fact 2.10.2. 3.3.3. Explain the connection between this equality and the Pythagorean Law. the matrix X X is invertible.1.4. 3. Ex. 3. 3.13. MX = 0) directly using matrix algebra. Make use of the full column rank assumption.4.2) are both symmetric and idempotent. you can take G = { x }.1) and (3. 3. Prove fact 3. Show that the equality in (3. and let y := 1 := (1.3. Confirm the claim in fact 3. EXERCISES 104 Ex. Fix a ∈ S ∩ S⊥ . we need to show that (i) αPx + βPy ∈ S and (ii) for any z ∈ S. and.3. The claim is that P(αx + βy) = αPx + βPy To verify this equality.1. We aim to show that S ∩ S⊥ = {0}. y ∈ R N . Show that the conditional expectation of a constant α is α. Since a ∈ S⊥ . Show that var[y] = E [var[y | x ]] + var[E [y | x ]] Ex.) 3. we have (αx + βy − (αPx + βPy)) z = 0 Here (i) is immediate. we know that a s = 0 for any s ∈ S.8. we have in particular. moreover S is a linear subspace.1 Solutions to Selected Exercises Solution to Exercise 3.15. 3. Ex. then E [ xy | G ] = x E [ y | G ]. because Px and Py are in S by definition. Since a ∈ S.6 (page 100) as appropriate. 3.4. Fix α. To see that (ii) holds. Ex. Prove the claim in example 3. Let S ⊂ R N . Solution to Exercise 3.16. (Warning: The proof is a little advanced and you should be comfortable manipulating double integrals.3. β ∈ R and x.4.6 that if x is G -measurable. 3. using the results in fact 3.4. Going back to theorem 3.4.1. we need to show that the right-hand side is the orthogonal projection of αx + βy onto S. the only such vector is 0.3.17.6.4. so we have (x − Px) z = (y − Py) z = 0. We are done. a a = a 2 = 0.1.4.2. the projections of x and y are orthogonal to S. 3.3.4. 2013 .14. show that if α is a constant and G is any information set. J OHN S TACHURSKI October 10.3.4. Let var[y | x ] := E [y2 | x ] − (E [y | x ])2 . In particular. then E [α | G ] = α. just note that (αx + βy − (αPx + βPy)) z = α(x − Px) z + β(y − Py) z By definition. As we saw in fact 2. 0).4. e1 = 0 and y − x. observe that b Ab = b X Xb = (Xb) Xb = Xb 2 By the properties of norms. and 2. and any matrix with nonzero determinant is invertible. Solution to Exercise 3.4. e2 = 0 These equations can be expressed more simply as 1 − x1 = 0 and 1 − x2 = 0. this last term is zero only when Xb = 0. Indeed.2.4. x2 . J OHN S TACHURSKI October 10. to show that E [y | x ] = E [y] we need to show that 1. Let A = X X.7.4. Let x = ( x1 . To see this. Solution to Exercise 3.4. We must show that b Ab > 0. It is false to say that Px = Py whenever x = y: We can find examples of vectors x and y such that x = y but Px = Py. EXERCISES 105 Solution to Exercise 3. you should be able to confirm that Px = Py.6). Let g be any function from R → R. E [E [y] g( x )] = E [yg( x )] for any function g : R → R. and (ii) y − x ⊥ S.13. 1.4. pick any b = 0. By the orthogonal projection theorem we have (i) x ∈ S.9.3. Solution to Exercise 3. From the (second) definition of conditional expectation. we have E [( x + E [w]) g( x)] = E [ xg( x)] + E [w]E [ g( x)] = E [ xg( x )] + E [wg( x )] = E [( x + w) g( x )] This confirms (3. We conclude that x = (1. But this is not true. because b = 0 and X is full column rank (see fact 2. It suffices to show that A is positive definite. From (i) we have x3 = 0. From (ii) we have y − x. To see that A is positive definite. E [y] is G -measurable. x3 ) be the closest point in S to y. Let y be independent of x.2 on page 27). if we fix any y and then set x = Py + αMy for some constant α. 2013 . Solution to Exercise 3. and also that x = y when α = 1.12. since this implies that its determinant is strictly positive.4. part 5).3. Given independence of x and w (and applying fact 1.8. Note that e1 ∈ S and e2 ∈ S. t)dtds This is equal to the right-hand side of (3. E [y]E [ g( x)] = E [E [y] g( x)]. let h be any such function. By fact 3. Regarding part 1. We need to show that E [E [y | G ]u] = E [yu] Since u ∈ L2 (G ).9). Solution to Exercise 3. J OHN S TACHURSKI October 10. then E [α | G ] = α. Since x ∈ L2 (G ).2 (see page 27) we have E [yg( x)] = E [y]E [ g( x)]. because E [y] is constant (see example 3.6 (page 100). xE [y | G ] is G -measurable. so xE [y | G ] is G -measurable by fact 3.5 on page 96 tells us that α is indeed G -measurable.3. then by facts 1.1 and 1. E [ xE [y | G ]z] = E [ xyz] for any z ∈ L2 (G ). then E [ xy | G ] = xE [y | G ].3. Solution to Exercise 3. EXERCISES 106 Part 1 is immediate. this is immediate from the definition of E [y | G ].3.4. and x is G -measurable by assumption. and that E [ g( x)h( x)] = E [yh( x)] for any function h : R → R (3. we can write E [ g( x)h( x)] = E = = = tp(t | x )dt h( x ) tp(t | s)dt h(s) p(s)ds t p ( s. we have u ∈ L2 (G ).8. and the proof is done.5 on page 96).3. By linearity of expectations. we know that if α is G -measurable.4.9). Regarding part 2. fix z ∈ L2 (G ). Regarding (3. To confirm this.3. let x and y be random variables where p(y | x ) is the conditional density of y given x.16. Let g( x ) := tp(t | x )dt. 2013 .17.3. As in example 3.1 on page 96.3.3. and 2. Solution to Exercise 3.14. Example 3.4. if g is any function. Regarding part 2.9) The first claim is obvious. The claim is that E [y | x ] = g( x ). To prove this. t ) dt h(s) p(s)ds p(s) t h(s) p(s. E [y | G ] is G -measurable by definition. we must show that 1.20) on page 26. and let u := xz. We need to show that if x is G -measurable.4. we need to show that g( x ) is xmeasurable. Using the notation in (1. Part II Foundations of Statistics 107 . 000 volunteers. including parametric and nonparametric methods. suppose that a certain drug is tested on 1. empirical distributions. We should probably call it “statistical economics. on the basis of this data. For example. What then is the process of extracting knowledge from data? Under what conditions will this process be successful? 4. or how different economic variables are related to one another. Learning from data concerns generalization.” but I guess people feel that the term “econometrics” has a better ring to it. The only cost of using the term “econometrics” is that we are sometimes fooled into thinking that we work on a distinct discipline. but still lack fundamental knowledge on how many systems work. This is not true. The next two chapters provides a short. hypothesis testing and confidence intervals. and. A finite set of data is observed.Chapter 4 Statistical Learning Econometrics is just statistics applied to economic problems—nothing more and nothing less. one seeks to make more general statements. and found to produce the desired effect in 95% 108 .1 Inductive Learning In the modern world we have lots of data. 4. concise review of the foundations of modern statistics.1 Generalization The fundamental problem of statistics is learning from data. separate from statistics.1. . We observe “inputs” x1 . but now we only care about learning the mean of F—or the standard deviation. The implication of their claim is that we can generalize to the wider population. 2013 .4. If. If. y N .1.4 (Regression). Here are some typical statistical problems. J OHN S TACHURSKI October 10. y) pairs. and the child determined the creature was a dog on this basis. x N are drawn from a given but unknown cdf F. . Example 4.2. we don’t know the distributions. . N random values x1 . then we could work out the conditional expectation E [y | x ]. We must do the best we can given the information contained in this sample. the value f ( x ) will be a good guess of the corresponding output y. Inductive learning is where reasoning proceeds from the specific to the general—as opposed to deductive learning. . the problem lies in the fact that we do not know the underlying distributions.3. but rather on what the outcome for the volunteers implies for other people. In statistical applications. etc. (Here we imagine that y is observed after x or not at all.” as well as the corresponding “outputs” y1 . on the other hand. . we wish to compute a function f such that. . x N to some “system. . Another word for generalization is induction.1. Example 4. . y). You show a child pictures of dogs in a book and say ’dog’. the drug company claims that the drug is highly effective.1. . or the median.) In these examples. The child has generalized from specific examples. .1. As we’ll see in §3.1. Same as example 4.4. We wish to learn about F from this sample.1. phrased in more mathematical language: Example 4. INDUCTIVE LEARNING 109 of cases.1. we knew the joint distribution of ( x.3. Hence. which proceeds from general to specific. . The interest is not so much in what happened to the volunteers themselves. given a new input/output pair ( x. On the basis of this study. Example 4. Given this data. . in example 4. All we have is the observations.1. After a while. however. the child sees a dog on the street and says ’dog’.3. you had told the child that dogs are hairy. the learning is inductive. and hence the need to predict y from x. there is a natural sense in which the conditional expectation is the best predictor of y given x. then the nature of the learning process could be called deductive.2. four legged animals that stick their tongues out when hot. 1 helps to illustrate this idea. in the end what we are doing is bringing our own assumptions into play. J OHN S TACHURSKI October 10. But why does it look reasonable? Because our brain picks up a pattern: The blue dots lie roughly on a straight line.0 0.65 0.4. This is called fitting the model to the data. given that the input value is 0.1. Was your guess something like the red dot in Figure 4.1. Figure 4.2 Data is not Enough As a rule.4.0 Figure 4.8 1. If our model is good.1: Generalization requires knowledge 4. 2013 . INDUCTIVE LEARNING 110 output 0.75 G 0. our brains have been trained (or are hard-wired?) to think in straight lines.2? It looks reasonable to me too. One way or another.4 input 0. Ideally. And even though this thought process is subconscious. Consider the regression setting of example 4. and suppose we observe the blue dots as our data.2 0. we still need to combine the data with some assumptions in order to generalize. or at least we feel it would be natural for that to occur. statistical learning requires more than just data.1.60 G 0.6 0. We instictively predict that the red dot will lie on the same line.55 G 0. data is combined with a theoretical model that encapsulates our knowledge of the system we are studying.50 0.8. then combining model with data allows to gain an understanding of how the system works. The data is often used to pin down parameter values for the model. Now make a subjective guess as to the likely value of the output. Even when we have no formal model of how the system works.70 0. 55 G 0.2 0. our assumptions should be based on sound theory and understanding of the system we are studying.” Perhaps Dr Fisher based his projection on subconscious attraction to straight lines. Those assumptions may come from knowledge of the system.4. INDUCTIVE LEARNING 111 0.8.2: Generalization requires knowledge Obviously our assumption about the linear relationship could be completely wrong. or they may come from subconscious preference for straight lines.50 0. we haven’t even talked about the kind of system we are observing here. Either way.4 input 0. we are adding something to the data in order to make inference about likely outcomes.0 Figure 4. 2013 . the point is that we cannot forecast the new observation from the data alone. We have to make some assumptions as to the functional relationship in order to come up with a guess of likely output given input 0. rather than some subconscious feeling that straight lines are most likely. Maybe the functional relationship between inputs and outputs is totally different to what we perceived from these few data points. Ideally.65 G G 0.1. If the assumptions we add to the data are to some extent correct.1 Either way.8 1.75 output 0.60 G 0. this injection of prior knowledge into the learning process allows us to generalize from the observed 1 In 1929. the economist Irving Fisher famously declared that “Stocks have reached what looks like a permanently high plateau.0 0. J OHN S TACHURSKI October 10. rather than some deeper understanding of the underlying forces generating the time series of equity prices he observed.6 0.70 0. After all. regardless of the process that led to our assumptions. .4. or the closing price of one share in Google over N consecutive days.1) and the sample standard deviation √ s N :=: s := s2 = N 1 ¯ )2 ∑ ( xn − x N−1 n =1 1/2 (4. STATISTICS 112 data points. Thus. . x N . . . suppose that we have data x1 . 2013 . roughly speaking. Common statistics used to summarize the data are the sample mean 1 N ¯ N :=: x ¯ := x ∑ xn Nn =1 the sample variance 2 s2 N :=: s := N 1 ¯ )2 ( xn − x ∑ N − 1 n =1 (4. To repeat: • A statistic is an observable function of the sample data. For example. but here we’re using the term statistic with a special meaning. Specifically. . . sure it is.3) J OHN S TACHURSKI October 10. y N ). . the rule is statistical learning = prior knowledge + data 4. the sample k-th moment is given by 1 N k ∑ xn N n =1 If we have bivariate data ( x1 .2) For positive integer k. then the sample covariance is N 1 ¯ )(yn − y ¯) ∑ ( xn − x N−1 n =1 (4. . ( x N . Isn’t this whole course about statistics? Yes. y1 ).2. which might represent the price of a Big Mac in N different countries.2 Statistics “Statistics” sounds like an odd name for a section. a statistic is any function of a given data set. 004054976 Perhaps the most important thing to remember about statistics is that. y ) [1] 0. STATISTICS 113 and the sample correlation is the sample covariance divided by the product of the two sample standard deviations. we tend to think of the data as a fixed set of numbers in a file on our hard disk. Hence. var and sd respectively. For exam¯ := N −1 ∑n xn of an identically distributed sample. each statistic is also a random quantity (i. if we look at the sample mean.001906421 > cor (x . y ) [1] 0. when we write x N − 1 N ∑n=1 xn . Sample covariance and sample correlation can be computed using cov and cor: > x <. random variable). and we only observe one value of any particular statistic—one sample mean. consider the sample mean x Each observation xn is drawn from some fixed distribution F with unknown mean J OHN S TACHURSKI October 10. x ¯ N is a random variable.rnorm (10) > cov (x .4) R has functions for all of these common statistics.rnorm (10) > y <. statistics have expectations. At this stage. etc. ¯ N := More formally. 2013 . Put differently. for example. the way that statisticians think about it is that they imagine designing the statistical exercise prior to observing the data. they are also random variables.5) ¯ N is a function from Ω → R. since. this becomes N ¯ )(yn − y ¯) ∑n =1 ( x n − x N N ¯ )2 ¯ )2 ∑ n ∑n =1 ( y n − y =1 ( x n − x (4.4. ple. With some rearranging.2.e. However. variances and so on. sample variance and sample standard deviation are calculated using the functions mean. Statistics are deterministic functions of these numbers. one sample variance. the data is regarded as a collection of random variables—even though these variables may have been previously determined in some historical data set. determined by some previous historical outcome. being functions of the sample. Hence x Being random variables. This might not be clear.. The sample mean. what we actually mean is 1 ¯ N (ω ) := x N n =1 ∑ xn (ω ) N (ω ∈ Ω) (4. x2 . Hence. .7) In R. we stack them into rows and use cov: > cov ( rbind ( x1 . this result might not seem very helpful. or even matrix valued.2. . functions of the data. the observed vectors x1 . For example. we will be working with a data matrix X where the rows are observations. if x1 . . The next session discusses this and other properties of estimators. µ. and we can just use cov(X). to obtain the sample variance covariance matrix Q of observations x1 . 4. then the sample mean is the random vector defined by 1 N ¯ := x ∑ xn Nn =1 and the sample variance-covariance matrix is defined as Q := N 1 ¯ )(xn − x ¯) ] ∑ [(xn − x N−1 n =1 (4. . . x3 .4.2 Estimators and their Properties Estimators are just statistics—that is. but ¯ is a useful predictor of the unknown quantity µ. 4. In that case. x N are random vectors of equal length.2. from linearity of expectations. This is a standard convention in statistics: rows of a matrix are observations of a random vector. we have in mind the idea of estimating a specific quantity J OHN S TACHURSKI October 10. . . x2 ) ) More commonly.6) Since we don’t actually know what µ is. x N are treated as row vectors of the matrix. the sample variance-covariance matrix is obtained using cov.2. the mean of x ¯] = E E [x 1 N n =1 ∑ N xn = 1 N n =1 ∑ E [ xn ] = µ N (4. x3 . However. 2013 .1 Vector Statistics Statistics can be vector valued. when we talk about estimators. in the sense it is: It tells us that x ¯ is unbiased that it’s “most likely” outcome is this unknown quantity.7). because. In relation to Q in (4. STATISTICS 114 ¯ is also µ. We say that x for µ. . which acts on matrices. when estimating the mean in the preceding problem we can also use the so-called mid-range estimator minn xn + maxn xn m N := 2 Which is better. 2 In J OHN S TACHURSKI October 10. .9) Low mean squared error means that probability mass is concentrated around θ . .2. For example.8) ˆ is called unbiased for θ if its bias is zero. x N is identically distributed. x N from F. . This may not be true for certain estimators.” To begin the discussion. A common way to do ¯ of the observations x1 . the sample mean or the mid-range estimator? More generally. . In other words. As we saw in (4. that depends on how you define “good. x N .9) is finite. . suppose we wish to estimate the mean µ := s F (ds) of F given observations x1 . let us consider the problem of what makes a good estimator. . If. . in addition. In this setting. . The bias of θ ˆ is defined as we collect some terminology.2). there are always many estimators we can use. the sample variance s2 is an unbiased estimator of the variance (exercise 4. For example. x ¯ is this is to use the sample mean x regarded as an estimator of µ. For an IID sample. . first ˆ be an estimator of θ . . Let θ ˆ] := E [θ ˆ] − θ bias[θ (4.2 Example 4. The mean The estimator θ ˆ of some fixed quantity θ is squared error of a given estimator θ ˆ] := E [(θ ˆ − θ )2 ] mse[θ (4.1. the random variables squared error of x ¯ is x1 . the definition of mean squared error. we are implicitly assuming that the expression on the right hand side of (4.10) (s − θ )2 F (ds) is the common variance of each xn . .7. simply because the second moment of the estimator = ∞. . if x1 . x N in question are uncorrelated.6). . the mean sample mean x ¯ is equal to the variance. Not surprisingly. to discuss an estimator.4.2. then the ¯ is always an unbiased estimator of the mean. then the variance of x ¯ ] = var var[ x where σ2 := 1 N n =1 ∑ N xn = σ2 N (4.2. In any given problem. .2. Example 4. . As a result. we need to also specify what it is we’re trying to estimate. . STATISTICS 115 of interest. or E [θ ˆ] = θ . 2013 . 4.3: Unbiased estimators Example 4. 2013 . The mid-range estimator m N may be biased as an estimator of the mean of a distribution. . ˆ.10). For example. Is that low or is it not? One way to approach this kind of question is to take the class of unbiased estimators of a given quantity θ .3. This is precisely what we want. the estimator in the set of unbiased estimators ˆ with E [θ ˆ] = θ } Uθ := {all statistics θ that has the lowest variance within this class (i. STATISTICS 116 ˆ2 θ ˆ1 θ θ Figure 4. as given in (4.2.replicate (5000 .800108 > exp (1 / 2) # Mean of lognormal [1] 1. set) is called the minimum variance unbiased estimator.648721 By the LLN. the sample mean of m N is close to E [m N ]. depending on the distribution in question. Low variance means that For an unbiased estimator θ ˆ is concentrated around its mean (see probability mass for the random variable θ figure 4. since the mean of an unbiased estimator is the quantity θ that we wish to estimate.. One issue here is that low variance is a bit hard to quantify. Bias in the lognormal case is illustrated in the following simulation: > mr <.2. .range [1] 3.function ( x ) return (( min ( x ) + max ( x ) ) / 2) > observations <. consider the variance of the sample mean. J OHN S TACHURSKI October 10. and find the estimator in the class with the lowest variance. mr ( rlnorm (20) ) ) > mean ( observations ) # Sample mean of mid . x N .e. low variance is desirable. . . This is clearly a long way from the mean of the lognormal density.3). For given θ and given data x1 . For example. x N . setting L ( α1 . . the estimator in the set of linear unbiased estimators ˆ with E [θ ˆ] = θ } Uθ := {all linear statistics θ with the lowest variance—if it exists—is called the best linear unbiased estimaˆ is a linear function of the data x1 . .3. it may be hard to determine in practice. . Example 4. we see that this set can be re-written as Uµ := ˆ= all µ n =1 ∑ αn xn with ∑ αn = 1 n =1 N N By fact 1. STATISTICS 117 While minimum variance unbiased estimators are nice in theory. For now let’s just look at an example. .4.1. or BLUE. tor. Let x1 .2. and find the best estimator in that class. . . even if such an estimator does exist in this class. . To find the BLUE. it’s certainly possible that no minimizer exists. . Hence. α N with n =1 ∑ αn = 1 N To solve this constrained optimization problem. Second. Here “linear” means that θ Linearity will be defined formally in §2. the set linear unbiased estimators of µ is given by Uµ := ˆ= all µ n =1 ∑ N αn xn with αn ∈ R. x N ∼ F.3. . where F has finite mean µ = 0 and variance σ2 . where αn ∈ R for n = 1. . we need to solve minimize σ2 n =1 ∑ N α2 n over all α1 . . the variance of an element of this class is given by var n =1 ∑ αn xn N = n =1 ∑ N α2 n var[ xn ] + 2 n<m ∑ αn αm cov[ xn .2. there is a tendency to focus on smaller classes than Uθ . . we can use the Lagrangian. . N and E n =1 ∑ αn xn N =µ Using linearity of expectations. xm ] = σ 2 n =1 ∑ α2 n N where the last equality is due to independence. N N Hence. . . . . . Take it on trust for now that the set of linear estimators of µ is given by ˆ= all statistics of the form µ IID n =1 ∑ αn xn . . .4. 2013 N . . . n = 1. . . α N . λ ) : = σ J OHN S TACHURSKI 2 n =1 ∑ N α2 n −λ n =1 ∑ αn − 1 October 10.5 on page 28. and far away with low probability. the sample mean is the best linear unbiased estimator of µ. and hence. Indeed. and it is not clear that its performance will be better. we have α∗ n = 1/ N . our estimator becomes n =1 ∑ N α∗ n xn = n =1 ¯ ∑ (1/ N )xn = x N We conclude that.4.7.1) decompose mean squared error into the sum of variance and squared bias: ˆ] = var[θ ˆ] + bias[θ ˆ ]2 mse[θ (4. each α∗ n takes the same value. we can (exercise 4. The unbiased estimator θ ˆ2 . from the constraint ∑n αn = 1. in recent years the use of biased estimators has become very common.11) In many situations we find a trade-off between bias and variance: We can lower variance at the cost of extra bias and vice-versa. J OHN S TACHURSKI October 10. note that while classical statistics puts much emphasis on unbiased estimators.4. Differentiating with respect to αn and setting the result equal to zero. . the minimizer α∗ n satisfies 2 α∗ n = λ / (2σ ) n = 1. Returning to the general case. consider the two estimators θ ˆ1 has higher variance than the biased estimator figure 4. STATISTICS 118 ˆ2 θ ˆ1 θ ˆ1 ] = θ E [θ ˆ2 ] E [θ Figure 4.2.9). . In this sense. 2013 . Using these values. To understand why. an unbiased estimator is not necessarily better than ˆ1 and θ ˆ2 of θ depicted a biased one. under our assumptions.4: Biased and unbiased estimators where λ is the Lagrange multiplier. θ An overall measure of performance that takes into account both bias and variance is mean squared error. For example. . as defined in (4. it’s important to bear in mind that what we seek is an estimator of θ that is close to θ with high probability. . N ∗ In particular. Fact 1.4. where N denotes the size of the sample Let θ ˆN is constructed. . . as explained in chapter 5. Second. which implies that x 2 If.6. then convergence of θ √ ˆN − θ ) does not diverge to infinity. E [ xn ] < ∞. To emphasize √ ˆ we sometimes say that an asymptotically normal estimator θ N is N -consistent.10) how.2. but asymptotic normality is important for two reasons. the variance of the sample mean converges to zero in N . v(θ )) for some v(θ ) > 0. x N be an IID sample √ as before.2. STATISTICS 119 4. 2 with each xn having mean µ. This means that the term θ ˆN − θ goes to N (θ √ this point. all probability mass concentrates on the mean—which is the value that the sample mean seeks to estimate. let’s now consider asymptotic properties. .2). which ¯ N is asymptotically normal (see (1. . ˆN be an estimator of a quantity θ ∈ R. We say that θ ˆN is from which θ ˆN ] → θ as N → ∞ • asymptotically unbiased for θ if E [θ p ˆN → • consistent for θ if θ θ as N → ∞ √ d ˆN − θ ) → • asymptotically normal if N (θ N (0. As another example of consistency. let’s consider the sample standard deviation s N = s defined in (4. x N in the sample are also independent. it gives us an idea of the rate of ˆN to θ . ˆN . In order to formalize it. in addition. of large numbers. . 2013 . . under the IID assumption. The sample mean x totically unbiased for the common mean µ because it is unbiased.3 Asymptotic Properties Notice in (4. In the last definition. as well as generalize to other estimators. it provides a means of forming confidence intervals and hypothesis tests. To see this. then g ( s N ) → g ( σ ). then we can also apply the central limit theorem.26) on page 34). If the random variables x1 . implies that x Example 4. The desirability of asymptotic normality is somewhat less obvious than that of consistency. .2. zero at least fast enough to offset the diverging term N . variance σ and standard deviation σ = σ2 . observe that if θ ˆN is asymptotically normal.29) on page 36). then we can apply the law ¯ N is consistent for µ (see (1. v(θ ) is called the asymptotic variance of θ All of these properties are desirable. .1 2 2 2 on page 31 tells us that if g is continuous and s2 N → σ . Taking p p J OHN S TACHURSKI October 10. Let x1 . This is a useful piece of information.2. Since the sample mean is unbiased for the mean. First.5. ¯ N of any identically distributed sample is asympExample 4.4. this suggests that in the limit. note that s2 N = N 1 N 1 ¯ N )2 = ( xn − x ∑ N − 1 n =1 N−1 N n =1 ¯ N )2 ∑ ( xn − x N 1 = N−1 N ¯ N − µ)]2 ∑ [(xn − µ) − (x N N n =1 Expanding out the square.3 Maximum Likelihood How does one come up with an estimator having nice properties. which is what we 2 want to show.4. we see that if s2 N → σ . we get s2 N N = N−1 1 N 1 ∑ ( xn − µ) − 2 N n =1 2 N n =1 ¯ N − µ) + ( x ¯ N − µ )2 ∑ (xn − µ)(x N = N−1 1 N ¯ N − µ )2 ∑ ( x n − µ )2 − ( x N N n =1 By the law of large numbers. then s N = s2 N → p p √ σ2 = σ. In more complicated settings. consistency.3. etc. Hence it suffices to prove that s2 N → σ . however. it’s natural to use the sample mean to estimate the mean of a random variable. 2013 . J OHN S TACHURSKI October 10.4. To see that this is the case. such as unbiasedness. more systematic approaches are required. One such approach is the celebrated principle of maximum likelihood.? Sometimes we can just use intuition. we then have s2 N = N N−1 1 N ¯ N )2 ∑ ( x n − µ )2 − ( µ − x N → 1 × [ σ 2 − 0] = σ 2 p n =1 Hence the sample variance and sample standard deviation are consistent estimators of the variance and standard deviation respectively. For example.1 (page 31). 1 N n =1 ∑ ( x n − µ )2 → σ 2 N p and ¯N ) → 0 (µ − x p Applying the various results in fact 1. 4. MAXIMUM LIKELIHOOD 120 p g( x ) = √ 2 x. What’s meant by the statement about x1 being relatively “unlikely” is that there is little probability mass in the neighborhood of x1 . To center the density on x1 . The prinˆ of µ the value that ciple of maximum likelihood suggests that we take as our guess µ that our distribution can be represented by a density. MAXIMUM LIKELIHOOD 121 N ( x1 . we could equivalently say that your job is to guess the the distribution N (µ distribution that generated the observation x1 . all individual outcomes s ∈ R have probability zero.4. given the observation x1 . suppose I present you with a single draw x1 from distribution N (µ. we must choose the mean to be x1 . In other words. µ) := (2π )−1/2 exp − ( s − µ )2 2 ( s ∈ R) Consider plugging the observed value x1 into this density: p( x1 . 2013 . and in this sense. where the µ is unknown. The density of x1 is p(s. then our observed data point x1 would be an “unlikely” outcome for this distribution. the most obvious guess would be that the normal density is centered on x1 . 1).1 The Idea To motivate the methodology.3. µ) = (2π )−1/2 exp − ( x1 − µ )2 2 Even though were dealing with a continuous random variable here. 1) x1 γ Figure 4. 3 Given J OHN S TACHURSKI October 10. 1) N ( γ. In guessing this distribution. our guess of µ ˆ = x1 .5: Maximizing the likelihood 4. all outcomes are equally unlikely. and the standard deviation is known ˆ of the mean µ of and equal to one for simplicity. if we centered it around some number γ much larger than x1 . See figure 4.3 In fact. in the absence of any additional information. is µ Maximum likelihood leads to the same conclusion. 1). let’s think of p( x1 . Your job is to make a guess µ ˆ of µ also pins down the distribution.5. µ) as representing the “probability” of realizing our sample point x1 . The same logic would apply if we centered the density at a point much smaller than x1 .3. Since a guess µ ˆ . and each choice of θ pins down a particular density p = p(·. . the likelihood function is p evaluated at the sample x1 . . 2013 .2). one then maximizes the joint density with respect to µ. equivaThe maximum likelihood estimate (MLE) θ lently. By independence. the joint density of this sample is obtained as the product of the marginal densities.5. . .13) The principle of maximum likelihood tells us to estimate θ using the maximizer of L(θ ) over θ ∈ Θ. µ) µ −∞<µ<∞ This coincides with our intuitive discussion regarding figure 4. Exercise 4.12) is called the maximum likelihood estimate. In this setting. . Let’s suppose now that the data x1 . A little thought will convince you that µ ˆ = x1 = argmax p( x1 .19).6 treats one example. the maximizer µ We can generalize these ideas in several ways. The same principle applies when we have x1 . we can maximize the log likelihood function (see §13. θ ) that generated the data is unknown. . θ ). . but the value of θ in the density p(·. 1).4. . this is the “most likely” µ given the samˆ = x1 is the maximizer. We will assume that p = p(·. θ ) (θ ∈ Θ) (4. Plugging the sample values into the joint density. The maximizer ˆ := argmax (2π ) µ −∞<µ<∞ − N /2 IID n =1 ∏ exp N − ( x n − µ )2 2 (4.4. As you are asked to show in exerˆ is precisely the sample mean of x1 . cise 4. In other words. p was a density function.) J OHN S TACHURSKI October 10.3.14) (In the preceding discussion. defined as the log of L: (θ ) := ln( L(θ )) (θ ∈ Θ) ˆ is the maximizer of L(θ ). . . but it can be a probability mass function as well. x N . x N . MAXIMUM LIKELIHOOD 122 maximizes this probability. x N has joint density p in the sense of (1. the functional form of p is known. . . of (θ ): ˆ := argmax L(θ ) = argmax (θ ) θ θ ∈Θ θ ∈Θ (4. . where µ is unknown. . . . . That is. θ ) is known up to a vector of parameters θ ∈ Θ ⊂ RK . . x N . ple.7. or. Alternatively. x N ∼ N (µ. In some sense. . and regarded as a function of θ : L ( θ ) : = p ( x1 .7. . 15) 4. As we saw in (4. we obviously need to know the joint density p of the data.1.3. which in this case is (θ ) = n =1 ∑ ln p(xn . . More generally. we can use the decomposition (1. when the data points are independent this is easy because the joint density p is the product of the marginals. we suppose that the pairs ( xn .2 Conditional Maximum Likelihood In many statistical problems we wish to learn about the relationship between one or more “input” variables and an “output” or “response” variable y.3. we wish to estimate the relationship between x and y.12). yn ≤ t P{ x n ≤ s ¯ t ¯ s −∞ −∞ p(s. since n =1 ∑ ln[ f θ (yn |xn ) g(xn )] = ∑ ln f θ (yn |xn ) + ∑ ln g(xn ) n =1 n =1 N N N J OHN S TACHURSKI October 10. then L(θ ) = n =1 ∏ N pn ( xn . if g is not a function of θ then this is unnecessary. θ ) N (4. (See example 4. . if each xn is drawn independently from fixed arbitrary (marginal) density pn (·. t) = f θ (t | s) g(s). yn ) N Letting g be the marginal density of x. yn ) are independent of each other and share a common density p: ¯} = ¯. and re-express the log likelihood as (θ ) = n =1 ∑ ln[ f θ (yn |xn ) g(xn )] N Since the function g enters into this expression.4. . θ ) and (θ ) = n =1 ∑ ln pn (xn . Given this data. we observe inputs x1 . . it might seem like we need to specify the marginal distribution of x in order to maximize . 2013 . MAXIMUM LIKELIHOOD 123 To implement this method. t) dsdt ¯∈R ¯. For the following theoretical discussion. . However.) In the case of scalar input. x N and corresponding outputs y1 .21) on page 26 to write p(s. How can we choose θ by maximum likelihood? The principle of maximum likelihood tells us to maximize the log likelihood function formed from the joint density of the sample. t for all s Let’s suppose that in investigating the relationship between x and y. θ ) on R. y N .4 on page 109. where θ ∈ Θ is a vector of parameters. . we have decided that the conditional density of y given x has the form f θ (y| x ). . . . which means that most hill climbing algorithms will converge to the global maximum. Analyzed by a series of brilliant statisticians. but in applications most people just call it the log likelihood. where θ is an unknown parameter and F is a cdf.g.3. 1} Taking this expression as the conditional density of y given x.e.3. then it’s called the logit model. We believe that the decision yn of the n-th individual is influenced by a variable xn measuring income from the rest of the household (e. the salary of their spouse). 4 To J OHN S TACHURSKI October 10. we will imagine that binary (i. Instead. and as such it does not affect the maximizer. If F is the logistic cdf F (s) = 1/(1 − e−s ). but there is no analytical solution for either the probit or logit case. Let q(s) be the probability that y = 1 (indicates participation) when x = s. numerical optimization is required. . maximum likelihood estimators were find the MLE. Consider a binary response model with scalar input x. To be concrete. We can then write P{y = t | x = s} = F(θ s)t (1 − F(θ s))1−t for s ∈ R and t ∈ {0. Often this is modelled by taking q(s) = F (θ s). As a result. .1. Bernoulli) random variables y1 .3.3. However. Maximum likelihood theory formed the cornerstone of early to mid 20th Century statistics. .4.3 Comments on Maximum Likelihood Let’s finish with some general comments on maximum likelihood. MAXIMUM LIKELIHOOD 124 By assumption. the (conditional) log likelihood is therefore (θ ) = = n =1 N n =1 ∑ ln[ F(θ xn )yn (1 − F(θ xn ))1−yn ] ∑ yn ln F (θ xn ) + n =1 N ∑ (1 − yn ) ln(1 − F(θ xn )) N If F is the standard normal cdf. y N indicate whether or not a sample of married women participate in the labor force. 2013 .. . the MLE is argmax θ ∈Θ n =1 ∑ ln f θ (yn |xn ) N N The term ∑n =1 ln f θ ( yn | xn ) is formally referred to as the conditional log likelihood.4 4. then the binary response model is called the probit model. we can differentiate with respect to θ to obtain the first order condition. Example 4.3. can be shown to be concave on R. the second term is independent of θ . Some discussion of numerical optimization is given in §8. which in turn requires the joint distribution of the sample as a function of θ . the set D of all normal densities. p(s.4 Parametric vs Nonparametric Estimation [roadmap] 4. asymptotically normal. σ > 0 The set D is an example of a parametric class. Consider. J OHN S TACHURSKI October 10. Dasgupta (2008. For details. at least asymptotically. PARAMETRIC VS NONPARAMETRIC ESTIMATION 125 shown to be good estimators under a variety of different criteria. σ ) = (2πσ2 )−1/2 exp − ( s − µ )2 2σ 2 . Thus.1 Classes of Distributions We used the terminology “parametric class” in the preceding discussion.t. That is. and 3. we see that to determine the MLE we must first specify the likelihood function L. 2.14).4. µ ∈ R. 4. under a bunch of regularity conditions that we won’t delve into. see. Discussion of these issues continues in §4. In this case the parametric class is the set of all normal densities p that can be formed by different choices of µ and σ. however. to pin down the MLE we need to know the parametric class of the density from which the sample was drawn. All of the nice properties of the MLE mentioned above are entirely dependent correct specification of the joint distribution. The most important criticism of maximum likelihood is that the statistician must bring a lot of knowledge to the table in order to form an estimator.4. many statisticians have become increasingly dissatisfied with the limitations of maximum likelihood. D := all p s. chapter 16). If we look at (4. µ.4.4. 2013 . For example. In the last several decades. have small variance. for example. This is a very big caveat indeed. These results are genuinely remarkable. MLEs are 1. consistent. and other approaches have become more popular. for example. • The density belongs to parametric class D = { pθ }θ ∈Θ . In such cases. a parametric class of densities D : = { pθ }θ ∈Θ : = : { pθ : θ ∈ Θ } is a set of densities pθ indexed by a vector of parameters θ ∈ Θ ⊂ RK . but θ ∈ Θ is unknown. A plot of f is given in figure 4. let’s look at an extended example. In particular. 1). Often this class will be very broad. f (s) := 1 ( s + 1)2 (2π )−1/2 exp − 2 2 + 1 ( s − 1)2 (2π )−1/2 exp − 2 2 (4. θ = (µ. In the previous example.4. consider the set D of all densities p with finite second moment. σ).6. we say that the class of densities is nonparametric.4. we typically assume that: • The data is generated by an unknown density. For example. s2 p(s)ds < ∞ This is a large set of densities that cannot be expressed as a parametric class.4. p(s)ds = 1. 2013 .t. If we get heads then we draw x from N (−1. If we get tails then we draw x from N (1. Nonparametric statistical techniques assume instead that the unknown density is in some nonparametric class of densities. and a particular choice of the parameters determines (parameterizes) an element of the class D . In the example.16) To understand the density f . PARAMETRIC VS NONPARAMETRIC ESTIMATION 126 The parameters are µ and σ. J OHN S TACHURSKI October 10. It may even be the set of all densities. and Θ ⊂ R2 . Not all classes of densities are parametric. Classical methods of inference such as maximum likelihood are parametric in nature. 4. D := all p : R → R s. suppose that we flip a fair coin. 1). we let f be a density that is a mixture of two normals. p ≥ 0. however. • We know the class. In other words. In this setting. The random variable x then has density f . More generally.2 Parametric Estimation To understand the difference between parametric and nonparametric estimation. and σ via σ µ via µ ˆ ).. PARAMETRIC VS NONPARAMETRIC ESTIMATION 127 0. . Let’s now consider whether fˆ a is good estimator of f .4. not having any particular pointers from theory as to the nature of the distribution. µ the normal density.10 0. Let’s suppose that. . the sample standard deviation. σ Plugging these into the density. This involves determining two parameters.25 −2 0 2 4 Figure 4. he makes the assumption that f ∈ D . . His (or her) objective is to estimate the whole density f . . In other words. based on the data. based on the sample. 2013 . our econometrician decides in the end to model the density as being normally distributed. Even though the fit in figure 4. his instinct will be to choose a particular parametric class for f . the deviation between J OHN S TACHURSKI October 10.05 0. his estimate becomes fˆ(s) := p(s. and then estimate the parameters in order to pin down a particular density in that class. The estimator fˆ of f proposed by our econometrician is the black density in figure 4.4. where D is the class of normal densities. Assuming that he is trained in traditional econometrics/statistics. the sample mean.15 0.e. . x200 from f and see how this procedure goes. He does this in the obvious way: He estimates ˆ := s.00 −4 0. one would probably have to say no.7 (i.7. x N from f .6: The true density f Now consider an econometrician who does not know f . as defined above. On balance.20 0. where p is ˆ. . . The density is superimposed over the histogram and the original true density f (in blue). µ and σ. He now wants to form an estimator fˆ ∈ D of f . ˆ := x ¯ . but has instead access to an IID sample x1 . Let’s generate 200 independent observations x1 . . 7: Estimate fˆ (black line) and true f (blue line) the black and blue lines) might be thought of as reasonable.4. 2013 . PARAMETRIC VS NONPARAMETRIC ESTIMATION 128 0.4. In particular. See any advanced text on density estimation. not function-valued estimators like fˆ. and seeks to construct a estimate fˆ of f on the basis of the data. the fit is not much better. We can see this by rerunning the simulation with ten times more observations (2. How can she proceed? fˆ is not consistent. the estimator fˆ does not have good properties.5 The problem is the mistaken original assumption that f ∈ D . one would say that J OHN S TACHURSKI October 10. She does not have knowledge of f . However.05 0.16) and presented to her. .15 0.3 Nonparametric Kernel Density Estimation Let’s now look at a standard nonparametric approach to the same problem.4.25 0. because consistency was defined for real-valued estimators. Unlike the previous econometrician. Our next econometrician is in the same position as the econometrician in the previous section: IID data x1 .8.20 0. There is no element of D that can approximate f well. The result is given in figure 4.30 −3 −2 −1 0 1 2 3 Figure 4.000 instead of 200). .00 0. . and then give a definition of consistency that applies here. one can define a notion of convergence in probability for functions. x N is generated from the density f in (4. however. she does not presume to know the parametric class that the density f belongs to. 4. . fˆ will not converge to f as the sample size goes to infinity.10 0. I haven’t used this terminology. 5 In general. As expected. 9 for the case of two sample points.7 asks you to confirm that fˆ is indeed a density. and let δ be a positive real number. if K is taken to be Gaussian.00 −4 0. 6 Although J OHN S TACHURSKI October 10. x1 and x2 . nonparametric kernel density estimates can be produced using the function density. we then define 1 N fˆN (s) :=: fˆ(s) := ∑K Nδ n =1 s − xn δ (4. Let K : R → R be a density function.8: Sample size = 2.05 0. In R.17) is illustrated in figure 4.25 0. drawn in black. our choice of K and δ are associated with some assumptions about the shape and form of f .4. 2) Summing these two bumps gives fˆ = g1 + g2 .7. and δ is called the bandwidth. Try.30 −2 0 2 4 Figure 4.15 0. For example. 2013 . which is the function gn ( s ) : = 1 K Nδ s − xn δ (n = 1. we have not specified a parametric class.20 0.6 A common choice for K is the standard normal density.10 0. Centered on sample point xn we place a smooth “bump” drawn in red. Exercise 4.4. PARAMETRIC VS NONPARAMETRIC ESTIMATION 129 0.17) Here K is called the kernel function of the estimator. the function fˆ in (4. For this choice of K.000 One approach to estimating f without making assumptions about the parametric class is to use a kernel density estimator. for example. From the sample. then fˆN will have exponentially decreasing tails. 4.1.4. see Devroye and Lugosi (2001.17). unlike the parametric case. type ?density. theorem 9. and. it can be proved in that.000 sample points. PARAMETRIC VS NONPARAMETRIC ESTIMATION 130 x1 x2 Figure 4. then E | fˆN (s) − f (s)| ds → 0 as N → ∞ For a proof.4.11 is given in listing 4.8. using the default settings in R. the nonparametric kernel density estimator converges to f in the following sense: IID ˆ Theorem 4. If the bandwidth δN is a sequence depending on N and satisfying δN → 0 and N δN → ∞ as N → ∞.) The fit in figure 4.11 is much better than the comparable parametric fit in figure 4. J OHN S TACHURSKI October 10.11 shows the estimate with 2. Returning to our estimation problem. 2013 . (The code for generating figure 4. for any density f . Indeed. further increases in sample size continue to improve the fit. Figure 4. figure 4. Let f N be as defined in (4.2).9: Function fˆ when N = 2 > plot ( density ( runif (200) ) ) To learn more.10 shows a kernel density estimate of f from 200 sample points. and let { xn }∞ n=1 ∼ f . Let f and K be any densities on R. The good theoretical results we have discussed are all asymptotic. and hence require abundant data. as in the two examples above. Classical physics tells her that the relationship is approximately proportional.4.20 −4 −2 0 2 4 Figure 4. Classical statistics J OHN S TACHURSKI October 10. When underlying theory provides us with knowledge of functional forms.10 0. For example.10: Nonparametric. the underlying theory provides the exact parametric class of the density. 2013 .05 0.4. you will see that. 4. the theory of Brownian motion describes how the location of a tiny particle in liquid is approximately normally distributed. This stands to reason: Nonparametric methods have little structure in the form of prior knowledge. the nonparametric estimate is actually poorer than the parametric alternative. PARAMETRIC VS NONPARAMETRIC ESTIMATION 131 0.4 Commentary In some fields of science. for small sample sizes (try 10).4. and nonparametric methods generally need large sample sizes. researchers have considerable knowledge about parametric classes and specific functional forms.00 0. This provides a functional form that the engineer can use to estimate a regression function. Here’s another example. if you experiment with the code in listing 4. sample size = 200 On the other hand.15 0. from a regression perspective: An engineer is interested in studying the effect of a certain load on the length of a spring. Hence. the parametric paradigm is excellent. seed (1234) fden <. 2013 .5 * dnorm (x . fden ( xgrid ) . mean = -1) + 0.11 set . mean =1) } return ( observations ) } observations <. 4.runif ( N ) for ( i in 1: N ) { if ( u [ i ] < 0.5 . lwd =2) J OHN S TACHURSKI October 10. xlab = " " .rnorm (1 .function ( x ) { # The density function of f return (0.5) { observations [ i ] <.fsamp (2000) xgrid <.4. length =200) plot ( density ( observations ) . mean = -1) } else observations [ i ] <.4.numeric ( N ) u <.function ( N ) { # Generates N draws from f observations <. ylab = " " ) lines ( xgrid .seq ( -4.5 .5 * dnorm (x .rnorm (1 . main = " " . PARAMETRIC VS NONPARAMETRIC ESTIMATION 132 Listing 4 The source code for figure 4. mean =1) ) } fsamp <. col = " blue " . 4. to be honest. a flexible parametric class of densities with many parameters can do a great job of approximating a given nonparametric class. economics and other social sciences are generally messier than the physical sciences. Moreover. an econometrician trying to estimate the effect of human capital on GDP does not have the same information as the engineer studying load on a spring. all the knowledge embodied in the functional forms helps us to conduct inference: Generalization is about combining knowledge and data. and be more convenient to work with. Often. modern statistics is something of a mix. These notes follow this mixed approach.25 −4 −2 0 2 4 Figure 4. 2013 . J OHN S TACHURSKI October 10. blending both parametric and nonparametric methods. Similarly.11: Nonparametric.15 0. no such elegant equivalent of the theory of Brownian motion exists for the movement of economic variables such as stock prices. there is no firm dividing line between parametric and nonparametric methods.20 0. At the same time. or exchange rates. PARAMETRIC VS NONPARAMETRIC ESTIMATION 133 0. sample size = 2. Overall. and the econometrician usually comes to the table with much less knowledge of parametric classes and functional forms.4. What about econometrics? Unfortunately.10 0.000 yields many well behaved estimators. For this reason. and the more knowledge we have the better.05 0.00 0. What does classical economics tell the econometrician about the precise functional form relating human capital to GDP? Not a lot. This suggests that nonparametric techniques will become increasingly popular in econometrics. who knows that her relationship is proportional. . Throughout this section. we’re going to look at an important kind of discrete distribution. . you will see that we can also write FN (s) = fraction of the sample less than or equal to s Graphically. The concept of the empirical distribution is a bit slippery because it’s a random distribution. x N be draws from some unknown distribution F. we will denote it by FN . . let x1 . s J with probabilities p1 . FN is a step function. . In R.12 shows an example with N = 10 and each data point drawn independently from a Beta(5. . . we see that the ecdf can be written as FN (s) = P{ x e ≤ s} = n =1 ∑ 1{ x n ≤ s } N ( s ∈ R) N 1 It’s more common to put the 1/ N term at the start. .12) on page 18 that if x is a discrete random variable taking values s1 .4. depending on the sample. . . . or ecdf. that means that probability 1/ N is placed on each point xn . Throughout the text.18). EMPIRICAL DISTRIBUTIONS 134 4. . then the cdf of x is F ( s ) = P{ x ≤ s } = j =1 ∑ 1{ s j ≤ s } p j J (4. The empirical distribution of the sample is the discrete distribution that puts equal probability on each sample point. with an upward jump of 1/ N at each data point. . x N with uniform probability 1/ N . Try this example: plot ( ecdf ( rnorm (20) ) ) J OHN S TACHURSKI October 10. . Since there are N sample points. so let’s use this as our definition: 1 FN (s) := N n =1 ∑ 1{ x n ≤ s } N If you think about it. p J .5. . we’ll work with the same fixed observations x1 . . . and x e will denote a random variable with the corresponding empirical distribution. 2013 . To describe it. .18) In this section. . 5) distribution. x N ∼ F. x e is a random variable taking each of the values x1 . Invoking (4. Figure 4. it’s a very useful object.5 Empirical Distributions Recall from (1. the ecdf is implemented by the function ecdf. . . That is. The cdf for the empirical distribution is called the empirical cumulative distribution function. Nevertheless. . then by the law of large numbers.6 0.0 0. note that if h is a function from R into R. the value of the expression (4. x N : 1 sFN (ds) :=: E [ x ] = N e n =1 ¯N ∑ xn =: x N If the sample is IID. .0 0.8 1. the mean of the empirical distribution is the sample mean of x1 . .2 0. EMPIRICAL DISTRIBUTIONS 135 1. .4 0. In other words.16) on page 22.19) converges in probability to the expectation E [h( x1 )] = h(s) F (ds).1 Plug in Estimators Continuing the proceeding discussion with the same notation.4.19) For example.5.6 0.8 0. for h : R → R and large N we have h(s) FN (ds) ≈ h(s) F (ds) This suggests an approach for producing estimators: Whenever we want to estimate a quantity θ such that J OHN S TACHURSKI October 10.0 Figure 4.4 0. 2013 .2 0.0 F FN 0. its expectation with respect to the empirical distribution is h(s) FN (ds) :=: E [h( x e )] = n =1 ∑ N h( xn ) 1 1 = N N n =1 ∑ h( xn ) N (4.12: FN and F with N = 10 4. then by (1.5. 5. For example.5. The plug in estimator of the variance 2 sk F (ds) of F is the sample t− is t− sFN (ds) 2 sF (ds) F (dt) FN (dt) = 1 N n =1 ¯ N )2 ∑ ( xn − x N This differs slightly from the sample variance s2 N defined on page 112. EMPIRICAL DISTRIBUTIONS 136 1. and 2.4. 4. . the function h is known.5). Notice that plug in estimators are nonparametric. Example 4. then we replace the cdf F with the ecdf FN and use the resulting statistic ˆN : = θ h(s) FN (ds) = 1 N n =1 ∑ h( xn ) N ˆN is called the plug in estimator of θ = h(s) F (ds).1. θ = h(s) F (ds) for some h : R → R. in this terminology. Although we have defined the plug in estimator as an estimator of quantities θ that can be expressed as integrals using F.5. because FN The estimator θ is plugged in to the expression in place of F.1. However. the deviation is negligible when N is large. 2013 .5) is −1 FN (0.2 Properties of the ecdf Once again. Remark 4.5. and let FN be the corresponding ecdf. the plug in estimator of the median F −1 (0. The plug in estimator of the k-th moment k-th moment 1 N k k s FN (ds) = ∑ xn Nn =1 Example 4.2. Perhaps the most important single fact about J OHN S TACHURSKI October 10. . . in the sense that we need no parametric class assumption in order to form the estimator. the term “plug in estimator” is often used more generally for any estimator produced by replacing F with FN . x N be IID draws from some fixed underlying distribution F. let x1 . .5. 15 illustrate the idea. Instead. . . . Both are random variables. 7 In J OHN S TACHURSKI October 10. x N are sponding ecdf. then we can learn the underlying distribution without having to impose any assumptions. Indeed. followed by an output y. Each picture shows 10 observations of FN .5. . it does not require specification of the parametric form of the underlying density in order to form the estimator. This is certainly a nice result.6 Empirical Risk Minimization Empirical risk minimization is an inductive principle that is essentially nonparametric in nature. which states the subjective loss (opposite of utility) from incorrect prediction. and we fact. from (1.13–4.1 (FTS). EMPIRICAL RISK MINIMIZATION 137 the ecdf FN is that it converges to the cdf F as N → ∞. The following theorem is sometimes called the fundamental theorem of statistics.) Thus. it starts with loss function. If x1 . at least in the IID setting. x N . assumptions are still required to generalize from this data.6. a stronger statement is true.6. the theorem tells us that convergence occurs “almost surely. if we have an infinite amount of data. . . As such. depending on 10 different observations of the data x1 .1 The ERM Principle To understand the principle. The details are omitted. 2013 . However. 4. . and FN is the correp sup | FN (s) − F (s)| → 0 R (Here “sup” is roughly equivalent to “max”—see the appendix for more discussion.4. we should bear in mind that in reality we only ever have a finite amount of data.7 Figures 4. then s∈ IID draws from some cdf F. we see that the maximal deviation between the two functions goes to zero in probability. The theorem tells us that.28) on page 36. consider a setting where we repeatedly observe an input x to a system. we have 1 FN (s) := N n =1 ∑ 1{ x n ≤ s } → P{ x n ≤ s } = : F ( s ) N p In fact. or the Glivenko-Cantelli theorem: Theorem 4. 4.” which is a stronger notion that in probability. . Except in special cases. 4 0.8 0. 2013 .4 0.2 0.6 0.8 0.4 0.0 0.0 Figure 4.0 Figure 4.8 1.2 0.0 F 0.6 0.4.0 F 0.2 0. EMPIRICAL RISK MINIMIZATION 138 1.0 0.0 0.6.4 0.8 1.6 0.6 0.0 0.13: Realizations of FN with N = 10 1.2 0.14: Realizations of FN with N = 100 J OHN S TACHURSKI October 10. EMPIRICAL RISK MINIMIZATION 139 1.6 0. f ( x )) = |y − f ( x )| • The discrete loss function L(y.0 F 0.8 0. Returning to the general problem. . .6. f ( x )).0 0.20) October 10. Incorrect prediction incurs a loss. f ( x )) = (y − f ( x ))2 • The absolute deviation L(y. f ( x ))] :=: J OHN S TACHURSKI L(t. Letting y be the actual outcome. the size of this subjective loss is taken to be L(y. ( x N . Suppose our aim is to predict new output values from observed input values.2 0.0 0. one might consider choosing f so as to minimize the expected loss R( f ) := E [ L(y. . Common choices for the loss function include: • The quadratic loss function L(y. we assume that the input-output pairs ( x1 .4 0. The function L is called the loss function. . f ( x )) = 1{y = f ( x )} The discrete loss function is typically used when y takes only finitely many possible values. In particular.2 0.8 1. A natural way to treat this problem is to choose a function f such that f ( x ) is our prediction of y once x is observed. t)dsdt (4.6 0.0 Figure 4.4. y N ) we observe are IID draws from joint density p. No loss is incurred otherwise.4 0.15: Realizations of FN with N = 1000 believe there is some stable relation between them. A unit loss is incurred if our guess is incorrect. y1 ). with arbitrary loss function L. f (s)) p(s. 2013 . 22) is not equal to f ∗ . f (xn )) N (4.20). .21) In particular. Although this reasoning seems logical. If we knew the joint density p.2. we obtain an estimate fˆ of the true minimizer argmin f R( f ) by solving ˆ ( f ) = argmin 1 fˆ = argmin R N f ∈F f ∈F n =1 ∑ L(yn . we do have at hand the observed inputoutput pairs ( x1 . as visualized in figure 4. Hence R( f ) cannot be evaluated. By repeating this calculation for different f . J OHN S TACHURSKI October 10. which are independent draws from p. y1 ). it might seem that we should F to be the set of all functions f : R → R. the law of large numbers tells us for large N we have R( f ) := E [ L(y. After all. then the solution to (4.6. there is an obvious difficultly: We don’t know p. At first pass. f (xn )) N (4. . f ( x ))] ≈ 1 N n =1 ∑ L(yn .22) This inductive principle.4. However. at least in principle. then. These ideas are explored in detail in §4. Draws from p give us information about p. which produces an estimate of the risk-minimizing function by minimizing the empirical risk. if the risk minimizing function f ∗ := argmin f R( f ) is not in F . the expected loss is called the risk. y N ). in the statistical setting. and is a class of candidate functions chosen by the econometrician or researcher. All is not lost. While we don’t know p. let alone minimized. . In fact. and we are making a sub-optimal choice. f (xn )) N This motivates us to replace the risk function R with the empirical risk function ˆ ( f ) := 1 R N n =1 ∑ L(yn .16.20). we want to be quite restrictive in our choice of F . EMPIRICAL RISK MINIMIZATION 140 In (4. it turns out that setting F to be the set of all functions from R → R is a bad idea. This set of functions is called the hypothesis space.6. 2013 . and R is called the risk function. or at least take it to be as large as possible. Notice that in (4. . is called empirical risk minimization (ERM). ( x N . we could evaluate R( f ) for any f by calculating the double integral in (4.22) we are minimizing over a set of functions F . we could search for a minimizer. For example. 23) For obvious reasons.24) on page 30. 2013 .16: Choosing the hypothesis space 4. y 1 makes no difference to the solution fˆ (see §13.26) Comparing with (1.25) This is the empirical risk counterpart to the risk minimization problem (1.6. this optimization problem is called the least squares problem. the ERM problem becomes term N min f ∈F n=1 ∑ (yn − f (xn ))2 N (4.2).24) ∈ L n =1 ∑ (yn − (xn ))2 = min N α.6.4.2 ERM and Least Squares ˆ ) = (y − y ˆ )2 .23) on page 29. and observing that the Specializing to the quadratic loss function L(y. which gives the minimizers of the risk. Direct differentiation and simple manipulations show that the minimizers of the empirical risk are N ¯ )(yn − y ¯) ˆ = ∑ n =1 ( x n − x β N ¯ )2 ∑ n =1 ( x n − x and ˆ = α 1 N ˆ1 ∑ yn − β N n =1 N n =1 ∑ xn N (4. β n =1 ∑ ( y n − α − β x n )2 N (4. EMPIRICAL RISK MINIMIZATION 141 all f : R → R F f ∗ := argmin f R( f ) Figure 4. J OHN S TACHURSKI October 10. we see that in this case the minimizers of the empirical risk are the sample analogues of the minimizers of the risk. If we specialize F to be the set of affine functions L := { all functions of the form ( x ) = α + β x } then the problem reduces to the linear least squares problem min (4. 4. To fix notation. let Pd be the set of all polynomials of degree d. That is. 1] and then y = cos(π x ) + u where u ∼ N (0. rather using the expectation from the actual distribution. the actual risk R( fˆ) may not be. We need to be careful about reading “too much” into this particular sample. while the function fˆ obtained by minimizing empirical risk over a large set of functions will make the empirical risk ˆ ( fˆ) small.6. 1]. if A is any set. g : A → R. EMPIRICAL RISK MINIMIZATION 142 Now let’s return to the issue of hypothesis space mentioned above: Why would we want to minimize the empirical risk over a restricted hypothesis space such as L. In the example.24) is equal to P1 . the set of linear functions L defined in (4. then f can be represented as a polynomial of degree d + 1 just by setting the last coefficient cd+1 to zero: Pd f d ( x ) = c0 x 0 + c1 x 1 + · · · c d x d = c 0 x 0 + c 1 x 1 + · · · c d x d + 0 x d +1 ∈ P d +1 Also. then infa∈ D g( a) ≤ infa∈ D g( a) always holds. The underlying problem is that we are R attempting to minimize expected loss on the basis of a sample mean. the risk minimization problem is min R( f ) f ∈Pd 8 Intuitively. Let’s illustrate this point by way of an example. 1] is the uniform distribution on the interval [−1.28) if we expand the set of candidates. J OHN S TACHURSKI October 10. the model we will consider is one that generates input-output pairs via x ∼ U [−1.27) where U [−1.8 Isn’t that desirable? The answer is: not necessarily. and D ⊂ D ⊂ A. where empirical risk is minimized over progressively larger hypothesis spaces. If we seek to predict y from x using quadratic loss and the set Pd as our candidate functions. Pd := { all functions f d ( x ) = c0 x0 + c1 x1 + · · · cd x d where each ci ∈ R} Clearly P1 ⊂ P2 ⊂ P3 ⊂ · · · because if f is a polynomial of degree d. 2013 . Formally. 1) (4. The reason is that. Our hypothesis spaces for predicting y from x will be sets of polynomial functions. minimizing the empirical risk over a bigger set of functions makes the empirical risk smaller. then we can find a smaller value. rather than the entire set of functions from R to R? After all. where R( f ) = E [(y − f ( x ))2 ] (4. 17. and is. we ˆ ( fˆd ). a polynomial of degree d. as expected.28) and calculating the expectation numerically.17: Risk and empirical risk as a function of d while the empirical risk minimization problem is ˆ(f) min R f ∈Pd where ˆ(f) = 1 R N n =1 ∑ (yn − f (xn ))2 N (4.7 0. . For large d.29) repeatedly.5 0. by construction. the risk decreases slightly and then increases rapidly. 9 The J OHN S TACHURSKI October 10.9 The results are in figure 4.2 2 4 6 8 G 0. . Finally. . Taking this as our data set. . empirical risk falls monotonically with d.4. compare the risk R( fˆd ) and empirical risk R Analysing the figure. the minimizer fˆd of the empirical risk is associated with high risk in the sense of large expected loss.4 risk GG G GGGGGGG GG G GGG G GG G GGGGGGGGGGGG GGG 0.6. we first generate N = 25 data points from the model (4. EMPIRICAL RISK MINIMIZATION 143 Risk G Empirical risk G 0.6 empirical risk 0. The solution to the d-th minimization problem is denoted fˆd . once for each d in 1. we see that. 15.1 G G 5 10 degree 15 20 0 5 10 degree 15 20 Figure 4. 2. we then solve (4.27). On the other hand.3 0. 2013 . This must be the case because minimizing a function over larger and larger domains produces smaller and smaller values. risk R( fˆd ) is evaluated by substituting fˆd into the expression for R in (4.29) To illustrate the difference between risk and empirical risk. This situation is called “over-fitting. There is no function in this class that can do a good job of fitting the underlying model. When d = 3. fˆ11 and fˆ14 . the prediction fˆ14 ( x ) is likely to be a poor predictor of y.18–4. the fit to the underlying model is poor. and fitted polynomial fˆd in red.21.” What can we learn from this discussion? The main lesson is that the choice of the hypothesis space F in the empirical risk minimization problem (4. the hypothesis space should be carefully chosen on the basis of economic theory: F should be the set of “reasonable” candidate descriptions of the relationship between x and y.17.20 and 4. In figures 4.23) is crucial.21 we have d = 3. This may or may not be a good choice. In real statistical applications. EMPIRICAL RISK MINIMIZATION 144 We can get a feeling for what is happening by plotting the data and the functions. the set of linear functions. many researchers simply choose F = L.21).” and is reflected in the poor fit of the red line to the black line in figure 4. When a new input x is drawn. passing near to many of the data points. 4. if we take d up to 11 or even 14.20 and 4. the fit of the function is fairly good (figure 4. the message is that statistical learning equals prior knowledge plus data. and that we only have 25 observations.27) in black. given our knowledge of the economic system we are modelling. The function y = cos(π x ) is the risk minimizer.19. the N data points are plotted alongside the function y = cos(π x ) from the true model (4. In figures 4.6. the hypothesis space Pd = P1 is quite small. When d = 1. In response. and the fitted polynomials are fˆ3 . we see that it is lower than for d = 1. we do not have the luxury of knowing the true model when we choose F .19). This is the problem of model selection. 2013 .18 we have d = 1. Examining the corresponding figures (figures 4. This situation is called “under-fitting. and represents the ideal prediction function. In a sense it is paying too much attention to this particular realization of the data. and the fitted polynomial fˆ1 is the linear regression line. Once again. On the other hand. The problem of model selection is discussed in more depth in chapter 10. Given that the data is relatively noisy. J OHN S TACHURSKI October 10. we see that the fitted polynomial has been able to fit the observed data closely. If we look at the risk for d = 3 on the right-hand side of figure 4. and the risk is high. In figure 4.18.4. Ideally. the class of functions Pd = P3 is considerably larger. and the risk is correspondingly high. d = 11 and d = 14 respectively. 0 G G G G G G G G −0.6.18: Fitted polynomial.0 G G G G G G G G −0.5 1. d = 3 J OHN S TACHURSKI October 10.0 G −0.0 degree = 1 0.0 G G G 0.5 G G G G G 0. EMPIRICAL RISK MINIMIZATION G G 145 G 1.5 G G G G G 0.0 G Figure 4. 2013 .5 G G G G G G G G G −1.0 G G −1.4.19: Fitted polynomial.0 G −0.5 0.5 0. d = 1 G G G 1.0 G Figure 4.0 degree = 3 0.5 G G G G G G G G G −1.0 G G G 0.0 G G −1.5 1. 2013 .5 G G G G G 0.0 G G −1.5 G G G G G G G G G −1.0 G Figure 4.0 G G G G G G G G −0.5 G G G G G G G G G −1.0 G Figure 4.0 G −0.21: Fitted polynomial. EMPIRICAL RISK MINIMIZATION G G 146 G 1.5 G G G G G 0.0 degree = 11 0.20: Fitted polynomial.0 G G G G G G G G −0. d = 11 G G G 1.0 G −0.5 0.0 degree = 14 0.5 1.4. d = 14 J OHN S TACHURSKI October 10.0 G G −1.0 G G G 0.0 G G G 0.6.5 0.5 1. One is maximum likelihood. We want to predict the payoff of the next draw. We wish to learn the density q by observing draws from this density. We take our loss function to be L( p. we can recover many parametric techniques as special cases of empirical risk minimization. . Unlike maximum likelihood. 2013 . . say.6. we usually don’t have to specify the parametric class of the unknown distributions in order to solve the ERM problem. then our loss is − ln p( x ). this becomes R(α) = E [(α − y)2 ] = The empirical risk is ˆ (α) = 1 R N IID (α − s)2 F (ds) n =1 ∑ ( α − y n )2 = N (α − s)2 FN (ds) (4. A simpler situation is where we observe only y and seek to predict it. at least with quadratic loss. suppose that x is drawn from unknown density q. Minimizing R to α. y N ∼ F. the corresponding risk is E [ L(α.6. if our guess of q is p and the value x is drawn. To see this. . This makes the learning problem simpler. that we observe y1 .3 Other Applications of ERM In our discussion of ERM so far. For the sake of concreteness. which is the most natural predictor of y. where F is an unknown cdf. we obtain our prediction of y as ˆ (α) = 1 α∗ := argmin R N α n =1 ∑ yn N Thus. At the same time. . let’s imagine that each observation is the monetary payoff of a particular game at a casino.30) ˆ (α) with respect Here FN is the empirical distribution of the sample. In this case the object we seek to calculate is just a constant (a prediction of y) rather than a function (that predicts y from any given x). we have talked about finding functions to predict y given x. for example. Letting α be our prediction and L be the loss function. x ) = − ln p( x ). Suppose. J OHN S TACHURSKI October 10. As this last example helps to clarify. the ERM principle is essentially nonparametric in nature. In other words.4. ERM leads to the sample mean. EMPIRICAL RISK MINIMIZATION 147 4. If we specialize to the quadratic loss. y)]. The empirical risk is determined only by the loss function and the empirical distribution. The KL deviation is possibly infinite. Now suppose that we observe IID draws x1 . θ )}θ ∈Θ . . let’s now add the assumption that the unknown density lies in some parametric class { p(·. It follows from this expression that the ERM estimator is precisely the maximum likelihood estimator. then we suffer large loss. x )] = − ln[ p(s)]q(s)ds Although it may not be obvious at first glance. the ERM principle indicates we should solve ˆ ( p) = argmin ˆ := argmin R p p p 1 N n =1 ∑ N − ln p( xn ) = argmax p n =1 ∑ ln p(xn ) N To make the connection with maximum likelihood. . Note that if two density are equal almost everywhere then they share the same cdf. minimization of the risk comes down to minimization of q(s) D (q. Equality almost everywhere is a basic concept from measure theory that is (only just) beyond the scope of these notes. and hence represent the same distribution. and zero if and only if p = q. always nonnegative.10 It follows that the unique minimizer of the risk is the true density q. . precisely.4. we obtain ˆ = argmax θ θ n =1 ∑ ln p(xn . EMPIRICAL RISK MINIMIZATION 148 Loosely speaking. let us first transform our expression for the risk function to q(s) R( p) = ln q(s)ds − ln[q(s)]q(s)ds p(s) (Can you see why the two expressions for R( p) are equal?) The term on the far right is called the entropy of the density q. x N from q. but the true value θ generating the data is unknown. Suppose that we know the parametric class. Hence. the loss function encourages us to choose p close to q. Our choice of loss leads to the risk function R( p) = E [ L( p. p) := ln q(s)ds p(s) This quantity is called the Kullback-Leibler (KL) deviation between q and p. Re-writing ˆ of q now reduces to choosing an estimate θ Choosing our estimate p our optimization problem for this case. p) = 0 if and only if p = q almost everywhere. 10 More J OHN S TACHURSKI October 10. minimizing this risk function yields the unknown density q. To see this. To estimate q. θ ) N = argmax (θ ) θ where is the log-likelihood. 2013 . . ˆ of θ . and does not involve p. D (q.6. Hence. if p puts small probabilities on regions where x is realized. x N with variance σ2 .4 Concluding Comments The ERM principle is a very general principle for solving statistical problems and producing estimators. Use the principle of maximum likelihood to which is just the sample mean = x 12 obtain the same conclusion. . some rather general consistency results have been obtained. . . . 1}.2) is typically biased. . and let σN be a consistent estimator of σ.7.2. To see why. Each xn is a Bernoulli random variable.e. ¯ N be the sample Ex.7. x100 . as discussed in §4. Let x1 . . . .6. Ex. . . 4. Let x mean.6. for any estimator θ ˆ] + bias[θ ˆ ]2 . . . . Having said that.7.7. 12 Hint: 11 The sample standard deviation IID J OHN S TACHURSKI October 10. . with each payoff independent of the previous outcome. x N be IID with mean µ and variance σ2 .1. Let µ be the probability of winning (i.6.7. x N . . m4 := s4 F (ds) Define the plug in estimator of m4 .7. . where F is a cdf. Some discussion can be found in [19]. 4. there will be instances when ERM produces poor estimators.3. That is.2.5. Let m4 < ∞ be the 4-th moment. Is the estimator consistent? Why or why not? Ex. 4. Suppose we are playing a slot machine (one-armed bandit) that either pays one dollar or nothing. . What is the limiting distribution of y N := N ¯N − µ x σN 2 Ex.4. Confirm (4. . Confirm that for an IID sample x1 . Let x1 .. 4. x N ∼ F. s (4. Indeed. a natural estimator of µ is the fraction of wins. 4. receiving one dollar).7. we have mse[θ ˆ] = Ex.7 Exercises ˆ of θ . var[θ Ex. look up Jensen’s inequality. ¯ . 2013 . the sample 2 11 variance s2 N defined in (4. Confirm that the maximizer of (4.4. The details are beyond the level of these notes.12) is the sample mean of x1 . EXERCISES 149 4. For such a general method it is difficult to give a set of strong results showing that ERM produces good estimators. . 1}. Having observed 100 plays x1 . where xn ∈ {0. the probability mass function for which can be written as p(s. 4. .1) is unbiased for σ . µ) := µs (1 − µ)1−s for s ∈ {0. 4.11): Show that. 7.5. . the first histogram should be of 10. What is the expectation and variance interval [0. In the simulation. (Warning: This exercise is harder. u N }. 500. 14 Hint: Use the add and subtract strategy. Let x be a random variable with µ := E [ x ]. ˆN be an estimator of θ .12.4. 4. .) What do you example. simulate 10.) 4.9 (Computational). 2. Extending your results in exercise 1. every δ > 0 and every realization of the sample. . (For x ¯1 . where u1 . Ex.000 observations of x observe about these four distributions? What interpretation can you give? Ex. .1 Solutions to Selected Exercises ˆ]. . 2013 .8. 13 Hint: fˆ(s)ds = 1 J OHN S TACHURSKI October 10.7.10 (Computational). x N be IID and uniformly distributed on the ¯ N be the sample mean. 4. Let x ¯ N ? For N = 1. using one histogram for each value of N . determine the cdf of z := max{u1 . . . 1].7. 10. Let x1 .. The ecdf should be close to the cdf. 1].000 observations of the random variable of x ¯ N . .7. Show that fˆ in (4. .11 (Computational).7. . along with your expression for the cdf. based on the definition (i. and requires a bit of then θ experience with analysis to solve properly.24. 4.14 Ex. Ex. EXERCISES 150 Ex.1. Try a change-of-variables argument. Adding and subtracting E [θ ˆ] := E [(θ ˆ − θ )2 ] = E [(θ ˆ − E [θ ˆ] + E [θ ˆ ] − θ )2 ] mse[θ Expanding the square and minor manipulations yield the desired result. Check this by generating 1. . 4.7. Consider the risk function given by R(θ ) = E [(θ − x )2 ].7. Let θ ˆN is consistent for θ .7. we get Solution to Exercise 4.000 draws of y and plotting the ecdf. without using differentiation.7.7.17) is a density for every N . 4. set N = 5. Show that µ is the minimizer of R(θ ) over all θ ∈ R. You need to show that fˆ is nonnegative and integrates to one. 4. Showing that is the hard part. Show that if θ ˆN is asymptotically normal. . Histogram the observations. Implement the ecdf as your own user-defined function in R. u N are N independent random variables uniformly distributed on [0.13 Ex. that it reports the fraction of the sample falling below a given point).e. J OHN S TACHURSKI October 10. By fact 1.4.5. µ) = ∑ [xn log µ + (1 − xn ) log(1 − µ)] n =1 N N ˆ=x ¯ as Differentiating with respect to µ and setting the result equal to zero yields µ claimed. given the stated assumption that m4 < ∞. fact 1.7. To show that fˆ(s)ds = 1.4.7.7. with pmf given by p(s.4. Solution to Exercise 4.2. the proof is done. y N = w2 N converges in distribution to z . Each xn is a Bernoulli random variable. µ) := µs (1 − µ)1−s for s ∈ {0. The sample fourth moment is consistent for m4 under the IID assumption by the law of large numbers.2.2. we then have w N → z ∼ N (0. Solution to Exercise 4. Solution to Exercise 4. the joint distribution is the product of the marginals.3 that s2 = N N−1 1 N ¯ − µ )2 ∑ ( x n − µ )2 − ( x N n =1 ¯ ] = E [( x ¯ − µ)2 ] = σ2 / N from (4. 1). The nonnegativity of fˆ is obvious. 2013 .7. 1}. By the continuous mapping theorem (fact 1.4. Applying the central limit theorem and Slutsky’s theorem (fact 1.5 on page 34) together. it’s enough to show that s−a K ds = δ δ for any given number a.4 2 on page 33).7.7 on page 1. applying the result var[ x and rearranging gives E [s2 ] = σ2 as claimed. The plug in estimator of m4 is the sample fourth moment. By independence.10) Taking expectations.3. EXERCISES 151 Solution to Exercise 4.2. which leads to K s−a δ ds = K (u)δdu = δ K (u)du Since K is a density.1 on page 31 yields σ /σN → σ /σ = 1.7. Let w N := p √ x ¯N − µ ¯N − µ σ √ x N = N σN σN σ p Since σN → σ by assumption. We showed in §4. the distribution of z2 is χ2 (1).7. This equality can be obtained by the change of variable u := (s − a)/δ. Solution to Exercise 4. and hence the log likelihood is d (µ) = n =1 ∑ log p(xn .6.7. and choose M such that P{|z| ≥ M} ≤ . Solution to Exercise 4. we can express R(θ ) as R(θ ) = E {[(θ − µ) + (µ − x )]2 } Expanding this out and using E [ x ] = µ. applying asymptotic normality.7. EXERCISES 152 Solution to Exercise 4.4.31) N →∞ (If a ≥ 0 and a ≤ for any > 0. then a = 0.12. Evidently a minimum is obtained when θ = µ.8.4 on page 33) and the definition of M gives (4. It suffices to show that for any positive number we have ˆN − θ | > δ } ≤ lim P{|θ (4.7. the continuous mapping theorem (fact 1. 2013 . Let z √be standard normal.4.) To establish (4. Fix δ > 0.7.31). fix > 0. J OHN S TACHURSKI October 10. Adding and subtracting µ. For N such that N δ ≥ M we have √ √ √ ˆ N − θ | > δ } = P{ N | θ ˆ N − θ | > N δ } ≤ P{ N | θ ˆN − θ | > M } P{|θ Taking N → ∞. we obtain R(θ ) = (θ − µ)2 + var[ x ].31). 1. then the distribution of θ the sampling distribution depends partly on the known object T . . . Usually this The distribution G of θ IID distribution will depend on unknown quantities. Thus. For example. x N is the data. x N ) for known function T . which is the cdf G (s) = P{θ ˆ ≤ s }. for example. . . Being a random variable. . (4. .5) on page 113. If x1 . x N (ω )) θ (See. . and partly on the uknown object F. For example. if θ statistic. are random ˆ is an estimator of some quantity θ . . then θ ˆ is the random variable θ ˆ(ω ) = T ( x1 (ω ). . . must be an observable function of the data.1 Making Inference about Theory [roadmap] 5. being a variables. .2 we emphasized the fact that statistics. . for some unknown cdf F. . then θ ˆ.) ˆ has a distribution. and hence estimators. if x1 .Chapter 5 Methods of Inference [roadmap] 5. and ˆ = T ( x1 . 153 (ω ∈ Ω) . . . x N ∼ F ˆ will depend on F and T .1 Sampling Distributions In §4. θ ˆ is sometimes called its sampling distribution. . ∑n =1 xn must be gamma. σ2 ). . As a second example. we will estimate mean duration using the sample mean x Since sums of independent exponential random variables are known to have the N gamma distribution. Combining fact 1.6 on page 24. . . Thus. The problem is one where we hold a belief or theory concerning the probabilities generating the data. .1) ¯ . the sampling distribution still depends on the unknown quantities θ and σ. . we are going to consider a different style of problem. 5.6) sample mean x on page 114 and (4. Using these observa¯ = N −1 ∑ n x n . ¯ is evaluated. tions. First. . Since scalar multiples of gammas are N − 1 ¯ again gamma. its value will be a draw from a gamma when the data is collected and x distribution. . We propose to monitor consecutive calls until observations x1 .) In this case. we obtain ¯ ∼ N ( θ . Suppose that some economic theory that implies a specific value θ0 for the unknown parameter θ (prices should be equal to marginal cost. x = N ∑n =1 xn must also be gamma.1. Which particular gamma distribution depends on the unknown quantity λ. where both θ and σ are unknown. . (4.2 Comparing Theory with Outcomes In chapter 4. . where both θ and σ are unknown. suppose again that we have x1 . σ2 / N ) x1 . .10) on page 115. etc. Consider the ¯ as an estimator of the mean θ .2. with density f (s) = 1 − exp(−λs). We model the duration x between two calls as having the exponential distribution. . . and we are interested in whether the observed data provides evidence for or against that theory. and at the properties of these estimators. 2013 . . . our interest in observing θ ˆ shed on our theory? light does this realization θ IID J OHN S TACHURSKI October 10. excess profits ˆ will be: What should be equal to zero. x N are recorded. we were interested in estimating and predicting. suppose that x1 .1. Although this expression The right-hand side is the sampling distribution of x provides some information. In this chapter. σ2 ) =⇒ x IID (5.5. suppose that we are interested in the average length of time between incoming telephone calls at a call center during business hours. x N ∼ N (θ . MAKING INFERENCE ABOUT THEORY 154 Let’s look at two examples. . x N ∼ N (θ . We looked at ways to find estimators. x N is an IID sample from the normal distribution N (θ . σ2 ). The parameter λ is unknown. under our assumptions. To illustrate. or models that are “plausible” given the observed outcome of a statistic. But this is not really an answer until we specify what “a long way” is. For example. we can look at the sampling distribution of ˆ.1. σ2 / N ).1: Theoretical and realized values N ( θ0 . J OHN S TACHURSKI October 10. Looking at this figure. 2013 . Plugging the hypothesized mean θ0 and the estimated variance s2 into the sampling distribution gives the density in figure 5.2. and own distribution.2. then θ ˆ is “unlikely” when our theory is true. In the present case. and the second parameter σ2 can be estimated consistently by the sample variance s2 .5. Although θ the parameters θ and σ are not known. this distribution is N (θ . we ˆ can indeed be regarded as a long way apart. our theory specifies that θ should be equal to θ0 . 5. consider figure 5. CONFIDENCE SETS 155 x ˆ θ θ0 Figure 5. in the sense that if can see that θ0 and θ ˆ would be a realization from way out in the tail of its our theory was correct. the realization θ this fact can be construed as evidence against the theory. distributions. s2 / N ) ˆ θ θ0 Figure 5. Is θ0 a long way ˆ? from θ To determine what “a long way” means. In the following sections we formalize these ideas.1).2: Estimated sampling distribution when the theory is correct ˆ appears to contradict our theory when it’s a The naive answer is: The realization θ long way from our hypothesized value θ0 .2 Confidence Sets The idea of confidence sets is to provide a set of parameter values. Thus. as shown in (5. If C (x) is an interval in R. 1). then our confidence sets will contain the true parameter about (1 − α) × 100% of the time. Let x1 . not the parameter θ .14) on page 21. A random set C (x) ⊂ Θ is called a 1 − α confidence set for θ if Pθ {θ ∈ C(x)} ≥ 1 − α for all θ ∈ Θ Remember. then C (x) is also called a confidence interval. . . Some rearranging gives ¯ N − √ zα/2 ≤ θ ≤ x ¯ N + √ zα/2 = 1 − α Pθ x N N Since this √ argument is true regardless of the value of θ . . Let Fθ be the joint distribution of the sample vector x when the data is generated by Mθ .2) σ and hence. we let Pθ {x ∈ B} be the probability that x ∈ B given x ∼ Fθ . x N ∼ N (θ . σ2 ). or a family of proposed regression functions { f θ } in a regression problem.1. By (5. where θ ∈ Θ = R is unknown.1 Parametric Examples Suppose that we have a parametric class of models { Mθ } indexed by parameter θ ∈ Θ ⊂ R. let Pθ {x ∈ B} := 1{s ∈ B} Fθ (ds). We wish to form a confidence interval for θ .2. The interpretation of the confidence set is that. Suppose for the moment that σ is known. then C (x) := ( x 1 For σ σ those readers who prefer more formal definitions. .1).1 Fix α ∈ (0. 1) (5. We anticipate observing a sample x := ( x1 . . and hence the experiment should be designed such that Pθ {θ ∈ C (x)} ≥ 1 − α regardless of which θ is generating the data. October 10. We use the notation Pθ to refer to probabilities for x.5. . For example. IID √ Pθ N ¯ N − θ | ≤ zα/2 |x σ = 1 − α when zα/2 := Φ−1 (1 − α/2) Here Φ is the standard normal cdf. The “for all” statement is necessary since we don’t actually know what the true value of θ is. . it is the set that is random here. CONFIDENCE SETS 156 5.2. x ¯ N + en ) is a 1 − α confidence interval for θ σ zα/2 / N . the models might describe a set of cdfs { Fθ }. x N ) generated by one of the models Mθ . . if we conduct all our statistical experiments with a fixed value of α. 2013 J OHN S TACHURSKI . we have √ (x ¯N − θ) N ∼ N (0. given B ⊂ R N . For example.2. we conclude that if en := ¯ N − en . applying (1. Example 5. 4) 5.3) where FN −1 is the cdf of the t-distribution with N − 1 degrees of freedom.3 that an estimator θ is called asymptotically normal if √ d ˆN − θ ) → N (θ v(θ ) Z as N → ∞ (5.2 The reasoning in example 5.2. but nevertheless the terminology is very common. This is part of a more general phenomenon ˆN of θ ∈ R related to the central limit theorem.5) proof of this statement is clearly related to fact 1. rather than an estimate of the standard deviation.4) and the approximation will be accurate.2 Asymptotic Confidence Sets As the degrees of freedom increases.2) becomes √ (x ¯N − θ) ∼ FN −1 N sN (5.2.2 as ¯ N − se( x ¯ N )tα/2 .2. x ¯ N + se( x ¯ N )tα/2 ) C (x) := ( x (5.9 on page 25. a natural approach is to replace σ with the sample standard deviation s N . we can write the confidence interval in example 5.2. quantiles of the t-distribution converge to those of the normal distribution. (5. 2 The J OHN S TACHURSKI October 10. the term In example 5. In this case. for large N .2.5.1. Hence. then the standard error is ˆ) := a sample estimate of the standard deviation of θ ˆ se(θ Of course this is not really a formal definition because we haven’t specified which estimate of the standard deviation we are talking about. a more realistic situation is that σ is also unknown. 3 Sometimes the term “standard error” is used to refer to the standard deviation of the estimator. The details are a bit fiddly. note that since the standard deviation of x is σ / N √ ¯ .1 now goes through when zα/2 is replaced by tα/2 := −1 FN −1 (1 − α /2). CONFIDENCE SETS 157 Example 5. We omit them because more general results are established in chapter 7. we can replace tα/2 with zα/2 in (5.2.2.2.2.2. and we obtain ¯ N − √N tα/2 ≤ θ ≤ x ¯ N + √N tα/2 = 1 − α Pθ x N N s s √ ¯ N . In that case.2. It is often s N / N is a sample estimate of the standard deviation of the estimator x ˆ called the standard error.3 Using our new notation. More generally. Recall from §4. Continuing on from example 5. if θ is an estimator of some quantity θ . 2013 . as are smooth transformations of sample means (theorem 1. For is called the asymptotic variance of θ example. notation. so there is a 95% likelihood that the true parameter lies within 2 standard deviations of the observed value of our estimator. (5.5) and (5.7) ˆN ) is where the approximate equality is in terms of distribution. which explains our choice of an approximation to the standard deviation of θ ˆN ) is referred to as the standard error of the estimator. ˆN ) such that Now suppose we have a sequence of statistics se(θ √ p ˆN ) → N se(θ v(θ ) as N→∞ (5. CONFIDENCE SETS 158 where v(θ ) is some positive constant and Z is standard normal. for large N . we have the following useful rule of thumb: If α = 0.5. We see that se(θ ˆN . the sequence For our asymptotically normal estimator θ ˆN − se(θ ˆN )zα/2 . se(θ A sequence of random sets CN (x) ⊂ Θ is said to form an asymptotic 1 − α confidence set for θ if N →∞ lim Pθ {θ ∈ CN (x)} ≥ 1 − α for all θ∈Θ ˆN . taking the limit and applying (5. Many estimators have this property.6) As exercise 5. we can write ˆN ≈ θ + se(θ ˆN ) Z θ (5. 2013 .5.6) imply that ˆN − θ d θ → Z ˆN ) se(θ As a result. N →∞ (5. The constant v(θ ) ˆN . rearranging.8) as N→∞ (5. θ ˆN + se(θ ˆN )zα/2 ) CN (x) := (θ can be used because.9) ˆN − se(θ ˆN )zα/2 ≤ θ ≤ θ ˆN + se(θ ˆN )zα/2 lim Pθ {θ ∈ CN (x)} = lim Pθ θ N →∞ = lim Pθ N →∞ −zα/2 ≤ ˆN − θ θ ≤ zα/2 ˆN ) se(θ = 1−α Looking at (5.1 asks you to show.2. J OHN S TACHURSKI October 10. most maximum likelihood estimators are asymptotically normal.4.3 on page 37).9) gives a good indication of why standard errors are normally reported along with the point estimate.05. For example.7). As before. then zα/2 ≈ 2. he showed that √ N sup | FN (s) − F (s)| → K s∈ d R (5. Conversely.2.10) where K is the Kolmogorov distribution √ 2π ∞ ( 2i − 1 ) 2 π 2 K (s) := exp − s i∑ 8s2 =1 ( s ≥ 0) (5. we get √ { F ∈ CN (x)} = −k1−α ≤ N ( FN (s) − F (s)) ≤ k1−α for all s √ = N | FN (s) − F (s)| ≤ k1−α for all s √ = sup N | FN (s) − F (s)| ≤ k1−α s Hopefully the last equality is clear. the limiting distribution K is independent of the cdf F that generates the data. and let CN (x) := k 1− α k 1− α ≤ G (s) ≤ FN (s) + √ for all s ∈ R F ∈ F : FN (s) − √ N N The set CN (x) ⊂ F is an asymptotic 1 − α confidence set for F. Indeed. In particular. A.10) now confirms our claim: lim P{ F ∈ CN (x)} = lim P sup N →∞ s √ N →∞ 4 If N | FN (s) − F (s)| ≤ k1−α = 1−α g : D → R and g(s) ≤ M for all s ∈ D.N. . In § 4. like in the CLT. x N are IID draws p from some cdf F.5. 2013 . A nice example is confidence sets for the ecdf. if sups∈ D g(s) ≤ M. We can use this result to produce an asymptotic 1 − α confidence set for F. . J OHN S TACHURSKI October 10.5. This is the fundamental theorem of statistics (theorem 4. then sups∈ D g(s) ≤ M.4 Applying (5.11) Notice that.5. CONFIDENCE SETS 159 5.2. let k1−α := K −1 (1 − α). Kolmogorov used an extension of the central limit theorem to obtain an asymptotic distribution for this term. To do so. then g(s) ≤ M for all s ∈ D. let F be the set of all cdfs on R.3 A Nonparametric Example Confidence sets can be obtained in nonparametric settings too.2. we learned that if x1 . rearranging the expression. . In 1933. . then sups | FN (s) − F (s)| → 0. and FN is the corresponding ecdf.1). The code for producing (d) of figure 5. This is done in figure 5. . Y . equivalently. ylab = " " ) lines ( xgrid .seq ( -3 . col = " red " ) 5. See listing 6.numeric ( length ( xgrid ) ) for ( i in 1: length ( xgrid ) ) Y [ i ] <.5. This value was obtained numerically from the definition of K in (5. 2) # RVs from t . In the code. col = " blue " . .36 / sqrt ( samp _ size ) plot ( xgrid .400 xgrid <.95) was 1.3 for α = 0. HYPOTHESIS TESTS 160 Given our data x1 . and then search for an s satisfying K (s) = 1 − α.36.95 := K −1 (0. In (c) and (d).Y . we can present the con√ fidence set CN (x) visually by plotting the lower bound function F ( s ) − k / N N 1 − α √ and the the upper bound function FN (s) + k1−α / N .3 samp _ size <.3 is shown in listing 5. length = grid _ size ) FN <. x N and the corresponding ecdf FN . The root of f was found using the R univariate root-finding function uniroot.36 / sqrt ( samp _ size ) Y _ lower <.function (s .1000 grid _ size <. Listing 5 The source code for figure 5. .3 Hypothesis Tests [roadmap] J OHN S TACHURSKI October 10. The preceding theory tells us that realizations of the confidence set will catch the true function F about 95 times in 100. 3 . The technique used was to truncate the infinite sum in (5. or. Y _ lower ) lines ( xgrid . F is shown in red.11). X ) return ( mean ( X <= s ) ) # ecdf X <. .Y + 1.dist with 2 DF Y <.3. the true function F is not shown.FN ( xgrid [ i ] .11) at 20 to provide an approximation to K. xlab = " " .rt ( samp _ size . 2013 . you will see that the value we used for k1−α = k0. In figures (a) and (b).05. type = " l " . X ) Y _ upper <.1. 2) . pt ( xgrid . f (s) = 0 when f (s) := K (s) − 1 + α. The data is generated using an arbitrary distribution F (the t-distribtion with 2 degrees of freedom). Y _ upper ) lines ( xgrid . HYPOTHESIS TESTS 161 0.3: Confidence sets for the ecdf J OHN S TACHURSKI October 10.2 0.4 −3 −2 −1 0 1 2 3 0.5.6 0.8 −2 −1 0 1 2 3 (c) N = 200.6 0.8 0.4 0.6 0.2 −3 −2 −1 0 1 2 3 0. 2013 . true F shown in red (d) N = 1000.6 0.2 −3 0.8 0.2 −3 0.3. true F shown in red Figure 5.8 −2 −1 0 1 2 3 (a) N = 200 (b) N = 1000 0.4 0.4 0. but it is not.b + exp ( . our null hypothesis might be that the distribution of u belongs to a certain parametric class of distributions. In this sense. suppose we observe data pairs ( xn .(2 * i .5) print ( ur ) 5. which is a statement that the observed data is being generated by a certain model. Popper used the example of testing the theory that all swans are white.3.sqrt (2 * pi ) / s b <. On the other hand.uniroot (f . Karl Popper (1902–1994) was a major originator and proponent of this approach. where f is a specific function and u has a specific distribution.) It is futile to attempt to prove this theory correct: Regardless of how many white swans we find. attempting to falsify a theory (i. a hypothesis test is an attempt reject the null. or one of a certain class of models. lower =1. reject the null) is more constructive than attempting to confirm it.3. For example.1 + alpha ) ur <. Our null hypothesis might be that all pairs are generated by a particular functional relationship y = f ( x ) + u. Alternatively.0.5. or the set of 3rd order polynomials.2 . upper =1.1) ^2 * ( pi ^2 / (8 * s ^2) ) ) } return ( a * b ) } alpha <. or the twice differentiable functions.e. J OHN S TACHURSKI October 10. such as the increasing functions.function ( s ) return ( K ( s ) . To illustrate why we should focus on rejection. One might imagine that the standard procedure in statistics is to show the validity of the null hypothesis.0 for ( i in 1:20) { b <. Rather.function ( s ) { # Kolmogorov cdf a <. HYPOTHESIS TESTS 162 Listing 6 Finding the value of k1−α K <.05 f <.. and that the function f lies in some particular set of functions H.1 The Framework Hypothesis testing begins with the specification of a null hypothesis. yn ). (Consider this to be our null hypothesis. a single black swan can show the claim to be false. 2013 . no amount can ever confirm the claim that that all swans on planet earth are white. 5. We use the notation Pθ to refer to probabilities for x. However.3.5. at each stage taking correctness of the theory as the null hypothesis and attempting to reject. a test is a binary function φ mapping the observed data x into {0. φ(x) is to be considered as a random variable.3. given B ⊂ R N . texts identify tests with a rejection region. Note that. To understand how this convention came about. which is a subset R of R N . then the null hypothesis is called a simple hypothesis. This amounts to specifying a subset Θ0 of Θ. 5 Some J OHN S TACHURSKI October 10.2. 1}. we only reject when there is strong evidence against the null hypothesis. prior to implementation of the test. we let Pθ {x ∈ B} be the probability that x ∈ B given x ∼ Fθ .3. The null hypothesis is often written as H0 : θ ∈ Θ0 If Θ0 is a singleton. Formally. (R N is the set of N vectors—see chapter 2 for more. then do not reject H0 Remark 5.1. then the null hypothesis is called a composite hypothesis. This is a useful process of elimination. then reject H0 if φ(x) = 0. HYPOTHESIS TESTS 163 Although we attempt to reject the null hypothesis. suppose that we have a collection of theories about how the economy works. If not. and go on to the remaining theories. the distribution of which depends on the distribution of x and the function φ. then we take φ to be defined by φ(x) := 1{x ∈ R}. Note also that failing to reject H0 should not be confused with accepting H0 ! More on this below. A null hypothesis is a specification of the set of models { Mθ } that we believe generated x. A test of the null hypothesis amounts to a test of whether the observed data was generated by Mθ for some θ ∈ Θ0 . Conversely. This is equivalent to our formulation: If a rejection region R is specified.) The null is rejected if x ∈ R. we do so only if strong evidence against it is observed.2 Constructing Tests Let’s look at testing in the parametric setting of §5. If the theory is rejected then we can discard it. if φ is specified. For example. The decision rule is5 if φ(x) = 1. Hence. then we take R as equal to {s ∈ R N : φ(s) = 1}. we don’t want to mistakenly discard a good theory.1. We let Fθ be the joint distribution of the sample vector x when the data is generated by Mθ . 2013 . The procedure would then be to step through the theories. We have a parametric class of models { Mθ } indexed by parameter θ ∈ Θ ⊂ R. in the sense that it is a function of the data.12) If (5. then the test is said to be of size α. a common (but not universal) set up is to define a real-valued test statistic T and a critical value c. this is only for convenience. In practice. and β(θ ) = 1 when θ ∈ / Θ0 . and then adjust the test such that β(θ ) ≤ α for all θ ∈ Θ0 (5.01. This means that they do not depend on any unknown quantities. 2013 . HYPOTHESIS TESTS 164 Remark 5. the standard procedure is to choose a small number α such as 0. Then. Second. because we don’t want to discard good theories.5. it’s unlikely the null is true. if our test tells us to reject the null. we can mistakenly reject the null hypothesis when it is in fact true. it is traditional to keep the probability of type I error small. we tend to be conservative in rejecting the null. we can fail to reject the null hypothesis when it is false.3. Ideally. being random. In constructing tests. β(θ ) is the probability that the test rejects when the data is generated by Mθ .12) holds. This is called type I error. this is usually difficult to achieve. As discussed above. In the case of a test statistic.3. There are two different ways in which the realization can mislead us. it is common to allow the test statistic to depend on unknown parameters (a subset of Θ). and the rule becomes: Reject H0 if and only if T (x) > c minor point: A test statistic is a kind of statistic. Because of this.) 6A (5. Although we are using the language of parametric hypothesis testing. statistics are thought of as being computable given the data. we would like β(θ ) = 0 when θ ∈ Θ0 . This is called type II error. First. (Otherwise the test cannot be implemented.2. its realization can be misleading. We can also think of θ as an arbitrary index over a (possibly nonparametric) class of models. and then set6 φ ( x ) : = 1{ T ( x ) > c } The pair ( T .13) J OHN S TACHURSKI October 10. The power function associated with the test φ is the function β ( θ ) = Pθ { φ ( x ) = 1 } (θ ∈ Θ) In other words. The outcome of our test depends on the random sample x. Usually. with the caveat that the values of these unknown parameters are all pinned down by the null hypothesis. and.05 or 0. c) then defines the test. For this reason. if our hypothesis is a statement about the second moment of a random variable. As a result. choose a desired size α according to our tolerance for type I error. .15) is fixed at 10.13).3. if Φ is the cdf of the standard normal distribution on R and z ∼ Φ. . Suppose we “know” that x1 . J OHN S TACHURSKI October 10. we need to adjust c such that ( T . 1/ N ). The null hypothesis is θ ≤ 0. c) is of size α.3 Choosing Critical Values Let’s think a bit more about the test in (5.3. . making the event { x ¯ N > c} more likely. 2. 1) for some θ ∈ R.12) holds with equality.3. In performing the last step.14) Given c. we seek a suitable test statistic.7 5. a natural choice for our statistic is the sample mean. but the value of θ is unknown. the problem is to choose c to solve α = sup Pθ { T (x) > c} θ ∈ Θ0 7N (5. A plot of β is presented in c. because higher θ pushes up the mean ¯ N . then ¯ N ≤ c} = P{θ + N −1/2 z ≤ c} = P{z ≤ N 1/2 (c − θ )} = Φ[ N 1/2 (c − θ )] Pθ { x ∴ β(θ ) = 1 − Φ[ N 1/2 (c − θ )] (5. it’s common to choose c such that (5. the function is decreasing of x ¯ N > c} less likely. identify a suitable test statistic T . and 3. . Given θ .1. Typically. the standard procedure is to: 1. choose a critical value c so that ( T .5. c) attains the appropriate size. so that ¯N T ( x ) : = : T ( x1 . . .4 for two different values of c. the choice of T is suggested by the problem. 0]. then we might take T to be the sample second moment. . HYPOTHESIS TESTS IID 165 Example 5. x N ∼ N (θ . 2013 . Once T is fixed. In this case. the power function is increasing in θ . or Θ0 = (−∞. Since we want to make inference about the mean. Each c ∈ R now gives us a test via (5. To construct a test. because higher c makes the event { x in figure 5. Thus. with power function β(θ ) = Pθ { x ¯N ∼ We can obtain a clearer expression for the power function by observing that x N (θ .13). x N ) : = x ¯ N > c }. For example. . our task is to find the appropriate critical value c so that the test ( T .4: The power function β Figure 5.15).6 0. . Setting θ = 0 and solving for c. we’ve taken Θ0 to be the two element set {θ a . Example 5. Given α. θb }. .3. The blue line gives an imaginary distribution of T (x) when x ∼ Fθa .5 gives an illustration.8 −1 0 θ 1 2 Figure 5. .2 0. .4 0. In the figure.8 power 0. In R. The black line gives the same for θb .1. Let’s look again at example 5. HYPOTHESIS TESTS 166 1.2. where x1 . Applying the expression for the power function in (5.0 −2 0.5.3.3. Here. 2013 . Assuming that a value of α be prescribed. the next step is to determine the value of c such that (5. so the supremum is obtained by setting θ = 0. this becomes α = sup{1 − Φ[ N 1/2 (c − θ )]} θ ≤0 IID The right-hand side is increasing in θ . we choose c such that the largest of the two shaded areas is equal to α. and our null hypothesis is θ ≤ 0. J OHN S TACHURSKI October 10. x N ∼ N (θ .0 c = 0. c) is of size α. For example. we obtain c(α) := N −1/2 Φ−1 (1 − α) where Φ−1 is the quantile function of the standard normal distribution.2 c = 0. this function can be evaluated using qnorm.14). To solve for c given α we use (5. 1) for θ unknown.15) holds. represented as a density. 05 > qnorm (1 .1. Lower α means less tolerance for type I error. but may not reject at size 0.5: Determining the critical value > alpha = 0. 2013 . J OHN S TACHURSKI October 10. HYPOTHESIS TESTS 167 θa θb α cα Figure 5. then.644854 Since Φ−1 is increasing. we see that smaller α corresponds to larger c(α)—we reduce ¯ N must the probability of type I error by increasing the critical value that the mean x obtain for rejection to occur.3.alpha ) [1] 1.05 will also reject at size 0. is: What is the smallest value of α at which we can still reject a given test statistic? This value is called the p-value.5. Also. Hence.4 p-Values Typically. A natural question. the result of the test may switch from reject to accept. for a fixed value of the test statistic.01. and forces the critical value to become larger. a test that rejects at size 0.3. higher N brings down c(α): More data allows us to reduce the critical value without increasing the probability of rejecting a true null. 5. and c(α) is continuous in α. and in this case the expression for p(x) reduces to p(x) := the α such that c(α) = T (x) (5. With a bit of rearranging and an application of symmetry (page 17).2. this is the α at which the test switches from accept to reject.17) 5. To go down this path we need to switch to a notion of asymptotic size. To solve for p(x).16) Example 5.5 Asymptotic Tests As we saw in examples 5. Here c(α) := Φ−1 (1 − α/2). 1) : c(α) < T (x)} Roughly speaking.3. Typically α → c(α) is continuous. it may still be hard to work out the implied distribution of a given statistic T (x). we need to solve for α in the expression Φ−1 (1 − α/2) = |t N (x)|.18) N →∞ J OHN S TACHURSKI October 10.3. we know relatively little about the distribution of the test statistic. a test ( T . the central limit theorem and its consequences provide an elegant solution: Even if we don’t know the distribution of the test statistic.16). c(α)) of size α.15). We assume here that c(α) is determined via the relationship in (5. and can hopefully choose critical values to obtain an appropriate size. Fortunately.3. however. we have an idea of the power of the test. In many cases. Once we know the asymptotic distribution. the p-value of the test is defined as p(x) := inf{α ∈ (0. we may still be able to determine it’s asymptotic distribution via the CLT. a test is called asymptotically of size α if lim β N (θ ) ≤ α for all θ ∈ Θ0 (5.3. HYPOTHESIS TESTS 168 Let’s give a formal definition.1 and 5. Often we don’t want to make parametric assumptions about the distribution of the underlying data x. at least for large samples.3.3. then. And even if we make such assumptions. 2013 . constructing appropriate tests requires knowledge of the distribution of the test statistic. Recall the test (5.21). we obtain p(x) = 2Φ(−|t N (x)|) (5. for each α ∈ (0. We have a null hypothesis is H0 : θ ∈ Θ0 and. for many problems. Consider again the general parametric setting of §5.5.3. 1). In this setting. Writing β N instead of β to emphasize the fact that the power function typically depends on sample size. rather than finite sample size. so we can apply the definition of p(x) in (5. The histogram is of standardized daily returns on the Nikkei 225 index from January 1984 until May 2009. The data file can be downloaded at http://johnstachurski. then. then.2. Let α be given. and the null hypothesis is rejected.txt The standard normal density has been superimposed on the histogram in blue. as shown in §5. and let k1−α = K −1 (1 − α) be the 1 − α quantile of the Kolmogorov distribution K. This is a common observation for asset price returns. Here “standardized” means that we have subtracted the sample mean and divided by the sample deviation.36.3. 2013 .19).2. Let’s think about making this analysis more precise by constructing an asymptotic test. √ The value of the statistic N sups∈R | FN (s) − Φ(s)| produced by listing 7 is 5. by (5. let Φ be the standard normal cdf. By (5. Finally.3. The construction of the test essentially “inverts” the confidence set in §5. let FN be the ecdf of the data. The histogram suggests that the density of returns is more peaked and has heavier tails than the normal density. J OHN S TACHURSKI October 10.67. We can see that the fit is not particularly good. If α = 0. the critical value k1−α is 1.net/emet/nikkei_daily. and the test is asymptotically of size α. To begin. The code for producing the histogram is in the first part of listing 7. and let the null hypothesis be that standardized returns are IID draws from Φ. we have √ d (5. HYPOTHESIS TESTS 169 To illustrate how we might go about constructing a test which is asymptotically of size α. then the standardized returns would be approximately standard normal.19) N sup | FN (s) − Φ(s)| → K s∈ R For the test φN (x) := 1 √ N sup | FN (s) − Φ(s)| > k1−α s∈ R let β N (Φ) be the value of the power function when the null hypothesis is true.6. consider the data presented in figure 5.05.11).10) on page 159. we have lim β N (Φ) = lim P N →∞ √ N →∞ N sup | FN (s) − Φ(s)| > k1−α s∈ R =α Hence (5. The test statistic exceeds the critical value.3. If the null hypothesis is true.18) is verified.5. If the original returns were normally distributed. as defined in (5. 2 0. but it may be the IID assumption rather than the normality assumption that makes our null a poor fit to the data. Nikkei 225 Looking back on what we’ve done. there’s an obvious weakness in our approach. The test we’ve implemented can be modified to tackle the case of dependent data.3 0. it’s interesting to discuss the current state of hypothesis testing in macroeconomics in J OHN S TACHURSKI October 10.4.4.6: Standardized daily returns. 2013 . but such a discussion is beyond the scope of these notes.5 0.6 −4 −2 0 returns 2 4 6 Figure 5. Our null hypothesis was that standardized returns are IID draws from the standard normal density.4 Use and Misuse of Testing [roadmap] 5.5.1 Testing and Model Selection To to think about testing and model selection in the context of econometrics. USE AND MISUSE OF TESTING 170 0.1 0. The rejection suggests this null is false.0 −6 0. 5. See Negri and Nishiyama (2010) for one such test and many references.4 0. The results were disappointing. After all. Date ( rev ( nikkei $ Date ) ) close <. One of the first things the proponents of rational expectations did was to test these models against data. I already know that my model is wrong.mean ( returns ) ) / sd ( returns ) hist ( data . col = " blue " ) FN <. 2013 . The standard paradigm recognizes that all J OHN S TACHURSKI October 10.as .diff ( log ( close ) ) data <. xlim = c ( -6 .rev ( nikkei $ Close ) returns <. You are probably aware that in the 1970s. USE AND MISUSE OF TESTING 171 Listing 7 Testing normality nikkei <. length =400) lines ( xgrid . dnorm ( xgrid ) .5.function ( s ) return ( mean ( data <= s ) ) FN <. sep = " . header =T . prob =T .6) ) xgrid <. As a result.Vectorize ( FN ) m <. ylab = " " .pnorm ( xgrid ) ) ) print ( sqrt ( length ( data ) ) * m ) # ecdf particular. main = " " . Rejection of my model using standard inference tells me nothing new. But after about five years of doing likelihood ratio tests on rational expectations models.4.max ( abs ( FN ( xgrid ) .seq ( min ( data ) . many proponents of these “good” models have moved away from formal hypothesis testing in favor of so-called calibration. breaks =100 .” This line of argument runs contrary to the standard paradigm under which most of science and statistical testing takes place. 2005): My recollection is that Bob Lucas and Ed Prescott were initially very enthusiastic about rational expectations econometrics.( returns . Hence. The consensus of this school of though can be paraphrased as “No model is a true description of the real world. xlab = " returns " . txt " . I recall Bob Lucas and Ed Prescott telling me that those tests were rejecting too many good models. it simply involved imposing on ourselves the same high standards we had criticized the Keynesians for failing to live up to. max ( data ) . table ( " nikkei _ daily ." ) dates <.read . Thomas Sargent’s account runs as follows (Evans and Honkapohja. the rational expectations revolution ushered in a brand new class of macroeconomic models. 5. and.5. Ex. without ever being culled via rejection: Just make sure that the model has no testable implications. However.5 Exercises d Ex. because a model with no testable implications cannot be considered as a theory of anything. Assume that both θ and σ are unknown.1. thus enabling rejection. 5. Let x1 . Consider the statistic √ ¯ N − θ0 x t N := N (5.6) hold.5. 5. Those models that explain observed phenomena effectively. .5. Others are harder to reject. are most difficult to reject. For any given model a poor fit to the data can usually be found along some dimension.2 Accepting the Null? power and lack of power. 5. Let α ∈ (0. example: unit root tests. x N be an IID sample with mean θ and variance σ2 . 1) whenever H0 is true. In other words. 1) and set zα/2 := Φ−1 (1 − α/2). . as proposed by Karl Popper. Show that if t N (x) → N (0. it should be acknowledged that social science is not physics. is that the theory has one or more testable implications.20) sN given this description of model selection. failing to reject doesn’t mean the null is right.5) and (5. We wish to test the hypothesis H0 : θ = θ0 .8 On the other hand. None of these are easy questions. But perhaps some of these models are still useful in guiding our thinking. and TN (x) = |t N (x)|. Ex. J OHN S TACHURSKI October 10.2.5. . The standard definition of a “scientific” theory. .4. 8 Incidentally. EXERCISES 172 models are wrong. at the same time. Some models are easily rejected. Show that (5. this strategy does not work. The overall outcome resembles survival of the fittest. 2013 .7) is valid when (5. 5. and perhaps they represent the best models we have for now. example: diagnostic tests for regression. the sequence of tests φN (x) := 1{ TN (x) > zα/2 } is asymptotically of size α. it might appear there is a way for us to produce a model that survives forever in the pool of currently acceptable theories. the theory can be falsified.3. will survive.5. we try to cull old ones that perform poorly and produce new ones that perform better. The process of culling takes place by statistical rejection. and our models tend to be more imperfect. Given that all models are wrong. Ex. . .5.5. plot the chi-squared cdf with 2 degrees of freedom.5. where observations is a vector storing the sample x1 . . . x N be N random variables. replacing the expresion x ¯ N − θ ) + ( θ − θ0 ).5. . p J over values 1. x N taking the value j. p) should return the value X in equation (5. p J . rather than exact.22) is asymptotically chi-squared with J − 1 degrees of freedom. 3. let x1 . x20 from this pmf. . show that the sequence of tests φN (x) := 1{|t N | > zα/2 } is asymptotically of size α. . .5. 2. . 11 In particular. . . Show that the test in exercise 5.11 generate 5. .6. By repeatedly simulating 20 IID observations x1 . and N = 20 is not a large sample. . Try the add and subtract strategy. n n j 12 You can improve the fit further by taking N larger.22) where q j is the fraction of the sample x1 . 5. .9 Ex. . . The reason is that the fit is only asymptotic.2. 2013 .12 In essence. Do not use this built-in function in your solution.2.21) ( q j − p j )2 X := N ∑ pj j =1 J (5.10 Ex.22). .6 (Computational). then the rejection probability will converge to one as N → ∞ whenever H0 is false: β N (θ ) → 1 as N → ∞ whenever θ ∈ Θ1 Such a test is said to be consistent. . x N is IID with P{ xn = j} = p j for n ∈ {1. . The test statistic is given by (5. J }. More precisely. N } and j ∈ {1. draw each x such that P{ x = j } = p for j = 1.2 and p3 = 0. . . . The null hypothesis of the test is that the sample x1 .4. 5. This exercise continues on from exercise 5. The function call chsqts(observations. The functions should be close. . x N .test for implementing the chi-squared goodness of fit test. J . . The chi-squared goodness of fit test is used to test whether or not a given data set x1 . 9 Hint: J OHN S TACHURSKI October 10. .5.5. . . If a test is well designed. p2 = 0. .3 is consistent whenever α ∈ (0. Plot the ecdf of obsX. . . . each xn taking integer values between 1 and J . .5 (Computational). you need to show that the absolute value of the test statistic |t N | in (5. . Write a function in R called chsqts that takes two arguments observations and p.20) with ( x 10 R has a built-in function called chisq. and p is a vector storing the values p1 . In the same figure. and store them in a vector called obsX. . 5. . in (5. . described by a pmf p1 . EXERCISES 173 With reference to exercise 5. Under the null hypothesis. . Let J = 3 with p1 = 0. . Let N = 20. . x N was generated from a particular discrete distribution. .000 independent observations of the statistic X . 1).5.20) ¯ N − θ0 diverges to infinity when θ = θ0 . .5. . . X in (5. . v(θ )) and N se(θ v(θ ).2.4.5. and the exercise is done. θ0 = θ ). observe that ˆN − θ θ = ξ N ηN ˆN ) se(θ √ where ξ N := ˆN − θ ) N (θ v(θ ) d and η N := √ v(θ ) ˆN ) N se(θ p Using the various rules for converges in probabilty and distribution (check them carefully—see facts 1.5. Fix α ∈ (0.1 Solutions to Selected Exercises ˆN − θ )/ se(θ ˆN ) converges in disSolution to Exercise 5.5. In view of (1.4 on page 33 implies that |t N (x)| → |z|. As a result. 1) and η N → 1. we have P{|z| > zα/2 } = α.18). J OHN S TACHURSKI October 10.4. Appealing to (1. Solution to Exercise 5. Since g(s) := |s| is continuous. It follows that exercise 5. In §4.2 we showed that the sample standard deviation s N defined in (4. 2013 .5. d Solution to Exercise 5. Applying Slutsky’s theorem (page 34) now gives ξ N η N → Z. In particular.5 on page 34. If H0 is true. fact 1. we can state that for zα/2 := Φ−1 (1 − α/2).2) is consistent for the standard deviation σ. EXERCISES 174 5.. We aim to show that (θ √ √ p d ˆN − θ ) → ˆN ) → tribution to a standard normal when N (θ N (0.e.5.3.4.2. To see this.2 can be applied. we have N →∞ d d lim β N (θ ) = lim Pθ {|t N (x)| > zα/2 } = P{|z| > zα/2 } = α N →∞ This confirms (5. as was to be shown. the test is asymptotically of size α. 1).1 and 1. we can see that t N := √ N ¯ N − θ0 x sN = σ√ N sN ¯ N − θ0 x σ is asymptotically standard normal whenever H0 is true (i.4. 1) and let z ∼ N (0.1.30) on page 36 and fact 1.5. then t N (x) → z by assumption.4) we obtain ξ N → Z ∼ N (0.14) on page 21.5. 1) .1 Least Squares [roadmap] 6. our problem for this chapter is to choose a function f : RK → R such that f (x) is a good predictor of y Throughout this chapter. so our expected loss (risk) from given function f is R( f ) := E [(y − f (x))2 ] 175 (6.1. In particular. Suppose we repeatedly observe a vector input x to a system. .Chapter 6 Linear Least Squares [roadmap] 6. and x takes values in RK . . Our aim is to predict new output values from observed input values. We assume that the inputoutput pairs (x1 . y1 ). (x N . This distribution is unknown to us. Both are random. followed by a scalar output y.1 Multivariate Least Squares and ERM Let’s consider a multivariate version of the regression problem we studied in §4. we will be using quadratic loss as our measure of “goodness”. .6. y N ) we observe are all draws from some common joint distribution on RK +1 . . 2) can hold when the input-output pairs form a time series (each n is a point in time). we solve N ˆ ( f ) where R ˆ ( f ) := 1 ∑ (yn − f (xn ))2 min R N n =1 f ∈F This is the general least squares problem.2) is a minimal requirement to justify the approach that we are taking. which is what we actually want to minimize. When we presented the theory of empirical risk minimization in §4. LEAST SQUARES 176 As we learned in chapter 3. provided that this correlation dies out sufficiently quickly over time. we will use the principle of empirical risk minimization instead. (x N . Thus.6.1. and in this chapter we consider the case F = L.2) can hold under much weaker assumptions than independence. we cannot compute this conditional expectation.6. If this is true. ˆ ( f ) := 1 R N n =1 ∑ (yn − f (xn ))2 → E [(y − f (x))2 ] = R( f ) N p (6. and correlation between the input-output pairs is present. y1 ). . we will usually be able to make the empirical risk R small by choosing a function f such that yn − f (xn ) is very small for all n. In other words. The function f is chosen from a set of candidate functions F mapping RK into R. . y N ) were independent of each other. and. replacing the risk function with the empirical risk before minimizing.6. (6. For now. the minimizer of the risk is f ∗ (x) = E [y | x].1 on page 72). An extensive discussion of this ideas is given in chapter 8. If we take F to be the set of all functions ˆ ( f ) arbitrarily from RK to R.3) J OHN S TACHURSKI October 10. the scalar random variables (yn − f (xn ))2 are also independent (fact 2. L := { all functions : RK → R. as we discussed extensively in §4. then. Since the underlying distribution is not known. where L is the linear functions from RK to R. Instead. Let us note at this stage that (6. F is called the hypothesis space. . by the law of large numbers.2) This gives a fundamental justification for the empirical risk minimization principle: for large N . F must be restricted. As before. 2013 .4. for fixed f . . the empirical risk and the true risk are close. Just keep in mind that validity of (6. For example. where (x) = b x for some b ∈ RK } (6. Let’s now turn to the hypothesis space F . That is. we assumed that the input-output pairs (x1 . However. this is not the same think as making the risk small.2. we will not make any particular assumptions.  . the technique has a very natural extension from L to a very broad class of functions. In fact. the idea of choosing b to minimize (6. 2013 . let     y1 x n1  y2   x n2      y :=  .2 The Least Squares Estimator The next step is to solve (6. This is an excellent question. colk (X) is all observations of the k-th regressor. ··· ··· x1K x2K .1. . more simply. .2 below.2). Although our presentation is rather modern.) This is the multivariate version of (4.1. . It dates back at least as far as Carl Gauss’s work on the orbital position of Celes. 6. . setting F = L allows us to obtain an analytical expression for the minimizer. One might well ask whether the choice F = L is suitable for most problems we encounter. as described in §6. It may not be. . .  yN and    X :=   x1 x2 . However.1. . which is usually satisfied in applications unless you’re doing something silly. J OHN S TACHURSKI October 10. xn :=  . armed with our knowledge of overdetermined systems (see §3. published in 1801. Moreover.5) .4). xN xnK     :=:       x11 x21 . (6.  . This will be more obvious after we switch to matrix notation.6) x N1 x N2 · · · We will use the notational convention colk (X) := the k-th column of X.   . or.  = n-th observation of all regressors (6. x12 x22 . In other words. The derivation is in §6.4) is intuitive. and this optimization problem has a very long tradition.1. Throughout this chapter. we already have all the necessary tools. .2. x NK      (6. . we will always maintain the following assumption on X. LEAST SQUARES 177 N ∈ L ∑ n =1 ( y n The problem we need to solve is then min N − (xn ))2 . which greatly simplifies computations.25) on page 141.4) ( y n − b x n )2 ∑ b ∈R n =1 min K (The constant 1/ N has been dropped since it does not affect the minimizer.6. To do this. you will be able to verify that n =1 ∑ ( y n − b x n )2 = 2 N y − Xb 2 Moreover.  xN b b xN Regarding the objective function in (6. and β become an estimator of an unknown parameter vector β. The significance is that we already know now to solve for the minimizer on the right-hand side of (6.  =  .8) ˆ is called the least squares estimator. we have argmin y − Xb b∈ R = argmin y − Xb b∈ K R (6.2). we have     x1 b b x1  x b   b x2   2    Xb =  . or the OLS Traditionally. since ˆ is an estimator of anything in particular. (Right now.7) is a minimizer of (6. Note that. In chapwe haven’t really assumed that β ˆ will ter 7 we will add some parametric structure to the underlying model. this random vector β estimator.1 (page 91).4) into matrix form. . any solution to the right-hand side of (6.4) and vice versa.6. 2013 .   . this terminology doesn’t fit well with our presentation. with a little bit of effort. and called the vector of fitted values: ˆ = Py y ˆ := X β J OHN S TACHURSKI October 10. . as defined on page 92.1.7). LEAST SQUARES 178 Assumption 6. First.1.1. the solution is ˆ : = ( X X ) −1 X y β (6. let P and M be the projection matrix and annihilator ˆ is often denoted y associated with X.4).7) K In summary.1. Let’s transform the minimization problem (6. for any b ∈ RK .   .2.3 Standard Notation There’s a fair bit of notation associated with linear least squares estimation. since increasing transforms don’t affect minimizers (see §13. By theorem 3. Let’s try to collect it in one place. The vector Py = X β ˆ. N > K and the matrix X is full column rank.) 6. y can be decomposed into two orthogonal vectors Py and My. := Py 2 . 1) (6.6. the choice of F is crucial.10) 6. As we saw in §4. SSR ESS • Sum of squared residuals :=: • Explained sum of squares :=: := My 2 . In that section. ˆ .11) October 10. TRANSFORMATIONS OF THE DATA 179 ˆ associated with OLS estimate and the ˆ n is the prediction xn β The n-th fitted value y n-th observation xn of the input vector.2. Related to the fitted values and residuals.1. By (6.2. and the second represents the error.1.1 Basis Functions Let’s now revisit the decision to set F = L made in §6. we have some standard definitions: • Total sum of squares :=: TSS := y 2 .1. From theorem 3. we considered data generated by the model y = cos(π x ) + u J OHN S TACHURSKI where u ∼ N (0.4 on page 89 we have My ⊥ Py and y = Py + My (6.6. where the first represents the best approximation to y in rng(X). 2013 .2.9) In other words.9) and the Pythagorean law (page 84) we have the following fundamental relation: TSS = ESS + SSR (6.2 Transformations of the Data [roadmap] 6. and called the vector of residuals: The vector My is often denoted u ˆ := My = y − y u ˆ The vector of residuals corresponds to the error that occurs when y is approximated by its orthogonal projection Py. the individual functions φ1 . we would also like F to be small. .12) October 10. . the minimizer of the risk is the conditional expectation of y given x. deleveraging-induced debt deflation. From such theory or perhaps from more primitive intuition. dynamic network effects. We saw that if F is too small. φ J mapping RK into R are referred to as basis functions. TRANSFORMATIONS OF THE DATA 180 We imagined that this model was unknown to us. .21 on page 146). . The action of φ on x ∈ RK is   φ1 (x)  φ2 (x)    x → φ(x) =  . it is easy to extend our previous analysis to a broad class of nonlinear functions.). where is a linear function from R J to R} The empirical risk minimization problem is then min J OHN S TACHURSKI n =1 ∑ {yn − N (φ(xn ))}2 = min ∑ (yn − γ φ(xn ))2 γ ∈R J n =1 N (6. etc.  φ J (x) In this context. in our quadratic loss setting.18 on page 145). As we learned in §3. This is called underfitting. we first transform the data using some arbitrary nonlinear function φ : RK → R J . Now we apply linear least squares estimation to the transformed data. For example. we choosing the hypothesis space to be Fφ := {all functions ◦ φ. . causing overfitting. many macroeconomic phenomena have a distinctly self-reinforcing flavor (poverty traps. and both the empirical risk and the risk are large (figure 4. we may suspect that. Fortunately. the conditional expectation E [y | x] is nonlinear. and self-reinforcing dynamics are inherently nonlinear. Conversely. it would be best if F contained the conditional expectation. To do so.3.6. As our information is limited by the quantity of data. if F is too big. for the problem at hand.2. then no function in F provides a good fit to the model. Of course choosing F in this way is not easy since E [y | x] is unknown. 2013 . then the empirical risk can be made small. The ideal case is where theory guides us. This seems to suggest that setting F = L is probably not appropriate. and attempted to minimize risk (expected quadratic loss) by minimizing empirical risk with different choices of hypothesis space F . Formally. Since our goal is to make the risk small.  ∈ R J  . We are paying too much attention to one particular data set. so that the minimizer of the empirical risk has to be “close” to the risk minimizer E [y | x]. but this risk itself is large (figure 4. providing information about E [y | x]. 1. . φ J (x N )     ∈ RN× J  In transforming (6. In most of this chapter. J −1 xn    =  j =1 ∑ γ j xn J j −1 (6.6. the solution is γ ˆ : = ( Φ Φ ) −1 Φ y Example 6. as previously discussed in §4.12) becomes argmin y − Φγ (6. .2. no loss of generality is entailed. .2.2. TRANSFORMATIONS OF THE DATA 181 Switching to matrix notation. this means that if we take J large enough.6.14) This case corresponds to univariate polynomial regression. the objective function can be expressed as n =1 ∑ ( y n − γ φ n )2 = N y − Φγ 2 Once again. However. Similarly. and xn is scalar valued. φN       =   φ1 (x1 ) · · · φ1 (x2 ) · · · . as we can just imagine that the data has already been transformed.12) into matrix form. for any given continuous function f on a closed interval of R. we return to the elementary case of regressing y on x. and x is ˆ to the result. On an intuitive level. rather than on some nonlinear transformation φ(x). . .13) γ∈ RJ Assuming that Φ is full column rank. we’ll use X to denote the data matrix instead of Φ. Suppose that K = 1. so that    γ φ( xn ) = γ φn = γ   0 xn 1 xn . and β denote the least squares estimator (X X)−1 X y.14) is capable of representing pretty much any (one-dimensional) nonlinear relationship we want. and the problem (6. increasing functions don’t affect minimizers. There is a theorem proved by Karl Weierstrass in 1885 which states that. 2013 . Consider the mononomial basis functions φj ( x ) = x j−1 . the relationship in (6. . J OHN S TACHURSKI October 10. . let  φn := φ(xn ) and   Φ :=   φ1 φ2 . ··· φ1 (x N ) · · · φ J ( x1 ) φ J ( x2 ) . there exists a polynomial function g such that g is arbitrarily close to f . . . To see why this is the case. TRANSFORMATIONS OF THE DATA 182 6. because we now have 1 N 1 Remember      1 1 1 1 ˆn = 1 y ˆ = 1 Py = 1 y = ∑y N N N N n =1 N n =1 ∑ yn N that rng(X) is the span of the columns of X. To add an intercept. since 1 is a column vector of X. 1 My = 1 y − (P1) y = 1 y − 1 y = 0 In other words. 2013 . adding an intercept means fitting an extra parameter. and each of the remaining columns of X contains N observations on one of the non-constant regressors. the vector of residuals sums to zero.4. Moreover. as exercise 6. J OHN S TACHURSKI October 10. then the mean of the fitted values y ˆ = Py is equal to the mean of y. and this extra degree of freedom allows a more flexible fit in our regression. and clearly each column is in the span. (As mentioned at the end of the last section. we use the transformation  φ(x) = 1 x   =  1 x1 .2.2 The Intercept There’s one special transformation that is worth treating in more detail: Adding an intercept to our regression. xK We’ll call the resulting matrix X instead of Φ. This follows from the previous argument.) In practice. we have P1 = 1 whenever 1 ∈ rng(X).6. It’s also the case that if the regression contains the intercept. we’re going to use X to denote the data matrix from now on.5 asks you to confirm.1 Therefore.4.2. One consequence of adding an intercept is that the vector of residuals must sum to zero. Clearly 1 ∈ rng(X) holds. even though the data it contains may have been subject to some transformations.10). What this means in the present case is that the first column of X is now 1. . observe that 1 My = 1 (I − P)y = 1 y − 1 Py = 1 y − (P 1) y = 1 y − (P1) y where the last equality uses the fact the P is symmetric (exercise 3. The nontrivial inequality R2 ≤ 1 follows from the fact that for P and any point y. If R2 = 1. Let y be 2 given. As many people have noted. we always have Py ≤ y .7 asks you to verify.6. we must have Py = y and y ∈ rng(X). 2013 . there are all sorts of problems with viewing R2 in this way. as we add regressors we increase the column space of X.e.. The closer is y to the subspace rng(X) that P projects onto. goodness of fit refers to the in-sample fit of the model to the data (i. 6. consider two groups of regressors. expanding it out towards y. the larger the column space X.3. In this section we cover some of the issues. X a and Xb . Intuitively. To see this more formally. It is usually represented by the symbol R2 (read “R squared”). a so-called perfect fit.3 Goodness of Fit In traditional econometrics. and let R2 a and Rb be the respective R squareds: R2 a := J OHN S TACHURSKI Pa y 2 y 2 and R2 b := Pb y 2 y 2 October 10. as exercise 6.3 (page 86).4. and hence increasings R2 . GOODNESS OF FIT 183 6.3. R2 has often been viewed as a one-number summary of the success of a regression model. the closer Py / y will be to one. Let P a and Pb be the projection matrices corresponding to X a and Xb respectively. then. An extreme case is when R2 = 1. the better we can approximate a given vector y with an element of that column space.4.1 R Squared and Empirical Risk One issue that arises when we equate high R2 with successful regression is that we can usually make R2 arbitrarily close to one in a way that nobody would consider good science: By putting as many regressors as we can think of into our regression. We assume that Xb is larger.1. Put differently. The most common measure of goodness of fit is the coefficient of determination. Historically. in the sense that every column of X a is also a column of Xb . as opposed to the potentially more important question of how well the model would predict new y from new x). how well the model fits the observed data. and the geometric intuition can be seen in figure 3. and defined as ESS Py 2 R2 : = = TSS y 2 For any X and y we have 0 ≤ R2 ≤ 1. This was discussed in theorem 3. Suppose that xn and yn are draws from a uniform distribution on [0. J OHN S TACHURSKI October 10. We will draw xn and yn independently. R2 = 1 and we have a perfect fit. 2013 . we obtain R2 a = 2 Rb Pa y Pb y 2 = P a Pb y Pb y 2 = P a yb yb 2 ≤1 2 where the final inequality follows from theorem 3.11). There is a close connection between R2 and empirical risk. Such an outcome was presented in figure 4. we can make the empirical risk arbitrarily small. Since SSR = My 2 = n =1 ˆ x n )2 ∑ (yn − β N we see that SSR is proportional to the empirical risk ˆ(f) = 1 R N n =1 ∑ (yn − f (xn ))2 N ˆ x.1. the latter being unobservable. Moreover. when we use empirical risk minimization. Empirical risk is just proxy for risk. (See (4. If we can drive the empirical risk to zero. then SSR = 0. We can understand what is happening by recalling our discussion of empirical risk minimization in §4. Let’s look at this phenomenon from a more statistical perspective. 1]. As we discussed. Hence R2 b ≥ Ra . Thus.2. It then follows from fact 3.6. and regressing with Xb produces (weakly) larger R2 . In fact. the increase in R2 is due to a fall in empirical of our fitted function f (x) = β risk. In particular.2 on page 88 that P a Pb y = P a y. then the increase in R2 occurs because SSR is falling. we change SSR = My 2 while leaving other terms constant.) The R2 of a regression of y on X can be written as R2 = 1 − SSR / y 2 . but this does not guarantee small risk.1.4.17 on page 143. GOODNESS OF FIT 184 Since Xb is larger than X a . it follows that the column space of X a is contained in the column space of Xb (exercise 6.6. this excess of attention to the data set we have in hand often causes the risk to explode. Let’s look at a small simulation that illustrates the idea. and setting yb := Pb y.3 on page 86.3.21) on page 140 for the definition of the latter. Using this fact. our true goal is to minimize risk. rather than empirical risk. if we raise R2 by including more regressors. if we allow ourselves unlimited flexibility in fitting functional relationships. If we alter the regressors. . We will fit the polynomial β 1 + β 2 x + β 3 x2 + · · · β K x K −1 to the data for successively larger K.0 G 0. the polynomial minimizes empirical risk by getting close to the sample points in this particular draw of observations.6 G G G G G y 0. Increasing K to 25. which is 0.6.4 G G G G G G G G 0. by using the so-called centered R2 in place of R2 .0 Figure 6. even though we know there is actually no relationship whatsoever between x and y. we will regress y on X = ( xn k.3. This reduces SSR and increases the R2 .n where n = 1. Since x and y are independent. .95. . Nevertheless. for this regression.) By this measure. the R2 was 0. A simulation and plot is shown in figure 6.3. .2 0. N and k = 1.4 x 0. In other k −1 ) words.6 0. J OHN S TACHURSKI October 10. the regression is very successful. Incrementing K by one corresponds to one additional regressor. GOODNESS OF FIT 185 G G G 0. 2013 . . . (The code is given in listing 8. .8 1.1: Minimizing empirical risk on independent data so there is no relationship at all between these two variables. Here K = 8. I obtained an R2 of 0. the best guess of y given x is just the mean of y. This problem is easily rectified. so we are fitting a relatively flexible polynomial.5.1. 6.2 Centered R squared Another issue with R2 is that it is not invariant to certain kinds of changes of units.87.2 G G G G G 0.8 G G G 0. . K . Indeed. cbind (X . kilometers versus miles. R2 is not invariant to changes of units that involve addition or subtraction (actual inflation versus inflation in excess of a certain level. we can make the R squared as large as we like. etc. etc. " R ^2 = " .3.25 y <. values ^2) y2 <. Py2 / y2 . we find that the R squared converges to one. just by a change of units. seed (1234) N <.sum ( y ^2) cat ( " K = " . K . Taking the limit as α → ∞. income versus income over a certain threshold.rep (1 . " \ n " ) } The main ideas are presented below. First. 2013 . when the regression contains an intercept. N ) KMAX <. many economists and statisticians use the centered R squared rather than the R squared. In other words. To see this.runif ( N ) X <. GOODNESS OF FIT 186 Listing 8 Generating the sequence of R2 set . The R2 in the latter case is P(y + α1) y + α1 2 2 = Py + αP1 2 Py + α1 2 α2 Py/α + 1 2 Py/α + 1 2 = = = y + α1 2 y + α1 2 α2 y / α + 1 2 y/α + 1 2 where the second inequality follows from the fact that 1 ∈ rng(X).lm ( y ~ 0 + X ) Py2 <.25 for ( K in 1: KMAX ) { X <. x ^ K ) results <. at least when the regression contains an intercept. R2 is invariant to changes of units that involve rescaling of the regressand y (dollars versus cents.runif ( N ) x <. For the J OHN S TACHURSKI October 10. where α ∈ R.6. let’s compare the R2 associated with y with that associated with y + α1.).) because if y is scaled by α ∈ R. then αPy 2 α2 Py 2 Py 2 Pαy 2 = = = αy 2 αy 2 α2 y 2 y 2 On the other hand.sum ( results $ fitted . For this reason. As has been observed on many occasions. let’s assume that this is the case (or.13) as R2 c N ˆn − y ¯ )2 (y ∑n = N=1 ¯ )2 ∑ n =1 ( y n − y (6. the fitted values Py and the residuals My.3. because the notion that the regressors “explain” variation in y suggests causation. centered R squared can be thought of as a measure of correlation. The idea is that regression amounts to decomposing y into the sum of two orthogonal parts. M c : = I − 1 ( 1 1 ) −1 1 = I − The centered R squared can be re-written (exercise 6.” The value of R2 is then claimed to be the fraction of the variation in y “explained” by the regressors.2). correlation and causation are not the same thing. 2013 . that 1 ∈ rng(X)). R2 is better thought of as a measure of correlation (see §6.6.4.3.10).17) It is a further exercise (exercise 6. 6. Thus. in the case of the simple regression. Some informal examples are as follows: J OHN S TACHURSKI October 10.4. The equality of the two expressions for R2 c is left as an exercise (exercise 6. as in (6.12). as defined in (4.4. Hopefully it is clear that adding a constant to each element of y will have no effect on R2 c.16) N is the annihilator associated with 1. more generally. correlation should not be confused with causation.16) to show that. By the Pythagorean theorem. As discussed below. This is a poor choice of terminology. This is sometimes paraphrased as “the total variation in y is the sum of explained and unexplained variation. GOODNESS OF FIT 187 purposes of this section.3. Instead. the squared norm y 2 =: TSS of y can then be represented Py 2 =: ESS plus My 2 =: SSR.4) on page 113. The centered R squared is defined as R2 c := where PMc y 2 Mc Py 2 = Mc y 2 Mc y 2 (6.3 A Note on Causality You have probably heard R2 interpreted as measuring the “explanatory power” of the regressors in a particular regression. and R2 says nothing about causation per se. the centered R squared is equal to the square of the sample correlation between the regressor and regressand.15) 1 11 (6. + X( β 2 ˆ is the minimizer of y − Xb 2. This does not imply that ambulances cause crashes.2 2 You 2 is also a solution can use the ideas on optimization in the appendix. but provide your own careful argument. . . 6. Does that mean fitting ABS to a given motorcycle will reduce the probability that bike is involved in an accident? Perhaps. One possible explanation is that that wearing shoes to bed causes headaches. Verify that ∑n =1 ( yn − b xn ) = y − Xb . .1. are they more likely to do so? Try asking your national research body to fund that experiment.4. • Suppose we observe that people sleeping with their shoes on often wake up with headaches. . EXERCISES 188 • We often see car crashes and ambulances together (correlation). 6. but another likely explanation is selection bias in the sample— cautious motorcyclists choose bikes with ABS. (If we stand someone on a bridge and tell them to jump. argue that β vectors b. and vice versa. This is especially so in the social sciences. while crazy motorcyclists don’t. • It has been observed that motorcycles fitted with ABS are less likely to be involved in accidents than those without ABS. Argue that the sample mean of a random sample y1 .) An excellent starting point for learning more is Freedman (2009).4. 6. where properly controlled experiments are often costly or impossible to implement. Using this equality. way: Let b be any K × 1 vector. A more likely explanation is that both phenomena are caused by too many pints at the pub the night before. Identifying causality in statistical studies can be an extremely difficult problem. Make sure you use the definition of a minimizer in your argument.4.3.4. 2013 . and let β 1.4. Show carefully that any solution to minb∈RK y − Xb to minb∈RK y − Xb .4. J OHN S TACHURSKI October 10. N 2 2 Ex. 6.6. Show that y − Xb 2 ˆ = y − Xβ 2 ˆ − b) 2 .2. y N from a given distribution F can be viewed as a least squares estimator of the mean of F. over all K × 1 Ex. 6. ˆ solves the least squares problem in a slightly different Ex. Let’s show that β ˆ : = ( X X ) −1 X y .4 Exercises Ex. 3 Ex. 6. .4. Show that the coefficient of determination R2 is invariant to a rescaling of the regressors (where all elements of the data matrix X are scaled by the same constant).4.4. Ex.14.4. Explain why P1 = 1 whenever 1 ∈ rng(X). Ex. 6.10. Let y ˆ. J OHN S TACHURSKI October 10.16).9. .3.4. Suppose the regression contains an intercept.4. y N ) be sequences of scalar random ˆ between x and y (defined in (4.4. Ex. Suppose that every column of X a is also a column of Xb .4.1. Confirm the equality of the two alternative expressions for R2 c in (6.17). R2 c is equal to the square of the sample correlation between x and y. 6.7. Using this fact (and not the orthogonal projection theorem).15. 6. in the case of the simple regression model (see §7. so that the first column ¯ be the sample mean of y.4.12. Show that the sample correlation ρ page 113) can be written as (Mc x) (Mc y) ˆ= ρ Mc x Mc y Ex. . and let x ¯ be a 1 × K row vector such that of X is 1.5 on page 90 and the Pythagorean law (page 84). 6.4) on variables. then Py = y and y ∈ rng(X). . 6. . Let X a and Xb be N × Ka and N × Kb respectively. Show that y ¯β the k-th element of x Ex.4. Show that My 2 = Mc y 2 − PMc y 2 (6. Ex. 6. 2013 . EXERCISES 189 Ex.4. Show that PM = MP = 0.18) always holds.4.4. ¯=x ¯ is the sample mean of the k-th column of X. Show that if R2 = 1. 6. (Quite hard) Show that. Show that if R2 = 1. show that the vector of fitted values and the vector of residuals are orthogonal.15). then every element of the vector of residuals is zero.13. 6.8.16. Verify the expression for R2 c in (6.6. 3 Hint: 4 Hints: Refresh your memory of theorem 3. 6. .3). Let Mc be as defined in (6. .4 Ex. Suppose that the regression contains an intercept. See fact 3.1.11. Ex. Let x := ( x1 . . Show that rng(X a ) ⊂ rng(Xb ). x N ) and y := (y1 .3 on page 86. Ex. Ex. 6.6. 6.5. 4. The resulting matrix should have all entries very close to (but not exactly) zero. The inner product should be very close to (but not exactly) zero.) Let N = 20 and let x be an evenly spaced grid on [0. In each round of the loop.19 (Computational). start with while(TRUE).14) for successively higher and higher values of K.6 where each iteration. EXERCISES 190 Ex. . Build an arbitrary data set X. 6 To set up an infinite loop. and record the values of the estimated coefficients of the non-constant (i. y N ). The OLS estimate of µ is ˆ : = ( 1 1 ) −1 1 y = µ 1 1 1y= N N ¯N ∑ yn = y N n =1 Reading right to left. calculate the regression coefficients using first lm and then the direct method (computing (X X)−1 via the function solve). (Example 6. . use control-C.17 (Computational). Test the orthogonality of My and Py. 2013 .. x N ) and y = (y1 .R] Ex. 6. the inversion routine involves calculating many very small numbers. Ex. . In §12. 6. print the current degree of the polynomial and the coefficients from the two methods.5 To see this in action. . 1). calculate the coefficients of (6. Construct M and calculate MX.4. 6.1 explains how to estimate the coefficients using multiple regression. y by simulation. Generate y as yn = xn + N (0. 6. Letting µ := E [yn ] we can write yn = µ + un when un := yn − µ.6. .1 Solutions to Selected Exercises Solution to Exercise 6. where each regression uses the same set of observations x = ( x1 .20 (Computational). y = 1µ + u. .4.2. Confirm that these values are equal to the estimated coefficients of the no-intercept regression after all variables have been centered around their mean. Run a regression with the intercept.1. Since the amount of memory allocated to storing each of these numbers is fixed. we discussed how the direct method of comˆ = (X X)−1 X y may be problematic in puting the OLS estimate via the expression β some settings. .4.18 (Computational).4. Ex. 6. . The difficulty is inverting the matrix (X X)−1 when it is large and almost singular. 5 In J OHN S TACHURSKI October 10. k ≥ 2) regressors.6. To be written: [R squared increases–see file rsquared. You should find that the direct method fails first. the result of the calculations may be imprecise. 1]. this setting. Build an arbitrary data matrix X by simulation.4.e. To exist from an infinite loop running in the terminal. the sample mean of y is the OLS estimate of the mean. Set the program running in an infinite loop.4.4. In other words. 1.9. we have My 2 = 0.7. This follows from the definition of β ˆ. Let β b∈ min y − Xb RK 2 which is to say that for any b ∈ RK √ √ If a and b and nonnegative constants with a ≤ b. then. My = 0.2. Part 2 follows immediately from part 1.4. and hence ˆ y − Xβ 2 ≤ y − Xb 2 ˆ ≤ y − Xb y − Xβ ˆ is a solution to In other words. If R2 = 1.3. Solution to Exercise 6.4. Solution to Exercise 6. But then y ∈ rng(X) by 5 of theorem 3. will be confirmed if y − X β because for arbitrary a ∈ RK we have ˆ ) = a ( X y − X X ( X X ) −1 X y ) = a ( X y − X y ) = 0 (aX) (y − X β ˆ be a solution to Solution to Exercise 6. If R2 = 1. then Py y Therefore My 2 2 2 = y 2 . Regarding part 1. October 10. and hence 2 = Py 2 + My 2 = y + My 2 = 0.1. observe that ˆ + X( β ˆ − b) 2 y − Xb 2 = y − X β By the Pythagorean law. the claim y − Xb 2 ˆ = y − Xβ 2 ˆ − b) + X( β 2 ˆ ⊥ X( β ˆ − b).4. then a ≤ b.4. and hence have My = 0. and. EXERCISES 191 Solution to Exercise 6. Since My + Py = y. 2013 J OHN S TACHURSKI .6. this implies that Py = y.4. β b∈ for any b ∈ RK min y − Xb RK The “vice versa” argument follows along similar lines. by fact 2. by (6.1 on page 52.10) on page 179.4. If z ∈ rng(X a ). It suffices to show that if z ∈ rng(X a ). . then z= for some scalars α1 .5 we have MMc y = My.4 tells us that for any conformable vector z we have z = Pz + Mz. . It is sufficient to show that PMc = Mc P Since 1 ∈ rng(X) by assumption. EXERCISES 192 Solution to Exercise 6.6. Therefore P11 = 11 P.4.18) Solution to Exercise 6. .4. z ∈ rng(Xb ).4. . . orthogonality and the Pythagorean law.13.4. Using this result. Let x1 . then z ∈ rng(Xb ).11. . 2013 N . where the two vectors on the right-hand side are orthogonal. and 1 P = (P 1) = (P1) = 1 .10. . .4. Solution to Exercise 6. we have P1 = 1. x J be the columns of X a and let x1 . .1. α J . we obtain Mc y = PMc y + MMc y From fact 3. Letting z = Mc y. . Theorem 3. . It suffices to show that PMc y J OHN S TACHURSKI 2 = n =1 ˆn − y ¯ )2 ∑ (y October 10. we obtain Mc y Rearranging gives (6. . and PMc = P − 1 1 P11 = P − 11 P = Mc P N N j =1 2 = PMc y 2 + My 2 ∑ αj xj J+M J j =1 ∑ αj xj + ∑ J 0 xj j = J +1 Solution to Exercise 6.12. But then z= In other words.1. x J + M be the columns of Xb . the squared sample correlation between x and y can be written as ˆ2 = ρ On the other hand.3. where the first column is 1 and the second column is x.3).3. 2013 . This completes the proof. we have R2 c = [(Mc x) (Mc y)]2 Mc x 2 Mc y 2 Mc Py 2 Mc y 2 Therefore it suffices to show that.19) Mc Py = Mc x Let X = (1.4.4. for the simple linear regression model in §7.3.4. We then have ˆ = 1β ˆ 1 + xβ ˆ2 Py = X β ∴ ∴ ˆ = Mc x β ˆ2 Mc Py = Mc X β ˆ 2 = |β ˆ 2 | Mc x = |(Mc x) (Mc y)| Mc x Mc Py = Mc x β Mc x 2 Cancelling Mc x we get (6. x) be the data matrix. Let ˆ 1 := y ˆ2x ˆ 2 := (Mc x) (Mc y) ¯−β ¯ and β β Mc x 2 be the OLS estimators of β 1 and β 2 respectively (see §7.4.14. for any α = 0. This follows immediately from the definition of R2 . From exercise 6.15. we have |(Mc x) (Mc y)| (6. and the fact that. J OHN S TACHURSKI October 10.6.16. P = X ( X X ) −1 X = α2 X(X X)−1 X = (αX)((αX) (αX))−1 (αX) α2 Solution to Exercise 6. EXERCISES 193 This is the case because ˆn − y ¯ )2 = ∑ (y N ¯ Py − 1y 2 ¯ = Py − P1y 2 ¯) = P(y − 1y 2 = PMc y 2 n =1 Solution to Exercise 6.19). Part III Econometric Models 194 . . we have to impose (much!) more structure. for some sets of assumptions on the data generating process.1 The Model [roadmap] 7. while for other assumptions the performance is poor. .Chapter 7 Classical OLS In this chapter we continue our study of linear least squares and multivariate regression. The main purpose of this chapter is to describe the performance of linear least squares estimation under the classical OLS assumptions. as begun in chapter 6.1. To some people (like me). As such. where OLS stands for ordinary least squares. . the results obtained from these assumptions form the bread and butter of econometrics. At the same time. they need to be understood. the performance of linear least squares estimation is good. . we need to make more assumptions on the process that generates our data. 7. we will 195 . (x N .1 The OLS Assumptions In chapter 6 we assumed that the observed input-output pairs (x1 . the standard OLS assumptions are somewhat difficult to swallow. y1 ). y N ) were draws from some common joint distribution on RK +1 . For starters. To say more about whether linear least squares estimation is a “good” procedure or otherwise. Not surprisingly. To work with the classical OLS model. .1. The proof of the next fact is an exercise (exercise 7. we obtain the linear regression model C = β x + u. where G is the growth rate of GDP over a given period. . Okun’s law is an empirical rule of thumb relating output to unemployment. and U is the change in the unemployment rate over the same period (i. Letting β := ( β 1 . THE MODEL 196 assume that the input-output pairs all satisfy yn = β xn + un (7. A commonly estimated form of Okun’s law is U = β 1 + β 2 G + u.2. The Cobb-Douglas production function relates capital and labor inputs with output via y = Akγ δ where A is a random. β 2 ) and x := (1. the classical OLS model makes two assumptions. The value − β 1 / β 2 is thought of as the rate of GDP growth necessary to maintain a stable unemployment rate. u N ) be the column vector formed by the N realizations of the shock.1) can be expressed as y = Xβ + u (7.1. Regarding the shocks u1 . Example 7. Fact 7.1. When (7. u2 . The parameter β 2 is expected to be negative.e. .2) holds we have My = Mu and SSR = u Mu. end value minus starting value). D ) and adding an error term u. and u1 .. . . and the other concerning the variance.1) where β is an unknown K × 1 vector of parameters.6. u N are unobservable random variables. . . Letting u := (u1 .1. Taking logs yields the linear regression model ln y = β + γ ln k + δ ln + u where the random term ln A is represented by β + u. 2013 .7.1. .1.3. Example 7. The elementary Keynesian consumption function is of the form C = β 1 + β 2 D. . The first assumption is as follows: J OHN S TACHURSKI October 10. where C is consumption and D is household disposable income.1. the N equations in (7. one concerning the first moments. firm-specific productivity term and γ and δ are parameters. The model is called linear because the deterministic part of the output value β x is linear as a function of x. . . .1). u N .2) Here are three very traditional examples: Example 7. Let M be the annihilator associated with X. assumption 7. n. note that var[u | X] := E [uu | X] − E [u | X]E [u | X] = E [uu | X] where the second equality is due to assumption 7.1. 3. Together.1.1.2.1. Fact 7. Given assumption 7.1.1. The most standard assumption is as follows: Assumption 7. 4.1. ˆ . j in 1.1.1. N .3.2.1. J OHN S TACHURSKI October 10. Hence. n. we have: 1. k. 2013 . To understand assumption 7. we have the following results: 1. 3. k. E [um xnk ] = 0 for any m.6. Given assumption 7. 2. Shocks are homoskedastic: E [u2 i | X ] = E [ u j | X ] = σ for any i. the proof of which is an exercise (exercise 7. E [u] = 0. n. Fact 7.1.1. THE MODEL 197 Assumption 7. u and X satisfy E [u | X] = 0. xnk ] = 0 for any m. . cov[um . Distinct shocks are uncorrelated: E [ui u j | X] = 0 whenever i = j. . k. we will need an assumption on second moments of To consider the variance of β the shock. E [um | xnk ] = 0 for any m.2).2.2. in the same sense that the vector of coefficients β is unknown. . Together. This assumption expresses a number of ideas in very compact form. Important details are recorded in the next fact. u and X satisfy E [uu | X] = σ2 I. The value of the parameter σ is unknown.1. 2 2 2. where σ is a positive constant.7.2 implies that the conditional variance-covariance matrix is the constant matrix σ2 I. . var[u] = E [uu ] = σ2 I. 3) The standard estimator of the parameter σ2 introduced in assumption 7.2). 2013 . Under assumption 7.1.1. β ˆ ] = E [β ˆ | X ] = β.2.8). As a precursor to the arguments.1.2. we obtain ˆ − β : = ( X X ) −1 X y − β = ( X X ) −1 X ( X β + u ) − β = ( X X ) −1 X u β ˆ .2. To repeat: ˆ : = ( X X ) −1 X y β (7. This proves E [ β J OHN S TACHURSKI October 10. 7.2. By (7.2 The OLS Estimators ˆ deThe standard estimator of the unknown parameter vector β is the estimator β fined in (6. we have ˆ | X] = β + E [(X X)−1 X u | X] = β + (X X)−1 X E [u | X] = β E [β ˆ | X] = β. note that. both estimators have nice properties.2 is ˆ 2 := σ SSR N−K We will show below that.2.1. Adding β to both sides yields This deviation is known as the sampling error of β the useful expression ˆ = β + ( X X ) −1 X u β (7.1.2.1–7. applying (7.1. Theorem 7. we have E [σ Proof of theorem 7.1. Under assumptions 7.2. under the classical OLS assumptions given in §7.1.1.4) 7. VARIANCE AND BIAS 198 7. and hence E [ β ˆ ] = E [E [ β ˆ | X]] = E [ β] = β.1. and σ ˆ 2 is an unbiased estimator of σ2 : shock.4) and assumption 7.1.1.2 Variance and Bias ˆ and σ ˆ 2 under In this section we will investigate the properties of the estimators β the standard OLS assumptions.7. we have E [ β ˆ 2 ] = E [σ ˆ 2 | X ] = σ2 . Theorem 7.1 Bias Under the linearity assumption y = X β + u and the standard assumptions on the ˆ is an unbiased estimator of β. 4 on page 73. trace(P) = trace[X(X X)−1 X ] = trace[(X X)−1 X X] = trace[IK ] = K ∴ trace(M) = trace(I N − P) = trace(I N ) − trace(P) = N − K Now let mij (X) be the i. observe that.2.7. If A := (X X)−1 X .2.3. recalling fact 2.2 ˆ Variance of β ˆ is known to be unbiased. we can treat it as non-random given X. we start the proof by showing that trace(M) = N − K. To see that this is so. and Proof.2.1. and E [σ ˆ 2 ] = E [E [ σ ˆ 2 | X]] = E [σ2 ] = σ2 . we have E [ SSR | X] = E [u Mu | X] = E ∑ ∑ ui u j mij (X) | X = ∑ ∑ mij (X)E [ui u j | X] i =1 j =1 i =1 j =1 N N N N In view of assumption 7. j-th element of M. we want to say something about the variance. Under assumptions 7.4.3. and hence.1.1.2.2.1–7. this reduces to n =1 ∑ mnn (X)σ2 = trace(M)σ2 = ( N − K)σ2 N ˆ 2 | X] = σ2 . by fact 2. VARIANCE AND BIAS 199 Proof of theorem 7. Theorem 7. 2013 .2.1 on page 196. then β ˆ | X] = var[ β + Au | X] = var[Au | X] var[ β Since A is a function of X. Hence E [σ 7.8. A(σ2 I)A = σ2 AA = σ2 (X X)−1 X X(X X)−1 = σ2 (X X)−1 ∴ ˆ | X] = var[Au | X] = σ2 (X X)−1 var[ β J OHN S TACHURSKI October 10. Now that β ˆ | X ] = σ 2 ( X X ) −1 . we have var[ β ˆ = β + Au. Applying fact 7. we have var[Au | X] = A var[u | X]A = A(σ2 I)A Moreover.2.1. For reasons that will become apparent. how to interpret the statement that var[b | X] − var[ β inite? Matrices have no standard ordering. In view of theorem 2. as described above. The matrix C is allowed to depend on X (i. There are a couple of points to clarify.3 The Gauss-Markov Theorem ˆ is known to be unbiased for β under the OLS assumptions.. Let b = Cy.e. it’s not clear whether this is low or high. var[b | X] ≥ var[ β ˆ is BLUE.2.2.4 (Gauss-Markov).5) Taking conditional expectations and using the fact that D is a function of X. the meaning of linearity: Although it’s not immediately clear from the statement of the theorem.e. be a function of X). and let D := C − A. and hence it’s hard to say when one random vector has “larger” variance than another. and was discussed previously in §4.2.7. then ˆ | X]. In particular. the meaning of unbiasedness: In this theorem.1 on page 56. for any β ∈ RK ). 2013 . If b is any other linear unbiased estimator of β. we have E [b | X] = E [Cy | X] = β.2. but not y. But nonnegative definiteness of the difference is a natural criterion. Proof of theorem 7. BLUE stands for Best The theorem is often summarized by stating that β Linear Unbiased Estimator.. ˆ | X] is positive defSecond. Although we obtained an expression for the step is to show that β variance in theorem 7.4. this is equivalent to requiring that b = Cy for some matrix C. it means that. VARIANCE AND BIAS 200 7. all elements of the principle diagonal of a nonnegative definite matrix are themselves nonnegative. the next Now that β ˆ has low variance. This leads us to the famous Gauss-Markov theorem. Theorem 7. we obtain ˆ | X] E [b | X] = E [DX β | X] + E [Du | X] + E [ β ˆ | X] = DX β + 0 + β = DX E [ β | X] + D E [u | X] + E [ β J OHN S TACHURSKI October 10. Then ˆ = DX β + Du + β ˆ b = Cy = Dy + Ay = D(X β + u) + β (7. The natural way ˆ with that of other unbiased to answer this question is to compare the variance of β estimators.2. where A := (X X)−1 X . First.1.2.3. regardless of the value of β (i. is that var[bk | X] ≥ var[ β Third. in the sense that var[b | X] − var[ β ˆ | X] is nonnegative definite. so the implication ˆ k | X] for all k. here linearity of b means that b is linear as a function of y (taking X as fixed).2. the proof is now done.5. Recall that y can be decomposed as implying an OLS estimate β ˆ + My y = Py + My = X β J OHN S TACHURSKI (7.3.3. it turns out to have many useful applications.5).1. an explicit ˆ . in particular. Combining this ˆ . While that might not expression for an arbitrary sub-vector of the OLS estimator β sound terribly exciting. we conclude that DX = 0. this says result with (7.2 and fact 2.1 Statement and Proof Continuing to work with our linear regression model.6.3.7. 7. ˆ = (X X)−1 X y. 2013 . THE FWL THEOREM 201 In light of the fact that b is unbiased. 7. E [b | X] = β for any given β. among other things. and.6) October 10. we have β = DX β + β for all β ∈ RK ∴ 0 = DX β for all β ∈ RK In light of exercise 2.19 on page 80.3 The FWL Theorem The Frisch-Waugh-Lovell (FWL) Theorem yields. To complete the proof. Roughly speaking. we obtain the expression b = Du + β that b is equal to the OLS estimator plus some zero mean noise. observe that ˆ | X] = var[(D + A)u | X] = (D + A) var[u | X](D + A) var[b | X] = var[Du + β Using assumption 7. the right-hand side of this expression becomes σ2 (D + A)(D + A ) = σ2 (DD + DA + AD + AA ) Since DA = DX(X X)−1 = 0(X X)−1 = 0 and since AA = (X X)−1 X X(X X)−1 = (X X)−1 we now conclude that ˆ | X] var[b | X] = σ2 [DD + (X X)−1 ] = σ2 DD + var[ β Since σ2 DD is nonnegative definite (why?). let’s take y and X as given. 5 on page 90.8). observe that (X2 M1 My) = y M M1 X2 = y MM1 X2 = y MX2 = 0 In the first equality we used the usual property of transposes (fact 2.6.6) as ˆ + X2 β ˆ + My y = X1 β 1 2 (7. and in the fourth we used the fact that M is the annihilator for X. Regarding the last term. in the third we used fact 3.7) by X2 M1 . Similarly.5). Premultiplying both sides of (7. let’s present the proof. projecting onto the orthogonal complement of the column space of X1 . The K2 × 1 vector β 2 ˆ = ( X M 1 X 2 ) −1 X M 1 y β 2 2 2 (7.7) Let P1 := X1 (X1 X1 )−1 X1 be the projection onto the column space of X1 .3.3. β 2 We can then rewrite (7. we just need to check that X2 M1 X2 is invertible.3.3.8) The theorem gives an explicit analytical expression for our arbitrarily chosen subset ˆ of the OLS estimate β ˆ . To see this. β 2 Proof of theorem 7.9) The first and last terms on the right-hand side are zero.1. Before discussing it’s implications. β 1 ˆ be the K2 × 1 vector consisting of the remaining K2 elements of β ˆ. The proof of this last fact is left as an exercise (exercise 7. and let M1 := I − P1 be the corresponding annihilator. (7. and hence MX2 = 0. Hence M1 X1 = 0. and 1. and let X2 be a matrix consisting of the remaining K2 := K − K1 columns of X. With this notation we have the following result: ˆ can be expressed as Theorem 7. we obtain ˆ + X M1 X2 β ˆ + X M1 My X2 M1 y = X2 M1 X1 β 2 2 1 2 (7. 2013 .1.7. In light of the above.10 or direct calculation).9) becomes ˆ X2 M1 y = X2 M1 X2 β 2 To go from this equation to (7.4. 2.10). because M1 is the annihilator associated with X1 . it suffices to show that the transpose of the term is 0 . in the second we used symmetry of M and M1 (exercise 3. THE FWL THEOREM 202 We can write (7. This is clear for the first term. J OHN S TACHURSKI October 10. let ˆ be the K1 × 1 vector consisting of the first K1 elements of β ˆ .1 (FWL theorem).6) in a slightly different way by partitioning the regressors and the estimated coefficients into two classes: Let X1 be a matrix consisting of the first K1 columns of X. Regress y ˜ on x This is obviously different from the process for obtaining the coefficient of the vector colK (X) in a simple univariate regression. the latter being just 1. where X2 is the single column colK (X). the ˆ K is: process for obtaining the OLS estimate β ˜ K. Regress y on colK (X). In view ˆ K can be found by regressing of the preceding discussion. THE FWL THEOREM 203 7. 2013 IID .2 Intuition ˆ in theorem 7. To get some feeling for what this means. the expression for β 2 rewritten as ˆ = [(M1 X2 ) M1 X2 ]−1 (M1 X2 ) M1 y β (7. Thus. these two residual terms y ˜ and x of y and colK (X) that are “not explained by” X1 . Remove effects of all other regressors from y and colK (X).10) 2 Close inspection of this formula confirms the following claim: There is another way ˆ besides just regressing y on X and then extracting the last K2 elements: to obtain β 2 We can also regress M1 y on M1 X2 to produce the same result. We can illustrate this idea further with a small simulation. on an intuitive level. 1.3. producing y ˜ and x ˜ K.8 asks you to show. let’s look at a special case.6.1 can be As exercise 7. In words.3. the OLS estimate β y ˜ := M1 y = residuals of regressing y on X1 on ˜ K := M1 colK (X) = residuals of regressing colK (X) on X1 x ˜ K can be thought of as the parts Loosely speaking. 1) October 10. 2. without regression coefficient β taking into account indirect channels involving other variables. containing the observations on the K-th regressor.7. Suppose that y = x1 + x2 + u J OHN S TACHURSKI where u ∼ N (0.3. the difference between the univariate least squares estimated coefficient of the K-th regressor and the multiple regression OLS coefficient is that the multiple ˆ K measures the isolated relationship between xK and y. As we saw in (4. THE FWL THEOREM 204 If we generate N independent observations from this model and regress y the observations of ( x1 . A proof can be found in chapter 7. The coefficient β where x ˆ 1 is called the intercept coefficient. J OHN S TACHURSKI October 10. where the intercept is included (i.lm ( y ~ 0 + X1 ) results $ coefficients X1 30. let’s derive the familiar expression for the slope coefficient in simple regression.runif ( N ) X2 <. if we regress y on x1 alone.c (1 . in this setting. x2 ).1 However. X2 ) y <. then. The coefficient in the univariate regression reflects this total effect. while β ˆ 2 more succinctly as We can rewrite β ˆ 2 = [(x − x ¯ 1)]−1 (x − x ¯ 1) (y − y ¯ 1) ¯ 1) (x − x β 1 The (7. 2013 . For simplicity. For example: > > > > > > > > N <. because an increase in x1 tends to have a large positive effect on x2 .1000 beta <. 1 is the first column of X) and K = 2. the second column of X will be denoted simply by x.e. provided that N is sufficiently large. Simple regression is a special case of multivariate regression.76840 Here the coefficient for x1 is much larger than unity. 1) X1 <.3. then the coefficient for x1 will depend on the relationship between x1 and x2 . known as the slope coefficient. which in turn increases y. the coefficients for x1 and x2 will both be close to unity. the OLS estimator is consistent for the coefficients.3.26) on page 141.3 Simple Regression As an application of the FWL theorem. 7.10 * exp ( X1 ) + rnorm ( N ) X <.X % * % beta + rnorm ( N ) results <. the OLS estimates are N ¯ )(yn − y ¯) ˆ 2 = ∑ n =1 ( x n − x β N ¯ )2 ∑ n =1 ( x n − x and ˆ1 = y ˆ2x ¯−β ¯ β ˆ 2 is ¯ is the sample mean of x and y ¯ is the sample mean of y.7.cbind ( X1 .11) reason is that.. 11) and (7. 7. non-intercept) regressors are equal to the estimated coefficients of a zero-intercept regression performed after all variables have been centered around their mean.e.12) where Mc is the annihilator associated with the single regressor 1.3.10). As we saw in the last section.1–7.16).12) coincide. ˆ = (X X)−1 X y is partitioned into ( β ˆ ).4 Centered Observations Let’s generalize the discussion in the preceding section to the case where there are multiple non-constant regressors. 7. then we can write ˆ 1. β If the OLS estimate β 2 ˆ = 1 β 1 + X2 β ˆ Xβ 2 ˆ as Applying the FWL theorem (equation 7. It is straightforward to show that for this annihilator Mc and any z. each a vector of observations on a nonconstant regressor. we showed that the variance-covariance matrix of the OLS estimate β J OHN S TACHURSKI October 10. Similarly.) It is now easy to see that the right-hand sides of (7.5 Precision of the OLS Estimates Let’s return to the regression problem. THE FWL THEOREM 205 By the FWL theorem (equation 7. Mc y is y centered around its mean. as defined in (6.16). The only difference to the preceding case is that instead of one column x of observations on a single non-constant regressor.1.10) once more. Mc X2 is a matrix formed by taking each column of X2 and centering it around its mean. the annihilator associated with 1 converts vectors Mc z = z − z into deviations around their mean. we have a matrix X2 containing multiple columns. the estimated coefficients of the non-constant (i. we can write β 2 ˆ = [(Mc X2 ) Mc X2 ]−1 (Mc X2 ) Mc y β 2 where Mc is the annihilator in (6. we have ¯ 1.2 in force. with assumptions 7. In theˆ orem 7. In other words.3.2. What we have shown is this: In an OLS regression with an intercept.3..7. (The “c” in Mc reminds us that Mc centers vectors around their mean. we also have ˆ 2 = [(Mc x) Mc x]−1 (Mc x) Mc y β (7. 2013 .1.3. β ˆ K are given by the principle diagonal of this matrix. That is. the OLS estimators will have the most precision. the variance σ2 of the shock Thus. From the FWL theorem. However. (Precision of an estimator is sometimes defined as the inverse of the variance. X1 contains as its ˆ is the OLS estimates of the columns the observations of the other regressors. we can then express β ˆ k = (colk (X) M1 colk (X))−1 colk (X) M1 y β (7. and β 1 ˆ k as corresponding coefficients.) The Gauss-Markov theorem tells us that. In this case.13).1). . . we obtain M1 y = M1 colk (X) β k + M1 u ˆ k in terms of the Substituting this into (7. Put differently.11) that ˆ k | X] = σ2 (colk (X) M1 colk (X))−1 = σ2 M1 colk (X) var[ β −2 (7. although definitions do vary. the variance of β u. we obtain a second expression for β shock vector u: ˆ k = β k + (colk (X) M1 colk (X))−1 colk (X) M1 u β Some calculations then show (exercise 7.14). J OHN S TACHURSKI October 10.13) where colk (X) is the vector of observations of the k-th regressor.7. small variance of β ˆ k correof these OLS estimates β sponds to probability mass concentrated around the true parameter β k .3. if we fix the regression problem and vary the estimators. 2013 . .16) ˆ k depends on two components. the OLS estimates will have low variance. Applying M1 to both sides of (7. we say that the estimator has high precision. . and which will have low precision estimates? To answer this question.14) where M1 is the annihilator corresponding to X1 . at least as far as unbiased linear estimators go. The scalar variances of the individual OLS coefficient estiˆ 1. we want to think about precision a different way: If we hold the estimation technique fixed (use only OLS) and consider different regression problems.2. THE FWL THEOREM 206 given X is σ2 (X X)−1 . We can write the regression model y = X β + u as y = X1 β1 + colk (X) β k + u (7. and the norm of the vector M1 colk (X). let’s focus on the variance of a fixed coefficient β k .6.15) (7. which problems will have high precision estimates. Since any one mates β ˆ k is unbiased (theorem 7. M1 := I − P1 where P1 is the matrix X1 (X1 X1 )−1 X1 projecting onto the column space of X1 . then the variance of β ˆ k correspondingly large)? This When will this norm be small (and the variance of β will be the case when colk (X) is “almost” a linear combination of the other regressors. Once this class is specified. the harder it will be to estimate β k with good precession. suppose that colk (X) is indeed “almost” a linear combination of the other regressors. The vector M1 colk (X) is the residuals from regressing colk (X) on X1 . then P1 colk (X) − colk (X) ≈ 0. β K .4. or contradicts the normality assumption. and M1 colk (X) is the ˆ k will be large. so we are saying that colk (X) is “almost in” that span. norm of this vector. This implies that P1 colk (X) ≈ colk (X) because P1 projects into the column space of the other regressors X1 . 7. and hence M1 colk (X) = colk (X) − P1 colk (X) ≈ 0 ˆ k is In other words.4 Normal Errors In this section. To see this. (In other words. As we have just seen. .) This will allow us to test hypotheses about the distribution of β coefficients. 2013 . Following this grand tradition. we’re going to strengthen and augment our previous assumptions by specifying the parametric class of the error vector u. . given that we’re strengthening our previous assumptions. The term M1 colk (X) −2 is more interesting. If this norm is small. This situation is sometimes referred to as multicollinearity. then the ˆ is fully specified. we J OHN S TACHURSKI October 10. and hence the variance of β large. β 1 . A normal distribution in R N is fully specified by its mean and variance-covariance matrix. NORMAL ERRORS 207 The variance in σ2 is in some sense unavoidable: Some data is noisier that other data. if values for these parameters are specified. Because of its many attractive properties. Now if P1 colk (X) ≈ colk (X). In this case.7. the norm of M1 colk (X) is small. we ˆ up to the unknown parameters can determine the distribution of the OLS estimate β σ2 . multicollinearity is associated with poor precision in estimates of the coefficients. . The larger the variance in the unobservable shock. we will assume that u is a normally distributed element of R N . the normal distribution is our go-to distribution. at least for the case where we have no information that suggests another distribution. . X and u are independent.6.4. NORMAL ERRORS 208 have no choice here. from assumption 7.1 and assumption 7.1 Preliminary Results ˆ given X is norAssumption 7. Theorem 7. Notice that assumption 7.1 also implies that the conditional distribution of β − 1 ˆ = β + (X X) X u. the distribution of β ˆ k given It follows from theorem 7.1. ˆ given X is N ( β.4. or more precisely.4.1 implies that shocks and regressors are uncorrelated (see fact 7. let ek be the k-th canonical basis vector.4.1. of Our second preliminary result concerns the distribution of σ Q := ( N − K ) 2 See ˆ2 σ σ2 (7. 1) (7.15 on page 80.1. σ2 I). Furthermore. 7.16).7.4. To make life a bit easier for ourselves. ˆ k ∼ N ( β k .4. This can be established directly from (7. k )-th element of the matrix (X X)−1 .1.2 It then follows that ˆ k − βk β zk := ∼ N (0.1.4. 2013 . since it gives an expression for the variance which is easier to compute. assumption 7. and.1 implies both assumption 7.1. σ 2 e ( X X ) −1 e k ) β k (7.2 on page 197).18) σ e k ( X X ) −1 e k ˆ 2 .4.1 instead.1. and observe that ˆ ∼ N ( e β .1.4.17) Note that ek (X X)−1 ek is the (k. σ2 (X X)−1 ). σ 2 e ( X X ) −1 e k ) ek β k k In other words. Assumption 7. and the variance is given in (7. J OHN S TACHURSKI October 10. we’ll go a step further and assume they are independent.15).19) exercise 2.1 that the distribution of individual coefficient β X is also normal. Under assumption 7. since β next theorem records this result. σ2 I). However.2. we will use theorem 7.1.2. From assumption 7. The mal. and linear combinations of normals are normal. To do this. the mean is E [u] = 0. and u ∼ N (0.1. We must then have u ∼ N (0. the variance-covariance matrix is E [uu ] = σ2 I. Under assumption 7. Since we don’t. we consider the null hypothesis H0 : β k = β0 k against H1 : β k = β0 k 2 where β0 k is any number.4.4.18). I). If the null hypothesis H0 is true. Let assumption 7. Specifically.20) The distribution of this statistic under the null is described in the next theorem. To see that Q ∼ χ2 ( N − K ) given X.9 and our previous result that trace(M) = rank(M) = N − K (recall that for idempotent matrices. ˆ 2 .1.20) is Student’s t with N − K degrees of freedom. Our next result implements this idea. we could test H0 via (7. then.2 The t-test Let’s consider the problem of testing a hypothesis about an individual coefficient β k . 2013 .4.3.9). observe that Q= (Mu) (Mu) u Mu = = ( σ −1 u ) M ( σ −1 u ) σ2 σ2 Since σ−1 u ∼ N (0. we obtain the t-statistic sample estimate se( β k tk := ˆ k − β0 β k ˆ se( β k ) (7.18).7.2. Theorem 7. we will make use of the following notation: ˆ k ) := se( β ˆ 2 e k ( X X ) −1 e k σ ˆ k ) is called the standard error of β ˆ k . as follows from fact 2. the expression on the far right is χ2 ( N − K ).4. and determine the the standard methodology is to replace σ2 with its estimator σ distribution of the resulting test statistic. the distribution of Q given X is χ2 ( N − K ). 7. In doing so. NORMAL ERRORS 209 Theorem 7.1 hold. J OHN S TACHURSKI October 10. If we knew σ .4. trace and rank are equal—see fact 2. the distribution of the t-statistic in (7.4.3. Proof.4. Replacing this standard deviation with its estimate of the standard deviation of β ˆ k ) and β k with β0 in (7. It can be regarded as the sample The term se( β ˆ k . conditional on X. 5. an application with simulated data is given in listing 9.6.14) on page 21. NORMAL ERRORS 210 Proof. The first two results were established in §7. we can write this as tk = zk N−K = zk ˆ 2 / σ2 ( N − K )σ N−K Q where Q is defined in (7. note that if a and b are independent random vectors and f and g are two functions. For the k-th coefficient β k . it suffices to show that β ˆ are zk as a function of β independent. we know that tk is Student’s t with N − K degrees of freedom if zk is standard normal. This is exercise 7. Let a desired size α be given.2. In view of fact 1. To see this.4.1. For T we take T = |tk |.15) on page 165.3. with the rule: Reject H0 if T > c. we choose c = cα to solve α = Pθ { T > c}. this leads to the statistic ˆk β tk := ˆ k) se( β This statistic is sometimes called the Z-score. you will find that J OHN S TACHURSKI October 10. we know that the solution is cα = F −1 (1 − α/2). Let’s look at an example. We then have tk = ˆ k − βk β = ˆ k) se( β ˆ k − βk β σ 2 e k ( X X ) −1 e k 1 ˆ 2 / σ2 σ = zk 1 ˆ 2 / σ2 σ where zk is defined in (7. The most common implementation of the t-test is the test that a given coefficent is equal to zero.18).19).4. Since both are normally distributed given X.7 on page 74 implies that they will be independent whenever their covariance is zero. where F is the Student’s t cdf with N − K degrees of freedom. so it remains only to show that zk and Q are independent. In view of (5. fact 2.3. Since we can write ˆ and Q as a function of u ˆ and u ˆ . 2013 .3. To illustrate further. Q is χ2 ( N − K ) and zk and Q are independent. which completes the proof of theorem 7.4. Suppose that the null hypothesis is true. In view of example 5. Multiplying and dividing by the square root of N − K. If you run the program.4.9 on page 25. or 1 − α = Pθ {|tk | ≤ c} From (1. then f (a) and g(b) are likewise independent. the corresponding p-value is 2 F (−|tk |).7. Recall that a test is a test statistic T and a critical value c. K ) X <. K <. NORMAL ERRORS 211 the Z-scores calculated by the zscore function agree with the “t value” column of the summary table produced by R’s summary function (the last line of listing 9). zscore ( k ) . seed (1234) N <.y .X % * % beta + u betahat <.stat .cbind ( runif ( N ) .th regressor zscore <. " \ n " ) } # For comparison : print ( summary ( lm ( y ~ X .stat (Z .1) ) ) 7.4. k = " . but for simplicity we will focus on null hypotheses that restrict a subset of the coefficients to be zero.3 beta <. J OHN S TACHURSKI October 10.X % * % betahat sigmahat <.solve ( t ( X ) % * % X ) % * % t ( X ) % * % y residuals <. 2013 . It is left as an exercise to check that the p-values in the same table agree with the formula 2 F (−|tk |) given in the last paragraph.sigmahat * sqrt ( solve ( t ( X ) % * % X ) [k . " : " . Listing 9 Calculating Z-scores set .rnorm ( N ) y <.7. the most common test is the F-test.rep (1 .3 The F-test The t-test is used to test hypotheses about individual regressors. k .stats for ( k in 1:3) { cat ( "t .function ( k ) { se <.K ) ) # Compute t . The F-test can test quite general hypotheses. k ]) return ( betahat [ k ] / se ) } # Print t . runif ( N ) . For hypotheses concerning multiple regressors.50. runif ( N ) ) u <.sqrt ( sum ( residuals ^2) / ( N .score ) for k .4. Theorem 7. Let Q1 := ( RSSR − USSR )/σ2 and let Q2 := F= USSR / σ2 . (a) Q1 is chi-squared with K2 degrees of freedom. 2013 J OHN S TACHURSKI .2. with parameters (K2 . If the null hypothesis is true.10 on page 25.4. Part (b) was established in theorem 7.7. which translates to evidence against the null hypothesis. under the null hypothesis.22) 2 := My 2 and RSSR : = M1 y be the sums of squared residuals for the unrestricted regression (7.21) (7.1. Regarding part (a).23) Large residuals in the restricted regression (7.21) result in large values for F.2. NORMAL ERRORS 212 ˆ . (b) Q2 is chi-squared with N − K degrees of freedom. the standard test statistic for our null hypothesis is F := ( RSSR − USSR )/K2 USSR / ( N − K ) (7.3. so that Q 1 / K2 Q2 / ( N − K ) In view of fact 1. conditional on X. Proof. P1 and M1 be as defined in §7.4. As our null In what follows. β 1 2 hypothesis we take H0 : β2 = 0 against H1 : β2 = 0 Since y = X β + u = X1 β1 + X2 β2 + u it follows that under the null hypothesis we have y = X1 β1 + u Letting USSR (7. it now suffices to show that. we let X1 . then.β ˆ .4.4. observe that.22) relative to those in (7.1 hold.23) has the F distribution. (c) Q1 and Q2 are independent.4. • USSR = My 2 = M ( X1 β1 + u ) 2 = Mu 2 = u Mu. and October 10. the statistic F defined in (7.21) and restricted regression (7. X2 . under the null hypothesis. N − K ).22) respectively. Let assumption 7. In this case (7.3 Applying the techniques in the proof of theorem 7. because cov[(P − P1 )u.4. This is the case. this becomes (P − P1 )E [uu | X]M = σ2 (P − P1 )M = σ2 (P − P1 )(I − P) Using idempotence and fact 3. observe that Q1 is a function of (P − P1 )u. we see that rank(P − P1 ) = trace(P − P1 ) = trace(P) − trace(P1 ) = K − K1 = K2 Via fact 2.1. and hence of theorem 7.24) where β2 is the vector of coefficients corresponding to the non-constant regressors. NORMAL ERRORS 213 2 • RSSR = M1 y 2 = M1 ( X1 β1 + u ) = M1 u 2 = u M1 u It follows that RSSR − USSR = u M1 u − u Mu = u (M1 − M)u Using the definitions of M and M1 . it remains to show that. Since X1 = 1.4.7.2. To complete the proof. as was to be shown. P1 and M are just functions of X.16) on J OHN S TACHURSKI October 10.2. we conclude that Q1 ∼ χ2 (K2 ).21) becomes y = 1 β 1 + X2 β2 + u (7.4. Q1 and Q2 are independent. 2013 . Since both (P − P1 )u and Mu are normal given X. we then obtain Q1 = RSSR − USSR σ2 = u ( I − P1 − I + P ) u = (σ1 u) (P − P1 )(σ−1 u) 2 σ It is an exercise to show that (P − P1 ) is symmetric and idempotent.9. where the latter is defined in (6. the matrix product on the right is (P − P1 )(I − P) = P − P2 − P1 + P1 P = P − P − P1 + P1 = 0 This completes the proof of independence.2.4. The most common implementation of the F test is the test that all coefficients of non-constant regressors are zero. we then have M1 = Mc . under the null hypothesis and taking X as given. it suffices to show that their covariance is zero. Mu | X] = E [(P − P1 )u(Mu) | X] = E [(P − P1 )uu M | X] Since P. while Q2 is a function of Mu. To see this. x3 <.50. NORMAL ERRORS 214 Listing 10 Calculating the F statistic set .cbind ( rep (1 . x2 .7.X % * % beta + u betahat <.mean ( y ) ) ^2) Fa <. " \ n " ) # For comparison : print ( summary ( lm ( y ~ x2 + x3 ) ) ) J OHN S TACHURSKI October 10. K <.K ) cat ( " F = " .y .3 beta <. N ) .4.rep (1 . K ) x2 <.rnorm ( N ) y <. 2013 . seed (1234) N <.X % * % betahat ussr <. x3 ) u <.runif ( N ) .ussr / ( N .sum (( y .sum ( residuals ^2) rssr <. Fa / Fb .( rssr .solve ( t ( X ) % * % X ) % * % t ( X ) % * % y residuals <.ussr ) / 2 Fb <.runif ( N ) X <. When it fails. A likely problem here is that the productivity shocks are positively correlated.2.5. As a first example. k is capital. and the results we have obtained are sensitive to their failure. consider again the Cobb-Douglas example on page 196. the OLS estimates can be biased when assumption 7. moreover.1 Endogeneity Bias Even if the model is correctly specified.25) K2 1 − R2 c where R2 c is the centered R squared defined in §6. An application with simpage 187. 2013 . 7.5 When the Assumptions Fail The standard OLS assumptions are very strict. and. let’s be polite and consider situations where the standard OLS assumptions are only slightly wrong. In what follows. you will find that the F statistic calculated by the theoretical formula agrees with the F statistic produced by R’s summary function. and subscript n indicates observation on the n-th firm. For example. The term un is a firm specific productivity shock.1 (i.1. the F statistic in (7. Can you provide some intuition as to why large F is evidence against the null? 7.24).1.3. E [u | X] = 0) fails. the bias is called endogeneity bias. If you run the program. which yields the regression model ln yn = β + γ ln k n + δ ln n + un Here y is output. and hence RSSR is the squared norm of y − y ulated data is given in listing 10. then pretty much all the results discussed in this chapter are invalid.23) can be rewritten as R2 N−K c F= (7. if our basic assumption y = X β + u is not true. J OHN S TACHURSKI October 10.1.7. It is an exercise to show that in the case of (7.5.2 on page 88. WHEN THE ASSUMPTIONS FAIL 215 ¯ 1.. Assumption 7.1 is sometimes called an exogeneity assumption. the firm will 3 Hint: See fact 3.e. is labor. There are many sources of endogeneity bias. We will look at two examples. −1 ] This term will be strictly positive whenever un. u N ) we can write the N equations in (7. the firm forecasts period n productivity as E [un | un.1. . and this value is observable to the firm. . For an illustration of careful modeling (and discussion of other potential problems with estimating production functions) see the paper of Olley and Pakes (1996).−1 + ηn )] = E [bu2 n.7. obtaining the OLS estimate ˆ : = ( x x ) −1 x y = x y β xx ˆ is a biased estimate of β. with un = un.2).1. suppose that un. When all shocks are zero mean we then have E [ n un ] = E [( a + bun. Thus. . x : = ( y 0 .1.−1 . .−1 is the productivity shock received by firm n last period. This source of endogeneity bias in estimating production functions has been discussed many times in the literature.−1 ] = un. . This will lead to endogeneity bias. 2013 . The best solution is better modeling. For example. suppose next that we have in hand data that is generated according to the simple AR(1) model y0 = 0 and y n = β y n −1 + u n IID for n = 1. Finally. y N −1 ) and u : = ( u1 .−1 has positive variance. Suppose that productivity follows a random walk. N (7. The source of the bias is failure of It turns out that β the exogeneity assumption. As a result.1 implies fact 7.−1 + ηn . where ηn is zero mean white noise. with the specific relationship n = a + bE [un | un.1 J OHN S TACHURSKI October 10. . .2 tells us that if assumption 7. WHEN THE ASSUMPTIONS FAIL 216 choose higher levels of both capital and labor when it anticipates high productivity in the current period. . . To illustrate this.26) N 2 2 Here we assume that {un }n =1 ∼ N (0. y N ). . σ ).2 (page 197) fail.26) as y = βx + u Suppose that we now estimate β by regressing y on x. . suppose that the firm increases labor input when productivity is anticipated to be high. . .5. fact 7.1. the conditions of fact 7. .1.−1 )(un.1 does not hold (because assumption 7.−1 ] for b > 0. . As a second example of endogeneity bias. The unknown parameters are β and σ . .1. and therefore assumption 7. Letting y : = ( y1 . 7.5. WHEN THE ASSUMPTIONS FAIL 217 holds, then we must have E [um xn+1 ] = 0 for any m and n. In the current set-up, this equates to E [um yn ] = 0. We now show that this fails whenever β = 0 and n ≥ m. To see this, observe that (exercise 7.6.15) we can write yn as n −1 yn = and, therefore, j =0 ∑ β j un− j (7.27) E [yn um ] = ∑ β j E [un− j um ] = βn−m σ2 whenever n ≥ m j =0 n −1 (7.28) It follows that when β = 0, assumption 7.1.1 must fail. To help illustrate the bias in our estimate of β, let’s generate the data 10,000 times, ˆ on each occasion, and then take the sample mean. The code to do this is compute β given in Listing 11, with N = 20 and β = 0.9. The resulting sample mean was 0.82. Since the number of replications is very large (i.e., 10,000), this will be very close ˆ ]. In fact, the asymptotic 95% confidence interval for our estimate of E [ β ˆ ] is to E [ β (0.818, 0.824). This bias towards small values is reinforced by the histogram of the observations given in figure 7.1, which is skewed to the left. 7.5.2 Misspecification and Bias The linearity assumption y = X β + u can fail in many ways, and when it does the ˆ will typically be biased. Let’s look at one possible failure, where the estimator β model is still linear, but some variables are omitted. In particular, let’s suppose that the data generating process is in fact y = X β + Zθ + u We’ll also assume that θ = 0, and that E [u | X, Z] = 0. Suppose that we are aware of the relationship between y and X, but unaware of the relationship between y and Z. This will lead us to mistakenly ignore Z, and simply regress y on X. Our OLS estimator will then be given by the usual expression ˆ = (X X)−1 X y. Substituting in (7.29), we get β ˆ = ( X X ) −1 X ( X β + Z θ + u ) = β + ( X X ) −1 X Z θ + ( X X ) −1 X u β J OHN S TACHURSKI October 10, 2013 (7.29) 7.5. WHEN THE ASSUMPTIONS FAIL 218 ˆ Listing 11 Generates observations of β N <- 20 y <- numeric ( N ) y _ zero <- 0 beta <- 0.9 num . reps <- 10000 betahat . obs <- numeric ( num . reps ) for ( j in 1: num . reps ) { u <- rnorm ( N ) y [1] <- beta * y _ zero + u [1] for ( t in 1:( N -1) ) { y [ t +1] <- beta * y [ t ] + u [ t +1] } x <- c ( y _ zero , y [ - N ]) # Lagged y betahat . obs [ j ] <- sum ( x * y ) / sum ( x ^2) } print ( mean ( betahat . obs ) ) J OHN S TACHURSKI October 10, 2013 7.6. EXERCISES 219 Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 ˆ Figure 7.1: Observations of β We now have ˆ ] = E [E [ β ˆ | X, Z]] = β + E [(X X)−1 X Z]θ E [β If the columns of X and Z are orthogonal, then X Z = 0, the last term on the rightˆ is unbiased. If this is not the case (typically it won’t be), hand side drops out, and β ˆ is a biased estimator of β. then β 7.5.3 Heteroskedasticity [to be written] 7.6 Exercises Throughout these exercises, we maintain assumptions 7.1.1 and 7.1.2. In particular, we assume that y = X β + u, E [u | X] = 0, and E [uu | X] = σ2 I, where σ is a positive constant. J OHN S TACHURSKI October 10, 2013 7.6. EXERCISES 220 A general hint for the exercises is to remember that, as stated in fact 3.3.7, when computing expectations conditional on X, you can treat X or any function of X as constant. For example, matrices and vectors that depend only on X and non-random vectors/matrices, such as X β or P = X(X X)−1 X , can be regarded as constant. Ex. 7.6.1. Show that My = Mu and SSR = u Mu. Ex. 7.6.2. Prove the claims in fact 7.1.2. Ex. 7.6.3. Confirm part 1 of fact 7.1.3: Show that var[u] = E [uu ] = σ2 I. Ex. 7.6.4. Show that cov[Py, My | X] = 0. ˆ,u ˆ | X ] = 0. Ex. 7.6.5. Show that, under assumptions 7.1.1–7.1.2, we have cov[ β Ex. 7.6.6. Show that E [Py | X] = X β and var[Py | X] = σ2 P. Ex. 7.6.7. Show that E [My | X] = 0 and var[My | X] = σ2 M. ˆ in (7.8) and (7.10) are equal.4 Ex. 7.6.8. Show that the two expressions for β 2 Ex. 7.6.9. In the proof of theorem 7.2.2 we used a clever trick with the trace to show that trace( P) = K. An alternative way to obtain the same result is to observe that, since P is idempotent, the trace is equal to the rank (fact 2.3.9 on page 70). Prove directly that the rank of P equals K. Ex. 7.6.10. (Hard) At the end of the proof of theorem 7.3.1, it was claimed that the matrix X2 M1 X2 is invertible. Verify this claim. Ex. 7.6.11. Confirm (7.16) on page 206. Ex. 7.6.12. Using (7.16), show that for the simple OLS model y = β 1 1 + β 2 x + u, the ˆ 2 given x is σ2 / ∑ N ( xn − x ¯ )2 . variance of β n =1 Ex. 7.6.13. Show that in the case of (7.24), the F statistic in (7.23) can be rewritten as (7.25).5 Ex. 7.6.14. Suppose that assumption 7.4.1 holds, so that X and u are independent, and u ∼ N (0, σ2 I). Show that, conditional on X, 1. Py and My are normally distributed, and 2. Py and My are independent. Ex. 7.6.15. Verify the expression for yn in (7.27). 4 Hint: 5 Hint: Use the symmetry and idempotence of the matrix M1 . Note that My = MMc y by fact 3.1.5 on page 90. J OHN S TACHURSKI October 10, 2013 7.6. EXERCISES 221 7.6.1 Solutions to Selected Exercises Solution to Exercise 7.6.1. We have My = M(X β + u) = Mu ∴ SSR := My 2 = Mu 2 = (Mu) (Mu) = u M Mu = u MMu by symmetry of M. The result now follows from idempotence of M. Solution to Exercise 7.6.2. The respective proofs for claims 1–4 are as follows: 1. E [u] = E [E [u | X]] = E [0] = 0 2. E [um | xnk ] = E [E [um | X] | xnk ] = E [0 | xnk ] = 0 3. E [um xnk ] = E [E [um xnk | xnk ]] = E [ xnk E [um | xnk ]] = 0 4. cov[um , xnk ] = E [um xnk ] − E [um ]E [ xnk ] = 0 Solution to Exercise 7.6.3. By definition, var[u] = E [uu ] − E [u]E [u ] Since E [u] = E [E [u | X]] = 0, this reduces to var[u] = E [uu ]. Moreover, E [uu ] = E [E [uu | X]] = E [σ2 I] = σ2 I Solution to Exercise 7.6.4. Note that • My = M(X β + u) = Mu • Py = P(X β + u) = X β + Pu • E [My | X] = E [Mu | X] = ME [u | X] = 0 J OHN S TACHURSKI October 10, 2013 7.6. EXERCISES 222 Using these facts, we obtain cov[Py, My | X] = cov[X β + Pu, Mu | X] = E [(X β + Pu)(Mu) | X] From linearity of expectations and symmetry of M, this becomes cov[Py, My | X] = E [X βu M | X] + E [Puu M | X] Regarding the first term on the right-hand side, we have E [ X β u M | X ] = X βE [ u | X ] M = 0 Regarding the second term on the right-hand side, we have E [Puu M | X] = PE [uu | X]M = Pσ2 IM = σ2 PM = 0 ∴ cov[Py, My | X] = 0 Solution to Exercise 7.6.5. Since M is a function of X, we have E [Mu | X] = ME [u | X] = 0. As a result, ˆ,u ˆu ˆ | X ]E [ u ˆ | X] = E [ β ˆ | X] − E [ β ˆ | X] cov[ β ˆ (Mu) | X] = E [β = E [ β(Mu) + (X X)−1 X u (Mu) | X] = E [(X X)−1 X u (Mu) | X] = E [(X X)−1 X uu M | X] = σ 2 ( X X ) −1 X M ˆ,u ˆ | X] = 0, and the proof is done. Since X M = (MX) = 0 we have cov[ β Solution to Exercise 7.6.6. Regarding the claim that E [Py | X] = X β, our previous results and linearity of expectations gives E [Py | X] = E [X β + Pu | X] = X β + PE [u | X] = X β Regarding the claim that var[Py | X] = σ2 P, our rules for manipulating variances yield var[Py | X] = var[X β + Pu | X] = var[Pu | X] = P var[u | X]P = Pσ2 IP Using symmetry and idempotence of P, we obtain var[Py | X] = σ2 P. J OHN S TACHURSKI October 10, 2013 7.6. EXERCISES 223 Solution to Exercise 7.6.7. Similar to the solution of exercise 7.6.6. Solution to Exercise 7.6.9. Since rank(X) = dim(rng(X)) = K, to establish that rank(P) = dim(rng(P)) = K, it suffices to show that rng(X) = rng(P). To see this, suppose first that z ∈ rng(X). Since P is the projection onto rng(X), we then have z = Pz (see theorem 3.1.3 on page 86), and hence z ∈ rng(P). Conversely, if z ∈ rng(P), then, z = Pa for some a ∈ R N . By definition, P maps every point into rng(X), so we conclude that z ∈ rng(X). Solution to Exercise 7.6.10. To see that the matrix X2 M1 X2 is invertible, note that, in view of idempotence and symmetry of M1 , X2 M1 X2 = X2 M1 M1 X2 = X2 M1 M1 X2 = ( M1 X2 ) M1 X2 In view of fact 2.3.11, to show that this matrix is invertible, it suffices to show that the matrix is positive definite. So take any a = 0. We need to show that a ( M1 X2 ) M1 X2 a = ( M1 X2 a ) M1 X2 a = M1 X2 a 2 >0 Since the only vector with zero norm is the zero vector, it now suffices to show that M1 X2 a is non-zero. From fact 3.1.6 on page 90, we see that M1 X2 a = 0 only when X2 a is in the column span of X1 . Thus, the proof will be complete if we can show that X2 a is not in the column span of X1 . Indeed, X2 a is not in the column span of X1 . For if it were, then we could write X1 b = X2 a for some b ∈ RK1 . Rearranging, we get Xc = 0 for some non-zero c (recall a = 0). This contradicts linear independence of the columns of X. Solution to Exercise 7.6.11. Repeating (7.15) we have ˆ k = β k + (colk (X) M1 colk (X))−1 colk (X) M1 u β Since β k is constant, taking the variance of (7.30) conditional on X we obtain ˆ k | X] = var[Au | X] var[ β where A := (colk (X) M1 colk (X))−1 colk (X) M1 (7.30) Since A is a function of X, we can treat it as constant given X, and we obtain ˆ k | X] = A var[u | X]A = Aσ2 IA = σ2 AA var[ β J OHN S TACHURSKI October 10, 2013 31). EXERCISES 224 To complete the proof.24). since the only random variables in X are the random variables in x.6. For the model y = β 1 1 + β 2 x + u. RSSR − USSR USSR = R2 c 1 − R2 c (7. Hence.7. from (7.16) is valid. we can write ˆ 2 | x ] = σ2 / var[ β as was to be shown. In the case of (7.31) Consider first the left-hand side of (7. regarding the right-hand side of (7.16). equivalently. 2013 . Solution to Exercise 7. the definition of R2 c and some minor manipulation gives R2 PMc y 2 c = 1 − R2 Mc y 2 − PMc y c J OHN S TACHURSKI 2 October 10.6.24) we have n =1 ¯ )2 ∑ ( xn − x N ( RSSR − USSR )/K2 R2 N−K c = 2 USSR / ( N − K ) 1 − R c K2 or. and colk (X) is just x. this becomes RSSR − USSR USSR = Mc y − My My 2 2 2 On the other hand. this becomes ˆ 2 | X] = σ2 [(Mc x) Mc x]−1 = σ2 / var[ β ¯ )2 ∑ ( xn − x N n =1 Finally.12.13.31). we have ˆ 2 | X ] = σ 2 ( x M c x ) −1 var[ β Using symmetry and idempotence of Mc . We conclude that (7.6. Solution to Exercise 7. the matrix M1 is the centering annihilator Mc associated with 1. We need to show that in the special case (7. we just observe that AA = (colk (X) M1 colk (X))−1 colk (X) M1 M1 colk (X)(colk (X) M1 colk (X))−1 = (colk (X) M1 colk (X))−1 where the last equality is due to symmetry and idempotence of M1 (recall that M1 is an annihilator). M and X β are all constant.6. Solution to Exercise 7. This was already proved in exercise 7.18). Mu and X β + Pu are both normally distributed.31). Since linear (or affine) transformations of normal random vectors are normal. We have shown previously that My = Mu and Py = X β + Pu. Since they are normally distributed given X. to establish (7.7. we need only show that they are uncorrelated given X. It remains to show that Py and My are independent given X.6.6. When X is constant. Since we are conditioning on X we can treat it as constant. P. we need to show that Mc y − My My 2 2 2 = PMc y 2 Mc y 2 − PMc y 2 This is can be established using (6. J OHN S TACHURSKI October 10.14. EXERCISES 225 Hence.4. 2013 . The random variable xt is called the state variable.1. the friendliest is the scalar Gaussian AR(1) model.) Since xt is a well-defined random 226 .1 Linear Models In time series as in other fields. which takes the form x t +1 = α + ρ x t + w t +1 with {wt } ∼ N (0.1 Some Common Models In this section we introduce some of the most common time series models. Large sample OLS theory is in many ways more attractive and convincing than its finite sample counterpart. we use the techniques developed in this process to move from finite sample OLS to large sample OLS. Next.1) fully defines the time t state xt as a random variable for each t. (We’ll spell out how this works in some detail below. Of the linear time series models. Our first step is to introduce and study common time series models. Note that (8.Chapter 8 Time Series Models The purpose of this chapter is to move away from the IID restriction and towards the study of data that has some dependence structure over time. σ2 ) and x0 given IID (8. working from specific to more general. Such data is very common in economics and finance. 8. 8.1) Here α. the easiest models to analyze are the linear models. ρ and σ are parameters. For this reason.1) generalizes in serveral different directions. In what follows. Σ) for some a symmetric. but of the last p previous states. or VAR(1). yielding the vector AR(1) model. The scalar AR(1) model further generalizes to RK . where the next state xt+1 is a (linear) function not just of the current state xt . For example.2). The message is that higher order processes can be reduced to first order processes by increasing the number of state variables. we can remove the assumption that the shocks are normally distributed. in what follows we concentrate primarily on first order models. then (8. 2013 . The vector xt is called the state vector. For example. This distribution is often called the marginal distribution of xt . positive definite K × K matrix Σ.2) is called the Gaussian VAR(1). The dynamics of the process are given by x t +1 = a + Λ x t + w t +1 with {wt } ∼ φ and x0 given IID (8. Λ is a K × K matrix and φ is some distribution on RK . Another common generalization of the scalar AR(1) model is the scalar AR( p) model. we can reformulate it as a first order model.1. in which case it is simply called the scalar AR(1) model. J OHN S TACHURSKI October 10. xt and xt−1 . and the space RK in which it takes values is called the state space. The dynamics can then be expressed as x t +1 = α + ρ x t + γ y t + w t +1 y t +1 = x t We can write this in matrix form as x t +1 y t +1 =α 1 0 + ρ γ 1 0 xt yt + 1 0 w t +1 Notice that this is a special case of the VAR(1) model in (8. If φ = N (0. SOME COMMON MODELS 227 variable.8. we will be interested in the dynamics both of the state process { xt } and also of the corresponding distribution sequence {Πt }. To begin. it has a distribution Πt defined by Πt (s) := P{ xt ≤ s}. the AR(2) process has dynamics x t +1 = α + ρ x t + γ x t −1 + w t +1 Although xt+1 is a function of two lagged states. we define an additional state variable yt via yt = xt−1 .2) where a is a K × 1 column vector. The scalar Gaussian AR(1) model (8. the ARCH(1) version of which has dynamics IID 2 1/2 x t +1 = ( α 0 + α 1 x t ) w t +1 . In the model. 1) (8.3). When s is small we have τ (s) ≈ 0.1) process xt 2 σt+1 = σt wt 2 = α0 + α1 x t + α2 σt2 Another popular nonlinear model is the smooth transition threshold autoregression (STAR) model x t +1 = g ( x t ) + w t +1 (8. with the smoothness of the transition depending on the shape of τ . Combining these equations gives the dynamics for x displayed by σt2+1 = α0 + α1 xt t in (8.2 Nonlinear Models The previous examples are all linear models. J OHN S TACHURSKI October 10. but this is not always a good thing. 1] is an increasing function satisfying lims→−∞ τ (s) = 0 and lims→∞ τ (s) = 1. econometricians studying asset prices have moved away from the ARCH model in favor of a generalized ARCH (GARCH) model. xt = σt wt .3) The model arose as an effort to model evolution of returns on a given asset. Moreover. the simplest of which is the GARCH(1. many theoretical modeling exercises produce models that are not linear. Their simplicity means they cannot always capture the kinds of dynamics we observe in data. The evolution of σt is specified 2 . In this section we introduce several popular nonlinear models. Linear models are simple. When s is large we have τ (s) ≈ 1. returns xt are first written as the product of an IID shock wt and a timevarying volatility component σt .4) where g is of the form g(s) := (α0 + ρ0 s)(1 − τ (s)) + (α1 + ρ1 s)τ (s) Here τ : R → [0. 2013 .8. the dynamics transition between two different linear models.1.1. Thus. and g(s) ≈ α0 + ρ0 s. In recent years. That is. {wt } ∼ N (0. and g(s) ≈ α1 + ρ1 s. One well-known nonlinear model is the p-th order autoregressive conditional heteroskedasticity model (ARCH( p) model). SOME COMMON MODELS 228 8. Continuing in this fashion. .1. w1 . w t +1 ) with x0 ∼ Π0 (8. depending on the initial condition and the shocks up until date t. we see that. w2 ). 2013 . w2 ) x3 = G ( G ( G ( x0 . there is a neat expression for Ht in (8. Example 8. w1 ). By repeated use of (8.8.5) Here we assume that {wt }t≥1 is an IID sequence of R M -valued shocks with common density φ.7) October 10. .1.6). w1 ). Where necessary. πt will represent the corresponding density. w2 . The density Π0 is the distribution of x0 . time homogeneous) Markov process looks as follows: x t +1 = G ( x t . for any t. we’ll let Π t ( s ) : = P{ x t ≤ s } represent the marginal distribution of the state xt . and that G is a given function mapping the current state xt ∈ RK and shock wt+1 ∈ R M into the new state xt+1 ∈ RK . In other words. Indeed. . equation (8. As before.1. . . . The general formulation for a (first order. for all t ≥ 0 we have t −1 t −1 xt = α The proof is an exercise.5). wt ) (8. the state vector xt can be written as a function of x0 and the shocks w1 .1. w3 ) .3 Markov Models All of the previous examples in this section are special cases of a general class of process called Markov processes.5) pins down each xt as a well-defined random variable.6) clarifies the fact that (8.6) Although there may be no neat expression for the function Ht . The initial condition x0 and the shocks {wt }t≥1 are also assumed to be independent of each other. there exists a function Ht such that xt = Ht (x0 . SOME COMMON MODELS 229 8. w1 ) x2 = G ( G ( x0 . A simple example is the scalar linear AR(1) process x t +1 = α + ρ x t + w t +1 In this case. . . . J OHN S TACHURSKI k =0 ∑ ρk + k =0 ∑ ρ k w t − k + ρ t x0 (8. . for each t. wt . we obtain the sequence of expressions x1 = G ( x0 . w2 (ω ). w1 ( ω ). It’s worth mentioning that. although it’s often not made explicit. so that x t +1 = G ( x t .6). w t +1 ) = g ( x t ) + w t +1 with {wt } ∼ φ IID (8. because this density may not in fact exist. the current state xt and future shocks wt+ j are independent for every t and every j > 0. w1 . which is the conditional density of the next period state given the current state. In this case. wt . the transition density has the form p(s | s) = φ(s − g(s)) (8.5). and in all cases we consider. The idea is that an element ω of Ω is selected by “nature” at the start of the “experiment” (with the probability that ω lies in E ⊂ Ω equal to P( E)). each state vector xt is determined via xt (ω ) = Ht (x0 (ω ).2. independent of wt+ j . Let Φ be the being a little careless here.8) We can usually derive an expression for the transition density in terms of the model. For example. . In this case the random variable xt+1 is equal to zero with probability one.) However. w2 . Such a random variable does not have a density. w) = 0 for all s and w. In particular. wt (ω )) where Ht is the function in (8. . . . An important object in Markov process theory is the transition density. and will be denoted by p. w2 ( ω ). by the IID assumption.9) for some function g and density φ. p(· | s) := the conditional density of xt+1 given xt equals s (8.1.2. w3 ( ω ).1.10)? Let’s go through the argument for the scalar case. xt is a function of the random variables x0 . behind the all random vectors {xt } lies a single sample space Ω and a probability P. or stochastic kernel.1 follows from fact 2. SOME COMMON MODELS 230 Fact 8. For the Markov process (8. . . 2013 .1. . .1 In particular. This determines the initial condition x0 and the shocks wt as x0 ( ω ). the density will exist in most applications.8.1 on page 72. Fact 8. .10) How does one arrive at expression (8. take the process xt+1 = G ( xt . See the discussion in §1. .1. where the model is xt+1 = g( xt ) + wt+1 with wt+1 ∼ φ. From these. suppose that the shock is additive. (For example. w1 (ω ).4. and these are all. wt+1 ) where G (s. 1 I’m J OHN S TACHURSKI October 10. . and let F0 : = ∅. This is equivalent to the claim that the density of x = g(s) + w is equal to φ(s − g(s)). Ft is an information set for each t.. we need first to introduce the notion of a filtration. Recall from §3.3. · · · Then {Ft } is a filtration. Now let {mt } be a scalar stochastic process (i. The sequence {Ft } is called a filtration if.2 that an information set is just a set of random variables or vectors. A presentation along these lines is beyond the scope of these notes. so that the derivative of Φ is φ. since the mid 20th Century. x2 . it satisfies Ft ⊂ Ft+1 for all t. we have F (s ) := P{ x ≤ s } = P{ g(s) + w ≤ s } = P{w ≤ s − g(s)} = Φ(s − g(s)) The density we seek is the derivative of this expression with respect to s . the underlying meaning is almost identical. and if {mt } is adapted to Ft .1.1. you will learn that {Ft } is actually best thought of as an increasing sequence of σ-algebras. Let {xt } be a sequence of random vectors. Ft contains the information available at time t. F3 : = { x1 . in addition. SOME COMMON MODELS 231 cdf corresponding to φ. a sequence of random variables) and let {Ft } be a filtration. 2 If J OHN S TACHURSKI October 10. Ft represents the variables we know as of time t. we claim that the density of xt+1 when xt equals constant s is given by p(s | s) := φ(s − g(s)).3. To give a more formal definition. you learn measure theory. Martingales arise naturally in many kinds of economic and financial models. a martingale is a stochastic process evolving over time such that the best guess of the next value given the current value is the current value. 2013 . (For the definition of measurability see §3. then we can compute mt at time t as well. which is an increasing sequence of information sets. F2 : = { x1 .) In many applications. That is. and the requirement that the sequence be increasing reflects the idea that more and more information is revealed over time. Now.8. which is φ(s − g(s)). Intuitively. martingales have contributed to much progress in the foundations of probability theory. In other words.2. Let {Ft } be a sequence of information sets.4 Martingales Loosely speaking. Moreover.e. F1 : = { x1 }. However. 8.1. In fact this is the canonical example of a filtration. x3 }. Letting F be the cdf of x = g(s) + w. p(s | s) = φ(s − g(s)) as claimed.2. We say that {mt } is adapted to the filtration {Ft } if mt is Ft -measurable for every t.2 Example 8. x2 }. then E [mt | Ft+ j ] = mt for any j ≥ 0. Note that the unconditional mean of a martingale difference sequence is always zero. then { mt } is adapted to {Ft }. Let {ηt } be an IID sequence of random variables with E [η1 ] = 0. To define martingales we let {mt } be a sequence of random variables adapted to a filtration {Ft }. 2013 . If {mt } is adapted to {Ft }.2. SOME COMMON MODELS 232 Example 8. and satisfying E [|mt |] < ∞ for all t. m t = m t −1 + η t = j =1 ∑ ηj t J OHN S TACHURSKI October 10.3. From the definition of a filtration and fact 3. F2 : = { x1 . we say that {mt } is a martingale with respect to {Ft } if E [mt+1 | Ft ] = mt for all t We say that {mt } is a martingale difference sequence with respect to {Ft } if E [mt+1 | Ft ] = 0 for all t A martingale difference sequence is so named because if {mt } is a martingale with respect to {Ft }. Fact 8.5. because E [mt ] = E [E [mt | Ft−1 ]] = E [0] = 0 Example 8. then dt = mt − mt−1 is a martingale difference sequence with respect to {Ft }.6 on page 100.g. In this setting.1.1.3. See exercise 8.1.2 on page 97 it follows that mt is Ft+ j measurable.8. and let mt := ∑t j=1 η j . ηt might be the payoff on the t-th round of a game (e. poker). (We are assuming that wealth starts at zero and may take arbitrarily large negative values without the gambler getting ejected from the game and knee-capped by the mafia. · · · and mt := t−1 ∑t j=1 x j .1. If {Ft } is the filtration defined by F0 : = ∅. F1 : = { x1 }. For example. x3 }. The classic example of a martingale is a random walk..1.8. The result in fact 8. and mt is the wealth of a gambler after the t-th round.) In this case.2 now follows from fact 3. Hopefully the reason is clear: By adaptedness. x2 .3. F3 : = { x1 . we know that mt is Ft -measurable.4. x2 }. 1 A Simple Example: Linear AR(1) Consider again the scalar Gaussian AR(1) process x t +1 = α + ρ x t + w t +1 J OHN S TACHURSKI with {wt } ∼ N (0. rt is an interest rate and ρ is a discount factor. ηt }. . 8. consider an Euler equation of the form u (ct ) = E t 1 + r t +1 u ( c t +1 ) 1+ρ where u is the derivative of a utility function u.5. 1) October 10. where Ft contains all variables observable at time t. . 1978). E [ m t + 1 | F t ] = E [ η1 + · · · η t + η t + 1 | F t ] = E [ η1 | F t ] + · · · E [ η t | F t ] + E [ η t + 1 | F t ] = η1 + · · · η t + E [ η t + 1 | F t ] = η1 + · · · η t + E [ η t + 1 ] = η1 + · · · η t = m t Example 8. The “time t” expectation E t [·] can be thought of as a conditional expectation E [· | Ft ]. starting with a very simple case. under the theory. This has profound implications for asymptotic theory. . the Euler equation reduces to c t = E t [ c t +1 ] = : E [ c t +1 | F t ] Thus. DYNAMIC PROPERTIES 233 is a martingale with respect to Ft := {η1 . 2013 IID .2. Specializing to the case rt+1 = ρ and u(c) = c − ac2 /2. To understand his hypothesis. 8.8. That {mt } is adapted to {Ft } follows immediately from the definition of mt and Ft . such as the law of large numbers or central limit theorem. .2 Dynamic Properties The time series models discussed above can display very different dynamics from the simple IID data processes considered earlier in this text. A famous example of a martingale in economic theory is Robert Hall’s hypothesis that consumption is a martingale (Hall. In this section we try to unravel some of the mysteries. Moreover.1. consumption is a martingale with respect to {Ft }.2. 2.1 . For example.function ( init . 2013 .2. let’s begin with some simulated time series and see what we observe. if you look at the time series for ρ = 0.9 .1. DYNAMIC PROPERTIES 234 For simplicity. we take α = 1. xlab = paste ( " rho = " . -0. let’s assume that J OHN S TACHURSKI October 10.numeric ( n ) x [1] <. then the process does not diverge. it follows from this expression and fact 1.1. -1. you will see that. Whenever ρ is outside the interval (−1. the simulated time paths are quite sensitive to the value of the coefficient ρ.c (0.1) N <. each generated using a different value of ρ.2) ) # Arrangement of figures for ( rho in rhos ) { plot ( ar1ts (0 .1 . Listing 12 Code for figure 8.9 in figure 8. rho ) . rho ) . Six individual time series { xt } are shown in figure 8. N .7).1 # Generates an AR (1) time series starting from x = init ar1ts <. n . 1. on the other hand.8.alpha + rho * x [ t ] + w [ t ] } return ( x ) } rhos <. 0. In particular. 1). alpha .9 . Since the shocks {wt } are assumed to be normal. 1 . the series tend to diverge. type = " l " . |ρ| < 1.init w <.1 .rnorm (n -1) for ( t in 1:( n -1) ) { x [ t +1] <. -0. In order to learn about the dynamics of this process.200 par ( mfrow = c (3 . rho ) { x <. As suggested by the figure (experiment with the code to verify it for yourself). Let’s assume that this is the case. ylab = " " ) } We can investigate this phenomenon analytically by looking at expression (8. the variance of the shock wt has been set to one.6 on page 24 that xt will be normally distributed whenever x0 is either normal or constant. If. the process settles down to random motion within a band (between about 5 and 15 in this case). In all simulations. The code for generating the figure is given in listing 12. after an initial burn in period where the series is affected by the initial condition x0 . 0e+09 0.9 150 200 1.0e+00 0 50 100 rho = 1. 2013 . DYNAMIC PROPERTIES 235 3 2 1 −1 0 0 50 100 rho = 0.9 150 200 −4 0 0 2 4 6 50 100 rho = −0.1 150 200 −6e+07 0 0e+00 6e+07 50 100 rho = −1.1 150 200 −2 0 0 1 2 3 50 100 rho = −0.2.1 150 200 Figure 8.1: Dynamics of the linear AR(1) model J OHN S TACHURSKI October 10.1 150 200 15 10 5 0 0 50 100 rho = 0.8. A more formal definition is given below. For both Π0 := N (µ0 . convergence is from left to right. σ∞ ) := N d 1 α . when α = 0 and ρ = 0. where µ and σ are given constants. 1 − ρ 1 − ρ2 (8. and if you experiment with different choices of µ0 and σ0 .2) to Π∞ := N (µ∞ . The initial distribution in the figure is 2 ) with arbitrarily chosen constants µ = −6 and σ2 = 4. as described in the next section. we’ve pinned down the marginal distribution Πt for xt . on the other hand.9. it seems likely that the marginal distribution Πt = N (µt . then 1 α 2 and σt2 → σ∞ := µt → µ∞ := 1−ρ 1 − ρ2 In this case. σ0 0 0 expectation and variance to (8. then we will have Πt = Π∞ term ergodicity is sometimes used to signify that the process satisfies the law of large numbers. 2013 . σt2 ) of xt con2 ).2 and 8. In other Observe that this limit does not depend on the starting values µ0 and σ0 words. σt2 ) → Π∞ := N (µ∞ . Π∞ does not depend on Π0 . one can then show that this is indeed the case. we have shown that Πt = N (µt .8. global stability and the law of large numbers are closely related. σ∞ page 33. the distribution Π∞ has another special property: If we start with Π0 = Π∞ . 2 Πt = N (µt . Figures 8. The fact that Πt → Π∞ for any choice of Π0 is called global stability. you will see that convergence to the same distribution Π∞ always occurs.2. DYNAMIC PROPERTIES 236 2 ). Using fact 1. In particular. then the mean and variance diverge. The code is given in listing 13. or ergodicity. However. and of the corresponding densities πt to π∞ . as will be discussed at length there. |ρ| < 1. Applying our usual rules for x0 ∼ N (µ0 .5.2.7).11) 2 . we can also see that µt := E [ xt ] = α t −1 k =0 ∑ ρ k + ρ t µ0 and σt2 := var[ xt ] = t −1 k =0 2 ∑ ρ2k + ρ2t σ0 Since xt is normal and we’ve now found the mean and variance. σt2 ) = N t −1 α k =0 ∑ ρ k + ρ t µ0 . 3 The J OHN S TACHURSKI October 10.3 on verges weakly (see the definition in §2. t −1 k =0 2 ∑ ρ2k + ρ2t σ0 Notice that if |ρ| ≥ 1.3 illustrate convergence of Πt to Π∞ . That is.3 Besides being the limiting distribution of the sequence {Πt }.4. σ0 0 0 the sequence of cdfs and the sequence of densities. If. 05 0.20 −5 0 5 Figure 8. DYNAMIC PROPERTIES 237 0.2 0.8 1.00 −10 0.4 0.0 −5 0 5 Figure 8.10 0.8.0 −10 0.2.15 0. 2013 .3: Convergence of densities J OHN S TACHURSKI October 10.6 0.2: Convergence of cdfs 0. dnorm ( xgrid . we get Π2 = Π∞ .0. taking Π0 = Π∞ implies that xt has the same marginal distribution Π∞ for every t. Thus. and so on. as was to be shown. In particular. Using (b) one more time we get Π3 = Π∞ . mean = mu .8.5. then Πt+1 = Π∞ is also true.8 / (1 . IID.rho ^2) # mu _ 0 and sigma ^2 _ 0 xgrid <. In other words. showing that if Πt = Π∞ . mean = mu . The way to do this is by induction. then we would see only one curve. xlab = " " . in figure 8. sigma2 <.9 N <. length =200) plot ( xgrid . because xt and xt+ j are not independent (unless ρ = 0). sd = sqrt ( sigma2 ) ) . then Πt = Π∞ for all t is a very important point.2 we had started at Π0 = Π∞ . type = " l " .7). ylab = " " . The fact that if Π0 = Π∞ . and (b) Πt = Π∞ implies Πt+1 = Π∞ . logic is as follows: Suppose we know that (a) Π0 = Π∞ . DYNAMIC PROPERTIES 238 Listing 13 Code for figure 8.2. 7 .0. For example. sd = sqrt ( sigma2 ) ) ) } for all t.rho ^2 * sigma2 + 1 lines ( xgrid . 2013 . then we will see motion in the figure as the sequence Πt converges to Π∞ . the sequence of random variables { xt } is identically distributed. dnorm ( xgrid . We’ll say more about this in just a moment. Note also that for our model. if we start at any other cdf Π0 = Π∞ . however. It is not. which corresponds to Π∞ . using (b) again.seq ( -10 .30 # Number of densities to plot mu <.3 rho <. if. Next. Π∞ is called the stationary distribution of the process. For this reason. one can use the relation xt+1 = α + ρ xt + wt+1 . and as such it’s worth checking analytically as well. The sequence of distributions is constant. Hence Πt = Π∞ for all t. The details are left as an exercise (exercise 8. main = " " ) for ( i in 2: N ) { mu <.4 To verify the latter. 4 The J OHN S TACHURSKI October 10. Then (a) and (b) together imply that Π1 = Π∞ . we have only one stationary distribution.-6.rho * mu sigma2 <. In particular. the sequence { xt } does not satisfy the LLN.2. as we’ll see below. 2013 . ρ and σ are such that some stationary distribution Π∞ does exist. In this case. in other words. (The distribution Π∞ is stationary for this process. or. as we just established. Indeed. If | ρ | < 1. DYNAMIC PROPERTIES 239 8. the LLN can easily fail. So let’s restrict attention to the case where the parameters α. 6 For example. and hence x t = x t −1 = x t −2 = · · · = x 1 If x0 has some given distribution Π∞ . they required zero correlation between elements of the sequence. does not generally converge to anything. σ2 ) IID Suppose for now that α = 0. Let’s think about this how this connection might work. which is why I chose the symbol Π∞ . 5 The J OHN S TACHURSKI October 10. then xt will have that same distribution for all t. the dynamics reduce to xt+1 = xt . wt is identically equal to zero. then a unique stationary distribution always exists. if | ρ | ≥ 1 and σ > 0.2 Linear AR(1) Continued: The LLN For consistency of statistical procedures. ρ = 1 and σ = 0.6 and only exception is if x0 is a degenerate random variable. 1 T 1 T ¯ T : = ∑ x t = ∑ x1 = x1 x T t =1 T t =1 and x1 . If we admit nonzero correlation. then clearly Π t ( s ) : = P{ x t ≤ s } = P{ x1 ≤ s } = Π ∞ ( s ) for all t This tells us that { xt } is identically distributed. putting all its probability mass on a single point. 0).5 So under what conditions does the sample mean of the AR(1) process converge to the common mean of xt ? One necessary condition is that the sequence { xt } does indeed have a common mean. then no stationary distribution exists. with common distribution Π∞ . the global stability concept we have just investigated provides one way to obtain the LLN without IID data. The scalar and vector LLNs we have considered so far used the IID assumption. being a random variable. starting with the AR(1) model x t +1 = α + ρ x t + w t +1 with {wt } ∼ N (0. then.28. because if we fix any distribution and let x0 have that distribution. If this is not clear. In fact every distribution is stationary for this process. Fortunately. Setting σ = 0 means that wt ∼ N (0. some version of the LLN is almost always necessary.) However.8. go back to exercise 1.5.2. for any values of α and σ. xm ] n<m = 2 σ2 + 2 N N ∑ cov[ xn . we saw that when { xn } is an IID sequence of random variables with common mean µ and variance σ2 .4.1 (page 34). When correlation is non-zero. let’s recall the proof of the LLN for IID sequences in theorem 1.8. xm ] ¯N ] → 0 In the IID case.2 on page 32 implies For the proof. the property that correlations die out over time is closely related with global stability.12) We saw in the previous example (α = 0. we just observe that since E [ x ¯ N ] → 0 as N → ∞. In the theorem. Hence { xt } is identically distributed. In view of fact 1. we have ¯ N := x 1 N n =1 ∑ xn → E [ xn ] = µ N p ¯ N ] = µ. 1) IID (8.3.11) that. Furthermore. We need something more. This will be the case if the covariances die out relatively quickly.4. cov[ xn . and try to think about whether the LLN can be salvaged. xm ] are or not var[ x small. To investigate this. globally stable stationary distribution given by Π∞ := N J OHN S TACHURSKI 0. We want to know when ¯ T → E [ xt ] = µ∞ := x p sΠ∞ (ds) (8. fact 1. under the assumption −1 < ρ < 1.2. so that cov[ xn .5 on that the result will hold whenever var[ x page 28. the model has a unique.13) We saw in (8. Now let’s weaken the assumption of zero correlation. let’s take α = 0 and σ = 1. For example. and hence the convergence var[ x does indeed hold. the question of whether ¯ N ] → 0 depends whether or not most of the terms cov[ xn . 1 1 − ρ2 October 10. xn+ j ] ≈ 0 when j is large. so that Πt = Π∞ for all t. σ = 0) that the mere fact that { xt } is identically distributed is not enough. so the dynamics are x t +1 = ρ x t + w t +1 with {wt } ∼ N (0. xm ] = 0 for all n < m. DYNAMIC PROPERTIES 240 start the process off with Π0 = Π∞ . ρ = 1. we have 1 var N n =1 ∑ xn N 1 = 2 N n =1 ∑ var[xn ] + N 2 ∑ n<m N 2 cov[ xn . 2013 . we obtain xt+ j = ρ j xt + we now have cov[ xt+ j . 8. 2013 .e.4 is given in listing 14. |ρ| < 1). but now we need to look at more general (and complex) cases. xt ] = E [ xt+ j xt ] = E ρ xt + 2 E [ xt ]+ j j k =1 ∑ ρ j−k wt+k j j k =1 ∑ ρ j−k wt+k xt =ρ j 2 = ρ j E [ xt ]+ k =1 j k =1 ∑ ρ j−k E [ wt+k xt ] ∑ ρ j − k E [ w t + k ]E [ x t ] =ρ j 2 E [ xt ] (In the second last equality.3 Markov Process Dynamics Studying the dynamics of the AR(1) process has helped us build intuition. we used the fact that xt depends only on current and lagged shocks (see (8.8. the mean sΠ∞ (ds) of Π∞ is zero.12) in the AR(1) model (8. Let’s look again at the Markov J OHN S TACHURSKI October 10.7) on page 229). and hence xt and wt+k are independent. fixing j ≥ 1 and iterating with (8. to make the LLN work we need covariances to go to zero as we compare elements of the sequence that are more and more separated in time. The blue line is the plot ¯ T against T . The grey line is a simulated time series of the process.2. Indeed. As anticipated by the LLN. The LLN result (8.. DYNAMIC PROPERTIES 241 It turns out that the stability condition −1 < ρ < 1 is precisely what we need for the covariances to die out.2.13). The of x code for producing figure 8.) Since |ρ| < 1 we now have 2 cov[ xt+ j . xt ] = ρ j E [ xt ]= ρj →0 1 − ρ2 as j→∞ To summarize our discussion.13) is illustrated in figure 8.8. The condition for covariances going to zero is exactly the condition we require for global stability (i. Note that for this model. In the simulation.4. we see that x ¯ T converges to zero. we set ρ = 0. 0) J OHN S TACHURSKI October 10.1 for ( t in 2: N ) x [ t +1] <.rho * x [ t ] + rnorm (1) for ( T in 1: N ) sample _ mean [ T ] <. DYNAMIC PROPERTIES 242 x −4 0 −2 0 2 4 500 1000 Index 1500 2000 Figure 8.8 N <.4 rho = 0.numeric ( N ) sample _ mean <.2000 x <. type = " l " ) lines ( sample _ mean .2.numeric ( N ) x [1] <. col = " gray " .8. col = " blue " . lwd =2) abline (0 .sum ( x [1: T ] / T ) plot (x . 2013 .4: LLN in the AR(1) case Listing 14 Code for figure 8. w2 . which includes the AR(1) model as a special case.14) says that this probability is π∞ (s ). These definitions include and formalize the stability-related definitions given for the scalar AR(1) model. .15) if π∞ (s ) = p(s | s)π∞ (s) ds for all s ∈ RK (8. .15) Here {wt }t≥1 is an IID sequence of RK -valued shocks with common density φ. This is the right-hand side of (8. xt+1 ∼ π∞ whenever xt ∼ π∞ .5) on page 229. the analysis can be somewhat tricky. wt ) ∈ B} where Ht is defined on page 229. and x0 has density π0 . a density π∞ on RK is called stationary for the process (8. Suppose that xt ∼ π∞ . with common marginal distribution π∞ . .8).2. the presentation below we’re going to work with densities rather than cdfs because the presentation is a little easier. In particular.8. as defined in (8. for the more general Markov model we are studying now. summed over all possible s and weighted by the probability that xt = s. 7 In J OHN S TACHURSKI October 10.14) A stationary density π∞ is called globally stable if the sequence of marginal densities {πt }∞ t=0 converges to π∞ for any choice of initial density π0 . w1 .7 Let’s work towards finding conditions for global stability. Here is a useful result pertaining to the additive shock model x t +1 = g ( x t ) + w t +1 (8. Let {xt } be a Markov process with stationary distribution π∞ .14). If x0 ∼ π∞ . the probability that xt+1 = s should be equal to the probability that xt+1 = s given xt = s. Given p. In other words. The equality in (8. First we need some definitions. Let’s start with the definition of stationary distributions. we have the following result: Fact 8.14) is as follows. let πt denote the density of xt .2. then {xt } is identically distributed. The interpretation of (8.1. Let p(· | s) be the transition density of the process. While stability or instability of linear models like the AR(1) process is relatively easy to study. As a consequence of the preceding discussion. for B ⊂ RK . For each t ≥ 1. Informally. . so that this sequence of densities will converge to a unique limit for all initial π0 . This is the reason π∞ is called a “stationary” distribution. we have B πt (s) ds = P{xt ∈ B} = P{ Ht (x0 . DYNAMIC PROPERTIES 243 process (8. 2013 . 2.2.1 are satisfied. 9 The condition that x ∼ π is actually unnecessary.2. | g(s)| = |ρ||s|.2. To see this. 1 − ρ2 ) Clearly the distribution N (0. Indeed.2. Example 8. we need to be able to check the conditions of theorem 8. This caries over to the general Markov case treated here. There are many other conditions for this kind of stability.16) is satisfied with fact the conditions of theorem 8.8 We’ll look at how to check the conditions of theorem 8.2. the stability conditions in theorem 8. Let {xt } be an generated by (8.8.2.1.1 and 8. 1) IID The conditions of theorem 8.1 in a moment.2. g(s) = ρ|s| and {vt } ∼ N (0.1 is a very handy result.1 hold. The best way to learn to do this is by looking at examples. 2013 .1.2.1. Suppose that the conditions of theorem 8.1 are rather strong (in order to make the statement of the theorem straightforward).2. so that (8. If the density φ has finite mean and is strictly positive everywhere on RK . See Stachurski (2009) for a more formal discussion of this LLN.15) with x0 ∼ π∞ . Let π∞ be the unique stationary density. We saw in the scalar AR(1) case that this connection is rather close. but it means that E [ h ( x )] = h(s)π∞ (s)ds ∞ t 0 for all t. Theorem 8.2. Here’s a variation on the scalar “threshold autoregressive” model: xt+1 = ρ| xt | + (1 − ρ2 )1/2 wt+1 with − 1 < ρ < 1 and {wt } ∼ N (0. Moreover.2. we can rewrite the model as IID xt+1 = g( xt ) + vt+1 . the function g is continuous and there exist positive constants λ and L such that λ < 1 and g(s) ≤ λ s + L for all s ∈ RK (8.1 are sufficient for the LLN: Theorem 8. 8 In J OHN S TACHURSKI October 10.16) then (8.15) has a unique stationary distribution that is globally stable.2. but before that let’s consider the connection between this stability theorem and the LLN.2.2. and references containing proofs. although it is but one example of a condition ensuring global stability. If h : RK → R is any function such that |h(s)|π∞ (s)ds < ∞. based on a variety of different criteria. For further discussion of the Markov case see Stachurski (2009) and the references therein. 1 − ρ2 ) of vt has finite mean and a density that is everywhere positive on R. which makes the result a little easier to digest. DYNAMIC PROPERTIES 244 Theorem 8. then9 1 T t =1 ∑ h(xt ) → E [h(xt )] = T p h(s)π∞ (s)ds as T→∞ To apply theorems 8. φ is the standard normal density and Φ is the standard normal cdf. The two definitions are equivalent.2. The second definition is the most useful for when it comes to numerical computation.16) applied to the law of motion (8.1. We then have g(s) ≤ Λs + L = Λs s s + L ≤ ρ(Λ) s + L =: λ s + L We can now see that the drift condition (8. Applying the triangle inequality | a + b| ≤ | a| + |b|. where q := ρ(1 − ρ2 )−1/2 . J OHN S TACHURSKI October 10.1.1 on page 52).2.17) To study the dynamics of this process. Consider again the VAR(1) process from (8.8.2. using the triangle inequality for the norm (fact 2. where the function g is given by g(s) := (α0 + ρ0 s)(1 − τ (s)) + (α1 + ρ1 s)τ (s) and τ : R → [0. and a unique. While for many Markov processes the stationary density has no known closed form solution.2. in the present case the stationary density is known to have the form π∞ (s) = 2φ(s)Φ(qs).3. In this case. By assumption.16) will be satisfied whenever the spectral norm of Λ is less than one. globally stable stationary density exists.17). Example 8. we obtain | g(s)| ≤ |(α0 + ρ0 s)(1 − τ (s))| + |(α1 + ρ1 s)τ (s)| ≤ |α0 | + |α1 | + |ρ0 | · |s(1 − τ (s))| + |ρ1 | · |sτ (s)| 10 Readers familiar with the notion of eigenvalues might have seen the spectral norm defined as the square root of the largest eigenvalue of Λ. with x t +1 = a + Λ x t + w t +1 (8. 1]. DYNAMIC PROPERTIES 245 λ = |ρ| and L = 0.2. Consider the STAR model introduced in §8. and hence all the conditions of theorem 8. and. we have λ < 1. it’s useful to recall the definition of the spectral norm of Λ. 2013 . Example 8. g(s) = a + Λs.1 are satisfied. we have g(s) = a + Λs ≤ a + Λs Let λ := ρ(Λ) and let L := a . which is given by10 ρ(Λ) := max s =0 Λs s Let us consider the drift condition (8.2.2). Here we do not wish to assume J OHN S TACHURSKI October 10.2 respectively). Suppose further that {mt } is identically distributed. τ is continuous.) Next. DYNAMIC PROPERTIES 246 Letting L := |α0 | + |α1 | and λ := max{|ρ0 |.2 yields 1 T t =1 ∑ 1{ x t ≤ s } → Π ∞ ( s ) T p In other words. suppose that {mt } is a martingale difference sequence with respect to filtration {Ft }. Let’s test this for the model in example 8.5 is given in listing 15. and the fit is pretty good.1 and 1.2 is if we take h(xt ) = 1{xt ≤ s} for some fixed s ∈ RK .4. the condition (8. The ecdf is a nonparametric estimator of the cdf.95. and E [ m2 1 ] < ∞. using R’s default choice of K and δ. This is the blue line in figure 8. then it has an everywhere positive density on R. Martingale difference sequences are important to us because they are good candidates for the LLN and CLT. the ecdf converges to the stationary cdf Π∞ .2. Provided that the cdf Π∞ has a density. If. where q := ρ(1 − ρ2 )−1/2 . If the variables { mt } are also independent. The code for producing figure 8.1. just as for the IID case. we generate a time series.4. the stationary density is known to be π∞ (s) = 2φ(s)Φ(qs). say. 8.8. Hence the process is globally stable.5.1 are satisfied.2.2. we have E [1{ x t ≤ s } ] = P{ x t ≤ s } = Π ∞ ( s ) so in this case theorem 8.3.2.16) is satisfied. (The value of ρ is 0. 2013 .8) on page 13. then g is continuous. the same idea works for the standard nonparametric density estimator discussed in §4. and plot s− xt T 1 the nonparametric kernel density estimate T δ ∑t=1 K ( δ ) as a black line. To see this.2. in addition. For this model. then the classical LLN and CLT apply (theorems 1.4 Martingale Difference LLN and CLT Next we consider asymptotics for martingale difference sequences.4. One interesting special case of theorem 8. φ is the standard normal density and Φ is the standard normal cdf. |ρ1 |}. we then have | g(s)| ≤ λ|s| + L If both |ρ0 | and |ρ1 | are strictly less than one.2. By (1. Here T = 5000. a normal distribution. and all the conditions of theorem 8. If the distribution of the shock has. 2 0.5 0.3 0. xlab = " " .rho / sqrt (1 .6 −1 0 1 2 3 4 Figure 8.numeric ( T ) x [1] = 0 for ( t in 2: T ) { x [ t ] = rho * abs ( x [t -1]) + sqrt (1 . 4 .function ( s ) { return (2 * dnorm ( s ) * pnorm ( s * delta ) ) } lines ( xgrid .seq ( -4 .1 0.5 rho <.2.5: Stationary density and nonparametric estimate Listing 15 Code for figure 8.rho ^2) T <.0 0.95 delta <.4 0. pistar ( xgrid ) . length =200) pistar <.rho ^2) * rnorm (1) } plot ( density ( x ) . DYNAMIC PROPERTIES 247 0. ylab = " " . col = " blue " ) J OHN S TACHURSKI October 10.5000 x <. main = " " ) xgrid <. 2013 .8.0. 3. γ2 := E [m2 t ] is positive and finite.11 We will use theorem 8.2. we might hope that the LLN and CLT will still hold in some form. and hence E [E [mt+ j mt | Ft+ j−1 ]] = E [mt E [mt+ j | Ft+ j−1 ]] = E [mt · 0] = 0 This confirms that martingale difference sequences are uncorrelated. Given the lack of correlation.18) can be proved in exactly the the same way we proved the classical LLN in theorem 1. in addition. 11 Deriving J OHN S TACHURSKI October 10. mt ] = E [mt+ j mt ] = E [E [mt+ j mt | Ft+ j−1 ]] Since t + j − 1 ≥ t and {Ft } is a filtration.2. as claimed.4. theorem 7. we know that mt is Ft+ j−1 -measurable.3.3 Maximum Likelihood for Markov Processes [roadmap] theorem 8. 8. MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 248 independence. then 1 T t =1 ∑ mt → 0 T p as T→∞ (8. 2013 .1. and 1 T then 2 ∑ E [ m2 t | F t −1 ] → γ T p as T→∞ t =1 √ T 1 T t =1 ∑ mt = T −1/2 T t =1 ∑ mt → N (0.3 is a consequence of a martingale CLT proved in Durrett (1996. We have cov[mt+ j .19) The LLN result in (8.3 from the result in Durrett requires some measure theory. fix t ≥ 0 and j ≥ 1.18) If.2. If you know measure theory then you should be able to work out the proof.3 in our large sample OLS theory below. the following result is true: Theorem 8. γ2 ) T d as T→∞ (8. and is beyond the scope of these notes. Indeed. Let {mt } be identically distributed.8. If {mt } is a martingale difference sequence with respect to some filtration {Ft }. Theorem 8. To see this.2.4). but with martingale difference sequences we do at least have zero correlation. . is pretty much the defining property of a first order Markov process. . . To obtain a convenient expression for the joint density in this general case. In this case. If the process we are observing has been running for a while. . and the maximum likelihood estimate (MLE) of θ is ˆ := argmax L(θ ) θ where L(θ ) = n =1 ∏ pθ ( xn ) N (8. then the joint density is no longer the product of the marginals. wt+1 ) given on page 229. time homogeneous) Markov process. we get p ( s1 . . . . on the other hand. 2013 . s T ) = p ( s1 ) t =1 ∏ p ( s t +1 | s 1 . . . and follows from the general expression xt+1 = G ( xt . for a (first order. . MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 249 8.8.21) again. . . the joint density is the product of the marginals. given global stability. our data x1 . nothing in this expression changes if it by induction if you wish. . it is not unreasonable to assume that x1 ∼ π∞ . . . as we saw in §4.3. s2 ) p ( s2 | s1 ) p ( s1 ) Extending this from T = 3 to general T we get12 T −1 p ( s1 . . our expression for the joint density becomes T −1 p ( s1 .20) If.1. then. x3 . 13 This 12 Check J OHN S TACHURSKI October 10. .3. . x T are observations of a globally stable Markov process with transition density p(st+1 | st ) and stationary density π∞ . .3. s3 ) = p ( s3 | s1 . st ) = p(st+1 | st ). s2 ). . let’s begin by constructing a joint density for the first three data points x1 . . s2 ) p ( s1 . s2 . x2 . . . then. s T ) = π ∞ ( s1 ) t =1 ∏ p ( s t +1 | s t ) where we are using the fact that. x N from common density pθ . s2 . this time to p(s1 . Suppose that x1 . Using (1. . s2 ) Applying (1. . we can write the joint density as p ( s1 .13 Finally. . . s t ) We can specialize further if we are dealing with a Markov process.21) on page 26. s3 ) = p ( s3 | s1 .1 The Likelihood Function If we have IID scalar observations x1 . . p(st+1 | s1 . x T is a time series where the independence assumption does not hold. {wt } ∼ N (0. then there are many elements in the sum. for a great many processes there is no known analytical expression for this density. abusing notation slightly. let’s suppose now that p depends on an unknown parameter vector θ ∈ Θ. and the influence of a single element is likely to be θ is formally defined by negligible. If xt = s. J OHN S TACHURSKI October 10. a + bs2 ). The log-likelihood function is then given by as π∞ T −1 θ (θ ) = ln π∞ ( x1 ) + t =1 ∑ ln pθ (xt+1 | xt ) In practice it is common to drop the first term in this expression. There are two reasons.14 Here we’ll follow this convention and. then xt+1 ∼ N (0. 1) IID (8. so the joint density of a Markov process x1 . Since the stationary density π∞ is determined by pθ (see (8. First.14). . . using simulation. write T −1 (θ ) = t =1 ∑ ln pθ (xt+1 | xt ) (8.3.21) Turning to the likelihood function. s T ) = π ∞ ( s1 ) t =1 ∏ p ( s t +1 | s t ) (8. Second. . MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 250 we shift to the vector case. and write pθ . particularly when the data size is large. 2013 . The discussion surrounding figure 8. .5 gives some idea. A better technique is discussed in Stachurski and Martin (2008).8.14) on page 243) we indicate this dependence by writing it θ . . . . and hence p(s | s) = (2π ( a + bs2 ))−1/2 exp − 14 In ( s )2 2( a + bs2 ) this situation it is still possible to compute the density numerically. x T with transition density p(st+1 | st ) and x1 ∼ π∞ has the form T −1 p ( s1 . .2 Example: The ARCH Case Recall the ARCH process 2 1/2 xt+1 = ( a + bxt ) w t +1 . if the data size is large.3.23) The transition density for this model is the density of xt+1 given xt = s. even though the stationary density π∞ (8.22) 8. and we observe a time series x1 .61. ˆ. b ) = t =1 ∑ 2 xt 1 +1 2 − ln(2π ( a + bxt )) − 2) 2 2( a + bxt (8. .15 From (8. x500 is generated using the true parameter values a = b = 0. b) and data corresponding to the time series x1 . In each of the four simulations. dropping terms that do not depend on a or b.6. To get more accurate estimates. then one can show that the process is globally stable.8. a rough guess of the MLEs can be obtained just by looking for maximizers in the figures.e. there ˆ and b in (8. we can rewrite this (abusing notation again) as T −1 ( a.1 (page 244) cannot be used to check global stability. the log-likelihood function is T −1 ( a. and the vertical axis is b values. 2013 . J OHN S TACHURSKI October 10.5. b ) = − t =1 ∑ 2 xt ln zt + +1 zt where 2 zt := a + bxt (8. MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 251 Since a + bs2 is the conditional variance of xt+1 . . in simulation (a).2. .44 and b to be around a form of analytical or numerical optimization.25). the parameters a and b are restricted to be nonnegative. if b < 1. x500 generated by these parameters. . and multiplying by 2 (an increasing transformation). . the MLEs look ˆ = 0. b (a Four different simulations of are given in figure 8. Since the graph of the function is three dimensional (i.3. On the other hand. and the function in (8. with theta corresponding to ( a. we have plotted it using contour lines and a color map. The function arch_like(theta.25) is then plotted. If you wish to verify global stability then have a look at the techniques in Chapter 8 of Stachurski (2009). . .25) Let’s run some simulations to see what this function looks like. theorem 8. In order to estimate a and b. In each figure.25). and obtain the MLEs a ˆ ) that maximizes ( a. ˆ as the vector ˆ and b we form the likelihood function (8. x T . The code for producing one of these figures (modulo randomness) is given in listing 16. . For example. the function has two arguments). .24) Rearranging. Moreover. b). Lighter colors refer to larger values. Thus. 15 Unfortunately. . because the shock is not additive. . In the simulations we will set T = 500 and a = b = 0. . data) represents in (8. unbeknownst to us.5.5.22). the true parameter values are a = b = 0. a separate data set x1 . we don’t have any analytical expressions for the MLEs because setting the two partial derivatives of ˆ . The horizontal axis is a values. we imagine the situation where. For this problem.25) to zero does not yield neat expressions for a are many numerical routines we can use to obtain the MLEs for a given data set. we can use some ˆ = 0. . . which is −1 times . The return value of optim is a list. However. and the approximate minimizing vector is one element of this list (called par). for most realizations of the data and starting values. given the definition of arch_like in listing 16 and a sequence of observations x1 . neg _ like . Any function g : R → R. which is variation on the Newton-Raphson algorithm.3. In J OHN S TACHURSKI October 10. . and how to code up simple implementations on your own. The first argument to optim is a vector of starting values (a guess of the MLEs). b ) neg _ like <. MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 252 The simplest approach is to use one of R’s inbuilt optimization routines. and optim is no exception.function ( theta ) { return ( . x T stored in a vector xdata. and let s0 be some initial point in R that we think (hope) is somewhere near a root. let g : R → R. method = " BFGS " ) Here optim is an built-in R function for numerical optimization of multivariate functions. For this reason. interior optimizers are always roots of the objective function’s first derivative.3. given a ¯ ∈ R such that g(s ¯) = 0.3 The Newton-Raphson Algorithm The Newton-Raphson algorithm is a root-finding algorithm. .optim ( start _ theta . the function that we pass to optim is neg_like. 0. there’s no guarantee that it will. To describe the algorithm. The next section will get you started. In case of problems. The last argument tells optim to use the BFGS routine.35) # An initial guess of (a . for differentiable functions. but what we can do is move to the root of the function which forms the tangent line to g at s0 . 8.c (0. the function arch_like can be optimized numerically via the commands start _ theta <. you will find that the algorithm converges to a good approximation to the global optimizer.arch _ like ( theta .65 . In this particular set up.8. Most built-in functions in most languages perform minimization rather than maximization. it’s useful to know how these kinds of algorithms work. xdata ) ) # xdata is the data } opt <. let’s begin with the root-finding problem and then specialize to optimization. the algorithm searches for points s root-finding algorithm can be used to optimize differentiable functions because. We don’t know how to jump from s0 straight to a root of g (otherwise there would be no problem to solve). For example. 2013 . To begin. In other words. 4 −4 1 0 −4 −4 20 −4 −4 12 −4 22 −4 −4 18 24 −4 16 −4 −4 26 −4 34 −40 0 04 −4 44 −4 10 6 −40 2 −41 08 −4 −4 18 16 −4 26 0.5 a 0.3 −3 98 −3 94 −3 92 2 −40 0 −40 8 −39 6 −39 −3 9 0.4 0.4 −4 −4 14 1 −4 8 −458 −46 −464 −468 −476 −484 −41 −4 −42 −43 0 0 2 20 3 −4 6 −41 −42 4 26 −4 4 −472 0 −47 8 −47 −47 4 2 −48 −480 −490 −488 8 −42 6 −43 −4 44 −48 −49 −494 6 6 −49 2 4 −50 6 −51 0.8 0.7 0.3 −43 2 −498 −506 −53 0 0.4 0.5 b b 0.5 0. MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 0.3 −42 0.7 −42 6 −41 0 6 −41 −4 −4 14 12 4 −41 −41 2 −41 0 (a) Simulation 1 (c) Simulation 3 0.6 0.6 6 −400 −396 −404 −384 −386 −388 −398 −402 0.5 −448 −446 0 −4 −394 −444 −442 4 −4 −39 2 00 02 −398 8.3 8 −43 −44 6 −4 56 −500 −512 −508 0 −52 0.6: Simulations of the function (8.6 0.3 3 −4 2 −4 4 −41 0 −42 8 38 −4 0.7 0.3 1 −4 4 −40 0 8 −40 6 −40 4 0.7 −42 4 −4 18 6 −42 −4 22 −43 4 2 −43 8 −42 −4 26 8 −43 −43 6 −43 −42 6 2 −42 0 −42 −42 0 4 8 −41 0.−4 36 −4 30 −4 34 −42 4 −42 68 −4 66 −4 64 −4 2 −4 32 −42 6 −41 4 −41 0 62 −4 60 −4 58 6 8 −41 −4 28 −4 26 2 −41 0 −41 8 −40 −4 5 −4 54 −4 −4 52 −396 −440 a 0 −4 2 0.4 08 1 −4 2 0.4 0.8 0 −43 0.5 −390 0.5 b 0.8 253 0.8 0.6 −40 a −39 0 0.3 0.3 −4 08 0.6 October 10. 2013 0.5 b 0.5 a −4 −438 06 Figure 8.4 96 −3 14 02 J OHN S TACHURSKI −390 −392 −394 −388 −3 80 −382 −386 8 −40 06 −3 92 −4 0.8 0.8 0.7 0.7 0.7 0.6 −4 50 −436 −434 0.8 50 0.4 0.3.6 0.6 0.8 .25) with T = 500 (d) Simulation 4 (b) Simulation 2 0.7 −45 2 −4 04 −4 06 −4 0 −4 0 −4 −4 54 −45 6 −46 0 2 6 −46 0. 2013 . col = topo .3 .8 . b .theta [1] + theta [2] * X ^2 return ( .3 .matrix ( nrow =K . b .8 . b . n =500) { x <.arch _ like ( theta . 0.data [ .numeric ( n ) x [1] = 0 w = rnorm ( n ) for ( t in 1:( n -1) ) { x [ t +1] = sqrt ( a + b * x [ t ]^2) * w [ t ] } return ( x ) } xdata <. b [ j ]) M [i .length ( data ) ] # All but last element Z <.sim _ data (0. nlevels =40 . 0. colors (12) ) contour (a . j ] <.6 arch _ like <. xdata ) } } image (a .data [ -1] # All but first element X <.3. M .function ( theta . M . length = K ) b <. MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 254 Listing 16 Code for figure 8. 0.sum ( log ( Z ) + Y ^2 / Z ) ) } sim _ data <.50 a <.5) # True parameters K <. ncol = K ) for ( i in 1: K ) { for ( j in 1: K ) { theta <.c ( a [ i ] .5 .seq (0. data ) { Y <.seq (0.function (a . length = K ) M <. add = T ) J OHN S TACHURSKI October 10.8. This can be done by applying the Newton-Raphson algorithm to g .7: The first step of the NR algorithm other words.8. and there have been many attempts to make the procedure more robust. suppose now that g : R → R is a differentiable function we wish to maximize.7.3. which is given by ˜ (s) := g(s0 ) + g (s0 )(s − s0 ) g ( s ∈ R) ˜ . The point s1 is taken as our next guess of the root. and the procedure is repeated. we replace g with its linear approximation around s0 . The R function optim described above is a child of this process. solving for the root. and so on. taking the tangent of g at s1 . This point is represented as s1 in figure 8. then g (s∗ ) = 0. MAXIMUM LIKELIHOOD FOR MARKOV PROCESSES 255 g ¯ g ¯ s s1 s0 Figure 8. J OHN S TACHURSKI October 10. and the value and solve for the root of g is easily seen to be s1 := s0 − g(s0 )/ g (s0 ). This generates a sequence of points {sk } satisfying s k +1 = s k − g(sk ) g (sk ) There are various results telling us that when g is suitably well-behaved and s0 is ¯. Let’s suppose that g is a function of two arguments. which yields the sequence s k +1 = s k − g (sk ) g (sk ) (8.26) We can extend this algorithm to the multivariate case as well. 2013 . then sequence {sk } will converge to s ¯. Hence it is natural to begin our search for maximizers by looking for roots to this equation. We know that if s∗ is a maximizer of g. suppose that g is twice differentiable 16 In practical situations we often have no way of knowing whether the conditions are satisfied. In particular.16 sufficiently close to a given root s To move from general root-finding to the specific problem of optimization. Markov switching. y) := g1 ( x . Max likelihood with latent variables. 8. b ) : = ∂b T −1 t =1 ∑ 2 xt 2 xt +1 z2 t − 1 zt while the second partials are ∂2 ( a. the Newton-Raphson algorithm for this two dimensional case is the algorithm that generates the sequence {( xk . Let zt be as defined in (8.8 show four iterations of this procedure.26). HMM.27). zt ∂ ( a. and so on. 0. factor models? 17 We 18 As are assuming that the Hessian matrix is invertible. y) g22 ( x. b ) : = ∂ a∂b T −1 t =1 T −1 t =1 ∑ 2 xt 1 +1 .5 and T = 500. GARCH. y) g21 ( x. gij second derivative cross-partial. yk ) − [∇2 g( xk . yk ) from some initial guess ( x0 . By analogy with (8. J OHN S TACHURSKI October 10.8. and we are already close to the global optimum. y) Here gi is the first partial of g with respect to its i-th argument.4 Models with Latent Variables To be written. y) ∈ R2 are defined as ∇ g( x.35). b ) : = ∂a T −1 t =1 ∑ 2 xt +1 z2 t − 1 . the simulation uses a = b = 0.65.4. y0 ). The first partials are then ∂ ( a. y ) g2 ( x .17 (8. Replication of this figure (modulo randomness) is left as an exercise. before. b ) : = ∂ b2 T −1 t =1 ∑ 4 xt 2 xt 1 +1 − 2 3 z2 z t t ∑ 2 xt 2 xt 1 1 −2 + 3 2 zt zt From these expressions we can easily form the gradient vector and the Hessian. 2013 .27) For the sake of the exercise.25). y) g12 ( x. pick an initial guess. Figure 8. yk )]−1 ∇ g( xk . starting from ( a0 . yk )} defined by ( xk+1 . yk+1 ) = ( xk . let’s apply this to maximization of (8. MODELS WITH LATENT VARIABLES 256 and g : R2 → R. b ) : = ∂ a2 The cross-partial is ∂2 ( a. y ) and ∇2 g ( x .18 In this case the convergence is quick. b0 ) = (0.25). − 2 3 z2 z t t ∂2 ( a. and iterate according to (8. The gradient vector and Hessian of g at ( x. y ) : = g11 ( x. 3: If a xn → a x for every a ∈ RK .8: Newton-Raphson iterates −4 24 −3 92 −3 98 −3 06 86 0.5. 8. prove the following part of fact 2.4.5 Exercises p p p Ex.5 a 0.1.5.1 (page 31) as appropriate.e.5.5.3 is a warm up to this exercise.2. 8. Ex.8. EXERCISES 257 0. Show that xn → 0.3.5. xn → 0 and yn → 0).5.5.4.5. 8.4 0. Confirm the claim N (x Ex.5.) √ d ¯ N − µ) → N (0.1 on page 75.7 0.5. Let {xn } be a sequence of vectors in R2 .. Let {xn } be an IID sequence of random vectors in RK with E [xn ] = 0 and var[xn ] = IK .7 −40 6 −39 0 2 b 0.8 −41 −40 −39 4 −41 −40 0 8 4 −40 6 2 −41 −39 −40 4 8 −38 6 −38 4 −38 −41 8 8 2 −38 0 −38 −39 2 −41 6 −39 0 0.5 0.8 8. 2013 p p p p p p .3 0.3 0.5.4 −370 −3 88 −372 6 −4 00 −4 14 −4 −3 90 −3 80 −3 8 −3 84 −3 2 −374 0. Ex.6.5. then xn → x. (Note that exercise 8.2: If Xn → X and Yn → Y. Ex. Let 1 N ¯ N := ¯N 2 x ∑ xn and y N := N · x Nn =1 J OHN S TACHURSKI October 10.1. Confirm the following claim in fact 2. yn ) for each n. 8. Verify fact 2. 8. Suppose that xn → 0 (i. Σ) in theorem 2.6 −3 −376 78 78 Figure 8. then Xn Yn → XY whenever the matrices are conformable.5. Using fact 1. Ex.6 −368 4 −39 4 −40 −40 8 −39 0. where xn := ( xn . 8. Is {ηt } a martingale with respect to {Ft }? 2. If not. 8. Is {zt } a martingale difference sequence with respect to {Ft }? Why or why not? Ex.9. Is {κt } a martingale with respect to {Ft }? 3. Look in the course notes for specific examples of martingales. Suppose that xt+1 = α + ρ xt + wt+1 . and let {mt } be a martingale with respect to {Ft }. give a proof. . Let mt := j =1 ∑ ηj t and κ t : = m2 t −t Show that {κt } is a martingale with respect to {Ft }. Let {Ft } be a filtration. 8.5 and let {Ft } be the filtration defined by Ft := {η1 . . 1 − ρ 1 − ρ2 and wt+1 ∼ N (0. 8. Ex. 8. 2013 . Let {ηt } be an IID sequence of scalar random variables with E [η1 ] = 0 and var[η1 ] = σ2 > 0. . ηt }.5.8.5. let κt := 2mt . . 1. give a counterexample.7. Let {ηt } be an IID sequence of scalar random variables with P{η1 = 1} = P{η1 = −1} = 0. and let γt := m2 t.5.19 Ex. you need to give a specific example of the pair {mt } and {Ft } where the stated property fails.5. . Let {Ft } be the filtration defined by Ft := {η1 . then dt = mt − mt−1 is a martingale difference sequence with respect to {Ft }. give a counterexample. . . 1) Show that xt+1 ∼ π∞ also holds. ηt }. Is {zt } IID? Why or why not? 2. 8.5. Is {γt } a martingale with respect to {Ft }? If yes. Show that if {mt } is a martingale with respect to {Ft }. Let {Ft } be a filtration. . where xt ∼ π∞ := N α 1 .10. 19 To J OHN S TACHURSKI October 10. Let ηt := mt + 1. 1. EXERCISES 258 What is the asymptotic distribution of {y N }? Ex. Ex.8.5. and let zt := t · ηt for each t.11. 1 Solutions to Selected Exercises p p p Solution to Exercise 8. 8. j-th element of Xn Yn converges in probability to the i.1 on page 31 twice. then we know in particular that this convergence holds for the canonical basis vectors. the i.4. j-th element of XY. Consider again the scalar Markov sequence xt+1 = ρ xt + wt+1 . . . 8. j-th element of Xn Yn converges in probability to the i.2. we have n → xik xik p and yn kj → ykj p for all k Applying fact 1. Ex.1. To prove that Xn Yn → XY. . we need to show that the i. By hypothesis. Prove that this process has a unique. Consider the scalar sequence xt+1 = ρ xt + wt+1 . If a xn → a x for every a ∈ RK . Hence ek xn → ek x p p for every k (elementwise convergence) ∴ k xn → xk p for every k ∴ xn → x p (vector convergence.2. globally stable stationary distribution using theorem 8.5. wt }.5. 2013 .12.13. EXERCISES IID 259 Ex. { xt } is a martingale with respect to {Ft }. and that −1 < ρ < 1. 1) and x0 = 0. . { xt } is a martingale difference sequence with respect to {Ft }.5.8. where {wt } ∼ N (0. by definition) J OHN S TACHURSKI October 10.5.5. we obtain n n ykj → xik ykj xik p for all k and then n n ykj → ∑ xik ykj ∑ xik k k p In other words. 2.5. j-th element of XY. Give conditions on ρ such that 1. Assume that {wt } is IID. 8. Solution to Exercise 8. Let Xn → X and Yn → Y.1. having Student’s t-distribution with 2 degrees of freedom. Let Ft := {w1 . By assumption. We also know that if un → u and vn → v. Combining the various general case is similar: Suppose first that xk k results about scalar convergence in probability in fact 1. Let {xn } be a sequence of random vectors in RK and x be a random vector in RK .3. n {| xk − xk | > } ⊂ { xn − x > } n 0 ≤ P{| xk − x k | > } ≤ P{ x n − x > } → 0 Solution to Exercise 8. then g(un ) → g(u).3.5.8.1 (page 31). Σ ) October 10.1 on page 31.5. and From the definition of the norm we see that | xk n k hence n | xk − xk | > =⇒ xn − x > ∴ ∴ The proof is done. n − x | ≤ x − x is always true.4.5.4. we have xn → 0 p p p p p p and and yn → 0 2 y2 n →0 =0 p p p ∴ ∴ 2 xn → 02 = 0 p xn xn 2 ∴ 2 = xn + y2 n → 0+0 = 0 p √ = xn 2 → 0 = 0 Solution to Exercise 8.5.4. EXERCISES 260 Solution to Exercise 8.5. suppose now that xn − x → 0.5. Fix > 0 and arbitrary k. The n → x for all k. then un + vn → u + v. 2013 . Define √ ¯ N − µ) zn := N (x J OHN S TACHURSKI and z ∼ N ( 0. From fact 1. We need to show that n → xk for all k ⇐⇒ xk p xn − x → 0 p p A special case of this argument can be found in the solution to exercise 8. one can then verify (details left to you) that K xn − x := k =1 n − x )2 → 0 k ∑ ( xk p p (n → ∞) Regarding the converse. we know that if g : R → R is continuous and {un } is a scalar sequence of random variables with un → u. 5.2. wt+1 ] J OHN S TACHURSKI October 10. I K ) Nx d Letting g(s) := s 2 and applying the continuous mapping theorem (fact 2. By assumption.3. var[ xt+1 ] = var[ρ xt + wt+1 ] = ρ2 var[ xt ] + var[wt+1 ] + 2ρ cov[ xt . we have shown that a zn → a z. we obtain K √ d ¯ N 2 → z 2 = ∑ z2 Nx yN = k k =1 From fact 1. Since linear combinations of normals are normal. functions of independent random variables are independent) and var[yn ] = var[a xn ] = a var[xn ]a = a Σa the scalar CLT yields a zn → N (0. d E [ x t +1 ] = E [ ρ x t + w t +1 ] = ρ E [ x t ] + E [ u ] = 0 The second claim also holds. Observe that √ ¯ n − E [yn ]) a zn := N (y where yn := a xn . EXERCISES d 261 We need to show that zn → z.5. we know that xt+1 is normal. Since a was arbitrary.7 on page 24 we conclude that y N → χ2 (K ). Since yn is IID (in particular. Solution to Exercise 8. and leave further details to you. To simplify the algebra. from the rule for variance of linear combinations. Thus.5. 2013 .4.8. a Σa) Since a z ∼ N (0. It follows from the vector central limit theorem that d d √ ¯ N → z ∼ N ( 0. To begin.5. Solution to Exercise 8.5. {xn } is an IID sequence in RK with E [xn ] = 0 and var[xn ] = IK . a Σa).4 on page 77).6. because. I’ll solve the case where α = 0. page 36).2.7. the Cramer-Wold device tells us that zn converges in distribution to z. fix a ∈ RK . The first claim is true because. we apply the Cramer-Wold device (fact 2. by linearity of expectations. page 77) and the scalar CLT (theorem 1. To do this. it remains only to show that E [ xt+1 ] = 0 and var[ xt+1 ] = 1/(1 − ρ2 ). their difference dt = mt − mt−1 can also be computed. {ηt } adapted to {Ft }. 2013 . 21 Once again. using the fact that {mt } is a martingale with respect to {Ft }. we have E [ d t +1 | F t ] = E [ m t +1 − m t | F t ] = E [ m t +1 | F t ] − E [ m t | F t ] = m t − m t = 0 Solution to Exercise 8. way to think about this intuitively is to think about whether or not dt can be computed on the basis of information available at time t. Hence.5.8.5. we need to show that dt is a function of variables in Ft . mt−1 is a function of variables in Ft−1 . the value of ηt = mt + 1 can also be computed. clearly dt = mt − mt−1 is also a function of variables in Ft . mt−1 is also is a function of variables in Ft . so mt is a function of variables in Ft . Regarding part 1. and. the way to remember this is to recognize that since the value of m can be computed t at time t (by assumption). {ηt } a martingale with respect to {Ft }. by the definition of filtrations. Second. Hence ηt = mt + 1 is a function of variables in Ft .5. Since both mt and mt−1 can be computed at time t. EXERCISES 262 Since xt and wt+1 are independent (xt is a function of current and lagged shocks only). {dt } is adapted to {Ft }. 20 The J OHN S TACHURSKI October 10. The proof is similar to part 1. and we get var[ xt+1 ] = ρ2 1 1 +1 = 2 1−ρ 1 − ρ2 Solution to Exercise 8.8.2).1. We need to show that 1.9. {κt } a martingale with respect to {Ft }. To see this. we have E [ η t +1 | F t ] = E [ m t +1 + 1 | F t ] = E [ m t +1 | F t ] + 1 = m t + 1 = η t Regarding part 2.21 Moreover. E [dt+1 | Ft ] = 0 To prove that {dt } is adapted to {Ft }. every variable in Ft−1 is also in Ft . since E [mt+1 | Ft ] = mt (recall that {mt } is a martingale) and E [mt | Ft ] = mt (by fact 8. Firstly. observe first that mt is a function of variables in Ft . Since both mt and mt−1 are functions of variables in Ft . the final term is zero. and hence omitted. because {mt } adapted to {Ft } by assumption. and 2.20 Finally. It is clear from (8. mt := ∑t j=1 ξ j and Ft : = { ξ 1 . EXERCISES 263 Regarding part 3. then { mt } is a martingale with respect to {Ft }.5. ξ t }.4. then xt = wt . if ρ = 0.12. the process {γt } given by γt = m2 t is not a martingale 2 2 whenever σ := E [ξ 1 ] is strictly positive. if ρ = 1. . then { xt } is the random walk in example 8. J OHN S TACHURSKI October 10. 2013 . Solution to Exercise 8. For example. Hence { xt } is a martingale with respect to {Ft }.8. . However. and E [ x t +1 | F t ] = E [ w t +1 | F t ] = E [ w t +1 ] = 0 Hence { xt } is a martingale difference sequence with respect to {Ft }. . we saw in the course notes that if {ξ t } is an IID sequence of random variables with E [ξ 1 ] = 0. {γt } is not generally a martingale with respect to {Ft }. observe that γt + 1 = m2 t +1 = j =1 ∑ ξ j + ξ t +1 t 2 2 = ( m t + ξ t +1 )2 = m 2 t + m t ξ t +1 + ξ t +1 ∴ Since and we now have 2 E [ γt + 1 | F t ] = E [ m 2 t | F t ] + E [ m t ξ t +1 | F t ] + E [ ξ t +1 | F t ] E [ m t ξ t +1 | F t ] = m t E [ ξ t +1 | F t ] = m t E [ ξ t +1 ] = 0 2 2 2 E [ξ t +1 | F t ] = E [ ξ t +1 ] = σ 2 2 2 2 E [ γt + 1 | F t ] = E [ m 2 t | F t ] + σ = m t + σ = γt + σ > γt Hence {γt } is not a martingale with respect to {Ft }.5. Regarding part 2. Regarding part 1. To see this.6) on page 229 that { xt } is adapted to the filtration for all values of ρ. .1. observations will be indexed by t rather than n. . and the sample size will be denoted by T rather than N . In particular. The only assumption we will retain from chapter 7 is the linear model assumption (see 7. . . . . and u be the vector of shocks. we assume that our data (y1 . To remind us of this. . .1) where β is a K-vector of unknown coefficients. We let X be the T × K matrix   x11 x12 · · · x1K  x   21 x22 · · · x2K  X :=  . We let y be the T -vector of observed outputs. .  xT1 xT2 · · · 264 x TK . so that ut is the t-th element of the T × 1 vector u.Chapter 9 Large Sample OLS [roadmap] 9. and ut is an unobservable shock. Although the large sample theory we develop here has applications in a crosssectional environment with no correlation between observations.  . x T ) is generated by the linear model yt = xt β + ut (9. .1).1 Consistency Let’s return to the linear multivariate regression problem studied in chapters 6 and 7. (y T . .  . so that yt is the t-th element of the T × 1 vector y. x1 ).1. for additional generality we will imagine ourselves to be in a time series setting. 2) ˆ − β = (X X)−1 X u for the sampling error and Also. we obtain ˆ = β T 1 T t =1 ∑ xt xt T −1 · 1 T t =1 ∑ xt yt T (9. For example.1.. taking our usual expression β performing a similar manipulation. .9. .1.1. we make the we will aim for a large sample property: consistency of β following assumptions: J OHN S TACHURSKI October 10. we showed in §7. we get ˆ = β T 1 XX T −1 · 1 Xy T Expanding out the matrix products (exercise 9.1 is close to the minimum requirement for unbiasedness.1. we abandon assumption 7. We will estimate the parameter vector β via least squares. 2013 . . We can make this dependence clearer by rewriting the estimator in a different way.4) The problem is that for this specification of (9. Instead ˆ .1 Assumptions Let’s now study the properties of this estimator in the time series setting.1).2. and without it there is little chance of establishing this property.5. The expression for the OLS estimate is unchanged: ˆ : = ( X X ) −1 X y β T ˆ with T to make the dependence of the estimaNotice also that we are subscripting β tor on T explicit. the OLS estimator is unbiased for β (theorem 7.1.3) 9. In this setting. we get ˆ −β= β T 1 T t =1 ∑ xt xt T −1 · 1 T t =1 ∑ xt ut T (9.1).3. T (9. To this end. which is the exogeneity assumption E [u | X] = 0.e. CONSISTENCY 265 We continue to assume that X is full column rank (i.1 that the assumption fails when we try to estimate the simple AR(1) model yt+1 = βyt + ut+1 by setting xt = yt−1 .1). thereby producing the regression model yt = β xt + ut . In fact assumption 7.1.1. Multiplying and dividing by T . The reason is that this assumption excludes too many models. t = 1. We know that under assumption 7. the regressor is correlated with lagged values of the shock. rank(X) = K). . the model has a unique. this will be the case whenever the shock process {ut } is IID. because current shocks usually feed into future state variables. Let { xt } be the Markov process in example 8. . .3. and therefore independent of ut . and Σxx := E [x1 x1 ] is positive definite. CONSISTENCY 266 Assumption 9. Second.1.1. . . .2. To repeat. . in the AR(1) regression (9.1.9. For example. . globally stable stationary distribution.1 are satisfied. 1) IID As discussed in example 8. . 2013 .2 (Weak exogeneity). Example 9. This is a second moment assumption.1 In this case. all of the conditions in assumption 9. Let’s look at a scalar example. x T is identically distributed.2 asks you to step through the details. ut−1 . . xt are equal to the lagged state variables y0 . xt+1 = ρ| xt | + (1 − ρ2 )1/2 wt+1 with − 1 < ρ < 1 and {wt } ∼ N (0. where q := ρ(1 − ρ2 )−1/2 . xt for all t Remark: Assumption 9.1 are required to be identically distributed. they are not assumed to be IID—some correlation is allowed. .5) Two remarks: First. which in turn are functions of only y0 and u1 . . . . It is desirable to admit this possibility in a time series setting. because the contemporaneous and lagged regressors x1 . . it is standard to assume that the first data point is drawn from the stationary distribution. This seems justified when the process has been running for a long time.1. 1 In J OHN S TACHURSKI October 10. Exercise 9. . Example 9. . .2 permits dependence between current shocks and future regressors.1.1 (Ergodic regressors). 1 T t =1 ∑ xt xt → Σxx T p as T→∞ (9. Moreover.4). the econometric setting. The shocks {ut } are IID with E [ut ] = 0 and 2 E [ u2 t ] = σ . . the shocks are independent of contemporaneous and lagged regressors: ut is independent of x1 . x2 .1.2. x T in assumption 9. . The sequence x1 .2.1.1. φ is the standard normal density and Φ is the standard normal cdf. Moreover. and hence the distribtion of the state has converged to the stationary distribution by the time the first data point is observed. Assumption 9. we are implicity assuming that Σxx := E [x1 x1 ] exists. although the observations x1 . . Let’s assume that x0 has density π∞ . . given by π∞ (s) = 2φ(s)Φ(qs). yt−1 .1.1.1 on page 244. . . . . . Moreover. x t . and 2. Since {ut } and {xt } are identically distributed.1.2 for us is that linear functions of {xt ut } now form a martingale difference sequence. distribution of mt depends on the joint distribution of a xt and ut . . First let’s check part 1.1. note that {mt } is adapted to {Ft }.9. . x t ] = The proof is an exercise (exercise 9.6) One of the most important consequences of assumptions 9.1. x t +1 .3.2 Regarding the second moment E [m2 1 ].2 both hold. .2 is that we have E [ u s u t | x1 . 2013 . . . these marginal distributions do not dependend on t. we have E [ m t +1 | F t ] = E [ u t +1 a x t +1 | F t ] = a x t +1 E [ u t +1 | F t ] = a x t +1 E [ u t +1 ] = 0 (Here the second equality follows from the fact that xt+1 ∈ Ft .1. From this LLN and CLT.3 (page 248).7) Proof.1.1 and 9. the inner expectation is σ2 .1 and 9. then. Since a xt and ut are independent.3) σ2 0 if if s=t s<t (9. . we will be able to show that the OLS estimator is consistent and asymptotically normal.1.1. u 1 . a martingale difference sequence with respect to the filtration defined by F t : = { x 1 . . while the third follows from the independence in assumption 9. Lemma 9. CONSISTENCY 267 One consequence of assumption 9. Moreover. we have 2 2 2 2 E [ m2 1 ] = E [E [ u1 ( a x1 ) | x1 ]] = E [( a x1 ) E [ u1 | x1 ]] From independence of u1 and x1 . the sequence {mt } defined by mt = a xt ut is 2 1. and that xt and ut are independent. u t } (9. .) This confirms that {mt } is a martingale difference sequence with respect to {Ft }. 2 The J OHN S TACHURSKI October 10. for any constant vector a ∈ RK .2.1.2. their joint distribution is just the product of their marginal distributions. . This allows us to apply the LLN and CLT results in theorem 8. If assumptions 9. identically distributed with E [m2 1 ] = σ a Σxx a. . since mt := ut a xt is a function of variables in Ft .1. ( a x1 )2 = a x1 a x1 = a x1 x1 a ∴ 2 2 2 E [ m2 1 ] = E [ a x1 x1 a σ ] = σ a E [ x1 x1 ] a = σ a Σxx a To check part 2. That {mt } is identically distributed follows from the assumption that {ut } and {xt } are identically distributed. CONSISTENCY 268 9.9) can be written as T −1 ∑t =1 mt .8) is true. for any a ∈ RK . it suffices to show that.1 and fact 2.3).5.1. the convergence T −1 ∑t =1 mt → 0 follows from theorem 8.5. p Now let us return to the expression on the right-hand side of (9.2. It suffices to show that the expression on the right-hand side of (9.2 both hold.1.1. In particular.3 on page 77.1.2 ˆ Consistency of β T Under the assumptions of the previous section. In view of fact 2.3 (page 248). By assumption 9.1 is now done. 2013 .1. we have p 1 T a xt ut → a 0 = 0 (9. we have the following result: ˆ → β as T → ∞. let’s show that 1 T p t =1 ∑ xt ut → 0 T p (9. then β T Proof.1.8). the OLS estimator is consistent for the parameter vector β. Since { mt } is an identically distributed martingale difference sequence (see lemma 9.1. 2 ˆT Consistency of σ 9. In the current setting T is the σ J OHN S TACHURSKI October 10.9. As a first step. If assumptions 9.3 To estimate the variance σ2 of the error terms. where N was the sample size. We have now verified (9.5. then (9.2 once more.9) ∑ T t =1 T If we define mt := a xt ut . we obtain ˆ −β= β T 1 T t =1 ∑ xt xt T −1 · 1 T t =1 ∑ ut xt → Σxx −1 0 = 0 T p The proof of theorem 9. we see that 1 T t =1 ∑ xt xt T −1 → Σxx −1 as T → ∞ p Appealing to fact 2.1.1.1.3) converges in probability to 0.2.1 and 9.1 on T page 267). we previously used the expression ˆ 2 := SSR /( N − K ). Theorem 9. 1 ˆ Asymptotic Normality of β Let’s begin by establishing asymptotic normality of the OLS estimator: J OHN S TACHURSKI October 10. ˆT Theorem 9.2. Hence it suffices to show that the second and third term converge in probability to zero as T → ∞ (recall fact 1.2 Asymptotic Normality √ ˆ − β) is asymptotiUnder our assumptions.2 on page 75.1.4.2 both hold.1 and 9.) In summary. Hence for our new expression we just take σ theory is affected if we use SSR /( T − K ) instead. ASYMPTOTIC NORMALITY 269 sample size.10) 2 → σ2 as T → ∞. 2 ˆT σ := 1 T t =1 ˆ )2 ˆ2 (yt − xt β ∑u t :=: T T ∑ t =1 p T 1 T (9. 2013 . If assumptions 9.5. the first term on the right-hand side converges in probability to σ2 . 9. since T is assumed to be large relative to K. combined with various convergence results we have already established. and.2.1.1.2. we get 2 ˆT = σ 1 T t =1 ∑ T u2 t ˆ ) 1 + 2( β − β T T t =1 ˆ ) ∑ xt ut + ( β − β T T 1 T t =1 ∑ xt xt T ˆ ) (β − β T By assumption 9. 9. We have 2 ˆT σ = 1 T t =1 ˆ )2 = ∑ (yt − xt β T T 1 T t =1 ˆ )]2 ∑ [ut + xt ( β − β T T Expanding out the square.9. (None of the following ˆT 1/ T . we will now show that the term T ( β T cally normal.1. we have 1/( T − K ) ≈ 2 = SSR / T .2 and the law of large numbers. The details are left as an exercise. then σ Proof. From this information we can develop asymptotic tests and confidence intervals.1 on page 31). These results follow from repeated applications of fact 2. 1. to establish (9.12) is valid. σ2 a Σxx a) d (9. J OHN S TACHURSKI October 10. then. the expression on the left of (9.12) is valid.13) we need to show that T −1/2 3 Remember t =1 ∑ mt → N (0.13) Fixing a and letting mt := ut a xt . σ2 Σxx ).12) is valid. applying assumption 9.2. conditional on the assumption that (9. Σxx var[ z ] Σxx ) = N ( 0.1. we have 1 −1 −1 −1 2 −1 2 −1 Σ− xx z ∼ N ( 0.2. we obtain √ ˆ − β) = T(β T 1 T t =1 ∑ xt xt T 1 · T −1/2 ∑ ut xt → Σ− xx z t =1 T d 13 In view of fact 2.2.6 on page 74 and symmetry of Σ− xx .12) If (9. the OLS estimator β T √ d 1 ˆ − β) → T(β N ( 0.5. Σxx σ Σxx Σxx ) = N ( 0. it suffices to show that for any a ∈ RK we have a T −1/2 t =1 ∑ ut xt T T →az d (9.4. ASYMPTOTIC NORMALITY 270 ˆ :=: β ˆ satisfies Theorem 9.13) can be rewritten as follows: a T −1/2 t =1 ∑ ut xt T T = T −1/2 ∑ ut a xt =: T −1/2 ∑ mt t =1 t =1 T Since z ∼ N (0.9. 2013 . Suppose we can show that T −1/2 t =1 ∑ ut xt → z −1 T d as T→∞ (9.1.5.5. σ2 Σxx ).14) that the transpose of the inverse is the inverse of the transpose.1 and 9. By the Cramer-Wold device (fact 2. Under assumptions 9. Let’s now check that (9. σ 2 Σ − xx ) T as T→∞ Proof.3) we obtain √ ˆ − β) = T(β T 1 T t =1 ∑ xt xt T −1 · T −1/2 ∑ ut xt t =1 T (9.2. σ Σxx ) This completes the proof of theorem 9. and the transpose of Σxx is just Σxx . since all variance-covariance matrices are symmetric.3 on page 77). Using the expression (9.1.11) Let z be a random variable satisfying z ∼ N (0.1 along with fact 2.1. 1. Let σ hold. and completes the proof of theorem 9. we cannot use this result directly.) ˆ T be as defined in (9.2. . as our model assumptions are quite different. However. Let’s consider this problem again in the large sample setting.2 Large Sample Tests In §7.2 we considered the problem of testing a hypothesis about an individual coefficient β k . ASYMPTOTIC NORMALITY 271 From lemma 9.15) ∑ t | Ft−1 ] → σ a Σxx a as T → ∞ T t =1 Since xt ∈ Ft−1 . .2. we have 2 2 2 2 2 2 2 E [ m2 t | F t −1 ] = E [ u t ( a x t ) | F t −1 ] = ( a x t ) E [ u t | F t −1 ] = σ ( a x t ) = σ a x t x t a The right-hand side of (9. this is not surprising.2. then the expression ( β degrees of freedom. .3. and let 2 e ( X X ) −1 e ˆ T ) := σ ˆT se( β k k k J OHN S TACHURSKI October 10.4. x t . In the large sample case.1 and (2.1 and 9. u 1 .9.15). because the t-distribution converges to the standard normal distribution as the degrees of freedom converges to infinity. 2013 .15) is therefore 1 T t =1 ∑ T E [ m2 t | F t −1 ] 1 = T t =1 ∑ (σ T 2 a xt xt a) = σ a 2 1 T t =1 ∑ xt xt T a → σ2 a Σxx a p where the convergence in probability is due to assumption 9.1. we showed that if the error terms are normally ˆ k − β k )/ se( β ˆ k ) has the t-distribution with N − K distributed.1. . 9.1.2.7). we already know that {mt } is an identically distributed with 2 E [ m2 1 ] = σ a Σxx a. the result (9.1. . Let assumptions 9. .2. u t } In view of the martingale difference CLT in theorem 8.2. (In a sense.10) on page 269. .2.1. and a martingale difference sequence with respect to the filtration defined by F t : = { x 1 . . The hypothesis to be tested is H0 : β k = β0 k against H1 : β k = β0 k In the finite sample theory of §7.4.14) will hold whenever p 1 T 2 E [ m2 (9. we can use the central limit theorem to show that the same statistic is asymptotically normal. This verifies (9. x t +1 .2 Theorem 9. 1) (9. σ 2 Σ − xx ) T as T→∞ where. Making the obvious k transformation.4. ek var[z]ek ) T √ d 1 ˆ T − βk ) → N (0.2. we then We have already shown that ( T −1 ∑t xx =1 x t x t ) have −1 T p 1 xt xt ek → ek Σxx −1 ek T e k ( X X ) −1 e k = e k ∑ T t =1 2 → σ2 .16). Recall from theorem 9. we then have √ ˆ T − βk ) T(β d k → N (0.17) and fact 1.4. 1) 2 T e ( X X ) −1 e ˆT σ k k k Assuming validity of the null hypothesis (so that β0 k = β ) and cancelling have established (9. 1) ˆ ) se( β k as T→∞ (9.2.1 that √ d 1 ˆ − β) → T(β z ∼ N ( 0.17) 1 2 σ ek Σ− e xx k T −1 → Σ −1 . σ2 ek Σ− In other words. Using this result plus fact 1.5 on page 34. we then have p p 1/ 2 T e ( X X )−1 e → 1/ ˆT σ k k p σ2 ek Σxx −1 ek In view of (9. we J OHN S TACHURSKI October 10.16) Proof. as shown in theorem 9. we have T ( β xx ek ). ASYMPTOTIC NORMALITY 272 Under the null hypothesis H0 . It now follows (from which facts?) that √ d ˆ − β)] → ek [ T ( β ek z ∼ N (ek E [z]. 2013 .1. β is the true parameter vector. as usual.9. √ T . we have T zk ˆ T − β0 d β := k T k → N (0.1 on ˆT Recall that σ page 31.7). we obtain √ ˆ T − βk ) d T(β k → N (0. Applying (2.2. 2.1.3.3 Exercises Ex. then assumptions fail.3. 9.3.2 is valid. Ex. Verify the claim that (9. derive ols estimator of ρ in the model of example 8. Verify expression (9. Suppose that this estiEx.3.2. To be careful. Verify that this is the case.6 (as in example 9.3. Let Q that d ˆ T (θ ˆ T − θ) 2 → T Q χ2 ( K ) (9. (see zhao. 9. and this is also true because Φ(qs) ≤ 1. 9. 9. 9. φ is the standard normal density and Φ is the standard normal cdf. In example 9.3.1.3. To verify that Σxx is positive definite.5 (To be written. It is know that for such a C there exists ˆ T be a consistent estimator of Q.2. but ut itself AR(1).4. Let K × 1 random vector θ mator is asymptotically normal. 9.6) holds when assumption 9. It remains to check the conditions on Σxx .1 on page 243).9. we should also check that Σxx is finite. 9. Recall here that xt is the t-th row of X.1. 2013 J OHN S TACHURSKI .2.2). because the function inside the integral is strictly positive everywhere but zero.18) Ex.) is it consistent?).3. in the sense that √ d ˆ T − θ) → T (θ N ( 0.1. This is clearly true.1. C ) where C is symmetric and positive definite.2 is identically distributed as a result of our assumption that x0 ∼ π∞ (see fact 8. 2 Σxx = E [ xt ]= ∞ −∞ s2 π∞ (s)ds = ∞ −∞ s2 2φ(s)Φ(qs)ds where q := ρ(1 − ρ2 )−1/2 .1. Ex.1 it was claimed that the threshold process studied in that example satisfies all of the conditions of assumption 9. The process { xt } studied in example 9.1 Solutions to Selected Exercises Solution to Exercise 9. ˆ T be an estimator of θ.1. we need to check that the term on the right-hand side is strictly positive. Ex.3. make this an exercise?). EXERCISES 273 9. Show a K × K matrix Q such that QCQ = I.3. In the present case.2. and hence ∞ −∞ s2 2φ(s)Φ(qs)ds ≤ ∞ −∞ s2 2φ(s)ds = 2 ∞ −∞ s2 φ(s)ds = 2 October 10.3. then E [ u s u t | x1 .19) is valid. x t . . we need to show that 1 T t =1 ∑ xt2 → Σxx = E [xt2 ] T p (9. Applying Slutsky’s theorem.3. . . x t ] = E [ u s 0 | x1 . By assumption we have p ˆT→ Q Q √ and d ˆ T − θ) → T (θ z where z ∼ N (0.3. C). On the other hand. xt ] = E [ ut ] = σ by independence.20) ∼ χ2 (K ).2. . x t ] = E [ u s E [ u t ] | x1 . This theorem confirms that the convergence in (9. .9. . Qz continuous mapping theorem to (9.3. EXERCISES 274 Finally.1. x t ] = E [E [ u s u t | x1 . if s < t. .1 (page 244) are satisfied. .20). var[Qz] = Q var[z]Q = QCQ = I In other words. .2. then E [u2 t | x1 . . . We need to show that E [ u s u t | x1 . . if s = t. . . u s ] | x1 . x t . .2 (page 266) is valid. x t ] = E [ u s E [ u t | x1 . . . . . . Solution to Exercise 9. x t ] = σ2 0 if if s=t s<t 2 2 On one hand. .4. 2013 .19) Since the conditions of theorem 8. . . . . we can appeal to theorem 8. Moreover. Applying the √ ˆ T T (θ ˆ T − θ) Q J OHN S TACHURSKI 2 d → Qz 2 ∼ χ2 ( K ) October 10. . .2 (page 244). . . x t ] = 0 Solution to Exercises 9. . . we obtain 2 (9.3. . . Suppose that assumption 9. u s ] | x1 . Qz is standard normal. we obtain √ d ˆ T T (θ ˆ T − θ) → Q Qz Clearly Qz is normally distributed with mean 0. . As a result. . . it should be clear that (9.18). 2013 . EXERCISES 275 This is equivalent to (9. we have ˆ T (θ ˆ T − θ0 ) T Q 2 d → χ2 ( K ) Fixing α and letting c be the 1 − α quantile of the χ2 (K ) distribution. Incidentally. J OHN S TACHURSKI October 10. we reject the null if ˆ T (θ ˆ T − θ0 ) 2 > c T Q This test is asymptotically of size α. Under the null hypothesis.3.18) can be used to test the null hypothesis that θ = θ0 .9. has unknown distribution. Ridge regression is an important method in its own right.Don’t use tests 10. where β is unknown. In particular.Chapter 10 Further Topics [roadmap] 10. it immediately presents us will a class of models we need to choose between. we estimate β with the OLS estimator ˆ := (X X)−1 X y = argmin y − Xb β b∈ 2 RK = argmin b∈ RK n =1 ∑ ( y n − x n b )2 N 276 .1 Model Selection [roadmap] . and hence a model selection problem. Moreover. Let’s begin by putting ourselves in the classical OLS setting of chapter 7. with connections to many areas of statistics and approximation theory. but satisfies E [u | X] = 0 and E [uu | X] = σ2 I for some unknown σ > 0. and u is unobservable.1 Ridge Regression We are going to begin our discussion of model selection by introducing ridge regression. we will assume that y = X β + u.1. In traditional OLS theory. it turns out that we can find a (biased) linear ˆ . a better way to evaluate the estimator is to consider its mean squared error.1 on page 198). in particular. β pirical risk corresponding to the natural risk function R( f ) = E [(y − f (x))2 ] when the hypothesis space F is the set of linear functions. In solving (10. and. which tells us directly how much probability mass the estimator puts around the object it’s trying to estimate.” To minimize mean squared error we face a trade off between these two terms. we are minimizing the empirical risk plus a term that penalizes large values of b . Firstly.1. Second. under our current assumptions.2) where λ ≥ 0 is called the regularization parameter.4. moreover. it is unbiased for β (theorem 7. But at least some of this celebration is misplaced. the optimal choice is not at either extreme.2.3) October 10. Applying this idea to the OLS setting. In many situations involving trade off. 2013 .2). A bit of calculus is to “shrink” the solution relative to the unpenalized solution β shows that the solution to (10. and tells us that the mean squared error is the sum of “variance” and “bias. Rather than looking at whether an estimator is best linear unbiased. the mean squared error of a estimator b ˆ ) := E [ b ˆ − β 2] mse(b It is an exercise (exercise 10.1) This equation is analogous to (4. but somewhere in the middle. The estimator is defined as the estimator that has lower mean squared error that β solution to the modified least squares problem min b∈ RK n =1 ∑ ( y n − x n b )2 + λ N b 2 (10.1) to show that the mean squared error can also be expressed as ˆ) = E[ b ˆ − E [b ˆ ] 2 ] + E [b ˆ] − β 2 mse(b (10. MODEL SELECTION 277 ˆ is a natural choice for estimating β. the Gauss-Markov theorem are much celebrated foundation stones of standard OLS theory. it has the lowest variance among all linear unbiased estimators of β (see the Gauss-Markov theorem on page 200). it minimizes the emIn many ways.10.) In the ˆ of β is defined as vector case. but rather when some small amount of bias is admitted. (This point was illustrated in figure 4. These results. Many estimation techniques exhibit this property: Mean squared error is at its minimum not when bias is zero. The effect ˆ .11) on page 118.2) is ˆ : = ( X X + λ I ) −1 X y β λ J OHN S TACHURSKI (10. and.4 on page 118. A will be chosen fairly arbitrarily.1. especially when the columns of A are almost linearly dependent. . 10. We can see from the figure that the variance of the regularized solutions has fallen even further. While β ˆ is the solution to the least squares estimator β λ J OHN S TACHURSKI October 10. while the latter are plotted in blue. While the theory of Tikhonov regularization is too deep to treat in detail here.2) is certainly a less obvious approach than simply N 2 2 minimizing the empirical risk ∑n =1 ( yn − xn b ) = y − Xb . σ2 I) where σ is a small positive number.1. corresponding to 20 draws of c0 . one often does better by minimizing Ab − c0 2 + λ b 2 for some small but positive λ. In our simulation. which minimizes Ab − c0 2 + λ b 2 . Let b∗ be the least squares solution: b∗ = argminb Ab − c 2 . very large values of λ pull the regularized solutions too close to the zero vector. Suppose in addition that c cannot be calculated perfectly. MODEL SELECTION 278 Minimizing the objective in (10. The regularized solutions are clearly better on average. this result is dependent on a reasonable choice for λ. and the regularized solution. To simplify. b∗ is then a solution to the system Ab∗ = c. In particular. and also the least squares solution (because it solves minb Ab − c 2 ). Instead we observe c0 ≈ c. . obtaining the least squares solution to the system Ab = c0 . . it turns out that this is not always the case. The true solution b∗ is plotted in red. we first set b∗ := (10. The result is figure 10. but they are now substantially biased. and then set c := Ab∗ . By construction. due to some form of measurement error. . We then plot the ordinary least squares solution based on c0 . The former are plotted against their index in black. we draw c0 ∼ N (c.2. you might guess that the best way to compute an approximation to b∗ is to solve minb Ab − c0 2 . Instead. which minimizes Ab − c0 2 . where A is N × K with N > K. In the absence of additional information. we can illustrate the rather surprising benefits of regularization with a simulation. This second approach is called Tikhonov regularization. One indication as to why it might be a good idea comes from regularization theory. If you experiment with the code (listing 17). 2013 . Conversely. the regularized solutions are almost the same as the unregularized solutions. suppose that Ab = c is an overdetermined system. To illustrate regularization. where the value of λ has been increased by a factor of 50. Surprisingly. Such a situation is depicted in figure 10. When measuring c we will corrupt it with a Gaussian shock. you will see that for very small values of λ. but such that the columns are quite close to being linearly dependent.10. The figure shows 20 solutions each for the ordinary and regularized solutions. 10). Tikhonov regularization gives some understanding as to why the ridge regression ˆ can perform better than β ˆ . Of course. 1.1: Effect of Tikhonov regularization. 2013 . λ = 1 6 8 10 b 12 14 5 10 15 20 Figure 10. MODEL SELECTION 279 6 8 10 b 12 14 5 10 15 20 Figure 10. λ = 50 J OHN S TACHURSKI October 10.2: Effect of Tikhonov regularization.10. lwd =2 . Ap % * % c0 ) lines ( b1 . col = " blue " ) # Compute the standard least squares solution b2 <. sd =0. col = " black " . sd = sigma ) # Compute the regularized solution b1 <.1. K ) c <.1] <. i ] <.20 A <.A % * % bstar # A transpose # True solution # Corresponding c # Create empty plot plot ( bstar .t ( A ) bstar <. 2013 .lm ( c0 ~ 0 + A ) $ coefficients lines ( b2 . ylab = " b " ) # Plot the solutions for ( j in 1: numreps ) { # Observe c with error c0 <.5 lambda <.1) } Ap <.solve (( Ap % * % A + lambda * diag ( K ) ) . lty =2) } lines ( bstar .i -1] + rnorm (N . K <.0.c + rnorm (N .10.1 numreps <.20 # Parameterizes measurement error for c # Regularization parameter # Number of times to solve system # Construct an arbitrary N x K matrix A N <. type = " n " .matrix ( nrow =N .40. ylim = c (6 .A [ . col = " red " ) J OHN S TACHURSKI October 10. xlab = " " . ncol = K ) A [ .1 for ( i in 2: K ) { A [ . MODEL SELECTION 280 Listing 17 Effect of Tikhonov regularization sigma <. 14) .rep (10 . given a single covariate x. and the details of the argument can be found there. For example. This was proved by Hoerl and Kennard exists a λ > 0 such that mse( β λ (1970). As mentioned above. the ridge regression estimator β λ ularized problem minb y − Xb 2 + λ b 2 . .1. this ˆ is biased (see exercise 10.2 Subset Selection and Ridge Regression One problem frequently faced in specific regression problems is which variables to include. for some intermediate ˆ falls by more than enough to offset the extra bias.2. And a similar problem of variable selection arises in time series models. since we are trying to choose the right subset of all candidate regressors. x. where we want to know how many lags of the state variables to include. educational attainment across schools. we can expect that the regularized estimator will sometimes perform better.10. The general problem is known as subset selection. which is the topic treated in the next few sections. with the right ˆ outperforms β ˆ even though all of the choice of λ.1. This problem takes on another dimension (no pun intended) when we start to think about basis functions.1. 2013 J OHN S TACHURSKI . ˆ can always outperform β ˆ .2). The reduction in mean implies that the estimator β λ squared error over the least squares estimator occurs because. For example. x2 . . the ridge regression estimator β λ classical OLS assumptions are completely valid. where the hypothesis space is F := all functions of the form f ( x ) = b0 x0 + b1 x1 + · · · + bd x d October 10. Since y is indeed a noisy observation. adoption of technologies. in the sense that there always In fact. This problem falls under the heading of model selection. population. .4. as discussed in §6. value of λ. it turns out that β λ ˆ ) < mse( β ˆ ). unemployment. we can think of any number of variables that might be relevant (median wage. we have the option to map this into a larger vector φ(x) using basis functions. MODEL SELECTION 281 ˆ is the solution to the regproblem minb y − Xb 2 . x d ) and regressing y on φ( x ). One is that. demand for certain products. we may consider mapping it into φ( x ) = (1. and so on. police density. etc. The same is true if we are trying to model credit default rates for some group of firms or individuals. if we are comparing crime rates across different cites.. This amounts to polynomial regression. the variance of β λ It is worth emphasizing two things before we move on. The other is that the right choice of λ is an important and nontrivial problem. 10. etc. . Given a set of covariates x.). x j }. and so on. solve the least squares problem) over the hypothesis space of linear functions from RK to R: L := all functions of the form f (x) = b x with b1 . .21 (page 145). As we saw in figures 4. (In other words. there are 2K subsets to step through. and the other increasing in # I .10. The subset selection problem has been tackled by many researchers. What J OHN S TACHURSKI October 10. . We will also suppose that we have already applied any transformations we think might be necessary.6. We imagine that x is a large set of K candidate regressors. the size of the subset selected.e. . one alternative is to use ridge regression. x might include not just the original regressors but squares of the regressors. . With K regressors. Subset selection can be viewed from the lens of empirical risk minimization. . the regularization term leads us to choose an estimate with smaller norm. we are back to the problem of choosing a suitable hypothesis space over which to minimize empirical risk. 2013 .. For example. MODEL SELECTION 282 Polynomial regression was discussed extensively in §4. bK ∈ R Doing so produces the usual OLS estimator. Choosing d is another example of the subset selection problem because we are trying to decide whether to include the regressor x j for some given j. a good choice of d is crucial. Let I ⊂ {1.18– 4. . the Bayesian Information Criterion (BIC) and Mallow’s C p statistic. we have already applied the basis functions. Suppose that we are trying to model a given system with output y and inputs x. Now suppose that we wish to exclude some subset of regressors { xi . For example. With ridge regression. . . The objective is to minimize the statistic. . . To avoid this problem.) If we want to include all regressors then we can minimize empirical risk (i. one increasing in the size of the empirical risk. . One of the problems with subset selection is that there is usually a large number of possible subsets. Mallow’s C p statistic consists of two terms. which involves trading off poor fit (large empirical risk) against excess complexity of the hypothesis space (large # I ). Regressing y on the remaining regressors { xk }k∈ / I is equivalent to minimizing the empirical risk over the hypothesis space L− I := all functions f (x) = b x with bk = 0 for all k ∈ I Once again. K } be the set of indices of the regressors we want to exclude. Well-known approaches include those based on the Akaike Information Criterion (AIC).1. . cross-products.2. The effect is to “almost exclude” those regressors. . x1 .18–4.3– 10.21 from §4. and the resulting prediction function we ˆ φ ( x ). instead of using empirical risk minimization.1. However. On the basis of what we’ve seen so far. We can illustrate the idea by reconsidering the regression problem discussed in §4. The data used here is exactly the same data used in the ˆ .6.3.2.6. because we still need to choose the value of the regularization parameter λ. rather than searching over 2K subsets. we can compute the risk of each function fˆλ .27) on page 142). and the fit is good. The hypothesis spaces were the sets Pd of degree d polynomials for different values of d. but with empirical risk minimization the result is overfitting (see figure 4. As in §4. The x-axis is on log-scale.18–4.2.21 on page 146). First. Figures 4. which translates into solving min b n =1 ∑ (yn − b φ(xn ))2 N where φ( x ) = ( x0 . .6.10. Figure 4.2.17 (page 143) showed that intermediate values of d did best at minimizing risk.4. because we are deciding what powers of x to include as regressors. choosing the right d is a essentially a subset selection problem. so that fˆλ ( x ) = β λ The function fˆλ is plotted in red for increasingly larger values of λ over figures 10. Of course. 2013 .6. In figure 10. the model selection problem is not solved. since we know the underlying model (see (4. This space is certainly large enough to provide a good fit to the data. MODEL SELECTION 283 this means in practice is that the coefficients of less helpful regressors are driven towards zero. we solve the regularized problem min b n =1 ∑ N (yn − b φ( xn ))2 + λ b 2 for different values of λ. We can do a very similar thing using ridge regression. and the procedure overfits. In figures 10. original figures 4. x d ) As discussed above. . The risk is plotted against λ in figure 10. the value of λ is too large. denote by fˆλ . In figure 10. J OHN S TACHURSKI October 10. the value of λ is too small to impose any real restriction.5 and 10. Here.6. For each d we minimized the empirical risk over Pd . . The solution for each λ we denote by β λ which is the ridge regression estimator. The black line is the risk minimizing function. the value of λ is a bit larger. and all coefficients are shrunk towards zero.7. let’s take as our hypothesis space the relatively large space P14 .21 (see page 145) showed the fit we obtained by minimizing empirical risk over larger and larger hypothesis spaces. the problem has been reduced to tuning a single parameter. it’s not surprising that risk is smallest for small but nonzero values of λ. 0 G G G G G G G G −0.0 G G −1.1.0 G −0. 2013 .0 0.5 0.5 G G G G G G G G G −1.0 0.0 G −0.5 G G G G G 0.10.0 G G G G G G G G −0.4: Fitted polynomial.78946809286892e−10 G Figure 10.5 G G G G G 0.5 0.000418942123448384 G Figure 10.0 lambda = 0.3: Fitted polynomial. MODEL SELECTION G G 284 G 1.0 G G −1.5 G G G G G G G G G −1.5 1.0 G G G 0.0 G G G 0.0 lambda = 2. λ ≈ 0. λ ≈ 0 G G G 1.0004 J OHN S TACHURSKI October 10.5 1. 0 G G −1.0 lambda = 17.5 G G G G G G G G G −1.0 G G −1.5 G G G G G 0. λ ≈ 18 G G G 1.1. MODEL SELECTION G G 285 G 1.5 0.5: Fitted polynomial.5 1.0 G −0.6: Fitted polynomial. λ ≈ 2200 J OHN S TACHURSKI October 10.4657948067 0.973328138195 0.10. 2013 .0 G G G 0.0 G Figure 10.5 G G G G G 0.0 G G G G G G G G −0.0 lambda = 22026.0 G G G 0.5 1.5 0.0 G G G G G G G G −0.0 G −0.0 G Figure 10.5 G G G G G G G G G −1. 6 0. One technique for injecting prior information into statistical estimation is via Bayesian analysis.5 G 0. Nevertheless. which states that for any sets A and B we have P( A | B ) = P( B | A )P( A ) P( B ) This formula follows easily from the definition of conditional probability on page 6.7 G risk 0.1. which values of our regularization parameter to choose.10. which functional forms to use. In what follows. and so on. we focus on Bayesian linear regression. MODEL SELECTION 286 Risk G G 0. Bayesian methods are currently very popular in econometrics and other fields of statistics (such as machine learning). let’s recall Bayes’ formula.1.3 A Bayesian Perspective The ideal case with model selection is that we have clear guidance from theory on which regressors to include. An analogous statement holds true for densities. although the derivation is a bit J OHN S TACHURSKI October 10. If theory or prior knowledge provides this information then every effort should be made to exploit it.7: The risk of fˆλ plotted against λ 10. the brief treatment we present in this section does provide useful intuition on their strengths and weaknesses.3 G G G G G 1e−10 1e−07 1e−04 lambda 1e−01 1e+02 Figure 10. To begin. 2013 .4 G 0. and perhaps a future version of these notes will give them more attention. which to exclude. β) . Often we wish to summarize the information contained in the posterior. we can write our distributions as p ( y | β ) = N ( X β.7) October 10. by looking at the “most likely” value of β given our priors and the information contained in the data. expressed in the form of probability distributions. or at its maximum value.10. The main idea of Bayesian analysis is to treat parameters as random variables. . it can be expressed as ˆ := argmax {ln p(y | β) + ln p( β)} β M β (10.1. In generic notation. u is random and unobservable. for example. We assume that the pairs satisfy y = X β + u. See. our prior on u implies that the density of y given β is N (X β.4) Applying Bayes formula to the pair (y. To simplify the presentation we will assume that X is nonrandom. While u and β are unobservable random quantities. we obtain (exercise 10. These subjective beliefs are called priors. we obtain p( β | y) = p(y | β) p( β) p(y) (10.5) The left-hand side is called the posterior density of β given the data y.4). . Suppose for example that we observe input-output pairs (y1 . Bishop. x N ). x1 ). Taking logs of (10. The new feature provided by the Bayesian perspective is that we take β to be random (and unobservable) as well. As before. Given our model y = X β + u. 2006.6) Inserting the distributions in (10. MODEL SELECTION 287 more involved. Here we will take the priors to be u ∼ N (0. The maximizer of the posterior is called the maximum a posteriori probability (MAP) estimate. τ I). (Taking X to be random leads to the same conclusions but with a longer derivation. in the sense of being unknown quantities for which we hold subjective beliefs regarding their likely values.5) and dropping the term that does not contain β. σI). and represents our new beliefs updated from the prior on the basis of the data y. 2013 J OHN S TACHURSKI .4. We can do this by looking either at the mean of the posterior. . dropping constant terms and multiplying by −1. σ I ) and p ( β ) = N ( 0. let’s suppose that we have subjective prior beliefs regarding their likely values. chapter 3).3) the expression ˆ = argmin β M β σ2 ∑ (yn − xn β) + τ 2 β n =1 2 N 2 (10. τ I ) (10. σ I) and β ∼ N (0. . (y N . 5. Recall that. at least in principle. Bayesian estimation provides a principled derivation of the penalized least squares method commonly known as ridge regression. given a selection of models (or estimation procedures). and we discuss it further in §10.10. f (x))] = L(t.1. 10. MODEL SELECTION 288 This is exactly equivalent to the penalized least squares problem (10. where the regularization parameter λ is equal to (σ/τ )2 . fˆ(x))] then we are taking expectation over all randomness.4 Cross-Validation The most natural way to think about model selection is to think about minimizing risk. Previously. and see how well we can do in terms of predicting new values as evaluated by expected J OHN S TACHURSKI October 10. where regularization arises out of combining prior knowledge with the data. Bayesian analysis provides the same regularization. because if we simply define the risk as E [ L(y. That’s not a bad idea per-se.1. In the next section we forgo the assumption that this strong prior knowledge is available. x) ∈ RK+1 with joint density p. the risk of a function f : RK → R is the expected loss R( f ) := E [ L(y. Now suppose that we observe N input-output pairs D : = { ( y1 .1. ( y N . then we are back at the model selection problem. 2013 . If not. . f (s)) p(t. Moreover.3). we would like to find the one that takes this data set D and returns a predictor fˆ such that fˆ has lower risk than the predictors returned by the other models. given loss function L and a system producing input-output pairs (y. x N ) } IID Intuitively. the value (σ/τ )2 is part of our prior knowledge. . x1 ). and hence there is no model selection problem. and consider a more automated approach to choosing λ. But what we want to do for now is just take the data set as given. including that in fˆ. In view of (10. . s) dt ds that occurs when we use f (x) to predict y. we justified ridge regression via Tikhonov regularization. Here we have to be a bit careful in defining risk. which depends implicitly on the data set D .2) on page 277. In practice one can of course question the assertion that we have so much prior knowledge that the regularization parameter λ := (σ/τ )2 is pinned down. Here. the solution is ˆ : = ( X X + ( σ / τ )2 I ) −1 X y β M Thus. we are using it to fit the model. We then repeat this for all models. Second. If we had J new observations (yv j . producing fˆ. because we don’t have any new data in general. From the law of large numbers. fˆ(s)) p(t. the problem is that we are using the data D twice. Hence we define the risk of fˆ as the expected loss taking D (and hence fˆ) as given: R( fˆ | D ) := E [ L(y. and fˆm is the predictor produced by fitting model m with data D . called the training set and the validation set. you might have the following idea: Although we don’t know p. J OHN S TACHURSKI October 10. we are using it to evalute the predictive ability of fˆ on new observations. See. but then again. f ( x j )) J j =1 Of course. in particular. so we could approximate R ( fˆ | D ) by 1 N n =1 ∑ L(yn . This point was discussed extensively in §4. The point that figure made was that complex models tend to overfit. then we would like to find the model m∗ such that R( fˆm∗ | D ) ≤ R( fˆm | D ) for all m∈M The obvious problem with this idea is that risk is unobservable. MODEL SELECTION 289 loss. figure 4. New data will tell us how fˆ performs out of v sample. we do have the data D . then we could estimate the risk by 1 J ˆ v ∑ L(yv j . which consists of IID draws from p. if you think about it for a moment more. for conflicting objectives.2.17 on page 143. One way that statisticians try to work around this problem is to take D and split it into two disjoint subsets. So what we really need is fresh data. s) dt ds If we have a collection of models M indexed by m. and the empirical risk is a very biased estimator of the risk. First. The training set is used to fit fˆ and the validation set is used to estimate the risk of fˆ. you will realize that this is just the empirical risk. this is not really a solution. 2013 . but high risk. x j ). fˆ(x)) | D ] = L(t. if we knew p there would be no need to estimate anything in the first place. xn ) are from the training data D . fˆ(xn )) N where the pairs (yn .10. However. If we knew the joint density p then we could calculate it. producing low empirical risk. Looking at this problem. and choose the one with lowest estimated risk. In essence. we know that expectations can be approximated by averages over IID draws.1.6. 2013 . fˆ−n (xn )) . The prediction quality is evaluated in terms of loss. the estimated risk. .1. the set of models is indexed by λ. suppose that we partition the data set D into two subsets D1 and D2 . MODEL SELECTION 290 Since data is scarce. so we are fitting a polynomial of degree 14 to the data by minimizing regularized least squares error. . In terms of model selection. . the procedure is attractive because we are using the available data quite intensively. The extreme is to partition D into N subsets. the algorithm can be expressed as follows: Algorithm 1: Leave-one-out cross-validation 1 2 3 4 5 for n = 1.2. xn ) omitted. the regularization parameter in the ridge regression. we then return an estimate of the risk. a more common procedure is cross-validation. xn )}. Next. x1 . . In this problem. First. Letting D−n := D \ {(yn .1. To illustrate the idea. we use D2 as the training set. For each λ.10. by considering again the ridge regression procedure used in §10. xn ). end N 1 return r := N ∑n =1 r n At each step inside the loop. but still evaluating based on out-of-sample prediction. The data set D is the set of points shown in figures 10. . the fitted function fˆλ is ˆ φ( x ) fˆλ ( x ) = β λ where ˆ := argmin β λ b n =1 ∑ N (yn − b φ( xn ))2 + λ b 2 Recall here that φ( x ) = ( x0 . we could divide the data into more than two sets. the data set D with just the n-th data point (yn . using the average loss. This procedure is called leave-one-out cross validation. and D1 as the validation set. The amount of regularization is increasing in λ. x d ) with d fixed at 14. we use D1 as the training set and D2 as the validation set.3–10. each with one element (yn . . Let’s illustrate this idea. set rn := L(yn .6. and then select the one that produces the lowest value of r. . we fit the model using all but the n-th data point. . which attempts to use the whole data set for both fitting the model and estimating the risk. and then try to predict the n-th data point using the fitted model. On an intuitive level. Repeating this n times. The resulting functions fˆλ were shown J OHN S TACHURSKI October 10. the idea is to run each model through the cross-validation procedure. The estimate of the risk is the average of the estimates of the risk produced in these two steps. Of course. N do fit fˆ−n using data D−n . N do 2 2 ˆ set β λ. we used the fact that we knew the underlying model to evaluate the risk. Minimizer is regression function. λ.10. and indeed the fit is excellent. we use the following algorithm to estimate the risk: Algorithm 2: Leave-one-out cross-validation for ridge regression 1 2 3 4 5 for n = 1. The associated function fˆλ is plotted in red in figure 10.4 and 10. we were able to determine which values of λ produced low risk (figure 10. In that discussion. 10.015.3).3–10.5 The Theory of Model Selection Generalization error. 10 .1.− n end return rλ := 1 N N ∑n =1 r λ. In real estimation problems. Since we could evaluate risk. for each λ in the grid lambda <.1.−n : = argminb ∑i =n { ( yi − b φ ( xi )) + λ b } . Intermediate values of λ produced the best fit in terms of minimizing risk (see figures 10. . for each λ in the grid.n The value of λ producing the smallest estimated risk rλ is around 0. The fit at each step within the loop is via ridge regression. . as in the Bayesian case—see §10. Variance-bias interpretation? J OHN S TACHURSKI October 10. ˆ set rλ. omitting the n-th data point. In this experiment.7). MODEL SELECTION 291 for different values of λ in figures 10.7 on page 286).exp ( seq ( -22 . and the resulting polynomial is used to predict yn from xn . In other words. our fully automated procedure is very successful. risk is unobservable. The prediction error is measured by squared loss. Let’s see how a data-based procedure such as cross-validation performs in terms of selecting a good value of λ. and we need to choose λ on the basis of the data alone (assuming we don’t have prior knowledge. .6. This is in fact very close to the value that minimizes the actual risk (see figure 10.1.n := (yn − β φ( xn ))2 . . 2013 . In this instance. length =10) ) we perform leave-one-out cross-validation as in algorithm 1.7).8. 5 G G G G G 0.0 G −0.0 G G G G G G G G −0.0 G G G 0.0 0. GMM is a generalization of the plug in estimator (which is a special case of ERM).0146660171165271 G Figure 10.2 Method of Moments . METHOD OF MOMENTS G G 292 G 1.2.IV. method of moments. simulated method of moments.8: Fitted polynomial.5 G G G G G G G G G −1.5 1. Is there one algorithm that outperforms all others over all possible specifications of the DGP? 10.015 Best is prior knowledge: choose the regression function. 2013 N . Suppose we want to estimate unknown quantity θ := The plug in estimator is ˆ := θ J OHN S TACHURSKI h(s) FN (ds) = 1 N h(s) F (ds) = E [h( x )] n =1 ∑ h( xn ) October 10.0 lambda = 0.0 G G −1.10.5 0. Change following to vector case. GMM. λ ≈ 0. 10. Ex. Verify (10.4.2.1) that mse(b ˆ .6). J OHN S TACHURSKI October 10. 10.1.7) on page 287 using (10.3.9) with 1 N n =1 (10.4 Exercises ˆ) = E[ b ˆ − E [b ˆ ] 2 ] + E [b ˆ ] − β 2.Gaussian copula 10.10) ˆ) − h( xn )] = 0 ∑ [ g(θ N (10. Derive the expectation of the ridge regression estimator β λ ˆ is a biased estimator of β when λ > 0. 2013 . Verify the claim in (10.4) and (10. We want to estimate θ where g(θ ) = E [h( x )] (10. x)] = 0 (10.4.8) ˆ is the solution to The method of moments estimator θ ˆ) = 1 g(θ N n =1 ∑ h( xn ) N (10. In particular.11) with the empirical counterpart 1 N ˆ. replacing (10.9) We can express the same thing slightly differently.3 Breaking the Bank .11) Generalized method of moments extends this idea further. by replacing (10. BREAKING THE BANK 293 The method of moments is a generalization of this idea.10. x n ) = 0 ∑ G (θ N (10. show that β λ Ex. 10.3. Ex.12) and (10.4.13) n =1 10.8) with E [ g(θ ) − h( x)] = 0 and (10.10) with the more general expression E [G(θ . Part IV Appendices 294 . This is done through a web-based API 295 .1 The R Language [roadmap] 11. a nonprofit organization that allows people to make small loans over the Internet. Let’s say that you are interested in development and microfinance. This chapter gives a quick introduction. mostly to local entrepreneurs in developing countries. 11. Here’s an example.1 Why R? It probably behooves me to say a few words about why I want you to learn R. and it’s done. R is not trivial to learn. Kiva has been good enough to make almost all its data freely available for download. And perhaps you like using Eviews/STATA/whatever because it’s nice and simple: point and click. and opening up a world of new possibilities. Let’s leave point and click regression for the high-school kiddies.Chapter 11 Appendix A: An R Primer These notes use the statistical programming language R for applications and illustration of various statistical and probabilistic concepts. Who needs to learn a fancy programming language? The short answer is that computers and computing are radically changing econometrics and statistics. A new development in this area is the arrival of Kiva. Real programming skills will set you free.1. The fact that R is open source means that R is • free as in “free beer”—it costs nothing to download • free as in “free speech”—owned by the community. The important this is that you learn how to code up your ideas. You will need to tell your computer what to do with a sequence of written text commands—a program. in this course you’ll be exposed to R.org/) introduces R as a language and environment for statistical computing and graphics.11.1. and not all can or will be conveniently canned and packaged in a point and click interface. J OHN S TACHURSKI October 10. Almost all the skills you learn in this course are portable to other languages. the main aim here is to teach you how to write programs for statistics and econometrics. you need to know a bit about computing. and it’s also free. designed as an open source implementation of S (the latter being a statistics language developed at Bell Laboratories. and descriptive enough to do almost anything you want. if you’re already a programmer. such as GAUSS or Matlab. it’s rapidly becoming the default language of statistics within statistics departments around the world. (Finally. R is a good way to jump into more computer intensive statistical techniques. Once again. Either way. The other side of the story is what gets done to the data once its stored on our computer. Computers have opened up a whole new world of statistical techniques.2 Introducing R What is R? The R homepage (http://www.) 11. The short answer is that R is certainly as good as any other programming language for statistics. you might be wondering why I’ve chosen R over other programming languages popular in econometrics. Whether you’re converted or not. Moreover. 2013 . The Kiva example illustrates the need for programming skills when obtaining and parsing data.r-project. R is well worth a look. It’s programming language is simple and robust enough to learn easily. That said. good programming skills are what give you the freedom to do what you want. for the community Despite being free. R is every bit as good as commercial statistical packages. If you want to get your hands on Kiva’s data and manipulate it.1. THE R LANGUAGE 296 which returns HTTP web queries in XML or JSON (Javascript Object Notation—a text based data exchange format suitable for parsing by computers). followed by a prompt like so: > The first thing you need to know is how to quit: > q () If you’re prompted to save your workspace. THE R LANGUAGE 297 Of course R is not perfect. Someone once said that the best thing about R is that it was written by statisticians. (Try searching YouTube too—at the time of writing there are some helpful videos.1. 2013 . etc.). say no for now. timeseries analysis. 11. including linear and nonlinear modelling. classification. You should now be greeted with information on the version number. .r-project. classical statistical tests. The second thing you need to know is how to get help.3 Getting Started R can be downloaded from http://www. the latest version and add on packages will be found on CRAN—the Comprehensive R Archive Network. More can be found by perusing the available packages on http://cran.org/. There’s an interactive help system that can be accessed as follows: > help ( plot ) J OHN S TACHURSKI October 10. and you have just fired up R for the first time. which allows users to tackle arbitrarily sophisticated problems. where you will be directed to a local mirror of the archive. but the language is a little quirky relative to some of the more elegant modern programming languages (Python. A quick Google search will provide lots of information on getting R up and running. Typically. It also has a steep learning curve relative to point-and-click style environments such as Eviews and STATA. It implements a vast range of statistical functions. On the other hand. and so on.1. CRAN can be accessed from the R homepage.r-project.org/. and the worst thing about R is that it was written by statisticians. It has excellent graphics and visual presentation. . tests and graphical techniques. copyright.. etc. combining sensible defaults with good user control. That’s pretty accurate: It’s a great environment to jump into and do serious analysis. well-structured programming language.) I’ll assume that installation has gone fine. R contains a complete.11. if the command you’ve begun is uniquely identified by what you’ve typed so far. which is accessed via the up and down arrow keys. J OHN S TACHURSKI October 10. try typing > plo and press the tab key twice. Here are a few tips. which might make some people nervous. the help system is fairly good. Once we start working with our own variables. If you have a variable called interwar . it will be expanded by the tab key to the full command. southern . This is particularly helpful with long names.search : > help . and I consult it often. THE R LANGUAGE 298 or > ? plot You’ll be presented with a manual page for the plot function.1. Also.11. We are not in the land of point-and-click here. so general Internet searches may be useful too. or edit the command and then run (use the left and right arrow keys). First. Another useful feature of the command line is the command history. in . However. the same expansion technique will work for them. press the up arrow key until it reappears. and returns you to the prompt. you can try help. This is useful if you’re not sure what related commands are available. alabama then typing the first few letters and tabbing out will save a lot of time. let’s learn a bit about the command interface. 2013 . If help doesn’t turn anything up. pressing “q” exits the manual page. it can be technical. (Try it and you’ll see. For example. You’ll be presented with a list of possible expansions. search ( " plotting " ) Overall.) You can now press enter to re-run the command as is. On my computer (which is running Linux—might be different for Windows or Mac). But the command line is highly efficient once you get used to it. to recall a previously entered command from the current session. Now we’ve covered help and quitting. consumption . The value 5 is an object stored somewhere in your computers’ memory. > x <. but the latter is more common. and then multiply the result by 12. R retrieves that value and uses it in place of x: J OHN S TACHURSKI October 10. For example.2. let’s see how we can put R to good use. So far so good. 2013 . For example.11.) To multiply we type > 12 * 23 To raise 12 to the power of 23 we type > 12^23 and so on. The simplest way to use R is as a fancy calculator.1 Variables At the heart of any programming language is the concept of a variable. A variable is a name (often a symbol such as x or y) associated with a value. but to do anything more interesting we’ll need to use variables 11.) Figure 11. Parentheses are used in a natural way. For example: > 12 * (2 + 10) [1] 144 This indicates to the R interpreter that it should first add 2 and 10.5 binds the name x to the value 5. (The equal sign = can be used as a substitute for the assignment operator <-.2.2 Variables and Vectors Now we have some feel for the command line. Now. to add 12 and 23 we type > 12 + 23 and get the response [1] 35 (Just ignore the [1] for now. when you use the symbol x. VARIABLES AND VECTORS 299 11. and the name x is associated to that little patch of memory.1 illustrates the idea. we are using the * symbol to multiply. the value of x is now 10.5 5 + 3 8 5 <.x * 2 First the r.x + x. Understanding this is important for interpreting commands such as > x <. as in x * 2. is evaluated to obtain 10.x 2 " x2 object ’ x2 ’ not found Exponential and division are as follows: J OHN S TACHURSKI October 10. Hence. 2013 . the R interpreter first evaluates the expression x + x and then binds the name y to the result.h.x + x 10 Notice how typing the variable name by itself causes R to return the value of the variable. It cannot be omitted: > x <Error : > x <Error : x 2 unexpected numeric constant in " x <. VARIABLES AND VECTORS 300 Figure 11. As before.s.2.1: Variables stored in memory > x > x [1] > x [1] > x [1] > y > y [1] <.11. Notice also that assignment works from right to left: in the statement y <. and then the name x is bound to this number. etc. but 1x is not (variables can’t start with numbers).200 or > number .) 11. (Actually. up until now we’ve used simple names for our variables. Now let’s create some vectors. A vector is an array of values such as numbers. 7. and we can verify this as follows: J OHN S TACHURSKI October 10. 5. 5 . 0. Also. If fact.2. observations <. of . such as x and y. as we’ll see soon enough. 2013 .3. 0 .2. which concatenates the numbers 2. there are other possibilities besides numbers. > a <.0 Here we’ve created a vector a using the c function.3 0.0 5. -1) > a [1] 2. the “scalar” variables we’ve created so far are stored internally as vectors of length 1. More complex names can be used to help us remember what our variables stand for: > number _ of _ observations <. The resulting vector is of length 5.2 Vectors So far.0 7. 7.0 -1. and -1 into a vector. the variables we’ve created have been scalar-valued.200 Some rules to remember: x1 is a legitimate variable name.11.c (2 .) Vectors are important in R.333333 We can use ls to see what variables we’ve created so far. VARIABLES AND VECTORS 301 > x ^3 [1] 1000 > x / 3 [1] 3. R is case sensitive (a and A are distinct names.3 . and rm if we decide that one of them should be deleted: > ls () [1] " x " " y " > rm ( x ) > ls () [1] " y " Incidentally. rep (0 .2. try rep: > z <.c (a .5 1. 2013 . To generate a constant array. 5) > z [1] 0 0 0 0 0 We can also generate vectors of random variables. we can use the function seq: > b <.rnorm (3) > y <. VARIABLES AND VECTORS 302 > length ( a ) [1] 5 The c function can also concatenate vectors: > a <. For example > x <.1] We’ll learn more about this later on. 4) > b <. b ) > a _ and _ b [1] 2 4 6 8 On thing we often want to do is create vectors of regular sequences.5:1 5 4 3 2 1 If we need to be a bit more flexible.0 Try help(seq) to learn more. 1 .0 0.seq ( -1 .rlnorm (30) > z <.c (6 . J OHN S TACHURSKI October 10.runif (300) # 3 draws from N (0 .0 -0.11. Here’s one way: > b > b [1] > b > b [1] <. length =5) > b [1] -1. 8) > a _ and _ b <.1:5 1 2 3 4 5 <.c (2 .5 0.1) # 30 lognormals # 300 uniforms on [0 . 7.3 0. VARIABLES AND VECTORS 303 11.0 11. 5) We can obtain the sum of all elements via > sum ( x ) [1] 15 and the minimal and maximal values via J OHN S TACHURSKI October 10.3 0. 3 . so there’s no need to distinguish.c (1 .3 Indices Using [k] after the name of a vector references the k-th element: > a <. 2013 .3 > a [3] <.3 .0 7.c (2 .0 7.2.c (2 .0 -1.1:5 # Same as x <. 2 .11. consider the vector > x <. To begin. Most of these can be performed on scalar variables as well. 5 .2. -1) > a [3] [1] 7.3 . 4 . The most common is by putting a vector of integers inside the square brackets like so: > a [1:4] [1] 2. 7.0 5. -1) > a [ -1] [1] 5.0 There are other ways to extract several elements at once. 0 .4 Vector Operations Now let’s look at some operations that we can perform on vectors. but remember that scalar variables are just vectors of length 1 in R.100 > a [1] 2 5 100 0 -1 Entering a negative index returns all but the indicated value: > a <.2. 5 . 0 . 0 2. 2013 .389056 0.0986123 1. VARIABLES AND VECTORS 304 > min ( x ) [1] 1 > max ( x ) [1] 5 To get the average of the values we use > mean ( x ) [1] 3 while the median is obtained by > median ( x ) [1] 3 The sample variance and standard deviation are obtained by var(x) and sd(x) respectively. In general.413159 0.1411200 -0.7 2.6536436 0. Here are some more examples.. returning the exponential and sine of x respectively: > exp ( x ) [1] 2.11.3862944 1.718282 > sin ( x ) [1] 0.9092974 20. For example J OHN S TACHURSKI October 10.4161468 0. the log function returns the natural log of each element of the vector: > log ( x ) [1] 0.4 1.2. standard arithmetic operations like addition and multiplication are performed elementwise.e.0 1.2 Now let’s look at arithmetic operations.0000000 0. element by element) on its argument. 1) [1] 1.6094379 This is a common pattern: The function acts elementwise (i.9899925 0. So far we’ve looked at functions that take a vector and return a single number.6931472 1. For example.8414710 7. we can perform two or more operations at once: > abs ( cos ( x ) ) [1] 0.7568025 -0.598150 148.5403023 0.2836622 > round ( sqrt ( x ) .085537 54. There are many others that transform the vector in question into a new vector of equal length.9589243 Naturally. more generally. length =200) > y <. there’s a nicer way to do it: > a * 2 [1] 2 4 6 8 The same principle works for addition. division and so on. > a * rep (2 .seq ( -3 . length ( a ) ) [1] 2 4 6 8 However.11. A common one is the plot() command.3 Graphics R has strong graphics capabilities when it comes to producing statistical figures. Here’s an example > x <. GRAPHICS 305 > a > b > a [1] > b [1] > a [1] > a [1] <.5:8 1 2 3 4 5 6 7 8 + b 6 8 10 12 * b 5 12 21 32 How about if we want to multiply each element of a by 2? One way would be to enter > a * rep (2 . 4) [1] 2 4 6 8 or. 2013 . There are many different ways to create such figures. # 4 because length ( a ) = 4 11. 3 .cos ( x ) > plot (x . y ) J OHN S TACHURSKI October 10.1:4 <.3. 0 −0. type = " l " ) 11. y. For example.rnorm (40) > plot (x .3. If we prefer blue we use the command plot(x. lines and text to a plotting area. GRAPHICS 306 1.5 0. graphical commands are either “high-level” or “low-level.11. 2013 .3.3. To produce a blue line. type="l"). High-level commands are built on top of low-level commands. If you give only one vector to plot() it will be interpreted as a time series. as in figure 11. try > x <.2: Illustration of the plot command This produces a scatter plot of x and y. J OHN S TACHURSKI October 10. try plot(x.5 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G −3 −2 −1 0 x 1 2 3 Figure 11. col="blue".0 y 0. col="blue") instead.1 High-Level Graphical Commands In R.0 −1. offering a convenient interface to the creation of common statistical figures. Low-level commands do things like add points.2.” The function plot() is an example of a high-level command. y. producing figure 11. 4. The code for producing it is > x <. then use par.5 0.rlnorm (100) > hist ( x ) # lognormal density The output is shown in figure 11.0 y 0.0 −1. Histograms are a really common way to investigate a univariate data sample.5 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G −3 −2 −1 0 x 1 2 3 Figure 11. Figure 11.0 −0.rnorm (1000) > hist (x . GRAPHICS 307 1.5 is a bit fancier. breaks =100 . 2013 . To produce a histogram we use hist: > x <.3.11. col = " midnightblue " ) If you want a background color too.3: Illustration of plot again The important thing to remember about creating figures in R is that the commands for a given figure should contain one and only one high-level command. which sets parameters for the figure. followed optionally by multiple low-level commands. J OHN S TACHURSKI October 10. Let’s discuss some more examples of high-level commands. See the on-line help for details (type “?par” without the quotes). 2013 .4: Illustration of the hist command Histogram of x 50 Frequency 0 −4 10 20 30 40 −2 x 0 2 Figure 11. GRAPHICS 308 Histogram of x Frequency 0 0 10 20 30 40 50 2 4 x 6 8 10 Figure 11.11.3.5: Illustration of hist again J OHN S TACHURSKI October 10. > x <. boxplot produces box plots. which adds straight lines. contour produces contour maps of 3D data. which adds polygons (useful for filling regions). which adds points. A 3D graph produced using persp is shown in figure 11. abline. For example. text. which adds arrows. Details are omitted. 2 * pi . 2013 . Still more high-level graphics functions are available if you want to dig deeper. Low-level commands are also available to the user.6: Plotting in 3D There are several other high-level plotting functions for presenting statistical data.2 Low-Level Graphical Commands High-level graphics commands are built on top of low-level commands.seq ( -2 * pi . For example. while persp produces 3D graphs. length =200) J OHN S TACHURSKI October 10.11.3. which adds lines described by x and y coordinates. The following code gives an example. GRAPHICS 309 Figure 11. 11. and can be utilized to add additional components to figures that were initially produced by a high-level command. which adds text. lines. and legend. Examples include points. We will meet some of them as we go along. arrows.6. pie produces pie charts. which adds a legend. and so on.3. barplot produces bar plots. polygon. 0 −6 −4 −2 0 x 2 4 6 Figure 11.11. pdf " ) # Write to a file called foo . to produce a histogram as a PDF.sin ( x ) z <. col = " blue " ) First we call the high-level graphics command plot.3.3. The resulting output is shown in figure 11. 11. or replace the word lines with plot.cos ( x ) plot (x . z .rlnorm (100) > pdf ( " foo .5 1.7: Adding points and lines > > > > y <. 2013 . try > x <.7. type = " l " ) lines (x .0 y 0. GRAPHICS 310 −1. pdf > hist ( x ) J OHN S TACHURSKI October 10. y . For example. You can experiment to see what happens if you issues these commands in the reverse order.5 0.3 Getting Hardcopy R provides range of graphics drivers to produce hardcopy. and then the low-level command lines.0 −0. strings have mode “character. DATA TYPES 311 > dev . 11. We can investigate the type of a variable by using either the mode or class function: > b <. 2013 . type getwd.11.4. We’ll talk more about classes below.1 for a discussion of the latter. such as numbers. mode refers to the primitive data type. In general. Although mode and class returned the same answer in the previous example. 4) > class ( b ) [1] " numeric " > mode ( b ) [1] " numeric " Here both mode and class return "numeric". Floating point numbers (or floats) are a computer’s approximation to real numbers—see §13. Strings are pieces of text—any of the alphanumeric characters and other symbols on your keyboard. text and Boolean values (see below).off signals to the graphics driver that you have finished adding components to foo.4.4 Data Types [roadmap] 11. R now writes the PDF output to a file called foo. whereas class is more specific.c (3 . In R. Another common data type is strings.” > x <.1 Basic Data Types Like most programming languages. R can work with and perform operations on various kinds of data. we will see that this is not always the case." foobar " > mode ( x ) [1] " character " J OHN S TACHURSKI # Bind name x to string " foobar " October 10. off () The command dev. To see what your current working directory is.pdf. which indicates that the elements of the vector are stored as floating point numbers.pdf in your current working directory. 2013 . For example. There are a few other basic data types besides numeric and character. In §11. One is Boolean. Here’s another example. col = midnightblue ) # Wrong ! # Separate by empty string instead? Because then R would think that midnightblue is a variable. then R interprets the sequence of letters as the name of a variable. " bar " ) [1] " foo bar " By default. sep = " " ) [1] " foobar " Here’s a more useful example of paste: > paste ( " Today is " . breaks =100 . and.3. Why do we do this? The reason is that if we don’t add quote marks. breaks =100 . and look for it in the current environment. Boolean values can be stored in vectors. the two strings are separated by a space. col = " midnightblue " ) was used to produce figure 11. that’s passed as an argument to the function hist.11. which holds the logical values TRUE and FALSE.foobar Error : object ’ foobar ’ not found Here the interpreter looks for the variable foobar. not finding it. As with other types. Here "midnightblue" is a string. date () ) [1] " Today is Mon Feb 28 11:22:11 2011 " In the code above. when we work with strings. For example: J OHN S TACHURSKI October 10. > x <. we usually write them between quote marks. DATA TYPES 312 We can concatenate strings using the paste function: > paste ( " foo " .4.5. the command > hist (x . Why can’t we just type > hist (x . issues an error message. you will have noticed that. " bar " . but we can eliminate the space as follows: > paste ( " foo " . 1 " " foobar " We see that the numeric value 1. The elements can now be accessed via these names using the “$” symbol: J OHN S TACHURSKI October 10. For example. where we create a list using the list function: > employee1 <. suppose that we try to make a vector with two different modes: > x <.1 . you can use the abbreviations T and F: > x <. FALSE ) > mode ( x ) [1] " logical " As we’ll soon see.c (T .c ( TRUE . To save a bit of typing. all data must be of the same type.11.TRUE > mode ( x ) [1] " logical " > x <. Boolean vectors are very important in R. A list is like a vector.1 has been converted to a character string. Here’s an example. and. in order to ensure all elements have the same type. in addition. 2013 . TRUE . an obvious question is: What if we want to store several different data types as a single object? For example.c (1. DATA TYPES 313 > x <. let’s say we have data on employees at a given firm. T . except that it’s elements have “names” that can be used to access them. salary =50000) Here we’ve given the names surname and salary to the two elements of the list. " foobar " ) > mode ( x ) [1] " character " > x [1] " 1. F ) > x [1] TRUE TRUE FALSE One thing to remember about vectors is that in any one vector. there’s no restriction on the data types of the elements. How should we accomplish this? In R. and we want to store their surname and salary together. one way to do it is to use a list.4. At this point.list ( surname = " Smith " . whenever we run a regression using the standard linear regression function lm. One final point on basic data types is that sometimes we need to test the type of a variable to make sure it will work in a given operation. Here’s an example: > x <.4.4. we will still be creating quite a lot of lists. 2013 . we may need to change its type. and so on. this function will return a list that contains information about the estimated coefficients. you can extract them with names: > names ( employee1 ) [1] " surname " " salary " Although we won’t have much cause to use the list function during this course. making changes. functions. data is usually stored in data frames. R provides a collection of is. DATA TYPES 314 > employee1 $ surname [1] " Smith " > employee1 $ salary [1] 50000 If you’re dealing with a list and you can’t remember the names of the elements. numeric ( x ) # Convert x to numeric > is . and so on. For example. numeric ( x ) [1] FALSE > sum ( x ) Error in sum ( x ) : invalid type > x <. If not. selecting subsets. they do so using a list. numeric ( x ) [1] TRUE > sum ( x ) [1] 300 11. For this purpose. The reason is that when R functions need to return a whole lot of different information. and as.c ( " 100 " . Data frames are special kinds of lists that are used to store “columns” J OHN S TACHURSKI October 10. reading in data.2 Data Frames In statistical programming we work a lot with data sets. " 200 " ) > is . In R.11. the residuals.as . storing it. 4. $50. a data frame is a list: > mode ( summers ) [1] " list " But it’s not just a list. -34) and then build a data frame: > summers <. For now.11.000 and $70. but the columns can hold different data types (numeric. who paid Summers speaking fees of (roughly) $140. each column corresponds to observations on a particular variable. $70. Over the year to April 2009. let’s have a look at our data frame: > summers fee price 1 140000 -35 2 70000 -100 3 50000 -89 4 70000 -34 J OHN S TACHURSKI October 10. On the financial blog “Alphaville. their share prices fell by 35%.000. Lehman Bros. 89% and 34% respectively. frame ( fee . 50000 . it’s a special kind of list called a data frame: > class ( summers ) [1] " data .data . price ) As discussed above. frame " We’ll talk more about the class function in §11. 70000 .3 below. character.c (140000 . Citigroup and JP Morgan. 100%. Typically. Let’s start with a simple and light-hearted example.000 respectively. Let’s record this in a data frame.c ( -35 . 70000) > price <. The firms in question were Goldman Sachs. -89 . We can do this in different ways. Data frames are a bit like matrices (which we’ll meet later). etc).” Tracy Alloway noted the correlation between speaking fees paid by major financial firms to economist Lawrence Summers (Director of the White House National Economic Council) and the relative stability of their share price during the GFC. -100 . Perhaps the easiest is to first put the data in vectors > fee <.000. 2013 . DATA TYPES 315 of related data.4. DATA TYPES 316 We see that R has numbered the rows for us. frame ( Speak . the columns can be accessed using their names: > summers $ Speak . Price .4. 2] # First row . It’s the same idea with directories (folders) on computers: Files are grouped in different directories to keep the organized by topic. Returning to our data frame. Fee = fee .11. 2013 . but rather they are “hidden” inside the data frame. we can have the same column name in different data frames. Fee Error : object " Speak . Change = price ) > summers Speak .] # All of second row Speak .2] # All of second column [1] -35 -100 -89 -34 J OHN S TACHURSKI October 10.data . the column names by themselves won’t work: > Speak . > summers [1 . This data encapsulation is deliberate. Change [1] -35 -100 -89 -34 70000 (Are you remembering to use the TAB key to expand these long names?) On the other hand. with columns of the data frame corresponding to elements of the list. Change 2 70000 -100 > summers [ . Just as we can have the same file name in different directories. We can produce more descriptive column names as follows: > summers <. second column [1] -35 > summers [2 . we can also access the data in the data frame using “matrix style” index notation. and helps us things organized. Fee [1] 140000 70000 50000 > summers $ Price . and used the variable names as names for the columns. For example. Fee Price . Fee Price . Change 1 140000 -35 2 70000 -100 3 50000 -89 4 70000 -34 Since summers is a kind of list. Fee " not found This is because these variables are not part of the current workspace. names ( summers ) [1] " 1 " " 2 " " 3 " " 4 " > firm <.00 Mean : 82500 Mean : -64.: 87500 3 rd Qu . run a linear regression and then add the line of best fit to the plot: > plot ( summers ) > reg <.75 Max .firm > row . Change -35 -100 -89 -34 One of the nice things about data frames is that many R functions know how to interact with them directly. " JP Morgan " ) > row . names ( summers ) <. data = summers ) > abline ( reg ) J OHN S TACHURSKI October 10. : -34.4.lm ( Price . if we enter > plot ( summers ) we immediately get a scatter plot. Fee . which acts on data frames.00 1 st Qu . " Lehman " .: -91. Let’s see how this works: > row . :140000 Max . Fee 140000 70000 50000 70000 Price .: -34. For example.c ( " Goldman " . : -100. we plot. If we use the function summary we get a summary of the data: > summary ( summers ) Speak . " Citi " . Fee Price .00 Here. DATA TYPES 317 One thing we could do to make the data frame more descriptive is to replace the row numbers with the names of the banks.names function. This is done through the row.75 Median : 70000 Median : -62.: 65000 1 st Qu .50 3 rd Qu . names ( summers ) [1] " Goldman " " Lehman " " Citi " " JP Morgan " Now the summers data frame looks as follows: > summers Goldman Lehman Citi JP Morgan Speak . 2013 . : 50000 Min .11. Change ~ Speak . Change Min . We’ll talk more about this output in §11.3 A Note on Classes In the proceeding example.4.Change −90 −100 −80 G −70 −60 −50 −40 G 60000 80000 100000 Speak. Fee . don’t read too much into this.) More discussion of univariate linear regression is given in §11.11. this is lighthearted.8 shows positive correlation.4.Fee 120000 140000 Figure 11.8: Scatter plot and line of best fit The resulting plot in figure 11. data = summers ) If you now type > summary ( reg ) you’ll be presented with a nice table giving you estimated coefficients and other summary statistics. DATA TYPES 318 G G Price. but for now I’d like you to notice that the function summary was used previously in the code > summary ( summers ) J OHN S TACHURSKI October 10. 2013 .lm ( Price .5. 11. (Again. Change ~ Speak . we created a data frame called summers and ran a regression with the code > reg <.5. we can check that both the data frame summers and the object reg returned by the regression are lists: > mode ( reg ) [1] " list " > mode ( summers ) [1] " list " However. it knows what action to perform.11. 11. and each time R gives an appropriate result. Functions like summary.1 The lm Function To set up an example. that act differently on different objects according to their class. SIMPLE REGRESSIONS IN R 319 which gave a basic description of the data in the data frame summers. 2013 . This second. frame " The function summary. length = N ) > y <. type information is called the objects class: > class ( reg ) [1] " lm " > class ( summers ) [1] " data .seq (0 . let’s generate some data: > N <.25 > x <. the summary function needs to distinguish between these objects.5 + 10 * x + rnorm ( N ) J OHN S TACHURSKI October 10. 1 . more specific. What’s interesting here is that we are using the same function summary with two very different arguments. first investigates its class.5. It’s useful to know how this works. when passed an object such as reg.5. On one hand. Once it knows the class of the object. The way this is accomplished is that the two objects are given addition type information beyond their mode. are called generic functions. so that it knows what kind of information to return.5 Simple Regressions in R Let’s have a quick look at the basic technique for running regressions in R. 11. In this example.3. We can regress y on x as follows: > results <.4 x 0. The list object returned by a called to lm includes as its elements various various other vectors and lists containing the information produced by the regression. It returns an object of class lm. For example.0 Figure 11. along with the function y = 5 + 10x (in blue). is a kind of list.11. 2013 .491593 11.456543 To see the full list.9: The data The data is plotted in figure 11. as discussed in §11.5.lm ( y ~ x ) > class ( results ) [1] " lm " The function lm is the standard function for linear. least squares regression. use names(results).4.9.8 1. J OHN S TACHURSKI October 10. which.0 0. the coefficients are an element of the list: > results $ coefficients ( Intercept ) x 4. SIMPLE REGRESSIONS IN R 320 16 G G G 14 G G G G 12 G G G G G G 10 y G G G G 8 G G G 6 G G G G 4 G 0.6 0. we have bound the name results to this list.2 0. 2013 . lty =2 .2 0. In figure 11.11.6 0. double thickness 11. for example.10 we’ve added a regression line via the call > abline ( results . Another such function is abline. > coef ( results ) > fitted . SIMPLE REGRESSIONS IN R 321 16 G G G 14 G G G G 12 G G G G G G 10 y G G G G 8 G G G 6 G G G G 4 G 0. This argument is called a formula. fitted values and so on from results.2 Formulas Now let’s have another look at the syntax of our call to lm.0 Figure 11.3.0 0.4.5. several generic functions perform further calculations based on the results to provide information about the regression. which is a special syntax used for J OHN S TACHURSKI October 10. values ( results ) > residuals ( results ) On top of these extractor functions that obtain low level information about the output of the regression. We learned about summary in §11. which adds the regression line to a plot. The argument we used was y ~ x.5.4 x 0.8 1. Try. lw =2) # Dashed .10: The regression line There are in fact various “extractor” functions for obtaining coefficients. we sometimes want R to evaluate the expressions in our formulas as ordinary arithmetic operations. First. For example. rather than x itself.11. 2013 . suppose that we want to regress y on the square of x. SIMPLE REGRESSIONS IN R 322 specifying statistical models in R. and indicates that y should be regressed on x and z. In regressions such as the one above. the intercept is included by default. Let’s look at some further examples. if you happen to have another predictor z. then you can regress y on x and z via the call > lm ( y ~ x + z ) What’s important to remember is that the “+” symbol does not represent addition per se—it is part of the formula.5. To remove the intercept we can use the formula > lm ( y ~ 0 + x + z ) # Or lm ( y ~ x + z . J OHN S TACHURSKI October 10. rather than be regarded as part of the formula.1) Finally. We can also include it explicitly via the call > lm ( y ~ 1 + x + z ) but this is exactly equivalent to the last call. This can be achieved by the call > lm ( y ~ I ( x ^2) ) The function I indicates that the operation x^2 should be treated as an arithmetic operation. when you start R. you will have a directory (folder) that contains files related to the project you are working on—data files. then R will 323 . we’ll learn about working with files. an internal variable is initialized that stores the current working directory. however. If you are using something other than a Linux machine. if you are using a Windows machine and have a directory called "C:\Foobar" on your computer.1 The Current Working Directory Often. 12. This can cause confusion. Regardless of your operating system. On part of this confusion is that some operating systems use the backslash symbol \ to separate directories in a path. On the other hand. script files containing R programs.1. these two directories will not be automatically matched. 12. figures you have created. This is the directory where R looks for any files you reference at the command line.1 Input and Output Writing programs and getting data in and out of R involves sending information to the screen and working with files. R uses /. when you are working with R. and so on. Let’s run through the basics of how this is done. In this chapter. while others use the forward slash /. For example. and also how to code up simple programs.Chapter 12 Appendix B: More R Techniques Now let’s dig a bit deeper into R. and writes files that you create. I’ve created a file on my computer called "test. 2013 . txt " ) Read 10 items > x [1] 10 9 8 7 6 5 4 3 2 1 If you want to see the contents of your current working directory. It is a text file that contains the single line 10 9 8 7 6 5 4 3 2 1 The full path to the file on my Linux machine is /home/john/emet_project/test.txt". This can be accomplished as follows: > setwd ( " / home / john / emet _ project / " ) and the call to scan now succeeds: > x <. Please be aware of this in what follows.scan ( " test .scan ( " test . Let’s look at an example.1. including shifting the file into the current working directory.12.txt" is not in the current working directory. you can do so with the dir function: > dir () [1] " test . txt " ) but receive an error message telling me the file cannot be found. and convert forward slashes to backslashes as necessary. INPUT AND OUTPUT 324 refer to this directory as "C:/Foobar". There are a few ways I can rectify this problem. The problem is that "test.txt If I start R in "/home/john" then this will be my current working directory. but the best solution is to change the current working directory to where the file is. This can be checked with the command getwd: > getwd () [1] " / home / john " Now I try to read the data in "test. txt " J OHN S TACHURSKI October 10.txt" into a vector with the scan function: > x <. There are work-arounds for all operating systems. pdf " " test . try the file. txt " Two points to finish this section: First. The scan function returns either a vector or a list. When working with data sets. off () null device 1 > dir () [1] " foo . the file will be saved in the current working directory: > pdf ( " foo . Second.table function.choose function. this should open up a dialog box which will allow you to select the file. 2013 . INPUT AND OUTPUT 325 If I next create a figure and save it.12. choose () ) On Windows machines. it’s tedious to have to manually set the current working directory every time you start R. a much more common function to use is the read.2 Reading Data in We have already me the scan function. which attempts to read the data from a file into a data frame. which is a low-level I/O function for reading data from files. as in the following example: x <. Let’s look at an example. there may be times when you want to read data from a file that is located somewhere on your computer. Googling will tell you what you need to know.1.1. depending on the contents of the file. 12. and it’s easiest to locate that file by point and click. pdf " ) > plot ( x ) > dev . I’ve created a text file on my computer called "testdf. In that case.txt " with two columns of data that looks as follows: X Y 1 10 2 20 3 30 J OHN S TACHURSKI October 10.scan ( file = file . read . STATA. SPSS and SAS. (One place to find those details is in your browser. You can skip lines of crud at the start of your file using skip.1.) Another thing you can do with read. frame " Here I’ve set header=TRUE because the first row contains column names. header = TRUE ) > df X Y 1 1 10 2 2 20 3 3 30 > class ( df ) [1] " data . header = TRUE ) On your home installation this command should work fine. J OHN S TACHURSKI October 10. However. 2013 . pass is your password. net / emet / testdf . and then proceed as before. table ( " http : / / johnstachurski . and proxy is your proxy server. (R can also handle many foreign data file formats. which must be aware of the proxy server if it’s working.) If the above command still doesn’t help. For example. table ( " testdf . R can read and write data files in the formats used by Excel.table . At work or at university it may not. INPUT AND OUTPUT 326 This file is in my current working directory. txt " .table and other input functions is read data directly from the Internet by giving the URL instead of a filename: > read . Please look up the documentation as required. which you can investigate ?read. where HTTP traffic is routed through a proxy server.table function has many options. setenv ( http _ proxy = " http : / / user : pass@proxy :8080 " ) Here user is your username. The read. If R doesn’t know about the proxy.12. you will not be able to see your password. txt " . then you can set it as follows: Sys . you can always save the file in question to your local hard disk with your browser. because many office and university computers are behind firewalls. work with comma separated values via sep and so on. and I can read it into a data frame as follows: > df <. One important use for this function is as a substitute for the print function. txt " ) There are many options to this command. which we met in §12.3 Other I/O Functions There are several low level functions for input and output. you can get information from them along the following lines: > x <. For example.1.) By specifying a file name. if other people are going to use your program. The vector can then be converted into a matrix if desired—we’ll talk about this process later. If no file name is given. and I leave you to investigate.1. we can also write this information to the hard disk: > cat ( " The value of x is " .3 > cat ( " The value of x is " . The last few functions we have discussed pertain to input. This function is quite flexible. Let’s talk briefly about output. " \ n " .1. A lower level function for data output is cat. For example. so you may need to call a function such as as. INPUT AND OUTPUT 327 12. > x <. One is scan. with more detailed control. file = " test .1. try something along the lines of > write . txt " ) J OHN S TACHURSKI October 10. 2013 .scan () 1: 1 12 3 55 128 # Me typing numbers in 6: # And hitting return a second time Read 5 items > x [1] 1 12 3 55 128 Another way to get information in is via the readline function.numeric if you want to convert x to a numeric value. " \ n " ) The value of x is 3 The final "\n" tells R to end with a new line. x . but the most common use is the one we mentioned: Reading a sequence of numbers from a text file and converting it to a vector. then scan reads from the screen: > x <. (Try without if you’re not sure what I mean.12. To write a whole data frame. table ( summers . x .readline ( " Enter the value of x : " ) Note that the result will be a string. " summers . isolate problems. type = " s " ) > points (x . 2013 . There are many reasons why you’ll need to write and work with scripts.sort ( rnorm (50) ) .12.1.5 . etc. Suppose for example that we type the next two commands at the prompt: > plot ( x <. make incremental improvements and so on.1: Step function 12. In interpreted languages such as R. The resulting program can then be shared with colleagues.1. these commands produce the graph in figure 12. J OHN S TACHURSKI October 10. We might be rather happy with this figure. a sequence of commands saved in a file is often called a script. cex = . A more generic term is program. col = " dark red " ) On my computer.4 Scripting Another kind of data you will want to read into R is sequences of commands. typing the commands one-by-one at the prompt becomes impractical. INPUT AND OUTPUT 328 G G G G G G G G G G G G G G G 1 G G G G x <− sort(rnorm(50)) 0 G G G G G G G G G G G G G G G G G G G G G G G G −1 G G G G −2 G G G 0 10 20 Index 30 40 50 Figure 12. and decide to save these commands in a file.1. or a programs. As the tasks you implement become longer and more complicated. When you have a long sequence of commands that need to be executed in a certain order. writing a script allows to you run the whole sequence of commands easily. and entering commands at the prompt (i. For starters. These features may sound minor. On Linux there are many options. which reads in commands from a specified text file: > source ( " foobar . you should have the option to create a script. There are plenty of good open source text editors. Selecting this option opens up a simple text editor. but in fact they make coding much faster and more enjoyable. and install a nice one on your home computer. For a short script. and they don’t do it well.. 1 Yes. Good text editors include features such as syntax highlighting.1 for information on how to set the current working directory.12.e.1. The latter is always useful for testing. and rapid prototyping of programs. code completion. running commands via a script produces the same result as typing the commands at the prompt one by one. note that the script must be in the current working directory. Alternatively. automatic indentation. word processors aren’t designed for the job. google around to find out about text editors that play well with R. a simple one is Notepad. On Windows. the most important thing to remember is that scripts are saved in text files. A final comment: Generally speaking. but they are quite different. J OHN S TACHURSKI October 10. text files are manipulated with text editors.. no-one will take you seriously if you tell them you code in MS Word. Students often confuse text files with word processor documents. But please don’t. See §12. and then copy and paste into R’s interpreter (i. into the command line. while on Mac OS you can use TextEdit. you might find that above the R command line there are several menus. where the > prompt is). By all means. Once you are familiar with R. getting help. It’s just not cool. Now let’s say that we’ve written a script in some text editor and we want to run it. 2013 . depending on the version of R you are running. R will not be able to read a program written in a word processor such as MS Word unless you specifically save it as a text file. One exception is as if you save as text then you can actually use a word processor to write scripts.e. INPUT AND OUTPUT 329 When creating a script for the first time.1 Usually. In the “File” menu. the most elementary way to run it is to open up the text file with the text editor. and so on. While this is not a bad way to start off. a much faster way is to use the R function source. However. Pretty much all computing environments supply a basic text editor. More importantly. you will find that your work is a mix of reading in commands from programs (source files). R " ) For serious scripting this second method is the only viable one.1. interacting directly with the interpreter). 2. numeric ( " foo " ) [1] TRUE J OHN S TACHURSKI October 10. Testing for equality and inequality is as follows: > 2 [1] > 2 [1] == 3 FALSE != 3 TRUE # Note double equal sign ! ! The exclamation mark means “not. In a script run via source.2.” and reverses truth values.) Some expressions in R evaluate to either TRUE or FALSE. numeric ( " foo " ) [1] FALSE > ! is . TRUE and FALSE can be shortened to T and F. using the relational operators > and >=. For example: > 2 [1] > 3 [1] > 3 FALSE >= 2 TRUE Here we’re testing for strict and weak inequality respectively. For example: > is . you will need an explicit call to print or cat to get the information to the screen. CONDITIONS 330 follows: If I have a variable x and I type x at the prompt and hit return.2 Conditions [roadmap] 12.TRUE > x [1] TRUE (In many situations.12. 2013 .1 Comparisons We have already met the logical values TRUE and FALSE: > x <. I get the value of the variable. 12. When applied to vectors. CONDITIONS 331 Note the double equal sign when testing for equality. 2013 . Inf behaves much as the symbol ∞ does in comparisons.e.c (0 . Often.Inf # [1] NaN > 1 + Inf [1] Inf > 0 > . these operators produce element by element comparisons. 2) > y <.c (1 .2 While on the topic of greater than and less than.2.1 == 2 FALSE = 2 2 # Testing equality # Assignment . For example. let’s create a vector x and then ask which elements are greater than 1: J OHN S TACHURSKI October 10.12. as well as arithmetic operations: > 1 > Inf [1] FALSE > 10^100 > Inf [1] FALSE > Inf + Inf [1] Inf > Inf . one interesting numerical object in R is Inf. consider: > x > x [1] > x > x [1] <. A single equal sign means assignment (i. equivalent to x <. For example.. is equivalent to <-). For example: > x <.Inf [1] TRUE > 10 / Inf [1] 0 Result is NaN ( Not a Number ) Let’s continue our discussion of the relational operators. we want to compare a whole vector against a single value. 10) > x > y [1] TRUE FALSE Here x[1] is compared against y[1] and x[2] is compared against y[2]. we can also consider the statements P AND Q and P OR Q. 0) > x > 1 [1] FALSE TRUE TRUE TRUE FALSE 12. To understand these operators. and true otherwise.” or “I live on Mars”. The statement P AND Q is true if both P and Q are true. and false otherwise. such as “2 is greater than 3. 9 . 5 .2.2 Boolean Operators Relational operators can be combined with the Boolean operators AND and OR. In R.2. consider two statements P and Q. 5 . The statement P OR Q is false if both P and Q are false.c (1 . so true # OR : Both true .2.3 Boolean Arithmetic As we saw earlier. 7] FALSE FALSE TRUE FALSE FALSE <= 1 | x > 5 TRUE FALSE FALSE TRUE TRUE 12. CONDITIONS 332 > x <.12. the values TRUE and FALSE are primitive data types in R. The operators AND and OR can also be applied to vectors. 4 . of class logical: J OHN S TACHURSKI October 10. so false # OR : One true . As usual.c (1 . 4 . 2013 . the operators AND and OR are represented by the symbols & and |respectively: > 1 [1] > 1 [1] > 1 [1] > 1 [1] < 2 & TRUE < 2 & FALSE < 2 | TRUE < 2 | TRUE 2 < 3 2 < 1 2 < 1 2 < 3 # AND : Both true . Given statements P and Q. 0) >= 5 & x <= 7 # All x in the interval [5 . 9 . so true # AND : One false . so true Try experimenting with different combinations. the action is performed element by element: > x > x [1] > x [1] <. c ( TRUE . FALSE ) ) [1] 2 This is very handy.5 3.2. 4 .5 4. if we want to know how many elements of a numerical vector y exceed 3.0 3. TRUE . FALSE . if y is any vector. 2013 . then > y [ y > 3] J OHN S TACHURSKI October 10. For example. For example. TRUE ) > y [ index ] # Extract first and last element of y [1] 2 4 This feature of vector indexing allows us to perform conditional extraction on vectors with very simple syntax. For example. we can use > mean ( y > 3) Can you see how this works? Make sure that it’s clear in your mind.4 Conditional Extraction One important fact regarding logical values is that vectors can be indexed by logical vectors. FALSE . > y <. length =5) > y [1] 2. FALSE .0 2. CONDITIONS 333 > mode ( TRUE ) [1] " logical " One important property of logical values is that they can be used in algebraic expressions.0 > index <. we can use the command > sum ( y > 3) If we want to know the fraction of elements of y that exceed 3.seq (2 . where TRUE evaluates to one and FALSE evaluates to zero: > FALSE + TRUE [1] 1 > FALSE * TRUE [1] 0 > sum ( c ( TRUE .12. 12.2. 2013 .75 F 9. while > if (3 > 2) print ( " foo " ) prints “foo”. rather than at the command line. Listing 18 gives a (completely artificial) example of the full if-then-else construct. and the second of which records his or her salary. The first few lines of wages are as follows: 1 2 3 4 5 6 sex salary F 11.03096 Take your time to think through how this works. 12.2.2.42 M 9. This works because the expression inside the square brackets produces a logical vector. and see what it does. J OHN S TACHURSKI October 10. Try reproducing and then running it.79 M 8. In the simplest case. > if (2 > 3) print ( " foo " ) prints nothing. we can combine logical expressions with the if statement to determine whether a piece of code should be executed or not. like so: > mean ( wages $ salary [ wages $ sex == " F " ]) [1] 10. Let’s suppose we have a data frame called wages.12.92 F 10.21 F 10. the first column of which records the sex of the individual. For example.5 If-Then-Else Next let’s discuss the if-then-else construct. contained in a small program written in a text file. CONDITIONS 334 returns all elements of y that exceed 3. Here’s an example of what we can achieve with conditional extraction.90 How can we compute the average wage for females? We can do it in one line. Conditionals are mainly used in programs. and only the elements of y corresponding to TRUE are extracted. e. " foo " .3. For example..e.3 Repetition The beauty of computers is that they can perform lots of small caculations quickly— much faster than a human. If a human needs to intervene and give the command for J OHN S TACHURSKI October 10. " bar " ) [1] " foo " > ifelse ( -1 > 1 . " bar " ) [1] " bar " # Returns " foo " # Returns " bar " The first statement inside the brackets is evaluated. If true. consider the following: > ifelse (1 > -1 . suppose we have a vector of data on years of schooling. " ) # Do something else } Often. This can be accomplished as follows: > tertiary <. the if-then-else syntax can be replaced by the convenient function ifelse. binary) variable in a new vector tertiary that has value 1 if more than 12 years of schooling has been attained (i. tertiary educated) and zero if not. 0) > tertiary [1] 0 0 1 0 1 1 0 12.. The function ifelse is vectorized.12.readline ( " Enter your password : " ) if ( password == " foobar " ) { print ( " Welcome " ) # Do something } else { print ( " Access denied . If false. " foo " . 2013 . the second value is returned. including university: > ys [1] 10 12 15 12 16 17 11 We want to create a dummy (i. the third value is returned.ifelse ( ys > 12 . REPETITION 335 Listing 18 If-then-else password <. To illustrate the latter. 1 . What we want to do is provide a set of instructions at the start. REPETITION 336 each calculation explicitly. This is done using loops. detailing all the calculations in a parsimonious way. The next piece of code performs this task: > x <. you’ve already learned a simpler way of performing the same calculation: > sum (1:100) [1] 5050 In R. the latter is more efficient than the for loop. 12. this kind of misses the point. i is stepped through each element of the vector 1:100. and the calculation x <. Listing 19 J OHN S TACHURSKI October 10. Suppose we want to simulate flipping a coin 1.1. The previous example loop is just for illustration. and count the number of heads we get in the process.3. i x <<<<<. Suppose for example that we want to sum the integers from 1 to 100. > > x i x i x . <<0 1 x + i 2 x + i # Many lines omitted 100 x + i You can see that would be more than a little tedious.0 > for ( i in 1:100) { x <.3. We’ll talk about why that’s the case in §12.1 For Loops The most common kind of loop is a for loop. 2013 . Let’s look at another example of a for loop. Next.12.5. As a matter of fact.000 times. Another way to get R to do this would be to write it all out in full: > > > > > .x + i is performed at each step.x + i } > x [1] 5050 How does this work? First x is set to zero. because I program in several languages. heads ) Once again.12.5) num . Listing 19 A for loop num . REPETITION 337 shows how we can do this using a loop. Once again. 2 Personally. The coin flip is simulated by drawing a uniform random number between zero and one.2 I often favor explicit loops over specialized R functions. depending on whether the number of employees is in [0. 1000] or [1000. This is why we need the explicit print call. there’s an easier way to do this in R.num . heads + 1 } print ( num .5) If you’re not sure about this last one. For loops are very often used to step through the indices of a vector.0 for ( i in 1:1000) { b <.runif (1) if ( b < 0. (Can you see how it works?) So will > rbinom (1 .5) will also do the job. M. For example. medium and large). the outcome is regarded as heads. for example). there are special functions in R that can be used to avoid this loop (see the function cut. heads <. ∞). you need to read up on the binomial distribution. let’s say that we have a numeric vector with data on firm sizes (in number of employees) called fsize. > sum ( runif (1000) < 0. and L (small. [500. to get the value of num.3. How does it work? The variable i is stepped through each element of the vector 1:1000. This can be accomplished as in listing 20. For example. and my brain is more comfortable with generic—rather than R-specific—coding styles. heads <. J OHN S TACHURSKI October 10. it’s been written in a text file. Since the program is a bit longer. 2013 . and the commands inside the curly brackets are performed at each step. 500). size =1000 . prob =0. If the number is less than 1/2.heads sent to the screen. We want to create a new vector ftype that replaces these numbers with the labels S. Suppose we want to model flipping a coin until the time the first head appears.character (0) for ( i if if if } # Empty character vector in 1: length ( fsize ) ) { ( fsize [ i ] < 500) ftype [ i ] <. and calculate the average value of the random variable over these repetitions. 2013 .face == 0 evaluates to FALSE." M " ( fsize [ i ] > 1000) ftype [ i ] <. especially with short programs performing standard operations. random variable has the so-called negative binomial distribution.2. we want to simulate a random variable that returns the number of the flip resulting in the first head. and a for loop would be a natural choice to step through these possibilities." S " ( fsize [ i ] >= 500 & fsize [ i ] <= 1000) ftype [ i ] <. because there are many convenient functions that avoid the need for explicit looping." L " From the previous examples. In other words.000 times. which can be simulated using rnbinom. However.4 To do this. In the listing.3. REPETITION 338 Listing 20 Another for loop ftype <. As the for loop progresses. 3 This J OHN S TACHURSKI October 10. Often this is true. We’ll produce our own implementation for the sake of the exercise.000 different possible subsets. At the end we take the mean and print it. for longer programs. There are over 1. and see which has the highest adjusted R squared. Now let’s repeat this simulation 1. it might seem that for loops are almost unnecessary in R. as seen in listing 22.3. The logic of the program is illustrated in figure 12.2 While Loops Now let’s briefly look at while loops. this vector is populated with the result of each individual simulation. 12. 4 This gives us an estimate of the expected value—more on this later.3 An implementation is given in listing 21. Suppose for example that we want to regress a variable such as inflation on all possible subsets of ten different regressors. we put our while loop inside a for loop. This loop continues to execute the statements inside the curly brackets until the condition coin. explicit loops are pretty much essential. the second line creates an empty vector outcomes.12. 0 coin . num + 1 b <. num ) Figure 12. num <.2: Flow chart J OHN S TACHURSKI October 10. REPETITION 339 Listing 21 A while loop flip . 2013 .1 } print ( flip .5) coin .3. num <.12.flip .runif (1) if ( b < 0. face <. face <. face == 0) { flip .0 while ( coin . 4. repetitions ) # Set up empty vector for ( i in 1: num . fitting a polynomial to those points. y gets return value Another built-in function is integrate. let’s spend some time learning the mechanics of building our own functions. face <. 2013 5 .flip .4. 12. which performs numerical integration.runif (1) if ( b < 0. num # Record result of i . many of which we’ve already met.12. sum is a function.0 while ( coin . face == 0) { flip .0 coin . As the next step along our journey to becoming good programmers.5) coin . 5) computes the integral −1 cos( x )dx numerically by evaluating the function cos at a number of points between -1 and 5. num . It takes a vector as its argument (also called parameter).rep (0 . num + 1 b <.1 } outcomes [ i ] <. -1 .4 Functions One aspect of programming that separates good and bad programmers is their use of functions. num <. On my computer.1 Built-In Functions R provides numerous functions as part of the R environment. repetitions <. repetitions ) { flip . face <.flip . For example. and returns the sum of that vector’s elements: > y <. For example. > integrate ( cos . and returning the integral of this approximating polynomial. num <. FUNCTIONS 340 Listing 22 Loop inside a loop num .1000 outcomes <. this function call returns the output J OHN S TACHURSKI October 10.th flip } print ( mean ( outcomes ) ) 12.sum ( x ) # x is argument . ” We begin with an example. a number representing the lower bound of integration. user-defined functions become almost indispensable. First. because if a and b differ. col = " blue " ) # Order of a and b reversed The argument "blue" is a named argument. one must use = rather than <-. It’s often convenient to define our own user-defined functions.) The function integrate takes three arguments: a function representing the integrand. the function call > plot (a . 5] via the call J OHN S TACHURSKI October 10. a string. R knows that the elements of a correspond to x-axis values. Second. which are part of the R environment. or even another function. if there are many arguments. Typically. This example illustrates the fact that R functions can take any object as an argument—a vector.4. named arguments have a default value attached. then so will the output of the call > plot (b . Further discussion of “why” is left to §12. In particular.4. b . In §12. and often many more. a . when we start writing longer programs. while those of b correspond to y-axis values. this function receives at least two arguments. which means that if such an argument is not supplied.3 e -14 (You can verify that this is approximately correct by computing − sin(5) + sin(−1). Consider. since a appears before b. Note that we write col="blue" rather that col<-"blue". the function can revert to a sensible default. When assigning values to positional arguments. Their meaning is determined from their order in the function call. Let’s look at a third example: the plot function. then distinguishing by names rather than position makes it easier to remember the roles of the different arguments. a data frame. 2013 . FUNCTIONS 341 -0.1174533 with absolute error < 4. for example.12. Named arguments serve two purposes. col = " blue " ) # a and b are vectors The first two arguments are called positional arguments. In fact. The ordering is clearly important here. For now let’s concentrate on “how.4.4. with col being the name of the argument. and a number representing the upper bound of integration.4. 12.2 User-Defined Functions The functions sum and plot are examples of built-in functions.1 we used the built-in function integrate to calculate the approximate integral of the cosine function over [−1. > f <.97950 with absolute error < 2. FUNCTIONS 342 > integrate ( cos . For example. We can do this by creating our own user-defined function that represents y = x2 cos( x ) and then passing it to integrate. we can have as many arguments as we like.12. For example.909932 > 3 * 3 * cos (3) [1] -8. the variable name x can be any name here. using the function name followed by the argument in brackets. passing f as the first argument to integrate: > integrate (f . Let’s test out our function to check that it works as expected: > f (3) [1] -8. while function(x) indicates that we are creating a function with one argument.4.function ( x ) return ( x * x * cos ( x ) ) Here f is just a name that we’ve chosen arbitrarily. we are now ready to perform the integration. called x. Simple R functions like the one we have just defined are much like the mathematical notion of a function. we want to compute −1 x2 cos( x )dx. as we confirmed in the third and fourth line. -1 . 2013 5 . In doing so. for whatever reason. For example. The first step is to create the function: > f <. just as y = x2 cos( x ) and y = z2 cos(z) describe exactly the same functional relationship. The return value is the right one. 5) -18. Anyway.5 e -13 Now let’s create some more complicated functions.function ( z ) return ( z * z * cos ( z ) ) creates the same function as did the previous function definition. 5) Now let’s suppose that. The built in function return determines what value the function will return when we call it. the code > f <.function () print ( " foo " ) > f () [1] " foo " J OHN S TACHURSKI October 10.909932 In the first line we are calling our function. -1 . consider > g <. the default value will be used: > h (1) [1] 2 > h (1 . with each call involving a sequence of commands. intercept =3) [1] 5 Note that. Notice the curly brackets in the first and last line.9) [1] 1 Why did calling f with a small number return a big number? 5 Actually. that represents the probability of heads for our (biased) coin. they always have just one return value (i. the function in listing 23 packages the simulation in listing 21 in a more convenient form. then bundle that data into a vector or a list. and no specified return value. The function in listing 23 has an argument q. If the function is called without specifying a value for intercept. To create such a function. 3) [1] 25 We can also create functions with named arguments: > h <.5 For an example with two arguments. Many functions are larger than the ones we’ve described.01) [1] 408 > f (.e.function (x . 2013 . f returns the string "foo". FUNCTIONS 343 creates and calls a function with no arguments. y ) return ( x ^2 + 2 * x * y + y ^2) > g (2 . For example. Commands inside these brackets are executed at each function call. The function returns the number of flips it took to obtain the first heads. intercept =0) return ( intercept + 2 * x ) Here the statement intercept=0 in the definition of the function h indicates that intercept is a named argument with default value 0.. which indicate the start and end of the function body respectively. we enclose the commands in curly brackets.function (x . although functions can have many arguments. If you want to send back multiple pieces of data from your user-defined function.4. Here’s how the function is called: > f (. they always return a single object). J OHN S TACHURSKI October 10.12. even though we did not specify a return value. and the local x is lost. In this case. Any use of the name x inside that environment is resolved by first looking for the variable name inside this environment. face <.1 binds the name x to 1. consider the following code: > x <.0 while ( coin . Once execution of the function call finishes.1 > f <.1 # with prob q } return ( flip .4. For example. When we execute the function call f. 2013 .function ( q ) { # q = the probability of heads flip . num <. The next assignment x <. num <. Execution returns to the global J OHN S TACHURSKI October 10. and. if it is not found.0 coin . This variable has local scope.12. face <. and bound to the value 2. then looking in the global environment.4. the local variable x is created in a separate environment specific to that function call. which involved binding x to 2.function () { x <. print ( x ) } > f () [1] 2 > x [1] 1 You might find it surprising to see that even after the function call f. num ) } 12.2.flip . and the local value 2 is printed.runif (1) if ( b < q ) coin . FUNCTIONS 344 Listing 23 A longer function f <. the local environment created for that function call is destroyed.2 occurs inside the body of the function f. the name x is found in the local environment. Here’s how it works: The first assignment x <.3 Variable Scope One technical issue about functions needs to be mentioned: Variables defined inside a function are local variables. This variable has what is called global scope. face == 0) { flip . because it is created outside of any function. num + 1 b <. when we query the value of x at the end we find it is still bound to 1. 12. Computers may be getting faster. Now. sooner or later you will bump up against the limits of what your computer can do. R looks for this variable name in the global environment.1 Efficiency If you persist with statistics and econometrics. In this case. and the programming problems tackled by econometricians are becoming increasingly complex.5. because they are used to solve specific tasks. For example.4 Why Functions? The single most important reason to use functions is that they break programs down into smaller logical components. more time can be spent making that one implementation as good as possible. Each of these logical components can be designed individually. It would not be a happy situation in those variable names conflicted with the variable names that you are using in your global environment. and everyone had to write their own. but data sets are also getting larger. The reason is data encapsulation: If you call some function in R that implements a complex operation. 2013 . Moreover. if there’s just one implementation that everyone uses. GENERAL PROGRAMMING TIPS 345 environment. This would involve an enormous duplication of effort. Almost all programming languages differentiate between local and global variables. the value returned is 1.12. This process of reducing programs to functions fits our brains well because it corresponds to the way that solve complex problems in our heads: By breaking them down into smaller pieces.5 General Programming Tips Let’s finish up our introduction to the R language by covering some general pointers for writing programs in R (and other related languages). suppose for some reason that R had no integrate function for numerical integration.5. that function will likely declare lots of variables that you have no prior knowledge of.4. functions encourage code reuse. Related to this point is the fact that. since numerical integration is common to many statistical problems. Throwing money at these J OHN S TACHURSKI October 10. when we query the value of x. 12. 12. considering only the task it must perform. R is written mainly in C. you decide to write statistical procedures in R rather than hand-code them in C. Once off compilation prior to execution is efficient for two reasons. In fact. it’s worth understanding the basics of how interpreted languages like R work. pre-compiled from purpose-built C routines. and you will soon find out: Programming statistics in C is a painful experience. However. like most people. GENERAL PROGRAMMING TIPS 346 problems by buying new hardware often makes little difference. a language like C can be hundreds of times faster than R in certain operations. The cost is that you lose low-level control relative to the alternative option of hand-coding C yourself. but it is optimized for humans. and optimize the machine code accordingly. and human time is far more valuable than computer time. you can still learn from the preceding discussion. we need to pack big batches of operations into individual commands. If you are faced with such a problem. Let’s compare it against the natural. this is done in one pass of the entire program. Even if. Why don’t we all just program in C then. individual commands are converted to machine code on the fly. In this course we won’t be dealing with huge data sets or enormous computational problems. then you will almost always need to look at the efficiency of your code. in an interpreted language such as MATLAB or R. First. prior to execution by the user. if it’s so much faster? Go ahead and try. Second. In a compiled language such as C. whereas compiled languages like C need do this only once. You can think of R as a friendly interface to C. R may not be optimized for computers. and all you have to do to use them is type in intuitive commands at the prompt. The benefit is that most of the necessary C routines have been coded up for you.12. This allows R to pass the whole operation out to optimized machine code. interpreted languages must pay the overhead of continually calling the machine code compiler. we can learn that to program R efficiently. suitable for statistical calculations. as in listing 24. In particular. All standard computer programs must be converted into machine code before they can be executed by the CPU. As a result. the compiler is able to see the program as a whole. 2013 . in order that code can be structured appropriately. vectorized operation in R: J OHN S TACHURSKI October 10. suppose we write our own naive function to obtain the square of all elements in a vector x. To illustrate these ideas. Fortran or Java. On the other hand.5. for the vast majority of your code. 99% of your CPU time will be used by a tiny subset of the code that you write. it can be far slower. and only if excessive run-time of the program justifies the extra work.x [ i ]^2 } return ( y ) } > n <. clarity is the priority.046 > system . Clarity is crucial.024 0. time ( f ( x ) ) user system elapsed 3. As a rule of thumb. Only this code needs to be optimized.004 0. When operations are not vectorized. the performance of R is often quite similar to that of C. vectorized calculations are far more efficient than explicit loops.423 We see that our function is over 100 times slower.2 Clarity and Debugging Having talked about efficiency. and pass it to the compiler in an optimal way. the vectorized method allows R to see the problem as a whole. Often. GENERAL PROGRAMMING TIPS 347 Listing 24 A slow loop f <. Once you’ve spent a few days debugging your programs. because R blindly steps through the instructions in the for loop.12.5. it’s now very important to note that very little of your code needs to be optimized.10^6 > x <. time ( x ^2) user system elapsed 0.5. translating into machine code as it goes.412 0. 12.012 3.numeric ( length ( x ) ) for ( i in 1: length ( x ) ) { y [ i ] <.function ( x ) { y <. 2013 . On the other hand. Fortran or Java.rnorm ( n ) > system . it will become very clear to you that. When R calculations can be vectorized. since writing and reading programs is not an easy business for J OHN S TACHURSKI October 10. however. Although this technique for tracking your variables is not very sophisticated. because you don’t have the benefit of an error message. and to you when you come back to your code after several weeks. months or years. Errors come in two main classes: syntax errors. whitespace at the start of a line of code). which are logical mistakes that cause your program to operate incorrectly. For example.6. this is irrelevant to the R interpreter: Whitespace is ignored.1 Distributions in R R has handy functions for accessing all the common distributions.) The text can be anything.6 More Statistics [Roadmap] 12. That said. One is to add comments to your programs. there are some things you can do to make it easier on yourself and others who read your programs. should you have a bug in one of your programs. you will still find that a lot of your programming time is spend hunting down bugs. Adding comments is helpful to others who read your code. These can be very difficult to track down. y . but the idea is to make a useful comment on your code. followed by some text. (A comment is a # symbol. which are flaws in the syntax of your instructions. (Like comments. the indentation in listings 22 and 23 helps to separate different logical parts of the program for the reader. These functions have the form lettername J OHN S TACHURSKI October 10.6. " \ n " ) so that you can keep track of your variables during execution. 12. and cause R to issue an error message. a good first step is to fill the program with calls to cat such as cat ( " x = " . Another useful technique to improve the clarity of your code is to use indentation (i. " and y = " . it can be extremely helpful. x . and I won’t say much about it..12. MORE STATISTICS 348 human beings.) Despite your best efforts at clarity. Debugging is a bit of an art. apart from suggesting that. The other kinds of errors are semantic errors. 2013 .e. 7883051 0.4566147 > set . seed (123) > runif (5) # Back to start again [1] 0. without resetting seed [1] 0.6. such as when you run a simulation experiment.2875775 0. By default. sometimes it’s helpful to set the initial conditions. The meanings of the letters are p d q r cumulative distribution function density function quantile function generates random variables Here are a couple of examples: > pnorm (2 . lnorm (log normal). these initial conditions come from the system clock and are different each time you call on the random number generator. d. Here’s an example of usage: > set .8830174 0. Pseudo random numbers follow well defined patterns determined by initial conditions. and you want others to be able to reproduce your results. mean =.5514350 0. you should be aware that “random” numbers generated by a computer are not truly random. This is a good thing if you want new draws each time. By clever design of these rules. s on [2 . The way to do this is via the function set. 4] See the documentation for further details on these functions. v . MORE STATISTICS 349 where “name” is one of the named distributions in R.2875775 0.4089769 0. q or r. unif (uniform). 4) > qcauchy (1 / 2) # median of cauchy distribution > runif (100 .seed.9404673 > runif (5) # Draw again . etc. 2013 .8830174 0.5281055 0.7883051 0. cdf of N (.1 . seed (123) > runif (5) # First draw [1] 0.0455565 0.8924190 0. These sequences are called pseudo random numbers. However. 4) # 100 uniform r . 2 .12. which sets the initial conditions (seed) for the random number generator. With respect to random number generation. it is possible to generate deterministic sequences the statistical properties of which resemble independent draws from common distributions.1 .4089769 0. sd =2) # F (2) . and “letter” is one of p .9404673 J OHN S TACHURSKI October 10. In fact they are not random at all—they are generated in a purely deterministic way according to specified rules. such as norm (normal). 50 .3 Working with Matrices Let’s look at how to work with matrices in R. As we saw in §11.6. 40 .1:4 <. 60) . 1) > dim ( x ) [1] 20 1 # Make x a column vector This is useful for performing matrix multiplication. > dim ( x ) <. nrow =3 . 30 .12. we can alter the dim attribute to make x a column or row vector. This can be observed by displaying the dim attribute associated with a given vector via the function dim. MORE STATISTICS 350 12. Here’s an example: > x <. we create the matrices A and B defined on page 69: > A <. 2013 .1] [ . Here. ncol =2) > A [ . an operation that distinguishes between row and column vectors.1 <..2] J OHN S TACHURSKI October 10.rnorm (20) > dim ( x ) NULL # Create an arbitrary vector x in R # By default .5 * x + y * x + b * y # # # # # # # Vector Vector Scalar Scalar Scalar multiplication Vector addition Both together 12. they have no notion of being either row vectors or column vectors.6. scalar multiplication and vector addition are performed in a natural way: > > > > > > > x y a b a x a <. We can also create them from scratch if we need to. In particular.4. x is a flat vector If we wish to.2. Most often matrices are read in from data files. or created from other computations.5:8 <.6.2 Working with Vectors Vectors in R are rather similar to the abstract mathematical notion of a vector we’ve just discussed.c (20 . 20 .matrix ( c (10 . 2] [ .6.2] [ . ncol =3) > B [ .] 2 4 6 Internally.] <.] NA NA NA [2 . 2013 .] 30 60 > B <.1] [ .] <. 6) > B [ .3] [1 .3] [1 .3] [1 .12.1] [ . 5) > B [2 . 6) . 2 .1] [ . 5 . we can create B in two steps: > B <.] 2 4 6 Alternatively. 2 .] 20 50 [3 . 5 . nrow =2 .] 2 4 6 > class ( B ) # Now B is a matrix [1] " matrix " Finally. here’s a third way that I use often: > B <. 3 .2] [ .3] [1 . 4 . MORE STATISTICS 351 [1 . 3) # Set the dimension attribute > B [ .c (2 .c (1 .] 1 3 5 [2 .c (2 .] 1 3 5 [2 .] NA NA NA > B [1 .matrix ( nrow =2 . 4 . In particular.] 1 3 5 [2 . 3 .] 10 40 [2 . 6) > class ( B ) # B is a flat vector with no dimension [1] " numeric " > dim ( B ) <.2] [ .1] [ . matrices in R are closely related to data frames. we can access elements of the matrices using square brackets followed by row and column numbers: J OHN S TACHURSKI October 10. 3 . 4 . ncol =3) > B [ .c (1 .matrix ( c (1 . 2013 .1] [ .] 11 23 35 [2 .1] [ . MORE STATISTICS 352 > A [3 .] 80 200 360 This is natural.] 10 20 30 [2 .2] [ .6.] 90 190 290 [2 .1] [ . and uses the symbol combination %*%: > A %*% B [ . but A and B are. because it’s consistent with the algebraic operations on vectors we saw earlier.2] [ . the product AB is formed by taking as it’s i. all columns All rows .] # [1] 30 60 > A [ .3] [1 .3] [1 . j-th element the inner product of the i-th row of A and the j-th column of B. 2] # [1] 40 50 60 Third row .2] [ .] 150 330 510 As we learned above.] 40 50 60 Addition is now straightforward: > t(A) + B [ . second column A and B are not conformable for addition.] 42 54 66 Notice that the multiplication symbol * gives element by element multiplication. Matrix multiplication is different.] 120 260 400 [3 .3] [1 .12. as follows > t(A) * B [ .2] [ .3] [1 . For example: J OHN S TACHURSKI October 10.] 10 60 150 [2 .1] [ . second column Third row . 2] # [1] 60 > A [3 . To take the transpose of A we use the transpose function t: > t(A) [ . Let’s check this: > det ( A ) [1] 236430 The inverse of A can be calculated as follows: J OHN S TACHURSKI October 10.3] [ .2] [ .] 1 [2 .2] [ .1] [ .2] [ .] 190 260 330 [3 .1] [1 . According to fact 2.3.] 290 400 510 > t(B) %*% t(A) [ .3] [1 .3.] % * % B [ .] 2 [3 .] 90 120 150 [2 .] 18 32 80 and b is the column vector > b [ . 2013 .3. a unique solution will exist provided that A has nonzero determinant (and is therefore of full rank).12.] 58 25 3 [2 . Suppose now that A is the matrix > A [ .] 290 400 510 Let’s have a look at solving linear systems of equations.3] [1 .1] [1 .] 90 120 150 [2 .3] [1 .1] [ .1] [ .] 43 97 90 [3 .5 that (AB) = B A : > t(A %*% B) [ .] 3 We are interesting in solving the system of equations Ax = b.6. MORE STATISTICS 353 > A [2 .] 400 We can also check the claim in fact 2.] 190 260 330 [3 . First.3] [1 .] -0.01924883 In view of (2.02939136 [2 .03350252 [3 . As a rule. In R. 4) .] 0. 2013 .2] [ .1] [1 .6.] 0.solve ( A ) > Ainv [ .3).007697839 0.008053124 0.005946792 0.matrix ( c (1 .Ainv % * % b > x [ . b): > solve (A .] 0.) The answer is that R will make an educated guess. but for large matrices the latter method is significantly more accurate and computationally efficient. it turns out that there are more efficient ways to do this numerically.04428795 While this is valid in theory.04428795 In this case we can see that there was no difference in the results produced by the two techniques.001564945 -0. depending on which of the two choices is conformable. nrow =2) J OHN S TACHURSKI October 10.1] [ .019396862 -0.] -0. we can solve for x as A−1 b: > x <.00828575 [2 .12. MORE STATISTICS 354 > Ainv <. computing inverses of matrices directly is relatively unstable numerically. Since a vector in R has no dimension attribute.02939136 [2 . suppose that we want to use a vector in a matrix operation such as matrix multiplication.020640359 -0. Let’s conclude this section with a few tips for working with matrices and vectors. Other techniques are available that mitigate this problem. 2 . The next piece of code illustrates: > A <.03350252 [3 .02153280 [3 .] 0. in the sense that round off errors have large effects.] -0.] -0.] 0.1] [1 . b ) [ . the preferred method is to use the formula solve(A. 3 . how will R know whether to treat it as a column vector or a row vector? (Clearly the result will depend on which choice is made. ] 2 4 > x <.2] [ . 3) > b <.1] [1 .c (4 .] 2 3 4 October 10. > a <.] 3 6 > rbind (a . b ) a b [1 .] 2 5 [3 . 5 .12. with no dimension NULL > A %*% x [ .c (5 .] 1 [2 . and R always gives the inner product.1] [ .2] [1 .] 17 39 In the first case x is treated as a column vector.6.] 1 4 [2 .1] [ . The case x %*% x is ambiguous. 6) > dim ( x ) # x is a flat vector . b ) [ .3] a 1 2 3 b 4 5 6 The function diag provides simple way to extract the diagonal elements of a matrix: > A [ . while in the second it is treated as a row vector.] 34 > x %*% A [ .] 23 [2 .2] [1 .1] [ . 6) > cbind (a . 2013 J OHN S TACHURSKI . MORE STATISTICS 355 > A [ .1] [ . 2 .c (1 . Here’s an example that forms a matrix by stacking two vectors row-wise and then column-wise.2] [1 .] 1 3 [2 . Sometimes we need to combine matrices or vectors. seed (1234) > N <.) Somewhat confusingly.] 0 0 0 0 [3 .] 12. 4) [ .4 Multiple Regression in R Let’s look briefly at running multiple regressions in R.] 0 0 0 0 [internally.1] [ .12. want to center each row around its mean. write function. 2013 .4] [1 .3] [1 .] 1 1 [2 .3] [ . 4 . example. and compare it with the result of direct calculations according to the theoretical results we have derived. so some vectorized operations can be performed one-off.2] [1 .6. matrices are just flat arrays.] 1 1 > matrix (0 .1] [ .] 0 0 0 0 [2 .2] [ .] 1 1 [3 .500 J OHN S TACHURSKI October 10. As a starting point. which we discussed in §11. Here’s how: > matrix (1 . use apply. MORE STATISTICS 356 > diag ( A ) [1] 1 4 (The trace can now be obtained by summing. the same function is also used to create diagonal matrices: > diag (3) [ .2] [ . 2) [ . Let’s now recall the method. we generate some synthetic data: > set .] 0 0 0 0 [4 . 3 . we often need to create matrices of zeros or ones.6.] 0 1 0 [3 .] 0 0 1 Last but not least.1] [ .5.] 1 0 0 [2 . The standard method is to use the function lm. The coefficient vector β has been set to (1. 1.4.12. Hence.9532119 1. you should be aware that the direct method is less stable numerically. runif ( N ) ) > y <.6.20.X % * % beta + rnorm ( N ) The seed for the random number generator has been set so that the same pseudorandom numbers are produced each time the program is run.c (1 . see exercise 6. > results <.5. We can run an OLS regression in R as follows.cbind ( rep (1 . exactly the same results are produced.lm ( y ~ 0 + X ) In the call to lm.X % * % solve ( t ( X ) % * % X ) % * % t ( X ) J OHN S TACHURSKI October 10.3. For an illustration of this numerical instability.results $ residuals > P <. runif ( N ) . 2013 . 1) > X <. which correspond to the ESS and SSR respectively: > yhat <.1] [1 .results $ fitted .0045879 > solve ( t ( X ) % * % X ) % * % t ( X ) % * % y [ .] 1. The matrix X consists of a vector of ones as the first column. let’s compare the estimate of the coefficient vector produced by lm to the one we get when using the theoretical formula directly: > results $ coefficients X1 X2 X3 0. MORE STATISTICS 357 > beta <. K = 3. 1 . because we already set up X to have a constant vector as the first column. values > uhat <. we’ll look instead at the squared norms of the vectors. However. but may be significant when data sets are large and multicollinearity is present in the data.9532119 [3 . Numerical error will be zero or negligible in mosts settings.9662670 0. 1) for simplicity.] 0. the term 0 indicates that a constant term should not be added—in this case.] 0. N ) . For a discussion of multicollinearity. and then two columns of observations on non-constant regressors.0045879 Here. ˆ .9662670 [2 . As a first step. see §7. Since these vectors Let’s also check our theoretical methods for computing y ˆ and u are too large to print out and compare. EXERCISES 358 > M <. direct method [1] 2002. this should agree with y ˆ 2 / y 2 .8137 According to the theory. In a given vector.squared : 0.diag ( N ) . 1 means that the individual was employed at the associated point in time.222 > sum ( uhat * uhat ) # Residuals calculated by lm () [1] 458. direct method [1] 458. [new exercise: computing a Lorenz curve from a data set on the web.1. 2013 . Suppose we are processing vector of zeros and ones. computing and plotting a set of Lorenz curves from a collection of data sets.12. 12. 12. and maybe.4417 Again. the calculation by lm can be viewed using the summary function: > summary ( results ) # Some lines omitted Multiple R .2. How could you use sum to determine whether a numeric vector x contains the value 10? Ex.8136919 The results are in agreement.7. Regarding the coefficient of determination.222 > sum (( P % * % y ) * ( P % * % y ) ) # Fitted vals .7 Exercises Ex. while 0 means unemployed. the results of lm and the results of our direct calculations from our theoretical results are in exact agreement. Let’s check this: > sum ( yhat * yhat ) / sum ( y * y ) [1] 0.7.P > sum ( yhat * yhat ) # Fitted values calculated by lm () [1] 2002.4417 > sum (( M % * % y ) * ( M % * % y ) ) # Residuals . using J OHN S TACHURSKI October 10.7. Write a function that takes such a vector (of arbitrary length) and compute the longest (consecutive) period of employment. The vectors corresponds to the employment histories of individuals. 12. 12. or using built in piecewise linear function approximation. 2013 .7.] [add more exercises!] J OHN S TACHURSKI October 10. make this a computer lab as well? to replace computer lab on chaotic dynamics? Lorenz curves can be calculated with for loops. EXERCISES 359 a loop. is an element of) A. then the statement x ∈ A means that x is contained in (alternatively. Actually. This set is denoted by R. and we understand it to contain “all of the numbers.” where the latter refers to the set of imaginary numbers. or the set of all monkeys in Japan. If A is a set.Chapter 13 Appendix C: Analysis 13. but we don’t need to talk any more about this. A set is a collection of objects viewed as a whole. The statement A = B means that A and B contain the same elements (each element of A is an element of B and 360 .) Other examples of sets are the set of all rectangles in the plane. (In this case the objects are numbers. then A ⊂ B means that any element of A is also an element of B. “real” is in contrast to “imaginary. and we say that A is a subset of B. If B is another set. R is an example of a set.” R can be visualized as the “continuous” real line: 0 It contains both the rational and the irrational numbers.1 Sets and Functions In the course we often refer to the real numbers. the imaginary numbers are no more imaginary (or less real) than any other kind of numbers. What’s “real” about the real numbers? Well. and so on. The union of A and B is the set of elements of S that are in A or B or both: A ∪ B := { x ∈ S : x ∈ A or x ∈ B} Here and below.1. It means “and/or”. “or” is used in the mathematical sense. the open inverval ( a. For arbitrary a and b in R.1 then I ⊂ R. −3 ∈ R. Also.13. Commonly used subsets of R include the intervals.1. if I is the irrational numbers. 0 ∈ R. b) = { x ∈ R : x < b}. For example. and so on. b ) : = { x ∈ R : a < x < b } while the closed inverval [ a. half lines such as (−∞.1: Sets A and B in S vice versa). e ∈ R. Let S be a set and let A and B be two subsets of S. 2013 . SETS AND FUNCTIONS S 361 A B Figure 13. b] is defined as [ a. The intersection of A and B is the set of all elements of S that are in both A and B: A ∩ B := { x ∈ S : x ∈ A and x ∈ B} The set A \ B is all points in A that are not points in B: A \ B := { x ∈ S : x ∈ A and x ∈ / B} The complement of A is the set of elements of S that are not contained in A: Ac := S \ A :=: { x ∈ S : x ∈ / A} J OHN S TACHURSKI October 10. π ∈ R. b) := { x ∈ R : a ≤ x < b}. b) is defined as ( a. b ] : = { x ∈ R : a ≤ x ≤ b } We also use half open intervals such as [ a. as illustrated in figure 13. 2: Unions. irrationals are those numbers such as π and whole numbers. Also. 1 The √ 2 that cannot be expressed as fractions of J OHN S TACHURSKI October 10.} ⊂ Q ⊂ R The empty set is. The following statements are true: 1. 2.1. . 2. ( A ∪ B)c = Bc ∩ Ac and ( A ∩ B)c = Bc ∪ Ac . I ⊂ R. Figure 13. since R consists of the irrationals I and the rationals Q.1. etc. 3. . The next fact lists some well known rules for set theoretic operations. For example. the set containing no elements. we have Q ⊂ R. If the intersection of A and B equals ∅. Q ∪ I = R.13. 2013 . SETS AND FUNCTIONS 362 Figure 13. intersection and complements Here x ∈ / A means that x is not an element of A. A ∪ B = B ∪ A and A ∩ B = B ∩ A. Fact 13.1. . unsurprisingly. Let A and B be subsets of S. then A and B are said to be disjoint. N := {1. Qc = I.2 illustrate these definitions. It is denoted by ∅. ( Ac )c = A. A function f from A to B is a rule that associates to each element a of A one and only one element of B. and written f ( a). and continuous if it is continuous at x for all x ∈ A. 4. From the definition. 13. That element of B is usually called the image of a under f . A function f : A → B is called continuous at x if f (xn ) → f (x) whenever xn → x. If we don’t know it’s morning. so feel free to skip this). We say that xn converges to x ∈ R if xn − x → 0. (For each n = 1. xn → 0. we have a corresponding xn ∈ R. Bottom right is not a function because the top point on the left-hand side is not associated with any image. Each position of the two hands is associated with one and only one time in the morning. Whole branches of mathematics are built on this idea. 2. there exists an N ∈ N such that | xn | < whenever n ≥ N . given any neighborhood of 0. More formally (we won’t use the formal definition.1 Functions Let A and B be sets. we won’t be doing any formal proofs using the definition—we just state it for the record. A \ B = A ∩ Bc . 2013 . . Let’s say we know it’s the morning. think of the hands on an old school clock. J OHN S TACHURSKI October 10.2 Convergence and Continuity Let { xn }∞ n=1 be a sequence of real numbers. N Now let {xn }∞ n=1 be a sequence of vectors in R . SETS AND FUNCTIONS 363 3. Symbolically.1.3 is instructive.13. the sequence points are eventually in that neighborhood. Let A ⊂ R N and B ⊂ R M . 13. The relationship is no longer functional. . one position of the hands is associated with two different times. Top right is not a function because the middle point on the left-hand side is associated with two different points (images). Symbolically.4 illustrates. given any > 0. However. however.1. this means that f is a function from A to B. Figure 13. . xn → x. If we write f : A → B.1. Figure 13. The notion of continuity is also massively important to mathematical analysis. This is the fundamental notion of convergence in R N . am and pm.) We say that xn converges to 0 if. For an example of a function. this is not allowed. 1. SETS AND FUNCTIONS 364 Figure 13.4: Continuity J OHN S TACHURSKI October 10.3: Function and non-functions Figure 13. 2013 .13. f is a function such that f ( a) is a number for each a ∈ A. consider the function f : (0. then m( x ) ≤ m( x ). we have now shown that g( a∗ ) ≥ g( a) for all a ∈ A In other words. 2013 . let’s recall the notions of supremum and infimum. 1) defined by f ( x ) = x. 1) → (0. J OHN S TACHURSKI October 10. OPTIMIZATION 365 Figure 13. we can always choose another point a∗∗ ∈ (0. Since a∗ is a maximizer of f . it must be the case that f ( a) ≤ f ( a∗ ).5: Monotone transforms preserve maximizers 13. a∗ is a maximizer of g on A. and let g be the function defined by g( a) = m( f ( a)). 1) such that a∗∗ = f ( a∗∗ ) > f ( a∗ ) = a∗ . Since m is monotone increasing. No maximizer exists and the optimization problem maxx∈(0. 1). Given that a was chosen arbitrarily. and let f : A → R. Before finishing this topic. To illustrate.1) f ( x ) has no solution.13.2 Optimization Monotone increasing transformations of functions do not affect maximizers: Let A be any set. It should be clear that f has no maximiser on (0. 1): given any a∗ ∈ (0. this implies that m( f ( a)) ≤ m( f ( a∗ )). A maximizer of f on A is a point a∗ ∈ A such that f ( a∗ ) ≥ f ( a) for all a ∈ A Now let m be a monotone increasing function. in the sense that if x ≤ x . Our claim is this: Any maximizer of f on A is also a maximizer of g on A. That is. It’s easy to see why this is the case.2. Let a ∈ A. LOGICAL ARGUMENTS 366 To get around this kind of problem. then the supremum s := sup A is the unique number s such that a ≤ s for every a ∈ A. and.Proof by induction.13. For example.3. For example. 1].Contrapositive. Many logical statements are of the form “if it’s an A. 1]. Relate to (7.27) and (8.Role of counterexamples. 2013 . Simple example with Venn diagram. 1) = 1. (Examples. To show it’s correct. while maxx∈(0. there exists a sequence { xn } ⊂ A such that xn → s. 13.) .4?) . 1) and [0. If A is a set. Simple examples. (fact 2.1) f ( x ) := sup{ f ( x ) : x ∈ (0. moreover. Relate to discussion of statistial learning and induction. and prove that it’s a B. we take an arbitrary A.7).2. To show it’s false. and. then it’s a B”. moreover. (Give examples of this process. 1)} = sup(0. One can show that the supremum and infimum of any bounded set A exist. 0 is the supremum of both (0.) The statement may be correct or incorrect. Returning to our original example with f ( x ) = x. Explain how it’s useful with an example from the text.1) f ( x ) is not well defined. we often the notion of supremum instead.3 Logical Arguments To be written: . there exists a sequence { xn } ⊂ A such that xn → i. J OHN S TACHURSKI October 10. and any set A when the values −∞ and ∞ are admitted as a possible infima and supremum. 1) and [0. 1 is the supremum of both (0. supx∈(0. we provide a counterexample. The infimum i := inf A is the unique number i such that a ≥ i for every a ∈ A. CA. 367 . [10] Freedman. Springer. [7] Doeblin. (1996): Probability: Theory and Examples. W. 9. 77–105.” Macroeconomic Dynamics. [3] Casella. and R. [13] Hayashi. 55–67. E. [8] Durrett. (1938): “Expos´ e de la theorie des chaˆ ıns simples constantes de ` un nombre fini d’´ Markov a etats. L. D. 971–87.” Technometrics. Union Interbalkanique. (2008): Asymptotic Theory of Statistics and Probability. [11] Greene. [2] Bishop.Bibliography [1] Amemiya. G. W. Springer.” Journal of Political Economy 86 (6). Prentice-Hall. Kennard (1970): “Ridge Regression: Biased Estimation for Nonorthogonal Problems. E. [9] Evans. Duxbury Press. New Jersey. (1993): Econometric Analysis. Springer. R. (2006): Pattern Recognition and Machine Learning. (2001): Analysis for Applied Mathematics. and R. (2000): Econometrics. Princeton UP. A. (1994): Introduction to Statistics and Econometrics. T. [14] Hoerl. (1978): “Stochastic implications of the life cycle-permanent income hypothesis. Cambridge UP. New York. H. F. Duxbury Press.M. Luc and Gabor Lugosi (2001): “Combinatorial Methods in Density Estimation” Springer-Verlag. Math. and S. 561–583. G. A. Harvard UP. W. R. 2. (2009): Statistical Models: Theory and Practice. [5] Dasgupta. C. A. [6] Devroye. [4] Cheney. 12 (1). Honkapohja (2005): “An Interview with Thomas Sargent. [12] Hall. Berger (1990): Statistical Inference. W.” Rev. ” Economic Notes.” Econometrica. V. S. [20] Wasserman. 443–450. I and Y. J. 76 (2). Springer-Verlag. and V. D. Nishiyama (2010): “Review on Goodness of Fit Tests for Ergodic Diffusion processes by Different Sampling Schemes. [21] Williams. Springer. [19] Vapnik. 64 (6). (2009): Economic Dynamics: Theory and Computation. and A. J. Pakes (1996): “The Dynamics of Productivity in the Telecommunications Equipment Industry. [18] Stachurski.BIBLIOGRAPHY 368 [15] Olley. Martin (2008): “Computing the Distributions of Economic Models via Simulation. (2000): The Nature of Statistical Learning Theory. MIT Press. [16] Negri. 39. (2004): All of Statistics. L. J OHN S TACHURSKI October 10. (2001): Probabiliity with Martingales. Cambridge Mathematical Textbooks. N. [17] Stachurski. 2013 . New York. 1263-97. 91– 106.” Econometrica. New York. 314 Delta method. 140 Empirical risk minimization. 119 Asymptotically unbiased. 38 Density. 92 Asymptotic variance. 65 Basis functions. 36 Chi-squared distribution. 32 Convergence in probability. 321 369 Adapted process. 119 Convergence in distribution. 134 Empirical risk. 164 Cumulative distribution function. 156 Consistent. 12. 97 Conditional probability. 6 Confidence interval. 332 Cauchy-Schwartz inequality. 114 Expectation. 115 Binary response model. 24 Class. 362 Ergodicity.Index √ N -consistent. 298 Complement. 30. 16 Current working directory. 124 Boolean operators. 119 Asymptotically normal. vector. 76 Convergence in mean square. 361 Conditional density. 134 Empirical cdf. 319 Coefficient of determination. 362 ecdf. 68 Diagonal matrix. 286 Bernoulli random variable. 231 Floating point number. 173 . 67 Column vector. 179 F-distribution. 10 Bias. 18 Determinant. 74 Covariance. 16 Central limit theorem. 28 Critical value. 231 Annihilator. 25 Filtration. 32. 180 Bayes’ formula. see Empirical risk minimization Estimator. 140 Empty set. 236 ERM. 26 Conditional expectation. 323 Data frame. 65 Disjoint sets. 311 Formula (in R). 183 Column space. 52 cdf. 119 Basis. 53 Command history. 13 Expectation. 53 Dimension. R. 72 Explained sum of squares. 119 Consistent test. 256 Homoskedastic. 90 Parametric class. 97 List. 236. 6 Independence. 360 Joint density. 68 Irrational numbers. 313 Log likelihood function. 67 Gaussian distribution. 365 Information set. 115 Measurability. 68 Inverse transform method. 61. 183 Plug in estimator. 86 Orthogonal vectors. 122 Linear combinations. 287 Probit. 7 Least squares. 55 Linear independence. 86 Orthogonal projection theorem. 148 Law of large numbers. 24 Generalization. 229 Matrix. 59 Linear function.INDEX 370 Linear subspace. 162. 92 October 10.s. 365 Maximum likelihood estimate. 231 Inner product.v. 122 Logit. 26 Joint distribution. 10 Induction. 207 Nonnegative definite. 287 Power function. of r. 136 Positive definite. 95 Moment. 27 Indicator function. 163 Ordinary least squares. 109 Inf. 139 Marginal distribution. 63 J OHN S TACHURSKI . 26 Markov process. 26 Kullback-Leibler deviation. 197 Hypothesis space. 140 Idempotent. 120 Priors. 126 Perfect fit. 361 Inverse matrix. 108 Generic function. 28 Monotone increasing function. 243 Gradient vector. 256 Hessian. 124 Projection matrix. 164 Principle of maximum likelihood. 331 Infimum. 54 Independence. 319 Global stability. 365 Multicollinearity. 50 Intersection. 34 Law of Total Probability. of events. 70 Posterior distribution. 70 Identity matrix. 50 Normal distribution. 2013 Full rank. 141. 53 Maximizer. 176 Likelihood function. 122 Mean squared error. 195 Orthogonal projection. 24 Null hypothesis. 70 Nonparametric class. 126 Norm. 39 Invertible. 84 Overdetermined system. 95. 124 Loss function. 330 Risk function. 50 J OHN S TACHURSKI October 10. 163 Test statistic. 69 Triangle inequality. 365 Symmetric cdf. 157.INDEX 371 Supremum. 57 Rank. uncentered. 114 Vector of fitted values. 164 Unbiased. vector case. 360 Sum of squared residuals. 112 Sample mean. 34 Span. 60 Spectral norm. 164 Slutsky’s theorem. 73. 113 Sample covariance. 164 Type II error. 153 Sampling error. 183 Range. 360 Singular matrix. programming. 28 Variance-covariance matrix. 243 Statistic. 179 Vector. 68 Size of a test. 164 Text editor. 28 Standard error. 50 Set. 53 Standard deviation. 2013 . 112 Sample variance. 67 Rational numbers. 52 Type I error. 311 Student’s t-distribution. real r. 112 Sampling distribution. 112 String. 198 Scalar product. 245 Square matrix. 209 Stationary distribution. 53 Test. 329 Text file. 361 Variable. programming. 115 Uniform distribution. 301 Pythagorean law. 278 Total sum of squares. 84 R squared. 186 R squared. 70 Transition density. 17 Symmetric matrix. 112 Sample correlation. 299 Variance. 163 Relational operators. 25 Subset. 238. 360 Rejection region. 114 Sample standard deviation. 112 Sample mean. 179 Trace.v. 53 Sample k-th moment. 230 Transpose. 178 Vector of residuals.. 360 Real numbers. centered. 328 Tikhonov regularization. 179 Sum. 24 Union. vectors. 139 Row vector.
Copyright © 2025 DOKUMEN.SITE Inc.