Notes for ECE 534An Exploration of Random Processes for Engineers Bruce Hajek July 15, 2011 c _ 2011 by Bruce Hajek All rights reserved. Permission is hereby given to freely print and circulate copies of these notes so long as the notes are left intact and not reproduced for commercial purposes. Email to
[email protected], pointing out errors or hard to understand passages or providing comments, is welcome. Contents 1 Getting Started 1 1.1 The axioms of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Independence and conditional probability . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Random variables and their distribution . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Functions of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Expectation of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6 Frequently used distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.7 Jointly distributed random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.8 Cross moments of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.9 Conditional densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.10 Transformation of random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2 Convergence of a Sequence of Random Variables 41 2.1 Four definitions of convergence of random variables . . . . . . . . . . . . . . . . . . . 41 2.2 Cauchy criteria for convergence of random variables . . . . . . . . . . . . . . . . . . 52 2.3 Limit theorems for sequences of independent random variables . . . . . . . . . . . . 56 2.4 Convex functions and Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5 Chernoff bound and large deviations theory . . . . . . . . . . . . . . . . . . . . . . . 60 2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3 Random Vectors and Minimum Mean Squared Error Estimation 73 3.1 Basic definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 The orthogonality principle for minimum mean square error estimation . . . . . . . . 75 3.3 Conditional expectation and linear estimators . . . . . . . . . . . . . . . . . . . . . . 80 3.3.1 Conditional expectation as a projection . . . . . . . . . . . . . . . . . . . . . 80 3.3.2 Linear estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.3 Discussion of the estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4 Joint Gaussian distribution and Gaussian random vectors . . . . . . . . . . . . . . . 85 3.5 Linear Innovations Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.6 Discrete-time Kalman filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 iii iv CONTENTS 4 Random Processes 105 4.1 Definition of a random process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 Random walks and gambler’s ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.3 Processes with independent increments and martingales . . . . . . . . . . . . . . . . 110 4.4 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.5 Counting processes and the Poisson process . . . . . . . . . . . . . . . . . . . . . . . 113 4.6 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.7 Joint properties of random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.8 Conditional independence and Markov processes . . . . . . . . . . . . . . . . . . . . 119 4.9 Discrete-state Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.10 Space-time structure of discrete-state Markov processes . . . . . . . . . . . . . . . . 128 4.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5 Inference for Markov Models 143 5.1 A bit of estimation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.2 The expectation-maximization (EM) algorithm . . . . . . . . . . . . . . . . . . . . . 149 5.3 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.3.1 Posterior state probabilities and the forward-backward algorithm . . . . . . . 155 5.3.2 Most likely state sequence – Viterbi algorithm . . . . . . . . . . . . . . . . . 158 5.3.3 The Baum-Welch algorithm, or EM algorithm for HMM . . . . . . . . . . . . 159 5.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6 Dynamics of Countable-State Markov Models 165 6.1 Examples with finite state space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.2 Classification and convergence of discrete-time Markov processes . . . . . . . . . . . 167 6.3 Classification and convergence of continuous-time Markov processes . . . . . . . . . 170 6.4 Classification of birth-death processes . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.5 Time averages vs. statistical averages . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.6 Queueing systems, M/M/1 queue and Little’s law . . . . . . . . . . . . . . . . . . . . 177 6.7 Mean arrival rate, distributions seen by arrivals, and PASTA . . . . . . . . . . . . . 180 6.8 More examples of queueing systems modeled as Markov birth-death processes . . . 182 6.9 Foster-Lyapunov stability criterion and moment bounds . . . . . . . . . . . . . . . . 184 6.9.1 Stability criteria for discrete-time processes . . . . . . . . . . . . . . . . . . . 184 6.9.2 Stability criteria for continuous time processes . . . . . . . . . . . . . . . . . 192 6.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7 Basic Calculus of Random Processes 205 7.1 Continuity of random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 7.2 Mean square differentiation of random processes . . . . . . . . . . . . . . . . . . . . 211 7.3 Integration of random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 7.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 7.5 Complexification, Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 CONTENTS v 7.6 The Karhunen-Lo`eve expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.7 Periodic WSS random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 8 Random Processes in Linear Systems and Spectral Analysis 245 8.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 8.2 Fourier transforms, transfer functions and power spectral densities . . . . . . . . . . 249 8.3 Discrete-time processes in linear systems . . . . . . . . . . . . . . . . . . . . . . . . . 256 8.4 Baseband random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 8.5 Narrowband random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 8.6 Complexification, Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 8.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9 Wiener filtering 279 9.1 Return of the orthogonality principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 9.2 The causal Wiener filtering problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 9.3 Causal functions and spectral factorization . . . . . . . . . . . . . . . . . . . . . . . 282 9.4 Solution of the causal Wiener filtering problem for rational power spectral densities . 287 9.5 Discrete time Wiener filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 9.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 10 Martingales 303 10.1 Conditional expectation revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 10.2 Martingales with respect to filtrations . . . . . . . . . . . . . . . . . . . . . . . . . . 308 10.3 Azuma-Hoeffding inequaltity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 10.4 Stopping times and the optional sampling theorem . . . . . . . . . . . . . . . . . . . 314 10.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 10.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 11 Appendix 325 11.1 Some notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 11.2 Convergence of sequences of numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 11.3 Continuity of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 11.4 Derivatives of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 11.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 11.5.1 Riemann integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 11.5.2 Lebesgue integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 11.5.3 Riemann-Stieltjes integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 11.5.4 Lebesgue-Stieltjes integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 11.6 On the convergence of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 11.7 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 12 Solutions to Problems 345 vi CONTENTS Preface From an applications viewpoint, the main reason to study the subject of these notes is to help deal with the complexity of describing random, time-varying functions. A random variable can be interpreted as the result of a single measurement. The distribution of a single random vari- able is fairly simple to describe. It is completely specified by the cumulative distribution function F(x), a function of one variable. It is relatively easy to approximately represent a cumulative distribution function on a computer. The joint distribution of several random variables is much more complex, for in general, it is described by a joint cumulative probability distribution function, F(x 1 , x 2 , . . . , x n ), which is much more complicated than n functions of one variable. A random process, for example a model of time-varying fading in a communication channel, involves many, possibly infinitely many (one for each time instant t within an observation interval) random vari- ables. Woe the complexity! These notes help prepare the reader to understand and use the following methods for dealing with the complexity of random processes: • Work with moments, such as means and covariances. • Use extensively processes with special properties. Most notably, Gaussian processes are char- acterized entirely be means and covariances, Markov processes are characterized by one-step transition probabilities or transition rates, and initial distributions. Independent increment processes are characterized by the distributions of single increments. • Appeal to models or approximations based on limit theorems for reduced complexity descrip- tions, especially in connection with averages of independent, identically distributed random variables. The law of large numbers tells us that, in a certain context, a probability distri- bution can be characterized by its mean alone. The central limit theorem, similarly tells us that a probability distribution can be characterized by its mean and variance. These limit theorems are analogous to, and in fact examples of, perhaps the most powerful tool ever dis- covered for dealing with the complexity of functions: Taylor’s theorem, in which a function in a small interval can be approximated using its value and a small number of derivatives at a single point. • Diagonalize. A change of coordinates reduces an arbitrary n-dimensional Gaussian vector into a Gaussian vector with n independent coordinates. In the new coordinates the joint probability distribution is the product of n one-dimensional distributions, representing a great vii viii CONTENTS reduction of complexity. Similarly, a random process on an interval of time, is diagonalized by the Karhunen-Lo`eve representation. A periodic random process is diagonalized by a Fourier series representation. Stationary random processes are diagonalized by Fourier transforms. • Sample. A narrowband continuous time random process can be exactly represented by its samples taken with sampling rate twice the highest frequency of the random process. The samples offer a reduced complexity representation of the original process. • Work with baseband equivalent. The range of frequencies in a typical radio transmission is much smaller than the center frequency, or carrier frequency, of the transmission. The signal could be represented directly by sampling at twice the largest frequency component. However, the sampling frequency, and hence the complexity, can be dramatically reduced by sampling a baseband equivalent random process. These notes were written for the first semester graduate course on random processes, offered by the Department of Electrical and Computer Engineering at the University of Illinois at Urbana- Champaign. Students in the class are assumed to have had a previous course in probability, which is briefly reviewed in the first chapter of these notes. Students are also expected to have some familiarity with real analysis and elementary linear algebra, such as the notions of limits, definitions of derivatives, Riemann integration, and diagonalization of symmetric matrices. These topics are reviewed in the appendix. Finally, students are expected to have some familiarity with transform methods and complex analysis, though the concepts used are reviewed in the relevant chapters. Each chapter represents roughly two weeks of lectures, and includes homework problems. Solu- tions to the even numbered problems without stars can be found at the end of the notes. Students are encouraged to first read a chapter, then try doing the even numbered problems before looking at the solutions. Problems with stars, for the most part, investigate additional theoretical issues, and solutions are not provided. Hopefully some students reading these notes will find them useful for understanding the diverse technical literature on systems engineering, ranging from control systems, image processing, com- munication theory, and communication network performance analysis. Hopefully some students will go on to design systems, and define and analyze stochastic models. Hopefully others will be motivated to continue study in probability theory, going on to learn measure theory and its applications to probability and analysis in general. A brief comment is in order on the level of rigor and generality at which these notes are written. Engineers and scientists have great intuition and ingenuity, and routinely use methods that are not typically taught in undergraduate mathematics courses. For example, engineers generally have good experience and intuition about transforms, such as Fourier transforms, Fourier series, and z-transforms, and some associated methods of complex analysis. In addition, they routinely use generalized functions, in particular the delta function is frequently used. The use of these concepts in these notes leverages on this knowledge, and it is consistent with mathematical definitions, but full mathematical justification is not given in every instance. The mathematical background required for a full mathematically rigorous treatment of the material in these notes is roughly at the level of a second year graduate course in measure theoretic probability, pursued after a course on measure theory. CONTENTS ix The author gratefully acknowledges the students and faculty (Todd Coleman, Christoforos Had- jicostis, Andrew Singer, R. Srikant, and Venu Veeravalli) in the past five years for their comments and corrections. Bruce Hajek July 2011 x CONTENTS Organization The first four chapters of the notes are used heavily in the remaining chapters, so that most readers should cover those chapters before moving on. Chapter 1 is meant primarily as a review of concepts found in a typical first course on probability theory, with an emphasis on axioms and the definition of expectation. Chapter 2 focuses on various ways in which a sequence of random variables can converge, and the basic limit theorems of probability: law of large numbers, central limit theorem, and the asymptotic behavior of large deviations. Chapter 3 focuses on minimum mean square error estimation and the orthogonality principle. Chapter 4 introduces the notion of random process, and briefly covers several examples and classes of random processes. Markov processes and martingales are introduced in this chapter, but are covered in greater depth in later chapters. The following four additional topics can be covered independently of each other. Chapter 5 describes the use of Markov processes for modeling and statistical inference. Applica- tions include natural language processing. Chapter 6 describes the use of Markov processes for modeling and analysis of dynamical systems. Applications include the modeling of queueing systems. Chapter 7-9 These three chapters develop calculus for random processes based on mean square convergence, moving to linear filtering, orthogonal expansions, and ending with causal and noncausal Wiener filtering. Chapter 10 This chapter explores martingales with respect to filtrations. In previous semesters, Chapters 1-4 and 7-9 were covered in ECE 534 at Illinois. In Fall 2011, ECE 534 at Illinois Chapters 1-5, Sections 6.1-6.8, Chapter 7, Sections 8.1-8.4, and Section 9.1 are to be covered. A number of background topics are covered in the appendix, including basic notation. Chapter 1 Getting Started This chapter reviews many of the main concepts in a first level course on probability theory with more emphasis on axioms and the definition of expectation than is typical of a first course. 1.1 The axioms of probability theory Random processes are widely used to model systems in engineering and scientific applications. These notes adopt the most widely used framework of probability and random processes, namely the one based on Kolmogorov’s axioms of probability. The idea is to assume a mathematically solid definition of the model. This structure encourages a modeler to have a consistent, if not accurate, model. A probability space is a triplet (Ω, T, T). The first component, Ω, is a nonempty set. Each element ω of Ω is called an outcome and Ω is called the sample space. The second component, T, is a set of subsets of Ω called events. The set of events T is assumed to be a σ-algebra, meaning it satisfies the following axioms: (See Appendix 11.1 for set notation). A.1 Ω ∈ T A.2 If A ∈ T then A c ∈ T A.3 If A, B ∈ T then A ∪ B ∈ T. Also, if A 1 , A 2 , . . . is a sequence of elements in T then ∞ i=1 A i ∈ T If T is a σ-algebra and A, B ∈ T, then AB ∈ T by A.2, A.3 and the fact AB = (A c ∪ B c ) c . By the same reasoning, if A 1 , A 2 , . . . is a sequence of elements in a σ-algebra T, then ∞ i=1 A i ∈ T. Events A i , i ∈ I, indexed by a set I are called mutually exclusive if the intersection A i A j = ∅ for all i, j ∈ I with i ,= j. The final component, P, of the triplet (Ω, T, P) is a probability measure on T satisfying the following axioms: P.1 P[A] ≥ 0 for all A ∈ T 1 2 CHAPTER 1. GETTING STARTED P.2 If A, B ∈ T and if A and B are mutually exclusive, then P[A∪B] = P[A] +P[B]. Also, if A 1 , A 2 , . . . is a sequence of mutually exclusive events in T then P ( ∞ i=1 A i ) = ∞ i=1 P[A i ]. P.3 P[Ω] = 1. The axioms imply a host of properties including the following. For any subsets A, B, C of T: • If A ⊂ B then P]A] ≤ P[B] • P[A∪ B] = P[A] +P[B] −P[AB] • P[A∪ B ∪ C] = P[A] +P[B] +P[C] −P[AB] −P[AC] −P[BC] +P[ABC] • P[A] +P[A c ] = 1 • P[∅] = 0. Example 1.1.1 (Toss of a fair coin) Using “H” for “heads” and “T” for “tails,” the toss of a fair coin is modelled by Ω = ¦H, T¦ T = ¦¦H¦, ¦T¦, ¦H, T¦, ∅¦ P¦H¦ = P¦T¦ = 1 2 , P¦H, T¦ = 1, P[∅] = 0 Note that, for brevity, we omitted the square brackets and wrote P¦H¦ instead of P[¦H¦]. Example 1.1.2 (Standard unit-interval probability space) Take Ω = ¦θ : 0 ≤ θ ≤ 1¦. Imagine an experiment in which the outcome ω is drawn from Ω with no preference towards any subset. In particular, we want the set of events T to include intervals, and the probability of an interval [a, b] with 0 ≤ a ≤ b ≤ 1 to be given by: P[ [a, b] ] = b −a. (1.1) Taking a = b, we see that T contains singleton sets ¦a¦, and these sets have probability zero. Since T is to be a σ-algrebra, it must also contain all the open intervals (a, b) in Ω, and for such an open interval, P[ (a, b) ] = b − a. Any open subset of Ω is the union of a finite or countably infinite set of open intervals, so that T should contain all open and all closed subsets of Ω. Thus, T must contain the intersection of any set that is the intersection of countably many open sets, and so on. The specification of the probability function P must be extended from intervals to all of T. It is not a priori clear how large T can be. It is tempting to take T to be the set of all subsets of Ω. However, that idea doesn’t work–see the starred homework showing that the length of all subsets of R can’t be defined in a consistent way. The problem is resolved by taking T to be the smallest 1.1. THE AXIOMS OF PROBABILITY THEORY 3 σ-algebra containing all the subintervals of Ω, or equivalently, containing all the open subsets of Ω. This σ-algebra is called the Borel σ-algebra for [0, 1], and the sets in it are called Borel sets. While not every subset of Ω is a Borel subset, any set we are likely to encounter in applications is a Borel set. The existence of the Borel σ-algebra is discussed in an extra credit problem. Furthermore, extension theorems of measure theory 1 imply that P can be extended from its definition (1.1) for interval sets to all Borel sets. The smallest σ-algebra, B, containing the open subsets of R is called the Borel σ-algebra for R, and the sets in it are called Borel sets. Similarly, the Borel σ-algebra B n of subsets of R n is the smallest σ-algebra containing all sets of the form [a 1 , b 1 ] [a 2 , b 2 ] [a n , b n ]. Sets in B n are called Borel subsets of R n . The class of Borel sets includes not only rectangle sets and countable unions of rectangle sets, but all open sets and all closed sets. Virtually any subset of R n arising in applications is a Borel set. Example 1.1.3 (Repeated binary trials) Suppose we’d like to represent an infinite sequence of binary observations, where each observation is a zero or one with equal probability. For example, the experiment could consist of repeatedly flipping a fair coin, and recording a one for heads and a zero for tails, for each flip. Then an outcome ω would be an infinite sequence, ω = (ω 1 , ω 2 , ), such that for each i ≥ 1, ω i ∈ ¦0, 1¦. Let Ω be the set of all such ω’s. Certainly we’d want T to contain any set that can be defined in terms of only finitely many of the observations. In particular, for any binary sequence (b 1 , , b n ) of some finite length n, the set ¦ω ∈ Ω : ω i = b i for 1 ≤ i ≤ n¦ should be in T, and the probability of such a set should be 2 −n . We take T to be the smallest σ-algebra of subsets of Ω containing all such sets. Again, the extension theorems of measure theory can be used to show that P can be extended to a probability measure defined on all of T. For a specific example of a set in T, consider A = ¦ω ∈ Ω : ω i = 1 for all even i¦. Let’s check that A ∈ T. The key is to express A as A = ∪ ∞ n=1 A n , where A n = ¦ω ∈ Ω : ω i = 1 for all even i with i ≤ n¦ for each n. For any n, A n is determined by only finitely many observations, so A n ∈ T. Therefore, A is the intersection of countably many sets in T, so A itself is in T. Next, we identify P[A], by first identifying P[A 2n ] for n ≥ 1. For any sequence b 1 , b 2 , . . . , b 2n of length 2n, the probability of the set ¦ω ∈ Ω : ω i = b i for 1 ≤ i ≤ 2n¦ is 2 −2n . The set A 2n is the union of 2 n such sets, which are disjoint, due to the 2 n possible choices of b i for odd i with 1 ≤ i ≤ 2n. So P[A 2n ] = 2 n 2 −2n = 2 −n . Furthermore, A ⊂ A 2n so that P[A] ≤ P[A 2n ] = 2 −n . Since n can be arbitrarily large, it follows that P[A] = 0. The following lemma gives a continuity property of probability measures which is analogous to continuity of functions on R n , reviewed in Appendix 11.3. If B 1 , B 2 , . . . is a sequence of events such that B 1 ⊂ B 2 ⊂ B 3 ⊂ , then we can think that B j converges to the set ∪ ∞ i=1 B i as j → ∞. The lemma states that in this case, P[B j ] converges to the probability of the limit set as j →∞. 1 See, for example, H.L. Royden, Real Analysis, Third edition. Macmillan, New York, 1988, or S.R.S. Varadhan, Probability Theory Lecture Notes, American Mathematical Society, 2001. The σ-algebra F and P can be extended somewhat further by requiring the following completeness property: if B ⊂ A ∈ F with P[A] = 0, then B ∈ F (and also P[B] = 0). 4 CHAPTER 1. GETTING STARTED Lemma 1.1.4 (Continuity of Probability) Suppose B 1 , B 2 , . . . is a sequence of events. (a) If B 1 ⊂ B 2 ⊂ then lim j→∞ P[B j ] = P [ ∞ i=1 B i ] (b) If B 1 ⊃ B 2 ⊃ then lim j→∞ P[B j ] = P [ ∞ i=1 B i ] Proof Suppose B 1 ⊂ B 2 ⊂ . Let D 1 = B 1 , D 2 = B 2 −B 1 , and, in general, let D i = B i −B i−1 for i ≥ 2, as shown in Figure 1.1. Then P[B j ] = j i=1 P[D i ] for each j ≥ 1, so B =D D 1 1 D 2 3 . . . Figure 1.1: A sequence of nested sets. lim j→∞ P[B j ] = lim j→∞ j i=1 P[D i ] (a) = ∞ i=1 P[D i ] (b) = P _ ∞ _ i=1 D i _ = P _ ∞ _ i=1 B i _ where (a) is true by the definition of the sum of an infinite series, and (b) is true by axiom P.2. This proves Lemma 1.1.4(a). Lemma 1.1.4(b) can be proved similarly, or can be derived by applying Lemma 1.1.4(a) to the sets B c j . Example 1.1.5 (Selection of a point in a square) Take Ω to be the square region in the plane, Ω = ¦(x, y) : 0 ≤ x, y ≤ 1¦. Let T be the Borel σ-algebra for Ω, which is the smallest σ-algebra containing all the rectangular subsets of Ω that are aligned with the axes. Take P so that for any rectangle R, P[R] = area of R. (It can be shown that T and P exist.) Let T be the triangular region T = ¦(x, y) : 0 ≤ y ≤ x ≤ 1¦. Since T is not rectangular, it is not immediately clear that T ∈ T, nor is it clear what P[T] is. That is where the axioms come in. For n ≥ 1, let T n denote the region shown in Figure 1.2. Since T n can be written as a union of finitely many mutually exclusive rectangles, it follows that T n ∈ T 1.2. INDEPENDENCE AND CONDITIONAL PROBABILITY 5 T n 1 2 n n 1 0 Figure 1.2: Approximation of a triangular region. and it is easily seen that P[T n ] = 1+2++n n 2 = n+1 2n . Since T 1 ⊃ T 2 ⊃ T 4 ⊃ T 8 and ∩ j T 2 j = T, it follows that T ∈ T and P[T] = lim n→∞ P[T n ] = 1 2 . The reader is encouraged to show that if C is the diameter one disk inscribed within Ω then P[C] = (area of C) = π 4 . 1.2 Independence and conditional probability Events A 1 and A 2 are defined to be independent if P[A 1 A 2 ] = P[A 1 ]P[A 2 ]. More generally, events A 1 , A 2 , . . . , A k are defined to be independent if P[A i 1 A i 2 A i j ] = P[A i 1 ]P[A i 2 ] P[A i j ] whenever j and i 1 , i 2 , . . . , i j are integers with j ≥ 1 and 1 ≤ i 1 < i 2 < < i j ≤ k. For example, events A 1 , A 2 , A 3 are independent if the following four conditions hold: P[A 1 A 2 ] = P[A 1 ]P[A 2 ] P[A 1 A 3 ] = P[A 1 ]P[A 3 ] P[A 2 A 3 ] = P[A 2 ]P[A 3 ] P[A 1 A 2 A 3 ] = P[A 1 ]P[A 2 ]P[A 3 ] A weaker condition is sometimes useful: Events A 1 , . . . , A k are defined to be pairwise inde- pendent if A i is independent of A j whenever 1 ≤ i < j ≤ k. Independence of k events requires that 2 k − k − 1 equations hold: one for each subset of ¦1, 2, . . . , k¦ of size at least two. Pairwise independence only requires that _ k 2 _ = k(k−1) 2 equations hold. If A and B are events and P[B] ,= 0, then the conditional probability of A given B is defined by P[A [ B] = P[AB] P[B] . It is not defined if P[B] = 0, which has the following meaning. If you were to write a computer routine to compute P[A [ B] and the inputs are P[AB] = 0 and P[B] = 0, your routine shouldn’t 6 CHAPTER 1. GETTING STARTED simply return the value 0. Rather, your routine should generate an error message such as “input error–conditioning on event of probability zero.” Such an error message would help you or others find errors in larger computer programs which use the routine. As a function of A for B fixed with P[B] ,= 0, the conditional probability of A given B is itself a probability measure for Ω and T. More explicitly, fix B with P[B] ,= 0. For each event A define P t [A] = P[A [ B]. Then (Ω, T, P t ) is a probability space, because P t satisfies the axioms P1 −P3. (Try showing that). If A and B are independent then A c and B are independent. Indeed, if A and B are independent then P[A c B] = P[B] −P[AB] = (1 −P[A])P[B] = P[A c ]P[B]. Similarly, if A, B, and C are independent events then AB is independent of C. More generally, suppose E 1 , E 2 , . . . , E n are independent events, suppose n = n 1 + +n k with n i > 1 for each i, and suppose F 1 is defined by Boolean operations (intersections, complements, and unions) of the first n 1 events E 1 , . . . , E n 1 , F 2 is defined by Boolean operations on the next n 2 events, E n 1 +1 , . . . , E n 1 +n 2 , and so on, then F 1 , . . . , F k are independent. Events E 1 , . . . , E k are said to form a partition of Ω if the events are mutually exclusive and Ω = E 1 ∪ ∪ E k . Of course for a partition, P[E 1 ] + + P[E k ] = 1. More generally, for any event A, the law of total probability holds because A is the union of the mutually exclusive sets AE 1 , AE 2 , . . . , AE k : P[A] = P[AE 1 ] + +P[AE k ]. If P[E i ] ,= 0 for each i, this can be written as P[A] = P[A [ E 1 ]P[E 1 ] + +P[A [ E k ]P[E k ]. Figure 1.3 illustrates the condition of the law of total probability. E E E E 1 2 3 4 Ω A Figure 1.3: Partitioning a set A using a partition of Ω. Judicious use of the definition of conditional probability and the law of total probability leads to Bayes’ formula for P[E i [ A] (if P[A] ,= 0) in simple form P[E i [ A] = P[AE i ] P[A] = P[A [ E i ]P[E i ] P[A] , 1.2. INDEPENDENCE AND CONDITIONAL PROBABILITY 7 or in expanded form: P[E i [ A] = P[A [ E i ]P[E i ] P[A [ E 1 ]P[E 1 ] + +P[A [ E k ]P[E k ] . The remainder of this section gives the Borel-Cantelli lemma. It is a simple result based on continuity of probability and independence of events, but it is not typically encountered in a first course on probability. Let (A n : n ≥ 0) be a sequence of events for a probability space (Ω, T, P). Definition 1.2.1 The event ¦A n infinitely often¦ is the set of ω ∈ Ω such that ω ∈ A n for infinitely many values of n. Another way to describe ¦A n infinitely often¦ is that it is the set of ω such that for any k, there is an n ≥ k such that ω ∈ A n . Therefore, ¦A n infinitely often¦ = ∩ k≥1 (∪ n≥k A n ) . For each k, the set ∪ n≥k A n is a countable union of events, so it is an event, and ¦A n infinitely often¦ is an intersection of countably many such events, so that ¦A n infinitely often¦ is also an event. Lemma 1.2.2 (Borel-Cantelli lemma) Let (A n : n ≥ 1) be a sequence of events and let p n = P[A n ]. (a) If ∞ n=1 p n < ∞, then P¦A n infinitely often¦ = 0. (b) If ∞ n=1 p n = ∞ and A 1 , A 2 , are mutually independent, then P¦A n infinitely often¦ = 1. Proof. (a) Since ¦A n infinitely often¦ is the intersection of the monotonically nonincreasing se- quence of events ∪ n≥k A n , it follows from the continuity of probability for monotone sequences of events (Lemma 1.1.4) that P¦A n infinitely often¦ = lim k→∞ P[∪ n≥k A n ]. Lemma 1.1.4, the fact that the probability of a union of events is less than or equal to the sum of the probabilities of the events, and the definition of the sum of a sequence of numbers, yield that for any k ≥ 1, P[∪ n≥k A n ] = lim m→∞ P[∪ m n=k A n ] ≤ lim m→∞ m n=k p n = ∞ n=k p n Combining the above yields P¦A n infinitely often¦ ≤ lim k→∞ ∞ n=k p n . If ∞ n=1 p n < ∞, then lim k→∞ ∞ n=k p n = 0, which implies part (a) of the lemma. (b) Suppose that ∞ n=1 p n = +∞ and that the events A 1 , A 2 , . . . are mutually independent. For any k ≥ 1, using the fact 1 −u ≤ exp(−u) for all u, P[∪ n≥k A n ] = lim m→∞ P[∪ m n=k A n ] = lim m→∞ 1 − m n=k (1 −p n ) ≥ lim m→∞ 1 −exp(− m n=k p n ) = 1 −exp(− ∞ n=k p n ) = 1 −exp(−∞) = 1. Therefore, P¦A n infinitely often¦ = lim k→∞ P[∪ n≥k A n ] = 1. 8 CHAPTER 1. GETTING STARTED Example 1.2.3 Consider independent coin tosses using biased coins, such that P[A n ] = p n = 1 n , where A n is the event of getting heads on the n th toss. Since ∞ n=1 1 n = +∞, the part of the Borel-Cantelli lemma for independent events implies that P¦A n infinitely often¦ = 1. Example 1.2.4 Let (Ω, T, P) be the standard unit-interval probability space defined in Example 1.1.2, and let A n = [0, 1 n ]. Then p n = 1 n and A n+1 ⊂ A n for n ≥ 1. The events are not independent, because for m < n, P[A m A n ] = P[A n ] = 1 n ,= P[A m ]P[A n ]. Of course 0 ∈ A n for all n. But for any ω ∈ (0, 1], ω ,∈ A n for n > 1 ω . Therefore, ¦A n infinitely often¦ = ¦0¦. The single point set ¦0¦ has probability zero, so P¦A n infinitely often¦ = 0. This conclusion holds even though ∞ n=1 p n = +∞, illustrating the need for the independence assumption in Lemma 1.2.2(b). 1.3 Random variables and their distribution Let a probability space (Ω, T, P) be given. By definition, a random variable is a function X from Ω to the real line R that is T measurable, meaning that for any number c, ¦ω : X(ω) ≤ c¦ ∈ T. (1.2) If Ω is finite or countably infinite, then T can be the set of all subsets of Ω, in which case any real-valued function on Ω is a random variable. If (Ω, T, P) is given as in the uniform phase example with T equal to the Borel subsets of [0, 2π], then the random variables on (Ω, T, P) are called the Borel measurable functions on Ω. Since the Borel σ-algebra contains all subsets of [0, 2π] that come up in applications, for practical purposes we can think of any function on [0, 2π] as being a random variable. For example, any piecewise continuous or piecewise monotone function on [0, 2π] is a random variable for the uniform phase example. The cumulative distribution function (CDF) of a random variable X is denoted by F X . It is the function, with domain the real line R, defined by F X (c) = P¦ω : X(ω) ≤ c¦ (1.3) = P¦X ≤ c¦ (for short) (1.4) If X denotes the outcome of the roll of a fair die (“die” is singular of “dice”) and if Y is uniformly distributed on the interval [0, 1], then F X and F Y are shown in Figure 1.4 The CDF of a random variable X determines P¦X ≤ c¦ for any real number c. But what about P¦X < c¦ and P¦X = c¦? Let c 1 , c 2 , . . . be a monotone nondecreasing sequence that converges to c from the left. This means c i ≤ c j < c for i < j and lim j→∞ c j = c. Then the events ¦X ≤ c j ¦ are nested: ¦X ≤ c i ¦ ⊂ ¦X ≤ c j ¦ for i < j, and the union of all such events is the event ¦X < c¦. Thus, by Lemma 1.1.4 P¦X < c¦ = lim i→∞ P¦X ≤ c i ¦ = lim i→∞ F X (c i ) = F X (c−). 1.3. RANDOM VARIABLES AND THEIR DISTRIBUTION 9 6 4 F F Y X 1 5 3 2 1 0 0 1 1 Figure 1.4: Examples of CDFs. Therefore, P¦X = c¦ = F X (c) − F X (c−) = ´F X (c), where ´F X (c) is defined to be the size of the jump of F at c. For example, if X has the CDF shown in Figure 1.5 then P¦X = 0¦ = 1 2 . The requirement that F X be right continuous implies that for any number c (such as c = 0 for this example), if the value F X (c) is changed to any other value, the resulting function would no longer be a valid CDF. The collection of all events A such that P¦X ∈ A¦ is determined by F X is a σ-algebra containing the intervals, and thus this collection contains all Borel sets. That is, P¦X ∈ A¦ is determined by F X for any Borel set A. 0 −1 0.5 1 Figure 1.5: An example of a CDF. Proposition 1.3.1 A function F is the CDF of some random variable if and only if it has the following three properties: F.1 F is nondecreasing F.2 lim x→+∞ F(x) = 1 and lim x→−∞ F(x) = 0 F.3 F is right continuous Proof The “only if” part is proved first. Suppose that F is the CDF of some random variable X. Then if x < y, F(y) = P¦X ≤ y¦ = P¦X ≤ x¦ +P¦x < X ≤ y¦ ≥ P¦X ≤ x¦ = F(x) so that F.1 is true. Consider the events B n = ¦X ≤ n¦. Then B n ⊂ B m for n ≤ m. Thus, by Lemma 1.1.4, lim n→∞ F(n) = lim n→∞ P[B n ] = P _ ∞ _ n=1 B n _ = P[Ω] = 1. 10 CHAPTER 1. GETTING STARTED This and the fact F is nondecreasing imply the following. Given any > 0, there exists N so large that F(x) ≥ 1 − for all x ≥ N . That is, F(x) →1 as x →+∞. Similarly, lim n→−∞ F(n) = lim n→∞ P[B −n ] = P _ ∞ n=1 B −n _ = P[∅] = 0. so that F(x) →0 as x →−∞. Property F.2 is proved. The proof of F.3 is similar. Fix an arbitrary real number x. Define the sequence of events A n for n ≥ 1 by A n = ¦X ≤ x + 1 n ¦. Then A n ⊂ A m for n ≥ m so lim n→∞ F(x + 1 n ) = lim n→∞ P[A n ] = P _ ∞ k=1 A k _ = P¦X ≤ x¦ = F X (x). Convergence along the sequence x+ 1 n , together with the fact that F is nondecreasing, implies that F(x+) = F(x). Property F.3 is thus proved. The proof of the “only if” portion of Proposition 1.3.1 is complete To prove the “if” part of Proposition 1.3.1, let F be a function satisfying properties F.1-F.3. It must be shown that there exists a random variable with CDF F. Let Ω = R and let T be the set B of Borel subsets of R. Define ˜ P on intervals of the form (a, b] by ˜ P[(a, b]] = F(b) −F(a). It can be shown by an extension theorem of measure theory that ˜ P can be extended to all of T so that the axioms of probability are satisfied. Finally, let ˜ X(ω) = ω for all ω ∈ Ω. Then ˜ P[ ˜ X ∈ (a, b]] = ˜ P[(a, b]] = F(b) −F(a). Therefore, ˜ X has CDF F. The vast majority of random variables described in applications are one of two types, to be described next. A random variable X is a discrete random variable if there is a finite or countably infinite set of values ¦x i : i ∈ I¦ such that P¦X ∈ ¦x i : i ∈ I¦¦ = 1. The probability mass function (pmf) of a discrete random variable X, denoted p X (x), is defined by p X (x) = P¦X = x¦. Typically the pmf of a discrete random variable is much more useful than the CDF. However, the pmf and CDF of a discrete random variable are related by p X (x) = ´F X (x) and conversely, F X (x) = y:y≤x p X (y), (1.5) where the sum in (1.5) is taken only over y such that p X (y) ,= 0. If X is a discrete random variable with only finitely many mass points in any finite interval, then F X is a piecewise constant function. A random variable X is a continuous random variable if the CDF is the integral of a function: F X (x) = _ x −∞ f X (y)dy 1.4. FUNCTIONS OF A RANDOM VARIABLE 11 The function f X is called the probability density function (pdf). If the pdf f X is continuous at a point x, then the value f X (x) has the following nice interpretation: f X (x) = lim ε→0 1 ε _ x+ε x f X (y)dy = lim ε→0 1 ε P¦x ≤ X ≤ x +ε¦. If A is any Borel subset of R, then P¦X ∈ A¦ = _ A f X (x)dx. (1.6) The integral in (1.6) can be understood as a Riemann integral if A is a finite union of intervals and f is piecewise continuous or monotone. In general, f X is required to be Borel measurable and the integral is defined by Lebesgue integration. 2 Any random variable X on an arbitrary probability space has a CDF F X . As noted in the proof of Proposition 1.3.1 there exists a probability measure P X (called ˜ P in the proof) on the Borel subsets of R such that for any interval (a, b], P X [(a, b]] = P¦X ∈ (a, b]¦. We define the probability distribution of X to be the probability measure P X . The distribution P X is determined uniquely by the CDF F X . The distribution is also determined by the pdf f X if X is continuous type, or the pmf p X if X is discrete type. In common usage, the response to the question “What is the distribution of X?” is answered by giving one or more of F X , f X , or p X , or possibly a transform of one of these, whichever is most convenient. 1.4 Functions of a random variable Recall that a random variable X on a probability space (Ω, T, P) is a function mapping Ω to the real line R , satisfying the condition ¦ω : X(ω) ≤ a¦ ∈ T for all a ∈ R. Suppose g is a function mapping R to R that is not too bizarre. Specifically, suppose for any constant c that ¦x : g(x) ≤ c¦ is a Borel subset of R. Let Y (ω) = g(X(ω)). Then Y maps Ω to R and Y is a random variable. See Figure 1.6. We write Y = g(X). Often we’d like to compute the distribution of Y from knowledge of g and the distribution of X. In case X is a continuous random variable with known distribution, the following three step procedure works well: (1) Examine the ranges of possible values of X and Y . Sketch the function g. (2) Find the CDF of Y , using F Y (c) = P¦Y ≤ c¦ = P¦g(X) ≤ c¦. The idea is to express the event ¦g(X) ≤ c¦ as ¦X ∈ A¦ for some set A depending on c. 2 Lebesgue integration is defined in Sections 1.5 and 11.5 12 CHAPTER 1. GETTING STARTED Ω g(X( )) X( ) ω ω g X Figure 1.6: A function of a random variable as a composition of mappings. (3) If F Y has a piecewise continuous derivative, and if the pmf f Y is desired, differentiate F Y . If instead X is a discrete random variable then step 1 should be followed. After that the pmf of Y can be found from the pmf of X using p Y (y) = P¦g(X) = y¦ = x:g(x)=y p X (x) Example 1.4.1 Suppose X is a N(µ = 2, σ 2 = 3) random variable (see Section 1.6 for the defini- tion) and Y = X 2 . Let us describe the density of Y . Note that Y = g(X) where g(x) = x 2 . The support of the distribution of X is the whole real line, and the range of g over this support is R + . Next we find the CDF, F Y . Since P¦Y ≥ 0¦ = 1, F Y (c) = 0 for c < 0. For c ≥ 0, F Y (c) = P¦X 2 ≤ c¦ = P¦− √ c ≤ X ≤ √ c¦ = P¦ − √ c −2 √ 3 ≤ X −2 √ 3 ≤ √ c −2 √ 3 ¦ = Φ( √ c −2 √ 3 ) −Φ( − √ c −2 √ 3 ) Differentiate with respect to c, using the chain rule and the fact, Φ t (s) = 1 √ 2π exp(− s 2 2 ) to obtain f Y (c) = _ 1 √ 24πc ¦exp(−[ √ c−2 √ 6 ] 2 ) + exp(−[ − √ c−2 √ 6 ] 2 )¦ if y ≥ 0 0 if y < 0 (1.7) Example 1.4.2 Suppose a vehicle is traveling in a straight line at speed a, and that a random direction is selected, subtending an angle Θ from the direction of travel which is uniformly dis- tributed over the interval [0, π]. See Figure 1.7. Then the effective speed of the vehicle in the random direction is B = a cos(Θ). Let us find the pdf of B. The range of a cos(Θ) as θ ranges over [0, π] is the interval [−a, a]. Therefore, F B (c) = 0 for c ≤ −a and F B (c) = 1 for c ≥ a. Let now −a < c < a. Then, because cos is monotone nonincreasing 1.4. FUNCTIONS OF A RANDOM VARIABLE 13 B a Θ Figure 1.7: Direction of travel and a random direction. on the interval [0, π], F B (c) = P¦a cos(Θ) ≤ c¦ = P¦cos(Θ) ≤ c a ¦ = P¦Θ ≥ cos −1 ( c a )¦ = 1 − cos −1 ( c a ) π Therefore, because cos −1 (y) has derivative, −(1 −y 2 ) − 1 2 , f B (c) = _ 1 π √ a 2 −c 2 [ c [< a 0 [ c [> a A sketch of the density is given in Figure 1.8. −a a f B 0 Figure 1.8: The pdf of the effective speed in a uniformly distributed direction. Example 1.4.3 Suppose Y = tan(Θ), as illustrated in Figure 1.9, where Θ is uniformly distributed over the interval (− π 2 , π 2 ) . Let us find the pdf of Y . The function tan(θ) increases from −∞ to ∞ 14 CHAPTER 1. GETTING STARTED Y 0 Θ Figure 1.9: A horizontal line, a fixed point at unit distance, and a line through the point with random direction. as θ ranges over the interval (− π 2 , π 2 ). For any real c, F Y (c) = P¦Y ≤ c¦ = P¦tan(Θ) ≤ c¦ = P¦Θ ≤ tan −1 (c)¦ = tan −1 (c) + π 2 π Differentiating the CDF with respect to c yields that Y has the Cauchy pdf: f Y (c) = 1 π(1 +c 2 ) −∞< c < ∞ Example 1.4.4 Given an angle θ expressed in radians, let (θ mod 2π) denote the equivalent angle in the interval [0, 2π]. Thus, (θ mod 2π) is equal to θ + 2πn, where the integer n is such that 0 ≤ θ + 2πn < 2π. Let Θ be uniformly distributed over [0, 2π], let h be a constant, and let ˜ Θ = (Θ +h mod 2π) Let us find the distribution of ˜ Θ. Clearly ˜ Θ takes values in the interval [0, 2π], so fix c with 0 ≤ c < 2π and seek to find P¦ ˜ Θ ≤ c¦. Let A denote the interval [h, h +2π]. Thus, Θ+h is uniformly distributed over A. Let B = n [2πn, 2πn +c]. Thus ˜ Θ ≤ c if and only if Θ +h ∈ B. Therefore, P¦ ˜ Θ ≤ c¦ = _ A T B 1 2π dθ By sketching the set B, it is easy to see that A B is either a single interval of length c, or the union of two intervals with lengths adding to c. Therefore, P¦ ˜ Θ ≤ c¦ = c 2π , so that ˜ Θ is itself uniformly distributed over [0, 2π] 1.4. FUNCTIONS OF A RANDOM VARIABLE 15 Example 1.4.5 Let X be an exponentially distributed random variable with parameter λ. Let Y = ¸X|, which is the integer part of X, and let R = X −¸X|, which is the remainder. We shall describe the distributions of Y and R. Clearly Y is a discrete random variable with possible values 0, 1, 2, . . . , so it is sufficient to find the pmf of Y . For integers k ≥ 0, p Y (k) = P¦k ≤ X < k + 1¦ = _ k+1 k λe −λx dx = e −λk (1 −e −λ ) and p Y (k) = 0 for other k. Turn next to the distribution of R. Clearly R takes values in the interval [0, 1]. So let 0 < c < 1 and find F R (c): F R (c) = P¦X −¸X| ≤ c¦ = P¦X ∈ ∞ _ k=0 [k, k +c]¦ = ∞ k=0 P¦k ≤ X ≤ k +c¦ = ∞ k=0 e −λk (1 −e −λc ) = 1 −e −λc 1 −e −λ where we used the fact 1 +α +α 2 + = 1 1−α for [ α [< 1. Differentiating F R yields the pmf: f R (c) = _ λe −λc 1−e −λ 0 ≤ c ≤ 1 0 otherwise What happens to the density of R as λ →0 or as λ →∞? By l’Hospital’s rule, lim λ→0 f R (c) = _ 1 0 ≤ c ≤ 1 0 otherwise That is, in the limit as λ →0, the density of X becomes more and more “evenly spread out,” and R becomes uniformly distributed over the interval [0, 1]. If λ is very large then the factor e −λ is nearly zero, and the density of R is nearly the same as the exponential density with parameter λ. An important step in many computer simulations of random systems is to generate a random variable with a specified CDF, by applying a function to a random variable that is uniformly distributed on the interval [0, 1]. Let F be a function satisfying the three properties required of a CDF, and let U be uniformly distributed over the interval [0, 1]. The problem is to find a function g so that F is the CDF of g(U). An appropriate function g is given by the inverse function of F. Although F may not be strictly increasing, a suitable version of F −1 always exists, defined for 0 < u < 1 by F −1 (u) = min¦x : F(x) ≥ u¦ (1.8) 16 CHAPTER 1. GETTING STARTED F (u) 1 1 !1 F(x) x u Figure 1.10: A CDF and its inverse. If the graphs of F and F −1 are closed up by adding vertical lines at jump points, then the graphs are reflections of each other about the x = y line, as illustrated in Figure 1.10. It is not hard to check that for any real x o and u o with 0 < u o < 1, F(x o ) ≥ u o if and only if x o ≥ F −1 (u o ) Thus, if X = F −1 (U) then P¦F −1 (U) ≤ x¦ = P¦U ≤ F(x)¦ = F(x) so that indeed F is the CDF of X Example 1.4.6 Suppose F(x) = 1−e −x for x ≥ 0 and F(x) = 0 for x < 0. Since F is continuously increasing in this case, we can identify its inverse by solving for x as a function of u so that F(x) = u. That is, for 0 < u < 1, we’d like 1 −e −x = u which is equivalent to e −x = 1 −u, or x = −ln(1 −u). Thus, F −1 (u) = −ln(1 − u). So we can take g(u) = −ln(1 − u) for 0 < u < 1. That is, if U is uniformly distributed on the interval [0, 1], then the CDF of −ln(1 − U) is F. The choice of g is not unique in general. For example, 1 −U has the same distribution as U, so the CDF of −ln(U) is also F. To double check the answer, note that if x ≥ 0, then P¦−ln(1 −U) ≤ x¦ = P¦ln(1 −U) ≥ −x¦ = P¦1 −U ≥ e −x ¦ = P¦U ≤ 1 −e −x ¦ = F(x). Example 1.4.7 Suppose F is the CDF for the experiment of rolling a fair die, shown on the left half of Figure 1.4. One way to generate a random variable with CDF F is to actually roll a die. To simulate that on a compute, we’d seek a function g so that g(U) has the same CDF. Using g = F −1 and using (1.8) or the graphical method illustrated in Figure 1.10 to find F −1 , we get that for 0 < u < 1, g(u) = i for i−1 6 < u ≤ i 6 for 1 ≤ i ≤ 6. To double check the answer, note that if 1 ≤ i ≤ 6, then P¦g(U) = i¦ = P _ i −1 6 < U ≤ i 6 _ = 1 6 so that g(U) has the correct pmf, and hence the correct CDF. 1.5. EXPECTATION OF A RANDOM VARIABLE 17 1.5 Expectation of a random variable The expectation, alternatively called the mean, of a random variable X can be defined in several different ways. Before giving a general definition, we shall consider a straight forward case. A random variable X is called simple if there is a finite set ¦x 1 , . . . , x m ¦ such that X(ω) ∈ ¦x 1 , . . . , x m ¦ for all ω. The expectation of such a random variable is defined by E[X] = m i=1 x i P¦X = x i ¦ (1.9) The definition (1.9) clearly shows that E[X] for a simple random variable X depends only on the pmf of X. Like all random variables, X is a function on a probability space (Ω, T, P). Figure 1.11 illus- trates that the sum defining E[X] in (1.9) can be viewed as an integral over Ω. This suggests writing E[X] = _ Ω X(ω)P[dω] (1.10) X( )=x X( )=x X( )=x 1 2 3 Ω ω ω ω Figure 1.11: A simple random variable with three possible values. Let Y be another simple random variable on the same probability space as X, with Y (ω) ∈ ¦y 1 , . . . , y n ¦ for all ω. Of course E[Y ] = n i=1 y i P¦Y = y i ¦. One learns in any elementary probability class that E[X + Y ] = E[X] + E[Y ]. Note that X + Y is again a simple random variable, so that E[X + Y ] can be defined in the same way as E[X] was defined. How would you prove E[X+Y ] = E[X]+E[Y ]? Is (1.9) helpful? We shall give a proof that E[X+Y ] = E[X]+E[Y ] motivated by (1.10). The sets ¦X = x 1 ¦, . . . , ¦X = x m ¦ form a partition of Ω. A refinement of this partition consists of another partition C 1 , . . . , C m such that X is constant over each C j . If we let x t j denote the value 18 CHAPTER 1. GETTING STARTED of X on C j , then clearly E[X] = j x t j P[C j ] Now, it is possible to select the partition C 1 , . . . , C m so that both X and Y are constant over each C j . For example, each C j could have the form ¦X = x i ¦ ∩ ¦Y = y k ¦ for some i, k. Let y t j denote the value of Y on C j . Then x t j +y t j is the value of X +Y on C j . Therefore, E[X +Y ] = j (x t j +y t j )P[C j ] = j x t j P[C j ] + j y t j P[C j ] = E[X] +E[Y ] While the expression (1.10) is rather suggestive, it would be overly restrictive to interpret it as a Riemann integral over Ω. For example, if X is a random variable for the the standard unit- interval probability space defined in Example 1.1.2, then it is tempting to define E[X] by Riemann integration (see the appendix): E[X] = _ 1 0 X(ω)dω (1.11) However, suppose X is the simple random variable such that X(w) = 1 for rational values of ω and X(ω) = 0 otherwise. Since the set of rational numbers in Ω is countably infinite, such X satisfies P¦X = 0¦ = 1. Clearly we’d like E[X] = 0, but the Riemann integral (1.11) is not convergent for this choice of X. The expression (1.10) can be used to define E[X] in great generality if it is interpreted as a Lebesgue integral, defined as follows: Suppose X is an arbitrary nonnegative random variable. Then there exists a sequence of simple random variables X 1 , X 2 , . . . such that for every ω ∈ Ω, X 1 (ω) ≤ X 2 (ω) ≤ and X n (ω) → X(ω) as n → ∞. Then E[X n ] is well defined for each n and is nondecreasing in n, so the limit of E[X n ] as n → ∞ exists with values in [0, +∞]. Furthermore it can be shown that the value of the limit depends only on (Ω, T, P) and X, not on the particular choice of the approximating simple sequence. We thus define E[X] = lim n→∞ E[X n ]. Thus, E[X] is always well defined in this way, with possible value +∞, if X is a nonnegative random variable. Suppose X is an arbitrary random variable. Define the positive part of X to be the random variable X + defined by X + (ω) = max¦0, X(ω)¦ for each value of ω. Similarly define the negative part of X to be the random variable X − (ω) = max¦0, −X(ω)¦. Then X(ω) = X + (ω) − X − (ω) for all ω, and X + and X − are both nonnegative random variables. As long as at least one of E[X + ] or E[X − ] is finite, define E[X] = E[X + ] − E[X − ]. The expectation E[X] is undefined if E[X + ] = E[X − ] = +∞. This completes the definition of E[X] using (1.10) interpreted as a Lebesgue integral. We will prove that E[X] defined by the Lebesgue integral (1.10) depends only on the CDF of X. It suffices to show this for a nonnegative random variable X. For such a random variable, and n ≥ 1, define the simple random variable X n by X n (ω) = _ k2 −n if k2 −n ≤ X(ω) < (k + 1)2 −n , k = 0, 1, . . . , 2 2n −1 0 else 1.5. EXPECTATION OF A RANDOM VARIABLE 19 Then E[X n ] = 2 2n −1 k=0 k2 −n (F X ((k + 1)2 −n ) −F X (k2 −n ) so that E[X n ] is determined by the CDF F X for each n. Furthermore, the X n ’s are nondecreasing in n and converge to X. Thus, E[X] = lim n→∞ E[X n ], and therefore the limit E[X] is determined by F X . In Section 1.3 we defined the probability distribution P X of a random variable such that the canonical random variable ˜ X(ω) = ω on (R, B, P X ) has the same CDF as X. Therefore E[X] = E[ ˜ X], or E[X] = _ ∞ −∞ xP X (dx) (Lebesgue) (1.12) By definition, the integral (1.12) is the Lebesgue-Stieltjes integral of x with respect to F X , so that E[X] = _ ∞ −∞ xdF X (x) (Lebesgue-Stieltjes) (1.13) Expectation has the following properties. Let X, Y be random variables and c a constant. E.1 (Linearity) E[cX] = cE[X]. If E[X], E[Y ] and E[X] + E[Y ] are well defined, then E[X +Y ] is well defined and E[X +Y ] = E[X] +E[Y ]. E.2 (Preservation of order) If P¦X ≥ Y ¦ = 1 and E[Y ] is well defined then E[X] is well defined and E[X] ≥ E[Y ]. E.3 If X has pdf f X then E[X] = _ ∞ −∞ xf X (x)dx (Lebesgue) E.4 If X has pmf p X then E[X] = x>0 xp X (x) + x<0 xp X (x). E.5 (Law of the unconscious statistician) If g is Borel measurable, E[g(X)] = _ Ω g(X(ω))P[dω] (Lebesgue) = _ ∞ −∞ g(x)dF X (x) (Lebesgue-Stieltjes) and in case X is a continuous type random variable E[g(X)] = _ ∞ −∞ g(x)f X (x)dx (Lebesgue) 20 CHAPTER 1. GETTING STARTED E.6 (Integration by parts formula) E[X] = _ ∞ 0 (1 −F X (x))dx − _ 0 −∞ F X (x)dx, (1.14) which is well defined whenever at least one of the two integrals in (1.14) is finite. There is a simple graphical interpretation of (1.14). Namely, E[X] is equal to the area of the region between the horizontal line ¦y = 1¦ and the graph of F X and contained in ¦x ≥ 0¦, minus the area of the region bounded by the x axis and the graph of F X and contained in ¦x ≤ 0¦, as long as at least one of these regions has finite area. See Figure 1.12. X x y y=1 F (x) X 0 + ! F (x) Figure 1.12: E[X] is the difference of two areas. Properties E.1 and E.2 are true for simple random variables and they carry over to general random variables in the limit defining the Lebesgue integral (1.10). Properties E.3 and E.4 follow from the equivalent definition (1.12) and properties of Lebesgue-Stieltjes integrals. Property E.5 can be proved by approximating g by piecewise constant functions. Property E.6 can be proved by integration by parts applied to (1.13). Alternatively, since F −1 X (U) has the same distribution as X, if U is uniformly distributed on the interval [0, 1], the law of the unconscious statistician yields that E[X] = _ 1 0 F −1 X (u)du, and this integral can also be interpreted as the difference of the areas of the same two regions. Sometimes, for brevity, we write EX instead of E[X]. The variance of a random variable X with EX finite is defined by Var(X) = E[(X−EX) 2 ]. By the linearity of expectation, if EX is finite, the variance of X satisfies the useful relation: Var(X) = E[X 2 −2X(EX) +(EX) 2 ] = E[X 2 ] −(EX) 2 . The following two inequalities are simple and fundamental. The Markov inequality states that if Y is a nonnegative random variable, then for c > 0, P¦Y ≥ c¦ ≤ E[Y ] c To prove Markov’s inequality, note that I ¡Y ≥c¦ ≤ Y c , and take expectations on each side. The Chebychev inequality states that if X is a random variable with finite mean µ and variance σ 2 , then for any d > 0, P¦[X −µ[ ≥ d¦ ≤ σ 2 d 2 1.6. FREQUENTLY USED DISTRIBUTIONS 21 The Chebychev inequality follows by applying the Markov inequality with Y = [X−µ[ 2 and c = d 2 . The characteristic function Φ X of a random variable X is defined by Φ X (u) = E[e juX ] for real values of u, where j = √ −1. For example, if X has pdf f, then Φ X (u) = _ ∞ −∞ exp(jux)f X (x)dx, which is 2π times the inverse Fourier transform of f X . Two random variables have the same probability distribution if and only if they have the same characteristic function. If E[X k ] exists and is finite for an integer k ≥ 1, then the derivatives of Φ X up to order k exist and are continuous, and Φ (k) X (0) = j k E[X k ] For a nonnegative integer-valued random variable X it is often more convenient to work with the z transform of the pmf, defined by Ψ X (z) = E[z X ] = ∞ k=0 z k p X (k) for real or complex z with [ z [≤ 1. Two such random variables have the same probability dis- tribution if and only if their z transforms are equal. If E[X k ] is finite it can be found from the derivatives of Ψ X up to the kth order at z = 1, Ψ (k) X (1) = E[X(X −1) (X −k + 1)] 1.6 Frequently used distributions The following is a list of the most basic and frequently used probability distributions. For each distribution an abbreviation, if any, and valid parameter values are given, followed by either the CDF, pdf or pmf, then the mean, variance, a typical example and significance of the distribution. The constants p, λ, µ, σ, a, b, and α are real-valued, and n and i are integer-valued, except n can be noninteger-valued in the case of the gamma distribution. Bernoulli: Be(p), 0 ≤ p ≤ 1 pmf: p(i) = _ _ _ p i = 1 1 −p i = 0 0 else z-transform: 1 −p +pz mean: p variance: p(1 −p) Example: Number of heads appearing in one flip of a coin. The coin is called fair if p = 1 2 and biased otherwise. 22 CHAPTER 1. GETTING STARTED Binomial: Bi(n, p), n ≥ 1, 0 ≤ p ≤ 1 pmf:p(i) = _ n i _ p i (1 −p) n−i 0 ≤ i ≤ n z-transform: (1 −p +pz) n mean: np variance: np(1 −p) Example: Number of heads appearing in n independent flips of a coin. Poisson: Poi(λ), λ ≥ 0 pmf: p(i) = λ i e −λ i! i ≥ 0 z-transform: exp(λ(z −1)) mean: λ variance: λ Example: Number of phone calls placed during a ten second interval in a large city. Significance: The Poisson pmf is the limit of the binomial pmf as n → +∞ and p → 0 in such a way that np →λ. Geometric: Geo(p), 0 < p ≤ 1 pmf: p(i) = (1 −p) i−1 p i ≥ 1 z-transform: pz 1 −z +pz mean: 1 p variance: 1 −p p 2 Example: Number of independent flips of a coin until heads first appears. Significant property: If X has the geometric distribution, P¦X > i¦ = (1 − p) i for integers i ≥ 1. So X has the memoryless property: P¦X > i +j [ X > i¦ = P¦X > j¦ for i, j ≥ 1. Any positive integer-valued random variable with this property has a geometric distribution. 1.6. FREQUENTLY USED DISTRIBUTIONS 23 Gaussian (also called Normal): N(µ, σ 2 ), µ ∈ R, σ ≥ 0 pdf (if σ 2 > 0): f(x) = 1 √ 2πσ 2 exp _ − (x −µ) 2 2σ 2 _ pmf (if σ 2 = 0): p(x) = _ 1 x = µ 0 else characteristic function: exp(juµ − u 2 σ 2 2 ) mean: µ variance: σ 2 Example: Instantaneous voltage difference (due to thermal noise) measured across a resistor held at a fixed temperature. Notation: The character Φ is often used to denote the CDF of a N(0, 1) random variable, 3 and Q is often used for the complementary CDF: Q(c) = 1 −Φ(c) = _ ∞ c 1 √ 2π e − x 2 2 dx Significant property (Central limit theorem): If X 1 , X 2 , . . . are independent and identically dis- tributed with mean µ and nonzero variance σ 2 , then for any constant c, lim n→∞ P _ X 1 + +X n −nµ √ nσ 2 ≤ c _ = Φ(c) Exponential: Exp (λ), λ > 0 pdf: f(x) = λe −λx x ≥ 0 characteristic function: λ λ −ju mean: 1 λ variance: 1 λ 2 Example: Time elapsed between noon sharp and the first telephone call placed in a large city, on a given day. Significance: If X has the Exp(λ) distribution, P¦X ≥ t¦ = e −λt for t ≥ 0. So X has the memoryless property: P¦X ≥ s +t [ X ≥ s¦ = P¦X ≥ t¦ s, t ≥ 0 Any nonnegative random variable with this property is exponentially distributed. 3 As noted earlier, Φ is also used to denote characteristic functions. The meaning should be clear from the context. 24 CHAPTER 1. GETTING STARTED Uniform: U(a, b) −∞ < a < b < ∞ pdf: f(x) = _ 1 b−a a ≤ x ≤ b 0 else characteristic function: e jub −e jua ju(b −a) mean: a +b 2 variance: (b −a) 2 12 Example: The phase difference between two independent oscillators operating at the same fre- quency may be modelled as uniformly distributed over [0, 2π] Significance: Uniform is uniform. Gamma(n, α): n, α > 0 (n real valued) pdf: f(x) = α n x n−1 e −αx Γ(n) x ≥ 0 where Γ(n) = _ ∞ 0 s n−1 e −s ds characteristic function: _ α α −ju _ n mean: n α variance: n α 2 Significance: If n is a positive integer then Γ(n) = (n − 1)! and a Gamma (n, α) random variable has the same distribution as the sum of n independent, Exp(α) distributed random variables. Rayleigh (σ 2 ): σ 2 > 0 pdf: f(r) = r σ 2 exp _ − r 2 2σ 2 _ r > 0 CDF : 1 −exp _ − r 2 2σ 2 _ mean: σ _ π 2 variance: σ 2 _ 2 − π 2 _ Example: Instantaneous value of the envelope of a mean zero, narrow band noise signal. Significance: If X and Y are independent, N(0, σ 2 ) random variables, then (X 2 + Y 2 ) 1 2 has the Rayleigh(σ 2 ) distribution. Also notable is the simple form of the CDF. 1.7. JOINTLY DISTRIBUTED RANDOM VARIABLES 25 1.7 Jointly distributed random variables Let X 1 , X 2 , . . . , X m be random variables on a single probability space (Ω, T, P). The joint cumu- lative distribution function (CDF) is the function on R m defined by F X 1 X 2 Xm (x 1 , . . . , x m ) = P¦X 1 ≤ x 1 , X 2 ≤ x 2 , . . . , X m ≤ x m ¦ The CDF determines the probabilities of all events concerning X 1 , . . . , X m . For example, if R is the rectangular region (a, b] (a t , b t ] in the plane, then P¦(X 1 , X 2 ) ∈ R¦ = F X 1 X 2 (b, b t ) −F X 1 X 2 (a, b t ) −F X 1 X 2 (b, a t ) +F X 1 X 2 (a, a t ) We write +∞ as an argument of F X in place of x i to denote the limit as x i → +∞. By the countable additivity axiom of probability, F X 1 X 2 (x 1 , +∞) = lim x 2 →∞ F X 1 X 2 (x 1 , x 2 ) = F X 1 (x 1 ) The random variables are jointly continuous if there exists a function f X 1 X 2 Xm , called the joint probability density function (pdf), such that F X 1 X 2 Xm (x 1 , . . . , x m ) = _ x 1 −∞ _ xm −∞ f X 1 X 2 Xm (u 1 , . . . , u m )du m du 1 . Note that if X 1 and X 2 are jointly continuous, then F X 1 (x 1 ) = F X 1 X 2 (x 1 , +∞) = _ x 1 −∞ __ ∞ −∞ f X 1 X 2 (u 1 , u 2 )du 2 _ du 1 . so that X 1 has pdf given by f X 1 (u 1 ) = _ ∞ −∞ f X 1 X 2 (u 1 , u 2 )du 2 . If X 1 , X 2 , . . . , X m are each discrete random variables, then they have a joint pmf p X 1 X 2 Xm defined by p X 1 X 2 Xm (u 1 , u 2 , . . . , u m ) = P[¦X 1 = u 1 ¦ ∩ ¦X 2 = u 2 ¦ ∩ ∩ ¦X m = u m ¦] The sum of the probability masses is one, and for any subset A of R m P¦(X 1 , . . . , X m ) ∈ A¦ = (u 1 ,...,um)∈A p X (u 1 , u 2 , . . . , u m ) The joint pmf of subsets of X 1 , . . . X m can be obtained by summing out the other coordinates of the joint pmf. For example, p X 1 (u 1 ) = u 2 p X 1 X 2 (u 1 , u 2 ) 26 CHAPTER 1. GETTING STARTED The joint characteristic function of X 1 , . . . , X m is the function on R m defined by Φ X 1 X 2 Xm (u 1 , u 2 , . . . , u m ) = E[e j(X 1 u 1 +X 2 ux++Xmum) ] Random variables X 1 , . . . , X m are defined to be independent if for any Borel subsets A 1 , . . . , A m of R, the events ¦X 1 ∈ A 1 ¦, . . . , ¦X m ∈ A m ¦ are independent. The random variables are indepen- dent if and only if the joint CDF factors. F X 1 X 2 Xm (x 1 , . . . , x m ) = F X 1 (x 1 ) F Xm (x m ) If the random variables are jointly continuous, independence is equivalent to the condition that the joint pdf factors. If the random variables are discrete, independence is equivalent to the condition that the joint pmf factors. Similarly, the random variables are independent if and only if the joint characteristic function factors. 1.8 Cross moments of random variables Let X and Y be random variables on the same probability space with finite second moments. Three important related quantities are: the correlation: E[XY ] the covariance: Cov(X, Y ) = E[(X −E[X])(Y −E[Y ])] the correlation coefficient: ρ XY = Cov(X, Y ) _ Var(X)Var(Y ) A fundamental inequality is Schwarz’s inequality: [ E[XY ] [ ≤ _ E[X 2 ]E[Y 2 ] (1.15) Furthermore, if E[Y 2 ] ,= 0, equality holds if and only if P[X = cY ] = 1 for some constant c. Schwarz’s inequality (1.15) is equivalent to the L 2 triangle inequality for random variables: E[(X +Y ) 2 ] 1 2 ≤ E[X 2 ] 1 2 +E[Y 2 ] 1 2 (1.16) Schwarz’s inequality can be proved as follows. If P¦Y = 0¦ = 1 the inequality is trivial, so suppose E[Y 2 ] > 0. By the inequality (a + b) 2 ≤ 2a 2 + 2b 2 it follows that E[(X − λY ) 2 ] < ∞ for any constant λ. Take λ = E[XY ]/E[Y 2 ] and note that 0 ≤ E[(X −λY ) 2 ] = E[X 2 ] −2λE[XY ] +λ 2 E[Y 2 ] = E[X 2 ] − E[XY ] 2 E[Y 2 ] , which is clearly equivalent to the Schwarz inequality. If P[X = cY ] = 1 for some c then equality holds in (1.15), and conversely, if equality holds in (1.15) then P[X = cY ] = 1 for c = λ. 1.9. CONDITIONAL DENSITIES 27 Application of Schwarz’s inequality to X−E[X] and Y −E[Y ] in place of X and Y yields that [ Cov(X, Y ) [ ≤ _ Var(X)Var(Y ) Furthermore, if Var(Y ) ,= 0 then equality holds if and only if X = aY +b for some constants a and b. Consequently, if Var(X) and Var(Y ) are not zero, so that the correlation coefficient ρ XY is well defined, then [ ρ XY [≤ 1 with equality if and only if X = aY +b for some constants a, b. The following alternative expressions for Cov(X, Y ) are often useful in calculations: Cov(X, Y ) = E[X(Y −E[Y ])] = E[(X −E[X])Y ] = E[XY ] −E[X]E[Y ] In particular, if either X or Y has mean zero then E[XY ] = Cov(X, Y ). Random variables X and Y are called orthogonal if E[XY ] = 0 and are called uncorrelated if Cov(X, Y ) = 0. If X and Y are independent then they are uncorrelated. The converse is far from true. Independence requires a large number of equations to be true, namely F XY (x, y) = F X (x)F Y (y) for every real value of x and y. The condition of being uncorrelated involves only a single equation to hold. Covariance generalizes variance, in that Var(X) = Cov(X, X). Covariance is linear in each of its two arguments: Cov(X +Y, U +V ) = Cov(X, U) + Cov(X, V ) + Cov(Y, U) + Cov(Y, V ) Cov(aX +b, cY +d) = acCov(X, Y ) for constants a, b, c, d. For example, consider the sum S m = X 1 + +X m , such that X 1 , , X m are (pairwise) uncorrelated with E[X i ] = µ and Var(X i ) = σ 2 for 1 ≤ i ≤ m. Then E[S m ] = mµ and Var(S m ) = Cov(S m , S m ) = i Var(X i ) + i,j:i,=j Cov(X i , X j ) = mσ 2 . Therefore, Sm−mµ √ mσ 2 has mean zero and variance one. 1.9 Conditional densities Suppose that X and Y have a joint pdf f XY . Recall that the pdf f Y , the second marginal density of f XY , is given by f Y (y) = _ ∞ −∞ f XY (x, y)dx 28 CHAPTER 1. GETTING STARTED The conditional pdf of X given Y , denoted by f X[Y (x [ y), is undefined if f Y (y) = 0. It is defined for y such that f Y (y) > 0 by f X[Y (x [ y) = f XY (x, y) f Y (y) −∞ < x < +∞ If y is fixed and f Y (y) > 0, then as a function of x, f X[Y (x [ y) is itself a pdf. The expectation of the conditional pdf is called the conditional expectation (or conditional mean) of X given Y = y, written as E[X [ Y = y] = _ ∞ −∞ xf X[Y (x [ y)dx If the deterministic function E[X [ Y = y] is applied to the random variable Y , the result is a random variable denoted by E[X [ Y ]. Note that conditional pdf and conditional expectation were so far defined in case X and Y have a joint pdf. If instead, X and Y are both discrete random variables, the conditional pmf p X[Y and the conditional expectation E[X [ Y = y] can be defined in a similar way. More general notions of conditional expectation are considered in a later chapter. 1.10 Transformation of random vectors A random vector X of dimension m has the form X = _ _ _ _ _ X 1 X 2 . . . X m _ _ _ _ _ where X 1 , . . . , X m are random variables. The joint distribution of X 1 , . . . , X m can be considered to be the distribution of the vector X. For example, if X 1 , . . . , X m are jointly continuous, the joint pdf f X 1 X 2 Xm (x 1 , . . . , x n ) can as well be written as f X (x), and be thought of as the pdf of the random vector X. Let X be a continuous type random vector on R n . Let g be a one-to-one mapping from R n to R n . Think of g as mapping x-space (here x is lower case, representing a coordinate value) into y-space. As x varies over R n , y varies over the range of g. All the while, y = g(x) or, equivalently, x = g −1 (y). Suppose that the Jacobian matrix of derivatives ∂y ∂x (x) is continuous in x and nonsingular for all x. By the inverse function theorem of vector calculus, it follows that the Jacobian matrix of the inverse mapping (from y to x) exists and satisfies ∂x ∂y (y) = ( ∂y ∂x (x)) −1 . Use [ K [ for a square matrix K to denote det(K) [. 1.10. TRANSFORMATION OF RANDOM VECTORS 29 Proposition 1.10.1 Under the above assumptions, Y is a continuous type random vector and for y in the range of g: f Y (y) = f X (x) [ ∂y ∂x (x) [ = f X (x) ¸ ¸ ¸ ¸ ∂x ∂y (y) ¸ ¸ ¸ ¸ Example 1.10.2 Let U, V have the joint pdf: f UV (u, v) = _ u +v 0 ≤ u, v ≤ 1 0 else and let X = U 2 and Y = U(1 +V ). Let’s find the pdf f XY . The vector (U, V ) in the u−v plane is transformed into the vector (X, Y ) in the x −y plane under a mapping g that maps u, v to x = u 2 and y = u(1 +v). The image in the x −y plane of the square [0, 1] 2 in the u −v plane is the set A given by A = ¦(x, y) : 0 ≤ x ≤ 1, and √ x ≤ y ≤ 2 √ x¦ See Figure 1.13 The mapping from the square is one to one, for if (x, y) ∈ A then (u, v) can be x u v 1 1 1 2 y Figure 1.13: Transformation from the u −v plane to the x −y plane. recovered by u = √ x and v = y √ x −1. The Jacobian determinant is ¸ ¸ ¸ ¸ ∂x ∂u ∂x ∂v ∂y ∂u ∂y ∂v ¸ ¸ ¸ ¸ = ¸ ¸ ¸ ¸ 2u 0 1 +v u ¸ ¸ ¸ ¸ = 2u 2 Therefore, using the transformation formula and expressing u and V in terms of x and y yields f XY (x, y) = _ √ x+( y √ x −1) 2x if (x, y) ∈ A 0 else 30 CHAPTER 1. GETTING STARTED Example 1.10.3 Let U and V be independent continuous type random variables. Let X = U +V and Y = V . Let us find the joint density of X, Y and the marginal density of X. The mapping g : _ u v _ → _ u v _ = _ u +v v _ is invertible, with inverse given by u = x − y and v = y. The absolute value of the Jacobian determinant is given by ¸ ¸ ¸ ¸ ∂x ∂u ∂x ∂v ∂y ∂u ∂y ∂v ¸ ¸ ¸ ¸ = ¸ ¸ ¸ ¸ 1 1 0 1 ¸ ¸ ¸ ¸ = 1 Therefore f XY (x, y) = f UV (u, v) = f U (x −y)f V (y) The marginal density of X is given by f X (x) = _ ∞ −∞ f XY (x, y)dy = _ ∞ −∞ f U (x −y)f V (y)dy That is f X = f U ∗ f V . Example 1.10.4 Let X 1 and X 2 be independent N(0, σ 2 ) random variables, and let X = (X 1 , X 2 ) T denote the two-dimensional random vector with coordinates X 1 and X 2 . Any point of x ∈ R 2 can be represented in polar coordinates by the vector (r, θ) T such that r = |x| = (x 2 1 + x 2 2 ) 1 2 and θ = tan −1 ( x 2 x 1 ) with values r ≥ 0 and 0 ≤ θ < 2π. The inverse of this mapping is given by x 1 = r cos(θ) x 2 = r sin(θ) We endeavor to find the pdf of the random vector (R, Θ) T , the polar coordinates of X. The pdf of X is given by f X (x) = f X 1 (x 1 )f X 2 (x 2 ) = 1 2πσ 2 e − r 2 2σ 2 The range of the mapping is the set r > 0 and 0 < θ ≤ 2π. On the range, ¸ ¸ ¸ ¸ ¸ ¸ ¸ ¸ ∂x ∂ _ r θ _ ¸ ¸ ¸ ¸ ¸ ¸ ¸ ¸ = ¸ ¸ ¸ ¸ ∂x 1 ∂r ∂x 1 ∂θ ∂x 2 ∂r ∂x 2 ∂θ ¸ ¸ ¸ ¸ = ¸ ¸ ¸ ¸ cos(θ) −r sin(θ) sin(θ) r cos(θ) ¸ ¸ ¸ ¸ = r 1.11. PROBLEMS 31 Therefore for (r, θ) T in the range of the mapping, f R,Θ (r, θ) = f X (x) ¸ ¸ ¸ ¸ ¸ ¸ ¸ ¸ ∂x ∂ _ r θ _ ¸ ¸ ¸ ¸ ¸ ¸ ¸ ¸ = r 2πσ 2 e − r 2 2σ 2 Of course f R,Θ (r, θ) = 0 off the range of the mapping. The joint density factors into a function of r and a function of θ, so R and Θ are independent. Moreover, R has the Rayleigh density with parameter σ 2 , and Θ is uniformly distributed on [0, 2π]. 1.11 Problems 1.1 Simple events A register contains 8 random binary digits which are mutually independent. Each digit is a zero or a one with equal probability. (a) Describe an appropriate probability space (Ω, T, P) corresponding to looking at the contents of the register. (b) Express each of the following four events explicitly as subsets of Ω, and find their probabilities: E1=“No two neighboring digits are the same” E2=“Some cyclic shift of the register contents is equal to 01100110” E3=“The register contains exactly four zeros” E4=“There is a run of at least six consecutive ones” (c) Find P[E 1 [E 3 ] and P[E 2 [E 3 ]. 1.2 Independent vs. mutually exclusive (a) Suppose that an event E is independent of itself. Show that either P[E] = 0 or P[E] = 1. (b) Events A and B have probabilities P[A] = 0.3 and P[B] = 0.4. What is P[A ∪ B] if A and B are independent? What is P[A∪ B] if A and B are mutually exclusive? (c) Now suppose that P[A] = 0.6 and P[B] = 0.8. In this case, could the events A and B be independent? Could they be mutually exclusive? 1.3 Congestion at output ports Consider a packet switch with some number of input ports and eight output ports. Suppose four packets simultaneously arrive on different input ports, and each is routed toward an output port. Assume the choices of output ports are mutually independent, and for each packet, each output port has equal probability. (a) Specify a probability space (Ω, T, P) to describe this situation. (b) Let X i denote the number of packets routed to output port i for 1 ≤ i ≤ 8. Describe the joint pmf of X 1 , . . . , X 8 . (c) Find Cov(X 1 , X 2 ). (d) Find P¦X i ≤ 1 for all i¦. (e) Find P¦X i ≤ 2 for all i¦. 32 CHAPTER 1. GETTING STARTED 1.4 Frantic search At the end of each day Professor Plum puts her glasses in her drawer with probability .90, leaves them on the table with probability .06, leaves them in her briefcase with probability 0.03, and she actually leaves them at the office with probability 0.01. The next morning she has no recollection of where she left the glasses. She looks for them, but each time she looks in a place the glasses are actually located, she misses finding them with probability 0.1, whether or not she already looked in the same place. (After all, she doesn’t have her glasses on and she is in a hurry.) (a) Given that Professor Plum didn’t find the glasses in her drawer after looking one time, what is the conditional probability the glasses are on the table? (b) Given that she didn’t find the glasses after looking for them in the drawer and on the table once each, what is the conditional probability they are in the briefcase? (c) Given that she failed to find the glasses after looking in the drawer twice, on the table twice, and in the briefcase once, what is the conditional probability she left the glasses at the office? 1.5 Conditional probability of failed device given failed attempts A particular webserver may be working or not working. If the webserver is not working, any attempt to access it fails. Even if the webserver is working, an attempt to access it can fail due to network congestion beyond the control of the webserver. Suppose that the a priori probability that the server is working is 0.8. Suppose that if the server is working, then each access attempt is successful with probability 0.9, independently of other access attempts. Find the following quantities. (a) P[ first access attempt fails] (b) P[server is working [ first access attempt fails ] (c) P[second access attempt fails [ first access attempt fails ] (d) P[server is working [ first and second access attempts fail ]. 1.6 Conditional probabilities–basic computations of iterative decoding Suppose B 1 , . . . , B n , Y 1 , . . . , Y n are discrete random variables with joint pmf p(b 1 , . . . , b n , y 1 , . . . , y n ) = _ 2 −n n i=1 q i (y i [b i ) if b i ∈ ¦0, 1¦ for 1 ≤ i ≤ n 0 else where q i (y i [b i ) as a function of y i is a pmf for b i ∈ ¦0, 1¦. Finally, let B = B 1 ⊕ ⊕B n represent the modulo two sum of B 1 , , B n . Thus, the ordinary sum of the n+1 random variables B 1 , . . . , B n , B is even. Express P[B = 1[Y 1 = y 1 , .Y n = y n ] in terms of the y i and the functions q i . Simplify your answer. (b) Suppose B and Z 1 , . . . , Z k are discrete random variables with joint pmf p(b, z 1 , . . . , z k ) = _ 1 2 k j=1 r j (z j [b) if b ∈ ¦0, 1¦ 0 else where r j (z j [b) as a function of z j is a pmf for b ∈ ¦0, 1¦ fixed. Express P[B = 1[Z 1 = z 1 , . . . , Z k = z k ] in terms of the z j and the functions r j . 1.7 Conditional lifetimes and the memoryless property of the geometric distribution (a) Let X represent the lifetime, rounded up to an integer number of years, of a certain car battery. 1.11. PROBLEMS 33 Suppose that the pmf of X is given by p X (k) = 0.2 if 3 ≤ k ≤ 7 and p X (k) = 0 otherwise. (i) Find the probability, P¦X > 3¦, that a three year old battery is still working. (ii) Given that the battery is still working after five years, what is the conditional probability that the battery will still be working three years later? (i.e. what is P[X > 8[X > 5]?) (b) A certain Illini basketball player shoots the ball repeatedly from half court during practice. Each shot is a success with probability p and a miss with probability 1 − p, independently of the outcomes of previous shots. Let Y denote the number of shots required for the first success. (i) Express the probability that she needs more than three shots for a success, P¦Y > 3¦, in terms of p. (ii) Given that she already missed the first five shots, what is the conditional probability that she will need more than three additional shots for a success? (i.e. what is P[Y > 8[Y > 5])? (iii) What type of probability distribution does Y have? 1.8 Blue corners Suppose each corner of a cube is colored blue, independently of the other corners, with some probability p. Let B denote the event that at least one face of the cube has all four corners colored blue. (a) Find the conditional probability of B given that exactly five corners of the cube are colored blue. (b) Find P[B], the unconditional probability of B. 1.9 Distribution of the flow capacity of a network A communication network is shown. The link capacities in megabits per second (Mbps) are given by C 1 = C 3 = 5, C 2 = C 5 = 10 and C 4 =8, and are the same in each direction. Information flow Source 1 2 3 4 5 Destination from the source to the destination can be split among multiple paths. For example, if all links are working, then the maximum communication rate is 10 Mbps: 5 Mbps can be routed over links 1 and 2, and 5 Mbps can be routed over links 3 and 5. Let F i be the event that link i fails. Suppose that F 1 , F 2 , F 3 , F 4 and F 5 are independent and P[F i ] = 0.2 for each i. Let X be defined as the maximum rate (in Mbits per second) at which data can be sent from the source node to the destination node. Find the pmf p X . 1.10 Recognizing cumulative distribution functions Which of the following are valid CDF’s? For each that is not valid, state at least one reason why. For each that is valid, find P¦X 2 > 5¦. F1(x) = ( e −x 2 4 x < 0 1 − e −x 2 4 x ≥ 0 F2(x) = 8 < : 0 x < 0 0.5 +e −x 0 ≤ x < 3 1 x ≥ 3 F3(x) = 8 < : 0 x ≤ 0 0.5 + x 20 0 < x ≤ 10 1 x ≥ 10 34 CHAPTER 1. GETTING STARTED 1.11 A CDF of mixed type Let X have the CDF shown. 1 2 0 F X 1.0 0.5 (a) Find P¦X ≤ 0.8¦. (b) Find E[X]. (c) Find Var(X). 1.12 CDF and characteristic function of a mixed type random variable Let X = (U −0.5) + , where U is uniformly distributed over the interval [0, 1]. That is, X = U −0.5 if U −0.5 ≥ 0, and X = 0 if U −0.5 < 0. (a) Find and carefully sketch the CDF F X . In particular, what is F X (0)? (b) Find the characteristic function Φ X (u) for real values of u. 1.13 Poisson and geometric random variables with conditioning Let Y be a Poisson random variable with mean µ > 0 and let Z be a geometrically distributed random variable with parameter p with 0 < p < 1. Assume Y and Z are independent. (a) Find P¦Y < Z¦. Express your answer as a simple function of µ and p. (b) Find P[Y < Z[Z = i] for i ≥ 1. (Hint: This is a conditional probability for events.) (c) Find P[Y = i[Y < Z] for i ≥ 0. Express your answer as a simple function of p, µ and i. (Hint: This is a conditional probability for events.) (d) Find E[Y [Y < Z], which is the expected value computed according to the conditional distribu- tion found in part (c). Express your answer as a simple function of µ and p. 1.14 Conditional expectation for uniform density over a triangular region Let (X, Y ) be uniformly distributed over the triangle with coordinates (0, 0), (1, 0), and (2, 1). (a) What is the value of the joint pdf inside the triangle? (b) Find the marginal density of X, f X (x). Be sure to specify your answer for all real values of x. (c) Find the conditional density function f Y [X (y[x). Be sure to specify which values of x the conditional density is well defined for, and for such x specify the conditional density for all y. Also, for such x briefly describe the conditional density of y in words. (d) Find the conditional expectation E[Y [X = x]. Be sure to specify which values of x this conditional expectation is well defined for. 1.15 Transformation of a random variable Let X be exponentially distributed with mean λ −1 . Find and carefully sketch the distribution functions for the random variables Y = exp(X) and Z = min(X, 3). 1.11. PROBLEMS 35 1.16 Density of a function of a random variable Suppose X is a random variable with probability density function f X (x) = _ 2x 0 ≤ x ≤ 1 0 else (a) Find P[X ≥ 0.4[X ≤ 0.8]. (b) Find the density function of Y defined by Y = −log(X). 1.17 Moments and densities of functions of a random variable Suppose the length L and width W of a rectangle are independent and each uniformly distributed over the interval [0, 1]. Let C = 2L + 2W (the length of the perimeter) and A = LW (the area). Find the means, variances, and probability densities of C and A. 1.18 Functions of independent exponential random variables Let X 1 and X 2 be independent random varibles, with X i being exponentially distributed with parameter λ i . (a) Find the pdf of Z = min¦X 1 , X 2 ¦. (b) Find the pdf of R = X 1 X 2 . 1.19 Using the Gaussian Q function Express each of the given probabilities in terms of the standard Gaussian complementary CDF Q. (a) P¦X ≥ 16¦, where X has the N(10, 9) distribution. (b) P¦X 2 ≥ 16¦, where X has the N(10, 9) distribution. (c) P¦[X − 2Y [ > 1¦, where X and Y are independent, N(0, 1) random variables. (Hint: Linear combinations of independent Gaussian random variables are Gaussian.) 1.20 Gaussians and the Q function Let X and Y be independent, N(0, 1) random variables. (a) Find Cov(3X + 2Y, X + 5Y + 10). (b) Express P¦X + 4Y ≥ 2¦ in terms of the Q function. (c) Express P¦(X −Y ) 2 > 9¦ in terms of the Q function. 1.21 Correlation of histogram values Suppose that n fair dice are independently rolled. Let X i = _ 1 if a 1 shows on the i th roll 0 else Y i = _ 1 if a 2 shows on the i th roll 0 else Let X denote the sum of the X i ’s, which is simply the number of 1’s rolled. Let Y denote the sum of the Y i ’s, which is simply the number of 2’s rolled. Note that if a histogram is made recording the number of occurrences of each of the six numbers, then X and Y are the heights of the first two entries in the histogram. (a) Find E[X 1 ] and Var(X 1 ). (b) Find E[X] and Var(X). (c) Find Cov(X i , Y j ) if 1 ≤ i, j ≤ n (Hint: Does it make a difference if i = j?) 36 CHAPTER 1. GETTING STARTED (d) Find Cov(X, Y ) and the correlation coefficient ρ(X, Y ) = Cov(X, Y )/ _ Var(X)Var(Y ). (e) Find E[Y [X = x] for any integer x with 0 ≤ x ≤ n. Note that your answer should depend on x and n, but otherwise your answer is deterministic. 1.22 Working with a joint density Suppose X and Y have joint density function f X,Y (x, y) = c(1 + xy) if 2 ≤ x ≤ 3 and 1 ≤ y ≤ 2, and f X,Y (x, y) = 0 otherwise. (a) Find c. (b) Find f X and f Y . (c) Find f X[Y . 1.23 A function of jointly distributed random variables Suppose (U, V ) is uniformly distributed over the square with corners (0,0), (1,0), (1,1), and (0,1), and let X = UV . Find the CDF and pdf of X. 1.24 Density of a difference Let X and Y be independent, exponentially distributed random variables with parameter λ, such that λ > 0. Find the pdf of Z = [X −Y [. 1.25 Working with a two dimensional density L et the random variables X and Y be jointly uniformly distributed over the region shown. 0 1 2 3 1 0 (a) Determine the value of f X,Y on the region shown. (b) Find f X , the marginal pdf of X. (c) Find the mean and variance of X. (d) Find the conditional pdf of Y given that X = x, for 0 ≤ x ≤ 1. (e) Find the conditional pdf of Y given that X = x, for 1 ≤ x ≤ 2. (f) Find and sketch E[Y [X = x] as a function of x. Be sure to specify which range of x this conditional expectation is well defined for. 1.26 Some characteristic functions Find the mean and variance of random variables with the following characteristic functions: (a) Φ(u) = exp(−5u 2 + 2ju) (b) Φ(u) = (e ju −1)/ju, and (c) Φ(u) = exp(λ(e ju −1)). 1.27 Uniform density over a union of two square regions Let the random variables X and Y be jointly uniformly distributed on the region ¦0 ≤ u ≤ 1, 0 ≤ v ≤ 1¦ ∪ ¦−1 ≤ u < 0, −1 ≤ v < 0¦. (a) Determine the value of f XY on the region shown. (b) Find f X , the marginal pdf of X. (c) Find the conditional pdf of Y given that X = a, for 0 < a ≤ 1. 1.11. PROBLEMS 37 (d) Find the conditional pdf of Y given that X = a, for −1 ≤ a < 0. (e) Find E[Y [X = a] for [a[ ≤ 1. (f) What is the correlation coefficient of X and Y ? (g) Are X and Y independent? (h) What is the pdf of Z = X +Y ? 1.28 A transformation of jointly continuous random variables Suppose (U, V ) has joint pdf f U,V (u, v) = _ 9u 2 v 2 if 0 ≤ u ≤ 1 & 0 ≤ v ≤ 1 0 else Let X = 3U and Y = UV . (a) Find the joint pdf of X and Y , being sure to specify where the joint pdf is zero. (b) Using the joint pdf of X and Y , find the conditional pdf, f Y [X (y[x), of Y given X. (Be sure to indicate which values of x the conditional pdf is well defined for, and for each such x specify the conditional pdf for all real values of y.) 1.29 Transformation of densities Let U and V have the joint pdf: f UV (u, v) = _ c(u −v) 2 0 ≤ u, v ≤ 1 0 else for some constant c. (a) Find the constant c. (b) Suppose X = U 2 and Y = U 2 V 2 . Describe the joint pdf f X,Y (x, y) of X and Y . Be sure to indicate where the joint pdf is zero. 1.30 Jointly distributed variables Let U and V be independent random variables, such that U is uniformly distributed over the interval [0, 1], and V has the exponential probability density function (a) Calculate E[ V 2 1+U ]. (b) Calculate P¦U ≤ V ¦. (c) Find the joint probability density function of Y and Z, where Y = U 2 and Z = UV . 1.31 * Why not every set has a length Suppose a length (actually, “one-dimensional volume” would be a better name) of any subset A ⊂ R could be defined, so that the following three axioms are satisfied: L0: 0 ≤ length(A) ≤ ∞ for any A ⊂ R L1: length([a, b]) = b −a for a < b L2: length(A) = length(A+y), for any A ⊂ R and y ∈ R, where A+y represents the translation of A by y, defined by A+y = ¦x +y : x ∈ A¦ L3: If A = ∪ ∞ i=1 B i such that B 1 , B 2 , are disjoint, then length(A) = ∞ i=1 length(B i ). 38 CHAPTER 1. GETTING STARTED The purpose of this problem is to show that the above supposition leads to a contradiction. Let Q denote the set of rational numbers, Q = ¦p/q : p, q ∈ Z, q ,= 0¦. (a) Show that the set of rational numbers can be expressed as Q = ¦q 1 , q 2 , . . .¦, which means that Q is countably infinite. Say that x, y ∈ R are equivalent, and write x ∼ y, if x −y ∈ Q. (b) Show that ∼ is an equivalence relation, meaning it is reflexive (a ∼ a for all a ∈ R), symmetric (a ∼ b implies b ∼ a), and transitive (a ∼ b and b ∼ c implies a ∼ c). For any x ∈ R, let Q x = Q + x. (c) Show that for any x, y ∈ R, either Q x = Q y or Q x ∩ Q y = ∅. Sets of the form Q x are called equivalence classes of the equivalence relation ∼. (d) Show that Q x ∩ [0, 1] ,= ∅ for all x ∈ R, or in other words, each equivalence class contains at least one element from the interval [0, 1]. Let V be a set obtained by choosing exactly one element in [0, 1] from each equivalence class (by accepting that V is well defined, you’ll be accepting what is called the Axiom of Choice). So V is a subset of [0, 1]. Suppose q t 1 , q t 2 , . . . is an enumeration of all the rational numbers in the interval [−1, 1], with no number appearing twice in the list. Let V i = V + q t i for i ≥ 1. (e) Verify that the sets V i are disjoint, and [0, 1] ⊂ ∪ ∞ i=1 V i ⊂ [−1, 2]. Since the V i ’s are translations of V , they should all have the same length as V . If the length of V is defined to be zero, then [0, 1] would be covered by a countable union of disjoint sets of length zero, so [0, 1] would also have length zero. If the length of V were strictly positive, then the countable union would have infinite length, and hence the interval [−1, 2] would have infinite length. Either way there is a contradiction. 1.32 * On sigma-algebras, random variables, and measurable functions Prove the seven statements lettered (a)-(g) in what follows. Definition. Let Ω be an arbitrary set. A nonempty collection T of subsets of Ω is defined to be an algebra if: (i) A c ∈ T whenever A ∈ F and (ii) A∪ B ∈ T whenever A, B ∈ T. (a) If T is an algebra then ∅ ∈ T, Ω ∈ T, and the union or intersection of any finite collection of sets in T is in T. Definition. T is called a σ-algebra if T is an algebra such that whenever A 1 , A 2 , ... are each in F, so is the union, ∪A i . (b) If T is a σ-algebra and B 1 , B 2 , . . . are in F, then so is the intersection, ∩B i . (c) Let U be an arbitrary nonempty set, and suppose that T u is a σ-algebra of subsets of Ω for each u ∈ U. Then the intersection ∩ u∈U T u is also a σ-algebra. (d) The collection of all subsets of Ω is a σ-algebra. (e) If T o is any collection of subsets of Ω then there is a smallest σ-algebra containing T o (Hint: use (c) and (d).) Definitions. B(R) is the smallest σ-algebra of subsets of R which contains all sets of the form (−∞, a]. Sets in B(R) are called Borel sets. A real-valued random variable on a probability space (Ω, T, P) is a real-valued function X on Ω such that ¦ω : X(ω) ≤ a¦ ∈ T for any a ∈ R. (f) If X is a random variable on (Ω, T, P) and A ∈ B(R) then ¦ω : X(ω) ∈ A¦ ∈ T. (Hint: Fix a random variable X. Let T be the collection of all subsets A of B(R) for which the conclusion is true. It is enough (why?) to show that T contains all sets of the form (−∞, a] and that T is a σ-algebra of subsets of R. You must use the fact that T is a σ-algebra.) Remark. By (f), P¦ω : X(ω) ∈ A¦, or P¦X ∈ A¦ for short, is well defined for A ∈ B(R). Definition. A function g mapping R to R is called Borel measurable if ¦x : g(x) ∈ A¦ ∈ B(R) whenever A ∈ B(R). 1.11. PROBLEMS 39 (g) If X is a real-valued random variable on (Ω, T, P) and g is a Borel measurable function, then Y defined by Y = g(X) is also a random variable on (Ω, T, P). 40 CHAPTER 1. GETTING STARTED Chapter 2 Convergence of a Sequence of Random Variables Convergence to limits is a central concept in the theory of calculus. Limits are used to define derivatives and integrals. We wish to consider derivatives and integrals of random functions, so it is natural to begin by examining what it means for a sequence of random variables to converge. See the Appendix for a review of the definition of convergence for a sequence of numbers. 2.1 Four definitions of convergence of random variables Recall that a random variable X is a function on Ω for some probability space (Ω, T, P). A sequence of random variables (X n (ω) : n ≥ 1) is hence a sequence of functions. There are many possible definitions for convergence of a sequence of random variables. One idea is to require X n (ω) to converge for each fixed ω. However, at least intuitively, what happens on an event of probability zero is not important. Thus, we use the following definition. Definition 2.1.1 A sequence of random variables (X n : n ≥ 1) converges almost surely to a random variable X, if all the random variables are defined on the same probability space, and P¦lim n→∞ X n = X¦ = 1. Almost sure convergence is denoted by lim n→∞ X n = X a.s. or X n a.s. → X. Conceptually, to check almost sure convergence, one can first find the set ¦ω : lim n→∞ X n (ω) = X(ω)¦ and then see if it has probability one. We shall construct some examples using the standard unit-interval probability space defined in Example 1.1.2. This particular choice of (Ω, T, P) is useful for generating examples, because random variables, being functions on Ω, can be simply specified by their graphs. For example, consider the random variable X pictured in Figure 2.1. The probability mass function for such X is given by P¦X = 1¦ = P¦X = 2¦ = 1 4 and P¦X = 3¦ = 1 2 . Figure 2.1 is a bit sloppy, in that it is not clear what the values of X are at the jump points, ω = 1/4 or ω = 1/2. However, each of these points has probability zero, so the distribution of X is the same no matter how X is defined at those points. 41 42 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 1 2 3 0 1 4 1 2 ω ω X( ) 1 3 4 Figure 2.1: A random variable on (Ω, T, P). Example 2.1.2 Let (X n : n ≥ 1) be the sequence of random variables on the standard unit-interval probability space defined by X n (ω) = ω n , illustrated in Figure 2.2. This sequence converges for all 4 0 1 0 1 ! X ( ) ! 0 1 0 1 ! X ( ) ! 0 1 0 1 ! X ( ) ! 0 1 0 1 ! X ( ) ! 1 2 3 Figure 2.2: X n (ω) = ω n on the standard unit-interval probability space. ω ∈ Ω, with the limit lim n→∞ X n (ω) = _ 0 if 0 ≤ ω < 1 1 if ω = 1. The single point set ¦1¦ has probability zero, so it is also true (and simpler to say) that (X n : n ≥ 1) converges a.s. to zero. In other words, if we let X be the zero random variable, defined by X(ω) = 0 for all ω, then X n a.s. → X. Example 2.1.3 (Moving, shrinking rectangles) Let (X n : n ≥ 1) be the sequence of random variables on the standard unit-interval probability space, as shown in Figure 2.3. The variable X 1 is identically one. The variables X 2 and X 3 are one on intervals of length 1 2 . The variables X 4 , X 5 , X 6 , and X 7 are one on intervals of length 1 4 . In general, each n ≥ 1 can be written as n = 2 k + j where k = ¸ln 2 n| and 0 ≤ j < 2 k . The variable X n is one on the length 2 −k interval (j2 −k , (j + 1)2 −k ]. To investigate a.s. convergence, fix an arbitrary value for ω. Then for each k ≥ 1, there is one value of n with 2 k ≤ n < 2 k+1 such that X n (ω) = 1, and X n (ω) = 0 for all other n. 2.1. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 43 0 1 0 1 ω X ( ) ω 0 1 0 1 ω X ( ) ω 0 1 0 1 ω X ( ) ω 0 1 0 1 ω X ( ) ω 0 1 0 1 ω X ( ) ω 0 1 0 1 ω X ( ) ω 0 1 0 1 ω X ( ) ω 6 7 1 2 3 5 4 Figure 2.3: A sequence of random variables on (Ω, T, P). Therefore, lim n→∞ X n (ω) does not exist. That is, ¦ω : lim n→∞ X n exists¦ = ∅, so of course, P¦lim n→∞ X n exists¦ = 0. Thus, X n does not converge in the a.s. sense. However, for large n, P¦X n = 0¦ is close to one. This suggests that X n converges to the zero random variable in some weaker sense. Example 2.1.3 motivates us to consider the following weaker notion of convergence of a sequence of random variables. Definition 2.1.4 A sequence of random variables (X n ) converges to a random variable X in prob- ability if all the random variables are defined on the same probability space, and for any > 0, lim n→∞ P¦[X − X n [ ≥ ¦ = 0. Convergence in probability is denoted by lim n→∞ X n = X p., or X n p. →X. Convergence in probability requires that [X−X n [ be small with high probability (to be precise, less than or equal to with probability that converges to one as n → ∞), but on the small probability event that [X−X n [ is not small, it can be arbitrarily large. For some applications that is unacceptable. Roughly speaking, the next definition of convergence requires that [X − X n [ be small with high probability for large n, and even if it is not small, the average squared value has to be small enough. Definition 2.1.5 A sequence of random variables (X n ) converges to a random variable X in the mean square sense if all the random variables are defined on the same probability space, E[X 2 n ] < +∞ for all n, and lim n→∞ E[(X n −X) 2 ] = 0. Mean square convergence is denoted by lim n→∞ X n = X m.s. or X n m.s. → X. Although it isn’t explicitly stated in the definition of m.s. convergence, the limit random variable must also have a finite second moment: 44 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES Proposition 2.1.6 If X n m.s. → X, then E[X 2 ] < +∞. Proof. Suppose that X n m.s. → X. By definition, E[X 2 n ] < ∞ for all n. Also by definition, there exists some n o so that E[(X − X n ) 2 ] < 1 for all n ≥ n o . The L 2 triangle inequality for random variables, (1.16), yields E[(X ∞ ) 2 ] 1 2 ≤ E[(X ∞ −X no ) 2 ] 1 2 +E[X 2 no ] 1 2 < +∞. Example 2.1.7 (More moving, shrinking rectangles) This example is along the same lines as Example 2.1.3, using the standard unit-interval probability space. Each random variable of the sequence (X n : n ≥ 1) is defined as indicated in Figure 2.4. where the value a n > 0 is some a X ( ) ! ! 1/n 0 1 n n Figure 2.4: A sequence of random variables corresponding to moving, shrinking rectangles. constant depending on n. The graph of X n for n ≥ 1 has height a n over some subinterval of Ω of length 1 n . We don’t explicitly identify the location of the interval, but we require that for any fixed ω, X n (ω) = a n for infinitely many values of n, and X n (ω) = 0 for infinitely many values of n. Such a choice of the locations of the intervals is possible because the sum of the lengths of the intervals, ∞ n=1 1 n , is infinite. Of course X n a.s. → 0 if the deterministic sequence (a n ) converges to zero. However, if there is a constant > 0 such that a n ≥ for all n (for example if a n = 1 for all n), then ¦ω : lim n→∞ X n (ω) exists¦ = ∅, just as in Example 2.1.3. The sequence converges to zero in probability for any choice of the constants (a n ), because for any > 0, P¦[X n −0[ ≥ ¦ ≤ P¦X n ,= 0¦ = 1 n →0. Finally, to investigate mean square convergence, note that E[[X n − 0[ 2 ] = a 2 n n . Hence, X n m.s. → 0 if and only if the sequence of constants (a n ) is such that lim n→∞ a 2 n n = 0. For example, if a n = ln(n) for all n, then X n m.s. → 0, but if a n = √ n, then (X n ) does not converge to zero in the m.s. sense. (Proposition 2.1.13 below shows that a sequence can have only one limit in the a.s., p., or m.s. senses, so the fact X n p. → 0, implies that zero is the only possible limit in the m.s. sense. So if a 2 n n ,→0, then (X n ) doesn’t converge to any random variable in the m.s. sense.) 2.1. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 45 Example 2.1.8 (Anchored, shrinking rectangles) Let (X n : n ≥ 1) be a sequence of random variables defined on the standard unit-interval probability space, as indicated in Figure 2.5, where 1/n X ( ) ! ! 0 1 n a n Figure 2.5: A sequence of random variables corresponding to anchored, shrinking rectangles. the value a n > 0 is some constant depending on n. That is, X n (ω) is equal to a n if 0 ≤ ω ≤ 1/n, and to zero otherwise. For any nonzero ω in Ω, X n (ω) = 0 for all n such that n > 1/ω. Therefore, X n a.s. → 0. Whether the sequence (X n ) converges in p. or m.s. sense for this example is exactly the same as in Example 2.1.7. That is, for convergence in probability or mean square sense, the locations of the shrinking intervals of support don’t matter. So X n p. →0. And X n m.s. → 0 if and only if a 2 n n →0. It is shown in Proposition 2.1.13 below that either a.s. or m.s. convergence imply convergence in probability. Example 2.1.8 shows that a.s. convergence, like convergence in probability., can allow [X n (ω) − X(ω)[ to be extremely large for ω in a small probability set. So neither convergence in probability, nor a.s. convergence, imply m.s. convergence, unless an additional assumption is made to control the difference [X n (ω) −X(ω)[ everywhere on Ω. Example 2.1.9 (Rearrangements of rectangles) Let (X n : n ≥ 1) be a sequence of random vari- ables defined on the standard unit-interval probability space. The first three random variables in the sequence are indicated in Figure 2.6. Suppose that the sequence is periodic, with period three, so that X n+3 = X n for all n ≥ 1. Intuitively speaking, the sequence of random variables 3 0 1 0 1 ! X ( ) ! 0 1 0 1 ! X ( ) ! 0 1 0 1 ! X ( ) ! 1 2 Figure 2.6: A sequence of random variables obtained by rearrangement of rectangles. 46 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES persistently jumps around. Obviously it does not converge in the a.s. sense. The sequence does not settle down to converge, even in the sense of convergence in probability, to any one random variable. This can be proved as follows. Suppose for the sake of contradiction that X n p. → X for some random variable. Then for any > 0 and δ > 0, if n is sufficiently large, P¦[X n −X[ ≥ ¦ ≤ δ. But because the sequence is periodic, it must be that P¦[X n −X[ ≥ ¦ ≤ δ for 1 ≤ n ≤ 3. Since δ is arbitrary it must be that P¦[X n −X[ ≥ ¦ = 0 for 1 ≤ n ≤ 3. Since is arbitrary it must be that P¦X = X n ¦ = 1 for 1 ≤ n ≤ 3. Hence, P¦X 1 = X 2 = X 3 ¦ = 1, which is a contradiction. Thus, the sequence does not converge in probability. A similar argument shows it does not converge in the m.s. sense, either. Even though the sequence fails to converge in a.s., m.s., or p. senses, it can be observed that all of the X n ’s have the same probability distribution. The variables are only different in that the places they take their possible values are rearranged. Example 2.1.9 suggests that it would be useful to have a notion of convergence that just depends on the distributions of the random variables. One idea for a definition of convergence in distribution is to require that the sequence of CDFs F Xn (x) converge as n →∞for all n. The following example shows such a definition could give unexpected results in some cases. Example 2.1.10 Let U be uniformly distributed on the interval [0, 1], and for n ≥ 1, let X n = (−1) n U n . Let X denote the random variable such that X = 0 for all ω. It is easy to verify that X n a.s. → X and X n p. → X. Does the CDF of X n converge to the CDF of X? The CDF of X n is graphed in Figure 2.7. The CDF F Xn (x) converges to 0 for x < 0 and to one for x > 0. However, F X n F X n 1 −1 n n n even n odd 0 0 Figure 2.7: CDF of X n = (−1) n n . F Xn (0) alternates between 0 and 1 and hence does not converge to anything. In particular, it doesn’t converge to F X (0). Thus, F Xn (x) converges to F X (x) for all x except x = 0. Recall that the distribution of a random variable X has probability mass ´ at some value x o , i.e. P¦X = x o ¦ = ´ > 0, if and only if the CDF has a jump of size ´ at x o : F(x o ) −F(x o −) = ´. Example 2.1.10 illustrates the fact that if the limit random variable X has such a point mass, then even if X n is very close to X, the value F Xn (x) need not converge. To overcome this phenomenon, we adopt a definition of convergence in distribution which requires convergence of the CDFs only at the continuity points of the limit CDF. Continuity points are defined for general functions in Appendix 11.3. Since CDFs are right-continuous and nondecreasing, a point x is a continuity point of a CDF F if and only if there is no jump of F at X: i.e. if F X (x) = F X (x−). 2.1. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 47 Definition 2.1.11 A sequence (X n : n ≥ 1) of random variables converges in distribution to a random variable X if lim n→∞ F Xn (x) = F X (x) at all continuity points x of F X . Convergence in distribution is denoted by lim n→∞ X n = X d. or X n d. →X. One way to investigate convergence in distribution is through the use of characteristic functions. Proposition 2.1.12 Let (X n ) be a sequence of random variables and let X be a random variable. Then the following are equivalent: (i) X n d. →X (ii) E[f(X n )] →E[f(X)] for any bounded continuous function f. (iii) Φ Xn (u) →Φ X (u) for each u ∈ R (i.e. pointwise convergence of characteristic functions) The relationships among the four types of convergence discussed in this section are given in the following proposition, and are pictured in Figure 2.8. The definitions use differing amounts of information about the random variables (X n : n ≥ 1) and X involved. Convergence in the a.s. sense involves joint properties of all the random variables. Convergence in the p. or m.s. sense involves only pairwise joint distributions–namely those of (X n , X) for all n. Convergence in distribution involves only the individual distributions of the random variables to have a convergence property. Convergence in the a.s., m.s., and p. senses require the variables to all be defined on the same probability space. For convergence in distribution, the random variables need not be defined on the same probability space. m.s. p. d. a.s. ( I f s e q u e n c e is d o m in a te d b y a f in ite s e c o n d m o m e n t.) a s in g le r a n d o m v a r ia b le w ith Figure 2.8: Relationships among four types of convergence of random variables. Proposition 2.1.13 (a) If X n a.s. → X then X n p. →X. (b) If X n m.s. → X then X n p. →X. (c) If P¦[X n [ ≤ Y ¦ = 1 for all n for some fixed random variable Y with E[Y 2 ] < ∞, and if X n p. →X, then X n m.s. → X. 48 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES (d) If X n p. →X then X n d. →X. (e) Suppose X n → X in the p., m.s., or a.s. sense and X n → Y in the p., m.s., or a.s. sense. Then P¦X = Y ¦ = 1. That is, if differences on sets of probability zero are ignored, a sequence of random variables can have only one limit (if p., m.s., and/or a.s. senses are used). (f ) Suppose X n d. →X and X n d. →Y. Then X and Y have the same distribution. Proof. (a) Suppose X n a.s. → X and let > 0. Define a sequence of events A n by A n = ¦ω :[ X n (ω) −X(ω) [< ¦ We only need to show that P[A n ] →1. Define B n by B n = ¦ω :[ X k (ω) −X(ω) [< for all k ≥ n¦ Note that B n ⊂ A n and B 1 ⊂ B 2 ⊂ so lim n→∞ P[B n ] = P[B] where B = ∞ n=1 B n . Clearly B ⊃ ¦ω : lim n→∞ X n (ω) = X(ω)¦ so 1 = P[B] = lim n→∞ P[B n ]. Since P[A n ] is squeezed between P[B n ] and 1, lim n→∞ P[A n ] = 1, so X n p. →X. (b) Suppose X n m.s. → X and let > 0. By the Markov inequality applied to [X −X n [ 2 , P¦[ X −X n [≥ ¦ ≤ E[[ X −X n [ 2 ] 2 (2.1) The right side of (2.1), and hence the left side of (2.1), converges to zero as n goes to infinity. Therefore X n p. →X as n →∞. (c) Suppose X n p. →X. Then for any > 0, P¦[ X [≥ Y +¦ ≤ P¦[ X −X n [≥ ¦ →0 so that P¦[ X [≥ Y + ¦ = 0 for every > 0. Thus, P¦[ X [≤ Y ¦ = 1, so that P¦[ X − X n [ 2 ≤ 4Y 2 ¦ = 1. Therefore, with probability one, for any > 0, [ X −X n [ 2 ≤ 4Y 2 I ¡[X−Xn[≥¦ + 2 so E[[ X −X n [ 2 ] ≤ 4E[Y 2 I ¡[X−Xn[≥¦ ] + 2 In the special case that P¦Y = L¦ = 1 for a constant L, the term E[Y 2 I ¡[X−Xn[≥¦ ] is equal to L 2 P¦[X − X n [ ≥ ¦, and by the hypotheses, P¦[X − X n [ ≥ ¦ → 0. Even if Y is random, since E[Y 2 ] < ∞ and P¦[X − X n [ ≥ ¦ → 0, it still follows that E[Y 2 I ¡[X−Xn[≥¦ ] → 0 as n → ∞, by Corollary 11.6.5. So, for n large enough, E[[X −X n [ 2 ] ≤ 2 2 . Since was arbitrary, X n m.s. → X. (d) Assume X n p. →X. Select any continuity point x of F X . It must be proved that lim n→∞ F Xn (x) = F X (x). Let > 0. Then there exists δ > 0 so that F X (x) ≤ F X (x −δ) + 2 . (See Figure 2.9.) Now 2.1. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 49 ! X x ! x! X F (x! ) F (x) Figure 2.9: A CDF at a continuity point. ¦X ≤ x −δ¦ = ¦X ≤ x −δ, X n ≤ x¦ ∪ ¦X ≤ x −δ, X n > x¦ ⊂ ¦X n ≤ x¦ ∪ ¦[X −X n [ ≥ δ¦ so F X (x −δ) ≤ F Xn (x) +P¦[ X n −X [≥ δ¦. For all n sufficiently large, P¦[ X n −X [≥ δ¦ ≤ 2 . This and the choice of δ yield, for all n sufficiently large, F X (x) ≤ F Xn (x) +. Similarly, for all n sufficiently large, F X (x) ≥ F X N (x) −. So for all n sufficiently large, [F Xn (x) −F X (x)[ ≤ . Since was arbitrary, lim n→∞ F Xn (x) = F X (x). (e) By parts (a) and (b), already proved, we can assume that X n p. →X and X n p. →Y. Let > 0 and δ > 0, and select N so large that P¦[X n −X[ ≥ ¦ ≤ δ and P¦[X n −Y [ ≥ ¦ ≤ δ for all n ≥ N. By the triangle inequality, [X −Y [ ≤ [X N −X[ +[X N −Y [. Thus, ¦[X −Y [ ≥ 2¦ ⊂ ¦[X N −X[ ≥ ¦ ∪ ¦[Y N −X[ ≥ ¦ so that P¦[X −Y [ ≥ 2¦ ≤ P¦[X N −X[ ≥ ¦ +P¦[X N −Y [ ≥ ¦ ≤ 2δ. We’ve proved that P¦[X − Y [ ≥ 2¦ ≤ 2δ. Since δ was arbitrary, it must be that P¦[X − Y [ ≥ 2¦ = 0. Since was arbitrary, it must be that P¦[X −Y [ = 0¦ = 1. (f) Suppose X n d. → X and X n d. → Y. Then F X (x) = F Y (y) whenever x is a continuity point of both x and y. Since F X and F Y are nondecreasing and bounded, they can have only finitely many discontinuities of size greater than 1/n for any n, so that the total number of discontinuities is at most countably infinite. Hence, in any nonempty interval, there is a point of continuity of both functions. So for any x ∈ R, there is a strictly decreasing sequence of numbers converging to x, such that x n is a point of continuity of both F X and F Y . So F X (x n ) = F Y (x n ) for all n. Taking the limit as n →∞ and using the right-continuitiy of CDFs, we have F X (x) = F Y (x). Example 2.1.14 Suppose X 0 is a random variable with P¦X 0 ≥ 0¦ = 1. Suppose X n = 6+ √ X n−1 for n ≥ 1. For example, if for some ω it happens that X 0 (ω) = 12, then X 1 (ω) = 6 + √ 12 = 9.465 . . . X 2 (ω) = 6 + √ 9.46 = 9.076 . . . X 3 (ω) = 6 + √ 9.076 = 9.0127 . . . Examining Figure 2.10, it is clear that for any ω with X 0 (ω) > 0, the sequence of numbers X n (ω) converges to 9. Therefore, X n a.s. → 9 The rate of convergence can be bounded as follows. Note that 50 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 6 x=y 6+ x 6+ x 3 x y 9 9 0 0 Figure 2.10: Graph of the functions 6 + √ x and 6 + x 3 . for each x ≥ 0, [ 6 + √ x −9 [ ≤ [ 6 + x 3 −9 [. Therefore, [ X n (ω) −9 [ ≤ [ 6 + X n−1 (ω) 3 −9 [ = 1 3 [ X n−1 (ω) −9 [ so that by induction on n, [ X n (ω) −9 [ ≤ 3 −n [ X 0 (ω) −9 [ (2.2) Since X n a.s. → 9 it follows that X n p. →9. Finally, we investigate m.s. convergence under the assumption that E[X 2 0 ] < +∞. By the inequality (a +b) 2 ≤ 2a 2 + 2b 2 , it follows that E[(X 0 −9) 2 ] ≤ 2(E[X 2 0 ] + 81) (2.3) Squaring and taking expectations on each side of (2.10) and using (2.3) thus yields E[[ X n −9 [ 2 ] ≤ 2 3 −2n ¦E[X 2 0 ] + 81¦ Therefore, X n m.s. → 9. Example 2.1.15 Let W 0 , W 1 , . . . be independent, normal random variables with mean 0 and vari- ance 1. Let X −1 = 0 and X n = (.9)X n−1 +W n n ≥ 0 In what sense does X n converge as n goes to infinity? For fixed ω, the sequence of numbers X 0 (ω), X 1 (ω), . . . might appear as in Figure 2.11. Intuitively speaking, X n persistently moves. We claim that X n does not converge in probability (so also not in the a.s. or m.s. senses). Here is a proof of the claim. Examination of a table for the normal distribution yields that P¦W n ≥ 2¦ = P¦W n ≤ −2¦ ≥ 0.02. Then P¦[ X n −X n−1 [≥ 2¦ ≥ P¦X n−1 ≥ 0, W n ≤ −2¦ +P¦X n−1 < 0, W n ≥ 2¦ = P¦X n−1 ≥ 0¦P¦W n ≤ −2¦ +P¦X n−1 < 0¦P¦W n ≥ 2¦ = P¦W n ≥ 2¦ ≥ 0.02 2.1. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 51 k X k Figure 2.11: A typical sample sequence of X. Therefore, for any random variable X, P¦[ X n −X [≥ 1¦ +P¦[ X n−1 −X [≥ 1¦ ≥ P¦[ X n −X [≥ 1 or [ X n−1 −X [≥ 1¦ ≥ P¦[ X n −X n−1 [≥ 2¦ ≥ 0.02 so P¦[ X n −X [≥ 1¦ does not converge to zero as n →∞. So X n does not converge in probability to any random variable X. The claim is proved. Although X n does not converge in probability, or in the a.s. or m.s.) senses, it nevertheless seems to asymptotically settle into an equilibrium. To probe this point further, let’s find the distribution of X n for each n. X 0 = W 0 is N(0, 1) X 1 = (.9)X 0 +W 1 is N(0, 1.81) X 2 = (.9)X 1 +W 2 is N(0, (.81)(1.81 + 1)) In general, X n is N(0, σ 2 n ) where the variances satisfy the recursion σ 2 n = (0.81)σ 2 n−1 +1 so σ 2 n →σ 2 ∞ where σ 2 ∞ = 1 0.19 = 5.263. Therefore, the CDF of X n converges everywhere to the CDF of any random variable X which has the N(0, σ 2 ∞ ) distribution. So X n d. →X for any such X. The previous example involved convergence in distribution of Gaussian random variables. The limit random variable was also Gaussian. In fact, we close this section by showing that limits of Gaussian random variables are always Gaussian. Recall that X is a Gaussian random variable with mean µ and variance σ 2 if either σ 2 > 0 and F X (c) = Φ( c−µ σ ) for all c, where Φ is the CDF of the standard N(0, 1) distribution, or σ 2 = 0, in which case F X (c) = I ¡c≥µ¦ and P¦X = µ¦ = 1. Proposition 2.1.16 Suppose X n is a Gaussian random variable for each n, and that X n → X ∞ as n →∞, in any one of the four senses, a.s., m.s., p., or d. Then X ∞ is also a Gaussian random variable. Proof. Since convergence in the other senses implies convergence in distribution, we can assume that the sequence converges in distribution. Let µ n and σ 2 n denote the mean and variance of X n . The first step is to show that the sequence σ 2 n is bounded. Intuitively, if it weren’t bounded, the distribution of X n would get too spread out to converge. Since F X∞ is a valid CDF, there exists 52 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES a value L so large that F X∞ (−L) < 1 3 and F X∞ (L) > 2 3 . By increasing L if necessary, we can also assume that L and −L are continuity points of F X . So there exists n o such that, whenever n ≥ n o , F Xn (−L) ≤ 1 3 and F Xn (L) ≥ 2 3 . Therefore, for n ≥ n o , P¦[X n [ ≤ L¦ ≥ F Xn ( 2 3 ) −F Xn ( 1 3 ) ≥ 1 3 . For σ 2 n fixed, the probability P¦[X n [ ≤ L¦ is maximized by µ n = 0, so no matter what the value of µ n is, 2Φ( L σn ) −1 ≥ P¦[X n [ ≤ L¦. Therefore, for n ≥ n o , Φ( L σn ) ≥ 2 3 , or equivalently, σ n ≤ L/Φ −1 ( 2 3 ), where Φ −1 is the inverse of Φ. The first n o −1 terms of the sequence (σ 2 n ) are finite. Therefore, the whole sequence (σ 2 n ) is bounded. Constant random variables are considered to be Gaussian random variables–namely degenerate ones with zero variance. So assume without loss of generality that X ∞ is not a constant random variable. Then there exists a value c o so that F X∞ (c o ) is strictly between zero and one. Since F X∞ is right-continuous, the function must lie strictly between zero and one over some interval of positive length, with left endpoint c o . The function can only have countably many points of discontinuity, so it has infinitely many points of continuity such that the function value is strictly between zero and one. Let c 1 and c 2 be two distinct such points, and let p 1 and p 2 denote the values of F X∞ at those two points, and let b i = Φ −1 (p i ) for i = 1, 2. It follows that lim n→∞ c i −µn σn = b i for i = 1, 2. The limit of the difference of the sequences is the difference of the limits, so lim n→∞ c 1 −c 2 σn = b 1 −b 2 . Since c 1 −c 2 ,= 0 and the sequence (σ n ) is bounded, it follows that (σ n ) has a finite limit, σ ∞ , and therefore also (µ n ) has a finite limit, µ ∞ . Therefore, the CDFs F Xn converge pointwise to the CDF for the N(µ ∞ , σ 2 ∞ ) distribution. Thus, X ∞ has the N(µ ∞ , σ 2 ∞ ) distribution. 2.2 Cauchy criteria for convergence of random variables It is important to be able to show that a limit exists even if the limit value is not known. For example, it is useful to determine if the sum of an infinite series of numbers is convergent without needing to know the value of the sum. One useful result for this purpose is that if (x n : n ≥ 1) is monotone nondecreasing, i.e. x 1 ≤ x 2 ≤ , and if it satisfies x n ≤ L for all n for some finite constant L, then the sequence is convergent. This result carries over immediately to random variables: if (X n : n ≥ 1) is a sequence of random variables such P¦X n ≤ X n+1 ¦ = 1 for all n and if there is a random variable Y such that P¦X n ≤ Y ¦ = 1 for all n, then (X n ) converges a.s. For deterministic sequences that are not monotone, the Cauchy criteria gives a simple yet general condition that implies convergence to a finite limit. A deterministic sequence (x n : n ≥ 1) is said to be a Cauchy sequence if lim m,n→∞ [x m −x n [ = 0. This means that, for any > 0, there exists N sufficiently large, such that [x m − x n [ < for all m, n ≥ N. If the sequence (x n ) has a finite limit x ∞ , then the triangle inequality for distances between numbers, [x m −x n [ ≤ [x m −x ∞ [ +[x n −x ∞ [, implies that the sequence is a Cauchy sequence. More useful is the converse statement, called the Cauchy criteria for convergence, or the completeness property of R: If (x n ) is a Cauchy sequence then (x n ) converges to a finite limit as n → ∞. The following proposition gives similar criteria for convergence of random variables. Proposition 2.2.1 (Cauchy criteria for random variables) Let (X n ) be a sequence of random variables on a probability space (Ω, T, P). 2.2. CAUCHY CRITERIA FOR CONVERGENCE OF RANDOM VARIABLES 53 (a) X n converges a.s. to some random variable if and only if P¦ω : lim m,n→∞ [X m (ω) −X n (ω)[ = 0¦ = 1. (b) X n converges m.s. to some random variable if and only if (X n ) is a Cauchy sequence in the m.s. sense, meaning E[X 2 n ] < +∞ for all n and lim m,n→∞ E[(X m −X n ) 2 ] = 0. (2.4) (c) X n converges p. to some random variable if and only if for every > 0, lim m,n→∞ P¦[X m −X n [ ≥ ¦ = 0. (2.5) Proof. (a) For any ω fixed, (X n (ω) : n ≥ 1) is a sequence of numbers. So by the Cauchy criterion for convergence of a sequence of numbers, the following equality of sets holds: ¦ω : lim n→∞ X n (ω) exists and is finite¦ = ¦ω : lim m,n→∞ [X m (ω) −X n (ω)[ = 0¦. Thus, the set on the left has probability one (i.e. X converges a.s. to a random variable) if and only if the set on the right has probability one. Part (a) is proved. (b) First the “only if” part is proved. Suppose X n m.s. → X ∞ . By the L 2 triangle inequality for random variables, E[(X n −X m ) 2 ] 1 2 ≤ E[(X m −X ∞ ) 2 ] 1 2 +E[(X n −X ∞ ) 2 ] 1 2 (2.6) Since X n m.s. → X ∞ . the right side of (2.6) converges to zero as m, n → ∞, so that (2.4) holds. The “only if” part of (b) is proved. Moving to the proof of the “if” part, suppose (2.4) holds. Choose the sequence k 1 < k 2 < . . . recursively as follows. Let k 1 be so large that E[(X n −X k 1 ) 2 ] ≤ 1/2 for all n ≥ k 1 . Once k 1 , . . . , k i−1 are selected, let k i be so large that k i > k i−1 and E[(X n −X k i ) 2 ] ≤ 2 −i for all n ≥ k i . It follows from this choice of the k i ’s that E[(X k i+1 −X k i ) 2 ] ≤ 2 −i for all i ≥ 1. Let S n = [X k 1 [+ n−1 i=1 [X k i+1 −X k i [. Note that [X k i [ ≤ S n for 1 ≤ i ≤ k by the triangle inequality for differences of real numbers. By the L 2 triangle inequality for random variables (1.16), E[S 2 n ] 1 2 ≤ E[X 2 k 1 ] 1 2 + n−1 i=1 E[(X k i+1 −X k i ) 2 ] 1 2 ≤ E[X 2 k 1 ] 1 2 + 1. Since S n is monotonically increasing, it converges a.s. to a limit S ∞ . Note that [X k i [ ≤ S ∞ for all i ≥ 1. By the monotone convergence theorem, E[S 2 ∞ ] = lim n→∞ E[S 2 n ] ≤ (E[X 2 k 1 ] 1 2 + 1) 2 . So, S ∞ is in L 2 (Ω, T, P). In particular, S ∞ is finite a.s., and for any ω such that S ∞ (ω) is finite, the sequence of numbers (X k i (ω) : i ≥ 1) is a Cauchy sequence. (See Example 11.2.3 in the appendix.) By completeness of R, for ω in that set, the limit X ∞ (ω) exists. Let X ∞ (ω) = 0 on the zero 54 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES probability event that (X k i (ω) : i ≥ 1) does not converge. Summarizing, we have lim i→∞ X k i = X ∞ a.s. and [X k i [ ≤ S ∞ where S ∞ ∈ L 2 (Ω, T, P). It therefore follows from Proposition 2.1.13(c) that X k i m.s. → X ∞ . The final step is to prove that the entire sequence (X n ) converges in the m.s. sense to X ∞ . For this purpose, let > 0. Select i so large that E[(X n − X k i ) 2 ] < 2 for all n ≥ k i , and E[(X k i −X ∞ ) 2 ] ≤ 2 . Then, by the L 2 triangle inequality, for any n ≥ k i , E[(X n −X ∞ ) 2 ] 1 2 ≤ E(X n −X k i ) 2 ] 1 2 +E[(X k i −X ∞ ) 2 ] 1 2 ≤ 2 Since was arbitrary, X n m.s. → X ∞ . The proof of (b) is complete. (c) First the “only if” part is proved. Suppose X n p. →X ∞ . Then for any > 0, P¦[X m −X n [ ≥ 2¦ ≤ P¦[X m −X ∞ [ ≥ ¦ +P¦[X m −X ∞ [ ≥ ¦ →0 as m, n →∞, so that (2.5) holds. The “only if” part is proved. Moving to the proof of the “if” part, suppose (2.5) holds. Select an increasing sequence of integers k i so that P¦[X n − X m [ ≥ 2 −i ¦ ≤ 2 −i for all m, n ≥ k i . It follows, in particular, that P¦[X k i+1 −X k i [ ≥ 2 −i ¦ ≤ 2 −i . Since the sum of the probabilities of these events is finite, the prob- ability that infinitely many of the events is true is zero, by the Borel-Cantelli lemma (specifically, Lemma 1.2.2(a)). Thus, P¦[X k i+1 − X k i [ ≤ 2 −i for all large enough i¦ = 1. Thus, for all ω is a set with probability one, (X k i (ω) : i ≥ 1) is a Cauchy sequence of numbers. By completeness of R, for ω in that set, the limit X ∞ (ω) exists. Let X ∞ (ω) = 0 on the zero probability event that (X k i (ω) : i ≥ 1) does not converge. Then, X k i a.s. → X ∞ . It follows that X k i p. →X ∞ as well. The final step is to prove that the entire sequence (X n ) converges in the p. sense to X ∞ . For this purpose, let > 0. Select i so large that P¦[[X n − X k i [[ ≥ ¦ < for all n ≥ k i , and P¦[X k i − X ∞ [ ≥ ¦ < . Then P¦[X n − X ∞ [ ≥ 2¦ ≤ 2 for all n ≥ k i . Since was arbitrary, X n p. →X ∞ . The proof of (c) is complete. The following is a corollary of Proposition 2.2.1(c) and its proof. Corollary 2.2.2 If X n p. → X ∞ , then there is a subsequence (X k i : i ≥ 1) such that lim i→∞ X k i = X ∞ a.s. Proof. By Proposition 2.2.1(c), the sequence satisfies (2.2.1). By the proof of Proposition 2.2.1(c) there is a subsequence (X k i ) that converges a.s. By uniqueness of limits in the p. or a.s. senses, the limit of the subsequence is the same random variable, X ∞ (up to differences on a set of measure zero). Proposition 2.2.1(b), the Cauchy criteria for mean square convergence, is used extensively in these notes. The remainder of this section concerns a more convenient form of the Cauchy criteria for m.s. convergence. Proposition 2.2.3 (Correlation version of the Cauchy criterion for m.s. convergence) Let (X n ) be a sequence of random variables with E[X 2 n ] < +∞ for each n. Then there exists a random variable X such that X n m.s. → X if and only if the limit lim m,n→∞ E[X n X m ] exists and is finite. Furthermore, if X n m.s. → X, then lim m,n→∞ E[X n X m ] = E[X 2 ]. 2.2. CAUCHY CRITERIA FOR CONVERGENCE OF RANDOM VARIABLES 55 proof The “if” part is proved first. Suppose lim m,n→∞ E[X n X m ] = c for a finite constant c. Then E¸(X n −X m ) 2 | = E[X 2 n ] −2E[X n X m ] +E[X 2 m ] → c −2c +c = 0 as m, n →∞ Thus, X n is Cauchy in the m.s. sense, so X n m.s. → X for some random variable X. To prove the “only if” part, suppose X n m.s. → X. Observe next that E[X m X n ] = E[(X + (X m −X))(X + (X n −X))] = E[X 2 + (X m −X)X +X(X n −X) + (X m −X)(X n −X)] By the Cauchy-Schwarz inequality, E[[ (X m −X)X [] ≤ E[(X m −X) 2 ] 1 2 E[X 2 ] 1 2 →0 E[[ (X m −X)(X n −X) [] ≤ E[(X m −X) 2 ] 1 2 E[(X n −X) 2 ] 1 2 →0 and similarly E[[ X(X n −X) [] →0. Thus E[X m X n ] →E[X 2 ]. This establishes both the “only if” part of the proposition and the last statement of the proposition. The proof of the proposition is complete. 2 Corollary 2.2.4 Suppose X n m.s. → X and Y n m.s. → Y . Then E[X n Y n ] →E[XY ]. Proof. By the inequality (a + b) 2 ≤ 2a 2 + 2b 2 , it follows that X n + Y n m.s. → X + Y as n → ∞. Proposition 2.2.3 therefore implies that E[(X n + Y n ) 2 ] → E[(X + Y ) 2 ], E[X 2 n ] → E[X 2 ], and E[Y 2 n ] →E[Y 2 ]. Since X n Y n = ((X n +Y n ) 2 −X 2 n −Y 2 n )/2, the corollary follows. Corollary 2.2.5 Suppose X n m.s. → X. Then E[X n ] →E[X]. Proof. Corollary 2.2.5 follows from Corollary 2.2.4 by taking Y n = 1 for all n. Example 2.2.6 This example illustrates the use of Proposition 2.2.3. Let X 1 , X 2 , . . . be mean zero random variables such that E[X i X j ] = _ 1 if i = j 0 else Does the series ∞ k=1 X k k converge in the mean square sense to a random variable with a finite second moment? Let Y n = n k=1 X k k . The question is whether Y n converges in the mean square sense to a random variable with finite second moment. The answer is yes if and only if lim m,n→∞ E[Y m Y n ] exists and is finite. Observe that E[Y m Y n ] = min(m,n) k=1 1 k 2 → ∞ k=1 1 k 2 as m, n →∞ 56 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES This sum is smaller than 1 + _ ∞ 1 1 x 2 dx = 2 < ∞. 1 Therefore, by Proposition 2.2.3, the series ∞ k=1 X k k indeed converges in the m.s. sense. 2.3 Limit theorems for sequences of independent random variables Sums of many independent random variables often have distributions that can be characterized by a small number of parameters. For engineering applications, this represents a low complexity method for describing the random variables. An analogous tool is the Taylor series approximation. A continuously differentiable function f can be approximated near zero by the first order Taylor’s approximation f(x) ≈ f(0) +xf t (0) A second order approximation, in case f is twice continuously differentiable, is f(x) ≈ f(0) +xf t (0) + x 2 2 f tt (0) Bounds on the approximation error are given by Taylor’s theorem, found in Appendix 11.4. In essence, Taylor’s approximation lets us represent the function by the numbers f(0), f t (0) and f tt (0). We shall see that the law of large numbers and central limit theorem can be viewed not just as analogies of the first and second order Taylor’s approximations, but actually as consequences of them. Lemma 2.3.1 If x n →x as n →∞ then (1 + xn n ) n →e x as n →∞. Proof. The basic idea is to note that (1+s) n = exp(nln(1+s)), and apply Taylor’s theorem to ln(1+s) about the point s = 0. The details are given next. Since ln(1+s)[ s=0 = 0, ln(1+s) t [ s=0 = 1, and ln(1 +s) tt = − 1 (1+s) 2 , the mean value form of Taylor’s Theorem (see the appendix) yields that if s > −1, then ln(1 + s) = s − s 2 2(1+y) 2 , where y lies in the closed interval with endpoints 0 and s. Thus, if s ≥ 0, then y ≥ 0, so that s − s 2 2 ≤ ln(1 +s) ≤ s if s ≥ 0. More to the point, if it is only known that s ≥ − 1 2 , then y ≥ − 1 2 , so that s −2s 2 ≤ ln(1 +s) ≤ s if s ≥ − 1 2 Letting s = xn n , multiplying through by n, and applying the exponential function, yields that exp(x n − 2x 2 n n ) ≤ (1 + x n n ) n ≤ exp(x n ) if x n ≥ − n 2 1 In fact, the sum is equal to π 2 6 , but the technique of comparing the sum to an integral to show the sum is finite is the main point here. 2.3. LIMIT THEOREMS FOR SEQUENCES OF INDEPENDENT RANDOM VARIABLES 57 If x n → x as n → ∞ then the condition x n > − n 2 holds for all large enough n, and x n − 2x 2 n n → x, yielding the desired result. A sequence of random variables (X n ) is said to be independent and identically distributed (iid) if the X i ’s are mutually independent and identically distributed. Proposition 2.3.2 (Law of large numbers) Suppose that X 1 , X 2 , . . . is a sequence of random vari- ables such that each X i has finite mean m. Let S n = X 1 + +X n . Then (a) Sn n m.s. → m. (hence also Sn n p. → m and Sn n d. → m.) if for some constant c, Var(X i ) ≤ c for all i, and Cov(X i , X j ) = 0 i ,= j (i.e. if the variances are bounded and the X i ’s are uncorrelated). (b) Sn n p. →m if X 1 , X 2 , . . . are iid. (This version is the weak law of large numbers.) (c) Sn n a.s. → m if X 1 , X 2 , . . . are iid. (This version is the strong law of large numbers.) We give a proof of (a) and (b), but prove (c) only under an extra condition. Suppose the conditions of (a) are true. Then E _ _ S n n −m _ 2 _ = Var _ S n n _ = 1 n 2 Var(S n ) = 1 n 2 i j Cov(X i , X j ) = 1 n 2 i Var(X i ) ≤ c n Therefore Sn n m.s. → m. Turn next to part (b). If in addition to the conditions of (b) it is assumed that Var(X 1 ) < +∞, then the conditions of part (a) are true. Since mean square convergence implies convergence in probability, the conclusion of part (b) follows. An extra credit problem shows how to use the same approach to verify (b) even if Var(X 1 ) = +∞. Here a second approach to proving (b) is given. The characteristic function of X i n is given by E _ exp _ juX i n __ = E _ exp _ j( u n )X i __ = Φ X _ u n _ where Φ X denotes the characteristic function of X 1 . Since the characteristic function of the sum of independent random variables is the product of the characteristic functions, ΦSn n (u) = _ Φ X _ u n __ n . Since E(X 1 ) = m it follows that Φ X is differentiable with Φ X (0) = 1, Φ t X (0) = jm and Φ t is continuous. By Taylor’s theorem, for any u fixed, Φ X _ u n _ = 1 + uΦ t X (u n ) n 58 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES for some u n between 0 and u n for all n. Since Φ t (u n ) → jm as n → ∞, Lemma 2.3.1 yields Φ X ( u n ) n → exp(jum) as n → ∞. Note that exp(jum) is the characteristic function of a random variable equal to m with probability one. Since pointwise convergence of characteristic functions to a valid characteristic function implies convergence in distribution, it follows that Sn n d. →m. However, convergence in distribution to a constant implies convergence in probability, so (b) is proved. Part (c) is proved under the additional assumption that E[X 4 1 ] < +∞. Without loss of generality we assume that EX 1 = 0. Consider expanding S 4 n . There are n terms of the form X 4 i and 3n(n−1) terms of the form X 2 i X 2 j with 1 ≤ i, j ≤ n and i ,= j. The other terms have the form X 3 i X j , X 2 i X j X k or X i X j X k X l for distinct i, j, k, l, and these terms have mean zero. Thus, E[S 4 n ] = nE[X 4 1 ] + 3n(n −1)E[X 2 1 ] 2 Let Y = ∞ n=1 ( Sn n ) 4 . The value of Y is well defined but it is a priori possible that Y (ω) = +∞ for some ω. However, by the monotone convergence theorem, the expectation of the sum of nonnegative random variables is the sum of the expectations, so that E[Y ] = ∞ n=1 E _ _ S n n _ 4 _ = ∞ n=1 nE[X 4 1 ] + 3n(n −1)E[X 2 1 ] 2 n 4 < +∞ Therefore, P¦Y < +∞¦ = 1. However, ¦Y < +∞¦ is a subset of the event of convergence ¦w : Sn(w) n → 0 as n → ∞¦, so the event of convergence also has probability one. Thus, part (c) under the extra fourth moment condition is proved. Proposition 2.3.3 (Central Limit Theorem) Suppose that X 1 , X 2 , . . . are i.i.d., each with mean µ and variance σ 2 . Let S n = X 1 + +X n . Then the normalized sum S n −nµ √ n converges in distribution to the N(0, σ 2 ) distribution as n →∞. Proof. Without loss of generality, assume that µ = 0. Then the characteristic function of the normalized sum Sn √ n is given by Φ X ( u √ n ) n , where Φ X denotes the characteristic function of X 1 . Since X 1 has mean 0 and finite second moment σ 2 , it follows that Φ X is twice differentiable with Φ X (0) = 1, Φ t X (0) = 0, Φ tt X (0) = −σ 2 , and Φ tt X is continuous. By Taylor’s theorem, for any u fixed, Φ X _ u √ n _ = 1 + u 2 2n Φ tt (u n ) for some u n between 0 and u √ n for all n. Since Φ tt (u n ) → −σ 2 as n → ∞, Lemma 2.3.1 yields Φ X ( u √ n ) n → exp(− u 2 σ 2 2 ) as n → ∞. Since pointwise convergence of characteristic functions to a valid characteristic function implies convergence in distribution, the proposition is proved. 2.4. CONVEX FUNCTIONS AND JENSEN’S INEQUALITY 59 2.4 Convex functions and Jensen’s inequality Let ϕ be a function on R with values in R∪¦+∞¦ such that ϕ(x) < ∞ for at least one value of x. Then ϕ is said to be convex if for any a, b and λ with a < b and 0 ≤ λ ≤ 1 ϕ(aλ +b(1 −λ)) ≤ λϕ(a) + (1 −λ)ϕ(b). This means that the graph of ϕ on any interval [a, b] lies below the line segment equal to ϕ at the endpoints of the interval. Proposition 2.4.1 Suppose f is twice continuously differentiable on R. Then f is convex if and only if f tt (v) ≥ 0 for all v. Proof. Suppose that f is twice continuously differentiable. Given a < b, define D ab = λf(a) + (1 −λ)f(b) −f(λa + (1 −λ)b). For any x ≥ a, f(x) = f(a) + _ x a f t (u)du = f(a) + (x −a)f t (a) + _ x a _ x v f tt (v)dvdu = f(a) + (x −a)f t (a) + _ v a (x −v)f tt (v)dv Applying this to D ab yields D ab = _ b a min¦λ(v −a), (1 −λ)(b −v)¦f tt (v)dv (2.7) On one hand, if f tt (v) ≥ 0 for all v, then D ab ≥ 0 whenever a < b, so that f is convex. On the other hand, if f tt (v o ) < 0 for some v o , then by the assumed continuity of f tt , there is a small open interval (a, b) containing v o so that f tt (v) < 0 for a < v < b. But then D a,b < 0, implying that f is not convex. Examples of convex functions include: ax 2 +bx +c for constants a, b, c with a ≥ 0, e λx for λ constant, ϕ(x) = _ −ln x x > 0 +∞ x ≤ 0, ϕ(x) = _ _ _ xln x x > 0 0 x = 0 +∞ x < 0. Theorem 2.4.2 (Jensen’s inequality) Let ϕ be a convex function and let X be a random variable such that E[X] is finite. Then E[ϕ(X)] ≥ ϕ(E[X]). 60 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES For example, Jensen’s inequality implies that E[X 2 ] ≥ E[X] 2 , which also follows from the fact Var(X) = E[X 2 ] −E[X] 2 . Proof. Since ϕ is convex, there is a tangent to the graph of ϕ at E[X], meaning there is a function L of the form L(x) = a + bx such that ϕ(x) ≥ L(x) for all x and ϕ(E[X]) = L(E[X]). See the illustration in Figure 2.12. Therefore E[ϕ(X)] ≥ E[L(X)] = L(E[X]) = ϕ(E[X]), which establishes the theorem. (x) L(x) x E[X] φ Figure 2.12: A convex function and a tangent linear function. A function ϕ is called concave if −ϕ is convex. If ϕ is concave then E[ϕ(X)] ≤ ϕ(E[X]). 2.5 Chernoff bound and large deviations theory Let X 1 , X 2 , . . . be an iid sequence of random variables with finite mean µ, and let S n = X 1 + +X n . The weak law of large numbers implies that for fixed a with a > µ, P¦ Sn n ≥ a¦ →0 as n →∞. In case the X i ’s have finite variance, the central limit theorem offers a refinement of the law of large numbers, by identifying the limit of P¦ Sn n ≥ a n ¦, where (a n ) is a sequence that converges to µ in the particular manner: a n = µ+ c √ n . For fixed c, the limit is not zero. One can think of the central limit theorem, therefore, to concern “normal” deviations of S n from its mean. Large deviations theory, by contrast, addresses P¦ Sn n ≥ a¦ for a fixed, and in particular it identifies how quickly P¦ Sn n ≥ a¦ converges to zero as n → ∞. We shall first describe the Chernoff bound, which is a simple upper bound on P¦ Sn n ≥ a¦. Then Cram´er’s theorem, to the effect that the Chernoff bound is in a certain sense tight, is stated. The moment generating function of X 1 is defined by M(θ) = E[e θX 1 ], and ln M(θ) is called the log moment generating function. Since e θX 1 is a positive random variable, the expectation, and hence M(θ) itself, is well-defined for all real values of θ, with possible value +∞. The Chernoff bound is simply given as P _ S n n ≥ a _ ≤ exp(−n[θa −ln M(θ)]) for θ ≥ 0 (2.8) The bound (2.8), like the Chebychev inequality, is a consequence of Markov’s inequality applied to 2.5. CHERNOFF BOUND AND LARGE DEVIATIONS THEORY 61 an appropriate function. For θ ≥ 0: P _ S n n ≥ a _ = P¦e θ(X 1 ++Xn−na) ≥ 1¦ ≤ E[e θ(X 1 ++Xn−na) ] = E[e θX 1 ] n e −nθa = exp(−n[θa −ln M(θ)]) To make the best use of the Chernoff bound we can optimize the bound by selecting the best θ. Thus, we wish to select θ ≥ 0 to maximize aθ −ln M(θ). In general the log moment generating function ln M is convex. Note that ln M(0) = 0. Let us suppose that M(θ) is finite for some θ > 0. Then d ln M(θ) dθ ¸ ¸ ¸ ¸ θ=0 = E[X 1 e θX 1 ] E[e θX 1 ] ¸ ¸ ¸ ¸ θ=0 = E[X 1 ] The sketch of a typical case is shown in Figure 2.13. Figure 2.13 also shows the line of slope a. ln M( ) ! a! l(a) ! Figure 2.13: A log moment generating function and a line of slope a. Because of the assumption that a > E[X 1 ], the line lies strictly above ln M(θ) for small enough θ and below ln M(θ) for all θ < 0. Therefore, the maximum value of θa −ln M(θ) over θ ≥ 0 is equal to l(a), defined by l(a) = sup −∞<θ<∞ θa −ln M(θ) (2.9) Thus, the Chernoff bound in its optimized form, is P _ S n n ≥ a _ ≤ exp(−nl(a)) a > E[X 1 ] There does not exist such a clean lower bound on the large deviation probability P¦ Sn n ≥ a¦, but by the celebrated theorem of Cram´er stated next, the Chernoff bound gives the right exponent. 62 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES Theorem 2.5.1 (Cram´er’s theorem) Suppose E[X 1 ] is finite, and that E[X 1 ] < a. Then for > 0 there exists a number n such that P _ S n n ≥ a _ ≥ exp(−n(l(a) +)) (2.10) for all n ≥ n . Combining this bound with the Chernoff inequality yields lim n→∞ 1 n ln P _ S n n ≥ a _ = −l(a) In particular, if l(a) is finite (equivalently if P¦X 1 ≥ a¦ > 0) then P _ S n n ≥ a _ = exp(−n(l(a) + n )) where ( n ) is a sequence with n ≥ 0 and lim n→∞ n = 0. Similarly, if a < E[X 1 ] and l(a) is finite, then 1 n P _ S n n ≤ a _ = exp(−n(l(a) + n )) where n is a sequence with n ≥ 0 and lim n→∞ n = 0. Informally, we can write for n large: P _ S n n ∈ da _ ≈ e −nl(a) da (2.11) Proof. The lower bound (2.10) is proved here under the additional assumption that X 1 is a bounded random variable: P¦[X 1 [ ≤ C¦ = 1 for some constant C; this assumption can be removed by a truncation argument covered in a homework problem. Also, to avoid trivialities, suppose P¦X 1 > a¦ > 0. The assumption that X 1 is bounded and the monotone convergence theorem imply that the function M(θ) is finite and infinitely differentiable over θ ∈ R. Given θ ∈ R, let P θ denote a new probability measure on the same probability space that X 1 , X 2 , . . . are defined on such that for any n and any event of the form ¦(X 1 , . . . , X n ) ∈ B¦, P θ ¦(X 1 , . . . , X n ) ∈ B¦ = E _ I ¡(X 1 ,...,Xn)∈B¦ e θSn ¸ M(θ) n In particular, if X i has pdf f for each i under the original probability measure P, then under the new probability measure P θ , each X i has pdf f θ defined by f θ (x) = f(x)e θx M(θ) , and the random variables X 1 , X 2 , . . . are independent under P θ . The pdf f θ is called the tilted version of f with parameter θ, and P θ is similarly called the tilted version of P with parameter θ. It is not difficult to show that the mean and variance of the X i ’s under P θ are given by: E θ [X 1 ] = E _ X 1 e θX 1 ¸ M(θ) = (ln M(θ)) t Var θ [X 1 ] = E θ [X 2 1 ] −E θ [X 1 ] 2 = (ln M(θ)) tt 2.5. CHERNOFF BOUND AND LARGE DEVIATIONS THEORY 63 Under the assumptions we’ve made, X 1 has strictly positive variance under P θ for all θ, so that ln M(θ) is strictly convex. The assumption P¦X 1 > a¦ > 0 implies that (aθ −ln M(θ)) → −∞ as θ → ∞. Together with the fact that ln M(θ) is differentiable and strictly convex, there thus exists a unique value θ ∗ of θ that maximizes aθ − ln M(θ). So l(a) = aθ ∗ − ln M(θ ∗ ). Also, the derivative of aθ − ln M(θ) at θ = θ ∗ is zero, so that E θ ∗[X] = (ln M(θ)) t ¸ ¸ ¸ ¸ θ=θ ∗ = a. Observe that for any b with b > a, P _ S n n ≥ n _ = _ ¡ω:na≤Sn¦ 1 dP = _ ¡ω:na≤Sn¦ M(θ ∗ ) n e −θ ∗ Sn e θ ∗ Sn dP M(θ ∗ ) n = M(θ ∗ ) n _ ¡ω:na≤Sn¦ e −θ ∗ Sn dP θ ∗ ≥ M(θ ∗ ) n _ ¡ω:na≤Sn≤nb¦ e −θ ∗ Sn dP θ ∗ ≥ M(θ ∗ ) n e −θ ∗ nb P θ ∗¦na ≤ S n ≤ nb¦ Now M(θ ∗ ) n e −θ ∗ nb = exp(−n(l(a) +θ ∗ (b −a)¦), and by the central limit theorem, P θ ∗¦na ≤ S n ≤ nb¦ → 1 2 as n →∞ so P θ ∗¦na ≤ S n ≤ nb¦ ≥ 1/3 for n large enough. Therefore, for n large enough, P _ S n n ≥ a _ ≥ exp _ −n _ l(a) +θ ∗ (b −a) + ln 3 n __ Taking b close enough to a, implies (2.10) for large enough n. Example 2.5.2 Let X 1 , X 2 , . . . be independent and exponentially distributed with parameter λ = 1. Then ln M(θ) = ln _ ∞ 0 e θx e −x dx = _ −ln(1 −θ) θ < 1 +∞ θ ≥ 1 See Figure 2.14 ! 1 0 +" ! 0 "" + 1 l(a) a " ln M( ) Figure 2.14: ln M(θ) and l(a) for an Exp(1) random variable. 64 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES Therefore, for any a ∈ R, l(a) = max θ ¦aθ −ln M(θ)¦ = max θ<1 ¦aθ + ln(1 −θ)¦ If a ≤ 0 then l(a) = +∞. On the other hand, if a > 0 then setting the derivative of aθ + ln(1 −θ) to 0 yields the maximizing value θ = 1 − 1 a , and therefore l(a) = _ a −1 −ln(a) a > 0 +∞ a ≤ 0 The function l is shown in Figure 2.14. Example 2.5.3 Let X 1 , X 2 , . . . be independent Bernoulli random variables with parameter p sat- isfying 0 < p < 1. Thus S n has the binomial distribution. Then ln M(θ) = ln(pe θ +(1 −p)), which has asymptotic slope 1 as θ →+∞ and converges to a constant as θ →−∞. Therefore, l(a) = +∞ if a > 1 or if a < 0. For 0 ≤ a ≤ 1, we find aθ −ln M(θ) is maximized by θ = ln( a(1−p) p(1−a) ), leading to l(a) = _ a ln( a p ) + (1 −a) ln( 1−a 1−p ) 0 ≤ a ≤ 1 +∞ else See Figure 2.15. ln M( ) 0 !! + l(a) a 1 +!! p " " 0 Figure 2.15: ln M(θ) and l(a) for a Bernoulli distribution. 2.6 Problems 2.1 Limits and infinite sums for deterministic sequences (a) Using the definition of a limit, show that lim θ→0 θ(1 + cos(θ)) = 0. (b) Using the definition of a limit, show that lim θ→0,θ>0 1+cos(θ) θ = +∞. (c) Determine whether the following sum is finite, and justify your answer: ∞ n=1 1+ √ n 1+n 2 . 2.6. PROBLEMS 65 2.2 The limit of the product is the product of the limits Consider two (deterministic) sequences with finite limits: lim n→∞ x n = x and lim n→∞ y n = y. (a) Prove that the sequence (y n ) is bounded. (b) Prove that lim n→∞ x n y n = xy. (Hint: Note that x n y n −xy = (x n −x)y n + x(y n −y) and use part (a)). 2.3 The reciprocal of the limit is the limit of the reciprocal Using the definition of converence for deterministic sequences, prove that if (x n ) is a sequence with a nonzero finite limit x ∞ , then the sequence (1/x n ) converges to 1/x ∞ . 2.4 Limits of some deterministic series Determine which of the following series are convergent (i.e. have partial sums converging to a finite limit). Justify your answers. (a) ∞ n=0 3 n n! (b) ∞ n=1 (n + 2) ln n (n + 5) 3 (c) ∞ n=1 1 (ln(n + 1)) 5 . 2.5 On convergence of deterministic sequences and functions (a) Let x n = 8n 2 +n 3n 2 for n ≥ 1. Prove that lim n→∞ x n = 8 3 . (b) Suppose f n is a function on some set D for each n ≥ 1, and suppose f is also a function on D. Then f n is defined to converge to f uniformly if for any > 0, there exists an n such that [f n (x) − f(x)[ ≤ for all x ∈ D whenever n ≥ n . A key point is that n does not depend on x. Show that the functions f n (x) = x n on the semi-open interval [0, 1) do not converge uniformly to the zero function. (c) The supremum of a function f on D, written sup D f, is the least upper bound of f. Equivalently, sup D f satisfies sup D f ≥ f(x) for all x ∈ D, and given any c < sup D f, there is an x ∈ D such that f(x) ≥ c. Show that [ sup D f − sup D g[ ≤ sup D [f − g[. Conclude that if f n converges to f uniformly on D, then sup D f n converges to sup D f. 2.6 Convergence of sequences of random variables Let Θ be uniformly distributed on the interval [0, 2π]. In which of the four senses (a.s., m.s., p., d.) do each of the following two sequences converge. Identify the limits, if they exist, and justify your answers. (a) (X n : n ≥ 1) defined by X n = cos(nΘ). (b) (Y n : n ≥ 1) defined by Y n = [1 − Θ π [ n . 2.7 Convergence of random variables on (0,1] Let Ω = (0, 1], let T be the Borel σ algebra of subsets of (0, 1], and let P be the probability measure on T such that P([a, b]) = b − a for 0 < a ≤ b ≤ 1. For the following two sequences of random variables on (Ω, T, P), find and sketch the distribution function of X n for typical n, and decide in which sense(s) (if any) each of the two sequences converges. (a) X n (ω) = nω −¸nω|, where ¸x| is the largest integer less than or equal to x. (b) X n (ω) = n 2 ω if 0 < ω < 1/n, and X n (ω) = 0 otherwise. 66 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 2.8 Convergence of random variables on (0,1], version 2 Let Ω = (0, 1], let T be the Borel σ algebra of subsets of (0, 1], and let P be the probability measure on T such that P([a, b]) = b −a for 0 < a ≤ b ≤ 1. Determine in which of the four senses (a.s., p., m.s, d.), if any, each of the following three sequences of random variables converges. Justify your answers. (a) X n (ω) = (−1) n n √ ω . (b) X n (ω) = nω n . (c) X n (ω) = ω sin(2πnω). (Try at least for a heuristic justification.) 2.9 On the maximum of a random walk with negative drift Let X 1 , X 2 , . . . be independent, identically distributed random variables with mean E[X i ] = −1. Let S 0 = 0, and for n ≥ 1, let S n = X 1 + +X n . Let Z = max¦S n : n ≥ 0¦. (a) Show that Z is well defined with probability one, and P¦Z < +∞¦ = 1. (b) Does there exist a finite constant L, depending only on the above assumptions, such that E[Z] ≤ L? Justify your answer. (Hint: Z ≥ max¦S 0 , S 1 ¦ = max¦0, X 1 ¦.) 2.10 Convergence of a sequence of discrete random variables Let X n = X+(1/n) where P[X = i] = 1/6 for i = 1, 2, 3, 4, 5 or 6, and let F n denote the distribution function of X n . (a) For what values of x does F n (x) converge to F(x) as n tends to infinity? (b) At what values of x is F X (x) continuous? (c) Does the sequence (X n ) converge in distribution to X? 2.11 Convergence in distribution to a nonrandom limit Let (X n , n ≥ 1) be a sequence of random variables and let X be a random variable such that P[X = c] = 1 for some constant c. Prove that if lim n→∞ X n = X d., then lim n→∞ X n = X p. That is, prove that convergence in distribution to a constant implies convergence in probability to the same constant. 2.12 Convergence of a minimum Let U 1 , U 2 , . . . be a sequence of independent random variables, with each variable being uniformly distributed over the interval [0, 1], and let X n = min¦U 1 , . . . , U n ¦ for n ≥ 1. (a) Determine in which of the senses (a.s., m.s., p., d.) the sequence (X n ) converges as n → ∞, and identify the limit, if any. Justify your answers. (b) Determine the value of the constant θ so that the sequence (Y n ) defined by Y n = n θ X n converges in distribution as n →∞ to a nonzero limit, and identify the limit distribution. 2.13 Convergence of a product Let U 1 , U 2 , . . . be a sequence of independent random variables, with each variable being uniformly distributed over the interval [0, 2], and let X n = U 1 U 2 U n for n ≥ 1. (a) Determine in which of the senses (a.s., m.s., p., d.) the sequence (X n ) converges as n → ∞, and identify the limit, if any. Justify your answers. (b) Determine the value of the constant θ so that the sequence (Y n ) defined by Y n = n θ ln(X n ) converges in distribution as n →∞ to a nonzero limit. 2.6. PROBLEMS 67 2.14 Limits of functions of random variables Let g and h be functions defined as follows: g(x) = _ _ _ −1 if x ≤ −1 x if −1 ≤ x ≤ 1 1 if x ≥ 1 h(x) = _ −1 if x ≤ 0 1 if x > 0. Thus, g represents a clipper and h represents a hard limiter. Suppose that (X n : n ≥ 0) is a sequence of random variables, and that X is also a random variable, all on the same underlying probability space. Give a yes or no answer to each of the four questions below. For each yes answer, identify the limit and give a justification. For each no answer, give a counterexample. (a) If lim n→∞ X n = X a.s., then does lim n→∞ g(X n ) a.s. necessarily exist? (b) If lim n→∞ X n = X m.s., then does lim n→∞ g(X n ) m.s. necessarily exist? (c) If lim n→∞ X n = X a.s., then does lim n→∞ h(X n ) a.s. necessarily exist? (d) If lim n→∞ X n = X m.s., then does lim n→∞ h(X n ) m.s. necessarily exist? 2.15 Sums of i.i.d. random variables, I A gambler repeatedly plays the following game: She bets one dollar and then there are three possible outcomes: she wins two dollars back with probability 0.4, she gets just the one dollar back with probability 0.1, and otherwise she gets nothing back. Roughly what is the probability that she is ahead after playing the game one hundred times? 2.16 Sums of i.i.d. random variables, II Let X 1 , X 2 , . . . be independent random variable with P[X i = 1] = P[X i = −1] = 0.5. (a) Compute the characteristic function of the following random variables: X 1 , S n = X 1 + +X n , and V n = S n / √ n. (b) Find the pointwise limits of the characteristic functions of S n and V n as n →∞. (c) In what sense(s), if any, do the sequences (S n ) and (V n ) converge? 2.17 Sums of i.i.d. random variables, III Fix λ > 0. For each integer n > λ, let X 1,n , X 2,n , . . . , X n,n be independent random variables such that P[X i,n = 1] = λ/n and P[X i,n = 0] = 1 −(λ/n). Let Y n = X 1,n +X 2,n + +X n,n . (a) Compute the characteristic function of Y n for each n. (b) Find the pointwise limit of the characteristic functions as n → ∞ tends. The limit is the characteristic function of what probability distribution? (c) In what sense(s), if any, does the sequence (Y n ) converge? 2.18 On the growth of the maximum of n independent exponentials Suppose that X 1 , X 2 , . . . are independent random variables, each with the exponential distribution with parameter λ = 1. For n ≥ 2, let Z n = max¡X 1 ,...,Xn¦ ln(n) . (a) Find a simple expression for the CDF of Z n . (b) Show that (Z n ) converges in distribution to a constant, and find the constant. (Note: It follows immediately that Z n converges in p. to the same constant. It can also be shown that (Z n ) converges in the a.s. and m.s. senses to the same constant.) 68 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 2.19 Normal approximation for quantization error Suppose each of 100 real numbers are rounded to the nearest integer and then added. Assume the individual roundoff errors are independent and uniformly distributed over the interval [−0.5, 0.5]. Using the normal approximation suggested by the central limit theorem, find the approximate probability that the absolute value of the sum of the errors is greater than 5. 2.20 Limit behavior of a stochastic dynamical system Let W 1 , W 2 , . . . be a sequence of independent, N(0, 0.5) random variables. Let X 0 = 0, and define X 1 , X 2 , . . . recursively by X k+1 = X 2 k + W k . Determine in which of the senses (a.s., m.s., p., d.) the sequence (X n ) converges as n →∞, and identify the limit, if any. Justify your answer. 2.21 Applications of Jensen’s inequality Explain how each of the inequalties below follows from Jensen’s inequality. Specifically, identify the convex function and random variable used. (a) E[ 1 X ] ≥ 1 E[X] , for a positive random variable X with finite mean. (b) E[X 4 ] ≥ E[X 2 ] 2 , for a random variable X with finite second moment. (c) D(f[g) ≥ 0, where f and g are positive probability densities on a set A, and D is the divergence distance defined by D(f[g) = _ A f(x) ln f(x) g(x) dx. (The base used in the logarithm is not relevant.) 2.22 Convergence analysis of successive averaging Let U 1 , U 2 , ... be independent random variables, each uniformly distributed on the interval [0,1]. Let X 0 = 0 and X 1 = 1, and for n ≥ 1 let X n+1 = (1 − U n )X n + U n X n−1 . Note that given X n−1 and X n , the variable X n+1 is uniformly distributed on the interval with endpoints X n−1 and X n . (a) Sketch a typical sample realization of the first few variables in the sequence. (b) Find E[X n ] for all n. (c) Show that X n converges in the a.s. sense as n goes to infinity. Explain your reasoning. (Hint: Let D n = [X n −X n−1 [. Then D n+1 = U n D n , and if m > n then [X m −X n [ ≤ D n .) 2.23 Understanding the Markov inequality Suppose X is a random variable with E[X 4 ] = 30. (a) Derive an upper bound on P[[X[ ≥ 10]. Show your work. (b) (Your bound in (a) must be the best possible in order to get both parts (a) and (b) correct). Find a distribution for X such that the bound you found in part (a) holds with equality. 2.24 Mean square convergence of a random series The sum of infinitely many random variables, X 1 + X 2 + is defined as the limit as n tends to infinity of the partial sums X 1 + X 2 + + X n . The limit can be taken in the usual senses (in probability, in distribution, etc.). Suppose that the X i are mutually independent with mean zero. Show that X 1 + X 2 + exists in the mean square sense if and only if the sum of the variances, Var(X 1 ) +Var(X 2 ) + , is finite. (Hint: Apply the Cauchy criteria for mean square convergence.) 2.25 Portfolio allocation Suppose that you are given one unit of money (for example, a million dollars). Each day you bet 2.6. PROBLEMS 69 a fraction α of it on a coin toss. If you win, you get double your money back, whereas if you lose, you get half of your money back. Let W n denote the wealth you have accumulated (or have left) after n days. Identify in what sense(s) the limit lim n→∞ W n exists, and when it does, identify the value of the limit (a) for α = 0 (pure banking), (b) for α = 1 (pure betting), (c) for general α. (d) What value of α maximizes the expected wealth, E[W n ]? Would you recommend using that value of α? (e) What value of α maximizes the long term growth rate of W n (Hint: Consider ln(W n ) and apply the LLN.) 2.26 A large deviation Let X 1 , X 2 , ... be independent, N(0,1) random variables. Find the constant b such that P¦X 2 1 +X 2 2 +. . . +X 2 n ≥ 2n¦ = exp(−n(b + n )) where n →0 as n →∞. What is the numerical value of the approximation exp(−nb) if n = 100. 2.27 Sums of independent Cauchy random variables Let X 1 , X 2 , . . . be independent, each with the standard Cauchy density function. The standard Cauchy density and its characteristic function are given by f(x) = 1 π(1+x 2 ) and Φ(u) = exp(−[u[). Let S n = X 1 +X 2 + +X n . (a) Find the characteristic function of Sn n θ for a constant θ. (b) Does Sn n converge in distribution as n → ∞? Justify your answer, and if the answer is yes, identify the limiting distribution. (c) Does Sn n 2 converge in distribution as n → ∞? Justify your answer, and if the answer is yes, identify the limiting distribution. (d) Does Sn √ n converge in distribution as n → ∞? Justify your answer, and if the answer is yes, identify the limiting distribution. 2.28 A rapprochement between the central limit theorem and large deviations Let X 1 , X 2 , . . . be independent, identically distributed random variables with mean zero, variance σ 2 , and probability density function f. Suppose the moment generating function M(θ) is finite for θ in an open interval I containing zero. (a) Show that for θ ∈ I, (ln M(θ)) tt is the variance for the “tilted” density function f θ defined by f θ (x) = f(x) exp(θx − ln M(θ)). In particular, since (ln M(θ)) tt is nonnegative, ln M is a convex function. (The interchange of expectation and differentiation with respect to θ can be justified for θ ∈ I. You needn’t give details.) Let b > 0 and let S n = X 1 + + X n for n any positive integer. By the central limit the- orem, P[S n ≥ b √ n] → Q(b/σ) as n → ∞. An upper bound on the Q function is given by Q(u) = _ ∞ u 1 √ 2π e −s 2 /2 ds ≤ _ ∞ u s u √ 2π e −s 2 /2 ds = 1 u √ 2π e −u 2 /2 . This bound is a good approximation if u is moderately large. Thus, Q(b/σ) ≈ σ b √ 2π e −b 2 /2σ 2 if b/σ is moderately large. 70 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES (b) The large deviations upper bound yields P[S n ≥ b √ n] ≤ exp(−n(b/ √ n)). Identify the limit of the large deviations upper bound as n →∞, and compare with the approximation given by the central limit theorem. (Hint: Approximate ln M near zero by its second order Taylor’s approxima- tion.) 2.29 Chernoff bound for Gaussian and Poisson random variables (a) Let X have the N(µ, σ 2 ) distribution. Find the optimized Chernoff bound on P¦X ≥ E[X] +c¦ for c ≥ 0. (b) Let Y have the Poi(λ) distribution. Find the optimized Chernoff bound on P¦Y ≥ E[Y ] +c¦ for c ≥ 0. (c) (The purpose of this problem is to highlight the similarity of the answers to parts (a) and (b).) Show that your answer to part (b) can be expressed as P¦Y ≥ E[Y ]+c¦ ≤ exp(− c 2 2λ ψ( c λ )) for c ≥ 0, where ψ(u) = 2g(1 + u)/u 2 , with g(s) = s(ln s − 1) + 1. (Note: Y has variance λ, so the essential difference between the normal and Poisson bounds is the ψ term. The function ψ is strictly positive and strictly decreasing on the interval [−1, +∞), with ψ(−1) = 2 and ψ(0) = 1. Also, uψ(u) is strictly increasing in u over the interval [−1, +∞). ) 2.30 Large deviations of a mixed sum Let X 1 , X 2 , . . . have the Exp(1) distribution, and Y 1 , Y 2 , . . . have the Poi(1) distribution. Suppose all these random variables are mutually independent. Let 0 ≤ f ≤ 1, and suppose S n = X 1 + + X nf +Y 1 + +Y (1−f)n . Define l(f, a) = lim n→∞ 1 n ln P¦ Sn n ≥ a¦ for a > 1. Cram´ers theorem can be extended to show that l(f, a) can be computed by replacing the probability P¦ Sn n ≥ a¦ by its optimized Chernoff bound. (For example, if f = 1/2, we simply view S n as the sum of the n 2 i.i.d. random variables, X 1 +Y 1 , . . . , X n 2 +Y n 2 .) Compute l(f, a) for f ∈ ¦0, 1 3 , 2 3 , 1¦ and a = 4. 2.31 Large deviation exponent for a mixture distribution Problem 2.30 concerns an example such that 0 < f < 1 and S n is the sum of n independent random variables, such that a fraction f of the random variables have a CDF F Y and a fraction 1 −f have a CDF F Z . It is shown in the solutions that the large deviations exponent for Sn n is given by: l(a) = max θ ¦θa −fM Y (θ) −(1 −f)M Z (θ)¦ where M Y (θ) and M Z (θ) are the log moment generating functions for F Y and F Z respectively. Consider the following variation. Let X 1 , X 2 , . . . , X n be independent, and identically distributed, each with CDF given by F X (c) = fF Y (c) + (1 −f)F Z (c). Equivalently, each X i can be generated by flipping a biased coin with probability of heads equal to f, and generating X i using CDF F Y if heads shows and generating X i with CDF F Z if tails shows. Let ¯ S n = X 1 + + X n , and let ¯ l denote the large deviations exponent for e Sn n . (a) Express the function ¯ l in terms of f, M Y , and M Z . (b) Determine which is true and give a proof: ¯ l(a) ≤ l(a) for all a, or ¯ l(a) ≥ l(a) for all a. Can you also offer an intuitive explanation? 2.32 The limit of a sum of cumulative products of a sequence of uniform random variables 2.6. PROBLEMS 71 Let A 1 , A 2 , . . . be a sequence of independent random variables, with P[A i = 1] = P[A i = 1 2 ] = 1 2 for all i. Let B k = A 1 A k . (a) Does lim k→∞ B k exist in the m.s. sense? Justify your anwswer. (b) Does lim k→∞ B k exist in the a.s. sense? Justify your anwswer. (c) Let S n = B 1 + . . . + B n . Show that lim m,n→∞ E[S m S n ] = 35 3 , which implies that lim n→∞ S n exists in the m.s. sense. (d) Find the mean and variance of the limit random variable. (e) Does lim n→∞ S n exist in the a.s. sense? Justify your anwswer. 2.33 * Distance measures (metrics) for random variables For random variables X and Y , define d 1 (X, Y ) = E[[ X −Y [ /(1+ [ X −Y [)] d 2 (X, Y ) = min¦ ≥ 0 : F X (x +) + ≥ F Y (x) and F Y (x +) + ≥ F X (x) for all x¦ d 3 (X, Y ) = (E[(X −Y ) 2 ]) 1/2 , where in defining d 3 (X, Y ) it is assumed that E[X 2 ] and E[Y 2 ] are finite. (a) Show that d i is a metric for i = 1, 2 or 3. Clearly d i (X, X) = 0 and d i (X, Y ) = d i (Y, X). Verify in addition the triangle inequality. (The only other requirement of a metric is that d i (X, Y ) = 0 only if X = Y . For this to be true we must think of the metric as being defined on equivalence classes of random variables.) (b) Let X 1 , X 2 , . . . be a sequence of random variables and let Y be a random variable. Show that X n converges to Y (i) in probability if and only if d 1 (X, Y ) converges to zero, (ii) in distribution if and only if d 2 (X, Y ) converges to zero, (iii) in the mean square sense if and only if d 3 (X, Y ) converges to zero (assume E[Y 2 ] < ∞). (Hint for (i): It helps to establish that d 1 (X, Y ) −/(1 +) ≤ P¦[ X −Y [≥ ¦ ≤ d 1 (X, Y )(1 +)/. The “only if” part of (ii) is a little tricky. The metric d 2 is called the Levy metric. 2.34 * Weak Law of Large Numbers Let X 1 , X 2 , . . . be a sequence of random variables which are independent and identically distributed. Assume that E[X i ] exists and is equal to zero for all i. If Var(X i ) is finite, then Chebychev’s inequality easily establishes that (X 1 + + X n )/n converges in probability to zero. Taking that result as a starting point, show that the convergence still holds even if Var(X i ) is infinite. (Hint: Use “truncation” by defining U k = X k I¦[ X k [≥ c¦ and V k = X k I¦[ X k [< c¦ for some constant c. E[[ U k [] and E[V k ] don’t depend on k and converge to zero as c tends to infinity. You might also find the previous problem helpful. 72 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 2.35 * Completing the proof of Cramer’s theorem Prove Theorem 2.5.1 without the assumption that the random variables are bounded. To begin, select a large constant C and let ¯ X i denote a random variable with the conditional distribution of X i given that [X i [ ≤ C. Let ¯ S n = ¯ X 1 + + ¯ X n and let ¯ l denote the large deviations exponent for ¯ X i . Then P _ S n n ≥ n _ ≥ P¦[X 1 [ ≤ C¦ n P _ ¯ S n n ≥ n _ One step is to show that ¯ l(a) converges to l(a) as C →∞. It is equivalent to showing that if a point- wise monotonically increasing sequence of convex functions converges pointwise to a nonnegative convex function, then the minima of the convex functions converges to a nonnegative value. Chapter 3 Random Vectors and Minimum Mean Squared Error Estimation The reader is encouraged to review the section on matrices in the appendix before reading this chapter. 3.1 Basic definitions and properties A random vector X of dimension m has the form X = _ _ _ _ _ X 1 X 2 . . . X m _ _ _ _ _ where the X i ’s are random variables all on the same probability space. The expectation of X (also called the mean of X) is the vector EX (or E[X]) defined by EX = _ _ _ _ _ EX 1 EX 2 . . . EX m _ _ _ _ _ Suppose Y is another random vector on the same probability space as X, with dimension n. The cross correlation matrix of X and Y is the m n matrix E[XY T ], which has ij th entry E[X i Y j ]. The cross covariance matrix of X and Y , denoted by Cov(X, Y ), is the matrix with ij th entry Cov(X i , Y j ). Note that the correlation matrix is the matrix of correlations, and the covariance matrix is the matrix of covariances. In the particular case that n = m and Y = X, the cross correlation matrix of X with itself, is simply called the correlation matrix of X, and is written as E[XX T ], and it has ij th entry E[X i X j ]. 73 74CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION The cross covariance matrix of X with itself, Cov(X, X), has ij th entry Cov(X i , X j ). This matrix is called the covariance matrix of X, and it is also denoted by Cov(X). So the notations Cov(X) and Cov(X, X) are interchangeable. While the notation Cov(X) is more concise, the notation Cov(X, X) is more suggestive of the way the covariance matrix scales when X is multiplied by a constant. Elementary properties of expectation, correlation, and covariance for vectors follow immediately from similar properties for ordinary scalar random variables. These properties include the following (here A and C are nonrandom matrices and b and d are nonrandom vectors). 1. E[AX +b] = AE[X] +b 2. Cov(X, Y )=E[X(Y −EY ) T ] =E[(X −EX)Y T ] =E[XY T ] −(EX)(EY ) T 3. E[(AX)(CY ) T ] = AE[XY T ]C T 4. Cov(AX +b, CY +d) = ACov(X, Y )C T 5. Cov(AX +b) = ACov(X)A T 6. Cov(W +X, Y +Z) = Cov(W, Y ) + Cov(W, Z) + Cov(X, Y ) + Cov(X, Z) In particular, the second property above shows the close connection between correlation matrices and covariance matrices. In particular, if the mean vector of either X or Y is zero, then the cross correlation and cross covariance matrices are equal. Not every square matrix is a correlation matrix. For example, the diagonal elements must be nonnegative. Also, Schwarz’s inequality (see Section 1.8) must be respected, so that [Cov(X i , X j )[ ≤ _ Cov(X i , X i )Cov(X j , X j ). Additional inequalities arise for consideration of three or more random variables at a time. Of course a square diagonal matrix is a correlation matrix if and only if its diagonal entries are nonnegative, because only vectors with independent entries need be considered. But if an mm matrix is not diagonal, it is not a priori clear whether there are m random variables with all m(m + 1)/2 correlations matching the entries of the matrix. The following proposition neatly resolves these issues. Proposition 3.1.1 Correlation matrices and covariance matrices are positive semidefinite. Con- versely, if K is a positive semidefinite matrix, then K is the covariance matrix and correlation matrix for some mean zero random vector X. Proof. If K is a correlation matrix, then K = E[XX T ] for some random vector X. Given any vector α, α T X is a scaler random variable, so α T Kα = E[α T XX T α] = E[(α T X)(X T α)] = E[(α T X) 2 ] ≥ 0. Similarly, if K = Cov(X, X) then for any vector α, α T Kα = α T Cov(X, X)α = Cov(α T X, α T X) = Var(α T X) ≥ 0. The first part of the proposition is proved. 3.2. THE ORTHOGONALITY PRINCIPLE FOR MINIMUM MEAN SQUARE ERROR ESTIMATION75 For the converse part, suppose that K is an arbitrary symmetric positive semidefinite matrix. Let λ 1 , . . . , λ m and U be the corresponding set of eigenvalues and orthonormal matrix formed by the eigenvectors. (See Section 11.7 in the appendix.) Let Y 1 , . . . , Y m be independent, mean 0 random variables with Var(Y i ) = λ i , and let Y be the random vector Y = (Y 1 , . . . , Y m ) T . Then Cov(Y, Y ) = Λ, where Λ is the diagonal matrix with the λ i ’s on the diagonal. Let X = UY . Then EX = 0 and Cov(X, X) = Cov(UY, UY ) = UΛU T = K. Therefore, K is both the covariance matrix and the correlation matrix of X. The characteristic function Φ X of X is the function on R m defined by Φ X (u) = E[exp(ju T X)]. 3.2 The orthogonality principle for minimum mean square error estimation Let X be a random variable with some known distribution. Suppose X is not observed but that we wish to estimate X. If we use a constant b to estimate X, the estimation error will be X − b. The mean square error (MSE) is E[(X −b) 2 ]. Since E[X −EX] = 0 and EX −b is constant, E[(X −b) 2 ] = E[((X −EX) + (EX −b)) 2 ] = E[(X −EX) 2 + 2(X −EX)(EX −b) + (EX −b) 2 ] = Var(X) + (EX −b) 2 . From this expression it is easy to see that the mean square error is minimized with respect to b if and only if b = EX. The minimum possible value is Var(X). Random variables X and Y are called orthogonal if E[XY ] = 0. Orthogonality is denoted by “X ⊥ Y .” The essential fact E[X−EX] = 0 is equivalent to the following condition: X−EX is orthogonal to constants: (X − EX) ⊥ c for any constant c. Therefore, the choice of constant b yielding the minimum mean square error is the one that makes the error X−b orthogonal to all constants. This result is generalized by the orthogonality principle, stated next. Fix some probability space and let L 2 (Ω, T, P) be the set of all random variables on the proba- bility space with finite second moments. Let X be a random variable in L 2 (Ω, T, P), and let 1 be a collection of random variables on the same probability space as X such that V.1 1 ⊂ L 2 (Ω, T, P) V.2 1 is a linear class: If Z 1 ∈ 1 and Z 2 ∈ 1 and a 1 , a 2 are constants, then a 1 Z 1 +a 2 Z 2 ∈ 1 V.3 1 is closed in the mean square sense: If Z 1 , Z 2 , . . . is a sequence of elements of 1 and if Z n →Z ∞ m.s. for some random variable Z ∞ , then Z ∞ ∈ 1. 76CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION That is, 1 is a closed linear subspace of L 2 (Ω, T, P). The problem of interest is to find Z ∗ in 1 to minimize the mean square error, E[(X−Z) 2 ], over all Z ∈ 1. That is, Z ∗ is the random variable in 1 that is closest to X in the minimum mean square error (MMSE) sense. We call it the projection of X onto 1 and denote it as Π 1 (X). Estimating a random variable by a constant corresponds to the case that 1 is the set of constant random variables: the projection of a random variable X onto the set of constant random variables is EX. The orthogonality principle stated next is illustrated in Figure 3.1. 3.2. THE ORTHOGONALITY PRINCIPLE FOR MINIMUM MEAN SQUARE ERROR ESTIMATION77 * Z Z X e 0 V Figure 3.1: Illustration of the orthogonality principle. Theorem 3.2.1 (The orthogonality principle) Let 1 be a closed, linear subspace of L 2 (Ω, T, P), and let X ∈ L 2 (Ω, T, P), for some probability space (Ω, T, P). (a) (Existence and uniqueness) There exists a unique element Z ∗ (also denoted by Π 1 (X)) in 1 so that E[(X − Z ∗ ) 2 ] ≤ E[(X − Z) 2 ] for all Z ∈ 1. (Here, we consider two elements Z and Z t of 1 to be the same if P¦Z = Z t ¦ = 1). (b) (Characterization) Let W be a random variable. Then W = Z ∗ if and only if the following two conditions hold: (i) W ∈ 1 (ii) (X −W) ⊥ Z for all Z in 1. (c)(Error expression) The minimum mean square error (MMSE) is given by E[(X −Z ∗ ) 2 ] = E[X 2 ] −E[(Z ∗ ) 2 ]. Proof. The proof of part (a) is given in an extra credit homework problem. The technical condition V.3 on 1 is essential for the proof of existence. Here parts (b) and (c) are proved. To establish the “if” half of part (b), suppose W satisfies (i) and (ii) and let Z be an arbitrary element of 1. Then W −Z ∈ 1 because 1 is a linear class. Therefore, (X −W) ⊥ (W −Z), which implies that E[(X −Z) 2 ] = E[(X −W +W −Z) 2 ] = E[(X −W) 2 + 2(X −W)(W −Z) + (W −Z) 2 ] = E[(X −W) 2 ] +E[(W −Z) 2 ]. Thus E[(X −W) 2 ] ≤ E[(X −Z) 2 ]. Since Z is an arbitrary element of 1, it follows that W = Z ∗ , and the “if” half of (b) is proved. 78CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION To establish the “only if” half of part (b), note that Z ∗ ∈ 1 by the definition of Z ∗ . Let Z ∈ 1 and let c ∈ R. Then Z ∗ +cZ ∈ 1, so that E[(X −(Z ∗ +cZ)) 2 ] ≥ E[(X −Z ∗ ) 2 ]. But E[(X −(Z ∗ +cZ)) 2 ] = E[(X −Z ∗ ) −cZ) 2 ] = E[(X −Z ∗ ) 2 ] −2cE[(X −Z ∗ )Z] +c 2 E[Z 2 ], so that −2cE[(X −Z ∗ )Z] +c 2 E[Z 2 ] ≥ 0. (3.1) As a function of c the left side of (3.1) is a parabola with value zero at c = 0. Hence its derivative with respect to c at 0 must be zero, which yields that (X − Z ∗ ) ⊥ Z. The “only if” half of (b) is proved. The expression of part (c) is proved as follows. Since X − Z ∗ is orthogonal to all elements of 1, including Z ∗ itself, E[X 2 ] = E[((X −Z ∗ ) +Z ∗ ) 2 ] = E[(X −Z ∗ ) 2 ] +E[(Z ∗ ) 2 ]. This proves part (c). The following propositions give some properties of the projection mapping Π 1 , with proofs based on the orthogonality principle. Proposition 3.2.2 (Linearity of projection) Suppose 1 is a closed linear subspace of L 2 (Ω, T, P), X 1 and X 2 are in L 2 (Ω, T, P), and a 1 and a 2 are constants. Then Π 1 (a 1 X 1 +a 2 X 2 ) = a 1 Π 1 (X 1 ) +a 2 Π 1 (X 2 ). (3.2) Proof. By the characterization part of the orthogonality principle (part (b) of Theorem 3.2.1), the projection Π 1 (a 1 X 1 + a 2 X 2 ) is characterized by two properties. So, to prove (3.2), it suffices to show that a 1 Π 1 1 (X 1 ) + a 2 Π 1 2 (X 2 ) satisfies these two properties. First, we must check that a 1 Π 1 1 (X 1 ) +a 2 Π 1 2 (X 2 ) ∈ 1. This follows immediately from the fact that Π 1 (X i ) ∈ 1, for i = 1, 2, and 1 is a linear subspace, so the first property is checked. Second, we must check that e ⊥ Z, where e = a 1 X 1 +a 2 X 2 −(a 1 Π 1 (X 1 )+a 2 Π 1 (X 2 )), and Z is an arbitrary element of 1. Now e = a 1 e 1 +a 2 e 2 , where e i = X i −Π 1 (X i ) for i = 1, 2, and e i ⊥ Z for i = 1, 2. So E[eZ] = a 1 E[e 1 Z] +a 2 E[e 2 Z] = 0, or equivalently, e ⊥ Z. Thus, the second property is also checked, and the proof is complete. Proposition 3.2.3 (Projections onto nested subspaces) Suppose 1 1 and 1 2 are closed linear sub- spaces of L 2 (Ω, T, P) such that 1 2 ⊂ 1 1 . Then for any X ∈ L 2 (Ω, T, P), Π 1 2 (X) = Π 1 2 Π 1 1 (X). (In words, the projection of X onto 1 2 can be found by first projecting X onto 1 1 , and then projecting the result onto 1 2 .) Furthermore, E[(X −Π 1 2 (X)) 2 ] = E[(X −Π 1 1 (X)) 2 ] +E[(Π 1 1 (X) −Π 1 2 (X)) 2 ]. (3.3) In particular, E[(X −Π 1 2 (X)) 2 ] ≥ E[(X −Π 1 1 (X)) 2 ]. 3.2. THE ORTHOGONALITY PRINCIPLE FOR MINIMUM MEAN SQUARE ERROR ESTIMATION79 Proof. By the characterization part of the orthogonality principle (part (b) of Theorem 3.2.1), the projection Π 1 2 (X) is characterized by two properties. So, to prove Π 1 2 (X) = Π 1 2 Π 1 1 (X), it suffices to show that Π 1 2 Π 1 1 (X) satisfies the two properties. First, we must check that Π 1 2 Π 1 1 (X) ∈ 1 2 . This follows immediately from the fact that Π 1 2 (X) maps into 1 2 , so the first property is checked. Second, we must check that e ⊥ Z, where e = X−Π 1 2 Π 1 1 (X), and Z is an arbitrary element of 1 2 . Now e = e 1 +e 2 , where e 1 = X −Π 1 1 (X) and e 2 = Π 1 1 (X) −Π 1 2 Π 1 1 (X). By the characterization of Π 1 1 (X), e 1 is perpendicular to any random variable in 1 1 . In particular, e 1 ⊥ Z, because Z ∈ 1 2 ⊂ 1 1 . The characterization of the projection of Π 1 1 (X) onto 1 2 implies that e 2 ⊥ Z. Since e i ⊥ Z for i = 1, 2, it follows that e ⊥ Z. Thus, the second property is also checked, so it is proved that Π 1 2 (X) = Π 1 2 Π 1 1 (X). As mentioned above, e 1 is perpendicular to any random variable in 1 1 , which implies that e 1 ⊥ e 2 . Thus, E[e 2 ] = E[e 2 1 ] +E[e 2 2 ], which is equivalent to (3.3). Therefore, (3.3) is proved. The last inequality of the proposition follows, of course, from (3.3). The inequality is also equivalent to the inequality min W∈1 2 E[(X −W) 2 ] ≥ min W∈1 1 E[(X −W) 2 ], and this inequality is true because the minimum of a set of numbers cannot increase if more numbers are added to the set. The following proposition is closely related to the use of linear innovations sequences, discussed in Sections 3.5 and 3.6. Proposition 3.2.4 (Projection onto the span of orthogonal subspaces) Suppose 1 1 and 1 2 are closed linear subspaces of L 2 (Ω, T, P) such that 1 1 ⊥ 1 2 , which means that E[Z 1 Z 2 ] = 0 for any Z 1 ∈ 1 1 and Z 2 ∈ 1 2 . Let 1 = 1 1 ⊕1 2 = ¦Z 1 +Z 2 : Z i ∈ 1 i ¦ denote the span of 1 1 and 1 2 . Then for any X ∈ L 2 (Ω, T, P), Π 1 (X) = Π 1 1 (X) + Π 1 2 (X). The minimum mean square error satisfies E[(X −Π 1 (X)) 2 ] = E[X 2 ] −E[(Π 1 1 (X)) 2 ] −E[(Π 1 2 (X)) 2 ]. Proof. The space 1 is also a closed linear subspace of L 2 (Ω, T, P) (see a starred homework problem). By the characterization part of the orthogonality principle (part (b) of Theorem 3.2.1), the projection Π 1 (X) is characterized by two properties. So to prove Π 1 (X) = Π 1 1 (X) +Π 1 2 (X), it suffices to show that Π 1 1 (X) + Π 1 2 (X) satisfies these two properties. First, we must check that Π 1 1 (X)+Π 1 2 (X) ∈ 1. This follows immediately from the fact that Π 1 i (X) ∈ 1 i , for i = 1, 2, so the first property is checked. Second, we must check that e ⊥ Z, where e = X −(Π 1 1 (X) + Π 1 2 (X)), and Z is an arbitrary element of 1. Now any such Z can be written as Z = Z 1 +Z 2 where Z i ∈ 1 i for i = 1, 2. Observe that Π 1 2 (X) ⊥ Z 1 because Π 1 2 (X) ∈ 1 2 and Z 1 ∈ 1 1 . Therefore, E[eZ 1 ] = E[(X −(Π 1 1 (X) + Π 1 2 (X))Z 1 ] = E[(X −Π 1 1 (X))Z 1 ] = 0, where the last equality follows from the characterization of Π 1 1 (X). Thus, e ⊥ Z 1 , and similarly e ⊥ Z 2 , so e ⊥ Z. Thus, the second property is also checked, so Π 1 (X) = Π 1 1 (X) + Π 1 2 (X) is proved. Since Π 1 i (X) ∈ 1 i for i = 1, 2, Π 1 1 (X) ⊥ Π 1 2 (X). Therefore, E[(Π 1 (X)) 2 ] = E[(Π 1 1 (X)) 2 ] + E[(Π 1 2 (X)) 2 ], and the expression for the MMSE in the proposition follows from the error expression in the orthogonality principle. 80CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION 3.3 Conditional expectation and linear estimators In many applications, a random variable X is to be estimated based on observation of a random variable Y . Thus, an estimator is a function of Y . In applications, the two most frequently considered classes of functions of Y used in this context are essentially all functions, leading to the best unconstrained estimator, or all linear functions, leading to the best linear estimator. These two possibilities are discussed in this section. 3.3.1 Conditional expectation as a projection Suppose a random variable X is to be estimated using an observed random vector Y of dimension m. Suppose E[X 2 ] < +∞. Consider the most general class of estimators based on Y , by setting 1 = ¦g(Y ) : g : R m →R, E[g(Y ) 2 ] < +∞¦. (3.4) There is also the implicit condition that g is Borel measurable so that g(Y ) is a random variable. The projection of X onto this class 1 is the unconstrained minimum mean square error (MMSE) estimator of X given Y . Let us first proceed to identify the optimal estimator by conditioning on the value of Y , thereby reducing this example to the estimation of a random variable by a constant, as discussed at the beginning of Section 3.2. For technical reasons we assume for now that X and Y have a joint pdf. Then, conditioning on Y , E[(X −g(Y )) 2 ] = _ R m E[(X −g(Y )) 2 [Y = y]f Y (y)dy where E[(X −g(Y )) 2 [Y = y] = _ ∞ −∞ (x −g(y)) 2 f X[Y (x [ y)dx Since the mean is the MMSE estimator of a random variable among all constants, for each fixed y, the minimizing choice for g(y) is g ∗ (y) = E[X[Y = y] = _ ∞ −∞ xf X[Y (x [ y)dx. (3.5) Therefore, the optimal estimator in 1 is g ∗ (Y ) which, by definition, is equal to the random variable E[X[Y ]. What does the orthogonality principle imply for this example? It implies that there exists an optimal estimator g ∗ (Y ) which is the unique element of 1 such that (X −g ∗ (Y )) ⊥ g(Y ) 3.3. CONDITIONAL EXPECTATION AND LINEAR ESTIMATORS 81 for all g(Y ) ∈ 1. If X, Y have a joint pdf then we can check that E[X [ Y ] satisfies the required condition. Indeed, E[(X −E[X [ Y ])g(Y )] = _ _ (x −E[X [ Y = y])g(y)f X[Y (x [ y)f Y (y)dxdy = _ __ (x −E[X [ Y = y])f X[Y (x [ y)dx _ g(y)f Y (y)dy = 0, because the expression within the braces is zero. In summary, if X and Y have a joint pdf (and similarly if they have a joint pmf) then the MMSE estimator of X given Y is E[X [ Y ]. Even if X and Y don’t have a joint pdf or joint pmf, we define the conditional expectation E[X [ Y ] to be the MMSE estimator of X given Y. By the orthogonality principle E[X [ Y ] exists as long as E[X 2 ] < ∞, and it is the unique function of Y such that E[(X −E[X [ Y ])g(Y )] = 0 for all g(Y ) in 1. Estimation of a random variable has been discussed, but often we wish to estimate a random vector. A beauty of the MSE criteria is that it easily extends to estimation of random vectors, because the MSE for estimation of a random vector is the sum of the MSEs of the coordinates: E[| X −g(Y ) | 2 ] = m i=1 E[(X i −g i (Y )) 2 ] Therefore, for most sets of estimators 1 typically encountered, finding the MMSE estimator of a random vector X decomposes into finding the MMSE estimators of the coordinates of X separately. Suppose a random vector X is to be estimated using estimators of the form g(Y), where here g maps R n into R m . Assume E[|X| 2 ] < +∞ and seek an estimator to minimize the MSE. As seen above, the MMSE estimator for each coordinate X i is E[X i [Y ], which is also the projection of X i onto the set of unconstrained estimators based on Y , defined in (3.4). So the optimal estimator g ∗ (Y ) of the entire vector X is given by g ∗ (Y ) = E[X [ Y ] = _ _ _ _ _ E[X 1 [ Y ] E[X 2 [ Y ] . . . E[X m [ Y ] _ _ _ _ _ Let the estimation error be denoted by e, e = X − E[X [ Y ]. (Even though e is a random vector we use lower case for it for an obvious reason.) The mean of the error is given by Ee = 0. As for the covariance of the error, note that E[X j [ Y ] is in 1 for each j, so e i ⊥ E[X j [ Y ] for each i, j. Since Ee i = 0, it follows that 82CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Cov(e i , E[X j [ Y ]) = 0 for all i, j. Equivalently, Cov(e, E[X [ Y ]) = 0. Using this and the fact X = E[X [ Y ] +e yields Cov(X) = Cov(E[X [ Y ] +e) = Cov(E[X [ Y ]) + Cov(e) + Cov(E[X[Y ], e) + Cov(e, E[X[Y ]) = Cov(E[X [ Y ]) + Cov(e) Thus, Cov(e) = Cov(X) −Cov(E[X [ Y ]). In practice, computation of E[X [ Y ] (for example, using (3.5) in case a joint pdf exists) may be too complex or may require more information about the joint distribution of X and Y than is available. For both of these reasons, it is worthwhile to consider classes of estimators that are constrained to smaller sets of functions of the observations. A widely used set is the set of all linear functions, leading to linear estimators, described next. 3.3.2 Linear estimators Let X and Y be random vectors with E[|X| 2 ] < +∞ and E[|Y | 2 ] < +∞. Seek estimators of the form AY + b to minimize the MSE. Such estimators are called linear estimators because each coordinate of AY + b is a linear combination of Y 1 , Y 2 , . . . , Y m and 1. Here “1” stands for the random variable that is always equal to 1. To identify the optimal linear estimator we shall apply the orthogonality principle for each coordinate of X with 1 = ¦c 0 +c 1 Y 1 +c 2 Y 2 +. . . +c n Y n : c 0 , c 1 , . . . , c n ∈ R¦ Let e denote the estimation error e = X −(AY +b). We must select A and b so that e i ⊥ Z for all Z ∈ 1. Equivalently, we must select A and b so that e i ⊥ 1 all i e i ⊥ Y j all i, j. The condition e i ⊥ 1, which means Ee i = 0, implies that E[e i Y j ] = Cov(e i , Y j ). Thus, the required orthogonality conditions on A and b become Ee = 0 and Cov(e, Y ) = 0. The condition Ee = 0 requires that b = EX − AEY , so we can restrict our attention to estimators of the form EX + A(Y − EY ), so that e = X − EX − A(Y − EY ). The condition Cov(e, Y ) = 0 becomes Cov(X, Y ) − ACov(Y, Y ) = 0. If Cov(Y, Y ) is not singular, then A must be given by A = Cov(X, Y )Cov(Y, Y ) −1 . In this case the optimal linear estimator, denoted by ´ E[X [ Y ], is given by ´ E[X [ Y ] = E[X] + Cov(X, Y )Cov(Y, Y ) −1 (Y −EY ) (3.6) Proceeding as in the case of unconstrained estimators of a random vector, we find that the covariance of the error vector satisfies Cov(e) = Cov(X) −Cov( ´ E[X [ Y ]) which by (3.6) yields Cov(e) = Cov(X) −Cov(X, Y )Cov(Y, Y ) −1 Cov(Y, X). (3.7) 3.3. CONDITIONAL EXPECTATION AND LINEAR ESTIMATORS 83 3.3.3 Discussion of the estimators As seen above, the expectation E[X], the MMSE linear estimator ´ E[X[Y [, and the conditional expectation E[X[Y ], are all instances of projection mappings Π 1 , for 1 consisting of constants, linear estimators based on Y , or unconstrained estimators based on Y , respectively. Hence, the orthogonality principle, and Propositions 3.2.2-3.2.4 all apply to these estimators. Proposition 3.2.2 implies that these estimators are linear functions of X. In particular, E[a 1 X 1 +a 2 X 2 [Y ] = a 1 E[X 1 [Y ] +a 2 E[X 2 [Y ], and the same is true with “E” replaced by “ ´ E.” Proposition 3.2.3, regarding projections onto nested subspaces, implies an ordering of the mean square errors: E[(X −E[X [ Y ]) 2 ] ≤ E[(X − ´ E[X [ Y ]) 2 ] ≤ Var(X). Furthermore, it implies that the best linear estimator of X based on Y is equal to the best linear estimator of the estimator E[X[Y ]: that is, ´ E[X[Y ] = ´ E[E[X[Y ][Y ]. It follows, in particular, that E[X[Y ] = ´ E[X[Y ] if and only if E[X[Y ] has the linear form, AX + b. Similarly, E[X], the best constant estimator of X, is also the best constant estimator of ´ E[X[Y ] or of E[X[Y ]. That is, E[X] = E[ ´ E[X[Y ]] = E[E[X[Y ]]. In fact, E[X] = E[ ´ E[E[X[Y ][Y ]]. Proposition 3.2.3 also implies relations among estimators based on different sets of observations. For example, suppose X is to be estimated and Y 1 and Y 2 are both possible observations. The space of unrestricted estimators based on Y 1 alone is a subspace of the space of unrestricted estimators based on both Y 1 and Y 2 . Therefore, Proposition 3.2.3 implies that E[E[X[Y 1 , Y 2 ][Y 1 ] = E[X[Y 1 ], a property that is sometimes called the tower property of conditional expectation. The same relation holds true for the same reason for the best linear estimators: ´ E[ ´ E[X[Y 1 , Y 2 ][Y 1 ] = ´ E[X[Y 1 ]. Example 3.3.1 Let X, Y be jointly continuous random variables with the pdf f XY (x, y) = _ x +y 0 ≤ x, y ≤ 1 0 else Let us find E[X [ Y ] and ´ E[X [ Y ]. To find E[X [ Y ] we first identify f Y (y) and f X[Y (x[y). f Y (y) = _ ∞ −∞ f XY (x, y)dx = _ 1 2 +y 0 ≤ y ≤ 1 0 else Therefore, f X[Y (x [ y) is defined only for 0 ≤ y ≤ 1, and for such y it is given by f X[Y (x [ y) = _ x+y 1 2 +y 0 ≤ x ≤ 1 0 else So for 0 ≤ y ≤ 1, E[X [ Y = y] = _ 1 0 xf X[Y (x [ y)dx = 2 + 3y 3 + 6y . Therefore, E[X [ Y ] = 2+3Y 3+6Y . To find ´ E[X [ Y ] we compute EX = EY = 7 12 , Var(Y ) = 11 144 and Cov(X, Y ) = − 1 144 so ´ E[X [ Y ] = 7 12 − 1 11 (Y − 7 12 ). 84CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Example 3.3.2 Suppose that Y = XU, where X and U are independent random variables, X has the Rayleigh density f X (x) = _ x σ 2 e −x 2 /2σ 2 x ≥ 0 0 else and U is uniformly distributed on the interval [0, 1]. We find ´ E[X [ Y ] and E[X [ Y ]. To compute ´ E[X [ Y ] we find EX = _ ∞ 0 x 2 σ 2 e −x 2 /2σ 2 dx = 1 σ _ π 2 _ ∞ −∞ x 2 √ 2πσ 2 e −x 2 /2σ 2 dx = σ _ π 2 EY = EXEU = σ 2 _ π 2 E[X 2 ] = 2σ 2 Var(Y ) = E[Y 2 ] −E[Y ] 2 = E[X 2 ]E[U 2 ] −E[X] 2 E[U] 2 = σ 2 _ 2 3 − π 8 _ Cov(X, Y ) = E[U]E[X 2 ] −E[U]E[X] 2 = 1 2 Var(X) = σ 2 _ 1 − π 4 _ Thus ´ E[X [ Y ] = σ _ π 2 + (1 − π 4 ) ( 2 3 − π 8 ) _ Y − σ 2 _ π 2 _ To find E[X [ Y ] we first find the joint density and then the conditional density. Now f XY (x, y) = f X (x)f Y [X (y [ x) = _ 1 σ 2 e −x 2 /2σ 2 0 ≤ y ≤ x 0 else f Y (y) = _ ∞ −∞ f XY (x, y)dx = _ _ ∞ y 1 σ 2 e −x 2 /2σ 2 dx = √ 2π σ Q _ y σ _ 0 y ≥ 0 y < 0 where Q is the complementary CDF for the standard normal distribution. So for y ≥ 0 E[X [ Y = y] = _ ∞ −∞ xf XY (x, y)dx/f Y (y) = _ ∞ y x σ 2 e −x 2 /2σ 2 dx √ 2π σ Q( y σ ) = σ exp(−y 2 /2σ 2 ) √ 2πQ( y σ ) Thus, E[X [ Y ] = σ exp(−Y 2 /2σ 2 ) √ 2πQ( Y σ ) 3.4. JOINT GAUSSIAN DISTRIBUTION AND GAUSSIAN RANDOM VECTORS 85 Example 3.3.3 Suppose that Y is a random variable and f is a Borel measurable function such that E[f(Y ) 2 ] < ∞. Let us show that E[f(Y )[Y ] = f(Y ). By definition, E[f(Y )[Y ] is the random variable of the form g(Y ) which is closest to f(Y ) in the mean square sense. If we take g(Y ) = f(Y ), then the mean square error is zero. No other estimator can have a smaller mean square error. Thus, E[f(Y )[Y ] = f(Y ). Similarly, if Y is a random vector with E[[[Y [[ 2 ] < ∞, and if A is a matrix and b a vector, then ´ E[AY +b[Y ] = AY +b. 3.4 Joint Gaussian distribution and Gaussian random vectors Recall that a random variable X is Gaussian (or normal) with mean µ and variance σ 2 > 0 if X has pdf f X (x) = 1 √ 2πσ 2 e − (x−µ) 2 2σ 2 . As a degenerate case, we say X is Gaussian with mean µ and variance 0 if P¦X = µ¦ = 1. Equivalently, X is Gaussian with mean µ and variance σ 2 ≥ 0 if its characteristic function is given by Φ X (u) = exp _ − u 2 σ 2 2 +jµu _ . Lemma 3.4.1 Suppose X 1 , X 2 , . . . , X n are independent Gaussian random variables. Then any linear combination a 1 X 1 + +a n X n is a Gaussian random variable. Proof. By an induction argument on n, it is sufficient to prove the lemma for n = 2. Also, if X is a Gaussian random variable, then so is aX for any constant a, so we can assume without loss of generality that a 1 = a 2 = 1. It remains to prove that if X 1 and X 2 are independent Gaussian random variables, then the sum X = X 1 +X 2 is also a Gaussian random variable. Let µ i = E[X i ] and σ 2 i = Var(X i ). Then the characteristic function of X is given by Φ X (u) = E[e juX ] = E[e juX 1 e juX 2 ] = E[e juX 1 ]E[e juX 2 ] = exp _ − u 2 σ 2 1 2 +jµ 1 u _ exp _ − u 2 σ 2 2 2 +jµ 2 u _ = exp _ − u 2 σ 2 2 +jµu _ . where µ = µ 1 +µ 2 and σ 2 = σ 2 1 +σ 2 2 . Thus, X is a N(µ, σ 2 ) random variable. Let (X i : i ∈ I) be a collection of random variables indexed by some set I, which possibly has infinite cardinality. A finite linear combination of (X i : i ∈ I) is a random variable of the form a 1 X i 1 +a 2 X i 2 + + a n X in where n is finite, i k ∈ I for each k, and a k ∈ R for each k. 86CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Definition 3.4.2 A collection (X i : i ∈ I) of random variables has a joint Gaussian distribution (and the random variables X i : i ∈ I themselves are said to be jointly Gaussian) if every finite linear combination of (X i : i ∈ I) is a Gaussian random variable. A random vector X is called a Gaussian random vector if its coordinate random variables are jointly Gaussian. A collection of random vectors is said to have a joint Gaussian distribution if all of the coordinate random variables of all of the vectors are jointly Gaussian. We write that X is a N(µ, K) random vector if X is a Gaussian random vector with mean vector µ and covariance matrix K. Proposition 3.4.3 (a) If (X i : i ∈ I) has a joint Gaussian distribution, then each of the random variables itself is Gaussian. (b) If the random variables X i : i ∈ I are each Gaussian and if they are independent, which means that X i 1 , X i 2 , . . . , X in are independent for any finite number of indices i 1 , i 2 , . . . , i n , then (X i : i ∈ I) has a joint Gaussian distribution. (c) (Preservation of joint Gaussian property under linear combinations and limits) Suppose (X i : i ∈ I) has a joint Gaussian distribution. Let (Y j : j ∈ J) denote a collection of random variables such that each Y j is a finite linear combination of (X i : i ∈ I), and let (Z k : k ∈ K) denote a set of random variables such that each Z k is a limit in probability (or in the m.s. or a.s. senses) of a sequence from (Y j : j ∈ J). Then (Y j : j ∈ J) and (Z k : k ∈ K) each have a joint Gaussian distribution. (c t ) (Alternative version of (c)) Suppose (X i : i ∈ I) has a joint Gaussian distribution. Let Z denote the smallest set of random variables that contains (X i : i ∈ I), is a linear class, and is closed under taking limits in probability. Then Z has a joint Gaussian distribution. (d) The characteristic function of a N(µ, K) random vector is given by Φ X (u) = E[e ju T X ] = e ju T µ− 1 2 u T Ku . (e) If X is a N(µ, K) random vector and K is a diagonal matrix (i.e. cov(X i , X j ) = 0 for i ,= j, or equivalently, the coordinates of X are uncorrelated) then the coordinates X 1 , . . . , X m are independent. (f ) A N(µ, K) random vector X such that K is nonsingular has a pdf given by f X (x) = 1 (2π) m 2 [K[ 1 2 exp _ − (x −µ) T K −1 (x −µ) 2 _ . (3.8) Any random vector X such that Cov(X) is singular does not have a pdf. (g) If X and Y are jointly Gaussian vectors, then they are independent if and only if Cov(X, Y ) = 0. 3.4. JOINT GAUSSIAN DISTRIBUTION AND GAUSSIAN RANDOM VECTORS 87 Proof. (a) Supppose (X i : i ∈ I) has a joint Gaussian distribution, so that all finite linear combinations of the X i ’s are Gaussian random variables. Each X i for i ∈ I is itself a finite lin- ear combination of all the variables (with only one term). So each X i is a Gaussian random variable. (b) Suppose the variables X i : i ∈ I are mutually independent, and each is Gaussian. Then any finite linear combination of (X i : i ∈ I) is the sum of finitely many independent Gaussian random variables (by Lemma 3.4.1), and is hence also a Gaussian random variable. So (X i : i ∈ I) has a joint Gaussian distribution. (c) Suppose the hypotheses of (c) are true. Let V be a finite linear combination of (Y j : j ∈ J) : V = b 1 Y j 1 +b 2 Y j 2 + +b n Y jn . Each Y j is a finite linear combination of (X i : i ∈ I), so V can be written as a finite linear combination of (X i : i ∈ I): V = b 1 (a 11 X i 11 +a 12 X i 12 + +a 1k 1 X i 1k 1 ) + +b n (a n1 X i n1 + +a nkn X i nkn ). Therefore V is thus a Gaussian random variable. Thus, any finite linear combination of (Y j : j ∈ J) is Gaussian, so that (Y j : j ∈ J) has a joint Gaussian distribution. Let W be a finite linear combination of (Z k : k ∈ K): W = a 1 Z k 1 + +a m Z km . By assumption, for 1 ≤ l ≤ m, there is a sequence (j l,n : n ≥ 1) of indices from J such that Y j l,n d. → Z k l as n → ∞. Let W n = a 1 Y j 1,n + + a m Y jm,n . Each W n is a Gaussian random variable, because it is a finite linear combination of (Y j : j ∈ J). Also, [W −W n [ ≤ m l=1 a l [Z k l −Y j l,n [. (3.9) Since each term on the right-hand side of (3.9) converges to zero in probability, it follows that W n p. → W as n → ∞. Since limits in probability of Gaussian random variables are also Gaussian random variables (Proposition 2.1.16), it follows that W is a Gaussian random variable. Thus, an arbitrary finite linear combination W of (Z k : k ∈ K) is Gaussian, so, by definition, (Z k : k ∈ K) has a joint Gaussian distribution. (c t ) Suppose (X i : i ∈ I) has a joint Gaussian distribution. Using the notation of part (c), let (Y i : i ∈ I) denote the set of all finite linear combinations of (X i : i ∈ I) and let (Z k : k ∈ K) denote the set of all random variables that are limits in probability of random variables in (Y i ; i ∈ I). We will show that Z = (Z k : k ∈ K), which together with part (c) already proved, will establish (c t ). We begin by establishing that (Z k : k ∈ K) satisfies the three properties required of Z : (i) (Z k : k ∈ K) contains (X i : i ∈ I). (ii) (Z k : k ∈ K) is a linear class (iii) (Z k : k ∈ K) is closed under taking limits in probability Property (i) follows from the fact that for any i o ∈ I, the random variable X io is trivially a finite linear combination of (X i : i ∈ I), and it is trivially the limit in probability of the sequence with all 88CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION entries equal to itself. Property (ii) is true because a linear combination of the form a 1 Z k 1 +a 2 Z k 2 is the limit in probability of a sequence of random variables of the form a 1 Y j n,1 + a 2 Y j n,2 , and, since (Y j : j ∈ J) is a linear class, a 1 Y j n,1 + a 2 Y jn 2 is a random variable from (Y j : j ∈ J) for each n. To prove (iii), suppose Z kn p. → Z ∞ as n → ∞ for some sequence k 1 , k 2 , . . . from K. By passing to a subsequence if necessary, it can be assumed that P¦[Z ∞ −Z kn [ ≥ 2 −(n+1) ¦ ≤ 2 −(n+1) for all n ≥ 1. Since each Z kn is the limit in probability of a sequence of random variables from (Y j : j ∈ J), for each n there is a j n ∈ J so that P¦[Z kn − Y jn [ ≥ 2 −(n+1) ¦ ≤ 2 −(n+1) . Since [Z ∞ −Y jn [ ≤ [Z ∞ −Z kn [ +[Z kn −Y jn [, it follows that P¦[Z ∞ −Y jn [ ≥ 2 −n ¦ ≤ 2 −n . So Y jn p →Z ∞ . Therefore, Z ∞ is a random variable in (Z k : k ∈ K), so (Z k : k ∈ K) is closed under convergence in probability. In summary, (Z k : k ∈ K) has properties (i)-(iii). Any set of random variables with these three properties must contain (Y j : j ∈ J), and hence must contain (Z k : k ∈ K). So (Z k : k ∈ K) is indeed the smallest set of random variables with properties (i)-(iii). That is, (Z k : k ∈ K) = Z, as claimed. (d) Let X be a N(µ, K) random vector. Then for any vector u with the same dimension as X, the random variable u T X is Gaussian with mean u T µ and variance given by Var(u T X) = Cov(u T X, u T X) = u T Ku. Thus, we already know the characteristic function of u T X. But the characteristic function of the vector X evaluated at u is the characteristic function of u T X evaluated at 1: Φ X (u) = E[e ju T X ] = E[e j(u T X) ] = Φ u T X (1) = e ju T µ− 1 2 u T Ku , which establishes part (d) of the proposition. (e) If X is a N(µ, K) random vector and K is a diagonal matrix, then Φ X (u) = m i=1 exp(ju i µ i − k ii u 2 i 2 ) = i Φ i (u i ) where k ii denotes the i th diagonal element of K, and Φ i is the characteristic function of a N(µ i , k ii ) random variable. By uniqueness of joint characteristic functions, it follows that X 1 , . . . , X m are independent random variables. (f) Let X be a N(µ, K) random vector. Since K is positive semidefinite it can be written as K = UΛU T where U is orthonormal (so UU T = U T U = I) and Λ is a diagonal matrix with the eigenvalues λ 1 , λ 2 , . . . , λ m of K along the diagonal. (See Section 11.7 of the appendix.) Let Y = U T (X − µ). Then Y is a Gaussian vector with mean 0 and covariance matrix given by Cov(Y, Y ) = Cov(U T X, U T X) = U T KU = Λ. In summary, we have X = UY + µ, and Y is a vector of independent Gaussian random variables, the i th one being N(0, λ i ). Suppose further that K is nonsingular, meaning det(K) ,= 0. Since det(K) = λ 1 λ 2 λ m this implies that λ i > 0 for each i, so that Y has the joint pdf f Y (y) = m i=1 1 √ 2πλ i exp _ − y 2 i 2λ i _ = 1 (2π) m 2 _ det(K) exp _ − y T Λ −1 y 2 _ . 3.4. JOINT GAUSSIAN DISTRIBUTION AND GAUSSIAN RANDOM VECTORS 89 Since [ det(U)[ = 1 and UΛ −1 U T = K −1 , the joint pdf for the N(µ, K) random vector X is given by f X (x) = f Y (U T (x −µ)) = 1 (2π) m 2 [K[ 1 2 exp _ − (x −µ) T K −1 (x −µ) 2 _ . Now suppose, instead, that X is any random vector with some mean µ and a singular covariance matrix K. That means that det K = 0, or equivalently that λ i = 0 for one of the eigenvalues of K, or equivalently, that there is a vector α such that α T Kα = 0 (such an α is an eigenvector of K for eigen- value zero). But then 0 = α T Kα = α T Cov(X, X)α = Cov(α T X, α T X) = Var(α T X). Therefore, P¦α T X = α T µ¦ = 1. That is, with probability one, X is in the subspace ¦x ∈ R m : α T (x−µ) = 0¦. Therefore, X does not have a pdf. (g) Suppose X and Y are jointly Gaussian vectors and uncorrelated (so Cov(X, Y ) = 0.) Let Z denote the dimension m+n vector with coordinates X 1 , . . . , X m , Y 1 , . . . , Y n . Since Cov(X, Y ) = 0, the covariance matrix of Z is block diagonal: Cov(Z) = _ Cov(X) 0 0 Cov(Y ) _ . Therefore, for u ∈ R m and v ∈ R n , Φ Z __ u v __ = exp _ − 1 2 _ u v _ T Cov(Z) _ u v _ +j _ u v _ T EZ _ = Φ X (u)Φ Y (v). Such factorization implies that X and Y are independent. The if part of part (f) is proved. Conversely, if X and Y are jointly Gaussian and independent of each other, then the characteristic function of the joint density must factor, which implies that Cov(Z) is block diagonal as above. That is, Cov(X, Y ) = 0. Recall that in general, if X and Y are two random vectors on the same probability space, then the mean square error for the MMSE linear estimator ´ E[X[Y [ is greater than or equal to the mean square error for the best unconstrained estimator, E[X[Y [. The tradeoff, however, is that E[X[Y [ can be much more difficult to compute than ´ E[X[Y [, which is determined entirely by first and second moments. As shown in the next proposition, if X and Y are jointly Gaussian, the two estimators coincide. That is, the MMSE unconstrained estimator of Y is linear. We also know that E[X[Y = y] is the mean of the conditional mean of X given Y = y. The proposition identifies not only the conditional mean, but the entire conditional distribution of X given Y = y, for the case X and Y are jointly Gaussian. Proposition 3.4.4 Let X and Y be jointly Gaussian vectors. Given Y = y, the conditional distribution of X is N( ´ E[X[Y = y], Cov(e)). In particular, the conditional mean E[X[Y = y] is equal to ´ E[X[Y = y]. That is, if X and Y are jointly Gaussian, then E[X[Y ] = ´ E[X[Y ]. 90CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION If Cov(Y ) is nonsingular, E[X[Y = y] = ´ E[X[Y = y] = EX + Cov(X, Y )Cov(Y ) −1 (y −E[Y ]) (3.10) Cov(e) = Cov(X) −Cov(X, Y )Cov(Y ) −1 Cov(Y, X), (3.11) and if Cov(e) is nonsingular, f X[Y (x[y) = 1 (2π) m 2 [Cov(e)[ 1 2 exp _ − 1 2 _ x − ´ E[X[Y = y] _ T Cov(e) −1 (x − ´ E[X[Y = y]) _ . (3.12) Proof. Consider the MMSE linear estimator ´ E[X[Y ] of X given Y , and let e denote the corresponding error vector: e = X − ´ E[X[Y ]. Recall that, by the orthogonality principle, Ee = 0 and Cov(e, Y ) = 0. Since Y and e are obtained from X and Y by linear transformations, they are jointly Gaussian. Since Cov(e, Y ) = 0, the random vectors e and Y are also independent. For the next part of the proof, the reader should keep in mind that if a is a deterministic vector of some dimension m, and Z is a N(0, K) random vector, for a matrix K that is not a function of a, then Z +a has the N(a, K) distribution. Focus on the following rearrangement of the definition of e: X = e + ´ E[X[Y ]. (3.13) (Basically, the whole proof of the proposition hinges on (3.13).) Since ´ E[X[Y ] is a function of Y and since e is independent of Y with distribution N(0, Cov(e)), the following key observation can be made. Given Y = y, the conditional distribution of e is the N(0, Cov(e)) distribution, which does not depend on y, while ´ E[X[Y = y] is completely determined by y. So, given Y = y, X can be viewed as the sum of the N(0, Cov(e)) vector e and the determined vector ´ E[X[Y = y]. So the conditional distribution of X given Y = y is N( ´ E[X[Y = y], Cov(e)). In particular, E[X[Y = y], which in general is the mean of the conditional distribution of X given Y = y, is therefore the mean of the N( ´ E[X[Y = y], Cov(e)) distribution. Hence E[X[Y = y] = ´ E[X[Y = y]. Since this is true for all y, E[X[Y ] = ´ E[X[Y ]. Equations (3.10) and (3.11), respectively, are just the equations (3.6) and (3.7) derived for the MMSE linear estimator, ´ E[X[Y ], and its associated covariance of error. Equation (3.12) is just the formula (3.8) for the pdf of a N(µ, K) vector, with µ = ´ E[X[Y = y] and K = Cov(e). Example 3.4.5 Suppose X and Y are jointly Gaussian mean zero random variables such that the vector _ X Y _ has covariance matrix _ 4 3 3 9 _ . Let us find simple expressions for the two random variables E[X 2 [Y ] and P[X ≥ c[Y ]. Note that if W is a random variable with the N(µ, σ 2 ) distribution, then E[W 2 ] = µ 2 + σ 2 and P¦W ≥ c¦ = Q( c−µ σ ), where Q is the standard Gaussian complementary CDF. The idea is to apply these facts to the conditional distribution of X given Y . Given Y = y, the conditional distribution of X is N( Cov(X,Y ) Var(Y ) y, Cov(X) − Cov(X,Y ) 2 Var(Y ) ), or N( y 3 , 3). Therefore, E[X 2 [Y = y] = ( y 3 ) 2 + 3 and P[X ≥ c[Y = y] = Q( c−(y/3) √ 3 ). Applying these two functions to the random variable Y yields E[X 2 [Y ] = ( Y 3 ) 2 + 3 and P[X ≥ c[Y ] = Q( c−(Y/3) √ 3 ). 3.5. LINEAR INNOVATIONS SEQUENCES 91 3.5 Linear Innovations Sequences Let X, Y 1 , . . . , Y n be random vectors with finite second moments, all on the same probability space. In general, computation of the joint projection ´ E[X[Y 1 , . . . , Y n ] is considerably more complicated than computation of the individual projections ´ E[X[Y i ], because it requires inversion of the covari- ance matrix of all the Y ’s. However, if E[Y i ] = 0 for all i and E[Y i Y T j ] = 0 for i ,= j (i.e., all coordinates of Y i are orthogonal to constants and to all coordinates of Y j for i ,= j), then ´ E[X[Y 1 , . . . , Y n ] = X + n i=1 ´ E[X −X[Y i ], (3.14) where we write X for EX. The orthogonality principle can be used to prove (3.14) as follows. It suffices to prove that the right side of (3.14) satisfies the two properties that together characterize the left side of (3.14). First, the right side is a linear combination of 1, Y 1 , . . . , Y n . Secondly, let e denote the error when the right side of (3.14) is used to estimate X: e = X −X − n i=1 ´ E[X −X[Y i ]. It must be shown that E[e(Y T 1 c 1 +Y T 2 c 2 + +Y T n c n +b)] = 0 for any constant vectors c 1 , . . . , c n and constant b. It is enough to show that E[e] = 0 and E[eY T j ] = 0 for all j. But ´ E[X −X[Y i ] has the form B i Y i , because X −X and Y i have mean zero. Thus, E[e] = 0. Furthermore, E[eY T j ] = E __ X − ´ E[X[Y j ] _ Y T j _ − i:i,=j E[B i Y i Y T j ]. Each term on the right side of this equation is zero, so E[eY T j ] = 0, and (3.14) is proved. If 1, Y 1 , Y 2 , . . . , Y n have finite second moments but are not orthogonal, then (3.14) doesn’t di- rectly apply. However, by orthogonalizing this sequence we can obtain a sequence 1, ¯ Y 1 , ¯ Y 2 , . . . , ¯ Y n that can be used instead. Let ¯ Y 1 = Y 1 −E[Y 1 ], and for k ≥ 2 let ¯ Y k = Y k − ´ E[Y k [Y 1 , . . . , Y k−1 ]. (3.15) Then E[ ¯ Y i ] = 0 for all i and E[ ¯ Y i ¯ Y T j ] = 0 for i ,= j. In addition, by induction on k, we can prove that the set of all random variables obtained by linear transformation of 1, Y 1 , . . . , Y k is equal to the set of all random variables obtained by linear transformation of 1, ¯ Y 1 , . . . , ¯ Y k . Thus, for any random variable X with finite second moments, ´ E[X [ Y 1 , . . . , Y n ] = ´ E[X[ ¯ Y 1 , . . . , ¯ Y n ] = X + n i=1 ´ E[X −X[ ¯ Y i ] = X + n i=1 E[X ¯ Y i ] ¯ Y i E[( ¯ Y i ) 2 ] 92CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Moreover, this same result can be used to compute the innovations sequence recursively: Y 1 = Y 1 −E[Y 1 ], and ¯ Y k = Y k −E[Y k ] − k−1 i=1 E[Y k ¯ Y i ] ¯ Y i E[( ¯ Y i ) 2 ] k ≥ 2. The sequence ¯ Y 1 , ¯ Y 2 , . . . , ¯ Y n is called the linear innovations sequence for Y 1 , Y 2 , . . . , Y n . 3.6 Discrete-time Kalman filtering Kalman filtering is a state-space approach to the problem of estimating one random sequence from another. Recursive equations are found that are useful in many real-time applications. For notational convenience, because there are so many matrices in this section, lower case letters are used for random vectors. All the random variables involved are assumed to have finite second moments. The state sequence x 0 , x 1 , . . ., is to be estimated from an observed sequence y 0 , y 1 , . . .. These sequences of random vectors are assumed to satisfy the following state and observation equations. State: x k+1 = F k x k +w k k ≥ 0 Observation: y k = H T k x k +v k k ≥ 0. It is assumed that • x 0 , v 0 , v 1 , . . . , w 0 , w 1 , . . . are pairwise uncorrelated. • Ex 0 = x 0 , Cov(x 0 ) = P 0 , Ew k = 0, Cov(w k ) = Q k , Ev k = 0, Cov(v k ) = R k . • F k , H k , Q k , R k for k ≥ 0; P 0 are known matrices. • x 0 is a known vector. See Figure 3.2 for a block diagram of the state and observation equations. The evolution of the + + H k T w k y k k x Delay k+1 x F k k v Figure 3.2: Block diagram of the state and observations equations. state sequence x 0 , x 1 , . . . is driven by the random vectors w 0 , w 1 , . . ., while the random vectors v 0 , v 1 , . . . , represent observation noise. 3.6. DISCRETE-TIME KALMAN FILTERING 93 Let x k = E[x k ] and P k = Cov(x k ). These quantities are recursively determined for k ≥ 1 by x k+1 = F k x k and P k+1 = F k P k F T k +Q k , (3.16) where the initial conditions x 0 and P 0 are given as part of the state model. The idea of the Kalman filter equations is to recursively compute conditional expectations in a similar way. Let y k = (y 0 , y 1 , . . . , y k ) represent the observations up to time k. Define for nonnegative integers i, j ´ x i[j = ´ E[x i [y j ] and the associated covariance of error matrices Σ i[j = Cov(x i − ´ x i[j ). The goal is to compute ´ x k+1[k for k ≥ 0. The Kalman filter equations will first be stated, then briefly discussed, and then derived. The Kalman filter equations are given by ´ x k+1[k = _ F k −K k H T k ¸ ´ x k[k−1 +K k y k (3.17) = F k ´ x k[k−1 +K k _ y k −H T k ´ x k[k−1 ¸ with the initial condition ´ x 0[−1 = x 0 , where the gain matrix K k is given by K k = F k Σ k[k−1 H k _ H T k Σ k[k−1 H k +R k ¸ −1 (3.18) and the covariance of error matrices are recursively computed by Σ k+1[k = F k _ Σ k[k−1 −Σ k[k−1 H k _ H T k Σ k[k−1 H k +R k _ −1 H T k Σ k[k−1 _ F T k +Q k (3.19) with the initial condition Σ 0[−1 = P 0 . See Figure 3.3 for the block diagram. x x F −K H k k k T K k Delay k+1 k k k−1 + k y Figure 3.3: Block diagram of the Kalman filter. We comment briefly on the Kalman filter equations, before deriving them. First, observe what happens if H k is the zero matrix, H k = 0, for all k. Then the Kalman filter equations reduce to (3.16) with ´ x k[k−1 = x k , Σ k[k−1 = P k and K k = 0. Taking H k = 0 for all k is equivalent to having no observations available. 94CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION In many applications, the sequence of gain matrices can be computed ahead of time according to (3.18) and (3.19). Then as the observations become available, the estimates can be computed using only (3.17). In some applications the matrices involved in the state and observation models, including the covariance matrices of the v k ’s and w k ’s, do not depend on k. The gain matrices K k could still depend on k due to the initial conditions, but if the model is stable in some sense, then the gains converge to a constant matrix K, so that in steady state the filter equation (3.17) becomes time invariant: ´ x k+1[k = (F −KH T )´ x k[k−1 +Ky k . In other applications, particularly those involving feedback control, the matrices in the state and/or observation equations might not be known until just before they are needed. The Kalman filter equations are now derived. Roughly speaking, there are two considerations for computing ´ x k+1[k once ´ x k[k−1 is computed: (1) the time update, accounting for the change in state from x k to x k+1 , and (2) the information update, accounting for the availability of the new observation y k . Indeed, for k ≥ 0, we can write ´ x k+1[k = ´ x k+1[k−1 + _ ´ x k+1[k − ´ x k+1[k−1 ¸ , (3.20) where the first term on the right of (3.20), namely ´ x k+1[k−1 , represents the result of modifying ´ x k[k−1 to take into account the passage of time on the state, but with no new observation. The difference in square brackets in (3.20) is the contribution of the new observation, y k , to the estimation of x k+1 . Time update: In view of the state update equation and the fact that w k is uncorrelated with the random variables of y k−1 and has mean zero, ´ x k+1[k−1 = ´ E[F k x k +w k [y k−1 ] = F k ´ E[x k [y k−1 ] + ´ E[w k [y k−1 ] = F k ´ x k[k−1 (3.21) Thus, the time update consists of simply multiplying the previous estimate by F k . If there were no new observation, then this would be the entire Kalman filter. Furthermore, the covariance of error matrix for predicting x k+1 by ´ x k+1[k−1 , is given by Σ k+1[k−1 = Cov(x k+1 − ´ x k+1[k−1 ) = Cov(F k (x k − ´ x k[k−1 ) +w k ) = F k Σ k[k−1 F T k +Q k . (3.22) Information update: Consider next the new observation y k . The observation y k is not totally new—for it can be predicted in part from the previous observations, or simply by its mean in the case k = 0. Specifically, we can consider ˜ y k = y k − ´ E[y k [ y k−1 ] to be the new part of the observation y k . Here, ˜ y 0 , ˜ y 1 , . . . is the linear innovation sequence for the observation sequence y 0 , y 1 , . . ., as defined in Section 3.5 (with the minor difference that here the vectors are indexed from time k = 0 on, rather than from time k = 1). Since the linear span of the random variables in (1, y k−1 , y k ) is the same as the linear span of the random variables in (1, y k−1 , ˜ y k ), for the purposes 3.6. DISCRETE-TIME KALMAN FILTERING 95 of incorporating the new observation we can pretend that ˜ y k is the new observation rather than y k . By the observation equation and the facts E[v k ] = 0 and E[y k−1 v T k ] = 0, it follows that ´ E[y k [ y k−1 ] = ´ E _ H T k x k +w k [y k−1 _ = H T k ´ x k[k−1 , so ˜ y k = y k − H T k ´ x k[k−1 . Since (1, y k−1 , y k ) and (1, ˜ y k−1 , ˜ y k ) have the same span and the random variables in ˜ y k−1 are orthogonal to the random variables in ˜ y k , and all these random variables have mean zero, ´ x k+1[k = ´ E _ x k+1 [˜ y k−1 , ˜ y k _ = x k+1 + ´ E _ x k+1 −x k+1 [˜ y k−1 _ + ´ E [x k+1 −x k+1 [˜ y k ] = ´ x k+1[k−1 + ´ E [x k+1 −x k+1 [˜ y k ] . Therefore, the term to be added for the information update (in square brackets in (3.20)) is ´ E [x k+1 −x k+1 [˜ y k ] . Since x k+1 − x k+1 and ˜ y k both have mean zero, the information update term can be simplified to: ´ E [x k+1 −x k+1 [˜ y k ] = K k ˜ y k , (3.23) where (use the fact Cov(x k+1 −x k+1 , ˜ y k ) = Cov(x k+1 , ˜ y k ) because E[˜ y k ] = 0) K k = Cov(x k+1 , ˜ y k )Cov(˜ y k ) −1 . (3.24) Putting it all together: Equation (3.20), with the time update equation (3.21) and the fact the information update term is K k ˜ y k , yields the main Kalman filter equation: ´ x k+1[k = F k ´ x k[k−1 +K k ˜ y k . (3.25) Taking into account the new observation ˜ y k , which is orthogonal to the previous observations, yields a reduction in the covariance of error: Σ k+1[k = Σ k+1[k−1 −Cov(K k ˜ y k ). (3.26) The Kalman filter equations (3.17), (3.18), and (3.19) follow easily from (3.25), (3.24), and (3.26), as follows. To convert (3.24) into (3.18), use Cov(x k+1 , ˜ y k ) = Cov(F k x k +w k , H T k (x k − ´ x k[k−1 ) +v k ) (3.27) = Cov(F k x k , H T k (x k − ´ x k[k−1 )) = Cov(F k (x k − ´ x k[k−1 ), H T k (x k − ´ x k[k−1 )) = F k Σ k[k−1 H k 96CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION and Cov(˜ y k ) = Cov(H T k (x k − ´ x k[k−1 ) +v k ) = Cov(H T k (x k − ´ x k[k−1 )) + Cov(v k ) = H T k Σ k[k−1 H k +R k To convert (3.26) into (3.19) use (3.22) and Cov(K k ˜ y k ) = K k Cov(˜ y k )K T k = Cov(x k+1 −F k ´ x k[k−1 )Cov(˜ y k ) −1 Cov(x k+1 −F k ´ x k[k−1 ) This completes the derivation of the Kalman filtering equations. 3.7 Problems 3.1 Rotation of a joint normal distribution yielding independence Let X be a Gaussian vector with E[X] = _ 10 5 _ Cov(X) = _ 2 1 1 1 _ . (a) Write an expression for the pdf of X that does not use matrix notation. (b) Find a vector b and orthonormal matrix U such that the vector Y defined by Y = U T (X −b) is a mean zero Gaussian vector such at Y 1 and Y 2 are independent. 3.2 Linear approximation of the cosine function over an interval Let Θ be uniformly distributed on the interval [0, π] (yes, [0, π], not [0, 2π]). Suppose Y = cos(Θ) is to be estimated by an estimator of the form a +bΘ. What numerical values of a and b minimize the mean square error? 3.3 Calculation of some minimum mean square error estimators Let Y = X + N, where X has the exponential distribution with parameter λ, and N is Gaussian with mean 0 and variance σ 2 . The variables X and N are independent, and the parameters λ and σ 2 are strictly positive. (Recall that E[X] = 1 λ and Var(X) = 1 λ 2 .) (a) Find ´ E[X[Y ] and also find the mean square error for estimating X by ´ E[X[Y ]. (b) Does E[X[Y ] = ´ E[X[Y ]? Justify your answer. (Hint: Answer is yes if and only if there is no estimator for X of the form g(Y ) with a smaller MSE than ´ E[X[Y ].) 3.4 Valid covariance matrix For what real values of a and b is the following matrix the covariance matrix of some real-valued random vector? K = _ _ 2 1 b a 1 0 b 0 1 _ _ . 3.7. PROBLEMS 97 Hint: An symmetric n n matrix is positive semidefinite if and only if the determinant of every matrix obtained by deleting a set of rows and the corresponding set of columns, is nonnegative. 3.5 Conditional probabilities with joint Gaussians I Let _ X Y _ be a mean zero Gaussian vector with correlation matrix _ 1 ρ ρ 1 _ , where [ρ[ < 1. (a) Express P[X ≤ 1[Y ] in terms of ρ, Y , and the standard normal CDF, Φ. (b) Find E[(X −Y ) 2 [Y = y] for real values of y. 3.6 Conditional probabilities with joint Gaussians II Let X, Y be jointly Gaussian random variables with mean zero and covariance matrix Cov _ X Y _ = _ 4 6 6 18 _ . You may express your answers in terms of the Φ function defined by Φ(u) = _ u −∞ 1 √ 2π e −s 2 /2 ds. (a) Find P[[X −1[ ≥ 2]. (b) What is the conditional density of X given that Y = 3? You can either write out the density in full, or describe it as a well known density with specified parameter values. (c) Find P[[X −E[X[Y ][ ≥ 1]. 3.7 An estimation error bound Suppose the random vector _ X Y _ has mean vector _ 2 −2 _ and covariance matrix _ 8 3 3 2 _ . Let e = X −E[X [ Y ]. (a) If possible, compute E[e 2 ]. If not, give an upper bound. (b) For what joint distribution of X and Y (consistent with the given information) is E[e 2 ] maxi- mized? Is your answer unique? 3.8 An MMSE estimation problem (a) Let X and Y be jointly uniformly distributed over the triangular region in the x−y plane with corners (0,0), (0,1), and (1,2). Find both the linear minimum mean square error (LMMSE) esti- mator estimator of X given Y and the (possibly nonlinear) MMSE estimator X given Y . Compute the mean square error for each estimator. What percentage reduction in MSE does the MMSE estimator provide over the LMMSE? (b) Repeat part (a) assuming Y is a N(0, 1) random variable and X = [Y [. 3.9 Comparison of MMSE estimators for an example Let X = 1 1+U , where U is uniformly distributed over the interval [0, 1]. (a) Find E[X[U] and calculate the MSE, E[(X −E[X[U]) 2 ]. (b) Find ´ E[X[U] and calculate the MSE, E[(X − ´ E[X[U]) 2 ]. 3.10 Conditional Gaussian comparison Suppose that X and Y are jointly Gaussian, mean zero, with Var(X) = Var(Y ) = 10 and 98CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Cov(X, Y ) = 8. Express the following probabilities in terms of the Q function. (a) p a . = P¦X ≥ 2¦. (b) p b . = P[X ≥ 2[Y = 3]. (c) p c . = P[X ≥ 2[Y ≥ 3]. (Note: p c can be expressed as an integral. You need not carry out the integration.) (d) Indicate how p a , p b , and p c are ordered, from smallest to largest. 3.11 Diagonalizing a two-dimensional Gaussian distribution Let X = _ X 1 X 2 _ be a mean zero Gaussian random vector with correlation matrix _ 1 ρ ρ 1 _ , where [ρ[ < 1. Find an orthonormal 2 by 2 matrix U such that X = UY for a Gaussian vector Y = _ Y 1 Y 2 _ such that Y 1 is independent of Y 2 . Also, find the variances of Y 1 and Y 2 . Note: The following identity might be useful for some of the problems that follow. If A, B, C, and D are jointly Gaussian and mean zero, then E[ABCD] = E[AB]E[CD] + E[AC]E[BD] + E[AD]E[BC]. This implies that E[A 4 ] = 3E[A 2 ] 2 , Var(A 2 ) = 2E[A 2 ], and Cov(A 2 , B 2 ) = 2Cov(A, B) 2 . Also, E[A 2 B] = 0. 3.12 An estimator of an estimator Let X and Y be square integrable random variables and let Z = E[X [ Y ], so Z is the MMSE estimator of X given Y . Show that the LMMSE estimator of X given Y is also the LMMSE estimator of Z given Y . (Can you generalize this result?). 3.13 Projections onto nested linear subspaces (a) Use the Orthogonality Principle to prove the following statement: Suppose 1 0 and 1 1 are two closed linear spaces of second order random variables, such that 1 0 ⊃ 1 1 , and suppose X is a random variable with finite second moment. Let Z ∗ i be the random variable in 1 i with the minimum mean square distance from X. Then Z ∗ 1 is the variable in 1 1 with the minimum mean square distance from Z ∗ 0 . (b) Suppose that X, Y 1 , and Y 2 are random variables with finite second moments. For each of the following three statements, identify the choice of subspace 1 0 and 1 1 such that the statement follows from part (a): (i) ´ E[X[Y 1 ] = ´ E[ ´ E[X[Y 1 , Y 2 ] [Y 1 ]. (ii) E[X[Y 1 ] = E[ E[X[Y 1 , Y 2 ] [Y 1 ]. (Sometimes called the “tower property.”) (iii) E[X] = E[ ´ E[X[Y 1 ]]. (Think of the expectation of a random variable as the constant closest to the random variable, in the m.s. sense. 3.14 Some identities for estimators Let X and Y be random variables with E[X 2 ] < ∞. For each of the following statements, determine if the statement is true. If yes, give a justification using the orthogonality principle. If no, give a counter example. (a) E[X cos(Y )[Y ] = E[X[Y ] cos(Y ) 3.7. PROBLEMS 99 (b) E[X[Y ] = E[X[Y 3 ] (c) E[X 3 [Y ] = E[X[Y ] 3 (d) E[X[Y ] = E[X[Y 2 ] (e) ´ E[X[Y ] = ´ E[X[Y 3 ] 3.15 Some identities for estimators, version 2 Let X, Y, and Z be random variables with finite second moments and suppose X is to be estimated. For each of the following, if true, give a brief explanation. If false, give a counter example. (a) E[(X −E[X[Y ]) 2 ] ≤ E[(X − ´ E[X[Y, Y 2 ]) 2 ]. (b) E[(X −E[X[Y ]) 2 ] = E[(X − ´ E[X[Y, Y 2 ] 2 ] if X and Y are jointly Gaussian. (c) E[ (X −E[E[X[Z] [Y ]) 2 ] ≤ E[(X −E[X[Y ]) 2 ]. (d) If E[(X −E[X[Y ]) 2 ] = Var(X), then X and Y are independent. 3.16 Some simple examples Give an example of each of the following, and in each case, explain your reasoning. (a) Two random variables X and Y such that ´ E[X[Y ] = E[X[Y ], and such that E[X[Y [ is not simply constant, and X and Y are not jointly Gaussian. (b) A pair of random variables X and Y on some probability space such that X is Gaussian, Y is Gaussian, but X and Y are not jointly Gaussian. (c) Three random variables X, Y, and Z, which are pairwise independent, but all three together are not independent. 3.17 The square root of a positive-semidefinite matrix (a) True or false? If B is a square matrix over the reals, then BB T is positive semidefinite. (b) True or false? If K is a symmetric positive semidefinite matrix over the reals, then there exists a symmetric positive semidefinite matrix S over the reals such that K = S 2 . (Hint: What if K is also diagonal?) 3.18 Estimating a quadratic Let _ X Y _ be a mean zero Gaussian vector with correlation matrix _ 1 ρ ρ 1 _ , where [ρ[ < 1. (a) Find E[X 2 [Y ], the best estimator of X 2 given Y. (b) Compute the mean square error for the estimator E[X 2 [Y ]. (c) Find ´ E[X 2 [Y ], the best linear (actually, affine) estimator of X 2 given Y, and compute the mean square error. 3.19 A quadratic estimator Suppose Y has the N(0, 1) distribution and that X = [Y [. Find the estimator for X of the form ´ X = a + bY + cY 2 which minimizes the mean square error. (You can use the following numerical values: E[[Y [] = 0.8, E[Y 4 ] = 3, E[[Y [Y 2 ] = 1.6.) (a) Use the orthogonality principle to derive equations for a, b, and c. (b) Find the estimator ´ X. (c) Find the resulting minimum mean square error. 100CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION 3.20 An innovations sequence and its application Let _ _ _ _ Y 1 Y 2 Y 3 X _ _ _ _ be a mean zero random vector with correlation matrix _ _ _ _ 1 0.5 0.5 0 0.5 1 0.5 0.25 0.5 0.5 1 0.25 0 0.25 0.25 1 _ _ _ _ . (a) Let Y 1 , Y 2 , Y 3 denote the innovations sequence. Find the matrix A so that _ _ _ Y 1 Y 2 Y 3 _ _ _ = A _ _ Y 1 Y 2 Y 3 _ _ . (b) Find the correlation matrix of _ _ _ Y 1 Y 2 Y 3 _ _ _ and cross covariance matrix Cov(X, _ _ _ ¯ Y 1 ¯ Y 2 ¯ Y 3 _ _ _). (c) Find the constants a, b, and c to minimize E[(X −a Y 1 −b Y 2 −c Y 3 ) 2 ]. 3.21 Estimation for an additive Gaussian noise model Assume x and n are independent Gaussian vectors with means ¯ x, ¯ n and covariance matrices Σ x and Σ n . Let y = x +n. Then x and y are jointly Gaussian. (a) Show that E[x[y] is given by either ¯ x + Σ x (Σ x + Σ n ) −1 (y −(¯ x + ¯ n)) or Σ n (Σ x + Σ n ) −1 ¯ x + Σ x (Σ x + Σ n ) −1 (y − ¯ n). (b). Show that the conditional covariance matrix of x given y is given by any of the three expressions: Σ x −Σ x (Σ x + Σ n ) −1 Σ x = Σ x (Σ x + Σ n ) −1 Σ n = (Σ −1 x + Σ −1 n ) −1 . (Assume that the various inverses exist.) 3.22 A Kalman filtering example (a) Let σ 2 > 0, let f be a real constant, and let x 0 denote a N(0, σ 2 ) random variable. Consider the state and observation sequences defined by: (state) x k+1 = fx k +w k (observation) y k = x k +v k where w 1 , w 2 , . . . ; v 1 , v 2 , . . . are mutually independent N(0, 1) random variables. Write down the Kalman filter equations for recursively computing the estimates ˆ x k[k−1 , the (scaler) gains K k , and the sequence of the variances of the errors (for brevity write σ 2 k for the covariance or error instead of Σ k[k−1 ). (b) For what values of f is the sequence of error variances bounded? 3.23 Steady state gains for one-dimensional Kalman filter This is a continuation of the previous problem. (a) Show that lim k→∞ σ 2 k exists. (b) Express the limit, σ 2 ∞ , in terms of f. (c) Explain why σ 2 ∞ = 1 if f = 0. 3.7. PROBLEMS 101 3.24 A variation of Kalman filtering (a) Let σ 2 > 0, let f be a real constant, and let x 0 denote a N(0, σ 2 ) random variable. Consider the state and observation sequences defined by: (state) x k+1 = fx k +w k (observation) y k = x k +w k where w 1 , w 2 , . . . are mutually independent N(0, 1) random variables. Note that the state and ob- servation equations are driven by the same sequence, so that some of the Kalman filtering equations derived in the notes do not apply. Derive recursive equations needed to compute ˆ x k[k−1 , including recursive equations for any needed gains or variances of error. (Hints: What modifications need to be made to the derivation for the standard model? Check that your answer is correct for f = 1.) 3.25 The Kalman filter for x k[k Suppose in a given application a Kalman filter has been implemented to recursively produce ´ x k+1[k for k ≥ 0, as in class. Thus by time k, ´ x k+1[k , Σ k+1[k , ´ x k[k−1 , and Σ k[k−1 are already computed. Suppose that it is desired to also compute ´ x k[k at time k. Give additional equations that can be used to compute ´ x k[k . (You can assume as given the equations in the class notes, and don’t need to write them all out. Only the additional equations are asked for here. Be as explicit as you can, expressing any matrices you use in terms of the matrices already given in the class notes.) 3.26 An innovations problem Let U 1 , U 2 , . . . be a sequence of independent random variables, each uniformly distributed on the interval [0, 1]. Let Y 0 = 1, and Y n = U 1 U 2 U n for n ≥ 1. (a) Find the variance of Y n for each n ≥ 1. (b) Find E[Y n [Y 0 , . . . , Y n−1 ] for n ≥ 1. (c) Find ´ E[Y n [Y 0 , . . . , Y n−1 ] for n ≥ 1. (d) Find the linear innovations sequence ¯ Y = ( ¯ Y 0 , ¯ Y 1 , . . .). (e) Fix a positive integer M and let X M = U 1 + . . . + U M . Using the answer to part (d), find ´ E[X M [Y 0 , . . . , Y M ], the best linear estimator of X M given (Y 0 , . . . , Y M ). 3.27 Linear innovations and orthogonal polynomials for the normal distribution (a) Let X be a N(0, 1) random variable. Show that for integers n ≥ 0, E[X n ] = _ n! (n/2)!2 n/2 n even 0 n odd Hint: One approach is to apply the power series expansion for e x on each side of the identity E[e uX ] = e u 2 /2 , and identify the coefficients of u n . (b) Let X be a N(0, 1) random variable, and let Y n = X n for integers n ≥ 1. Express the first four terms, ¯ Y 1 through ¯ Y 4 , of the linear innovations sequence of Y in terms of U. 102CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION 3.28 Linear innovations and orthogonal polynomials for the uniform distribution (a) Let U be uniformly distributed on the interval [−1, 1]. Show that for integers n ≥ 0, E[U n ] = _ 1 n+1 n even 0 n odd (b) Let Y n = U n for integers n ≥ 0. Note that Y 0 ≡ 1. Express the first four terms, ¯ Y 1 through ¯ Y 4 , of the linear innovations sequence of Y in terms of U. 3.29 Representation of three random variables with equal cross covariances Let K be a matrix of the form K = _ _ 1 a a a 1 a a a 1 _ _ , where a ∈ R. (a) For what values of a is K the covariance matrix of some random vector? (b) Let a have one of the values found in part (a). Fill in the missing entries of the matrix U, U = _ _ _ ∗ ∗ 1 √ 3 ∗ ∗ 1 √ 3 ∗ ∗ 1 √ 3 _ _ _, to yield an orthonormal matrix, and find a diagonal matrix Λ with nonnegative entries, so that if Z is a three dimensional random vector with Cov(Z) = I, then UΛ 1 2 Z has covariance matrix K. (Hint: It happens that the matrix U can be selected independently of a. Also, 1 + 2a is an eigenvalue of K.) 3.30 Kalman filter for a rotating state Consider the Kalman state and observation equations for the following matrices, where θ o = 2π/10 (the matrices don’t depend on time, so the subscript k is omitted): F = (0.99) _ cos(θ o ) −sin(θ o ) sin(θ o ) cos(θ o ) _ H = _ 1 0 _ Q = _ 1 0 0 1 _ R = 1 (a) Explain in words what successive iterates F n x o are like, for a nonzero initial state x o (this is the same as the state equation, but with the random term w k left off). (b) Write out the Kalman filter equations for this example, simplifying as much as possible (but no more than possible! The equations don’t simplify all that much.) 3.31 * Proof of the orthogonality principle Prove the seven statements lettered (a)-(g) in what follows. Let X be a random variable and let 1 be a collection of random variables on the same probability space such that (i) E[Z 2 ] < +∞ for each Z ∈ 1 3.7. PROBLEMS 103 (ii) 1 is a linear class, i.e., if Z, Z t ∈ 1 then so is aZ +bZ t for any real numbers a and b. (iii) 1 is closed in the sense that if Z n ∈ 1 for each n and Z n converges to a random variable Z in the mean square sense, then Z ∈ 1. The Orthogonality Principle is that there exists a unique element Z ∗ ∈ 1 so that E[(X −Z ∗ ) 2 ] ≤ E[(X − Z) 2 ] for all Z ∈ 1. Furthermore, a random variable W ∈ 1 is equal to Z ∗ if and only if (X −W) ⊥ Z for all Z ∈ 1. ((X −W) ⊥ Z means E[(X −W)Z] = 0.) The remainder of this problem is aimed at a proof. Let d = inf¦E[(X−Z) 2 ] : Z ∈ 1¦. By definition of infimum there exists a sequence Z n ∈ 1 so that E[(X −Z n ) 2 ] →d as n →+∞. (a) The sequence Z n is Cauchy in the mean square sense. (Hint: Use the “parallelogram law”: E[(U −V ) 2 ] +E[(U +V ) 2 ] = 2(E[U 2 ] +E[V 2 ]). Thus, by the Cauchy criteria, there is a random variable Z ∗ such that Z n converges to Z ∗ in the mean square sense. (b) Z ∗ satisfies the conditions advertised in the first sentence of the principle. (c) The element Z ∗ satisfying the condition in the first sentence of the principle is unique. (Consider two random variables that are equal to each other with probability one to be the same.) This completes the proof of the first sentence. (d) (“if” part of second sentence). If W ∈ 1 and (X −W) ⊥ Z for all Z ∈ 1, then W = Z ∗ . (The “only if” part of second sentence is divided into three parts:) (e) E[(X −Z ∗ −cZ) 2 ] ≥ E[(X −Z ∗ ) 2 ] for any real constant c. (f) −2cE[(X −Z ∗ )Z] +c 2 E[Z 2 ] ≥ 0 for any real constant c. (g) (X −Z ∗ ) ⊥ Z, and the principle is proved. 3.32 * The span of two closed subspaces is closed Check that the span, 1 1 ⊕ 1 2 , of two closed linear spaces (defined in Proposition 3.2.4) is also a closed linear space. A hint for showing that 1 is closed is to use the fact that if (Z n ) is a m.s. convergent sequence of random variables in 1, then each variable in the sequence can be represented as Z n = Z n,1 +Z n,2 , where Z n,i ∈ 1 i , and E[(Z n −Z m ) 2 ] = E[(Z n,1 −Z m,1 ) 2 ] +E[(Z n,2 −Z m,2 ) 2 ]. 3.33 * Von Neumann’s alternating projections algorithm Let 1 1 and 1 2 be closed linear subspaces of L 2 (Ω, T, P), and let X ∈ L 2 (Ω, T, P). Define a sequence (Z n : n ≥ 0) recursively, by alternating projections onto 1 1 and 1 2 , as follows. Let Z 0 = X, and for k ≥ 0, let Z 2k+1 = Π 1 1 (Z 2k ) and Z 2k+2 = Π 1 2 (Z 2k+1 ). The goal of this problem is to show that Z n m.s. → Π 1 1 ∩1 2 (X). The approach will be to establish that (Z n ) converges in the m.s. sense, by verifying the Cauchy criteria, and then use the orthogonality principle to identify the limit. Define D(i, j) = E[(Z i −Z j )] 2 for i ≥ 0 and j ≥ 0, and let i = D(i + 1, i) for i ≥ 0. (a) Show that i = E[(Z i ) 2 ] −E[(Z i+1 ) 2 ]. (b) Show that ∞ i=0 i ≤ E[Z 2 ] < ∞. (c) Use the orthogonality principle to show that for n ≥ 1 and k ≥ 0: D(n, n + 2k + 1) = n +D(n + 1, n + 2k + 1) D(n, n + 2k + 2) = D(n, n + 2k + 1) − n+2k+1 . 104CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION (d) Use the above equations to show that for n ≥ 1 and k ≥ 0, D(n, n + 2k + 1) = n + + n+k −( n+k+1 + + n+2k ) D(n, n + 2k + 2) = n + + n+k −( n+k+1 + + n+2k+1 ). Consequently, D(n, m) ≤ m−1 i=n i for 1 ≤ n < m, and therefore (Z n : n ≥ 0) is a Cauchy sequence, so Z n m.s. → Z ∞ for some random variable Z ∞ . (e) Verify that Z ∞ ∈ 1 1 ∩ 1 2 . (f) Verify that (X − Z ∞ ) ⊥ Z for any Z ∈ 1 1 ∩ 1 2 . (Hint: Explain why (X − Z n ) ⊥ Z for all n, and let n →∞.) By the orthogonality principle, (e) and (f) imply that Z ∞ = Π 1 1 ∩1 2 (X). Chapter 4 Random Processes 4.1 Definition of a random process A random process X is an indexed collection X = (X t : t ∈ T) of random variables, all on the same probability space (Ω, T, P). In many applications the index set T is a set of times. If T = Z, or more generally, if T is a set of consecutive integers, then X is called a discrete-time random process. If T = R or if T is an interval of R, then X is called a continuous-time random process. Three ways to view a random process X = (X t : t ∈ T) are as follows: • For each t fixed, X t is a function on Ω. • X is a function on T Ω with value X t (ω) for given t ∈ T and ω ∈ Ω. • For each ω fixed with ω ∈ Ω, X t (ω) is a function of t, called the sample path corresponding to ω. Example 4.1.1 Suppose W 1 , W 2 , . . . are independent random variables with P¦W k = 1¦ = P¦W k = −1¦ = 1 2 for each k, and suppose X 0 = 0 and X n = W 1 + + W n for positive integers n. Let W = (W k : k ≥ 1) and X = (X n : n ≥ 0). Then W and X are both discrete-time random processes. The index set T for X is Z + . A sample path of W and a corresponding sample path of X are shown in Figure 4.1. The following notation is used: µ X (t) = E[X t ] R X (s, t) = E[X s X t ] C X (s, t) = Cov(X s , X t ) F X,n (x 1 , t 1 ; . . . ; x n , t n ) = P¦X t 1 ≤ x 1 , . . . , X tn ≤ x n ¦ and µ X is called the mean function , R X is called the correlation function, C X is called the covariance function, and F X,n is called the nth order CDF. Sometimes the prefix “auto,” meaning “self,” is 105 106 CHAPTER 4. RANDOM PROCESSES k W ( ) k ω k X ( ) ω k Figure 4.1: Typical sample paths. added to the words correlation and covariance, to emphasize that only one random process is involved. Definition 4.1.2 A second order random process is a random process (X t : t ∈ T) such that E[X 2 t ] < +∞ for all t ∈ T. The mean, correlation, and covariance functions of a second order random process are all well- defined and finite. If X t is a discrete random variable for each t, then the nth order pmf of X is defined by p X,n (x 1 , t 1 ; . . . ; x n , t n ) = P¦X t 1 = x 1 , . . . , X tn = x n ¦. Similarly, if X t 1 , . . . , X tn are jointly continuous random variables for any distinct t 1 , . . . , t n in T, then X has an nth order pdf f X,n , such that for t 1 , . . . , t n fixed, f X,n (x 1 , t 1 ; . . . ; x n , t n ) is the joint pdf of X t 1 , . . . , X tn . Example 4.1.3 Let A and B be independent, N(0, 1) random variables. Suppose X t = A+Bt+t 2 for all t ∈ R. Let us describe the sample functions, the mean, correlation, and covariance functions, and the first and second order pdf’s of X. Each sample function corresponds to some fixed ω in Ω. For ω fixed, A(ω) and B(ω) are numbers. The sample paths all have the same shape–they are parabolas with constant second derivative equal to 2. The sample path for ω fixed has t = 0 intercept A(ω), and minimum value A(ω) − B(ω) 2 4 achieved at t = − B(w) 2 . Three typical sample paths are shown in Figure 4.2. The various moment functions are given by µ X (t) = E[A+Bt +t 2 ] = t 2 R X (s, t) = E[(A+Bs +s 2 )(A+Bt +t 2 )] = 1 +st +s 2 t 2 C X (s, t) = R X (s, t) −µ X (s)µ X (t) = 1 +st. As for the densities, for each t fixed, X t is a linear combination of two independent Gaussian random variables, and X t has mean µ X (t) = t 2 and variance Var(X t ) = C X (t, t) = 1 + t 2 . Thus, X t is a 4.1. DEFINITION OF A RANDOM PROCESS 107 ω A( ) ω −B( ) 2 t A( )− ω 4 ω B( ) 2 Figure 4.2: Typical sample paths. N(t 2 , 1 +t 2 ) random variable. That specifies the first order pdf f X,1 well enough, but if one insists on writing it out in all detail it is given by f X,1 (x, t) = 1 _ 2π(1 +t 2 ) exp _ − (x −t 2 ) 2 2(1 +t 2 ) _ . For s and t fixed distinct numbers, X s and X t are jointly Gaussian and their covariance matrix is given by Cov _ X s X t _ = _ 1 +s 2 1 +st 1 +st 1 +t 2 _ . The determinant of this matrix is (s −t) 2 , which is nonzero. Thus X has a second order pdf f X,2 . For most purposes, we have already written enough about f X,2 for this example, but in full detail it is given by f X,2 (x, s; y, t) = 1 2π[s −t[ exp _ − 1 2 _ x −s 2 y −t 2 _ T _ 1 +s 2 1 +st 1 +st 1 +t 2 _ −1 _ x −s 2 y −t 2 _ _ . The nth order distributions of X for this example are joint Gaussian distributions, but densities don’t exist for n ≥ 3 because X t 1 , X t 2 , and X t 3 are linearly dependent for any t 1 , t 2 , t 3 . A random process (X t : t ∈ T) is said to be Gaussian if the random variables X t : t ∈ T comprising the process are jointly Gaussian. The process X in the example just discussed is Gaussian. All the finite order distributions of a Gaussian random process X are determined by the mean function µ X and autocorrelation function R X . Indeed, for any finite subset ¦t 1 , t 2 , . . . , t n ¦ of T, (X t 1 , . . . , X tn ) T is a Gaussian vector with mean (µ X (t 1 ), . . . , µ X (t n )) T and covariance matrix with ijth element C X (t i , t j ) = R X (t i , t j ) −µ X (t i )µ X (t j ). Two or more random processes are said to be jointly Gaussian if all the random variables comprising the processes are jointly Gaussian. 108 CHAPTER 4. RANDOM PROCESSES Example 4.1.4 Let U = (U k : k ∈ Z) be a random process such that the random variables U k : k ∈ Z are independent, and P[U k = 1] = P[U k = −1] = 1 2 for all k. Let X = (X t : t ∈ R) be the random process obtained by letting X t = U n for n ≤ t < n + 1 for any n. Equivalently, X t = U ]t| . A sample path of U and a corresponding sample path of X are shown in Figure 4.3. Both random processes have zero mean, so their covariance functions are equal to their correlation k U k X t t Figure 4.3: Typical sample paths. function and are given by R U (k, l) = _ 1 if k = l 0 else R X (s, t) = _ 1 if ¸s| = ¸t| 0 else . The random variables of U are discrete, so the nth order pmf of U exists for all n. It is given by p U,n (x 1 , k 1 ; . . . ; x n , k n ) = _ 2 −n if (x 1 , . . . , x n ) ∈ ¦−1, 1¦ n 0 else for distinct integers k 1 , . . . , k n . The nth order pmf of X exists for the same reason, but it is a bit more difficult to write down. In particular, the joint pmf of X s and X t depends on whether ¸s| = ¸t|. If ¸s| = ¸t| then X s = X t and if ¸s| , = ¸t| then X s and X t are independent. Therefore, the second order pmf of X is given as follows: f X,2 (x 1 , t 1 ; x 2 , t 2 ) = _ ¸ _ ¸ _ 1 2 if ¸t 1 | = ¸t 2 | and either x 1 = x 2 = 1 or x 1 = x 2 = −1 1 4 if ¸t 1 | , = ¸t 2 | and x 1 , x 2 ∈ ¦−1, 1¦ 0 else. 4.2 Random walks and gambler’s ruin Suppose p is given with 0 < p < 1. Let W 1 , W 2 , . . . be independent random variables with P¦W i = 1¦ = p and P¦W i = −1¦ = 1 − p for i ≥ 1. Suppose X 0 is an integer valued random variable independent of (W 1 , W 2 , . . .), and for n ≥ 1, define X n by X n = X 0 + W 1 + + W n . A sample path of X = (X n : n ≥ 0) is shown in Figure 4.4. The random process X is called a random walk. Write P k and E k for conditional probabilities and conditional expectations given that X 0 = k. For example, P k [A] = P[A [ X 0 = k] for any event A. Let us summarize some of the basic properties of X. 4.2. RANDOM WALKS AND GAMBLER’S RUIN 109 k b ω n X ( ) n Figure 4.4: A typical sample path. • E k [X n ] = k +n(2p −1). • Var k (X n ) = Var(k +W 1 + +W n ) = 4np(1 −p). • lim n→∞ Xn n = 2p −1 (a.s. and m.s. under P k , k fixed). • lim n→∞ P k _ Xn−n(2p−1) √ 4np(1−p) ≤ c _ = Φ(c). • P k ¦X n = k +j −(n −j)¦ = _ n j _ p j (1 −p) n−j for 0 ≤ j ≤ n. Almost all the properties listed are properties of the one dimensional distributions of X. In fact, only the strong law of large numbers, giving the a.s. convergence in the third property listed, depends on the joint distribution of the X n ’s. The so-called Gambler’s Ruin problem is a nice example of the calculation of a probability involving the joint distributions of the random walk X. Interpret X n as the number of units of money a gambler has at time n. Assume that the initial wealth k satisfies k ≥ 0, and suppose the gambler has a goal of accumulating b units of money for some positive integer b ≥ k. While the random walk (X n : n ≥ 0) continues on forever, we are only interested in it until it hits either 0 or b. Let R b denote the event that the gambler is eventually ruined, meaning the random walk reaches zero without first reaching b. The gambler’s ruin probability is P k [R b ]. A simple idea allows us to compute the ruin probability. The idea is to condition on the value of the first step W 1 , and then to recognize that after the first step is taken, the conditional probability of ruin is the same as the unconditional probability of ruin for initial wealth k +W 1 . 110 CHAPTER 4. RANDOM PROCESSES Let r k = P k [R b ] for 0 ≤ k ≤ b, so r k is the ruin probability for the gambler with initial wealth k and target wealth b. Clearly r 0 = 1 and r b = 0. For 1 ≤ k ≤ b −1, condition on W 1 to yield r k = P k ¦W 1 = 1¦P k [R b [ W 1 = 1] +P k ¦W 1 = −1¦P k [R b [ W 1 = −1] or r k = pr k+1 +(1 −p)r k−1 . This yields b −1 linear equations for the b −1 unknowns r 1 , . . . , r b−1 . If p = 1 2 the equations become r k = 1 2 ¦r k−1 +r k+1 ¦ so that r k = A+Bk for some constants A and B. Using the boundary conditions r 0 = 1 and r b = 0, we find that r k = 1 − k b in case p = 1 2 . Note that, interestingly enough, after the gambler stops playing, he’ll have b units with probability k b and zero units otherwise. Thus, his expected wealth after completing the game is equal to his initial capital, k. If p ,= 1 2 , we seek a solution of the form r k = Aθ k 1 +Bθ k 2 , where θ 1 and θ 2 are the two roots of the quadratic equation θ = pθ 2 + (1 − p) and A, B are selected to meet the two boundary conditions. The roots are 1 and 1−p p , and finding A and B yields, that if p ,= 1 2 r k = _ 1−p p _ k − _ 1−p p _ b 1 − _ 1−p p _ b 0 ≤ k ≤ b. Focus, now, on the case that p > 1 2 . By the law of large numbers, Xn n →2p −1 a.s. as n →∞. This implies, in particular, that X n → +∞ a.s. as n → ∞. Thus, unless the gambler is ruined in finite time, his capital converges to infinity. Let R be the event that the gambler is eventually ruined. The events R b increase with b because if b is larger the gambler has more possibilities to be ruined before accumulating b units of money: R b ⊂ R b+1 ⊂ and R = ∪ ∞ b=k R b . Therefore by the countable additivity of probability, P k [R] = lim b→∞ P k [R b ] = lim b→∞ r k = _ 1 −p p _ k . Thus, the probability of eventual ruin decreases geometrically with the initial wealth k. 4.3 Processes with independent increments and martingales The increment of a random process X = (X t : t ∈ T) over an interval [a, b] is the random variable X b −X a . A random process is said to have independent increments if for any positive integer n and any t 0 < t 1 < < t n in T, the increments X t 1 −X t 0 , . . . , X tn −X t n−1 are mutually independent. A random process (X t : t ∈ T) is called a martingale if E[X t ] is finite for all t and for any positive integer n and t 1 < t 2 < < t n < t n+1 , E[X t n+1 [ X t 1 , . . . , X tn ] = X tn or, equivalently, E[X t n+1 −X tn [ X t 1 , . . . , X tn ] = 0. 4.4. BROWNIAN MOTION 111 If t n is interpreted as the present time, then t n+1 is a future time and the value of (X t 1 , . . . , X tn ) represents information about the past and present values of X. With this interpretation, the martingale property is that the future increments of X have conditional mean zero, given the past and present values of the process. An example of a martingale is the following. Suppose a gambler has initial wealth X 0 . Suppose the gambler makes bets with various odds, such that, as far as the past history of X can determine, the bets made are all for fair games in which the expected net gains are zero. Then if X t denotes the wealth of the gambler at any time t ≥ 0, then (X t : t ≥ 0) is a martingale. Suppose (X t ) is an independent increment process with index set T = R + or T = Z + , with X 0 equal to a constant and with mean zero increments. Then X is a martingale, as we now show. Let t 1 < < t n+1 be in T. Then (X t 1 , . . . , X tn ) is a function of the increments X t 1 − X 0 , X t 2 − X t 1 , . . . , X tn −X t n−1 , and hence it is independent of the increment X t n+1 −X tn . Thus E[X t n+1 −X tn [ X t 1 , . . . , X tn ] = E[X t n+1 −X tn ] = 0. The random walk (X n : n ≥ 0) arising in the gambler’s ruin problem is an independent increment process, and if p = 1 2 it is also a martingale. The following proposition is stated, without proof, to give an indication of some of the useful deductions that follow from the martingale property. Proposition 4.3.1 (a) Let X 0 , X 1 , X 2 , . . . be nonnegative random variables such that E[X k+1 [ X 0 , . . . , X k ] ≤ X k for k ≥ 0 (such X is a nonnegative supermartingale). Then P __ max 0≤k≤n X k _ ≥ γ _ ≤ E[X 0 ] γ . (b) (Doob’s L 2 Inequality) Let X 0 , X 1 , . . . be a martingale sequence with E[X 2 n ] < +∞ for some n. Then E _ _ max 0≤k≤n X k _ 2 _ ≤ 4E[X 2 n ]. 4.4 Brownian motion A Brownian motion, also called a Wiener process, is a random process W = (W t : t ≥ 0) such that B.0 P¦W 0 = 0¦ = 1. B.1 W has independent increments. B.2 W t −W s has the N(0, σ 2 (t −s)) distribution for t ≥ s. B.3 P[W t is a continuous function of t] = 1, or in other words, W is sample path continuous with probability one. 112 CHAPTER 4. RANDOM PROCESSES X ( ) ω t t Figure 4.5: A typical sample path of Brownian motion. A typical sample path of a Brownian motion is shown in Figure 4.5. A Brownian motion, being a mean zero independent increment process with P¦W 0 = 0¦ = 1, is a martingale. The mean, correlation, and covariance functions of a Brownian motion W are given by µ W (t) = E[W t ] = E[W t −W 0 ] = 0 and, for s ≤ t, R W (s, t) = E[W s W t ] = E[(W s −W 0 )(W s −W 0 +W t −W s )] = E[(W s −W 0 ) 2 ] = σ 2 s so that, in general, C W (s, t) = R W (s, t) = σ 2 (s ∧ t). A Brownian motion is Gaussian, because if 0 = t 0 ≤ t 1 ≤ ≤ t n , then each coordinate of the vector (W t 1 , . . . , W tn ) is a linear combination of the n independent Gaussian random variables (W t i − W t i−1 : 1 ≤ i ≤ n). Thus, properties B.0–B.2 imply that W is a Gaussian random process with µ W = 0 and R W (s, t) = σ 2 (s ∧ t). In fact, the converse is also true. If W = (W t : t ≥ 0) is a Gaussian random process with mean zero and R W (s, t) = σ 2 (s ∧ t), then B.0–B.2 are true. Property B.3 does not come automatically. For example, if W is a Brownian motion and if U is a Unif(0,1) distributed random variable independent of W, let W be defined by W t = W t +I ¡U=t¦ . Then P¦ W t = W t ¦ = 1 for each t ≥ 0 and W also satisfies B.0–B.2, but W fails to satisfy B.3. Thus, W is not a Brownian motion. The difference between W and W is significant if events involving uncountably many values of t are investigated. For example, P¦W t ≤ 1 for 0 ≤ t ≤ 1¦ ,= P¦ W t ≤ 1 for 0 ≤ t ≤ 1¦. 4.5. COUNTING PROCESSES AND THE POISSON PROCESS 113 4.5 Counting processes and the Poisson process A function f on R + is called a counting function if f(0) = 0, f is nondecreasing, f is right continuous, and f is integer valued. The interpretation is that f(t) is the number of “counts” observed during the interval (0, t]. An increment f(b) −f(a) is the number of counts in the interval (a, b]. If t i denotes the time of the ith count for i ≥ 1, then f can be described by the sequence (t i ). Or, if u 1 = t 1 and u i = t i −t i−1 for i ≥ 2, then f can be described by the sequence (u i ). See Figure 4.6. The numbers t 1 , t 2 , . . . are called the count times and the numbers u 1 , u 2 , . . . are called t t t t 1 2 3 u u u 3 1 2 3 2 1 0 0 f(t) Figure 4.6: A counting function. the intercount times. The following equations clearly hold: f(t) = ∞ n=1 I ¡t≥tn¦ t n = min¦t : f(t) ≥ n¦ t n = u 1 + +u n . A random process is called a counting process if with probability one its sample path is a counting function. A counting process has two corresponding random sequences, the sequence of count times and the sequence of intercount times. The most widely used example of a counting process is a Poisson process, defined next. Definition 4.5.1 Let λ ≥ 0. By definition, a Poisson process with rate λ is a random process N = (N t : t ≥ 0) such that N.1 N is a counting process, N.2 N has independent increments, N.3 N(t) −N(s) has the Poi(λ(t −s)) distribution for t ≥ s. Proposition 4.5.2 Let N be a counting process and let λ > 0. The following are equivalent: 114 CHAPTER 4. RANDOM PROCESSES (a) N is a Poisson process with rate λ. (b) The intercount times U 1 , U 2 , . . . are mutually independent, Exp(λ) random variables. (c) For each τ > 0, N τ is a Poisson random variable with parameter λτ, and given ¦N τ = n¦, the times of the n counts during [0, τ] are the same as n independent, Unif[0, τ] random variables, reordered to be nondecreasing. That is, for any n ≥ 1, the conditional density of the first n count times, (T 1 , . . . , T n ), given the event ¦N τ = n¦, is: f(t 1 , . . . , t n [N τ = n) = _ n! τ n 0 < t 1 < < t n ≤ τ 0 else (4.1) Proof. It will be shown that (a) implies (b), (b) implies (c), and (c) implies (a). (a) implies (b). Suppose N is a Poisson process. The joint pdf of the first n count times T 1 , . . . , T n can be found as follows. Let 0 < t 1 < t 2 < < t n . Select > 0 so small that (t 1 −, t 1 ], (t 2 −, t 2 ], . . . , (t n −, t n ] are disjoint intervals of R + . Then the probability that (T 1 , . . . , T n ) is in the n-dimensional cube with upper corner t 1 , . . . , t n and sides of length is given by P¦T i ∈ (t i −, t i ] for 1 ≤ i ≤ n¦ = P¦N t 1 − = 0, N t 1 −N t 1 − = 1, N t 2 − −N t 1 = 0, . . . , N tn −N tn− = 1¦ = (e −λ(t 1 −) )(λe −λ )(e −λ(t 2 −−t 1 ) ) (λe −λ ) = (λ) n e −λtn . The volume of the cube is n . Therefore (T 1 , . . . , T n ) has the pdf f T 1 Tn (t 1 , . . . , t n ) = _ λ n e −λtn if 0 < t 1 < < t n 0 else. (4.2) The vector (U 1 , . . . , U n ) is the image of (T 1 , . . . , T n ) under the mapping (t 1 , . . . , t n ) →(u 1 , . . . , u n ) defined by u 1 = t 1 , u k = t k −t k−1 for k ≥ 2. The mapping is invertible, because t k = u 1 + +u k for 1 ≤ k ≤ n, it has range R n + , and the Jacobian ∂u ∂t = _ _ _ _ _ _ _ 1 −1 1 −1 1 . . . . . . −1 1 _ _ _ _ _ _ _ has unit determinant. Therefore, by the formula for the transformation of random vectors (see Section 1.10), f U 1 ...Un (u 1 , . . . , u n ) = _ λ n e −λ(u 1 ++un) u ∈ R n + 0 else . (4.3) The joint pdf in (4.3) factors into the product of n pdfs, with each pdf being for an Exp(λ) random variable. Thus the intercount times U 1 , U 2 , . . . are independent and each is exponentially distributed with parameter λ. So (a) implies (b). 4.5. COUNTING PROCESSES AND THE POISSON PROCESS 115 (b) implies (c). Suppose that N is a counting process such that the intercount times U 1 , U 2 , . . . are independent, Exp(λ) random variables, for some λ > 0. Thus, for n ≥ 1, the first n intercount times have the joint pdf given in (4.3). Equivalently, appealing to the transformation of random vectors in the reverse direction, the pdf of the first n count times, (T 1 , . . . , T n ), is given by (4.2). Fix τ > 0 and an integer n ≥ 1. The event ¦N τ = n¦ is equivalent to the event (T 1 , . . . , T n+1 ) ∈ A n,τ , where A n,τ = ¦t ∈ R n+1 + : 0 < t 1 < < t n ≤ τ < t n+1 ¦. The conditional pdf of (T 1 , . . . , T n+1 ), given that ¦N τ = n¦, is obtained by starting with the joint pdf of (T 1 , . . . , T n+1 ), namely λ n+1 e −λ(t n+1 ) on ¦t ∈ R n+1 : 0 < t 1 < < t n+1 ¦, setting it equal to zero off of the set A n,τ , and scaling it up by the factor 1/P¦N τ = n¦ on A n,τ : f(t 1 , . . . , t n+1 [N τ = n) = _ λ n+1 e −λt n+1 P¡Nτ =n¦ 0 < t 1 < < t n ≤ τ < t n+1 0 else (4.4) The joint density of (T 1 , . . . , T n ), given that ¦N τ = n¦, is obtained for each (t 1 , . . . , t n ) by integrating the density in (4.4) with respect to t n+1 over R. If 0 < t 1 < < t n ≤ τ does not hold, the density in (4.4) is zero for all values of t n+1 . If 0 < t 1 < < t n ≤ τ, then the density in (4.4) is nonzero for t n+1 ∈ (τ, ∞). Integrating (4.4) with respect to t n+1 over (τ, ∞) yields: f(t 1 , . . . , t n [N τ = n) = _ λ n e −λτ P¡Nτ =n¦ 0 < t 1 < < t n ≤ τ 0 else (4.5) The conditional density in (4.5) is constant over the set ¦t ∈ R n + : 0 < t 1 < < t n ≤ τ¦. Since the density must integrate to one, that constant must be the reciprocal of the n-dimensional volume of the set. The unit cube [0, τ] n in R n has volume τ n . It can be partitioned into n! equal volume subsets corresponding to the n! possible orderings of the numbers t 1 , . . . , t n . Therefore, the set ¦t ∈ R n + : 0 ≤ t 1 < < t n ≤ τ¦, corresponding to one of the orderings, has volume τ n /n!. Hence, (4.5) implies both that (4.1) holds and that P¦N τ = n¦ = (λτ) n e −λτ n! . These implications are for n ≥ 1. Also, P¦N τ = 0¦ = P¦U 1 > τ¦ = e −λτ . Thus, N τ is a Poi(λτ) random variable. (c) implies (a). Suppose t 0 < t 1 < . . . < t k and let n 1 , . . . , n k be nonnegative integers. Set n = n 1 +. . . +n k and p i = (t i −t i−1 )/t k for 1 ≤ i ≤ k. Suppose (c) is true. Given there are n counts in the interval [0, τ], by (c), the distribution of the numbers of counts in each subinterval is as if each of the n counts is thrown into a subinterval at random, falling into the i th subinterval with probability p i . The probability that, for 1 ≤ i ≤ K, n i particular counts fall into the i th interval, is p n 1 1 p n k k . The number of ways to assign n counts to the intervals such that there are k i counts in the i th interval is _ n n 1 n k _ = n! n 1 !n k ! . This thus gives rise to what is known as a multinomial 116 CHAPTER 4. RANDOM PROCESSES distribution for the numbers of counts per interval. We have P¦N(t i ) −N(t i−1 ) = n i for 1 ≤ i ≤ k¦ = P ¦N(t k ) = n¦ P [N(t i ) −N(t i−1 ) = n i for 1 ≤ i ≤ k [ N(t k ) = n] = (λt k ) n e −λt k n! _ n n 1 n k _ p n 1 1 p n k k = k i=1 (λ(t i −t i−1 )) n i e −λ(t i −t i−1 ) n i ! Therefore the increments N(t i ) −N(t i−1 ), 1 ≤ i ≤ k, are independent, with N(t i ) −N(t i−1 ) being a Poisson random variable with mean λ(t i −t i−1 ), for 1 ≤ i ≤ k. So (a) is proved. A Poisson process is not a martingale. However, if ¯ N is defined by ¯ N t = N t −λt, then ¯ N is an independent increment process with mean 0 and ¯ N 0 = 0. Thus, ¯ N is a martingale. Note that ¯ N has the same mean and covariance function as a Brownian motion with σ 2 = λ, which shows how little one really knows about a process from its mean function and correlation function alone. 4.6 Stationarity Consider a random process X = (X t : t ∈ T) such that either T = Z or T = R. Then X is said to be stationary if for any t 1 , . . . , t n and s in T, the random vectors (X t 1 , . . . , X tn ) and (X t 1 +s , . . . , X tn+s ) have the same distribution. In other words, the joint statistics of X of all orders are unaffected by a shift in time. The condition of stationarity of X can also be expressed in terms of the CDF’s of X: X is stationary if for any n ≥ 1, s, t 1 , . . . , t n ∈ T, and x 1 , . . . , x n ∈ R, F X,n (x 1 , t 1 ; . . . ; x n , t n ) = F X,n (x 1 , t 1 +s; . . . ; x n ; t n +s). Suppose X is a stationary second order random process. (Recall that second order means that E[X 2 t ] < ∞ for all t.) Then by the n = 1 part of the definition of stationarity, X t has the same distribution for all t. In particular, µ X (t) and E[X 2 t ] do not depend on t. Moreover, by the n = 2 part of the definition E[X t 1 X t 2 ] = E[X t 1 +s X t 2 +s ] for any s ∈ T. If E[X 2 t ] < +∞ for all t, then E[X t+s ] and R X (t 1 +s, t 2 +s) are finite and both do not depend on s. A second order random process (X t : t ∈ T) with T = Z or T = R is called wide sense stationary (WSS) if µ X (t) = µ X (s +t) and R X (t 1 , t 2 ) = R X (t 1 +s, t 2 +s) for all t, s, t 1 , t 2 ∈ T. As shown above, a stationary second order random process is WSS. Wide sense stationarity means that µ X (t) is a finite number, not depending on t, and R X (t 1 , t 2 ) depends on t 1 , t 2 only through the difference t 1 −t 2 . By a convenient and widely accepted abuse of notation, if X is WSS, we use µ X to be the constant and R X to be the function of one real variable such that E[X t ] = µ X t ∈ T E[X t 1 X t 2 ] = R X (t 1 −t 2 ) t 1 , t 2 ∈ T. 4.6. STATIONARITY 117 The dual use of the notation R X if X is WSS leads to the identity R X (t 1 , t 2 ) = R X (t 1 −t 2 ). As a practical matter, this means replacing a comma by a minus sign. Since one interpretation of R X requires it to have two arguments, and the other interpretation requires only one argument, the interpretation is clear from the number of arguments. Some brave authors even skip mentioning that X is WSS when they write: “Suppose (X t : t ∈ R) has mean µ X and correlation function R X (τ),” because it is implicit in this statement that X is WSS. Since the covariance function C X of a random process X satisfies C X (t 1 , t 2 ) = R X (t 1 , t 2 ) −µ X (t 1 )µ X (t 2 ), if X is WSS then C X (t 1 , t 2 ) is a function of t 1 − t 2 . The notation C X is also used to denote the function of one variable such that C X (t 1 − t 2 ) = Cov(X t 1 , X t 2 ). Therefore, if X is WSS then C X (t 1 −t 2 ) = C X (t 1 , t 2 ). Also, C X (τ) = R X (τ) −µ 2 X , where in this equation τ should be thought of as the difference of two times, t 1 −t 2 . In general, there is much more to know about a random vector or a random process than the first and second moments. Therefore, one can mathematically define WSS processes that are spectacularly different in appearance from any stationary random process. For example, any random process (X k : k ∈ Z) such that the X k are independent with E[X k ] = 0 and Var(X k ) = 1 for all k is WSS. To be specific, we could take the X k to be independent, with X k being N(0, 1) for k ≤ 0 and with X k having pmf p X,1 (x, k) = P¦X k = x¦ = _ 1 2k 2 x ∈ ¦k, −k¦ 1 − 1 k 2 if x = 0 0 else for k ≥ 1. A typical sample path of this WSS random process is shown in Figure 4.7. k k X Figure 4.7: A typical sample path. The situation is much different if X is a Gaussian process. Indeed, suppose X is Gaussian and WSS. Then for any t 1 , t 2 , . . . , t n , s ∈ T, the random vector (X t 1 +s , X t 2 +s , . . . , X tn+s ) T is Gaussian with mean (µ, µ, . . . , µ) T and covariance matrix with ijth entry C X ((t i +s)−(t j +s)) = C X (t i −t j ). 118 CHAPTER 4. RANDOM PROCESSES This mean and covariance matrix do not depend on s. Thus, the distribution of the vector does not depend on s. Therefore, X is stationary. In summary, if X is stationary then X is WSS, and if X is both Gaussian and WSS, then X is stationary. Example 4.6.1 Let X t = Acos(ω c t+Θ), where ω c is a nonzero constant, A and Θ are independent random variables with P¦A > 0¦ = 1 and E[A 2 ] < +∞. Each sample path of the random process (X t : t ∈ R) is a pure sinusoidal function at frequency ω c radians per unit time, with amplitude A and phase Θ. We address two questions. First, what additional assumptions, if any, are needed on the distri- butions of A and Θ to imply that X is WSS? Second, we consider two distributions for Θ which each make X WSS, and see if they make X stationary. To address whether X is WSS, the mean and correlation functions can be computed as follows. Since A and Θ are independent and since cos(ω c t + Θ) = cos(ω c t) cos(Θ) −sin(ω c t) sin(Θ), µ X (t) = E[A] (E[cos(Θ)] cos(ω c t) −E[sin(Θ)] sin(ω c t)) . Thus, the function µ X (t) is a linear combination of cos(ω c t) and sin(ω c t). The only way such a linear combination can be independent of t is if the coefficients of both cos(ω c t) and sin(ω c t) are zero (in fact, it is enough to equate the values of µ X (t) at ω c t = 0, π 2 , and π). Therefore, µ X (t) does not depend on t if and only if E[cos(Θ)] = E[sin(Θ)] = 0. Turning next to R X , using the trigonometric identity cos(a) cos(b) = (cos(a −b) +cos(a +b))/2 yields R X (s, t) = E[A 2 ]E[cos(ω c s + Θ) cos(ω c t + Θ)] = E[A 2 ] 2 ¦cos(ω c (s −t)) +E[cos(ω c (s +t) + 2Θ)]¦ . Since s +t can be arbitrary for s −t fixed, in order that R X (s, t) be a function of s −t alone it is necessary that E[cos(ω c (s +t) +2Θ)] be a constant, independent of the value of s +t. Arguing just as in the case of µ X , with Θ replaced by 2Θ, yields that R X (s, t) is a function of s −t if and only if E[cos(2Θ)] = E[sin(2Θ)] = 0. Combining the findings for µ X and R X , yields that X is WSS, if and only if, E[cos(Θ)] = E[sin(Θ)] = E[cos(2Θ)] = E[sin(2Θ)] = 0. There are many distributions for Θ in [0, 2π] such that the four moments specified are zero. Two possibilities are (a) Θ is uniformly distributed on the interval [0, 2π], or, (b) Θ is a discrete random variable, taking the four values 0, π 2 , π, 3π 2 with equal probability. Is X stationary for either possibility? We shall show that X is stationary if Θ is uniformly distributed over [0, 2π]. Stationarity means that for any fixed constant s, the random processes (X t : t ∈ R) and (X t+s : t ∈ R) have the same finite order distributions. For this example, X t+s = Acos(ω c (t +s) + Θ) = Acos(ω c t + ˜ Θ) 4.7. JOINT PROPERTIES OF RANDOM PROCESSES 119 where ˜ Θ = ((ω c s+Θ) mod 2π). By Example 1.4.4, ˜ Θ is again uniformly distributed on the interval [0, 2π]. Thus (A, Θ) and (A, ˜ Θ) have the same joint distribution, so Acos(ω c t+Θ) and Acos(ω c t+ ˜ Θ) have the same finite order distributions. Hence, X is indeed stationary if Θ is uniformly distributed over [0, 2π]. Assume now that Θ takes on each of the values of 0, π 2 , π, and 3π 2 with equal probability. Is X stationary? If X were stationary then, in particular, X t would have the same distribution for all t. On one hand, P¦X 0 = 0¦ = P¦Θ = π 2 or Θ = 3π 2 ¦ = 1 2 . On the other hand, if ω c t is not an integer multiple of π 2 , then ω c t +Θ cannot be an integer multiple of π 2 , so P¦X t = 0¦ = 0. Hence X is not stationary. (With more work it can be shown that X is stationary, if and only if, (Θ mod 2π) is uniformly distributed over the interval [0, 2π].) 4.7 Joint properties of random processes Two random processes X and Y are said to be jointly stationary if their parameter set T is either Z or R, and if for any t 1 , . . . , t n , s ∈ T, the distribution of the random vector (X t 1 +s , X t 2 +s , . . . , X tn+s , Y t 1 +s , Y t 2 +s , . . . , Y tn+s ) does not depend on s. The random processes X and Y are said to be jointly Gaussian if all the random variables comprising X and Y are jointly Gaussian. If X and Y are second order random processes on the same probability space, the cross corre- lation function, R XY , is defined by R XY (s, t) = E[X s Y t ], and the cross covariance function, C XY , is defined by C XY (s, t) = Cov(X s , Y t ). The random processes X and Y are said to be jointly WSS, if X and Y are each WSS, and if R XY (s, t) is a function of s − t. If X and Y are jointly WSS, we use R XY (τ) for R XY (s, t) where τ = s − t, and similarly C XY (s − t) = C XY (s, t). Note that C XY (s, t) = C Y X (t, s), so C XY (τ) = C Y X (−τ). 4.8 Conditional independence and Markov processes Markov processes are naturally associated with the state space approach for modeling a system. The idea of a state space model for a given system is to define the state of the system at any given time t. The state of the system at time t should summarize everything about the system up to and including time t that is relevant to the future of the system. For example, the state of an aircraft at time t could consist of the position, velocity, and remaining fuel at time t. Think of t as the present time. The state at time t determines the possible future part of the aircraft trajectory. For example, it determines how much longer the aircraft can fly and where it could possibly land. The state at time t does not completely determine the entire past trajectory of the aircraft. Rather, the state summarizes enough about the system up to the present so that if the state is known, no more information about the past is relevant to the future possibilities. The concept of state is 120 CHAPTER 4. RANDOM PROCESSES inherent in the Kalman filtering model discussed in Chapter 3. The notion of state is captured for random processes using the notion of conditional independence and the Markov property, which are discussed next. Let X, Y, Z be random vectors. We shall define the condition that X and Z are conditionally independent given Y . Such condition is denoted by X − Y − Z. If X, Y, Z are discrete, then X −Y −Z is defined to hold if P[X = i, Z = k [ Y = j] = P[X = i [ Y = j]P[Z = k [ Y = j] (4.6) for all i, j, k with P¦Y = j¦ > 0. Equivalently, X − Y − Z if P¦X = i, Y = j, Z = k¦P¦Y = j¦ = P¦X = i, Y = j¦P¦Z = k, Y = j¦ (4.7) for all i, j, k. Equivalently again, X - Y - Z if P[Z = k [ X = i, Y = j] = P[Z = k [ Y = j] (4.8) for all i, j, k with P¦X = i, Y = j¦ > 0. The forms (4.6) and (4.7) make it clear that the condition X − Y − Z is symmetric in X and Z: thus X − Y − Z is the same condition as Z − Y − X. The form (4.7) does not involve conditional probabilities, so no requirement about conditioning events having positive probability is needed. The form (4.8) shows that X − Y − Z means that knowing Y alone is as informative as knowing both X and Y , for the purpose of determining conditional probabilies of Z. Intuitively, the condition X − Y − Z means that the random variable Y serves as a state. If X, Y , and Z have a joint pdf, then the condition X − Y − Z can be defined using the pdfs and conditional pdfs in a similar way. For example, the conditional independence condition X − Y − Z holds by definition if f XZ[Y (x, z[y) = f X[Y (x[y)f Z[Y (z[y) whenever f Y (y) > 0 An equivalent condition is f Z[XY (z[x, y) = f Z[Y (z[y) whenever f XY (x, y) > 0. (4.9) Example 4.8.1 Suppose X, Y, Z are jointly Gaussian vectors. Let us see what the condition X − Y − Z means in terms of the covariance matrices. Assume without loss of generality that the vectors have mean zero. Because X, Y , and Z are jointly Gaussian, the condition (4.9) is equivalent to the condition that E[Z[X, Y ] = E[Z[Y ] (because given X, Y , or just given Y , the conditional distribution of Z is Gaussian, and in the two cases the mean and covariance of the conditional distribution of Z is the same.) The idea of linear innovations applied to the length two sequence (Y, X) yields E[Z[X, Y ] = E[Z[Y ] +E[Z[ ˜ X] where ˜ X = X−E[X[Y ]. Thus X−Y −Z if and only if E[Z[ ˜ X] = 0, or equivalently, if and only if Cov( ˜ X, Z) = 0. Since ˜ X = X −Cov(X, Y )Cov(Y ) −1 Y , if follows that Cov( ˜ X, Z) = Cov(X, Z) −Cov(X, Y )Cov(Y ) −1 Cov(Y, Z). 4.8. CONDITIONAL INDEPENDENCE AND MARKOV PROCESSES 121 Therefore, X −Y −Z if and only if Cov(X, Z) = Cov(X, Y )Cov(Y ) −1 Cov(Y, Z). (4.10) In particular, if X, Y , and Z are jointly Gaussian random variables with nonzero variances, the condition X − Y − Z holds if and only if the correlation coefficients satisfy ρ XZ = ρ XY ρ Y Z . A general definition of conditional probabilities and conditional independence, based on the general definition of conditional expectation given in Chapter 3, is given next. Recall that P[F] = E[I F ] for any event F, where I F denotes the indicator function of F. If Y is a random vector, we define P[F[Y ] to equal E[I F [Y ]. This means that P[F[Y ] is the unique (in the sense that any two versions are equal with probability one) random variable such that (1) P[F[Y ] is a function of Y and it has finite second moments, and (2) E[g(Y )P[F[Y ]] = E[g(Y )I A ] for any g(Y ) with finite second moment. Given arbitrary random vectors, we define X and Z to be conditionally independent given Y , (written X −Y −Z) if for any Borel sets A and B, P[¦X ∈ A¦¦Z ∈ B¦[Y ] = P[X ∈ A[Y ]P[Z ∈ B[Y ]. Equivalently, X −Y −Z holds if for any Borel set B, P[Z ∈ B[X, Y ] = P[Z ∈ B[Y ]. Definition 4.8.2 A random process X = (X t : t ∈ T) is said to be a Markov process if for any t 1 , . . . , t n+1 in T with t 1 < < t n , the following conditional independence condition holds: (X t 1 , , X tn ) − X tn − X t n+1 (4.11) It turns out that the Markov property is equivalent to the following conditional independence property: For any t 1 , . . . , t n+m in T with t 1 < < t n+m , (X t 1 , , X tn ) − X tn − (X tn , , X t n+m ) (4.12) The definition (4.11) is easier to check than condition (4.12), but (4.12) is appealing because it is symmetric in time. In words, thinking of t n as the present time, the Markov property means that the past and future of X are conditionally independent given the present state X tn . Example 4.8.3 (Markov property of independent increment processes) Let (X t : t ≥ 0) be an independent increment process such that X 0 is a constant. Then for any t 1 , . . . , t n+1 with 0 ≤ t 1 ≤ ≤ t n+1 , the vector (X t 1 , . . . , X tn ) is a function of the n increments X t 1 −X 0 , X t 2 −X t 1 , X tn − X t n−1 , and is thus independent of the increment V = X t n+1 − X tn . But X t n+1 is determined by V and X tn . Thus, X is a Markov process. In particular, random walks, Brownian motions, and Poisson processes are Markov processes. 122 CHAPTER 4. RANDOM PROCESSES Example 4.8.4 (Gaussian Markov processes) Suppose X = (X t : t ∈ T) is a Gaussian random process with Var(X t ) > 0 for all t. By the characterization of conditional independence for jointly Gaussian vectors (4.10), the Markov property (4.11) is equivalent to Cov( _ _ _ _ _ X t 1 X t 2 . . . X tn _ _ _ _ _ , X t n+1 ) = Cov( _ _ _ _ _ X t 1 X t 2 . . . X tn _ _ _ _ _ , X tn )Var(X tn ) −1 Cov(X tn , X t n+1 ) which, letting ρ(s, t) denote the correlation coefficient between X s and X t , is equivalent to the requirement _ _ _ _ _ ρ(t 1 , t n+1 ) ρ(t 2 , t n+1 )) . . . ρ(t n , t n+1 ) _ _ _ _ _ = _ _ _ _ _ ρ(t 1 , t n ) ρ(t 2 , t n ) . . . ρ(t n , t n ) _ _ _ _ _ ρ(t n , t n+1 ) Therefore a Gaussian process X is Markovian if and only if ρ(r, t) = ρ(r, s)ρ(s, t) whenever r, s, t ∈ T with r < s < t. (4.13) If X = (X k : k ∈ Z) is a discrete-time stationary Gaussian process, then ρ(s, t) may be written as ρ(k), where k = s − t. Note that ρ(k) = ρ(−k). Such a process is Markovian if and only if ρ(k 1 +k 2 ) = ρ(k 1 )ρ(k 2 ) for all positive integers k 1 and k 2 . Therefore, X is Markovian if and only if ρ(k) = b [k[ for all k, for some constant b with [b[ ≤ 1. Equivalently, a stationary Gaussian process X = (X k : k ∈ Z) with V ar(X k ) > 0 for all k is Markovian if and only if the covariance function has the form C X (k) = Ab [k[ for some constants A and b with A > 0 and [b[ ≤ 1. Similarly, if (X t : t ∈ R) is a continuous-time stationary Gaussian process with V ar(X t ) > 0 for all t, X is Markovian if and only if ρ(s + t) = ρ(s)ρ(t) for all s, t ≥ 0. The only bounded real- valued functions satisfying such a multiplicative condition are exponential functions. Therefore, a stationary Gaussian process X with V ar(X t ) > 0 for all t is Markovian if and only if ρ has the form ρ(τ) = exp(−α[τ[), for some constant α ≥ 0, or equivalently, if and only if C X has the form C X (τ) = Aexp(−α[τ[) for some constants A > 0 and α ≥ 0. 4.9 Discrete-state Markov processes This section delves further into the theory of Markov processes in the technically simplest case of a discrete state space. Let o be a finite or countably infinite set, called the state space. Given a probability space (Ω, T, P), an o valued random variable is defined to be a function Y mapping Ω to o such that ¦ω : Y (ω) = s¦ ∈ T for each s ∈ o. Assume that the elements of o are ordered so that o = ¦a 1 , a 2 , . . . , a n ¦ in case o has finite cardinality, or o = ¦a 1 , a 2 , a 3 , . . .¦ in case o has 4.9. DISCRETE-STATE MARKOV PROCESSES 123 infinite cardinality. Given the ordering, an o valued random variable is equivalent to a positive integer valued random variable, so it is nothing exotic. Think of the probability distribution of an o valued random variable Y as a row vector of possibly infinite dimension, called a probability vector: p Y = (P¦Y = a 1 ¦, P¦Y = a 2 ¦, . . .). Similarly think of a deterministic function g on o as a column vector, g = (g(a 1 ), g(a 2 ), . . .) T . Since the elements of o may not even be numbers, it might not make sense to speak of the expected value of an o valued random variable. However, if g is a function mapping o to the reals, then g(Y ) is a real-valued random variable and its expectation is given by the inner product of the probability vector p Y and the column vector g: E[g(Y )] = i∈S p Y (i)g(i) = p Y g. A random process X = (X t : t ∈ T) is said to have state space o if X t is an o valued random variable for each t ∈ T, and the Markov property of such a random process is defined just as it is for a real valued random process. Let (X t : t ∈ T) be a be a Markov process with state space o. For brevity we denote the first order pmf of X at time t as π(t) = (π i (t) : i ∈ o). That is, π i (t) = p X (i, t) = P¦X(t) = i¦. The following notation is used to denote conditional probabilities: P[X t 1 = j 1 , . . . , X tn = j n [X s 1 = i 1 , . . . , X sm = i m ] = p X (j 1 , t 1 ; . . . ; j n , t n [i 1 , s 1 ; . . . ; i m , s m ) For brevity, conditional probabilities of the form P[X t = j[X s = i] are written as p ij (s, t), and are called the transition probabilities of X. The first order pmfs π(t) and the transition probabilities p ij (s, t) determine all the finite order distributions of the Markov process as follows. Given _ t 1 < t 2 < . . . < t n in T, i i , i 2 , ..., i n ∈ o (4.14) one writes p X (i 1 , t 1 ; ; i n , t n ) = p X (i 1 , t 1 ; ; i n−1 , t n−1 )p X (i n , t n [i 1 , t 1 ; ; i n−1 , t n−1 ) = p X (i 1 , t 1 ; ; i n−1 , t n−1 )p i n−1 in (t n−1 , t n ) Application of this operation n −2 more times yields that p X (i 1 , t 1 ; ; i n , t n ) = π i 1 (t 1 )p i 1 i 2 (t 1 , t 2 ) p i n−1 in (t n−1 , t n ) (4.15) which shows that the finite order distributions of X are indeed determined by the first order pmfs and the transition probabilities. Equation (4.15) can be used to easily verify that the form (4.12) of the Markov property holds. Given s < t, the collection H(s, t) defined by H(s, t) = (p ij (s, t) : i, j ∈ o) should be thought of as a matrix, and it is called the transition probability matrix for the interval [s, t]. Let e denote the column vector with all ones, indexed by o. Since π(t) and the rows of H(s, t) are probability vectors, it follows that π(t)e = 1 and H(s, t)e = e. Computing the distribution of X t by summing over all possible values of X s yields that π j (t) = i P[X s = i, X t = j] = i π i (s)p ij (s, t), which in 124 CHAPTER 4. RANDOM PROCESSES matrix form yields that π(t) = π(s)H(s, t) for s, t ∈ T, s ≤ t. Similarly, given s < τ < t, computing the conditional distribution of X t given X s by summing over all possible values of X τ yields H(s, t) = H(s, τ)H(τ, t) s, τ, t ∈ T, s < τ < t. (4.16) The relations (4.16) are known as the Chapman-Kolmogorov equations. A Markov process is time-homogeneous if the transition probabilities p ij (s, t) depend on s and t only through t −s. In that case we write p ij (t −s) instead of p ij (s, t), and H ij (t −s) instead of H ij (s, t). If the Markov process is time-homogeneous, then π(s+τ) = π(s)H(τ) for s, s+τ ∈ T and τ ≥ 0. A probability distribution π is called an equilibrium (or invariant) distribution if πH(τ) = π for all τ ≥ 0. Recall that a random process is stationary if its finite order distributions are invariant with respect to translation in time. On one hand, referring to (4.15), we see that a time-homogeneous Markov process is stationary if and only if π(t) = π for all t for some equilibrium distribution π. On the other hand, a Markov random process that is stationary is time homogeneous. Repeated application of the Chapman-Kolmogorov equations yields that p ij (s, t) can be ex- pressed in terms of transition probabilities for s and t close together. For example, consider Markov processes with index set the integers. Then H(n, k + 1) = H(n, k)P(k) for n ≤ k, where P(k) = H(k, k + 1) is the one-step transition probability matrix. Fixing n and using forward re- cursion starting with H(n, n) = I, H(n, n + 1) = P(n), H(n, n + 2) = P(n)P(n + 1), and so forth yields H(n, l) = P(n)P(n + 1) P(l −1) In particular, if the chain is time-homogeneous then H(k) = P k for all k, where P is the time independent one-step transition probability matrix, and π(l) = π(k)P l−k for l ≥ k. In this case a probability distribution π is an equilibrium distribution if and only if πP = π. Example 4.9.1 Consider a two-stage pipeline through which packets flow, as pictured in Figure 4.8. Some assumptions about the pipeline will be made in order to model it as a simple discrete- time Markov process. Each stage has a single buffer. Normalize time so that in one unit of time a packet can make a single transition. Call the time interval between k and k + 1 the kth “time slot,” and assume that the pipeline evolves in the following way during a given slot. d d 1 2 a Figure 4.8: A two-stage pipeline. If at the beginning of the slot, there are no packets in stage one, then a new packet arrives to stage one with probability a, independently of the past history of the pipeline and of the outcome at stage two. 4.9. DISCRETE-STATE MARKOV PROCESSES 125 a ad a d d d ad ad 10 00 01 11 2 2 2 2 2 1 1 ad 2 d Figure 4.9: One-step transition probability diagram. If at the beginning of the slot, there is a packet in stage one and no packet in stage two, then the packet is transfered to stage two with probability d 1 . If at the beginning of the slot, there is a packet in stage two, then the packet departs from the stage and leaves the system with probability d 2 , independently of the state or outcome of stage one. These assumptions lead us to model the pipeline as a discrete-time Markov process with the state space o = ¦00, 01, 10, 11¦, transition probability diagram shown in Figure 4.9 (using the notation ¯ x = 1 −x) and one-step transition probability matrix P given by P = _ _ _ _ ¯ a 0 a 0 ¯ ad 2 ¯ a ¯ d 2 ad 2 a ¯ d 2 0 d 1 ¯ d 1 0 0 0 d 2 ¯ d 2 _ _ _ _ The rows of P are probability vectors. For example, the first row is the probability distribution of the state at the end of a slot, given that the state is 00 at the beginning of a slot. Now that the model is specified, let us determine the throughput rate of the pipeline. The equilibrium probability distribution π = (π 00 , π 01 , π 10 , π 11 ) is the probability vector satis- fying the linear equation π = πP. Once π is found, the throughput rate η can be computed as follows. It is defined to be the rate (averaged over a long time) that packets transit the pipeline. Since at most two packets can be in the pipeline at a time, the following three quantities are all clearly the same, and can be taken to be the throughput rate. The rate of arrivals to stage one The rate of departures from stage one (or rate of arrivals to stage two) The rate of departures from stage two 126 CHAPTER 4. RANDOM PROCESSES Focus on the first of these three quantities to obtain η = P[an arrival at stage 1] = P[an arrival at stage 1[stage 1 empty at slot beginning]P[stage 1 empty at slot beginning] = a(π 00 +π 01 ). Similarly, by focusing on departures from stage 1, obtain η = d 1 π 10 . Finally, by focusing on departures from stage 2, obtain η = d 2 (π 01 +π 11 ). These three expressions for η must agree. Consider the numerical example a = d 1 = d 2 = 0.5. The equation π = πP yields that π is proportional to the vector (1, 2, 3, 1). Applying the fact that π is a probability distribution yields that π = (1/7, 2/7, 3/7, 1/7). Therefore η = 3/14 = 0.214 . . .. In the remainder of this section we assume that X is a continuous-time, finite-state Markov process. The transition probabilities for arbitrary time intervals can be described in terms of the transition probabilites over arbitrarily short time intervals. By saving only a linearization of the transition probabilities, the concept of generator matrix arises naturally, as we describe next. Let o be a finite set. A pure-jump function for a finite state space o is a function x : ¹ + → o such that there is a sequence of times, 0 = τ 0 < τ 1 < with lim i→∞ τ i = ∞, and a sequence of states with s i ,= s i+1 , i ≥ 0, such that that x(t) = s i for τ i ≤ t < τ i+1 . A pure-jump Markov process is an o valued Markov process such that, with probability one, the sample functions are pure-jump functions. Let Q = (q ij : i, j ∈ o) be such that q ij ≥ 0 i, j ∈ o, i ,= j q ii = − j∈S,j,=i q ij i ∈ o. (4.17) An example for state space o = ¦1, 2, 3¦ is Q = _ _ −1 0.5 0.5 1 −2 1 0 1 −1 _ _ , and this matrix Q can be represented by the transition rate diagram shown in Figure 4.10. A pure- jump, time-homogeneous Markov process X has generator matrix Q if the transition probabilities (p ij (τ)) satisfy lim h¸0 (p ij (h) −I ¡i=j¦ )/h = q ij i, j ∈ o (4.18) or equivalently p ij (h) = I ¡i=j¦ +hq ij +o(h) i, j ∈ o (4.19) where o(h) represents a quantity such that lim h→0 o(h)/h = 0. For the example this means that the transition probability matrix for a time interval of duration h is given by _ _ 1 −h 0.5h 0.5h h 1 −2h h 0 h 1 −h _ _ + _ _ o(h) o(h) o(h) o(h) o(h) o(h) o(h) o(h) o(h) _ _ 4.9. DISCRETE-STATE MARKOV PROCESSES 127 2 1 3 0.5 1 1 1 0.5 Figure 4.10: Transition rate diagram for a continuous-time Markov process. For small enough h, the rows of the first matrix are probability distributions, owing to the assump- tions on the generator matrix Q. Proposition 4.9.2 Given a matrix Q satisfying (4.17), and a probability distribution π(0) = (π i (0) : i ∈ o), there is a pure-jump, time-homogeneous Markov process with generator matrix Q and initial distribution π(0). The finite order distributions of the process are uniquely determined by π(0) and Q. The first order distributions and the transition probabilities can be derived from Q and an initial distribution π(0) by solving differential equations, derived as follows. Fix t > 0 and let h be a small positive number. The Chapman-Kolmogorov equations imply that π j (t +h) −π j (t) h = i∈S π i (t) _ p ij (h) −I ¡i=j¦ h _ . (4.20) Letting h converge to zero yields the differential equation: ∂π j (t) ∂t = i∈S π i (t)q ij (4.21) or, in matrix notation, ∂π(t) ∂t = π(t)Q. These equations, known as the Kolmogorov forward equa- tions, can be rewritten as ∂π j (t) ∂t = i∈S,i,=j π i (t)q ij − i∈S,i,=j π j (t)q ji , (4.22) which shows that the rate change of the probability of being at state j is the rate of “probability flow” into state j minus the rate of probability flow out of state j. Example 4.9.3 Consider the two-state, continuous-time Markov process with the transition rate diagram shown in Figure 4.11 for some positive constants α and β. The generator matrix is given 128 CHAPTER 4. RANDOM PROCESSES 2 1 α β Figure 4.11: Transition rate diagram for a two-state continuous-time Markov process. by Q = _ −α α β −β _ Let us solve the forward Kolmogorov equation for a given initial distribution π(0). The equation for π 1 (t) is ∂π 1 (t) ∂t = −απ 1 (t) +βπ 2 (t); π 1 (0) given But π 1 (t) = 1 −π 2 (t), so ∂π 1 (t) ∂t = −(α +β)π 1 (t) +β; π 1 (0) given By differentiation we check that this equation has the solution π 1 (t) = π 1 (0)e −(α+β)t + _ t 0 e −(α+β)(t−s) βds = π 1 (0)e −(α+β)t + β α +β (1 −e −(α+β)t ). so that π(t) = π(0)e −(α+β)t + _ β α +β , α α +β _ (1 −e −(α+β)t ) For any initial distribution π(0), lim t→∞ π(t) = _ β α +β , α α +β _ . The rate of convergence is exponential, with rate parameter α +β, and the limiting distribution is the unique probability distribution satisfying πQ = 0. 4.10 Space-time structure of discrete-state Markov processes The previous section showed that the distribution of a time-homogeneous, discrete-state Markov process can be specified by an initial probability distribution, and either a one-step transition probability matrix P (for discrete-time processes) or a generator matrix Q (for continuous-time processes). Another way to describe these processes is to specify the space-time structure, which is simply the sequences of states visited and how long each state is visited. The space-time structure 4.10. SPACE-TIME STRUCTURE OF DISCRETE-STATE MARKOV PROCESSES 129 is discussed first for discrete-time processes, and then for continuous-time processes. One benefit is to show how little difference there is between discrete-time and continuous-time processes. Let (X k : k ∈ Z + ) be a time-homogeneous Markov process with one-step transition probability matrix P. Let T k denote the time that elapses between the k th and k + 1 th jumps of X, and let X J (k) denote the state after k jumps. See Fig. 4.12 for illustration. More precisely, the holding ! ! 0 1 2 ! X (3) X(k) J k X (1) X (2) J J s s s 1 2 3 Figure 4.12: Illustration of jump process and holding times. times are defined by T 0 = min¦t ≥ 0 : X(t) ,= X(0)¦ (4.23) T k = min¦t ≥ 0 : X(T 0 + . . . +T k−1 +t) ,= X(T 0 + . . . +T k−1 )¦ (4.24) and the jump process X J = (X J (k) : k ≥ 0) is defined by X J (0) = X(0) and X J (k) = X(T 0 + . . . +T k−1 ) (4.25) Clearly the holding times and jump process contain all the information needed to construct X, and vice versa. Thus, the following description of the joint distribution of the holding times and the jump process characterizes the distribution of X. Proposition 4.10.1 Let X = (X(k) : k ∈ Z + ) be a time-homogeneous Markov process with one- step transition probability matrix P. (a) The jump process X J is itself a time-homogeneous Markov process, and its one-step transition probabilities are given by p J ij = p ij /(1 −p ii ) for i ,= j, and p J ii = 0, i, j ∈ o. (b) Given X(0), X J (1) is conditionally independent of T 0 . (c) Given (X J (0), . . . , X J (n)) = (j 0 , . . . , j n ), the variables T 0 , . . . , T n are conditionally indepen- dent, and the conditional distribution of T l is geometric with parameter p j l j l : P[T l = k[X J (0) = j 0 , . . . , X J (n) = j n ] = p k−1 j l j l (1 −p j l j l ) 0 ≤ l ≤ n, k ≥ 1. 130 CHAPTER 4. RANDOM PROCESSES Proof. Observe that if X(0) = i, then ¦T 0 = k, X J (1) = j¦ = ¦X(1) = i, X(2) = i, . . . , X(k −1) = i, X(k) = j¦, so P[T 0 = k, X J (1) = j[X(0) = i] = p k−1 ii p ij = [(1 −p ii )p k−1 ii ]p J ij (4.26) Because for i fixed the last expression in (4.26) displays the product of two probability distributions, conclude that given X(0) = i, T 0 has distribution ((1 −p ii )p k−1 ii : k ≥ 1), the geometric distribution of mean 1/(1 −p ii ) X J (1) has distribution (p J ij : j ∈ o) (i fixed) T 0 and X J (1) are independent More generally, check that P[X J (1) = j 1 , . . . , X J (n) = j n , T o = k 0 , . . . , T n = k n [X J (0) = i] = p J ij 1 p J j 1 j 2 . . . p J j n−1 jn n l=0 (p k l −1 j l j l (1 −p j l j l )) This establishes the proposition. Next we consider the space-time structure of time-homogeneous continuous-time pure-jump Markov processes. Essentially the only difference between the discrete- and continuous-time Markov processes is that the holding times for the continuous-time processes are exponentially distributed rather than geometrically distributed. Indeed, define the holding times T k , k ≥ 0 and the jump process X J using (4.23)-(4.25) as before. Proposition 4.10.2 Let X = (X(t) : t ∈ R + ) be a time-homogeneous, pure-jump Markov process with generator matrix Q. Then (a) The jump process X J is a discrete-time, time-homogeneous Markov process, and its one-step transition probabilities are given by p J ij = _ −q ij /q ii for i ,= j 0 for i = j (4.27) (b) Given X(0), X J (1) is conditionally independent of T 0 . (c) Given X J (0) = j 0 , . . . , X J (n) = j n , the variables T 0 , . . . , T n are conditionally independent, and the conditional distribution of T l is exponential with parameter −q j l j l : P[T l ≥ c[X J (0) = j 0 , . . . , X J (n) = j n ] = exp(cq j l j l ) 0 ≤ l ≤ n. 4.10. SPACE-TIME STRUCTURE OF DISCRETE-STATE MARKOV PROCESSES 131 X(t) t (h) X (1) X (2) (h) X (3) (h) s s s 1 2 3 Figure 4.13: Illustration of sampling of a pure-jump function. Proof. Fix h > 0 and define the “sampled” process X (h) by X (h) (k) = X(hk) for k ≥ 0. See Fig. 4.13. Then X (h) is a discrete-time Markov process with one-step transition probabilities p ij (h) (the transition probabilities for the original process for an interval of length h). Let (T (h) k : k ≥ 0) denote the sequence of holding times and (X J,h (k) : k ≥ 0) the jump process for the process X (h) . The assumption that with probability one the sample paths of X are pure-jump functions, implies that with probability one: lim h→0 (X J,h (0), X J,h (1), . . . , X J,h (n), hT (h) 0 , hT (h) 1 , . . . , hT (h) n ) = (X J (0), X J (1), . . . , X J (n), T 0 , T 1 , . . . , T n ) (4.28) Since convergence with probability one implies convergence in distribution, the goal of identifying the distribution of the random vector on the righthand side of (4.28) can be accomplished by identifying the limit of the distribution of the vector on the left. First, the limiting distribution of the process X J,h is identified. Since X (h) has one-step transi- tion probabilities p ij (h), the formula for the jump process probabilities for discrete-time processes (see Proposition 4.10.1, part a) yields that the one step transition probabilities p J,h ij for X (J,h) are given by p J,h ij = p ij (h) 1 −p ii (h) = p ij (h)/h (1 −p ii (h))/h → q ij −q ii as h →0 (4.29) for i ,= j, where the limit indicated in (4.29) follows from the definition (4.18) of the generator matrix Q. Thus, the limiting distribution of X J,h is that of a Markov process with one-step transition probabilities given by (4.27), establishing part (a) of the proposition. The conditional independence properties stated in (b) and (c) of the proposition follow in the limit from the corresponding properties for the jump process X J,h guaranteed by Proposition 4.10.1. Finally, since log(1 +θ) = 132 CHAPTER 4. RANDOM PROCESSES θ +o(θ) by Taylor’s formula, we have for all c ≥ 0 that P[hT (h) l > c[X J,h (0) = j 0 , . . . , X J,h = j n ] = (p j l j l (h)) ]c/h| = exp(¸c/h| log(p j l j l (h))) = exp(¸c/h|(q j l j l h +o(h))) → exp(q j l j l c) as h →0 which establishes the remaining part of (c), and the proposition is proved. 4.11 Problems 4.1 Event probabilities for a simple random process Define the random process X by X t = 2A+Bt where A and B are independent random variables with P[A = 1] = P[A = −1] = P[B = 1] = P[B = −1] = 0.5. (a) Sketch the possible sample functions. (b) Find P[X t ≥ 0] for all t. (c) Find P[X t ≥ 0 for all t]. 4.2 Correlation function of a product Let Y and Z be independent random processes with R Y (s, t) = 2 exp(−[s −t[) cos(2πf(s −t)) and R Z (s, t) = 9 + exp(−3[s −t[ 4 ). Find the autocorrelation function R X (s, t) where X t = Y t Z t . 4.3 A sinusoidal random process Let X t = Acos(2πV t + Θ) where the amplitude A has mean 2 and variance 4, the frequency V in Hertz is uniform on [0, 5], and the phase Θ is uniform on [0, 2π]. Furthermore, suppose A, V and Θ are independent. Find the mean function µ X (t) and autocorrelation function R X (s, t). Is X WSS? 4.4 Another sinusoidal random process Suppose that X 1 and X 2 are random variables such that EX 1 = EX 2 = EX 1 X 2 = 0 and Var(X 1 ) = Var(X 2 ) = σ 2 . Define Y t = X 1 cos(2πt) − X 2 sin(2πt). (a) Is the random process Y necessarily wide-sense stationary? (b) Give an example of random variables X 1 and X 2 satisfying the given conditions such that Y is stationary. (c) Give an example of random variables X 1 and X 2 satisfying the given conditions such that Y is not (strict sense) stationary. 4.5 A random line Let X = (X t : t ∈ R) be a random process such that X t = R − St for all t, where R and S are independent random variables, having the Rayleigh distribution with positive parameters σ 2 R and σ 2 S , respectively. (a) Indicate three typical sample paths of X in a single sketch. Describe in words the set of possible sample paths of X. (b) Is X a Markov process? Why or why not? (c) Does X have independent increments? Why or why not? (d) Let A denote the area of the triangle bounded by portions of the coordinate axes and the graph of X. Find E[A]. Simplify your answer as much as possible. 4.11. PROBLEMS 133 4.6 Brownian motion: Ascension and smoothing Let W be a Brownian motion process and suppose 0 ≤ r < s < t. (a) Find P¦W r ≤ W s ≤ W t ¦. (b) Find E[W s [W r , W t ]. (This part is unrelated to part (a).) 4.7 Brownian bridge Let W = (W t : t ≥ 0) be a standard Brownian motion (i.e. a Brownian motion with paramter σ 2 = 1.) Let B t = W t −tW 1 for 0 ≤ t ≤ 1. The process B = (B t : 0 ≤ t ≤ 1) is called a Brownian bridge process. Like W, B is a mean zero Gaussian random process. (a) Sketch a typical sample path of W, and the corresponding sample path of B. (b) Find the autocorrelation function of B. (c) Is B a Markov process? (d) Show that B is independent of the random variable W 1 . (This means that for any finite collec- tion, t 1 , . . . , t n ∈ [0, 1], the random vector (B t 1 , . . . , B tn ) T is independent of W 1 .) (e) (Due to J.L. Doob.) Let X t = (1 − t)W t 1−t , for 0 ≤ t < 1 and let X 1 = 0. Let X denote the random process X = (X t : 0 ≤ t ≤ 1). Like W, X is a mean zero, Gaussian random process. Find the autocorrelation function of X. Can you draw any conclusions? 4.8 Some Poisson process calculations Let N = (N t : t ≥ 0) be a Poisson process with rate λ > 0. (a) Give a simple expression for P[N 1 ≥ 1[N 2 = 2] in terms of λ. (b) Give a simple expression for P[N 2 = 2[N 1 ≥ 1] in terms of λ. (c) Let X t = N 2 t . Is X = (X t : t ≥ 0) a time-homogeneous Markov process? If so, give the transition probabilities p ij (τ). If not, explain. 4.9 A random process corresponding to a random parabola Define a random process X by X t = A+Bt +t 2 , where A and B are independent, N(0, 1) random variables. (a) Find ˆ E[X 5 [X 1 ], the linear minimum mean square error (LMMSE) estimator of X 5 given X 1 , and compute the mean square error. (b) Find the MMSE (possibly nonlinear) estimator of X 5 given X 1 , and compute the mean square error. (c) Find ˆ E[X 5 [X 0 , X 1 ] and compute the mean square error. (Hint: Can do by inspection.) 4.10 MMSE prediction for a Gaussian process based on two observations Let X be a stationary Gaussian process with mean zero and R X (τ) = 5 cos( πτ 2 )3 −[τ[ . (a) Find the covariance matrix of the random vector (X(2), X(3), X(4)) T . (b) Find E[X(4)[X(2)]. (c) Find E[X(4)[X(2), X(3)]. 4.11 A simple discrete-time random process Let U = (U n : n ∈ Z) consist of independent random variables, each uniformly distributed on the interval [0, 1]. Let X = (X k : k ∈ Z¦ be defined by X k = max¦U k−1 , U k ¦. (a) Sketch a typical sample path of the process X. (b) Is X stationary? (c) Is X Markov? (d) Describe the first order distributions of X. (e) Describe the second order distributions of X. 134 CHAPTER 4. RANDOM PROCESSES 4.12 Poisson process probabilities Consider a Poisson process with rate λ > 0. (a) Find the probability that there is (exactly) one count in each of the three intervals [0,1], [1,2], and [2,3]. (b) Find the probability that there are two counts in the interval [0, 2] and two counts in the interval [1, 3]. (Note: your answer to part (b) should be larger than your answer to part (a)). (c) Find the probability that there are two counts in the interval [1,2], given that there are two counts in the interval [0,2] and two counts in the the interval [1,3]. 4.13 Sliding function of an i.i.d. Poisson sequence Let X = (X k : k ∈ Z) be a random process such that the X i are independent, Poisson random variables with mean λ, for some λ > 0. Let Y = (Y k : k ∈ Z) be the random process defined by Y k = X k +X k+1 . (a) Show that Y k is a Poisson random variable with parameter 2λ for each k. (b) Show that X is a stationary random process. (c) Is Y a stationary random process? Justify your answer. 4.14 Adding jointly stationary Gaussian processes Let X and Y be jointly stationary, jointly Gaussian random processes with mean zero, autocorre- lation functions R X (t) = R Y (t) = exp(−[t[), and cross-correlation function R XY (t) = (0.5) exp(−[t −3[). (a) Let Z(t) = (X(t) +Y (t))/2 for all t. Find the autocorrelation function of Z. (b) Is Z a stationary random process? Explain. (c) Find P[X(1) ≤ 5Y (2) + 1]. You may express your answer in terms of the standard normal cumulative distribution function Φ. 4.15 Invariance of properties under transformations Let X = (X n : n ∈ Z), Y = (Y n : n ∈ Z), and Z = (Z n : n ∈ Z) be random processes such that Y n = X 2 n for all n and Z n = X 3 n for all n. Determine whether each of the following statements is always true. If true, give a justification. If not, give a simple counter example. (a) If X is Markov then Y is Markov. (b) If X is Markov then Z is Markov. (c) If Y is Markov then X is Markov. (d) If X is stationary then Y is stationary. (e) If Y is stationary then X is stationary. (f) If X is wide sense stationary then Y is wide sense stationary. (g) If X has independent increments then Y has independent increments. (h) If X is a martingale then Z is a martingale. 4.16 A linear evolution equation with random coefficients Let the variables A k , B k , k ≥ 0 be mutually independent with mean zero. Let A k have variance σ 2 A and let B k have variance σ 2 B for all k. Define a discrete-time random process Y by Y = (Y k : k ≥ 0), such that Y 0 = 0 and Y k+1 = A k Y k +B k for k ≥ 0. 4.11. PROBLEMS 135 (a) Find a recursive method for computing P k = E[(Y k ) 2 ] for k ≥ 0. (b) Is Y a Markov process? Explain. (c) Does Y have independent increments? Explain. (d) Find the autocorrelation function of Y . ( You can use the second moments (P k ) in expressing your answer.) (e) Find the corresponding linear innovations sequence ( Y k : k ≥ 1). 4.17 On an M/D/infinity system Suppose customers enter a service system according to a Poisson point process on R of rate λ, meaning that the number of arrivals, N(a, b], in an interval (a, b], has the Poisson distribution with mean λ(b − a), and the numbers of arrivals in disjoint intervals are independent. Suppose each customer stays in the system for one unit of time, independently of other customers. Because the arrival process is memoryless, because the service times are deterministic, and because the customers are served simultaneously, corresponding to infinitely many servers, this queueing system is called an M/D/∞ queueing system. The number of customers in the system at time t is given by X t = N(t −1, t]. (a) Find the mean and autocovariance function of X. (b) Is X stationary? Is X wide sense stationary? (c) Is X a Markov process? (d) Find a simple expression for P¦X t = 0 for t ∈ [0, 1]¦ in terms of λ. (e) Find a simple expression for P¦X t > 0 for t ∈ [0, 1]¦ in terms of λ. 4.18 A fly on a cube Consider a cube with vertices 000, 001, 010, 100, 110, 101. 011, 111. Suppose a fly walks along edges of the cube from vertex to vertex, and for any integer t ≥ 0, let X t denote which vertex the fly is at at time t. Assume X = (X t : t ≥ 0) is a discrete-time Markov process, such that given X t , the next state X t+1 is equally likely to be any one of the three vertices neighboring X t . (a) Sketch the one step transition probability diagram for X. (b) Let Y t denote the distance of X t , measured in number of hops, between vertex 000 and X t . For example, if X t = 101, then Y t = 2. The process Y is a Markov process with states 0,1,2, and 3. Sketch the one-step transition probability diagram for Y . (c) Suppose the fly begins at vertex 000 at time zero. Let τ be the first time that X returns to vertex 000 after time 0, or equivalently, the first time that Y returns to 0 after time 0. Find E[τ]. 4.19 Time elapsed since Bernoulli renewals Let U = (U k : k ∈ Z) be such that for some p ∈ (0, 1), the random variables U k are independent, with each having the Bernoulli distribution with parameter p. Interpret U k = 1 to mean that a renewal, or replacement, of some part takes place at time k. For k ∈ Z, let X k = min¦i ≥ 1 : U k−i = 1¦. In words, X k is the time elapsed since the last renewal strictly before time k. (a) The process X is a time-homogeneous Markov process. Indicate a suitable state space, and describe the one-step transition probabilities. (b) Find the distribution of X k for k fixed. 136 CHAPTER 4. RANDOM PROCESSES (c) Is X a stationary random process? Explain. (d) Find the k-step transition probabilities, p i,j (k) = P¦X n+k = j[X n = i¦. 4.20 A random process created by interpolation Let U = (U k : k ∈ Z) such that the U k are independent, and each is uniformly distributed on the interval [0, 1]. Let X = (X t : t ∈ R) denote the continuous time random process obtained by linearly interpolating between the U’s. Specifically, X n = U n for any n ∈ Z, and X t is affine on each interval of the form [n, n + 1] for n ∈ Z. (a) Sketch a sample path of U and a corresponding sample path of X. (b) Let t ∈ R. Find and sketch the first order marginal density, f X,1 (x, t). (Hint: Let n = ¸t| and a = t − n, so that t = n + a. Then X t = (1 − a)U n + aU n+1 . It’s helpful to consider the cases 0 ≤ a ≤ 0.5 and 0.5 < a < 1 separately. For brevity, you need only consider the case 0 ≤ a ≤ 0.5.) (c) Is the random process X WSS? Justify your answer. (d) Find P¦max 0≤t≤10 X t ≤ 0.5¦. 4.21 Reinforcing samples (Due to G. Polya) Suppose at time k = 2, there is a bag with two balls in it, one orange and one blue. During each time step between k and k + 1, one of the balls is selected from the bag at random, with all balls in the bag having equal probability. That ball, and a new ball of the same color, are both put into the bag. Thus, at time k there are k balls in the bag, for all k ≥ 2. Let X k denote the number of blue balls in the bag at time k. (a) Is X = (X k : k ≥ 2) a Markov process? (b) Let M k = X k k . Thus, M k is the fraction of balls in the bag at time k that are blue. Determine whether M = (M k : k ≥ 2) is a martingale. (c) By the theory of martingales, since M is a bounded martingale, it converges a.s. to some random variable M ∞ . Let V k = M k (1−M k ). Show that E[V k+1 [V k ] = k(k+2) (k+1) 2 V k , and therefore that E[V k ] = (k+1) 6k . It follows that Var(lim k→∞ M k ) = 1 12 . (d) More concretely, find the distribution of M k for each k, and then identify the distribution of the limit random variable, M ∞ . 4.22 Restoring samples Suppose at time k = 2, there is a bag with two balls in it, one orange and one blue. During each time step between k and k + 1, one of the balls is selected from the bag at random, with all balls in the bag having equal probability. That ball, and a new ball of the other color, are both put into the bag. Thus, at time k there are k balls in the bag, for all k ≥ 2. Let X k denote the number of blue balls in the bag at time k. (a) Is X = (X k : k ≥ 2) a Markov process? If so, describe the one-step transition probabilities. (b) Compute E[X k+1 [X k ] for k ≥ 2. (c) Let M k = X k k . Thus, M k is the fraction of balls in the bag at time k that are blue. Determine whether M = (M k : k ≥ 2) is a martingale. (d) Let D k = M k − 1 2 . Show that E[D 2 k+1 [X k ] = 1 (k + 1) 2 _ k(k −2)D 2 k + 1 4 _ 4.11. PROBLEMS 137 (e) Let v k = E[D 2 k ]. Prove by induction on k that v k ≤ 1 4k . What can you conclude about the limit of M k as k →∞? (Be sure to specify what sense(s) of limit you mean.) 4.23 A space-time transformation of Brownian motion Suppose X = (X t : t ≥ 0) is a real-valued, mean zero, independent increment process, and let E[X 2 t ] = ρ t for t ≥ 0. Assume ρ t < ∞ for all t. (a) Show that ρ must be nonnegative and nondecreasing over [0, ∞). (b) Express the autocorrelation function R X (s, t) in terms of the function ρ for all s ≥ 0 and t ≥ 0. (c) Conversely, suppose a nonnegative, nondecreasing function ρ on [0, ∞) is given. Let Y t = W(ρ t ) for t ≥ 0, where W is a standard Brownian motion with R W (s, t) = min¦s, t¦. Explain why Y is an independent increment process with E[Y 2 t ] = ρ t for all t ≥ 0. (d) Define a process Z in terms of a standard Brownian motion W by Z 0 = 0 and Z t = tW( 1 t ) for t > 0. Does Z have independent increments? Justify your answer. 4.24 An M/M/1/B queueing system Suppose X is a continuous-time Markov process with the transition rate diagram shown, for a positive integer B and positive constant λ. . . . 1 1 1 1 1 ! ! ! ! ! 0 1 2 B!1 B (a) Find the generator matrix, Q, of X for B = 4. (b) Find the equilibrium probability distribution. (Note: The process X models the number of customers in a queueing system with a Poisson arrival process, exponential service times, one server, and a finite buffer.) 4.25 Identification of special properties of two discrete-time processes Determine which of the properties: (i) Markov property (ii) martingale property (iii) independent increment property are possessed by the following two random processes. Justify your answers. (a) X = (X k : k ≥ 0) defined recursively by X 0 = 1 and X k+1 = (1 + X k )U k for k ≥ 0, where U 0 , U 1 , . . . are independent random variables, each uniformly distributed on the interval [0, 1]. (b) Y = (Y k : k ≥ 0) defined by Y 0 = V 0 , Y 1 = V 0 +V 1 , and Y k = V k−2 +V k−1 +V k for k ≥ 2, where V k : k ∈ Z are independent Gaussian random variables with mean zero and variance one. 4.26 Identification of special properties of two discrete-time processes (version 2) Determine which of the properties: (i) Markov property (ii) martingale property (iii) independent increment property are possessed by the following two random processes. Justify your answers. 138 CHAPTER 4. RANDOM PROCESSES (a) (X k : k ≥ 0), where X k is the number of cells alive at time k in a colony that evolves as follows. Initially, there is one cell, so X 0 = 1. During each discrete time step, each cell either dies or splits into two new cells, each possibility having probability one half. Suppose cells die or split independently. Let X k denote the number of cells alive at time k. (b) (Y k : k ≥ 0), such that Y 0 = 1 and, for k ≥ 1, Y k = U 1 U 2 . . . U k , where U 1 , U 2 , . . . are independent random variables, each uniformly distributed over the interval [0, 2] 4.27 Identification of special properties of two continuous-time processes Answer as in the previous problem, for the following two random processes: (a) Z = (Z t : t ≥ 0), defined by Z t = exp(W t − σ 2 t 2 ), where W is a Brownian motion with parameter σ 2 . (Hint: Observe that E[Z t ] = 1 for all t.) (b) R = (R t : t ≥ 0) defined by R t = D 1 +D 2 + +D Nt , where N is a Poisson process with rate λ > 0 and D i : i ≥ 1 is an iid sequence of random variables, each having mean 0 and variance σ 2 . 4.28 Identification of special properties of two continuous-time processes (version 2) Answer as in the previous problem, for the following two random processes: (a) Z = (Z t : t ≥ 0), defined by Z t = W 3 t , where W is a Brownian motion with parameter σ 2 . (b) R = (R t : t ≥ 0), defined by R t = cos(2πt +Θ), where Θ is uniformly distributed on the interval [0, 2π]. 4.29 A branching process Let p = (p i : i ≥ 0) be a probability distribution on the nonnegative integers with mean m. Consider a population beginning with a single individual, comprising generation zero. The offspring of the initial individual comprise the first generation, and, in general, the offspring of the kth generation comprise the k +1 st generation. Suppose the number of offspring of any individual has the probability distribution p, independently of how many offspring other individuals have. Let Y 0 = 1, and for k ≥ 1 let Y k denote the number of individuals in the k th generation. (a) Is Y = (Y k : k ≥ 0) a Markov process? Briefly explain your answer. (b) Find constants c k so that Y k c k is a martingale. (c) Let a m = P[Y m = 0], the probability of extinction by the m th generation. Express a m+1 in terms of the distribution p and a m (Hint: condition on the value of Y 1 , and note that the Y 1 subpopulations beginning with the Y 1 individuals in generation one are independent and statistically identical to the whole population.) (d) Express the probability of eventual extinction, a ∞ = lim m→∞ a m , in terms of the distribution p. Under what condition is a ∞ = 1? (e) Find a ∞ in terms of θ in case p k = θ k (1 − θ) for k ≥ 0 and 0 ≤ θ < 1. (This distribution is similar to the geometric distribution, and it has mean m = θ 1−θ .) 4.30 Moving balls Consider the motion of three indistinguishable balls on a linear array of positions, indexed by the positive integers, such that one or more balls can occupy the same position. Suppose that at time t = 0 there is one ball at position one, one ball at position two, and one ball at position three. Given the positions of the balls at some integer time t, the positions at time t + 1 are determined 4.11. PROBLEMS 139 as follows. One of the balls in the left most occupied position is picked up, and one of the other two balls is selected at random (but not moved), with each choice having probability one half. The ball that was picked up is then placed one position to the right of the selected ball. (a) Define a finite-state Markov process that tracks the relative positions of the balls. Try to use a small number of states. (Hint: Take the balls to be indistinguishable, and don’t include the position numbers.) Describe the significance of each state, and give the one-step transition probability matrix for your process. (b) Find the equilibrium distribution of your process. (c) As time progresses, the balls all move to the right, and the average speed has a limiting value, with probability one. Find that limiting value. (You can use the fact that for a finite-state Markov process in which any state can eventually be reached from any other, the fraction of time the process is in a state i up to time t converges a.s. to the equilibrium probability for state i as t →∞. (d) Consider the following continuous time version of the problem. Given the current state at time t, a move as described above happens in the interval [t, t + h] with probability h + o(h). Give the generator matrix Q, find its equilibrium distribution, and identify the long term average speed of the balls. 4.31 Mean hitting time for a discrete-time, discrete-state Markov process Let (X k : k ≥ 0) be a time-homogeneous Markov process with the one-step transition probability diagram shown. 1 2 3 0.2 0.6 0.6 0.4 0.4 0.8 (a) Write down the one step transition probability matrix P. (b) Find the equilibrium probability distribution π. (c) Let τ = min¦k ≥ 0 : X k = 3¦ and let a i = E[τ[X 0 = i] for 1 ≤ i ≤ 3. Clearly a 3 = 0. Derive equations for a 1 and a 2 by considering the possible values of X 1 , in a way similar to the analysis of the gambler’s ruin problem. Solve the equations to find a 1 and a 2 . 4.32 Mean hitting time for a continuous-time, discrete-space Markov process Let (X t : t ≥ 0) be a time-homogeneous Markov process with the transition rate diagram shown. 1 2 3 10 1 5 1 (a) Write down the rate matrix Q. (b) Find the equilibrium probability distribution π. (c) Let τ = min¦t ≥ 0 : X t = 3¦ and let a i = E[τ[X 0 = i] for 1 ≤ i ≤ 3. Clearly a 3 = 0. Derive equations for a 1 and a 2 by considering the possible values of X t (h) for small values of h > 0 and taking the limit as h →0. Solve the equations to find a 1 and a 2 . 140 CHAPTER 4. RANDOM PROCESSES 4.33 Poisson merger Summing counting processes corresponds to “merging” point processes. Show that the sum of K independent Poisson processes, having rates λ 1 , . . . , λ K , respectively, is a Poisson process with rate λ 1 + . . . + λ K . (Hint: First formulate and prove a similar result for sums of random variables, and then think about what else is needed to get the result for Poisson processes. You can use the definition of a Poisson process or one of the equivalent descriptions given by Proposition 4.5.2 in the notes. Don’t forget to check required independence properties.) 4.34 Poisson splitting Consider a stream of customers modeled by a Poisson process, and suppose each customer is one of K types. Let (p 1 , . . . , p K ) be a probability vector, and suppose that for each k, the k th customer is type i with probability p i . The types of the customers are mutually independent and also independent of the arrival times of the customers. Show that the stream of customers of a given type i is again a Poisson stream, and that its rate is λp i . (Same hint as in the previous problem applies.) Show furthermore that the K substreams are mutually independent. 4.35 Poisson method for coupon collector’s problem (a) Suppose a stream of coupons arrives according to a Poisson process (A(t) : t ≥ 0) with rate λ = 1, and suppose there are k types of coupons. (In network applications, the coupons could be pieces of a file to be distributed by some sort of gossip algorithm.) The type of each coupon in the stream is randomly drawn from the k types, each possibility having probability 1 k , and the types of different coupons are mutually independent. Let p(k, t) be the probability that at least one coupon of each type arrives by time t. (The letter “p” is used here because the number of coupons arriving by time t has the Poisson distribution). Express p(k, t) in terms of k and t. (b) Find lim k→∞ p(k, k ln k+kc) for an arbitrary constant c. That is, find the limit of the probability that the collection is complete at time t = k ln k +kc. (Hint: If a k →a as k →∞, then (1+ a k k ) k → e a .) (c) The rest of this problem shows that the limit found in part (b) also holds if the total number of coupons is deterministic, rather than Poisson distributed. One idea is that if t is large, then A(t) is not too far from its mean with high probability. Show, specifically, that lim k→∞ P[A(k ln k +kc) ≥ k ln k +kc t ] = _ 0 if c < c t 1 if c > c t (d) Let d(k, n) denote the probability that the collection is complete after n coupon arrivals. (The letter “d” is used here because the number of coupons, n, is deterministic.) Show that for any k, t, and n fixed, d(k, n)P[A(t) ≥ n] ≤ p(k, t) ≤ P[A(t) ≥ n] +P[A(t) ≤ n]d(k, n). (e) Combine parts (c) and (d) to identify lim k→∞ d(k, k ln k +kc). 4.36 Some orthogonal martingales based on Brownian motion (This problem is related to the problem on linear innovations and orthogonal polynomials in the previous problem set.) Let W = (W t : t ≥ 0) be a Brownian motion with σ 2 = 1 (called a standard Brownian motion), and let M t = exp(θW t − θ 2 t 2 ) for an arbitrary constant θ. (a) Show that (M t : t ≥ 0) is a martingale. (Hint for parts (a) and (b): For notational brevity, let J s represent (W u : 0 ≤ u ≤ s) for the purposes of conditioning. If Z t is a function of W t for each 4.11. PROBLEMS 141 t, then a sufficient condition for Z to be a martingale is that E[Z t [J s ] = Z s whenever 0 < s < t, because then E[Z t [Z u , 0 ≤ u ≤ s] = E[E[Z t [J s ][Z u , 0 ≤ u ≤ s] = E[Z s [Z u , 0 ≤ u ≤ s] = Z s ). (b) By the power series expansion of the exponential function, exp(θW t − θ 2 t 2 ) = 1 +θW t + θ 2 2 (W 2 t −t) + θ 3 3! (W 3 t −3tW t ) + = ∞ n=0 θ n n! M n (t) where M n (t) = t n/2 H n ( Wt √ t ), and H n is the n th Hermite polynomial. The fact that M is a martin- gale for any value of θ can be used to show that M n is a martingale for each n (you don’t need to supply details). Verify directly that W 2 t −t and W 3 t −3tW t are martingales. (c) For fixed t, (M n (t) : n ≥ 0) is a sequence of orthogonal random variables, because it is the linear innovations sequence for the variables 1, W t , W 2 t , . . .. Use this fact and the martingale property of the M n processes to show that if n ,= m and s, t ≥ 0, then M n (s) ⊥ M m (t). 4.37 A state space reduction preserving the Markov property Consider a time-homogeneous, discrete-time Markov process X = (X k : k ≥ 0) with state space o = ¦1, 2, 3¦, initial state X 0 = 3, and one-step transition probability matrix P = _ _ 0.0 0.8 0.2 0.1 0.6 0.3 0.2 0.8 0.0 _ _ . (a) Sketch the transition probability diagram and find the equilibrium probability distribution π = (π 1 , π 2 , π 3 ). (b) Identify a function f on o so that f(s) = a for two choices of s and f(s) = b for the third choice of s, where a ,= b, such that the process Y = (Y k : k ≥ 0) defined by Y k = f(X k ) is a Markov process with only two states, and give the one-step transition probability matrix of Y . Briefly explain your answer. 4.38 * Autocorrelation function of a stationary Markov process Let X = (X k : k ∈ Z) be a Markov process such that the state space, ¦ρ 1 , ρ 2 , ..., ρ n ¦, is a finite subset of the real numbers. Let P = (p ij ) denote the matrix of one-step transition probabilities. Let e be the column vector of all ones, and let π(k) be the row vector π(k) = (P[X k = ρ 1 ], ..., P[X k = ρ n ]). (a) Show that Pe = e and π(k + 1) = π(k)P. (b) Show that if the Markov chain X is a stationary random process then π(k) = π for all k, where π is a vector such that π = πP. (c) Prove the converse of part (b). (d) Show that P[X k+m = ρ j [X k = ρ i , X k−1 = s 1 , ..., X k−m = s m ] = p (m) ij , where p (m) ij is the i, jth element of the mth power of P, P m , and s 1 , . . . , s m are arbitrary states. (e) Assume that X is stationary. Express R X (k) in terms of P, (ρ i ), and the vector π of parts (b) and (c). 142 CHAPTER 4. RANDOM PROCESSES Chapter 5 Inference for Markov Models This chapter gives a glimpse of the theory of iterative algorithms for graphical models, as well as an introduction to statistical estimation theory. It begins with a brief introduction to estimation theory: maximum likelihood and Bayesian estimators are introduced, and an iterative algorithm, known as the expectation-maximization algorithm, for computation of maximum likelihood estima- tors in certain contexts, is described. This general background is then focused on three inference problems posed using Markov models. 5.1 A bit of estimation theory The two most commonly used methods for producing estimates of unknown quantities are the maximum likelihood (ML) and Bayesian methods. These two methods are briefly described in this section, beginning with the ML method. Suppose a parameter θ is to be estimated, based on observation of a random variable Y . An estimator of θ based on Y is a function ´ θ, which for each possible observed value y, gives the estimate ´ θ(y). The ML method is based on the assumption that Y has a pmf p Y (y[θ) (if Y is discrete type) or a pdf f Y (y[θ) (if Y is continuous type), where θ is the unknown parameter to be estimated, and the family of functions p Y (y[θ) or f Y (y[θ), is known. Definition 5.1.1 For a particular value y and parameter value θ, the likelihood of y for θ is p Y (y[θ), if Y is discrete type, or f Y (y[θ), if Y is continuous type. The maximum likelihood estimate of θ given Y = y for a particular y is the value of θ that maximizes the likelihood of y. That is, the maximum likelihood estimator ´ θ ML is given by ´ θ ML (y) = arg max θ p Y (y[θ), or ´ θ ML (y) = arg max θ f Y (y[θ). Note that the maximum likelihood estimator is not defined as one maximizing the likelihood of the parameter θ to be estimated. In fact, θ need not even be a random variable. Rather, the maximum likelihood estimator is defined by selecting the value of θ that maximizes the likelihood of the observation. 143 144 CHAPTER 5. INFERENCE FOR MARKOV MODELS Example 5.1.2 Suppose Y is assumed to be a N(θ, σ 2 ) random variable, where σ 2 is known. Equivalently, we can write Y = θ +W, where W is a N(0, σ 2 ) random variable. Given a value y is observed, the ML estimator is obtained by maximizing f Y (y[θ) = 1 √ 2πσ 2 exp(− (y−θ) 2 2σ 2 ) with respect to θ. By inspection, ´ θ ML (y) = y. Example 5.1.3 Suppose Y is assumed to be a Poi(θ) random variable, for some θ > 0. Given the observation Y = k for some fixed k ≥ 0, the ML estimator is obtained by maximizing p Y (k[θ) = e −θ θ k k! with respect to θ. Equivalently, dropping the constant k! and taking the logarithm, θ is to be selected to maximize −θ + k ln θ. The derivative is −1 + k/θ, which is positive for θ < k and negative for θ > k. Hence, ´ θ ML (k) = k. Note that in the ML method, the quantity to be estimated, θ, is not assumed to be random. This has the advantage that the modeler does not have to come up with a probability distribution for θ, and can still impose hard constraints on θ. But the ML method does not permit incorporation of soft probabilistic knowledge the modeler may have about θ before any observation is used. The Bayesian method is based on estimating a random quantity. Thus, in the end, the variable to be estimated, say Z, and the observation, say Y , are jointly distributed random variables. Definition 5.1.4 The Bayesian estimator of Z given Y, for jointly distributed random variables Z and Y, and cost function C(z, y), is the function ´ Z = g(Y ) of Y which minimizes the average cost, E[C(Z, ´ Z)]. The assumed distribution of Z is called the prior or a priori distribution, whereas the conditional distribution of Z given Y is called the posterior or a posteriori distribution. In particular, if Z is discrete, there is a prior pmf, p Z , and a posterior pmf, p Z[Y , or if Z and Y are jointly continuous, there is a prior pdf, f Z , and a posterior pdf, f Z[Y . One of the most common choices of the cost function is the squared error, C(z, ˆ z) = (z −ˆ z) 2 , for which the Bayesian estimators are the minimum mean squared error (MMSE) estimators, examined in Chapter 3. Recall that the MMSE estimators are given by the conditional expectation, g(y) = E[Z[Y = y], which, given the observation Y = y, is the mean of the posterior distribution of Z given Y = y. A commonly used choice of C in case Z is a discrete random variable is C(z, ˆ z) = I ¡z,=b z¦ . In this case, the Bayesian objective is to select ˆ Z to minimize P¦Z ,= ˆ Z¦, or equivalently, to maximize P¦Z = ˆ Z¦. For an estimator ´ Z = g(Y ), P¦Z = ˆ Z¦ = y P[Z = g(y)[Y = y]p Y (y) = y p Z[Y (g(y)[y)p Y (y). So a Bayesian estimator for C(z, ˆ z) = I ¡z,=b z¦ is one such that g(y) maximizes P[Z = g(y)[Y = y] for each y. That is, for each y, g(y) is a maximizer of the posterior pmf of Z. The estimator, called the maximum a posteriori probability (MAP) estimator, can be written concisely as ´ Z MAP (y) = arg max z p Z[Y (z[y). 5.1. A BIT OF ESTIMATION THEORY 145 Suppose there is a parameter θ to be estimated based on an observation Y, and suppose that the pmf of Y, p Y (y[θ), is known for each θ. This is enough to determine the ML estimator, but determination of a Bayesian estimator requires, in addition, a choice of cost function C and a prior probability distribution (i.e. a distribution for θ). For example, if θ is a discrete variable, the Bayesian method would require that a prior pmf for θ be selected. In that case, we can view the parameter to be estimated as a random variable, which we might denote by the upper case symbol Θ, and the prior pmf could be denoted by p Θ (θ). Then, as required by the Bayesian method, the variable to be estimated, Θ, and the observation, Y , would be jointly distributed random variables. The joint pmf would be given by p Θ,Y (θ, Y ) = p Θ (θ)p Y (y[θ). The posterior probability distribution can be expressed as a conditional pmf, by Bayes’ formula: p Θ[Y (θ[y) = p Θ (θ)p Y (y[θ) p Y (y) (5.1) where p Y (y) = θ p Θ,Y (θ t , y). Given y, the value of the MAP estimator is a value of θ that maximizes p Θ[Y (θ[y) with respect to θ. For that purpose, the denominator in the right-hand side of (5.1) can be ignored, so that the MAP estimator is given by ´ Θ MAP (y) = arg max θ p Θ[Y (θ[y) = arg max θ p Θ (θ)p Y (y[θ). (5.2) The expression, (5.2), for ´ Θ MAP (y) is rather similar to the expression for the ML estimator, ´ θ ML (y) = arg max θ p Y (y[θ). In fact, the two estimators agree if the prior p Θ (θ) is uniform, meaning it is the same for all θ. The MAP criterion for selecting estimators can be extended to the case that Y and θ are jointly continuous variables, leading to the following: ´ Θ MAP (y) = arg max θ f Θ[Y (θ[y) = arg max θ f Θ (θ)f Y (y[θ). (5.3) In this case, the probability that any estimator is exactly equal to θ is zero, but taking ´ Θ MAP (y) to maximize the posterior pdf maximizes the probability that the estimator is within of the true value of θ, in an asymptotic sense as →0. Example 5.1.5 Suppose Y is assumed to be a N(θ, σ 2 ) random variable, where the variance σ 2 is known and θ is to be estimated. Using the Bayesian method, suppose the prior density of θ is the N(0, b 2 ) density for some known paramber b 2 . Equivalently, we can write Y = Θ+W, where Θ is a N(0, b 2 ) random variable and W is a N(0, σ 2 ) random variable, independent of Θ. By the properties of joint Gaussian densities given in Chapter 3, given Y = y, the posterior distribution (i.e. the conditional distribution of Θ given y) is the normal distribution with mean ´ E[Θ[Y = y] = b 2 y b 2 +σ 2 and variance b 2 σ 2 b 2 +σ 2 . The mean and maximizing value of this conditional density are both equal to ´ E[Θ[Y = y]. Therefore, ´ Θ MMSE (y) = ´ Θ MAP (y) = ´ E(Θ[Y = y). It is interesting to compare 146 CHAPTER 5. INFERENCE FOR MARKOV MODELS this example to Example 5.1.2. The Bayesian estimators (MMSE and MAP) are both smaller in magnitude than ´ θ ML (y) = y, by the factor b 2 b 2 +σ 2 . If b 2 is small compared to σ 2 , the prior information indicates that [θ[ is believed to be small, resulting in the Bayes estimators being smaller in magnitude than the ML estimator. As b 2 →∞, the priori distribution gets increasingly uniform, and the Bayes estimators coverge to the ML estimator. Example 5.1.6 Suppose Y is assumed to be a Poi(θ) random variable. Using the Bayesian method, suppose the prior distribution for θ is the uniformly distribution over the interval [0, θ max ], for some known value θ max . Given the observation Y = k for some fixed k ≥ 0, the MAP estimator is obtained by maximizing p Y (k[θ)f Θ (θ) = e −θ θ k k! I ¡0≤θ≤θ θmax ¦ θ max with respect to θ. As seen in Example 5.1.3, the term e −θ θ k k! is increasing in θ for θ < k and decreasing in θ for θ > k. Therefore, ´ Θ MAP (k) = min¦k, θ max ¦. It is interesting to compare this example to Example 5.1.3. Intuitively, the prior probability distri- bution indicates knowledge that θ ≤ θ max , but no more than that, because the prior restricted to θ ≤ θ max is uniform. If θ max is less than k, the MAP estimator is strictly smaller than ´ θ ML (k) = k. As θ max → ∞, the MAP estimator converges to the ML estimator. Actually, deterministic prior knowledge, such as θ ≤ θ max , can also be incorporated into ML estimation as a hard constraint. The next example makes use of the following lemma. Lemma 5.1.7 Suppose c i ≥ 0 for 1 ≤ i ≤ n and that c = n i=1 c i > 0. Then n i=1 c i log p i is maximized over all probability vectors p = (p 1 . . . . , p n ) by p i = c i /c. Proof. If c j = 0 for some j, then clearly p j = 0 for the maximizing probability vector. By eliminating such terms from the sum, we can assume without loss of generality that c i > 0 for all i. The function to be maximized is a strictly concave function of p over a region with linear constraints. The positivity constraints, namely p i ≥ 0, will be satisfied with strict inequality. The remaining constraint is the equality constraint, n i=1 p i = 1. We thus introduce a Lagrange multiplier λ for the equality constraint and seek the stationary point of the Lagrangian L(p, λ) = n i=1 c i log p i −λ(( n i=1 p i )−1). By definition, the stationary point is the point at which the partial derivatives with respect to the variables p i are all zero. Setting ∂L ∂p i = c i p i −λ = 0 yields that p i = c i λ for all i. To satisfy the linear constraint, λ must equal c. Example 5.1.8 Suppose b = (b 1 , b 2 , . . . , b n ) is a probability vector to be estimated by observing Y = (Y 1 , . . . , Y T ). Assume Y 1 , . . . , Y T are independent, with each Y t having probability distribution b: P¦Y t = i¦ = b i for 1 ≤ t ≤ T and 1 ≤ i ≤ n. We shall determine the maximum likelihood 5.1. A BIT OF ESTIMATION THEORY 147 estimate, ´ b ML (y), given a particular observation y = (y 1 , . . . , y T ). The likelihood to be maximized with respect to b is p(y[b) = b y 1 b y T = n i=1 b k i i where k i = [¦t : y t = i¦[. The log likelihood is ln p(y[b) = n i=1 k i ln(b i ). By Lemma 5.1.7, this is maximized by the empirical distribution of the observations, namely b i = k i T for 1 ≤ i ≤ n. That is, ´ b ML = ( k 1 T , . . . , kn T ). 148 CHAPTER 5. INFERENCE FOR MARKOV MODELS Example 5.1.9 This is a Bayesian version of the previous example. Suppose b = (b 1 , b 2 , . . . , b n ) is a probability vector to be estimated by observing Y = (Y 1 , . . . , Y T ), and assume Y 1 , . . . , Y T are independent, with each Y t having probability distribution b. For the Bayesian method, a distribution of the unknown distribution b must be assumed. That is right, a distribution of the distribution is needed. A convenient choice is the following. Suppose for some known numbers d i ≥ 1 that (b 1 , . . . , b n−1 ) has the prior density: f B (b) = _ Q n i=1 b d i −1 i Z(d) if b i ≥ 0 for 1 ≤ i ≤ n −1, and n−1 i=1 b i ≤ 1 0 else where b n = 1 −b 1 − −b n−1 , and Z(d) is a constant chosen so that f B integrates to one. A larger value of d i for a fixed i expresses an a priori guess that the corresponding value b i may be larger. It can be shown, in particular, that if B has this prior distribution, then E[B i ] = d i d 1 +dn . The MAP estimate, ´ b MAP (y), for a given observation vector y, is given by: ´ b MAP (y) = arg max b ln (f B (b)p(y[b)) = arg max b _ −ln(Z(d)) + n i=1 (d i −1 +k i ) ln(b i ) _ By Lemma 5.1.7, ´ b MAP (y) = ( d 1 −1+k 1 e T , . . . , dn−1+kn e T ), where ¯ T = n i=1 (d i −1+k i ) = T−n+ n i=1 d i . Comparison with Example 5.1.8 shows that the MAP estimate is the same as the ML estimate, except that d i − 1 is added to k i for each i. If the d i ’s are integers, the MAP estimate is the ML estimate with some prior observations mixed in, namely, d i −1 prior observations of outcome i for each i. A prior distribution such that the MAP estimate has the same algebraic form as the ML estimate is called a conjugate prior, and the specific density f B for this example is a called the Dirichlet density with parameter vector d. Example 5.1.10 Suppose that Y = (Y 1 , . . . , Y T ) is observed, and that it is assumed that the Y i are independent, with the binomial distribution with parameters n and q. Suppose n is known, and q is an unknown parameter to be estimated from Y . Let us find the maximum likelihood estimate, ´ q ML (y), for a particular observation y = (y 1 , . . . , y T ). The likelihood is p(y[q) = T t=1 __ n y t _ q yt (1 −q) 3−yt _ = cq s (1 −q) nT−s , where s = y 1 + + y T , and c depends on y but not on q. The log likelihood is ln c + s ln(q) + (nT −s) ln(1 −q). Maximizing over q yields ˆ q ML = s nT . An alternative way to think about this is to realize that each Y t can be viewed as the sum of n independent Bernoulli(q) random variables, and s can be viewed as the observed sum of nT independent Bernoulli(q) random variables. 5.2. THE EXPECTATION-MAXIMIZATION (EM) ALGORITHM 149 5.2 The expectation-maximization (EM) algorithm The expectation-maximization algorithm is a computational method for computing maximum like- lihood estimates in contexts where there are hidden random variables, in addition to observed data and unknown parameters. The following notation will be used. θ, a parameter to be estimated X, the complete data p cd (x[θ), the pmf of the complete data, which is a known function for each value of θ Y = h(X), the observed random vector Z, the unobserved data (This notation is used in the common case that X has the form X = (Y, Z).) We write p(y[θ) to denote the pmf of Y for a given value of θ. It can be expressed in terms of the pmf of the complete data by: p(y[θ) = ¡x:h(x)=y¦ p cd (x[θ) (5.4) In some applications, there can be a very large number of terms in the sum in (5.4), making it difficult to numerically maximize p(y[θ) with respect to θ (i.e. to compute ´ θ ML (y)). Algorithm 5.2.1 (Expectation-maximization (EM) algorithm) An observation y is given, along with an intitial estimate θ (0) . The algorithm is iterative. Given θ (k) , the next value θ (k+1) is com- puted in the following two steps: (Expectation step) Compute Q(θ[θ (k) ) for all θ, where Q(θ[θ (k) ) = E[ log p cd (X[θ) [ y, θ (k) ]. (5.5) (Maximization step) Compute θ (k+1) ∈ arg max θ Q(θ[θ (k) ). In other words, find a value θ (k+1) of θ that maximizes Q(θ[θ (k) ) with respect to θ. Some intuition behind the algorithm is the following. If a vector of complete data x could be observed, it would be reasonable to estimate θ by maximizing the pmf of the complete data, p cd (x[θ), with respect to θ. This plan is not feasible if the complete data is not observed. The idea is to estimate log p cd (X[θ) by its conditional expectation, Q(θ[θ (k) ), and then find θ to maximize this conditional expectation. The conditional expectation is well defined if some value of the parameter θ is fixed. For each iteration of the algorithm, the expectation step is completed using the latest value of θ, θ (k) , in computing the expectation of log p cd (X[θ). In most applications there is some additional structure that helps in the computation of Q(θ[θ (k) ). This typically happens when p cd factors into simple terms, such as in the case of hidden Markov models discussed in this chapter, or when p cd has the form of an exponential raised to a low degree 150 CHAPTER 5. INFERENCE FOR MARKOV MODELS polynomial, such as the Gaussian or exponential distribution. In some cases there are closed form expressions for Q(θ[θ (k) ). In others, there may be an algorithm that generates samples of X with the desired pmf p cd (x[θ (k) ) using random number generators, and then log p cd (X[θ) is used as an approximation to Q(θ[θ (k) ). Example 5.2.2 (Estimation of the variance of a signal) An observation Y is modeled as Y = S+N, where the signal S is assumed to be a N(0, θ) random variable, where θ is an unknown parameter, assumed to satisfy θ ≥ 0, and the noise N is a N(0, σ 2 ) random variable where σ 2 is known and strictly positive. Suppose it is desired to estimate θ, the variance of the signal. Let y be a particular observed value of Y. We consider two approaches to finding ´ θ ML : a direct approach, and the EM algorithm. For the direct approach, note that for θ fixed, Y is a N(0, θ +σ 2 ) random variable. Therefore, the pdf of Y evaluated at y, or likelihood of y, is given by f(y[θ) = exp(− y 2 2(θ+σ 2 ) ) _ 2π(θ +σ 2 ) . The natural log of the likelihood is given by log f(y[θ) = − log(2π) 2 − log(θ +σ 2 ) 2 − y 2 2(θ +σ 2 ) . Maximizing over θ yields ´ θ ML = (y 2 − σ 2 ) + . While this one-dimensional case is fairly simple, the situation is different in higher dimensions, as explored in Problem 5.5. Thus, we examine use of the EM algorithm for this example. To apply the EM algorithm for this example, take X = (S, N) as the complete data. The observation is only the sum, Y = S +N, so the complete data is not observed. For given θ, S and N are independent, so the log of the joint pdf of the complete data is given as follows: log p cd (s, n[θ) = − log(2πθ) 2 − s 2 2θ − log(2πσ 2 ) 2 − n 2 2σ 2 . For the estimation step, we find Q(θ[θ (k) ) = E[ log p cd (S, N[θ) [y, θ (k) ] = − log(2πθ) 2 − E[S 2 [y, θ (k) ] 2θ − log(2πσ 2 ) 2 − E[N 2 [y, θ (k) ] 2σ 2 . For the maximization step, we find ∂Q(θ[θ (k) ) ∂θ = − 1 2θ + E[S 2 [y, θ (k) ] 2θ 2 from which we see that θ (k+1) = E[S 2 [y, θ (k) ]. Computation of E[S 2 [y, θ (k) ] is an exercise in conditional Gaussian distributions, similar to Example 3.4.5. The conditional second moment is 5.2. THE EXPECTATION-MAXIMIZATION (EM) ALGORITHM 151 the sum of the square of the conditional mean and the variance of the estimation error. Thus, the EM algorithm becomes the following recursion: θ (k+1) = _ θ (k) θ (k) +σ 2 _ 2 y 2 + θ (k) σ 2 θ (k) +σ 2 (5.6) Problem 5.3 shows that if θ (0) > 0, then θ (k) → ´ θ ML as k →∞. The following proposition shows that the likelihood p(y[θ (k) ) is nondecreasing in k. In the ideal case, the likelihood converges to the maximum possible value of the likelihood, and lim k→∞ θ (k) = ´ θ ML (y). However, the sequence could converge to a local, but not global, maximizer of the likelihood, or possibly even to an inflection point of the likelihood. This behavior is typical of gradient type nonlinear optimization algorithms, which the EM algorithm is similar to. Note that even if the parameter set is convex (as it is for the case of hidden Markov models), the corresponding sets of probability distributions on Y are not convex. It is the geometry of the set of probability distributions that really matters for the EM algorithm, rather than the geometry of the space of the parameters. Proposition 5.2.3 (Convergence of the EM algorithm) Suppose that the complete data pmf can be factored as p cd (x[θ) = p(y[θ)k(x[y, θ) such that (i) log p(y[θ) is differentiable in θ (ii) E _ log k(X[y, ¯ θ) [ y, ¯ θ ¸ is finite for all ¯ θ (iii) D(k([y, θ)[[k([y, θ t )) is differentiable with respect to θ t for any θ fixed. and suppose that p(y[θ (0) ) > 0. Then the likelihood p(y[θ (k) ) is nondecreasing in k, and any limit point θ ∗ of the sequence (θ (k) ) is a stationary point of the objective function p(y[θ), which by definition means ∂p(y[θ) ∂θ [ θ=θ ∗ = 0. (5.7) The proof of Proposition 5.2.3 is given after the notion of divergence between probability vectors is introduced. Definition 5.2.4 The divergence between probability vectors p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ), denoted by D(p[[q), is defined by D(p[[q) = i p i log(p i /q i ), with the understanding that p i log(p i /q i ) = 0 if p i = 0 and p i log(p i /q i ) = +∞ if p i > q i = 0. Lemma 5.2.5 (Basic properties of divergence) (i) D(p[[q) ≥ 0, with equality if and only if p = q (ii) D is a convex function of the pair (p, q). 152 CHAPTER 5. INFERENCE FOR MARKOV MODELS Proof. Property (i) follows from Lemma 5.1.7. Here is another proof. In proving (i), we can assume that q i > 0 for all i. The function φ(u) = _ ulog u u > 0 0 u = 0 is convex. Thus, by Jensen’s inequality, D(p[[q) = i φ( p i q i )q i ≥ φ( i p i q i q i ) = φ(1) = 0, so (i) is proved. The proof of (ii) is based on the log-sum inequality, which is the fact that for nonnegative numbers a 1 , . . . , a n , b 1 , . . . , b n : i a i log a i b i ≥ a log a b , (5.8) where a = i a i and b = i b i . To verify (5.8), note that it is true if and only if it is true with each a i replaced by ca i , for any strictly positive constant c. So it can be assumed that a = 1. Similarly, it can be assumed that b = 1. For a = b = 1, (5.8) is equivalent to the fact D(a[[b) ≥ 0, already proved. So (5.8) is proved. Let 0 < α < 1. Suppose p j = (p j 1 , . . . , p j n ) and q j = (q j 1 , . . . , q j n ) are probability distributions for j = 1, 2, and let p i = αp 1 i +(1 −α)p 2 i and q i = αq 1 i +(1 −α)q 2 i , for 1 ≤ i ≤ n. That is, (p 1 , q 1 ) and (p 2 , q 2 ) are two pairs of probability distributions, and (p, q) = α(p 1 , q 1 ) +(1−α)(p 2 , q 2 ). For i fixed with 1 ≤ i ≤ n, the log-sum inequality (5.8) with (a 1 , a 2 , b 1 , b 2 ) = (αp 1 i , (1 − α)p 2 i , αq 1 i , (1 − α)q i 2 ) yields αp 1 i log p 1 i q 1 i + (1 −α)p 2 i log p 2 i q 2 i = αp 1 i log αp 1 i αq 1 i + (1 −α)p 2 i log (1 −α)p 2 i (1 −α)q 2 i ≥ p i log p i q i . Summing each side of this inequality over i yields αD(p 1 [[q 1 ) +(1 −α)D(p 2 [[q 2 ) ≥ D(p[[q), so that D(p[[q) is a convex function of the pair (p, q). Proof of Proposition 5.2.3 Using the factorization p cd (x[θ) = p(y[θ)k(x[y, θ), Q(θ[θ (k) ) = E[log p cd (X[θ)[y, θ (k) ] = log p(y[θ) +E[ log k(X[y, θ) [y, θ (k) ] = log p(y[θ) +E[ log k(X[y, θ) k(X[y, θ (k) ) [y, θ (k) ] +R = log p(y[θ) −D(k([y, θ (k) )[[k([y, θ)) +R, (5.9) where R = E[ log k(X[y, θ (k) ) [y, θ (k) ]. By assumption (ii), R is finite, and it depends on y and θ (k) , but not on θ. Therefore, the maxi- mization step of the EM algorithm is equivalent to: θ (k+1) = arg max θ _ log p(y[θ) −D(k([y, θ (k) )[[k([y, θ)) _ (5.10) 5.2. THE EXPECTATION-MAXIMIZATION (EM) ALGORITHM 153 Thus, at each step, the EM algorithm attempts to maximize the log likelihood ratio log p(y[θ) itself, minus a term which penalizes large differences between θ and θ (k) . The definition of θ (k+1) implies that Q(θ (k+1) [θ (k) ) ≥ Q(θ (k) [θ (k) ). Therefore, using (5.9) and the fact D(k([y, θ (k) )[[k([y, θ (k) )) = 0, yields log p(y[θ (k+1) ) −D(k([y, θ (k+1) )[[k([y, θ (k) )) ≥ log p(y[θ (k) ) (5.11) In particular, since the divergence is nonnegative, p(y[θ (k) ) is nondecreasing in k. Therefore, lim k→∞ log p(y[θ (k) ) exists. Suppose now that the sequence (θ k ) has a limit point, θ ∗ . By continuity, implied by the differ- entiability assumption (i), lim k→∞ p(y[θ k ) = p(y[θ ∗ ) < ∞. For each k, 0 ≤ max θ _ log p(y[θ) −D _ k([y, θ (k) ) [[ k([y, θ) __ −log p(y[θ (k) ) (5.12) ≤ log p(y[θ (k+1) ) −log p(y[θ (k) ) →0 as k →∞, (5.13) where (5.12) follows from the fact that θ (k) is a possible value of θ in the maximization, and the inequality in (5.13) follows from (5.10) and the fact that the divergence is always nonnegative. Thus, the quantity on the right-hand side of (5.12) converges to zero as k → ∞. So by continuity, for any limit point θ ∗ of the sequence (θ k ), max θ [log p(y[θ) −D(k([y, θ ∗ ) [[ k([y, θ))] −log p(y[θ ∗ ) = 0 and therefore, θ ∗ ∈ arg max θ [log p(y[θ) −D(k([y, θ ∗ ) [[ k([y, θ))] So the derivative of log p(y[θ) − D(k([y, θ) [[ k([y, θ ∗ )) with respect to θ at θ = θ ∗ is zero. The same is true of the term D(k([y, θ) [[ k([y, θ ∗ )) alone, because this term is nonnegative, it has value 0 at θ = θ ∗ , and it is assumed to be differentiable in θ. Therefore, the derivative of the first term, log p(y[θ), must be zero at θ ∗ . Remark 5.2.6 In the above proposition and proof, we assume that θ ∗ is unconstrained. If there are inequality constraints on θ and if some of them are tight for θ ∗ , then we still find that if θ ∗ is a limit point of θ (k) , then it is a maximizer of f(θ) = log p(y[θ) − D(k([y, θ) [[ k([y, θ ∗ )) . Thus, under regularity conditions implying the existence of Lagrange multipliers, the Kuhn-Tucker optimality conditions are satisfied for the problem of maximizing f(θ). Since the derivatives of D(k([y, θ) [[ k([y, θ ∗ )) with respect to θ at θ = θ ∗ are zero, and since the Kuhn-Tucker optimality conditions only involve the first derivatives of the objective function, those conditions for the problem of maximizing the true log likelihood function, log p(y[θ), also hold at θ ∗ . 154 CHAPTER 5. INFERENCE FOR MARKOV MODELS 5.3 Hidden Markov models A popular model of one-dimensional sequences with dependencies, explored especially in the context of speech processing, are the hidden Markov models. Suppose that X = (Y, Z), where Z is unobserved data and Y is the observed data Z = (Z 1 , . . . , Z T ) is a time-homogeneous Markov process, with one-step transition probability matrix A = (a i,j ), and with Z 1 having the initial distribution π. Here, T, with T ≥ 1, denotes the total number of observation times. The state-space of Z is denoted by o, and the number of states of o is denoted by N s . Y = (Y 1 , . . . , Y n ) is the observed data. It is such that given Z = z, for some z = (z 1 , . . . , z n ), the variables Y 1 , , Y n are conditionally independent with P[Y t = l[Z = z] = b zt,l , for a given observation generation matrix B = (b i,l ). The observations are assumed to take values in a set of size N o , so that B is an N s N o matrix and each row of B is a probability vector. The parameter for this model is θ = (π, A, B). The model is illustrated in Figure 5.1. The pmf of T ! Y 1 1 Y 2 Y 3 Y Z Z 2 Z 3 . . . Z A A A A B B B B T Figure 5.1: Structure of hidden Markov model. the complete data, for a given choice of θ, is p cd (y, z[θ) = π z 1 T−1 t=1 a zt,z t+1 T t=1 b zt,yt . (5.14) The correspondence between the pmf and the graph shown in Figure 5.1 is that each term on the right-hand side of (5.14) corresponds to an edge in the graph. In what follows we consider the following three estimation tasks associated with this model: 1. Given the observed data and θ, compute the conditional distribution of the state (solved by the forward-backward algorithm) 2. Given the observed data and θ, compute the most likely sequence for hidden states (solved by the Viterbi algorithm) 3. Given the observed data, compute the maximum likelihood (ML) estimate of θ (solved by the Baum-Welch/EM algorithm). 5.3. HIDDEN MARKOV MODELS 155 These problems are addressed in the next three subsections. As we will see, the first of these problems arises in solving the third problem. The second problem has some similarities to the first problem, but it can be addressed separately. 5.3.1 Posterior state probabilities and the forward-backward algorithm In this subsection we assume that the parameter θ = (π, A, B) of the hidden Markov model is known and fixed. We shall describe computationally efficient methods for computing posterior probabilites for the state at a given time t, or for a transition at a given pair of times t to t + 1, of the hidden Markov process, based on past observations (case of causal filtering) or based on past and future observations (case of smoothing). These posterior probabilities would allow us to compute, for example, MAP estimates of the state or transition of the Markov process at a given time. For example, we have: ´ Z t[t MAP = arg max i∈S P[Z t = i[Y 1 = y 1 , . . . , Y t = y t , θ] (5.15) ´ Z t[T MAP = arg max i∈S P[Z t = i[Y 1 = y 1 , . . . , Y T = y T , θ] (5.16) (Z t , Z t+1 ) [T MAP = arg max (i,j)∈SS P[Z t = i, Z t+1 = j[Y 1 = y 1 , . . . , Y T = y T , θ], (5.17) where the conventions for subscripts is similar to that used for Kalman filtering: “t[T” denotes that the state is to be estimated at time t based on the observations up to time T. The key to efficient computation is to recursively compute certain quantities through a recursion forward in time, and others through a recursion backward in time. We begin by deriving a forward recursion for the variables α i (t) defined as follows: α i (t) . = P[Y 1 = y 1 , , Y t = y t , Z t = i[θ], for i ∈ o and 1 ≤ t ≤ T. The intial value is α i (1) = π i b iy 1 . By the law of total probability, the update rule is: α j (t + 1) = i∈S P[Y 1 = y 1 , , Y t+1 = y t+1 , Z t = i, Z t+1 = j[θ] = i∈S P[Y 1 = y 1 , , Y t = y t , Z t = i[θ] P[Z t+1 = j, Y t+1 = y t+1 [Y 1 = y 1 , , Y t = y t , Z t = i, θ] = i∈S α i (t)a ij b jy t+1 . The right-hand side of (5.15) can be expressed in terms of the α’s as follows. P[Z t = i[Y 1 = y 1 , . . . , Y t = y t , θ] = P[Z t = i, Y 1 = y 1 , . . . , Y t = y t [θ] P[Y 1 = y 1 , . . . , Y t = y t [θ] = α i (t) j∈S α j (t) (5.18) 156 CHAPTER 5. INFERENCE FOR MARKOV MODELS The computation of the α’s and the use of (5.18) is an alternative, and very similar to, the Kalman filtering equations. The difference is that for Kalman filtering equations, the distributions involved are all Gaussian, so it suffices to compute means and variances, and also the normalization in (5.18), which is done once after the α’s are computed, is more or less done at each step in the Kalman filtering equations. To express the posterior probabilities involving both past and future observations used in (5.16), the following β variables are introduced: β i (t) . = P[Y t+1 = y t+1 , , Y T = y T [Z t = i, θ], for i ∈ o and 1 ≤ t ≤ T. The definition is not quite the time reversal of the definition of the α’s, because the event Z t = i is being conditioned upon in the definition of β i (t). This asymmetry is introduced because the presentation of the model itself is not symmetric in time. The backward equation for the β’s is as follows. The intial condition for the backward equations is β i (T) = 1 for all i. By the law of total probability, the update rule is β i (t −1) = j∈S P[Y t = y t , , Y T = y T , Z t = j[Z t−1 = i, θ] = j∈S P[Y t = y t , Z t = j[Z t−1 = i, θ] P[Y t+1 = y t , , Y T = y T , [Z t = j, Y t = y t , Z t−1 = i, θ] = j∈S a ij b jyt β j (t). Note that P[Z t = i, Y 1 = y 1 , . . . , Y T = y T [θ] = P[Z t = i, Y 1 = y 1 , . . . , Y t = y t [θ] P[Y t+1 = y t+1 , . . . , Y T = y T [θ, Z t = i, Y 1 = y 1 , . . . , Y t = y t ] = P[Z t = i, Y 1 = y 1 , . . . , Y t = y t [θ] P[Y t+1 = y t+1 , . . . , Y T = y T [θ, Z t = i] = α i (t)β i (t) from which we derive the smoothing equation for the conditional distribution of the state at a time t, given all the observations: γ i (t) . = P[Z t = i[Y 1 = y 1 , . . . , Y T = y T , θ] = P[Z t = i, Y 1 = y 1 , . . . , Y T = y T [θ] P[Y 1 = y 1 , . . . , Y T = y T [θ] = α i (t)β i (t) j∈S α j (t)β j (t) The variable γ i (t) defined here is the same as the probability in the right-hand side of (5.16), so that we have an efficient way to find the MAP smoothing estimator defined in (5.16). For later 5.3. HIDDEN MARKOV MODELS 157 use, we note from the above that for any i such that γ i (t) > 0, P[Y 1 = y 1 , . . . , Y T = y T [θ] = α i (t)β i (t) γ i (t) . (5.19) Similarly, P[Z t = i, Z t+1 = j, Y 1 = y 1 , . . . , Y T = y T [θ] = P[Z t = i, Y 1 = y 1 , . . . , Y t = y t [θ] P[Z t+1 = j, Y t+1 = y t+1 [θ, Z t = i, Y 1 = y 1 , . . . , Y t = y t ] P[Y t+2 = y t+2 , . . . , Y T = y T [θ, Z t = i, Z t+1 = j, Y 1 = y 1 , . . . , Y t+1 = y t+1 ] = α i (t)a ij b jy t+1 β j (t + 1), from which we derive the smoothing equation for the conditional distribution of a state-transition for some pair of consecutive times t and t + 1, given all the observations: ξ ij (t) . = P[Z t = i, Z t+1 = j[Y 1 = y 1 , . . . , Y T = y T , θ] = P[Z t = i, Z t+1 = j, Y 1 = y 1 , . . . , Y T = y T [θ] P[Y 1 = y 1 , . . . , Y T = y T [θ] = α i (t)a ij b jy t+1 β j (t + 1) i ,j α i (t)a i j b j y t+1 β j (t + 1) = γ i (t)a ij b jy t+1 β j (t + 1) β i (t) , where the final expression is derived using (5.19). The variable ξ i,j (t) defined here is the same as the probability in the right-hand side of (5.17), so that we have an efficient way to find the MAP smoothing estimator of a state transition, defined in (5.17). Summarizing, the forward-backward or α−β algorithm for computing the posterior distribution of the state or a transition is given by: Algorithm 5.3.1 (The forward-backward algorithm) The α’s can be recursively computed forward in time, and the β’s recursively computed backward in time, using: α j (t + 1) = i∈S α i (t)a ij b jy t+1 , with initial condition α i (1) = π i b iy 1 β i (t −1) = j∈S a ij b jyt β j (t), with initial condition β i (T) = 1. 158 CHAPTER 5. INFERENCE FOR MARKOV MODELS Then the posterior probabilities can be found: P[Z t = i[Y 1 = y 1 , . . . , Y t = y t , θ] = α i (t) j∈S α j (t) (5.20) γ i (t) . = P[Z t = i[Y 1 = y 1 , . . . , Y T = y T , θ] = α i (t)β i (t) j∈S α j (t)β j (t) (5.21) ξ ij (t) . = P[Z t = i, Z t+1 = j[Y 1 = y 1 , . . . , Y T = y T , θ] = α i (t)a ij b jy t+1 β j (t + 1) i ,j α i (t)a i j b j y t+1 β j (t + 1) (5.22) = γ i (t)a ij b jy t+1 β j (t + 1) β i (t) . (5.23) Remark 5.3.2 If the number of observations runs into the hundreds or thousands, the α’s and β’s can become so small that underflow problems can be encountered in numerical computation. How- ever, the formulas (5.20), (5.21), and (5.22) for the posterior probabilities in the forward-backward algorithm are still valid if the α’s and β’s are multiplied by time dependent (but state independent) constants (for this purpose, (5.22) is more convenient than (5.23), because (5.23) invovles β’s at two different times). Then, the α’s and β’s can be renormalized after each time step of computation to have sum equal to one. Moreover, the sum of the logarithms of the normalization factors for the α’s can be stored in order to recover the log of the likelihood, log p(y[θ) = log Ns−1 i=0 α i (T). 5.3.2 Most likely state sequence – Viterbi algorithm Suppose the parameter θ = (π, A, B) is known, and that Y = (Y 1 , . . . , Y T ) is observed. In some applications one wishes to have an estimate of the entire sequence Z. Since θ is known, Y and Z can be viewed as random vectors with a known joint pmf, namely p cd (y, z[θ). For the remainder of this section, let y denote a fixed observed sequence, y = (y 1 , . . . , y T ). We will seek the MAP estimate, ´ Z MAP (y, θ), of the entire state sequence Z = (Z 1 , . . . , Z T ), given Y = y. By definition, it is defined to be the z that maximizes the posterior pmf p(z[y, θ), and as shown in Section 5.1, it is also equal to the maximizer of the joint pmf of Y and Z: ´ Z MAP (y, θ) = arg max z p cd (y, z[θ) The Viterbi algorithm (a special case of dynamic programming), described next, is a computation- ally efficient algorithm for simultaneously finding the maximizing sequence z ∗ ∈ o T and computing p cd (y, z ∗ [θ). It uses the variables: δ i (t) . = max (z 1 ,...,z t−1 )∈S t−1 P[Z 1 = z 1 , . . . , Z t−1 = z t−1 , Z t = i, Y 1 = y 1 , , Y t = y t [θ]. The δ’s can be computed by a recursion forward in time, using the initial values δ i (1) = π(i)b iy 1 5.3. HIDDEN MARKOV MODELS 159 and the recursion derived as follows: δ j (t) = max i max ¡z 1 ,...,z t−2 ¦ P[Z 1 = z 1 , . . . , Z t−2 = z t−2 , Z t−1 = i, Z t = j, Y 1 = y 1 , , Y t = y t [θ] = max i max ¡z 1 ,...,z t−2 ¦ P[Z 1 = z 1 , . . . , Z t−2 = z t−2 , Z t−1 = i, Y 1 = y 1 , , Y t−1 = y t−1 [θ]a i,j b jyt = max i ¦δ i (t −1)a i,j b jyt ¦ Note that δ i (T) = max z:z T =i p cd (y, z[θ). Thus, the following algorithm correctly finds ´ Z MAP (y, θ). Algorithm 5.3.3 (Viterbi algorithm) Compute the δ’s and associated back pointers by a recursion forward in time: (initial condition) δ i (1) = π(i)b iy 1 (recursive step) δ j (t) = max i ¦δ i (t −1)a ij b j,yt ¦ (5.24) (storage of back pointers) φ j (t) . = arg max i ¦δ i (t −1)a i,j b j,yt ¦ Then z ∗ = ´ Z MAP (y, θ) satisfies p cd (y, z ∗ [θ) = max i δ i (T), and z ∗ is given by tracing backward in time: z ∗ T = arg max i δ i (T) and z ∗ t−1 = φ z ∗ t (t) for 2 ≤ t ≤ T. (5.25) 5.3.3 The Baum-Welch algorithm, or EM algorithm for HMM The EM algorithm, introduced in Section 5.2, can be usefully applied to many parameter estimation problems with hidden data. This section shows how to apply it to the problem of estimating the parameter of a hidden Markov model from an observed output sequence. This results in the Baum- Welch algorithm, which was developed earlier than the EM algorithm, in the particular context of HMMs. The parameter to be estimated is θ = (π, A, B). The complete data consists of (Y, Z) whereas the observed, incomplete data consists of Y alone. The initial parameter θ (0) = (π (0) , A (0) , B (0) ) should have all entries strictly positive, because any entry that is zero will remain zero at the end of an iteration. Suppose θ (k) is given. The first half of an iteration of the EM algorithm is to compute, or determine in closed form, Q(θ[θ (k) ). Taking logarithms in the expression (5.14) for the pmf of the complete data yields log p cd (y, z[θ) = log π z 1 + T−1 t=1 log a zt,z t+1 + T t=1 log b zt,yt Taking the expectation yields Q(θ[θ (k) ) = E[log p cd (y, Z[θ)[y, θ (k) ] = i∈S γ i (1) log π i + T−1 t=1 i,j ξ ij (t) log a i,j + T t=1 i∈S γ i (t) log b i,yt , 160 CHAPTER 5. INFERENCE FOR MARKOV MODELS where the variables γ i (t) and ξ i,j (t) are defined using the model with parameter θ (k) . In view of this closed form expression for Q(θ[θ (k) ), the expectation step of the EM algorithm essentially comes down to computing the γ’s and the ξ’s. This computation can be done using the forward-backward algorithm, Algorithm 5.3.1, with θ = θ (k) . The second half of an iteration of the EM algorithm is to find the value of θ that maximizes Q(θ[θ (k) ), and set θ (k+1) equal to that value. The parameter θ = (π, A, B) for this problem can be viewed as a set of probability vectors. Namely, π is a probability vector, and, for each i fixed, a ij as j varies, and b il as l varies, are probability vectors. Therefore, Example 5.1.8 and Lemma 5.1.7 will be of use. Motivated by these, we rewrite the expression found for Q(θ[θ (k) ) to get Q(θ[θ (k) ) = i∈S γ i (1) log π i + i,j T−1 t=1 ξ ij (t) log a i,j + i∈S T t=1 γ i (t) log b i,yt = i∈S γ i (1) log π i + i,j _ T−1 t=1 ξ ij (t) _ log a i,j + i∈S l _ T t=1 γ i (t)I ¡yt=l¦ _ log b i,l (5.26) The first summation in (5.26) has the same form as the sum in Lemma 5.1.7. Similarly, for each i fixed, the sum over j involving a i,j , and the sum over l involving b i,l , also have the same form as the sum in Lemma 5.1.7. Therefore, the maximization step of the EM algorithm can be written in the following form: π (k+1) i = γ i (1) (5.27) a (k+1) i,j = T−1 t=1 ξ i,j (t) T−1 t=1 γ i (t) (5.28) b (k+1) i,l = T t=1 γ i (t)I ¡yt=l¦ T t=1 γ i (t) (5.29) The update equations (5.27)-(5.29) have a natural interpretation. Equation (5.27) means that the new value of the distribution of the initial state, π (k+1) , is simply the posterior distribution of the initial state, computed assuming θ (k) is the true parameter value. The other two update equations are similar, but are more complicated because the transition matrix A and observation generation matrix B do not change with time. The denominator of (5.28) is the posterior expected number of times the state is equal to i up to time T −1, and the numerator is the posterior expected number of times two consecutive states are i, j. Thus, if we think of the time of a jump as being random, the right-hand side of (5.28) is the time-averaged posterior conditional probability that, given the state at the beginning of a transition is i at a typical time, the next state will be j. Similarly, the right-hand side of (5.29) is the time-averaged posterior conditional probability that, given the state is i at a typical time, the observation will be l. 5.4. NOTES 161 Algorithm 5.3.4 (Baum-Welch algorithm, or EM algorithm for HMM) Select the state space o, and in particular, the cardinality, N s , of the state space, and let θ (0) denote a given initial choice of parameter. Given θ (k) , compute θ (k+1) by using the forward-backward algorithm (Algorithm 5.3.1) with θ = θ (k) to compute the γ’s and ξ’s. Then use (5.27)-(5.29) to compute θ (k+1) = (π (k+1) , A (k+1) , B (k+1) ). 5.4 Notes The EM algorithm is due to A.P. Dempster, N.M. Laird, and B.D. Rubin [3]. The paper includes examples and a proof that the likelihood is increased with each iteration of the algorithm. An article on the convergence of the EM algorithm is given in [16]. Earlier related work includes that of Baum et al. [2], giving the Baum-Welch algorithm. A tutorial on inference for HMMs and applications to speech recognition is given in [11]. 5.5 Problems 5.1 Estimation of a Poisson parameter Suppose Y is assumed to be a Poi(θ) random variable. Using the Bayesian method, suppose the prior distribution of θ is the exponential distribution with some known parameter λ > 0. (a) Find ´ Θ MAP (k), the MAP estimate of θ given that Y = k is observed, for some k ≥ 0. (b) For what values of λ is ´ Θ MAP (k) ≈ ´ θ ML (k)? (The ML estimator was found in Example 5.1.3.) Why should that be expected? 5.2 A variance estimation problem with Poisson observation The input voltage to an optical device is X and the number of photons observed at a detector is N. Suppose X is a Gaussian random variable with mean zero and variance σ 2 , and that given X, the random variable N has the Poisson distribution with mean X 2 . (Recall that the Poisson distribution with mean λ has probability mass function λ n e −λ /n! for n ≥ 0.) (a) Express P¦N = n]¦ in terms of σ 2 . You can express this is the last candidate. You do not have to perform the integration. (b) Find the maximum likelihood estimator of σ 2 given N. (Caution: Estimate σ 2 , not X. Be as explicit as possible–the final answer has a simple form. Hint: You can first simplify your answer to part (a) by using the fact that if X is a N(0, ¯ σ 2 ) random variable, then E[X 2n ] = e σ 2n (2n)! n!2 n . ) 5.3 Convergence of the EM algorithm for an example The purpose of this exercise is to verify for Example 5.2.2 that if θ (0) > 0, then θ (k) → ´ θ ML as k → ∞. As shown in the example, ´ θ ML = (y 2 − σ 2 ) + . Let F(θ) = _ θ θ+σ 2 _ 2 y 2 + θσ 2 θ+σ 2 so that the recursion (5.6) has the form θ (k+1) = F(θ (k) ). Clearly, over R + , F is increasing and bounded. (a) Show that 0 is the only nonnegative solution of F(θ) = θ if y ≤ σ 2 and that 0 and y − σ 2 are the only nonnegative solutions of F(θ) = θ if y > σ 2 . (b) Show that for small θ > 0, 162 CHAPTER 5. INFERENCE FOR MARKOV MODELS F(θ) = θ + θ 2 (y 2 −σ 2 ) σ 4 + o(θ 3 ). (Hint: For 0 < θ < σ 2 , θ θ+σ 2 = θ σ 2 1 1+θ/σ 2 = θ σ 2 (1 − θ σ 2 + ( θ σ 2 ) 2 −. . .). (c) Sketch F and argue, using the above properties of F, that if θ (0) > 0, then θ (k) → ´ θ ML . 5.4 Transformation of estimators and estimators of transformations Consider estimating a parameter θ ∈ [0, 1] from an observation Y . A prior density of θ is available for the Bayesian estimators, MAP and MMSE, and the conditional density of Y given θ is known. Answer the following questions and briefly explain your answers. (a) Does 3 + 5 ´ θ ML = (3 + 5θ) ML ? (b) Does ( ´ θ ML ) 3 = ¯ (θ 3 ) ML ? (c) Does 3 + 5 ´ θ MAP = (3 + 5θ) MAP ? (d) Does ( ´ θ MAP ) 3 = ¯ (θ 3 ) MAP ? (e) Does 3 + 5 ´ θ MMSE = (3 + 5θ) MMSE ? (f) Does ( ´ θ MMSE ) 3 = ¯ (θ 3 ) MMSE ? 5.5 Using the EM algorithm for estimation of a signal variance This problem generalizes Example 5.2.2 to vector observations. Suppose the observation is Y = S + N, such that the signal S and noise N are independent random vectors in R d . Assume that S is N(0, θI), and N is N(0, Σ N ), where θ, with θ > 0, is the parameter to be estimated, I is the identity matrix, and Σ N is known. (a) Suppose θ is known. Find the MMSE estimate of S, ´ S MMSE , and find an espression for the covariance matrix of the error vector, S − ´ S MMSE . (b) Suppose now that θ is unknown. Describe a direct approach to computing ´ θ ML (Y ). (c) Describe how ´ θ ML (Y ) can be computed using the EM algorithm. (d) Consider how your answers to parts (b) and (c) simplify in case d = 2 and the covariance matrix of the noise, Σ N , is the identity matrix. 5.6 Finding a most likely path Consider an HMM with state space o = ¦0, 1¦, observation space ¦0, 1, 2¦, and parameter θ = (π, A, B) given by: π = (a, a 3 ) A = _ a a 3 a 3 a _ B = _ ca ca 2 ca 3 ca 2 ca 3 ca _ Here a and c are positive constants. Their actual numerical values aren’t important, other than the fact that a < 1. Find the MAP state sequence for the observation sequence 021201, using the Viterbi algorithm. Show your work. 5.7 An underconstrained estimation problem Suppose the parameter θ = (π, A, B) for an HMM is unknown, but that it is assumed that the number of states N s in the statespace o for (Z t ) is equal to the number of observations, T. Describe a trivial choice of the ML estimator ´ θ ML (y) for a given observation sequence y = (y 1 , . . . , y T ). What is the likelihood of y for this choice of θ? 5.5. PROBLEMS 163 5.8 Specialization of Baum-Welch algorithm for no hidden data (a) Determine how the Baum-Welch algorithm simplifies in the special case that B is the identity matrix, so that X t = Y t for all t. (b) Still assuming that B is the identity matrix, suppose that o = ¦0, 1¦ and the observation sequence is 0001110001110001110001. Find the ML estimator for π and A. 5.9 Free energy and the Boltzmann distribution Let S denote a finite set of possible states of a physical system, and suppose the (internal) energy of any state s ∈ S is given by V (s) for some function V on S. Let T > 0. The Helmholtz free energy of a probability distribution Q on S is defined to be the average (internal) energy minus the temperature times entropy: F(Q) = i Q(i)V (i) + T i Q(i) log Q(i). Note that F is a convex function of Q. (We’re assuming Boltzmann’s constant is normalized to one, so that T should actually be in units of energy, but by abuse of notation we will call T the temperature.) (a) Use the method of Lagrange multipliers to show that the Boltzmann distribution defined by B T (i) = 1 Z(T) exp(−V (i)/T) minimizes F(Q). Here Z(T) is the normalizing constant required to make B T a probability distribution. (b) Describe the limit of the Boltzmann distribution as T →∞. (c) Describe the limit of the Boltzmann distribution as T →0. If it is possible to simulate a random variable with the Boltzmann distribution, does this suggest an application? (d) Show that F(Q) = TD(Q[[B T ) + (term not depending on Q). Therefore, given an energy function V on S and temperature T > 0, minimizing free energy over Q in some set is equivalent to minimizing the divergence D(Q[[B T ) over Q in the same set. 5.10 Baum-Welch saddlepoint Suppose that the Baum-Welch algorithm is run on a given data set with initial parameter θ (0) = (π (0) , A (0) , B (0) ) such that π (0) = π (0) A (0) (i.e., the initial distribution of the state is an equilibrium distribution of the state) and every row of B (0) is identical. Explain what happens, assuming an ideal computer with infinite precision arithmetic is used. 5.11 Inference for a mixture model (a) An observed random vector Y is distributed as a mixture of Gaussian distributions in d dimen- sions. The parameter of the mixture distribution is θ = (θ 1 , . . . , θ J ), where θ j is a d-dimensional vector for 1 ≤ j ≤ J. Specifically, to generate Y a random variable Z, called the class label for the observation, is generated. The variable Z is uniformly distributed on ¦1, . . . , J¦, and the con- ditional distribution of Y given (θ, Z) is Gaussian with mean vector θ Z and covariance the d d identity matrix. The class label Z is not observed. Assuming that θ is known, find the posterior pmf p(z[y, θ). Give a geometrical interpretation of the MAP estimate ´ Z for a given observation Y = y. (b) Suppose now that the parameter θ is random with the uniform prior over a very large region and suppose that given θ, n random variables are each generated as in part (a), independently, to produce (Z (1) , Y (1) , Z (2) , Y (2) , . . . , Z (n) , Y (n) ). Give an explicit expression for the joint distribution P(θ, z (1) , y (1) , z (2) , y (2) , . . . , z (n) , y (n) ). 164 CHAPTER 5. INFERENCE FOR MARKOV MODELS (c) The iterative conditional modes (ICM) algorithm for this example corresponds to taking turns maximizing P( ´ θ, ´ z (1) , y (1) , ´ z (2) , y (2) , . . . , ´ z (n) , y (n) ) with respect to ´ θ for ´ z fixed and with respect to ´ z for ´ θ fixed. Give a simple geometric description of how the algorithm works and suggest a method to initialize the algorithm (there is no unique answer for the later). (d) Derive the EM algorithm for this example, in an attempt to compute the maximum likelihood estimate of θ given y (1) , y (2) , . . . , y (n) . 5.12 Constraining the Baum-Welch algorithm The Baum-Welch algorithm as presented placed no prior assumptions on the parameters π, A, B, other than the number of states N s in the state space of (Z t ). Suppose matrices A and B are given with the same dimensions as the matrices A and B to be esitmated, with all elements of A and B having values 0 and 1. Suppose that A and B are constrained to satisfy A ≤ A and B ≤ B, in the element-by-element ordering (for example, a ij ≤ a ij for all i, j.) Explain how the Baum-Welch algorithm can be adapted to this situation. 5.13 * Implementation of algorithms Write a computer program to (a) simulate a HMM on a computer for a specified value of the paramter θ = (π, A, B), (b) To run the forward-backward algorithm and compute the α’s, β’s, γ’s, and ξ’s , (c) To run the Baum-Welch algorithm. Experiment a bit and describe your results. For example, if T observations are generated, and then if the Baum-Welch algorithm is used to estimate the paramter, how large does T need to be to insure that the estimates of θ are pretty accurate. Chapter 6 Dynamics of Countable-State Markov Models Markov processes are useful for modeling a variety of dynamical systems. Often questions involving the long-time behavior of such systems are of interest, such as whether the process has a limiting distribution, or whether time-averages constructed using the process are asymptotically the same as statistical averages. 6.1 Examples with finite state space Recall that a probability distribution π on o is an equilibrium probability distribution for a time- homogeneous Markov process X if π = πH(t) for all t. In the discrete-time case, this condition reduces to π = πP. We shall see in this section that under certain natural conditions, the existence of an equilibrium probability distribution is related to whether the distribution of X(t) converges as t → ∞. Existence of an equilibrium distribution is also connected to the mean time needed for X to return to its starting state. To motivate the conditions that will be imposed, we begin by considering four examples of finite state processes. Then the relevant definitions are given for finite or countably-infinite state space, and propositions regarding convergence are presented. Example 6.1.1 Consider the discrete-time Markov process with the one-step probability diagram shown in Figure 6.1. Note that the process can’t escape from the set of states o 1 = ¦a, b, c, d, e¦, so that if the initial state X(0) is in o 1 with probability one, then the limiting distribution is supported by o 1 . Similarly if the initial state X(0) is in o 2 = ¦f, g, h¦ with probability one, then the limiting distribution is supported by o 2 . Thus, the limiting distribution is not unique for this process. The natural way to deal with this problem is to decompose the original problem into two problems. That is, consider a Markov process on o 1 , and then consider a Markov process on o 2 . Does the distribution of X(0) necessarily converge if X(0) ∈ o 1 with probability one? The answer is no. For example, note that if X(0) = a, then X(k) ∈ ¦a, c, e¦ for all even values of k, whereas X(k) ∈ ¦b, d¦ for all odd values of k. That is, π a (k) +π c (k) +π e (k) is one if k is even and is zero if k is odd. Therefore, if π a (0) = 1, then π(k) does not converge as k →∞. 165 166 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 0.5 1 0.5 0.5 0.5 0.5 1 1 0.5 0.5 b c d e f g h a 0.5 1 Figure 6.1: A one-step transition probability diagram with eight states. Basically speaking, the Markov process of Example 6.1.1 fails to have a unique limiting distri- bution independent of the initial state for two reasons: (i) the process is not irreducible, and (ii) the process is not aperiodic. Example 6.1.2 Consider the two-state, continuous time Markov process with the transition rate diagram shown in Figure 6.2 for some positive constants α and β. This was already considered in Example 4.9.3, where we found that for any initial distribution π(0), ! " 1 2 Figure 6.2: A transition rate diagram with two states. lim t→∞ π(t) = lim t→∞ π(0)H(t) = ( β α +β , α α +β ). The rate of convergence is exponential, with rate parameter α+β, which happens to be the nonzero eigenvalue of Q. Note that the limiting distribution is the unique probability distribution satisfying πQ = 0. The periodicity problem of Example 6.1.1 does not arise for continuous-time processes. Example 6.1.3 Consider the continuous-time Markov process with the transition rate diagram in Figure 6.3. The Q matrix is the block-diagonal matrix given by Q = _ ¸ ¸ _ −α α 0 0 β −β 0 0 0 0 −α α 0 0 β −β _ ¸ ¸ _ 6.2. CLASSIFICATION AND CONVERGENCE OF DISCRETE-TIME MARKOV PROCESSES167 ! " 1 2 ! " 3 4 Figure 6.3: A transition rate diagram with four states. This process is not irreducible, but rather the transition rate diagram can be decomposed into two parts, each equivalent to the diagram for Example 6.1.2. The equilibrium probability distributions are the probability distributions of the form π = (λ β α+β , λ α α+β , (1 −λ) β α+β , (1 −λ) α α+β ), where λ is the probability placed on the subset ¦1, 2¦. Example 6.1.4 Consider the discrete-time Markov process with the transition probability diagram in Figure 6.4. The one-step transition probability matrix P is given by 1 2 3 1 1 1 Figure 6.4: A one-step transition probability diagram with three states. P = _ _ 0 1 0 0 0 1 1 0 0 _ _ Solving the equation π = πP we find there is a unique equilibrium probability vector, namely π = ( 1 3 , 1 3 , 1 3 ). On the other hand, if π(0) = (1, 0, 0), then π(k) = π(0)P k = _ _ _ (1, 0, 0) if k ≡ 0 mod 3 (0, 1, 0) if k ≡ 1 mod 3 (0, 0, 1) if k ≡ 2 mod 3 Therefore, π(k) does not converge as k →∞. 6.2 Classification and convergence of discrete-time Markov pro- cesses The following definition applies for either discrete time or continuous time. Definition 6.2.1 Let X be a time-homogeneous Markov process on the countable state space o. The process is said to be irreducible if for all i, j ∈ o, there exists s > 0 so that p ij (s) > 0. 168 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS The next definition is relevant only for discrete-time processes. Definition 6.2.2 The period of a state i is defined to be GCD¦k ≥ 0 : p ii (k) > 0¦, where “GCD” stands for greatest common divisor. The set ¦k ≥ 0 : p ii (k) > 0¦ is closed under addition, which by a result in elementary algebra 1 implies that the set contains all sufficiently large integer multiples of the period. The Markov process is called aperiodic if the period of all the states is one. Proposition 6.2.3 If X is irreducible, all states have the same period. Proof. Let i and j be two states. By irreducibility, there are integers k 1 and k 2 so that p ij (k 1 ) > 0 and p ji (k 2 ) > 0. For any integer n, p ii (n + k 1 + k 2 ) ≥ p ij (k 1 )p jj (n)p ji (k 2 ), so the set ¦k ≥ 0 : p ii (k) > 0¦ contains the set ¦k ≥ 0 : p jj (k) > 0¦ translated up by k 1 + k 2 . Thus the period of i is less than or equal to the period of j. Since i and j were arbitrary states, the proposition follows. For a fixed state i, define τ i = min¦k ≥ 1 : X(k) = i¦, where we adopt the convention that the minimum of an empty set of numbers is +∞. Let M i = E[τ i [X(0) = i]. If P[τ i < +∞[X(0) = i] < 1, then state i is called transient (and by convention, M i = +∞). Otherwise P[τ i < +∞[X(0) = i] = 1, and i is then said to be positive recurrent if M i < +∞ and to be null recurrent if M i = +∞. Proposition 6.2.4 Suppose X is irreducible and aperiodic. (a) All states are transient, or all are positive recurrent, or all are null recurrent. (b) For any initial distribution π(0), lim t→∞ π i (t) = 1/M i , with the understanding that the limit is zero if M i = +∞. (c) An equilibrium probability distribution π exists if and only if all states are positive recurrent. (d) If it exists, the equilibrium probability distribution π is given by π i = 1/M i . (In particular, if it exists, the equilibrium probability distribution is unique). Proof. (a) Suppose state i is recurrent. Given X(0) = i, after leaving i the process returns to state i at time τ i . The process during the time interval ¦0, . . . , τ i ¦ is the first excursion of X from state 0. From time τ i onward, the process behaves just as it did initially. Thus there is a second excursion from i, third excursion from i, and so on. Let T k for k ≥ 1 denote the length of the kth excursion. Then the T k ’s are independent, and each has the same distribution as T 1 = τ i . Let j be another state and let denote the probability that X visits state j during one excursion from i. Since X is irreducible, > 0. The excursions are independent, so state j is visited during the kth excursion with probability , independently of whether j was visited in earlier excursions. Thus, the number of excursions needed until state j is reached has the geometric distribution with parameter , which has mean 1/. In particular, state j is eventually visited with probability one. After j is visited the process eventually returns to state i, and then within an average of 1/ additional 1 Such as the Euclidean algorithm, Chinese remainder theorem, or Bezout theorem 6.2. CLASSIFICATION AND CONVERGENCE OF DISCRETE-TIME MARKOV PROCESSES169 excursions, it will return to state j again. Thus, state j is also recurrent. Hence, if one state is recurrent, all states are recurrent. The same argument shows that if i is positive recurrent, then j is positive recurrent. Given X(0) = i, the mean time needed for the process to visit j and then return to i is M i /, since on average 1/ excursions of mean length M i are needed. Thus, the mean time to hit j starting from i, and the mean time to hit i starting from j, are both finite. Thus, j is positive recurrent. Hence, if one state is positive recurrent, all states are positive recurrent. (b) Part (b) of the proposition follows by an application of the renewal theorem, which can be found in [1]. (c) Suppose all states are positive recurrent. By the law of large numbers, for any state j, the long run fraction of time the process is in state j is 1/M j with probability one. Similarly, for any states i and j, the long run fraction of time the process is in state j is γ ij /M i , where γ ij is the mean number of visits to j in an excursion from i. Therefore 1/M j = γ ij /M i . This implies that i 1/M i = 1. That is, π defined by π i = 1/M i is a probability distribution. The convergence for each i separately given in part (b), together with the fact that π is a probability distribution, imply that i [π i (t) − π i [ → 0. Thus, taking s to infinity in the equation π(s)H(t) = π(s + t) yields πH(t) = π, so that π is an equilibrium probability distribution. Conversely, if there is an equilibrium probability distribution π, consider running the process with initial state π. Then π(t) = π for all t. So by part (b), for any state i, π i = 1/M i . Taking a state i such that π i > 0, it follows that M i < ∞. So state i is positive recurrent. By part (a), all states are positive recurrent. (d) Part (d) was proved in the course of proving part (c). We conclude this section by describing a technique to establish a rate of convergence to the equilibrium distribution for finite-state Markov processes. Define δ(P) for a one-step transition probability matrix P by δ(P) = min i,k j p ij ∧ p kj , where a ∧b = min¦a, b¦. The number δ(P) is known as Dobrushin’s coefficient of ergodicity. Since a +b −2(a ∧ b) = [a −b[ for a, b ≥ 0, we also have 1 −2δ(P) = min i,k j [p ij −p kj [. Let |µ| for a vector µ denote the L 1 norm: |µ| = i [µ i [. Proposition 6.2.5 For any probability vectors π and σ, |πP −σP| ≤ (1−δ(P))|π−σ|. Further- more, if δ(P) > 0 then there is a unique equilibrium distribution π ∞ , and for any other probability distribution π on o, |πP l −π ∞ | ≤ 2(1 −δ(P)) l . Proof. Let ˜ π i = π i − π i ∧ σ i and ˜ σ i = σ i − π i ∧ σ i . Note that if π i ≥ σ i then ˜ π i = π i − σ i and ˜ σ i = 0, and if π i ≤ σ i then ˜ σ i = σ i − π i and ˜ π i = 0. Also, |˜ π| and |˜ σ| are both equal to 170 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 1 − i π i ∧ σ i . Therefore, |π −σ| = |˜ π − ˜ σ| = 2|˜ π| = 2|˜ σ|. Furthermore, |πP −σP| = |˜ πP − ˜ σP| = j | i ˜ π i P ij − k ˜ σ k P kj | = (1/|˜ π|) j [ i,k ˜ π i ˜ σ k (P ij −P kj )[ ≤ (1/|˜ π|) i,k ˜ π i ˜ σ k j [P ij −P kj [ ≤ |˜ π|(2 −2δ(P)) = |π −σ|(1 −δ(P)), which proves the first part of the proposition. Iterating the inequality just proved yields that |πP l −σP l | ≤ (1 −δ(P)) l |π −σ| ≤ 2(1 −δ(P)) l . (6.1) This inequality for σ = πP n yields that |πP l −πP l+n | ≤ 2(1 −δ(P)) l . Thus the sequence πP l is a Cauchy sequence and has a limit π ∞ , and π ∞ P = π ∞ . Finally, taking σ in (6.1) equal to π ∞ yields the last part of the proposition. Proposition 6.2.5 typically does not yield the exact asymptotic rate that |π l − π ∞ | tends to zero. The asymptotic behavior can be investigated by computing (I − zP) −1 , and then matching powers of z in the identity (I −zP) −1 = ∞ n=0 z n P n . 6.3 Classification and convergence of continuous-time Markov pro- cesses Chapter 4 discusses Markov processes in continuous time with a finite number of states. Here we extend the coverage of continuous-time Markov processes to include countably infinitely many states. For example, the state of a simple queue could be the number of customers in the queue, and if there is no upper bound on the number of customers that can be waiting in the queue, the state space is Z + . One possible complication, that rarely arises in practice, is that a continuous time process can make infinitely many jumps in a finite amount of time. Let o be a finite or countably infinite set, and let ´ ,∈ o. A pure-jump function is a function x : R + → o ∪ ¦´¦ such that there is a sequence of times, 0 = τ 0 < τ 1 < . . . , and a sequence of states, s 0 , s 1 , . . . with s i ∈ o, and s i ,= s i+1 , i ≥ 0, so that x(t) = _ s i if τ i ≤ t < τ i+1 i ≥ 0 ´ if t ≥ τ ∗ (6.2) where τ ∗ = lim i→∞ τ i . If τ ∗ is finite it is said to be the explosion time of the function x, and if τ ∗ = +∞ the function is said to be nonexplosive. The example corresponding to o = ¦0, 1, . . .¦, τ i = i/(i + 1) and s i = i is pictured in Fig. 6.5. Note that τ ∗ = 1 for this example. 6.3. CLASSIFICATION AND CONVERGENCE OF CONTINUOUS-TIME MARKOV PROCESSES171 x(t) ! 1 0 2 . . . " t ! ! ! Figure 6.5: A pure-jump function with an explosion time. Definition 6.3.1 A pure-jump Markov process (X t : t ≥ 0) is a Markov process such that, with probability one, its sample paths are pure-jump functions. Such a process is said to be nonexplosive if its sample paths are nonexplosive, with probability one. Generator matrices are defined for countable-state Markov processes just as they are for finite- state Markov processes. A pure-jump, time-homogeneous Markov process X has generator matrix Q = (q ij : i, j ∈ o) if lim h¸0 (p ij (h) −I ¡i=j¦ )/h = q ij i, j ∈ o (6.3) or equivalently p ij (h) = I ¡i=j¦ +hq ij +o(h) i, j ∈ o (6.4) where o(h) represents a quantity such that lim h→0 o(h)/h = 0. The space-time properties for continuous-time Markov processes with a countably infinite num- ber of states are the same as for a finite number of states. There is a discrete-time jump process, and the holding times, given the jump process, are exponentially distributed. Also, the following holds. Proposition 6.3.2 Given a matrix Q = (q ij : i, j ∈ o) satisfying q ij ≥ 0 for distinct states i and j, and q ii = − j∈S,j,=i q ij for each state i, and a probability distribution π(0) = (π i (0) : i ∈ o), there is a pure-jump, time-homogeneous Markov process with generator matrix Q and initial distribution π(0). The finite-dimensional distributions of the process are uniquely determined by π(0) and Q. The Chapman-Kolmogorov equations, H(s, t) = H(s, τ)H(τ, t), and the Kolmogorov forward equations, ∂π j (t) ∂t = i∈S π i (t)q ij , hold. Example 6.3.3 (Birth-death processes) A useful class of countable-state Markov processes is the set of birth-death processes. A (continuous-time) birth-death process with parameters (λ 0 , λ 2 , . . .) and (µ 1 , µ 2 , . . .) (also set λ −1 = µ 0 = 0) is a pure-jump Markov process with state space o = Z + and generator matrix Q defined by q kk+1 = λ k , q kk = −(µ k + λ k ), and q kk−1 = µ k for k ≥ 0, and 172 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS q ij = 0 if [i −j[ ≥ 2. The transition rate diagram is shown in Fig. 6.6. The space-time structure, as defined in Section 4.10, of such a process is as follows. Given the process is in state k at time t, the next state visited is k+1 with probability λ k /(λ k +µ k ) and k−1 with probability µ k /(λ k +µ k ). The holding time of state k is exponential with parameter λ k +µ k . The Kolmogorov forward equations for birth-death processes are ∂π k (t) ∂t = λ k−1 π k−1 (t) −(λ +µ)π k (t) +µ k+1 π k+1 (t) (6.5) Example 6.3.4 (Description of a Poisson process as a Markov process) Let λ > 0 and consider a birth-death process N with λ k = λ and µ k = 0 for all k, with initial state zero with probability one. The space-time structure of this Markov process is then rather simple. Each transition is an upward jump of size one, so the jump process is deterministic: N J (k) = k for all k. Ordinarily, the holding times are only conditionally independent given the jump process, but since the jump process is deterministic, the holding times are independent. Also, since q k,k = −λ for all k, each holding time is exponentially distributed with parameter λ. Therefore, N satisfies condition (b) of Proposition 4.5.2, so that N is a Poisson process with rate λ. Define, for i ∈ o, τ o i = min¦t > 0 : X(t) ,= i¦, and τ i = min¦t > τ o i : X(t) = i¦. Thus, if X(0) = i, τ i is the first time the process returns to state i, with the exception that τ i = +∞ if the process never returns to state i. The following definitions are the same as when X is a discrete-time process. Let M i = E[τ i [X(0) = i]. If P[τ i < +∞] < 1, then state i is called transient. Otherwise P[τ i < +∞] = 1, and i is then said to be positive recurrent if M i < +∞ and to be null recurrent if M i = +∞. The following propositions are analogous to those for discrete-time Markov processes. Proofs can be found in [1, 10]. Proposition 6.3.5 Suppose X is irreducible. (a) All states are transient, or all are positive recurrent, or all are null recurrent. (b) For any initial distribution π(0), lim t→+∞ π i (t) = 1/(−q ii M i ), with the understanding that the limit is zero if M i = +∞. Proposition 6.3.6 Suppose X is irreducible and nonexplosive. 1 2 3 0 1 1 2 2 3 3 0 4 µ ! µ µ µ ! ! ! Figure 6.6: Transition rate diagram of a birth-death process. 6.4. CLASSIFICATION OF BIRTH-DEATH PROCESSES 173 (a) A probability distribution π is an equilibrium distribution if and only if πQ = 0. (b) An equilibrium probability distribution exists if and only if all states are positive recurrent. (c) If all states are positive recurrent, the equilibrium probability distribution is given by π i = 1/(−q ii M i ). (In particular, if it exists, the equilibrium probability distribution is unique). The assumption that X be nonexplosive is needed for Proposition 6.3.6(a) (per Problem 6.9), but the following proposition shows that the Markov processes encountered in most applications are nonexplosive. Proposition 6.3.7 Suppose X is irreducible. Fix a state i o and for k ≥ 1 let o k denote the set of states reachable from i o in k jumps. Suppose for each k ≥ 1 there is a constant γ k so that the jump intensities on o k are bounded by γ k , that is, suppose −q ii ≤ γ k for i ∈ o k . If ∞ k=1 1 γ k = +∞, then the process X is nonexplosive. 6.4 Classification of birth-death processes The classification of birth-death processes, introduced in Example 6.3.3, is relatively simple. To avoid trivialities, consider a birth-death process such that the birth rates, (λ i : i ≥ 0) and death rates (µ i : i ≥ 1) are all strictly positive. Then the process is irreducible. First, investigate whether the process is nonexplosive, because this is a necessary condition for both recurrence and positive recurrence. This is usually a simple matter, because if the rates are bounded or grow at most linearly, the process is nonexplosive by Proposition 6.3.7. In some cases, even if Proposition 6.3.7 doesn’t apply, it can be shown by some other means that the process is nonexplosive. For example, a test is given below for the process to be recurrent, and if it is recurrent, it is not explosive. Next, investigate whether X is positive recurrent. Suppose we already know that the process is nonexplosive. Then the process is positive recurrent if and only if πQ = 0 for some probability distribution π, and if it is positive recurrent, π is the equilibrium distribution. Now πQ = 0 if and only if flow balance holds for any state k: (λ k +µ k )π k = λ k−1 π k−1 +µ k+1 π k+1 . (6.6) Equivalently, flow balance must hold for all sets of the form ¦0, . . . , n − 1¦ (just sum each side of (6.6) over k ∈ ¦1, . . . , n − 1¦). Therefore, πQ = 0 if and only if π n−1 λ n−1 = π n µ n for n ≥ 1, which holds if and only if there is a probability distribution π with π n = π 0 λ 0 . . . λ n−1 /(µ 1 . . . µ n ) for n ≥ 1. Thus, a probability distribution π with πQ = 0 exists if and only if S 1 < +∞, where S 1 = ∞ i=0 λ 0 . . . λ i−1 µ 1 . . . µ i , (6.7) with the understanding that the i = 0 term in the sum defining S 1 is one. Thus, under the assumption that X is nonexplosive, X is positive recurrent if and only if S 1 < ∞, and if X is positive recurrent, the equilibrium distribution is given by π n = (λ 0 . . . λ n−1 )/(S 1 µ 1 . . . µ n ). 174 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Finally, investigate whether X is recurrent. This step is not necessary if we already know that X is positive recurrent, because a positive recurrent process is recurrent. The following test for recurrence is valid whether or not X is explosive. Since all states have the same classification, the process is recurrent if and only if state 0 is recurrent. Thus, the process is recurrent if the probability the process never hits 0, for initial state 1, is zero. We shall first find the probability of never hitting state zero for a modified process, which stops upon reaching a large state n, and then let n → ∞ to find the probability the original process never reaches state 0. Let b in denote the probability, for initial state i, the process does not reach zero before reaching n. Set the boundary conditions, b 0n = 0 and b nn = 1. Fix i with 1 ≤ i ≤ n −1, and derive an expression for b in by first conditioning on the state reached by the first jump of the process, starting from state i. By the space-time structure, the probability the first jump is up is λ i /(λ i + µ i ) and the probability the first jump is down is µ i /(λ i +µ i ). Thus, b in = λ i λ i +µ i b i+1,n + µ i λ i +µ i b i−1,n , which can be rewritten as µ i (b in − b i−1,n ) = λ i (b i+1,n − b i,n ). In particular, b 2n − b 1n = b 1n µ 1 /λ 1 and b 3n −b 2n = b 1n µ 1 µ 2 /(λ 1 λ 2 ), and so on, which upon summing yields the expression b kn = b 1n k−1 i=0 µ 1 µ 2 . . . µ i λ 1 λ 2 . . . λ i . with the convention that the i = 0 term in the sum is one. Finally, the condition b nn = 1 yields the solution b 1n = 1 n−1 i=0 µ 1 µ 2 ...µ i λ 1 λ 2 ...λ i . (6.8) Note that b 1n is the probability, for initial state 1, of the event B n that state n is reached without an earlier visit to state 0. Since B n+1 ⊂ B n for all n ≥ 1, P[∩ n≥1 B n [X(0) = 1] = lim n→∞ b 1n = 1/S 2 (6.9) where S 2 = ∞ i=0 µ 1 µ 2 . . . µ i λ 1 λ 2 . . . λ i , with the understanding that the i = 0 term in the sum defining S 2 is one. Due to the definition of pure jump processes used, whenever X visits a state in o the number of jumps up until that time is finite. Thus, on the event ∩ n≥1 B n , state zero is never reached. Conversely, if state zero is never reached, then either the process remains bounded (which has probability zero) or ∩ n≥1 B n is true. Thus, P[zero is never reached[X(0) = 1] = 1/S 2 . Consequently, X is recurrent if and only if S 2 = ∞. In summary, the following proposition is proved. 6.5. TIME AVERAGES VS. STATISTICAL AVERAGES 175 Proposition 6.4.1 Suppose X is a continuous-time birth-death process with strictly positive birth rates and death rates. If X is nonexplosive (for example, if the rates are bounded or grow at most linearly with n, or if S 2 = ∞) then X is positive recurrent if and only if S 1 < +∞. If X is positive recurrent the equilibrium probability distribution is given by π n = (λ 0 . . . λ n−1 )/(S 1 µ 1 . . . µ n ). The process X is recurrent if and only if S 2 = ∞. Discrete-time birth-death processes have a similar characterization. They are discrete-time, time-homogeneous Markov processes with state space equal to the set of nonnegative integers. Let nonnegative birth probabilities (λ k : k ≥ 0) and death probabilities (µ k : k ≥ 1) satisfy λ 0 ≤ 1, and λ k +µ k ≤ 1 for k ≥ 1. The one-step transition probability matrix P = (p ij : i, j ≥ 0) is given by p ij = _ ¸ ¸ ¸ ¸ _ ¸ ¸ ¸ ¸ _ λ i if j = i + 1 µ i if j = i −1 1 −λ i −µ i if j = i ≥ 1 1 −λ 0 if j = i = 0 0 else. (6.10) Implicit in the specification of P is that births and deaths can’t happen simultaneously. If the birth and death probabilities are strictly positive, Proposition 6.4.1 holds as before, with the exception that the discrete-time process cannot be explosive. 2 6.5 Time averages vs. statistical averages Let X be a positive recurrent, irreducible, time-homogeneous Markov process with equilibrium probability distribution π. To be definite, suppose X is a continuous-time process, with pure-jump sample paths and generator matrix Q. The results of this section apply with minor modifications to the discrete-time setting as well. Above it is noted that lim t→∞ π i (t) = π i = 1/(−q ii M i ), where M i is the mean “cycle time” of state i. A related consideration is convergence of the empirical distribution of the Markov process, where the empirical distribution is the distribution observed over a (usually large) time interval. For a fixed state i, the fraction of time the process spends in state i during [0, t] is 1 t _ t 0 I ¡X(s)=i¦ ds Let T 0 denote the time that the process is first in state i, and let T k for k ≥ 1 denote the time that the process jumps to state i for the kth time after T 0 . The cycle times T k+1 − T k , k ≥ 0 are independent and identically distributed, with mean M i . Therefore, by the law of large numbers, with probability one, lim k→∞ T k /(kM i ) = lim k→∞ 1 kM i k−1 l=0 (T l+1 −T l ) = 1 2 If in addition λi +µi = 1 for all i, then the discrete-time process has period 2. 176 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Furthermore, during the kth cycle interval [T k , T k+1 ), the amount of time spent by the process in state i is exponentially distributed with mean −1/q ii , and the time spent in the state during disjoint cycles is independent. Thus, with probability one, lim k→∞ 1 kM i _ T k 0 I ¡X(s)=i¦ ds = lim k→∞ 1 kM i k−1 l=0 _ T l+1 T l I ¡X(s)=i¦ ds = 1 M i E __ T 1 T 0 I ¡X(s)=i¦ ds _ = 1/(−q ii M i ) Combining these two observations yields that lim t→∞ 1 t _ t 0 I ¡X(s)=i¦ ds = 1/(−q ii M i ) = π i (6.11) with probability one. In short, the limit (6.11) is expected, because the process spends on average −1/q ii time units in state i per cycle from state i, and the cycle rate is 1/M i . Of course, since state i is arbitrary, if j is any other state then lim t→∞ 1 t _ t 0 I ¡X(s)=j¦ ds = 1/(−q jj M j ) = π j (6.12) By considering how the time in state j is distributed among the cycles from state i, it follows that the mean time spent in state j per cycle from state i is M i π j . So for any nonnegative function φ on o, lim t→∞ 1 t _ t 0 φ(X(s))ds = lim k→∞ 1 kM i _ T k 0 φ(X(s))ds = 1 M i E __ T 1 T 0 φ(X(s))ds _ = 1 M i E _ _ j∈S φ(j) _ T 1 T 0 I ¡X(s)=j¦ ds _ _ = 1 M i j∈S φ(j)E __ T 1 T 0 I ¡X(s)=j¦ _ ds = j∈S φ(j)π j (6.13) Finally, if φ is a function of S such that either j∈S φ + (j)π j < ∞ or j∈S φ − (j)π j < ∞, then since (6.13) holds for both φ + and φ − , it must hold for φ itself. 6.6. QUEUEING SYSTEMS, M/M/1 QUEUE AND LITTLE’S LAW 177 queue server system Figure 6.7: A single server queueing system. 6.6 Queueing systems, M/M/1 queue and Little’s law Some basic terminology of queueing theory will now be explained. A simple type of queueing system is pictured in Figure 6.7. Notice that the system is comprised of a queue and a server. Ordinarily whenever the system is not empty, there is a customer in the server, and any other customers in the system are waiting in the queue. When the service of a customer is complete it departs from the server and then another customer from the queue, if any, immediately enters the server. The choice of which customer to be served next depends on the service discipline. Common service disciplines are first-come first-served (FCFS) in which customers are served in the order of their arrival, or last-come first-served (LCFS) in which the customer that arrived most recently is served next. Some of the more complicated service disciplines involve priority classes, or the notion of “processor sharing” in which all customers present in the system receive equal attention from the server. Often models of queueing systems involve a stochastic description. For example, given positive parameters λ and µ, we may declare that the arrival process is a Poisson process with rate λ, and that the service times of the customers are independent and exponentially distributed with parameter µ. Many queueing systems are given labels of the form A/B/s, where “A” is chosen to denote the type of arrival process, “B” is used to denote the type of departure process, and s is the number of servers in the system. In particular, the system just described is called an M/M/1 queueing system, so-named because the arrival process is memoryless (i.e. a Poisson arrival process), the service times are memoryless (i.e. are exponentially distributed), and there is a single server. Other labels for queueing systems have a fourth descriptor and thus have the form A/B/s/b, where b denotes the maximum number of customers that can be in the system. Thus, an M/M/1 system is also an M/M/1/∞ system, because there is no finite bound on the number of customers in the system. A second way to specify an M/M/1 queueing system with parameters λ and µ is to let A(t) and D(t) be independent Poisson processes with rates λ and µ respectively. Process A marks customer arrival times and process D marks potential customer departure times. The number of customers in the system, starting from some initial value N(0), evolves as follows. Each time there is a jump of A, a customer arrives to the system. Each time there is a jump of D, there is a potential departure, meaning that if there is a customer in the server at the time of the jump then the customer departs. 178 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS If a potential departure occurs when the system is empty then the potential departure has no effect on the system. The number of customers in the system N can thus be expressed as N(t) = N(0) +A(t) + _ t 0 I ¡N(s−)≥1¦ dD(s) It is easy to verify that the resulting process N is Markov, which leads to the third specification of an M/M/1 queueing system. A third way to specify an M/M/1 queuing system is that the number of customers in the system N(t) is a birth-death process with λ k = λ and µ k = µ for all k, for some parameters λ and µ. Let ρ = λ/µ. Using the classification criteria derived for birth-death processes, it is easy to see that the system is recurrent if and only if ρ ≤ 1, and that it is positive recurrent if and only if ρ < 1. Moreover, if ρ < 1 then the equilibrium distribution for the number of customers in the system is given by π k = (1 −ρ)ρ k for k ≥ 0. This is the geometric distribution with zero as a possible value, and with mean N = ∞ k=0 kπ k = (1 −ρ)ρ ∞ k=1 ρ k−1 k = (1 −ρ)ρ( 1 1 −ρ ) t = ρ 1 −ρ The probability the server is busy, which is also the mean number of customers in the server, is 1−π 0 = ρ. The mean number of customers in the queue is thus given by ρ/(1−ρ) −ρ = ρ 2 /(1−ρ). This third specification is the most commonly used way to define an M/M/1 queueing process. Since the M/M/1 process N(t) is positive recurrent, the Markov ergodic convergence theorem implies that the statistical averages just computed, such as N, are also equal to the limit of the time-averaged number of customers in the system as the averaging interval tends to infinity. An important performance measure for a queueing system is the mean time spent in the system or the mean time spent in the queue. Littles’ law, described next, is a quite general and useful relationship that aids in computing mean transit time. Little’s law can be applied in a great variety of circumstances involving flow through a system with delay. In the context of queueing systems we speak of a flow of customers, but the same principle applies to a flow of water through a pipe. Little’s law is that λT = N where λ is the mean flow rate, T is the mean delay in the system, and N is the mean content of the system. For example, if water flows through a pipe with volume one cubic meter at the rate of two cubic meters per minute, then the mean time (averaged over all drops of water) that water spends in the pipe is T = N/λ = 1/2 minute. This is clear if water flows through the pipe without mixing, for then the transit time of each drop of water is 1/2 minute. However, mixing within the pipe does not effect the average transit time. Little’s law is actually a set of results, each with somewhat different mathematical assumptions. The following version is quite general. Figure 6.8 pictures the cumulative number of arrivals (α(t)) and the cumulative number of departures (δ(t)) versus time, for a queueing system assumed to be initially empty. Note that the number of customers in the system at any time s is given by the difference N(s) = α(s) − δ(s), which is the vertical distance between the arrival and departure graphs in the figure. On the other hand, assuming that customers are served in first-come first- served order, the horizontal distance between the graphs gives the times in system for the customers. 6.6. QUEUEING SYSTEMS, M/M/1 QUEUE AND LITTLE’S LAW 179 t ! " s s s N(s) Figure 6.8: A single server queueing system. Given a (usually large) t > 0, let γ t denote the area of the region between the two graphs over the interval [0, t]. This is the shaded region indicated in the figure. It is then natural to define the time-averaged values of arrival rate and system content as λ t = α(t)/t and N t = 1 t _ t 0 N(s)ds = γ t /t Finally, the average, over the α(t) customers that arrive during the interval [0, t], of the time spent in the system up to time t, is given by T t = γ t /α(t). Once these definitions are accepted, we have the following obvious proposition. Proposition 6.6.1 (Little’s law, expressed using averages over time) For any t > 0, N t = λ t T t (6.14) Furthermore, if any two of the three variables in (6.14) converge to a positive finite limit as t →∞, then so does the third variable, and the limits satisfy N ∞ = λ ∞ T ∞ . For example, the number of customers in an M/M/1 queue is a positive recurrent Markov process so that lim t→∞ N t = N = ρ/(1 −ρ) where calculation of the statistical mean N was previously discussed. Also, by the law of large numbers applied to interarrival times, we have that the Poisson arrival process for an M/M/1 queue satisfies lim t→∞ λ t = λ with probability one. Thus, with probability one, lim t→∞ T t = N/λ = 1 µ −λ . 180 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS In this sense, the average waiting time in an M/M/1 system is 1/(µ−λ). The average time in service is 1/µ (this follows from the third description of an M/M/1 queue, or also from Little’s law applied to the server alone) so that the average waiting time in queue is given by W = 1/(µ −λ) −1/µ = ρ/(µ −λ). This final result also follows from Little’s law applied to the queue alone. 6.7 Mean arrival rate, distributions seen by arrivals, and PASTA The mean arrival rate for the M/M/1 system is λ, the parameter of the Poisson arrival process. However for some queueing systems the arrival rate depends on the number of customers in the system. In such cases the mean arrival rate is still typically meaningful, and it can be used in Little’s law. Suppose the number of customers in a queuing system is modeled by a birth death process with arrival rates (λ k ) and departure rates (µ k ). Suppose in addition that the process is positive recurrent. Intuitively, the process spends a fraction of time π k in state k and while in state k the arrival rate is λ k . Therefore, the average arrival rate is λ = ∞ k=0 π k λ k Similarly the average departure rate is µ = ∞ k=1 π k µ k and of course λ = µ because both are equal to the throughput of the system. Often the distribution of a system at particular system-related sampling times are more impor- tant than the distribution in equilibrium. For example, the distribution seen by arriving customers may be the most relevant distribution, as far as the customers are concerned. If the arrival rate depends on the number of customers in the system then the distribution seen by arrivals need not be the same as the equilibrium distribution. Intuitively, π k λ k is the long-term frequency of arrivals which occur when there are k customers in the system, so that the fraction of customers that see k customers in the system upon arrival is given by r k = π k λ k λ . The following is an example of a system with variable arrival rate. Example 6.7.1 (Single-server, discouraged arrivals) Suppose λ k = α/(k + 1) and µ k = µ for all k, where µ and α are positive constants. Then S 2 = ∞ k=0 (k + 1)!µ k α k = ∞ and S 1 = ∞ k=0 α k k!µ k = exp( α µ ) < ∞ 6.7. MEAN ARRIVAL RATE, DISTRIBUTIONS SEEN BY ARRIVALS, AND PASTA 181 so that the number of customers in the system is a positive recurrent Markov process, with no additional restrictions on α and µ. Moreover, the equilibrium probability distribution is given by π k = (α/µ) k exp(−α/µ)/k!, which is the Poisson distribution with mean N = α/µ. The mean arrival rate is λ = ∞ k=0 π k α k + 1 = µexp(− α µ ) ∞ k=0 (α/µ) k+1 (k + 1)! = µexp(− α µ )(exp( α µ ) −1) = µ(1 −exp(− α µ )). (6.15) This expression derived for λ is clearly equal to µ, because the departure rate is µ with probability 1 − π 0 and zero otherwise. The distribution of the number of customers in the system seen by arrivals, (r k ) is given by r k = π k α (k + 1)λ = (α/µ) k+1 exp(−α/µ) (k + 1)!(1 −exp(−α/µ)) for k ≥ 0 which in words can be described as the result of removing the probability mass at zero in the Poisson distribution, shifting the distribution down by one, and then renormalizing. The mean number of customers in the queue seen by a typical arrival is therefore (α/µ−1)/(1 −exp(−α/µ)). This mean is somewhat less than N because, roughly speaking, the customer arrival rate is higher when the system is more lightly loaded. The equivalence of time-averages and statistical averages for computing the mean arrival rate and the distribution seen by arrivals can be shown by application of ergodic properties of the processes involved. The associated formal approach is described next, in slightly more generality. Let X denote an irreducible, positive-recurrent pure-jump Markov process. If the process makes a jump from state i to state j at time t, say that a transition of type (i, j) occurs. The sequence of transitions of X forms a new Markov process, Y . The process Y is a discrete-time Markov process with state space ¦(i, j) ∈ o o : q ij > 0¦, and it can be described in terms of the jump process for X, by Y (k) = (X J (k −1), X J (k)) for k ≥ 0. (Let X J (−1) be defined arbitrarily.) The one-step transition probability matrix of the jump process X J is given by π J ij = q ij /(−q ii ), and X J is recurrent because X is recurrent. Its equilibrium distribution π J (if it exists) is propor- tional to −π i q ii (see Problem 6.3), and X J is positive recurrent if and only if this distribution can be normalized to make a probability distribution, i.e. if and only if R = − i π i q ii < ∞. Assume for simplicity that X J is positive recurrent. Then π J i = −π i q ii /R is the equilibrium probability distribution of X J . Furthermore, Y is positive recurrent and its equilibrium distribution is given by π Y ij = π J i p J ij = −π i q ii R q ij −q ii = π i q ij R Since limiting time averages equal statistical averages for Y , lim n→∞ (number of first n transitions of X that are type (i, j))/n = π i q ij /R 182 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS with probability one. Therefore, if A ⊂ o o, and if (i, j) ∈ A, then lim n→∞ number of first n transitions of X that are type (i, j) number of first n transitions of X with type in A = π i q ij (i ,j )∈A π i q i j (6.16) To apply this setup to the special case of a queueing system in which the number of customers in the system is a Markov birth-death processes, let the set A be the set of transitions of the form (i, i + 1). Then deduce that the fraction of the first n arrivals that see i customers in the system upon arrival converges to π i λ i / j π j λ j with probability one. Note that if λ i = λ for all i, then λ = λ and π = r. The condition λ i = λ also implies that the arrival process is Poisson. This situation is called “Poisson Arrivals See Time Averages” (PASTA). 6.8 More examples of queueing systems modeled as Markov birth- death processes For each of the four examples of this section it is assumed that new customers are offered to the system according to a Poisson process with rate λ, so that the PASTA property holds. Also, when there are k customers in the system then the service rate is µ k for some given numbers µ k . The number of customers in the system is a Markov birth-death process with λ k = λ for all k. Since the number of transitions of the process up to any given time t is at most twice the number of customers that arrived by time t, the Markov process is not explosive. Therefore the process is positive recurrent if and only if S 1 is finite, where S 1 = ∞ k=0 λ k µ 1 µ 2 . . . µ k Special cases of this example are presented in the next four examples. Example 6.8.1 (M/M/m systems) An M/M/m queueing system consists of a single queue and m servers. The arrival process is Poisson with some rate λ and the customer service times are independent and exponentially distributed with mean µ for some µ > 0. The total number of customers in the system is a birth-death process with µ k = µmin(k, m). Let ρ = λ/(mµ). Since µ k = mµ for all k large enough it is easy to check that the process is positive recurrent if and only if ρ < 1. Assume now that ρ < 1. Then the equilibrium distribution is given by π k = (λ/µ) k S 1 k! for 0 ≤ k ≤ m π m+j = π m ρ j for j ≥ 1 where S 1 is chosen to make the probabilities sum to one (use the fact 1 +ρ +ρ 2 . . . = 1/(1 −ρ)): S 1 = _ m−1 k=0 (λ/µ) k k! _ + (λ/µ) m m!(1 −ρ) . 6.8. MORE EXAMPLES OF QUEUEING SYSTEMS MODELED AS MARKOV BIRTH-DEATH PROCESSES 183 An arriving customer must join the queue (rather that go directly to a server) if and only if the system has m or more customers in it. By the PASTA property, this is the same as the equilibrium probability of having m or more customers in the system: P Q = ∞ j=0 π m+j = π m /(1 −ρ) This formula is called the Erlang C formula for probability of queueing. Example 6.8.2 (M/M/m/m systems) An M/M/m/m queueing system consists of m servers. The arrival process is Poisson with some rate λ and the customer service times are independent and exponentially distributed with mean µ for some µ > 0. Since there is no queue, if a customer arrives when there are already m customers in the system, then the arrival is blocked and cleared from the system. The total number of customers in the system is a birth death process, but with the state space reduced to ¦0, 1, . . . , m¦, and with µ k = kµ for 1 ≤ k ≤ m. The unique equilibrium distribution is given by π k = (λ/µ) k S 1 k! for 0 ≤ k ≤ m where S 1 is chosen to make the probabilities sum to one. An arriving customer is blocked and cleared from the system if and only if the system already has m customers in it. By the PASTA property, this is the same as the equilibrium probability of having m customers in the system: P B = π m = (λ/µ) m m! m j=0 (λ/µ) j j! This formula is called the Erlang B formula for probability of blocking. Example 6.8.3 (A system with a discouraged server) The number of customers in this system is a birth-death process with constant birth rate λ and death rates µ k = 1/k. It is is easy to check that all states are transient for any positive value of λ (to verify this it suffices to check that S 2 < ∞). It is not difficult to show that N(t) converges to +∞ with probability one as t →∞. Example 6.8.4 (A barely stable system) The number of customers in this system is a birth-death process with constant birth rate λ and death rates µ k = λ(1+k 2 ) 1+(k−1) 2 for all k ≥ 1. Since the departure rates are barely larger than the arrival rates, this system is near the borderline between recurrence and transience. However, we see that S 1 = ∞ k=0 1 1 +k 2 < ∞ 184 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS so that N(t) is positive recurrent with equilibrium distribution π k = 1/(S 1 (1 +k 2 )). Note that the mean number of customers in the system is N = ∞ k=0 k/(S 1 (1 +k 2 )) = ∞ By Little’s law the mean time customers spend in the system is also infinity. It is debatable whether this system should be thought of as “stable” even though all states are positive recurrent and all waiting times are finite with probability one. 6.9 Foster-Lyapunov stability criterion and moment bounds Communication network models can become quite complex, especially when dynamic scheduling, congestion, and physical layer effects such as fading wireless channel models are included. It is thus useful to have methods to give approximations or bounds on key performance parameters. The criteria for stability and related moment bounds discussed in this chapter are useful for providing such bounds. Aleksandr Mikhailovich Lyapunov (1857-1918) contributed significantly to the theory of stabil- ity of dynamical systems. Although a dynamical system may evolve on a complicated, multiple dimensional state space, a recurring theme of dynamical systems theory is that stability questions can often be settled by studying the potential of a system for some nonnegative potential function V . Potential functions used for stability analysis are widely called Lyapunov functions. Similar stability conditions have been developed by many authors for stochastic systems. Below we present the well known criteria due to Foster [4] for recurrence and positive recurrence. In addition we present associated bounds on the moments, which are expectations of some functions on the state space, computed with respect to the equilibrium probability distribution. 3 Subsection 6.9.1 discusses the discrete-time tools, and presents examples involving load balanc- ing routing, and input queued crossbar switches. Subsection 6.9.2 presents the continuous time tools, and an example. 6.9.1 Stability criteria for discrete-time processes Consider an irreducible discrete-time Markov process X on a countable state space o, with one-step transition probability matrix P. If f is a function on o, then Pf represents the function obtained by multiplication of the vector f by the matrix P: Pf(i) = j∈S p ij f(j). If f is nonnegative, then Pf is well defined, with the understanding that Pf(i) = +∞ is possible for some, or all, values of i. An important property of Pf is that Pf(i) = E[f(X(t + 1)[X(t) = i]. Let V be a nonnegative function on o, to serve as the Lyapunov function. The drift vector of V (X(t)) is 3 A version of these moment bounds was given by Tweedie [15], and a version of the moment bound method was used by Kingman [5] in a queueing context. As noted in [9], the moment bound method is closely related to Dynkin’s formula. The works [13, 14, 6, 12], and many others, have demonstrated the wide applicability of the stability methods in various queueing network contexts, using quadratic Lyapunov functions. 6.9. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS 185 defined by d(i) = E[V (X(t + 1))[X(t) = i] −V (i). That is, d = PV −V . Note that d(i) is always well-defined, if the value +∞ is permitted. The drift vector is also given by d(i) = j:j,=i p ij (V (j) −V (i)). (6.17) Proposition 6.9.1 (Foster-Lyapunov stability criterion) Suppose V : o → R + and C is a finite subset of o. (a) If ¦i : V (i) ≤ K¦ is finite for all K, and if PV −V ≤ 0 on o −C, then X is recurrent. (b) If > 0 and b is a constant such that PV −V ≤ − +bI C , then X is positive recurrent. Proposition 6.9.2 (Moment bound) Suppose V , f, and g are nonnegative functions on o and suppose PV (i) −V (i) ≤ −f(i) +g(i) for all i ∈ o (6.18) In addition, suppose X is positive recurrent, so that the means, f = πf and g = πg are well-defined. Then f ≤ g. (In particular, if g is bounded, then g is finite, and therefore f is finite.) Corollary 6.9.3 (Combined Foster-Lyapunov stability criterion and moment bound) Suppose V, f, and g are nonnegative functions on o such that PV (i) −V (i) ≤ −f(i) +g(i) for all i ∈ o (6.19) In addition, suppose for some > 0 that the set C defined by C = ¦i : f(i) < g(i)+¦ is finite. Then X is positive recurrent and f ≤ g. (In particular, if g is bounded, then g is finite, and therefore f is finite.) Proof. Let b = max¦g(i) + − f(i) : i ∈ C¦. Then V, C, b, and satisfy the hypotheses of Proposition 6.9.1(b), so that X is positive recurrent. Therefore the hypotheses of Proposition 6.9.2 are satisfied, so that f ≤ g. The assumptions in Propositions 6.9.1 and 6.9.2 and Corollary 6.9.3 do not imply that V is finite. Even so, since V is nonnegative, for a given initial state X(0), the long term average drift of V (X(t)) is nonnegative. This gives an intuitive reason why the mean downward part of the drift, f, must be less than or equal to the mean upward part of the drift, g. Example 6.9.4 (Probabilistic routing to two queues) Consider the routing scenario with two queues, queue 1 and queue 2, fed by a single stream of packets, as pictured in Figure 6.9. Here, 0 ≤ a, u, d 1 , d 2 ≤ 1, and u = 1 − u. The state space for the process is o = Z 2 + , where the state x = (x 1 , x 2 ) denotes x 1 packets in queue 1 and x 2 packets in queue 2. In each time slot, a new arrival is generated with probability a, and then is routed to queue 1 with probability u and to queue 2 with probability u. Then each queue i, if not empty, has a departure with probability d i . 186 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS u queue 1 queue 2 2 d 1 d a u Figure 6.9: Two queues fed by a single arrival stream. Note that we allow a packet to arrive and depart in the same slot. Thus, if X i (t) is the number of packets in queue i at the beginning of slot t, then the system dynamics can be described as follows: X i (t + 1) = X i (t) +A i (t) −D i (t) +L i (t) for i ∈ ¦0, 1¦ (6.20) where • A(t) = (A 1 (t), A 2 (t)) is equal to (1, 0) with probability au, (0, 1) with probability au, and A(t) = (0, 0) otherwise. • D i (t) : t ≥ 0, are Bernoulli(d i ) random variables, for i ∈ ¦0, 1¦ • All the A(t)’s, D 1 (t)’s, and D 2 (t)’s are mutually independent • L i (t) = (−(X i (t) +A i (t) −D i (t))) + (see explanation next) If X i (t) + A i (t) = 0, there can be no actual departure from queue i. However, we still allow D i (t) to equal one. To keep the queue length process from going negative, we add the random variable L i (t) in (6.20). Thus, D i (t) is the potential number of departures from queue i during the slot, and D i (t) − L i (t) is the actual number of departures. This completes the specification of the one-step transition probabilities of the Markov process. A necessary condition for positive recurrence is, for any routing policy, a < d 1 +d 2 , because the total arrival rate must be less than the total depature rate. We seek to show that this necessary condition is also sufficient, under the random routing policy. Let us calculate the drift of V (X(t)) for the choice V (x) = (x 2 1 +x 2 2 )/2. Note that (X i (t+1)) 2 = (X i (t) + A i (t) − D i (t) + L i (t)) 2 ≤ (X i (t) + A i (t) − D i (t)) 2 , because addition of the variable L i (t) 6.9. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS 187 can only push X i (t) +A i (t) −D i (t) closer to zero. Thus, PV (x) −V (x) = E[V (X(t + 1))[X(t) = x] −V (x) ≤ 1 2 2 i=1 E[(x i +A i (t) −D i (t)) 2 −x 2 i [X(t) = x] = 2 i=1 x i E[A i (t) −D i (t)[X(t) = x] + 1 2 E[(A i (t) −D i (t)) 2 [X(t) = x] (6.21) ≤ _ 2 i=1 x i E[A i (t) −D i (t)[X(t) = x] _ + 1 = −(x 1 (d 1 −au) +x 2 (d 2 −au)) + 1 (6.22) Under the necessary condition a < d 1 +d 2 , there are choices of u so that au < d 1 and au < d 2 , and for such u the conditions of Corollary 6.9.3 are satisfied, with f(x) = x 1 (d 1 − au) + x 2 (d 2 − au), g(x) = 1, and any > 0, implying that the Markov process is positive recurrent. In addition, the first moments under the equlibrium distribution satisfy: (d 1 −au)X 1 + (d 2 −au)X 2 ≤ 1. (6.23) In order to deduce an upper bound on X 1 +X 2 , we select u ∗ to maximize the minimum of the two coefficients in (6.23). Intuitively, this entails selecting u to minimize the absolute value of the difference between the two coefficients. We find: = max 0≤u≤1 min¦d 1 −au, d 2 −au¦ = min¦d 1 , d 2 , d 1 +d 2 −a 2 ¦ and the corresponding value u ∗ of u is given by u ∗ = _ _ _ 0 if d 1 −d 2 < −a 1 2 + d 1 −d 2 2a if [d 1 −d 2 [ ≤ a 1 if d 1 −d 2 > a For the system with u = u ∗ , (6.23) yields X 1 +X 2 ≤ 1 . (6.24) We remark that, in fact, X 1 +X 2 ≤ 2 d 1 +d 2 −a (6.25) If [d 1 − d 2 [ ≤ a then the bounds (6.24) and (6.25) coincide, and otherwise, the bound (6.25) is strictly tighter. If d 1 − d 2 < −a then u ∗ = 0, so that X 1 = 0, and (6.23) becomes (d 2 − a)X 2 ≤ 1 , which implies (6.25). Similarly, if d 1 − d 2 > a, then u ∗ = 1, so that X 2 = 0, and (6.23) becomes (d 1 −a)X 1 ≤ 1, which implies (6.25). Thus, (6.25) is proved. 188 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Example 6.9.5 (Route-to-shorter policy) Consider a variation of the previous example such that when a packet arrives, it is routed to the shorter queue. To be definite, in case of a tie, the packet is routed to queue 1. Then the evolution equation (6.20) still holds, but with with the description of the arrival variables changed to the following: • Given X(t) = (x 1 , x 2 ), A(t) = (I ¡x 1 ≤x 2 ¦ , I ¡x 1 >x 2 ¦ ) with probability a, and A(t) = (0, 0) otherwise. Let P RS denote the one-step transition probability matrix when the route-to-shorter policy is used. Proceeding as in (6.21) yields: P RS V (x) −V (x) ≤ 2 i=1 x i E[A i (t) −D i (t))[X(t) = x] + 1 = a _ x 1 I ¡x 1 ≤x 2 ¦ +x 2 I ¡x 1 >x 2 ¦ _ −d 1 x 1 −d 2 x 2 + 1 Note that x 1 I ¡x 1 ≤x 2 ¦ + x 2 I ¡x 1 >x 2 ¦ ≤ ux 1 + ux 2 for any u ∈ [0, 1], with equality for u = I ¡x 1 ≤x 2 ¦ . Therefore, the drift bound for V under the route-to-shorter policy is less than or equal to the drift bound (6.22), for V for any choice of probabilistic splitting. In fact, route-to-shorter routing can be viewed as a controlled version of the independent splitting model, for which the control policy is selected to minimize the bound on the drift of V in each state. It follows that the route-to-shorter process is positive recurrent as long as a < d 1 + d 2 , and (6.23) holds for any value of u such that au < d 1 and au ≤ d 2 . In particular, (6.24) holds for the route-to-shorter process. We remark that the stronger bound (6.25) is not always true for the route-to-shorter policy. The problem is that even if d 1 − d 2 < −a, the route-to-shorter policy can still route to queue 1, and so X 1 ,= 0. In fact, if a and d 2 are fixed with 0 < a < d 2 < 1, then X 1 → ∞ as d 1 → 0 for the route-to-shorter policy. Intuitively, that is because occasionally there will be a large number of customers in the system due to statistical fluctuations, and then there will be many customers in queue 1. But if d 2 << 1, those customers will remain in queue 2 for a very long time. Example 6.9.6 (An input queued switch with probabilistic switching) 4 Consider a packet switch with N inputs and N outputs, as pictured in Figure 6.10. Suppose there are N 2 queues – N at each input – with queue i, j containing packets that arrived at input i and are destined for output j, for i, j ∈ E, where E = ¦1, , N¦. Suppose the packets are all the same length, and adopt a discrete-time model, so that during one time slot, a transfer of packets can occur, such that at most one packet can be transferred from each input, and at most one packet can be transferred to each output. A permutation σ of E has the form σ = (σ 1 , . . . , σ N ), where σ 1 , . . . , σ N are distinct 4 Tassiulas [12] originally developed the results of Examples 6.9.6 and 6.9.7, in the context of wireless networks. The paper [8] presents similiar results in the context of a packet switch. 6.9. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS 189 input 4 1,3 1,4 1,2 1,1 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 output 1 output 2 output 3 output 4 input 1 input 2 input 3 Figure 6.10: A 4 4 input queued switch. elements of E. Let Π denote the set of all N! such permutations. Given σ ∈ Π, let R(σ) be the N N switching matrix defined by R ij = I ¡σ i =j¦ . Thus, R ij (σ) = 1 means that under permutation σ, input i is connected to output j, or, equivalently, a packet in queue i, j is to depart, if there is any such packet. A state x of the system has the form x = (x ij : i, j ∈ E), where x ij denotes the number of packets in queue i, j. The evolution of the system over a time slot [t, t + 1) is described as follows: X ij (t + 1) = X ij (t) +A ij (t) −R ij (σ(t)) +L ij (t) where • A ij (t) is the number of packets arriving at input i, destined for output j, in the slot. Assume that the variables (A ij (t) : i, j ∈ E, t ≥ 0) are mutually independent, and for each i, j, the random variables (A ij (t) : t ≥ 0) are independent, identically distributed, with mean λ ij and E[A 2 ij ] ≤ K ij , for some constants λ ij and K ij . Let Λ = (λ ij : i, j ∈ E). • σ(t) is the switch state used during the slot • L ij = (−(X ij (t) +A ij (t) −R ij (σ(t))) + , which takes value one if there was an unused potential departure at queue ij during the slot, and is zero otherwise. The number of packets at input i at the beginning of the slot is given by the row sum j∈E X ij (t), its mean is given by the row sum j∈E λ ij , and at most one packet at input i can be served in a time slot. Similarly, the set of packets waiting for output j, called the virtual queue for output j, has size given by the column sum i∈E X ij (t). The mean number of arrivals to the virtual queue for output j is i∈E λ ij (t), and at most one packet in the virtual queue can be served in a time slot. These considerations lead us to impose the following restrictions on Λ: j∈E λ ij < 1 for all i and i∈E λ ij < 1 for all j (6.26) 190 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Except for trivial cases involving deterministic arrival sequences, the conditions (6.26) are necessary for stable operation, for any choice of the switch schedule (σ(t) : t ≥ 0). Let’s first explore random, independent and identically distributed (i.i.d.) switching. That is, given a probability distribution u on Π, let (σ(t) : t ≥ 0) be independent with common probability distribution u. Once the distributions of the A ij ’s and u are fixed, we have a discrete-time Markov process model. Given Λ satisfying (6.26), we wish to determine a choice of u so that the process with i.i.d. switch selection is positive recurrent. Some standard background from switching theory is given in this paragraph. A line sum of a matrix M is either a row sum, j M ij , or a column sum, i M ij . A square matrix M is called doubly stochastic if it has nonnegative entries and if all of its line sums are one. Birkhoff’s theorem, celebrated in the theory of switching, states that any doubly stochastic matrix M is a convex combination of switching matrices. That is, such an M can be represented as M = σ∈Π R(σ)u(σ), where u = (u(σ) : σ ∈ Π) is a probability distribution on Π. If M is a nonnegative matrix with all line sums less than or equal to one, then if some of the entries of M are increased appropriately, a doubly stochastic matrix can be obtained. That is, there exists a doubly stochastic matrix M so that M ij ≤ M ij for all i, j. Applying Birkhoff’s theorem to M yields that there is a probability distribution u so that M ij ≤ σ∈Π R(σ)u(σ) for all i, j. Suppose Λ satisfies the necessary conditions (6.26). That is, suppose that all the line sums of Λ are less than one. Then with defined by = 1 −(maximum line sum of Λ) N , each line sum of (λ ij + : i, j ∈ E) is less than or equal to one. Thus, by the observation at the end of the previous paragraph, there is a probability distribution u ∗ on Π so that λ ij + ≤ µ ij (u ∗ ), where µ ij (u) = σ∈Π R ij (σ)u(σ). We consider the system using probability distribution u ∗ for the switch states. That is, let (σ(t) : t ≥ 0) be independent, each with distribution u ∗ . Then for each ij, the random variables R ij (σ(t)) are independent, Bernoulli(µ ij (u ∗ )) random variables. Consider the quadratic Lyapunov function V given by V (x) = 1 2 i,j x 2 ij . As in (6.21), PV (x) −V (x) ≤ i,j x ij E[A ij (t) −R ij (σ(t))[X ij (t) = x] + 1 2 i,j E[(A ij (t) −R ij (σ(t))) 2 [X(t) = x]. Now E[A ij (t) −R ij (σ(t))[X ij (t) = x] = E[A ij (t) −R ij (σ(t))] = λ ij −µ ij (u ∗ ) ≤ − and 1 2 i,j E[(A ij (t) −R ij (σ(t))) 2 [X(t) = x] ≤ 1 2 i,j E[(A ij (t)) 2 + (R ij (σ(t))) 2 ] ≤ K 6.9. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS 191 where K = 1 2 (N + i,j K ij ). Thus, PV (x) −V (x) ≤ − _ _ ij x ij _ _ +K (6.27) Therefore, by Corollary 6.9.3, the process is positive recurrent, and ij X ij ≤ K (6.28) That is, the necessary condition (6.26) is also sufficient for positive recurrence and finite mean queue length in equilibrium, under i.i.d. random switching, for an appropriate probability distribution u ∗ on the set of permutations. Example 6.9.7 (An input queued switch with maximum weight switching) The random switching policy used in Example 2a depends on the arrival rate matrix Λ, which may be unknown a priori. Also, the policy allocates potential departures to a given queue ij, whether or not the queue is empty, even if other queues could be served instead. This suggests using a dynamic switching policy, such as the maximum weight switching policy, defined by σ(t) = σ MW (X(t)), where for a state x, σ MW (x) = arg max σ∈Π ij x ij R ij (σ). (6.29) The use of “arg max” here means that σ MW (x) is selected to be a value of σ that maximizes the sum on the right hand side of (6.29), which is the weight of permutation σ with edge weights x ij . In order to obtain a particular Markov model, we assume that the set of permutations Π is numbered from 1 to N! in some fashion, and in case there is a tie between two or more permutations for having the maximum weight, the lowest numbered permutation is used. Let P MW denote the one-step transition probability matrix when the route-to-shorter policy is used. Letting V and K be as in Example 2a, we find under the maximum weight policy that P MW V (x) −V (x) ≤ ij x ij (λ ij −R ij (σ MW (x))) +K The maximum of a function is greater than or equal to the average of the function, so that for any probability distribution u on Π ij x ij R ij (σ MW (t)) ≥ σ u(σ) ij x ij R ij (σ) (6.30) = ij x ij µ ij (u) 192 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS with equality in (6.30) if and only if u is concentrated on the set of maximum weight permutations. In particular, the choice u = u ∗ shows that ij x ij R ij (σ MW (t)) ≥ ij x ij µ ij (u∗) ≥ ij x ij (λ ij +) Therefore, if P is replaced by P MW , (6.27) still holds. Therefore, by Corollary 6.9.3, the process is positive recurrent, and the same moment bound, (6.28), holds, as for the randomized switching strategy of Example 2a. On one hand, implementing the maximum weight algorithm does not require knowledge of the arrival rates, but on the other hand, it requires that queue length infor- mation be shared, and that a maximization problem be solved for each time slot. Much recent work has gone towards reduced complexity dynamic switching algorithms. 6.9.2 Stability criteria for continuous time processes Here is a continuous time version of the Foster-Lyapunov stability criteria and the moment bounds. Suppose X is a time-homegeneous, irreducible, continuous-time Markov process with generator matrix Q. The drift vector of V (X(t)) is the vector QV . This definition is motivated by the fact that the mean drift of X for an interval of duration h is given by d h (i) = E[V (X(t +h))[X(t) = i] −V (i) h = j∈S _ p ij (h) −δ ij h _ V (j) = j∈S _ q ij + o(h) h _ V (j), (6.31) so that if the limit as h → 0 can be taken inside the summation in (6.31), then d h (i) → QV (i) as h → 0. The following useful expression for QV follows from the fact that the row sums of Q are zero: QV (i) = j:j,=i q ij (V (j) −V (i)). (6.32) Formula (6.32) is quite similar to the formula (6.17) for the drift vector for a discrete-time process. Proposition 6.9.8 (Foster-Lyapunov stability criterion–continuous time) Suppose V : o → R + and C is a finite subset of o. (a) If QV ≤ 0 on o −C, and ¦i : V (i) ≤ K¦ is finite for all K then X is recurrent. (b) Suppose for some b > 0 and > 0 that QV (i) ≤ − +bI C (i) for all i ∈ o. (6.33) Suppose further that ¦i : V (i) ≤ K¦ is finite for all K, or that X is nonexplosive. Then X is positive recurrent. 6.9. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS 193 2 queue 1 queue 2 queue 3 1 2 3 2 1 ! ! ! m m u 1 1 u u 2 u Figure 6.11: A system of three queues with two servers. Example 6.9.9 Suppose X has state space o = Z + , with q i0 = µ for all i ≥ 1, q ii+1 = λ i for all i ≥ 0, and all other off-diagonal entries of the rate matrix Q equal to zero, where µ > 0 and λ i > 0 such that i≥0 1 λ i < +∞. Let C = ¦0¦, V (0) = 0, and V (i) = 1 for i ≥ 0. Then QV = −µ + (λ 0 + µ)I C , so that (6.33) is satisfied with = µ and b = λ 0 + µ. However, X is not positive recurrent. In fact, X is explosive. To see this, note that p J ii+1 = λ i µ+λ i ≥ exp(− µ λ i ). Let δ be the probability that, starting from state 0, the jump process does not return to zero. Then δ = ∞ i=0 p J ii+1 ≥ exp(−µ ∞ i=0 1 λ i ) > 0. Thus, X J is transient. After the last visit to state zero, all the jumps of X J are up one. The corresponding mean holding times of X are 1 λ i +µ which have a finite sum, so that the process X is explosive. This example illustrates the need for the assumption just after (6.33) in Proposition 6.9.8. As for the case of discrete time, the drift conditions imply moment bounds. Proposition 6.9.10 (Moment bound–continuous time) Suppose V , f, and g are nonnegative func- tions on o, and suppose QV (i) ≤ −f(i) + g(i) for all i ∈ o. In addition, suppose X is positive recurrent, so that the means, f = πf and g = πg are well-defined. Then f ≤ g. Corollary 6.9.11 (Combined Foster-Lyapunov stability criterion and moment bound–continuous time) Suppose V , f, and g are nonnegative functions on o such that QV (i) ≤ −f(i) +g(i) for all i ∈ o, and, for some > 0, the set C defined by C = ¦i : f(i) < g(i) + ¦ is finite. Suppose also that ¦i : V (i) ≤ K¦ is finite for all K. Then X is positive recurrent and f ≤ g. Example 6.9.12 (Random server allocation with two servers) Consider the system shown in Fig- ure 6.11. Suppose that each queue i is fed by a Poisson arrival process with rate λ i , and suppose there are two potential departure processes, D 1 and D 2 , which are Poisson processes with rates m 1 and m 2 , respectively. The five Poisson processes are assumed to be independent. No matter how the potential departures are allocated to the permitted queues, the following conditions are 194 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS necessary for stability: λ 1 < m 1 , λ 3 < m 2 , and λ 1 +λ 2 +λ 3 < m 1 +m 2 (6.34) That is because server 1 is the only one that can serve queue 1, server 2 is the only one that can serve queue 3, and the sum of the potential service rates must exceed the sum of the potential arrival rates for stability. A vector x = (x 1 , x 2 , x 2 ) ∈ Z 3 + corresponds to x i packets in queue i for each i. Let us consider random selection, so that when D i has a jump, the queue served is chosen at random, with the probabilities determined by u = (u 1 , u 2 ). As indicated in Figure 6.11, a potential service by server 1 is given to queue 1 with probability u 1 , and to queue 2 with probability u 1 . Similarly, a potential service by server 2 is given to queue 2 with probability u 2 , and to queue 3 with probability u 2 . The rates of potential service at the three stations are given by µ 1 (u) = u 1 m 1 µ 2 (u) = u 1 m 1 +u 2 m 2 µ 3 (u) = u 2 m 2 . Let V (x) = 1 2 (x 2 1 +x 2 2 +x 2 3 ). Using (6.32), we find that the drift vector QV is given by QV (x) = 1 2 _ 3 i=1 ((x i + 1) 2 −x 2 i )λ i _ + 1 2 _ 3 i=1 ((x i −1) 2 + −x 2 i )µ i (u) _ Now (x i −1) 2 + ≤ (x i −1) 2 , so that QV (x) ≤ _ 3 i=1 x i (λ i −µ i (u)) _ + γ 2 (6.35) where γ is the total rate of events, given by γ = λ 1 +λ 2 +λ 3 +µ 1 (u)+µ 2 (u)+µ 3 (u), or equivalently, γ = λ 1 +λ 2 +λ 3 +m 1 +m 2 . Suppose that the necessary condition (6.34) holds. Then there exists some > 0 and choice of u so that λ i + ≤ µ i (u) for 1 ≤ i ≤ 3 and the largest such choice of is = min¦m 1 − λ 1 , m 2 − λ 3 , m 1 +m 2 −λ 1 −λ 2 −λ 3 3 ¦. (See excercise.) So QV (x) ≤ −(x 1 +x 2 +x 3 ) +γ for all x, so Corollary 6.9.11 implies that X is positive recurrent and X 1 +X 2 +X 3 ≤ γ 2 . Example 6.9.13 (Longer first server allocation with two servers) This is a continuation of Example 6.9.12, concerned with the system shown in Figure 6.11. Examine the right hand side of (6.35). Rather than taking a fixed value of u, suppose that the choice of u could be specified as a function 6.10. PROBLEMS 195 of the state x. The maximum of a function is greater than or equal to the average of the function, so that for any probability distribution u, 3 i=1 x i µ i (u) ≤ max u i x i µ i (u t ) (6.36) = max u m 1 (x 1 u t 1 +x 2 u t 1 ) +m 2 (x 2 u t 2 +x 3 u t 2 ) = m 1 (x 1 ∨ x 2 ) +m 2 (x 2 ∨ x 3 ) with equality in (6.36) for a given state x if and only if a longer first policy is used: each service opportunity is allocated to the longer queue connected to the server. Let Q LF denote the one-step transition probability matrix when the longest first policy is used. Then (6.35) continues to hold for any fixed u, when Q is replaced by Q LF . Therefore if the necessary condition (6.34) holds, can be taken as in Example 6.9.12, and Q LF V (x) ≤ −(x 1 + x 2 + x 3 ) + γ for all x. So Corollary 6.9.11 implies that X is positive recurrent under the longer first policy, and X 1 + X 2 + X 3 ≤ γ 2 . (Note: We see that Q LF V (x) ≤ _ 3 i=1 x i λ i _ −m 1 (x 1 ∨ x 2 ) −m 2 (x 2 ∨ x 3 ) + γ 2 , but for obtaining a bound on X 1 +X 2 +X 3 it was simpler to compare to the case of random service allocation.) 6.10 Problems 6.1 Mean hitting time for a simple Markov process Let (X(n) : n ≥ 0) denote a discrete-time, time-homogeneous Markov chain with state space ¦0, 1, 2, 3¦ and one-step transition probability matrix P = _ _ _ _ 0 1 0 0 1 −a 0 a 0 0 0.5 0 0.5 0 0 1 0 _ _ _ _ for some constant a with 0 ≤ a ≤ 1. (a) Sketch the transition probability diagram for X and give the equilibrium probability vector. If the equilibrium vector is not unique, describe all the equilibrium probability vectors. (b) Compute E[min¦n ≥ 1 : X(n) = 3¦[X(0) = 0]. 6.2 A two station pipeline in continuous time This is a continuous-time version of Example 4.9.1. Consider a pipeline consisting of two single- buffer stages in series. Model the system as a continuous-time Markov process. Suppose new packets 196 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS are offered to the first stage according to a rate λ Poisson process. A new packet is accepted at stage one if the buffer in stage one is empty at the time of arrival. Otherwise the new packet is lost. If at a fixed time t there is a packet in stage one and no packet in stage two, then the packet is transfered during [t, t+h) to stage two with probability hµ 1 +o(h). Similarly, if at time t the second stage has a packet, then the packet leaves the system during [t, t +h) with probability hµ 2 +o(h), independently of the state of stage one. Finally, the probability of two or more arrival, transfer, or departure events during [t, t + h) is o(h). (a) What is an appropriate state-space for this model? (b) Sketch a transition rate diagram. (c) Write down the Q matrix. (d) Derive the throughput, assuming that λ = µ 1 = µ 2 = 1. (e) Still assuming λ = µ 1 = µ 2 = 1. Suppose the system starts with one packet in each stage. What is the expected time until both buffers are empty? 6.3 Equilibrium distribution of the jump chain Suppose that π is the equilibrium distribution for a time-homogeneous Markov process with tran- sition rate matrix Q. Suppose that B −1 = i −q ii π i , where the sum is over all i in the state space, is finite. Show that the equilibrium distribution for the jump chain (X J (k) : k ≥ 0) (defined in Section 4.10) is given by π J i = −Bq ii π i . (So π and π J are identical if and only if q ii is the same for all i.) 6.4 A simple Poisson process calculation Let (N(t) : t ≥ 0) be a Poisson random process with rate λ > 0. Compute P[N(s) = i[N(t) = k] where 0 < s < t and i and k are nonnegative integers. (Caution: note order of s and t carefully). 6.5 A simple question of periods Consider a discrete-time Markov process with the nonzero one-step transition probabilities indi- cated by the following graph. 8 3 4 1 2 5 6 7 (a) What is the period of state 4? (b) What is the period of state 6? 6.6 A mean hitting time problem Let (X(t) : t ≥ 0) be a time-homogeneous, pure-jump Markov process with state space ¦0, 1, 2¦ and Q matrix Q = _ _ −4 2 2 1 −2 1 2 0 −2. _ _ 6.10. PROBLEMS 197 (a) Write down the state transition diagram and compute the equilibrium distribution. (b) Compute a i = E[min¦t ≥ 0 : X(t) = 1¦[X(0) = i] for i = 0, 1, 2. If possible, use an approach that can be applied to larger state spaces. (c) Derive a variation of the Kolmogorov forward differential equations for the quantities: α i (t) = P[X(s) ,= 2 for 0 ≤ s ≤ t and X(t) = i[X(0) = 0] for 0 ≤ i ≤ 2. (You need not solve the equations.) (d) The forward Kolmogorov equations describe the evolution of an initial probability distribution going forward in time, given an initial. In other problems, a boundary condition is given at a final time, and a differential equation working backwards in time from a final condition is called for (called Kolmogorov backward equations). Derive a backward differential equation for: β j (t) = P[X(s) ,= 2 for t ≤ s ≤ t f [X(t) = j], for 0 ≤ j ≤ 2 and t ≤ t f for some fixed time t f . (Hint: Express β i (t−h) in terms of the β j (t) t s for t ≤ t f , and let h →0. You need not solve the equations.) 6.7 A birth-death process with periodic rates Consider a single server queueing system in which the number in the system is modeled as a continuous time birth-death process with the transition rate diagram shown, where λ a , λ b , µ a , and µ b are strictly positive constants. a 3 1 2 0 . . . 4 ! ! ! ! µ ! µ µ µ µ a a a b b a b a b (a) Under what additional assumptions on these four parameters is the process positive recurrent? (b) Assuming the system is positive recurrent, under what conditions on λ a , λ b , µ a , and µ b is it true that the distribution of the number in the system at the time of a typical arrival is the same as the equilibrium distribution of the number in the system? 6.8 Markov model for a link with resets Suppose that a regulated communication link resets at a sequence of times forming a Poisson process with rate µ. Packets are offered to the link according to a Poisson process with rate λ. Suppose the link shuts down after three packets pass in the absence of resets. Once the link is shut down, additional offered packets are dropped, until the link is reset again, at which time the process begins anew. ! µ (a) Sketch a transition rate diagram for a finite state Markov process describing the system state. (b) Express the dropping probability (same as the long term fraction of packets dropped) in terms of λ and µ. 6.9 An unusual birth-death process Consider the birth-death process X with arrival rates λ k = (p/(1 − p)) k /a k and death rates µ k = 198 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS (p/(1 − p)) k−1 /a k , where .5 < p < 1, and a = (a 0 , a 1 , . . .) is a probability distribution on the nonnegative integers with a k > 0 for all k. (a) Classify the states for the process X as transient, null recurrent or positive recurrent. (b) Check that aQ = 0. Is a an equilibrium distribution for X? Explain. (c) Find the one-step transition probabilities for the jump-chain, X J (d) Classify the states for the process X J as transient, null recurrent or positive recurrent. 6.10 A queue with decreasing service rate Consider a queueing system in which the arrival process is a Poisson process with rate λ. Suppose the instantaneous completion rate is µ when there are K or fewer customers in the system, and µ/2 when there are K +1 or more customers in the system. The number in the system is modeled as a birth-death Markov process. (a) Sketch the transition rate diagram. (b) Under what condition on λ and µ are all states positive recurrent? Under this condition, give the equilibrium distribution. (c) Suppose that λ = (2/3)µ. Describe in words the typical behavior of the system, given that it is initially empty. 6.11 Limit of a distrete time queueing system Model a queue by a discrete-time Markov chain by recording the queue state after intervals of q seconds each. Assume the queue evolves during one of the atomic intervals as follows: There is an arrival during the interval with probability αq, and no arrival otherwise. If there is a customer in the queue at the beginning of the interval then a single departure will occur during the interval with probability βq. Otherwise no departure occurs. Suppose that it is impossible to have an arrival and a departure in a single atomic interval. (a) Find a k =P[an interarrival time is kq] and b k =P[a service time is kq]. (b) Find the equilibrium distribution, p = (p k : k ≥ 0), of the number of customers in the system at the end of an atomic interval. What happens as q →0? 6.12 An M/M/1 queue with impatient customers Consider an M/M/1 queue with parameters λ and µ with the following modification. Each customer in the queue will defect (i.e. depart without service) with probability αh + o(h) in an interval of length h, independently of the other customers in the queue. Once a customer makes it to the server it no longer has a chance to defect and simply waits until its service is completed and then departs from the system. Let N(t) denote the number of customers in the system (queue plus server) at time t. (a) Give the transition rate diagram and generator matrix Q for the Markov chain N = (N(t) : t ≥ 0). (b) Under what conditions are all states positive recurrent? Under this condition, find the equilibrium distribution for N. (You need not explicitly sum the series.) (c) Suppose that α = µ. Find an explicit expression for p D , the probability that a typical arriving customer defects instead of being served. Does your answer make sense as λ/µ converges to zero or to infinity? 6.13 Statistical multiplexing Consider the following scenario regarding a one-way link in a store-and-forward packet commu- nication network. Suppose that the link supports eight connections, each generating traffic at 5 kilobits per second (kbps). The data for each connection is assumed to be in packets exponentially distributed in length with mean packet size 1 kilobit. The packet lengths are assumed mutually 6.10. PROBLEMS 199 independent and the packets for each stream arrive according to a Poisson process. Packets are queued at the beginning of the link if necessary, and queue space is unlimited. Compute the mean delay (queueing plus transmission time–neglect propagation delay) for each of the following three scenarios. Compare your answers. (a) (Full multiplexing) The link transmit speed is 50 kbps. (b) The link is replaced by two 25 kbps links, and each of the two links carries four sessions. (Of course the delay would be larger if the sessions were not evenly divided.) (c) (Multiplexing over two links) The link is replaced by two 25 kbps links. Each packet is transmitted on one link or the other, and neither link is idle whenever a packet from any session is waiting. 6.14 A queue with blocking (M/M/1/5 system) Consider an M/M/1 queue with service rate µ, arrival rate λ, and the modifi- cation that at any time, at most five customers can be in the system (including the one in service, if any). If a customer arrives and the system is full (i.e. already has five customers in it) then the customer is dropped, and is said to be blocked. Let N(t) denote the number of customers in the system at time t. Then (N(t) : t ≥ 0) is a Markov chain. (a) Indicate the transition rate diagram of the chain and find the equilibrium probability distribution. (b) What is the probability, p B , that a typical customer is blocked? (c) What is the mean waiting time in queue, W, of a typical customer that is not blocked? (d) Give a simple method to numerically calculate, or give a simple expression for, the mean length of a busy period of the system. (A busy period begins with the arrival of a customer to an empty system and ends when the system is again empty.) 6.15 Three queues and an autonomously traveling server Consider three stations that are served by a single rotating server, as pictured. ! 1 2 3 " " " station 1 station 2 station 3 # $ Customers arrive to station i according to a Poisson process of rate λ i for 1 ≤ i ≤ 3, and the total service requirement of each customer is exponentially distributed, with mean one. The rotation of the server is modelled by a three state Markov process with the transition rates α, β, and γ as indicated by the dashed lines. When at a station, the server works at unit rate, or is idle if the station is empty. If the service to a customer is interrupted because the server moves to the next station, the service is resumed when the server returns. (a) Under what condition is the system stable? Briefly justify your answer. (b) Identify a method for computing the mean customer waiting time at station one. 200 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 6.16 On two distibutions seen by customers Consider a queueing system in which the number in the system only changes in steps of plus one or minus one. Let D(k, t) denote the number of customers that depart in the interval [0,t] that leave behind exactly k customers, and let R(k,t) denote the number of customers that arrive in the interval [0,t] to find exactly k customers already in the system. (a) Show that [D(k, t)−R(k, t)[ ≤ 1 for all k and t. (b) Let α t (respectively δ t ) denote the number of arrivals (departures) up to time t. Suppose that α t → ∞ and α t /δ t → 1 as t → ∞. Show that if the following two limits exist for a given value k, then they are equal: r k = lim t→∞ R(k, t)/α t and d k = lim t→∞ D(k, t)/δ t . 6.17 Recurrence of mean zero random walks (a) Suppose B 1 , B 2 , . . . is a sequence of independent, mean zero, integer valued random variables, which are bounded, i.e. P[[B i [ ≤ M] = 1 for some M. (a) Let X 0 = 0 and X n = B 1 + +B n for n ≥ 0. Show that X is recurrent. (b) Suppose Y 0 = 0 and Y n+1 = Y n + B n + L n , where L n = (−(Y n + B n )) + . The process Y is a reflected version of X. Show that Y is recurrent. 6.18 Positive recurrence of reflected random walk with negative drift Suppose B 1 , B 2 , . . . is a sequence of independent, integer valued random variables, each with mean B < 0 and second moment B 2 < +∞. Suppose X 0 = 0 and X n+1 = X n + B n + L n , where L n = (−(X n + B n )) + . Show that X is positive recurrent, and give an upper bound on the mean under the equilibrium distribution, X. (Note, it is not assumed that the B’s are bounded.) 6.19 Routing with two arrival streams (a) Generalize Example 6.9.4 to the scenario shown. 1 queue 1 queue 2 2 d 1 d d 1 2 u u u u 2 queue 3 3 2 a a 1 where a i , d j ∈ (0, 1) for 1 ≤ i ≤ 2 and 1 ≤ j ≤ 3. In particular, determine conditions on a 1 and a 2 that insure there is a choice of u = (u 1 , u 2 ) which makes the system positive recurrent. Under those conditions, find an upper bound on X 1 +X 2 +X 3 , and select u to mnimize the bound. (b) Generalize Example 1.b to the scenario shown. In particular, can you find a version of route- to-shorter routing so that the bound found in part (a) still holds? 6.20 An inadequacy of a linear potential function Consider the system of Example 6.9.5 (a discrete-time model, using the route to shorter policy, with ties broken in favor of queue 1, so u = I ¡x 1 ≤x 2 ¦ ): 6.10. PROBLEMS 201 u queue 1 queue 2 2 d 1 d a u Assume a = 0.7 and d 1 = d 2 = 0.4. The system is positive recurrent. Explain why the function V (x) = x 1 + x 2 does not satisfy the Foster-Lyapunov stability criteria for positive recurrence, for any choice of the constant b and the finite set C. 6.21 Allocation of service Prove the claim in Example 6.9.12 about the largest value of . 6.22 Opportunistic scheduling (Based on [14]) Suppose N queues are in parallel, and suppose the arrivals to a queue i form an independent, identically distributed sequence, with the number of arrivals in a given slot having mean a i > 0 and finite second moment K i . Let S(t) for each t be a subset of E = ¦1, . . . , N¦ and t ≥ 0. The random sets S(t) : t ≥ 0 are assumed to be independent with common distribution w. The interpretation is that there is a single server, and in slot i, it can serve one packet from one of the queues in S(t). For example, the queues might be in the base station of a wireless network with packets queued for N mobile users, and S(t) denotes the set of mobile users that have working channels for time slot [t, t + 1). See the illustration: state s queue 1 1 queue 2 2 N queue N a . . . a a Fading channel (a) Explain why the following condition is necessary for stability: For all s ⊂ E with s ,= ∅, i∈s a i < B:B∩s,=∅ w(B) (6.37) (b) Consider u of the form u = (u(i, s) : i ∈ E, s ⊂ E), with u(i, s) ≥ 0, u(i, s) = 0 if i ,∈ s, and i∈E u(i, s) = I ¡s,=∅¦ . Suppose that given S(t) = s, the queue that is given a potential service opportunity has probability distribution (u(i, s) : i ∈ E). Then the probability of a potential service at queue i is given by µ i (u) = s u(i, s)w(s) for i ∈ E. Show that under the condition (6.37), for some > 0, u can be selected to that a i + ≤ µ i (u) for i ∈ E. (Hint: Apply the min-cut, max-flow theorem to an appropriate graph.) (c) Show that using the u found in part (b) that the process is positive recurrent. 202 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS (d) Suggest a dynamic scheduling method which does not require knowledge of the arrival rates or the distribution w, which yields the same bound on the mean sum of queue lengths found in part (b). 6.23 Routing to two queues – continuous time model Give a continuous time analog of Examples 6.9.4 and 6.9.5. In particular, suppose that the arrival process is Poisson with rate λ and the potential departure processes are Poisson with rates µ 1 and µ 2 . 6.24 Stability of two queues with transfers Let (λ 1 , λ 2 , ν, µ 1 , µ 2 ) be a vector of strictly positve parameters, and consider a system of two service stations with transfers as pictured. 2 station 1 station 2 u! 2 " " 1 µ 1 µ Station i has Possion arrivals at rate λ i and an exponential type server, with rate µ i . In addition, customers are transferred from station 1 to station 2 at rate uν, where u is a constant with u ∈ U = [0, 1]. (Rather than applying dynamic programming here, we will apply the method of Foster- Lyapunov stability theory in continuous time.) The system is described by a continuous-time Markov process on Z 2 + with some transition rate matrix Q. (You don’t need to write out Q.) (a) Under what condition on (λ 1 , λ 2 , ν, µ 1 , µ 2 ) is there a choice of the constant u such that the Markov process describing the system is positive recurrent? (b) Let V be the quadratic Lyapunov function, V (x 1 , x 2 ) = x 2 1 2 + x 2 2 2 . Compute the drift vector QV . (c) Under the condition of part (a), and using the moment bound associated with the Foster- Lyapunov criteria, find an upper bound on the mean number in the system in equilibrium, X 1 +X 2 . (The smaller the bound the better.) 6.25 Stability of a system with two queues and modulated server Consider two queues, queue 1 and queue 2, such that in each time slot, queue i receives a new packet with probability a i , where 0 < a 1 < 1 and 0 < a 2 < 1. Suppose the server is described by a three state Markov process, as shown. queue 2 a a 1 2 1 2 0 ! server longer queue 1 6.10. PROBLEMS 203 If the server process is in state i for i ∈ ¦1, 2¦ at the beginning of a slot, then a potential service is given to station i. If the server process is in state 0 at the beginning of a slot, then a potential service is given to the longer queue (with ties broken in favor of queue 1). Then during the slot, the server state jumps with the probabilities indicated. (Note that a packet can arrive and depart in one time slot.) For what values of a 1 and a 2 is the process stable? Briefly explain your answer (but rigorous proof is not required). 204 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Chapter 7 Basic Calculus of Random Processes The calculus of deterministic functions revolves around continuous functions, derivatives, and inte- grals. These concepts all involve the notion of limits. See the appendix for a review of continuity, differentiation and integration. In this chapter the same concepts are treated for random processes. We’ve seen four different senses in which a sequence of random variables can converge: almost surely (a.s.), in probability (p.), in mean square (m.s.), and in distribution (d.). Of these senses, we will use the mean square sense of convergence the most, and make use of the correlation version of the Cauchy criterion for m.s. convergence, and the associated facts that for m.s. convergence, the means of the limits are the limits of the means, and correlations of the limits are the limits of correlations (Proposition 2.2.3 and Corollaries 2.2.4 and 2.2.5). As an application of integration of random processes, ergodicity and the Karhunen-Lo´eve expansion are discussed. In addition, notation for complex-valued random processes is introduced. 7.1 Continuity of random processes The topic of this section is the definition of continuity of a continuous-time random process, with a focus on continuity defined using m.s. convergence. Chapter 2 covers convergence of sequences. Limits for deterministic functions of a continuous variable can be defined in either of two equivalent ways. Specifically, a function f on R has a limit y at t o , written as lim s→to f(s) = y, if either of the two equivalent conditions is true: (1) (Definition based on and δ) Given > 0, there exists δ > 0 so that [ f(s) −y [≤ whenever [s −t o [ ≤ δ. (2) (Definition based on sequences) f(s n ) →y for any sequence (s n ) such that s n →t o . Let’s check that (1) and (2) are equivalent. Suppose (1) is true, and let (s n ) be such that s n →t o . Let > 0 and then let δ be as in condition (1). Since s n → t o , it follows that there exists n o so that [s n − t o [ ≤ δ for all n ≥ n o . But then [f(s n ) − y[ ≤ by the choice of δ. Thus, f(s n ) → y. That is, (1) implies (2). For the converse direction, it suffices to prove the contrapositive: if (1) is not true then (2) is not true. Suppose (1) is not true. Then there exists an > 0 so that, for any n ≥ 1, there exists a 205 206 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES value s n with [s n −t o [ ≤ 1 n such that [f(s n ) −y[ > . But then s n →t o , and yet f(s n ) ,→y, so (2) is false. That is, not (1) implies not (2). This completes the proof that (1) and (2) are equivalent. Similarly, and by essentially the same reasons, convergence for a continuous-time random process can be defined using either and δ, or using sequences, at least for limits in the p., m.s., or d. senses. As we will see, the situation is slightly different for a.s. limits. Let X = (X t : t ∈ T) be a random process such that the index set T is equal to either all of R, or an interval in R, and fix t o ∈ T. Definition 7.1.1 (Limits for continuous-time random processes.) The process (X t : t ∈ T) has limit Y at t o : (i) in the m.s. sense, written lim s→to X s = Y m.s., if for any > 0, there exists δ > 0 so that E[(X s − Y ) 2 ] < whenever s ∈ T and [s − t o [ < δ. An equivalent condition is X sn m.s. → Y as n →∞, whenever s n →t o . (ii) in probability, written lim s→to X s = Y p., if given any > 0, there exists δ > 0 so that P¦[X s −Y [ ≥ ] < whenever s ∈ T and [s −t o [ < δ. An equivalent condition is X sn p. →Y as n →∞, whenever s n →t o . (iii) in distribution, written lim s→to X s = Y d., if given any continuity point c of F Y and any > 0, there exists δ > 0 so that [F X,1 (c, s)−F Y (c)[ < whenever s ∈ T and [s−t o [ < δ. An equivalent condition is X sn d. →Y as n →∞, whenever s n →t o . (Recall that F X,1 (c, s) = P¦X s ≤ c¦.) (iv) almost surely, written lim s→to X s = Y a.s., if there is an event F to having probability one such that F to ⊂ ¦ω : lim s→to X s (ω) = Y (ω)¦. 1 The relationship among the above four types of convergence in continuous time is the same as the relationship among the four types of convergence of sequences, illustrated in Figure 2.8. That is, the following is true: Proposition 7.1.2 The following statements hold as s →t o for a fixed t o in T : If either X s a.s. → Y or X s m.s. → Y then X s p. → Y. If X s p. → Y. then X s d. → Y. Also, if there is a random variable Z with E[Z] < ∞ and [X t [ ≤ Z for all t, and if X s p. →Y then X s m.s. → Y Proof. As indicated in Definition 7.1.1, the first three types of convergence are equivalent to convergence along sequences, in the corresponding senses. The fourth type of convergence, namely a.s. convergence as s → t o , implies convergence along sequences (Example 7.1.3 shows that the converse is not true). That is true because if (s n ) is a sequence converging to t o , ¦ω : lim s→to X t (ω) = Y (ω)¦ ⊂ ¦ω : lim n→∞ X sn (ω) = Y (ω)¦. 1 This definition is complicated by the fact that the set {ω : lims→to Xs(ω) = Y (ω)} involves uncountably many random variables, and it is not necessarily an event. There is a way to simplify the definition as follows, but it requires an extra assumption. A probability space (Ω, F, P) is complete, if whenever N is an event having probability zero, all subsets of N are events. If (Ω, F, P) is complete, the definition of lims→to Xs = Y a.s., is equivalent to the requirement that {ω : lims→to Xs(ω) = Y (ω)} be an event and have probability one. 7.1. CONTINUITY OF RANDOM PROCESSES 207 Therefore, if the first of these sets contains an event which has probability one, the second of these sets is an event which has probability one. The proposition then follows from the same relations for convergence of sequences. In particular, a.s. convergence for continuous time implies a.s. convergence along sequences (as just shown), which implies convergence in p. along sequences, which is the same as convergence in probability. The other implications of the proposition follow directly from the same implications for sequences, and the fact the first three definitions of convergence for continuous time have a form based on sequences. The following example shows that a.s. convergence as s → t o is strictly stronger than a.s. convergence along sequences. Example 7.1.3 Let U be uniformly distributed on the interval [0, 1]. Let X t = 1 if t − U is a rational number, and X t = 0 otherwise. Each sample path of X takes values zero and one in any finite interval, so that X is not a.s. convergent at any t o . However, for any fixed t, P¦X t = 0¦ = 1. Therefore, for any sequence s n , since there are only countably many terms, P¦X sn = 0 for all n¦ = 1 so that X sn →0 a.s. Definition 7.1.4 (Four types of continuity at a point for a random process) For each t o ∈ T fixed, the random process X = (X t : t ∈ T) is continuous at t o in any one of the four senses: m.s., p., a.s., or d., if lim s→to X s = X to in the corresponding sense. The following is immediately implied by Proposition 7.1.2. It shows that for convergence of a random process at a single point, the relations illustrated in Figure 2.8 again hold. Corollary 7.1.5 If X is continuous at t o in either the a.s. or m.s. sense, then X is continuous at t o in probability. If X is continuous at t o in probability, then X is continuous at t o in distribution. Also, if there is a random variable Z with E[Z] < ∞ and [X t [ ≤ Z for all t, and if X is continuous at t o in probability, then it is continuous at t o in the m.s. sense. A deterministic function f on R is simply called continuous if it is continuous at all points. Since we have four senses of continuity at a point for a random process, this gives four types of continuity for random processes. Before stating them formally, we describe a fifth type of continuity of random processes, which is often used in applications. Recall that for a fixed ω ∈ Ω, the random process X gives a sample path, which is a function on T. Continuity of a sample path is thus defined as it is for any deterministic function. The subset of Ω, ¦ω : X t (ω) is a continuous function of t¦, or more concisely, ¦X t is a continuous function of t¦, is the set of ω such that the sample path for ω is continuous. The fifth type of continuity requires that the sample paths be continuous, if a set probability zero is ignored. 208 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Definition 7.1.6 (Five types of continuity for a whole random process) A random process X = (X t : t ∈ T) is said to be m.s. continuous if it is m.s. continuous at each t continuous in p. if it is continuous in p. at each t continuous in d. if it is continuous in d. at each t a.s. continuous at each t, if, as the phrase indicates, it is a.s. continuous at each t. 2 a.s. sample-path continuous, if F ⊂ ¦X t is continuous in t¦ for some event F with P(F) = 1. The relationship among the five types of continuity for a whole random process is pictured in Figure 7.1, and is summarized in the following proposition. a.s. continuous at each t p. d. a f in ite s e c o n d m o m e n t.) a s in g le r a n d o m v a r ia b le w ith m.s. ( I f p r o c e s s is d o m in a te d b y a.s. sample!path continuous Figure 7.1: Relationships among five types of continuity of random processes. Proposition 7.1.7 If a process is a.s. sample-path continuous it is a.s. continuous at each t. If a process is a.s. continuous at each t or m.s. continuous, it is continuous in p. If a process is continuous in p. it is continuous in d. Also, if there is a random variable Y with E[Y ] < ∞ and [X t [ ≤ Y for all t, and if X is continuous in p., then X is m.s. continuous. Proof. Suppose X is a.s. sample-path continuous. Then for any t o ∈ T, ¦ω : X t (ω) is continuous at all t ∈ T¦ ⊂ ¦ω : X t (ω) is continuous at t o ¦ (7.1) Since X is a.s. sample-path continuous, the set on the left-hand side of (7.1) contains an event F with P(F) = 1 and F is also a subset of the set on the the right-hand side of (7.1). Thus, X is a.s. continuous at t o . Since t o was an arbitrary element of T, if follows that X is a.s. continuous at each t. The remaining implications of the proposition follow from Corollary 7.1.5. 2 We avoid using the terminology “a.s. continuous” for the whole random process, because such terminology could too easily be confused with a.s. sample-path continuous 7.1. CONTINUITY OF RANDOM PROCESSES 209 Example 7.1.8 (Shows a.s. sample-path continuity is strictly stronger than a.s. continuity at each t.) Let X = (X t : 0 ≤ t ≤ t) be given by X t = I ¡t≥U¦ for 0 ≤ t ≤ 1, where U is uniformly distributed over [0, 1]. Thus, each sample path of X has a single upward jump of size one, at a random time U uniformly distributed over [0, 1]. So every sample path is discontinuous, and therefore X is not a.s. sample-path continuous. For any fixed t and ω, if U(ω) ,= t (i.e. if the jump of X is not exactly at time t) then X s (ω) → X t (ω) as s → t. Since P¦U ,= t¦ = 1, it follows that X is a.s. continuous at each t. Therefore X is also continuous in p. and d. senses. Finally, since [X t [ ≤ 1 for all t and X is continuous in p., it is also m.s. continuous. The remainder of this section focuses on m.s. continuity. Recall that the definition of m.s. convergence of a sequence of random variables requires that the random variables have finite second moments, and consequently the limit also has a finite second moment. Thus, in order for a random process X = (X t : t ∈ T) to be continuous in the m.s. sense, it must be a second order process: E[X 2 t ] < ∞ for all t ∈ T. Whether X is m.s. continuous depends only on the correlation function R X , as shown in the following proposition. Proposition 7.1.9 Suppose (X t : t ∈ T) is a second order process. The following are equivalent: (i) R X is continuous at all points of the form (t, t) (This condition involves R X for points in and near the set of points of the form (t, t). It is stronger than requiring R X (t, t) to be continuous in t–see example 7.1.10.) (ii) X is m.s. continuous (iii) R X is continuous over T T. If X is m.s. continuous, then the mean function, µ X (t), is continuous. If X is wide sense stationary, the following are equivalent: (i t ) R X (τ) is continuous at τ = 0 (ii t ) X is m.s. continuous (iii t ) R X (τ) is continuous over all of R. Proof. ((i) implies (ii)) Fix t ∈ T and suppose that R X is continuous at the point (t, t). Then R X (s, s), R X (s, t), and R X (t, s) all converge to R X (t, t) as s →t. Therefore, lim s→t E[(X s −X t ) 2 ] = lim s→t (R X (s, s) − R X (s, t) − R X (t, s) + R X (t, t)) = 0. So X is m.s. continuous at t. Therefore if R X is continuous at all points of the form (t, t) ∈ T T, then X is m.s. continuous at all t ∈ T. Therefore (i) implies (ii). ((ii) implies (iii)) Suppose condition (ii) is true. Let (s, t) ∈ TT, and suppose (s n , t n ) ∈ TT for all n ≥ 1 such that lim n→∞ (s n , t n ) = (s, t). Therefore, s n → s and t n → t as n → ∞. By condition (b), it follows that X sn m.s. → X s and X tn m.s. → X t as n → ∞. Since the limit of the correlations is the correlation of the limit for a pair of m.s. convergent sequences (Corollary 2.2.4) 210 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES it follows that R X (s n , t n ) →R X (s, t) as n →∞. Thus, R X is continuous at (s, t), where (s, t) was an arbitrary point of T T. Therefore R X is continuous over T T, proving that (ii) implies (iii). Obviously (iii) implies (i), so the proof of the equivalence of (i)-(iii) is complete. If X is m.s. continuous, then, by definition, for any t ∈ T, X s m.s. → X t as s → t. It thus follows that µ X (s) →µ X (t), because the limit of the means is the mean of the limit, for a m.s. convergent sequence (Corollary 2.2.5). Thus, m.s. continuity of X implies that the deterministic mean function, µ X , is continuous. Finally, if X is WSS, then R X (s, t) = R X (τ) where τ = s −t, and the three conditions (i)-(iii) become (i t )-(iii t ), so the equivalence of (i)-(iii) implies the equivalence of (i t )-(iii t ). Example 7.1.10 Let X = (X t : t ∈ R) be defined by X t = U for t < 0 and X t = V for t ≥ 0, where U and V are independent random variables with mean zero and variance one. Let t n be a sequence of strictly negative numbers converging to 0. Then X tn = U for all n and X 0 = V . Since P¦[U − V [ ≥ ¦ ,= 0 for small enough, X tn does not converge to X 0 in p. sense. So X is not continuous in probability at zero. It is thus not continuous in the m.s or a.s. sense at zero either. The only one of the five senses that the whole process could be continuous is continuous in distribution. The process X is continuous in distribution if and only if U and V have the same distribution. Finally, let us check the continuity properties of the autocorrelation function. The autocorrelation function is given by R X (s, t) = 1 if either s, t < 0 or if s, t ≥ 0, and R X (s, t) = 0 otherwise. So R X is not continuous at (0, 0), because R( 1 n , − 1 n ) = 0 for all n ≥ 1, so R( 1 n , − 1 n ) ,→ R X (0, 0) = 1. as n → ∞. However, it is true that R X (t, t) = 1 for all t, so that R X (t, t) is a continuous function of t. This illustrates the fact that continuity of the function of two variables, R X (s, t), at a particular fixed point (t o , t o ), is a stronger requirement than continuity of the function of one variable, R X (t, t), at t = t o . Example 7.1.11 Let W = (W t : t ≥ 0) be a Brownian motion with parameter σ 2 . Then E[(W t − W s ) 2 ] = σ 2 [t − s[ → 0 as s → t. Therefore W is m.s. continuous. Another way to show W is m.s. continuous is to observe that the autocorrelation function, R W (s, t) = σ 2 (s ∧t), is continuous. Since W is m.s. continuous, it is also continuous in the p. and d. senses. As we stated in defining W, it is a.s. sample-path continuous, and therefore a.s. continuous at each t ≥ 0, as well. Example 7.1.12 Let N = (N t : t ≥ 0) be a Poisson process with rate λ > 0. Then for fixed t, E[(N t −N s ) 2 ] = λ(t −s) +(λ(t −s)) 2 →0 as s →t. Therefore N is m.s. continuous. As required, R N , given by R N (s, t) = λ(s ∧ t) + λ 2 st, is continuous. Since N is m.s. continuous, it is also continuous in the p. and d. senses. N is also a.s. continuous at any fixed t, because the probability of a jump at exactly time t is zero for any fixed t. However, N is not a.s. sample continuous. In fact, P[N is continuous on [0, a]] = e −λa and so P[N is continuous on R + ] = 0. 7.2. MEAN SQUARE DIFFERENTIATION OF RANDOM PROCESSES 211 Definition 7.1.13 A random process (X t : t ∈ T), such that T is a bounded interval (open, closed, or mixed) in R with endpoints a < b, is piecewise m.s. continuous, if there exist n ≥ 1 and a = t 0 < t 1 < < t n = b, such that, for 1 ≤ k ≤ n: X is m.s. continuous over (t k−1 , t k ) and has m.s. limits at the endpoints of (t k−1 , t k ). More generally, if T is all of R or an interval in R, X is piecewise m.s. continuous over T if it is piecewise m.s. continuous over every bounded subinterval of T. 7.2 Mean square differentiation of random processes Before considering the m.s. derivative of a random process, we review the definition of the derivative of a function (also, see Appendix 11.4). Let the index set T be either all of R or an interval in R. Suppose f is a deterministic function on T. Recall that for a fixed t in T, f is differentiable at t if lim s→t f(s)−f(t) s−t exists and is finite, and if f is differentiable at t, the value of the limit is the derivative, f t (t). The whole function f is called differentiable if it is differentiable at all t. The function f is called continuously differentiable if f is differentiable, and the derivative function f t is continuous. In many applications of calculus, it is important that a function f be not only differentiable, but continuously differentiable. In much of the applied literature, when there is an assumption that a function is differentiable, it is understood that the function is continuously differentiable. For example, by the fundamental theorem of calculus, f(b) −f(a) = _ b a f t (s)ds (7.2) holds if f is a continuously differentiable function with derivative f t . Example 11.4.2 shows that (7.2) might not hold if f is simply assumed to be differentiable. Let X = (X t : t ∈ T) be a second order random process such that the index set T is equal to either all of R or an interval in R. The following definition for m.s. derivatives is analogous to the definition of derivatives for deterministic functions. Definition 7.2.1 For each t fixed, the random process X = (X t : t ∈ T) is mean square (m.s.) differentiable at t if the following limit exists lim s→t Xs−Xt s−t m.s. The limit, if it exists, is the m.s. derivative of X at t, denoted by X t t . The whole random process X is said to be m.s. differentiable if it is m.s. differentiable at each t, and it is said to be m.s. continuously differentiable if it is m.s. differentiable and the derivative process X t is m.s. continuous. Let ∂ i denote the operation of taking the partial derivative with respect to the ith argument. For example, if f(x, y) = x 2 y 3 then ∂ 2 f(x, y) = 3x 2 y 2 and ∂ 1 ∂ 2 f(x, y) = 6xy 2 . The partial derivative of a function is the same as the ordinary derivative with respect to one variable, with the other variables held fixed. We shall be applying ∂ 1 and ∂ 2 to an autocorrelation function R X = (R X (s, t) : (s, t) ∈ T T¦, which is a function of two variables. 212 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Proposition 7.2.2 (a) (The derivative of the mean is the mean of the derivative) If X is m.s. differentiable, then the mean function µ X is differentiable, and µ t X (t) = µ X (t). (i.e. the operations of (i) taking expectation, which basically involves integrating over ω, and (ii) dif- ferentiation with respect to t, can be done in either order.) (b) If X is m.s. differentiable, the cross correlation functions are given by R X X = ∂ 1 R X and R XX = ∂ 2 R X , and the autocorrelation function of X t is given by R X = ∂ 1 ∂ 2 R X = ∂ 2 ∂ 1 R X . (In particular, the indicated partial derivatives exist.) (c) X is m.s. differentiable at t if and only if the following limit exists and is finite: lim s,s →t R X (s, s t ) −R X (s, t) −R X (t, s t ) +R X (t, t) (s −t)(s t −t) . (7.3) (Therefore, the whole process X is m.s. differentiable if and only if the limit in (7.3) exists and is finite for all t ∈ T.) (d) X is m.s. continuously differentiable if and only if R X , ∂ 2 R X , and ∂ 1 ∂ 2 R X exist and are continuous. (By symmetry, if X is m.s. continuously differentiable, then also ∂ 1 R X is con- tinuous.) (e) (Specialization of (d) for WSS case) Suppose X is WSS. Then X is m.s. continuously differ- entiable if and only if R X (τ), R t X (τ), and R tt X (τ) exist and are continuous functions of τ. If X is m.s. continuously differentiable then X and X t are jointly WSS, X t has mean zero (i.e. µ X = 0) and autocorrelation function given by R X (τ) = −R tt X (τ), and the cross correlation functions are given by R X X (τ) = R t X (τ) and R XX (τ) = −R t X (τ). (f ) (A necessary condition for m.s. differentiability) If X is WSS and m.s. differentiable, then R t X (0) exists and R t X (0) = 0. (g) If X is a m.s. differentiable Gaussian process, then X and its derivative process X t are jointly Gaussian. Proof. (a) Suppose X is m.s. differentiable. Then for any t fixed, X s −X t s −t m.s. → X t t as s →t. It thus follows that µ X (s) −µ X (t) s −t →µ X (t) as s →t, (7.4) because the limit of the means is the mean of the limit, for a m.s. convergent sequence (Corol- lary 2.2.5). But (7.4) is just the definition of the statement that the derivative of µ X at t is equal to µ X (t). That is, dµ X dt (t) = µ X (t) for all t, or more concisely, µ t X = µ X . (b) Suppose X is m.s. differentiable. Since the limit of the correlations is the correlation of the limits for m.s. convergent sequences (Corollary 2.2.4), for t, t t ∈ T, R X X (t, t t ) = lim s→t E __ X(s) −X(t) s −t _ X(t t ) _ = lim s→t R X (s, t t ) −R X (t, t t ) s −t = ∂ 1 R X (t, t t ) 7.2. MEAN SQUARE DIFFERENTIATION OF RANDOM PROCESSES 213 Thus, R X X = ∂ 1 R X , and in particular, the partial derivative ∂ 1 R X exists. Similarly, R XX = ∂ 2 R X . Also, by the same reasoning, R X (t, t t ) = lim s →t E _ X t (t) _ X(s t ) −X(t t ) s t −t t __ = lim s →t R X X (t, s t ) −R X X (t, t t ) s t −t t = ∂ 2 R X X (t, t t ) = ∂ 2 ∂ 1 R X (t, t t ), so that R X = ∂ 2 ∂ 1 R X . Similarly, R X = ∂ 1 ∂ 1 R X . (c) By the correlation form of the Cauchy criterion, (Proposition 2.2.3), X is m.s. differentiable at t if and only if the following limit exists and is finite: lim s,s →t E __ X(s) −X(t) s −t __ X(s t ) −X(t) s t −t __ . (7.5) Multiplying out the terms in the numerator in the right side of (7.5) and using E[X(s)X(s t )] = R X (s, s t ), E[X(s)X(t)] = R X (s, t), and so on, shows that (7.5) is equivalent to (7.3). So part (c) is proved. (d) The numerator in (7.3) involves R X evaluated at the four courners of the rectangle [t, s] [t, s t ], shown in Figure 7.2. Suppose R X , ∂ 2 R X and ∂ 1 ∂ 2 R X exist and are continuous functions. + t s ! s t ! ! + Figure 7.2: Sampling points of R X . Then by the fundamental theorem of calculus, (R X (s, s t ) −R X (s, t)) −(R X (t, s t ) −R X (t, t)) = _ s t ∂ 2 R X (s, v)dv − _ s t ∂ 2 R X (t, v)dv = _ s t [∂ 2 R X (s, v) −∂ 2 R X (t, v)] dv = _ s t _ s t ∂ 1 ∂ 2 R X (u, v)dudv. (7.6) Therefore, the ratio in (7.3) is the average value of ∂ 1 ∂ 2 R X over the rectangle [t, s] [t, s t ]. Since ∂ 1 ∂ 2 R X is assumed to be continuous, the limit in (7.3) exists and it is equal to ∂ 1 ∂ 2 R X (t, t). There- fore, by part (c) already proved, X is m.s. differentiable. By part (b), the autocorrelation function 214 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES of X t is ∂ 1 ∂ 2 R X . Since this is assumed to be continuous, it follows that X t is m.s. continuous. Thus, X is m.s. continuously differentiable. (e) If X is WSS, then R X (s −t) = R X (τ) where τ = s −t. Suppose R X (τ), R t X (τ) and R tt X (τ) exist and are continuous functions of τ. Then ∂ 1 R X (s, t) = R t X (τ) and ∂ 2 ∂ 1 R X (s, t) = −R tt X (τ). (7.7) The minus sign in (7.7) appears because R X (s, t) = R X (τ) where τ = s − t, and the derivative of with respect to t is −1. So, the hypotheses of part (d) hold, so that X is m.s. differentiable. Since X is WSS, its mean function µ X is constant, which has derivative zero, so X t has mean zero. Also by part (c) and (7.7), R X X (τ) = R t X (τ) and R X X = −R tt X . Similarly, R XX (τ) = −R t X (τ). Note that X and X t are each WSS and the cross correlation functions depend on τ alone, so X and X t are jointly WSS. (f ) If X is WSS then E _ _ X(t) −X(0) t _ 2 _ = − 2(R X (t) −R X (0)) t 2 (7.8) Therefore, if X is m.s. differentiable then the right side of (7.8) must converge to a finite limit as t →0, so in particular it is necessary that (R X (t) −R X (0))/t →0 as t →0. Therefore R t X (0) = 0. (g) The derivative process X t is obtained by taking linear combinations and m.s. limits of random variables in X = (X t ; t ∈ T). Therefore, (g) follows from the fact that the joint Gaussian property is preserved under linear combinations and limits (Proposition 3.4.3(c)). Example 7.2.3 Let f(t) = t 2 sin(1/t 2 ) for t ,= 0 and f(0) = 0 as in Example 11.4.2, and let X = (X t : t ∈ R) be the deterministic random process such that X(t) = f(t) for all t ∈ R. Since X is differentiable as an ordinary function, it is also m.s. differentiable, and its m.s. derivative X t is equal to f t . Since X t , as a deterministic function, is not continuous at zero, it is also not continuous at zero in the m.s. sense. We have R X (s, t) = f(s)f(t) and ∂ 2 R X (s, t) = f(s)f t (t), which is not continuous. So indeed the conditions of Proposition 7.2.2(d) do not hold, as required. Example 7.2.4 A Brownian motion W = (W t : t ≥ 0) is not m.s. differentiable. If it were, then for any fixed t ≥ 0, W(s)−W(t) s−t would converge in the m.s. sense as s → t to a random variable with a finite second moment. For a m.s. convergent seqence, the second moments of the variables in the sequence converge to the second moment of the limit random variable, which is finite. But W(s) −W(t) has mean zero and variance σ 2 [s −t[, so that lim s→t E _ _ W(s) −W(t) s −t _ 2 _ = lim s→t σ 2 [s −t[ = +∞. (7.9) Thus, W is not m.s. differentiable at any t. For another approach, we could appeal to Proposition 7.2.2 to deduce this result. The limit in (7.9) is the same as the limit in (7.5), but with s and s t 7.3. INTEGRATION OF RANDOM PROCESSES 215 restricted to be equal. Hence (7.5), or equivalently (7.3), is not a finite limit, implying that W is not differentiable at t. Similarly, a Poisson process is not m.s. differentiable at any t. A WSS process X with R X (τ) = e −α[τ[ is not m.s. differentiable because R t X (0) does not exist. A WSS process X with R X (τ) = 1 1+τ 2 is m.s. differentiable, and its derivative process X t is WSS with mean 0 and covariance function R X (τ) = − _ 1 1 +τ 2 _ tt = 2 −6τ 2 (1 +τ 2 ) 3 . Proposition 7.2.5 Suppose X is a m.s. differentiable random process and f is a differentiable function. Then the product Xf = (X(t)f(t) : t ∈ R) is mean square differentiable and (Xf) t = X t f +Xf t . Proof: Fix t. Then for each s ,= t, X(s)f(s) −X(t)f(t) s −t = (X(s) −X(t))f(s) s −t + X(t)(f(s) −f(t)) s −t m.s. → X t (t)f(t) +X(t)f t (t) as s →t. Definition 7.2.6 A random process X on a bounded interval (open, closed, or mixed) with end- points a < b is continuous and piecewise continuously differentiable in the m.s. sense, if X is m.s. continuous over the interval, and if there exists n ≥ 1 and a = t 0 < t 1 < < t n = b, such that, for 1 ≤ k ≤ n: X is m.s. continuously differentiable over (t k−1 , t k ) and X t has finite limits at the endpoints of (t k−1 , t k ). More generally, if T is all of R or a subinterval of R, then a random process X = (X t : t ∈ T) is continuous and piecewise continuously differentiable in the m.s. sense if its restriction to any bounded interval is continuous and piecewise continuously differentiable in the m.s. sense. 7.3 Integration of random processes Let X = (X t : a ≤ t ≤ b) be a random process and let h be a function on a finite interval [a, b]. How shall we define the following integral? _ b a X t h(t)dt. (7.10) One approach is to note that for each fixed ω, X t (ω) is a deterministic function of time, and so the integral can be defined as the integral of a deterministic function for each ω. We shall focus on another approach, namely mean square (m.s.) integration. An advantage of m.s. integration is that it relies much less on properties of sample paths of random processes. As for integration of deterministic functions, the m.s. Riemann integrals are based on Riemann sums, defined as follows. Given: 216 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES • A partition of (a, b] of the form (t 0 , t 1 ], (t 1 , t 2 ], , (t n−1 , t n ], where n ≥ 0 and a = t 0 ≤ t 1 < t n = b • A sampling point from each subinterval, v k ∈ (t k−1 , t k ], for 1 ≤ k ≤ n, the corresponding Riemann sum for Xh is defined by n k=1 X v k h(v k )(t k −t k−1 ). The norm of the partition is defined to be max k [t k −t k−1 [. Definition 7.3.1 The Riemann integral _ b a X t h(t)dt is said to exist in the m.s. sense and its value is the random variable I if the following is true. Given any > 0, there is a δ > 0 so that E[( n k=1 X v k h(v k )(t k −t k−1 ) −I) 2 ] ≤ whenever the norm of the partition is less than or equal to δ. This definition is equivalent to the following condition, expressed using convergence of sequences. The m.s. Riemann integral exists and is equal to I, if for any sequence of partitions, specified by ((t m 1 , t m 2 , . . . , t m nm ) : m ≥ 1), with corresponding sampling points ((v m 1 , . . . , v m nm ) : m ≥ 1), such that norm of the m th partition converges to zero as m → ∞, the corresponding sequence of Riemann sums converges in the m.s. sense to I as m → ∞. The process X t h(t) is said to be m.s. Riemann integrable over (a, b] if the integral _ b a X t h(t)dt exists and is finite. Next, suppose X t h(t) is defined over the whole real line. If X t h(t) is m.s. Riemann integrable over every bounded interval [a, b], then the Riemann integral of X t h(t) over R is defined by _ ∞ −∞ X t h(t)dt = lim a,b→∞ _ b −a X t h(t)dt m.s. provided that the indicated limit exist as a, b jointly converge to +∞. Whether an integral exists in the m.s. sense is determined by the autocorrelation function of the random process involved, as shown next. The condition involves Riemann integration of a deterministic function of two variables. As reviewed in Appendix 11.5, a two-dimensional Riemann integral over a bounded rectangle is defined as the limit of Riemann sums corresponding to a partition of the rectangle into subrectangles and choices of sampling points within the subrectangles. If the sampling points for the Riemann sums are required to be horizontally and vertically alligned, then we say the two-dimensional Riemann integral exists with aligned sampling. Proposition 7.3.2 The integral _ b a X t h(t)dt exists in the m.s. Riemann sense if and only if _ b a _ b a R X (s, t)h(s)h(t)dsdt (7.11) exists as a two dimensional Riemann integral with aligned sampling. The m.s. integral exists, in particular, if X is m.s. piecewise continuous over [a, b] and h is piecewise continuous over [a, b]. 7.3. INTEGRATION OF RANDOM PROCESSES 217 Proof. By definition, the m.s. integral of X t h(t) exists if and only if the Riemann sums converge in the m.s. sense for an arbitary sequence of partitions and sampling points, such that the norms of the partitions converge to zero. So consider an arbitrary sequence of partitions of (a, b] into intervals specified by the collection of endpoints, ((t m 0 , t m 1 , . . . , t m n m) : m ≥ 1), with corresponding sampling point v m k ∈ (t m k−1 , t m k ] for each m and 1 ≤ k ≤ n m , such that the norm of the m th partition converges to zero as m →∞. For each m ≥ 1, let S m denote the corresponding Riemann sum: S m = n m k=1 X v m k h(v m k )(t m k −t m k−1 ). By the correlation form of the Cauchy criterion for m.s. convergence (Proposition 2.2.3), (S m : m ≥ 1) converges in the m.s. sense if and only if lim m,m →∞ E[S m S m ] exists and is finite. Now E[S m S m ] = n m j=1 n m k=1 R X (v m j , v m k )h(v m j )h(v m k )(t m j −t m j−1 )(t m k −t m k−1 ), (7.12) and the right-hand side of (7.12) is the Riemann sum for the integral (7.11), for the partition of (a, b] (a, b] into rectangles of the form (t m j−1 , t m j ] (t m k−1 , t m k ] and the sampling points (v m j , v m k ). Note that the mm t sampling points are aligned, in that they are determined by the m + m t num- bers v m 1 , . . . , v m n m, v m 1 , . . . , v m n m . Moreover, any Riemann sum for the integral (7.11) with aligned sampling can arise in this way. Further, as m, m t → ∞, the norm of this partition, which is the maximum length or width of any rectangle of the partition, converges to zero. Thus, the limit lim m,m →∞ E[S m S m ] exists for any sequence of partitions and sampling points if and only if the integral (7.11) exists as a two-dimensional Riemann integral with aligned sampling. Finally, if X is piecewise m.s. continuous over [a, b] and h is piecewise continuous over [a, b], then there is a partition of [a, b] into intervals of the form (s k−1 , s k ] such that X is m.s. continuous over (s k−1 , s k ) with m.s. limits at the endpoints, and h is continuous over (s k−1 , s k ) with finite limits at the endpoints. Therefore, R X (s, t)h(s)h(t) restricted to each rectangle of the form (s j−1 , s j ) (s k−1 , s k ), is the restriction of a continuous function on [s j−1 , s j ] [s k−1 , s k ]. Thus R X (s, t)h(s)h(t) is Riemann integrable over [a, b] [a, b]. 218 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Proposition 7.3.3 Suppose X t h(t) and Y t k(t) are both m.s. integrable over [a, b]. Then E __ b a X t h(t)dt _ = _ b a µ X (t)h(t)dt (7.13) E _ __ b a X t h(t)dt _ 2 _ = _ b a _ b a R X (s, t)h(s)h(t)dsdt (7.14) Var __ b a X t h(t)dt _ = _ b a _ b a C X (s, t)h(s)h(t)dsdt. (7.15) E ___ b a X s h(s)ds ___ b a Y t k(t)dt __ = _ b a _ b a R XY (s, t)h(s)k(t)dsdt (7.16) Cov __ b a X s h(s)ds, _ b a Y t k(t)dt _ = _ b a _ b a C XY (s, t)h(s)k(t)dsdt (7.17) _ b a X t h(t) +Y t k(t)dt = _ b a X t h(t)dt + _ b a Y t k(t))dt (7.18) Proof. Let (S m ) denote the sequence of Riemann sums appearing in the proof of Proposition 7.3.2. Since the mean of a m.s. convergent sequence of random variables is the limit of the means (Corollary 2.2.5), E __ b a X t h(t)dt _ = lim m→∞ E[S m ] = lim m→∞ nm k=1 µ X (v m k )h(v m k )(t m k −t m k−1 ). (7.19) The right-hand side of (7.19) is a limit of Riemann sums for the integral _ b a µ X (t)h(t)dt. Since this limit exists and is equal to E _ _ b a X t h(t)dt _ for any sequence of partitions and sample points, it follows that _ b a µ X (t)h(t)dt exists as a Riemann integral, and is equal to E _ _ b a X t h(t)dt _ , so (7.13) is proved. The second moment of the m.s. limit of (S m : m ≥ 0) is equal to lim m,m →∞ E[S m S m ], by the correlation form of the Cauchy criterion for m.s. convergence (Proposition 2.2.3), which implies (7.14). It follows from (7.13) that E _ __ b a X t h(t)dt _ 2 _ = _ b a _ b a µ X (s)µ X (t)h(s)h(t)dsdt Subtracting each side of this from the corresponding side of (7.14) yields (7.15). The proofs of (7.16) and (7.17) are similar to the proofs of (7.14) and (7.15), and are left to the reader. For any partition of [a, b] and choice of sampling points, the Riemann sums for the three integrals appearing (7.17) satisfy the corresponding additivity condition, implying (7.17). 7.3. INTEGRATION OF RANDOM PROCESSES 219 The fundamental theorem of calculus, stated in Appendix 11.5, states the increments of a continuous, piecewise continuously differentiable function are equal to integrals of the derivative of the function. The following is the generalization of the fundamental theorem of calculus to the m.s. calculus. Theorem 7.3.4 (Fundamental Theorem of m.s. Calculus) Let X be a m.s. continuously differen- tiable random process. Then for a < b, X b −X a = _ b a X t t dt (m.s. Riemann integral) (7.20) More generally, if X is continuous and piecewise continuously differentiable, (11.4) holds with X t t replaced by the right-hand derivative, D + X t . (Note that D + X t = X t t whenever X t t is defined.) Proof. The m.s. Riemann integral in (7.20) exists because X t is assumed to be m.s. continuous. Let B = X b −X a − _ b a X t t dt, and let Y be an arbitrary random variable with a finite second moment. It suffices to show that E[Y B] = 0, because a possible choice of Y is B itself. Let φ(t) = E[Y X t ]. Then for s ,= t, φ(s) −φ(t) s −t = E _ Y _ X s −X t s −t __ Taking a limit as s → t and using the fact the correlation of a limit is the limit of the correlations for m.s. convergent sequences, it follows that φ is differentiable and φ t (t) = E[Y X t t ]. Since X t is m.s. continuous, it similarly follows that φ t is continuous. Next, we use the fact that the integral in (7.20) is the m.s. limit of Riemann sums, with each Riemann sum corresponding to a partition of (a, b] specified by some n ≥ 1 and a = t 0 < < t n = b and sampling points v k ∈ (t k−1 , t k ] for a ≤ k ≤ n. Since the limit of the correlation is the correlation of the limt for m.s. convergence, E _ Y _ b a X t t dt _ = lim [t k −t k−1 [→0 E _ Y n k=1 X t v k (t k −t k−1 ) _ = lim [t k −t k−1 [→0 n k=1 φ t (v k )(t k −t k−1 ) = _ b a φ t (t)dt Therefore, E[Y B] = φ(b) − φ(a) − _ b a φ t (t)dt, which is equal to zero by the fundamental theorem of calculus for deterministic continuously differentiable functions. This establishes (7.20) in case X is m.s. continuously differentiable. If X is m.s. continuous and only piecewise continuously differentiable, we can use essentially the same proof, observing that φ is continuous and piecewise continuously differentiable, so that E[Y B] = φ(b) − φ(a) − _ b a φ t (t)dt = 0 by the fundamental theorem of calculus for deterministic continuous, piecewise continuously differential functions. Proposition 7.3.5 Suppose X is a Gaussian random process. Then X, together with all mean square derivatives of X that exist, and all m.s. Riemann integrals of X of the form I(a, b) = _ b a X t h(t)dt that exist, are jointly Gaussian. 220 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Proof. The m.s. derivatives and integrals of X are obtained by taking m.s. limits of linear combinations of X = (X t ; t ∈ T). Therefore, the proposition follows from the fact that the joint Gaussian property is preserved under linear combinations and limits (Proposition 3.4.3(c)). Theoretical Exercise Suppose X = (X t : t ≥ 0) is a random process such that R X is continuous. Let Y t = _ t 0 X s ds. Show that Y is m.s. differentiable, and P[Y t t = X t ] = 1 for t ≥ 0. Example 7.3.6 Let (W t : t ≥ 0) be a Brownian motion with σ 2 = 1, and let X t = _ t 0 W s ds for t ≥ 0. Let us find R X and P[[X t [ ≥ t] for t > 0. Since R W (u, v) = u ∧ v, R X (s, t) = E __ s 0 W u du _ t 0 W v dv _ = _ s 0 _ t 0 (u ∧ v)dvdu. To proceed, consider first the case s ≥ t and partition the region of integration into three parts as shown in Figure 7.3. The contributions from the two triangular subregions is the same, so u v s u<v u>v t t Figure 7.3: Partition of region of integration. R X (s, t) = 2 _ t 0 _ u 0 vdvdu + _ s t _ t 0 vdvdu = t 3 3 + t 2 (s −t) 2 = t 2 s 2 − t 3 6 . Still assuming that s ≥ t, this expression can be rewritten as R X (s, t) = st(s ∧ t) 2 − (s ∧ t) 3 6 . (7.21) Although we have found (7.21) only for s ≥ t, both sides are symmetric in s and t. Thus (7.21) holds for all s, t. 7.3. INTEGRATION OF RANDOM PROCESSES 221 Since W is a Gaussian process, X is a Gaussian process. Also, E[X t ] = 0 (because W is mean zero) and E[X 2 t ] = R X (t, t) = t 3 3 . Thus, P[[X t [ ≥ t] = 2P _ _ X t _ t 3 3 ≥ t _ t 3 3 _ _ = 2Q _ _ 3 t _ . Note that P[[X t [ ≥ t] →1 as t →+∞. Example 7.3.7 Let N = (N t : t ≥ 0) be a second order process with a continuous autocorrelation function R N and let x 0 be a constant. Consider the problem of finding a m.s. differentiable random process X = (X t : t ≥ 0) satisfying the linear differential equation X t t = −X t +N t , X 0 = x 0 . (7.22) Guided by the case that N t is a smooth nonrandom function, we write X t = x 0 e −t + _ t 0 e −(t−v) N v dv (7.23) or X t = x 0 e −t +e −t _ t 0 e v N v dv. (7.24) Using Proposition 7.2.5, it is not difficult to check that (7.24) indeed gives the solution to (7.22). Next, let us find the mean and autocovariance functions of X in terms of those of N. Taking the expectation on each side of (7.23) yields µ X (t) = x 0 e −t + _ t 0 e −(t−v) µ N (v)dv. (7.25) A different way to derive (7.25) is to take expectations in (7.22) to yield the deterministic linear differential equation: µ t X (t) = −µ X (t) +µ N (t); µ X (0) = x 0 which can be solved to yield (7.25). To summarize, we found two methods to start with the stochastic differential equation (7.23) to derive (7.25), thereby expressing the mean function of the solution X in terms of the mean function of the driving process N. The first is to solve (7.22) to obtain (7.23) and then take expectations, the second is to take expectations first and then solve the deterministic differential equation for µ X . 222 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES The same two methods can be used to express the covariance function of X in terms of the covariance function of N. For the first method, we use (7.23) to obtain C X (s, t) = Cov _ x 0 e −s + _ s 0 e −(s−u) N u du, x 0 e −t + _ t 0 e −(t−v) N v dv _ = _ s 0 _ t 0 e −(s−u) e −(t−v) C N (u, v)dvdu. (7.26) The second method is to derive deterministic differential equations. To begin, note that ∂ 1 C X (s, t) = Cov (X t s , X t ) = Cov (−X s +N s , X t ) so ∂ 1 C X (s, t) = −C X (s, t) +C NX (s, t). (7.27) For t fixed, this is a differential equation in s. Also, C X (0, t) = 0. If somehow the cross covariance function C NX is found, (7.27) and the boundary condition C X (0, t) = 0 can be used to find C X . So we turn next to finding a differential equation for C NX . ∂ 2 C NX (s, t) = Cov(N s , X t t ) = Cov(N s , −X t +N t ) so ∂ 2 C NX (s, t) = −C NX (s, t) +C N (s, t). (7.28) For s fixed, this is a differential equation in t with initial condition C NX (s, 0) = 0. Solving (7.28) yields C NX (s, t) = _ t 0 e −(t−v) C N (s, v)dv. (7.29) Using (7.29) to replace C NX in (7.27) and solving (7.27) yields (7.26). 7.4 Ergodicity Let X be a stationary or WSS random process. Ergodicity generally means that certain time averages are asymptotically equal to certain statistical averages. For example, suppose X = (X t : t ∈ R) is WSS and m.s. continuous. The mean µ X is defined as a statistical average: µ X = E[X t ] for any t ∈ R. The time average of X over the interval [0, t] is given by 1 t _ t 0 X u du. 7.4. ERGODICITY 223 Of course, for t fixed, the time average is a random variable, and is typically not equal to the statistical average µ X . The random process X is called mean ergodic (in the m.s. sense) if lim t→∞ 1 t _ t 0 X u du = µ X m.s. A discrete time WSS random process X is similarly called mean ergodic (in the m.s. sense) if lim n→∞ 1 n n i=1 X i = µ X m.s. (7.30) For example, by the m.s. version of the law of large numbers, if X = (X n : n ∈ Z) is WSS with C X (n) = I ¡n=0¦ (so that the X i ’s are uncorrelated) then (7.30) is true. For another example, if C X (n) = 1 for all n, it means that X 0 has variance one and P¦X k = X 0 ¦ = 1 for all k (because equality holds in the Schwarz inequality: C X (n) ≤ C X (0)). Then for all n ≥ 1, 1 n n k=1 X k = X 0 . Since X 0 has variance one, the process X is not ergodic if C X (n) = 1 for all n. In general, whether X is m.s. ergodic in the m.s. sense is determined by the autocovariance function, C X . The result is stated and proved next for continuous time, and the discrete-time version is true as well. Proposition 7.4.1 Let X be a real-valued, WSS, m.s. continuous random process. Then X is mean ergodic (in the m.s. sense) if and only if lim t→∞ 2 t _ t 0 _ t −τ t _ C X (τ)dτ = 0. (7.31) Sufficient conditions are (a) lim τ→∞ C X (τ) = 0. (This condition is also necessary if lim τ→∞ C X (τ) exists.) (b) _ ∞ −∞ [C X (τ)[dτ < +∞. (c) lim τ→∞ R X (τ) = 0. (d) _ ∞ −∞ [R X (τ)[dτ < +∞. Proof. By the definition of m.s. convergence, X is mean ergodic if and only if lim t→∞ E _ _ 1 t _ t 0 X u du −µ X _ 2 _ = 0. (7.32) 224 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Since E _ 1 t _ t 0 X u du _ = 1 t _ t 0 µ X du = µ X , (7.32) is the same as Var _ 1 t _ t 0 X u du _ →0 as t →∞. By the properties of m.s. integrals, Var _ 1 t _ t 0 X u du _ = Cov _ 1 t _ t 0 X u du, 1 t _ t 0 X v dv _ = 1 t 2 _ t 0 _ t 0 C X (u −v)dudv (7.33) = 1 t 2 _ t 0 _ t−v −v C X (τ)dτdv (7.34) = 1 t 2 _ t 0 _ t−τ 0 C X (τ)dvdτ + _ 0 −t _ t −τ C X (τ)dvdτ (7.35) = 1 t _ t −t _ t −[τ[ t _ C X (τ)dτ = 2 t _ t 0 _ t −τ t _ C X (τ)dτ, where for v fixed the variable τ = u − v was introduced, and we use the fact that in both (7.34) and (7.35), the pair (v, τ) ranges over the region pictured in Figure 7.4. This establishes the first !t t t v ! Figure 7.4: Region of integration for (7.34) and (7.35). statement of the proposition. For the remainder of the proof, it is important to keep in mind that the integral in (7.33) is simply the average of C X (u − v) over the square [0, t] [0, t]. The function C X (u − v) is equal to C X (0) along the diagonal of the square, and the magnitude of the function is bounded by C X (0) everywhere in the square. Thus, if C X (u, v) is small for u − v larger than some constant, if t is large, the average of C X (u − v) over the square will be small. The integral in (7.31) is equivalent to the integral in (7.33), and both can be viewed as a weighted average of C X (τ), with a triangular weighting function. It remains to prove the assertions regarding (a)-(d). Suppose C X (τ) → c as τ → ∞. We claim the left side of (7.31) is equal to c. Indeed, given ε > 0 there exists L > 0 so that [C X (τ) −c[ ≤ ε 7.4. ERGODICITY 225 whenever τ ≥ L. For 0 ≤ τ ≤ L we can use the Schwarz inequality to bound C X (τ), namely [C X (τ)[ ≤ C X (0). Therefore for t ≥ L, ¸ ¸ ¸ ¸ 2 t _ t 0 _ t −τ t _ C X (τ)dτ −c ¸ ¸ ¸ ¸ = ¸ ¸ ¸ ¸ 2 t _ t 0 _ t −τ t _ (C X (τ) −c) dτ ¸ ¸ ¸ ¸ ≤ 2 t _ t 0 _ t −τ t _ [C X (τ) −c[ dτ ≤ 2 t _ L 0 (C X (0) +[c[) dτ + 2ε t _ t L t −τ t dτ ≤ 2L(C X (0) +[c[) t + 2ε L _ t 0 t −τ t dτ = 2L(C X (0) +[c[) t +ε ≤ 2ε for t large enough Thus the left side of (7.31) is equal to c, as claimed. Hence if lim τ→∞ C X (τ) = c, (7.31) holds if and only if c = 0. It remains to prove that (b), (c) and (d) each imply (7.31). Suppose condition (b) holds. Then ¸ ¸ ¸ ¸ 2 t _ t 0 _ t −τ t _ C X (τ)dτ ¸ ¸ ¸ ¸ ≤ 2 t _ t 0 [C X (τ)[dτ ≤ 1 t _ ∞ −∞ [C X (τ)[dτ →0 as t →∞ so that (7.31) holds. Suppose either condition (c) or condition (d) holds. By the same arguments applied to C X for parts (a) and (b), it follows that 2 t _ t 0 _ t −τ t _ R X (τ)dτ →0 as t →∞. Since the integral in (7.31) is the variance of a random variable, it is nonnegative. Also, the integral is a weighted average of C X (τ), and C X (τ) = R X (τ) −µ 2 X . Therefore, 0 ≤ 2 t _ t 0 _ t −τ t _ C X (τ)dt = −µ 2 X + 2 t _ t 0 _ t −τ t _ R X (τ)dτ →−µ 2 X as t →∞. Thus, (7.31) holds, so that X is mean ergodic in the m.s. sense. In addition, we see that conditions (c) and (d) also each imply that µ X = 0. Example 7.4.2 Let f c be a nonzero constant, let Θ be a random variable such that cos(Θ), sin(Θ), cos(2Θ), and sin(2Θ) have mean zero, and let A be a random variable independent of Θ such that E[A 2 ] < +∞. Let X = (X t : t ∈ R) be defined by X t = Acos(2πf c t + Θ). Then X is WSS with 226 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES µ X = 0 and R X (τ) = C X (τ) = E[A 2 ] cos(2πfcτ) 2 . Condition (7.31) is satisfied, so X is mean ergodic. Mean ergodicity can also be directly verified: ¸ ¸ ¸ ¸ 1 t _ t 0 X u du ¸ ¸ ¸ ¸ = ¸ ¸ ¸ ¸ A t _ t 0 cos(2πf c u + Θ)du ¸ ¸ ¸ ¸ = ¸ ¸ ¸ ¸ A(sin(2πf c t + Θ) −sin(Θ)) 2πf c t ¸ ¸ ¸ ¸ ≤ [A[ πf c t →0 m.s. as t →∞. Example 7.4.3 (Composite binary source) A student has two biased coins, each with a zero on one side and a one on the other. Whenever the first coin is flipped the outcome is a one with probability 3 4 . Whenever the second coin is flipped the outcome is a one with probability 1 4 . Consider a random process (W k : k ∈ Z) formed as follows. First, the student selects one of the coins, each coin being selected with equal probability. Then the selected coin is used to generate the W k ’s — the other coin is not used at all. This scenario can be modelled as in Figure 7.5, using the following random variables: U k V k W k S=0 S=1 Figure 7.5: A composite binary source. • (U k : k ∈ Z) are independent Be _ 3 4 _ random variables • (V k : k ∈ Z) are independent Be _ 1 4 _ random variables • S is a Be _ 1 2 _ random variable • The above random variables are all independent • W k = (1 −S)U k +SV k . The variable S can be thought of as a switch state. Value S = 0 corresponds to using the coin with probability of heads equal to 3 4 for each flip. Clearly W is stationary, and hence also WSS. Is W mean ergodic? One approach to answering this is the direct one. Clearly µ W = E[W k ] = E[W k [S = 0]P[S = 0] +E[W k [ S = 1]P[S = 1] = 3 4 1 2 + 1 4 1 2 = 1 2 . 7.4. ERGODICITY 227 So the question is whether 1 n n k=1 W k ? → 1 2 m.s. But by the strong law of large numbers 1 n n k=1 W k = 1 n n k=1 ((1 −S)U k +SV k ) = (1 −S) _ 1 n n k=1 U k _ +S _ 1 n n k=1 V k _ m.s. → (1 −S) 3 4 +S 1 4 = 3 4 − S 2 . Thus, the limit is a random variable, rather than the constant 1 2 . Intuitively, the process W has such strong memory due to the switch mechanism that even averaging over long time intervals does not diminish the randomness due to the switch. Another way to show that W is not mean ergodic is to find the covariance function C W and use the necessary and sufficient condition (7.31) for mean ergodicity. Note that for k fixed, W 2 k = W k with probability one, so E[W 2 k ] = 1 2 . If k ,= l, then E[W k W l ] = E[W k W l [ S = 0]P[S = 0] +E[W k W l [ S = 1]P[S = 1] = E[U k U l ] 1 2 +E[V k V l ] 1 2 = E[U k ]E[U l ] 1 2 +E[V k ]E[V l ] 1 2 = _ 3 4 _ 2 1 2 + _ 1 4 _ 2 1 2 = 5 16 . Therefore, C W (n) = _ 1 4 if n = 0 1 16 if n ,= 0 Since lim n→∞ C W (n) exists and is not zero, W is not mean ergodic. In many applications, we are interested in averages of functions that depend on multiple random variables. We discuss this topic for a discrete time stationary random process, (X n : n ∈ Z). Let h be a bounded, Borel measurable function on R k for some k. What time average would we expect to be a good approximation to the statistical average E[h(X 1 , . . . , X k )]? A natural choice is 1 n n j=1 h(X j , X j+1 , . . . , X j+k−1 ). 228 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES We define a stationary random process (X n : n ∈ Z) to be ergodic if lim n→∞ 1 n n j=1 h(X j , . . . , X j+k−1 ) = E[h(X 1 , . . . , X k )] for every k ≥ 1 and for every bounded Borel measurable function h on R k , where the limit is taken in any of the three senses a.s., p. or m.s. 3 An interpretation of the definition is that if X is ergodic then all of its finite dimensional distributions are determined as time averages. As an example, suppose h(x 1 , x 2 ) = _ 1 if x 1 > 0 ≥ x 2 0 else . Then h(X 1 , X 2 ) is one if the process (X k ) makes a “down crossing” of level 0 between times one and two. If X is ergodic then with probability 1, lim n→∞ 1 n (number of down crossings between times 1 and n + 1) = P[X 1 > 0 ≥ X 2 ]. (7.36) Equation (7.36) relates quantities that are quite different in nature. The left hand side of (7.36) is the long time-average downcrossing rate, whereas the right hand side of (7.36) involves only the joint statistics of two consecutive values of the process. Ergodicity is a strong property. Two types of ergodic random processes are the following: • a process X = (X k ) such that the X k ’s are iid. • a stationary Gaussian random process X such that lim n→∞ R X (n) = 0 or, lim n→∞ C X (n) = 0. 7.5 Complexification, Part I In some application areas, primarily in connection with spectral analysis as we shall see, complex valued random variables naturally arise. Vectors and matrices over C are reviewed in the appendix. A complex random variable X = U +jV can be thought of as essentially a two dimensional random variable with real coordinates U and V . Similarly, a random complex n-dimensional vector X can be written as X = U +jV , where U and V are each n-dimensional real vectors. As far as distributions are concerned, a random vector in n-dimensional complex space C n is equivalent to a random vector with 2n real dimensions. For example, if the 2n real variables in U and V are jointly continuous, then X is a continuous type complex random vector and its density is given by a function f X (x) for x ∈ C n . The density f X is related to the joint density of U and V by f X (u + jv) = f UV (u, v) for all u, v ∈ R n . As far as moments are concerned, all the second order analysis covered in the notes up to this point can be easily modified to hold for complex random variables, simply by inserting complex 3 The mathematics literature uses a different definition of ergodicity for stationary processes, which is equivalent. There are also definitions of ergodicity that do not require stationarity. 7.5. COMPLEXIFICATION, PART I 229 conjugates in appropriate places. To begin, if X and Y are complex random variables, we define their correlation by E[XY ∗ ] and similarly their covariance as E[(X − E[X])(Y − E[Y ]) ∗ ]. The Schwarz inequality becomes [E[XY ∗ ][ ≤ _ E[[X[ 2 ]E[[Y [ 2 ] and its proof is essentially the same as for real valued random variables. The cross correlation matrix for two complex random vectors X and Y is given by E[XY ∗ ], and similarly the cross covariance matrix is given by Cov(X, Y ) = E[(X −E[X])(Y −E[Y ]) ∗ ]. As before, Cov(X) = Cov(X, X). The various formulas for covariance still apply. For example, if A and C are complex matrices and b and d are complex vectors, then Cov(AX + b, CY + d) = ACov(X, Y )C ∗ . Just as in the case of real valued random variables, a matrix K is a valid covariance matrix (in other words, there exits some random vector X such that K = Cov(X)) if and only if K is Hermitian symmetric and positive semidefinite. Complex valued random variables X and Y with finite second moments are said to be orthogonal if E[XY ∗ ] = 0, and with this definition the orthogonality principle holds for complex valued random variables. If X and Y are complex random vectors, then again E[X[Y ] is the MMSE estimator of X given Y , and the covariance matrix of the error vector is given by Cov(X) −Cov(E[X[Y ]). The MMSE estimator for X of the form AY + b (i.e. the best linear estimator of X based on Y ) and the covariance of the corresponding error vector are given just as for vectors made of real random variables: ˆ E[X[Y ] = E[X] + Cov(X, Y )Cov(Y ) −1 (Y −E[Y ]) Cov(X − ˆ E[X[Y ]) = Cov(X) −Cov(X, Y )Cov(Y ) −1 Cov(Y, X) By definition, a sequence X 1 , X 2 , . . . of complex valued random variables converges in the m.s. sense to a random variable X if E[[X n [ 2 ] < ∞ for all n and if lim n→∞ E[[X n − X[ 2 ] = 0. The various Cauchy criteria still hold with minor modification. A sequence with E[[X n [ 2 ] < ∞ for all n is a Cauchy sequence in the m.s. sense if lim m,n→∞ E[[X n − X m [ 2 ] = 0. As before, a sequence converges in the m.s. sense if and only if it is a Cauchy sequence. In addition, a sequence X 1 , X 2 , . . . of complex valued random variables with E[[X n [ 2 ] < ∞ for all n converges in the m.s. sense if and only if lim m,n→∞ E[X m X ∗ n ] exits and is a finite constant c. If the m.s. limit exists, then the limiting random variable X satisfies E[[X[ 2 ] = c. Let X = (X t : t ∈ T) be a complex random process. We can write X t = U t +jV t where U and V are each real valued random processes. The process X is defined to be a second order process if E[[X t [ 2 ] < ∞ for all t. Since [X t [ 2 = U 2 t + V 2 t for each t, X being a second order process is equivalent to both U and V being second order processes. The correlation function of a second order complex random process X is defined by R X (s, t) = E[X s X ∗ t ]. The covariance function is given by C X (s, t) = Cov(X s , X t ) where the definition of Cov for complex random variables is used. The definitions and results given for m.s. continuity, m.s. differentiation, and m.s. integration all carry over to the case of complex processes, because they are based on the use of the Cauchy criteria for m.s. convergence which also carries over. For example, a complex valued random process is m.s. continuous if and only if its correlation function R X is continuous. Similarly the cross correlation function for two second order random processes X and Y is defined by R XY (s, t) = E[X s Y ∗ t ]. Note that R XY (s, t) = R ∗ Y X (t, s). Let X = (X t : t ∈ T) be a complex random process such that T is either the real line or the set of integers, and write X t = U t + jV t where U and V are each real valued random pro- 230 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES cesses. By definition, X is stationary if and only if for any t 1 , . . . , t n ∈ T, the joint distribution of (X t 1 +s , . . . , X tn+s ) is the same for all s ∈ T. Equivalently, X is stationary if and only if U and V are jointly stationary. The process X is defined to be WSS if X is a second order process such that E[X t ] does not depend on t, and R X (s, t) is a function of s −t alone. If X is WSS we use R X (τ) to denote R X (s, t), where τ = s −t. A pair of complex-valued random processes X and Y are defined to be jointly WSS if both X and Y are WSS and if the cross correlation function R XY (s, t) is a function of s −t. If X and Y are jointly WSS then R XY (−τ) = R ∗ Y X (τ). In summary, everything we’ve discussed in this section regarding complex random variables, vectors, and processes can be considered a simple matter of notation. One simply needs to add lines to indicate complex conjugates, to use [X[ 2 instead of X 2 , and to use a star “ ∗ ” for Hermitian transpose in place of “T” for transpose. We shall begin using the notation at this point, and return to a discussion of the topic of complex valued random processes in a later section. In particular, we will examine complex normal random vectors and their densities, and we shall see that there is somewhat more to complexification than just notation. 7.6 The Karhunen-Lo`eve expansion We’ve seen that under a change of coordinates, an n-dimensional random vector X is transformed into a vector Y = U ∗ X such that the coordinates of Y are orthogonal random variables. Here U is the unitary matrix such that E[XX ∗ ] = UΛU ∗ . The columns of U are eigenvectors of the Hermitian symmetric matrix E[XX ∗ ] and the corresponding nonnegative eigenvalues of E[XX ∗ ] comprise the diagonal of the diagonal matrix Λ. The columns of U form an orthonormal basis for C n . The Karhunen-Lo`eve expansion gives a similar change of coordinates for a random process on a finite interval, using an orthonormal basis of functions instead of an othonormal basis of vectors. Fix an interval [a, b]. The L 2 norm of a real or complex valued function f on the interval [a, b] is defined by [[f[[ = ¸ _ b a [f(t)[ 2 dt We write L 2 [a, b] for the set of all functions on [a, b] which have finite L 2 norm. The inner product of two functions f and g in L 2 [a, b] is defined by (f, g) = _ b a f(t)g ∗ (t)dt The functions f and g are said to be orthogonal if (f, g) = 0. Note that [[f[[ = _ (f, f) and the Schwarz inequality holds: [(f, g)[ ≤ [[f[[ [[g[[. An orthonormal basis for L 2 [a, b] is a sequence of functions φ 1 , φ 2 , . . . which are mutually orthogonal and have norm one, or in other words, (φ i , φ j ) = I ¡i=j¦ for all i and j. In many applications it is useful to use representations of the form f(t) = N n=1 c n φ n (t). (7.37) 7.6. THE KARHUNEN-LO ` EVE EXPANSION 231 In such a case, we think of (c 1 , . . . , c N ) as the coordinates of f relative to the basis (φ n ), and we might write f ↔ (c 1 , , . . . , c N ). For example, transmitted signals in many digital communication systems have this form, where the coordinate vector (c 1 , , . . . , c N ) represents a data symbol. The geometry of the space of all functions f of the form (7.37) for the fixed orthonormal basis (φ n ) is equivalent to the geometry of the coordinates vectors. For example, if g has a similar representation, g(t) = N n=1 d n φ n (t), or equivalently g ↔(d 1 , . . . , d N ), then f +g ↔(c 1 , . . . , c N ) + (d 1 , . . . , d N ) and (f, g) = _ b a _ N m=1 c m φ m (t) __ N n=1 d ∗ n φ ∗ n (t) _ dt = N m=1 N n=1 c m d ∗ n _ b a φ m (t)φ ∗ n (t)dt = N m=1 N n=1 c m d ∗ n (φ m , φ n ) = N m=1 c m d ∗ m That is, the inner product of the functions, (f, g), is equal to the inner product of their coordinate vectors. Note that for 1 ≤ n ≤ N, φ n ↔ (0, . . . , 0, 1, 0, . . . , 0), such that the one is in the n th position. If f ↔ (c 1 , . . . , c N ), the coordinates of f can be obtained by computing inner products between f and the basis functions: (f, φ n ) = _ b a _ N m=1 c m φ m (t) _ φ ∗ n (t)dt = N m=1 c m (φ m , φ n ) = c n . Another way to derive that (f, φ n ) = c n is to note that f ↔(c 1 , . . . , c N ) and φ n ↔(0, . . . , 0, 1, 0, . . . , 0), so (f, φ n ) is the inner product of (c 1 , . . . , c N ) and (0, . . . , 0, 1, 0, . . . , 0), or c n . Thus, the coordinate vector for f is given by f ↔((f, φ 1 ), . . . , (f, φ N )). The dimension of the space L 2 [a, b] is infinite, so there are orthonormal bases (φ n : n ≥ 1) with infinitely many functions (i.e. N = ∞). For such a basis, a function f can have the representation f(t) = ∞ n=1 c n φ n (t) (7.38) In many instances encountered in practice, the sum (7.38) converges for each t, but in general what is meant is that the convergence is in the sense of the L 2 [a, b] norm: lim N→∞ _ b a [f(t) − N n=1 c n φ n (t)[ 2 dt = 0, 232 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES or equivalently, lim N→∞ [[f − N n=1 c n φ n [[ = 0 An orthonormal basis (φ n ) is defined to be complete if any function f in L 2 [a, b] can be written as in (7.38). A commonly used example of a complete orthonormal basis for an interval [0, T] is given by φ 1 (t) = 1 √ T , φ 2 (t) = _ 2 T cos( 2πt T ), φ 3 (t) = _ 2 T sin( 2πt T ), φ 2k (t) = _ 2 T cos( 2πkt T ), φ 2k+1 (t) = _ 2 T sin( 2πkt T ) for k ≥ 1. (7.39) What happens if we replace f by a random process X = (X t : a ≤ t ≤ b)? Suppose (φ n : 1 ≤ n ≤ N) is an orthonormal basis consisting of continuous functions, with N ≤ ∞. (The basis does not have to be complete, and we allow N to be finite or to equal infinity. Of course, if the basis is complete, then N = ∞.) Suppose that X is m.s. continuous, or equivalently, that R X is continuous as a function on [a, b][a, b]. In particular, R X is bounded. Then E[ _ b a [X t [ 2 dt] = _ b a R X (t, t)dt < ∞, so that _ b a [X t [ 2 dt is finite with probability one. Suppose that X can be represented as X t = N n=1 C n φ n (t). (7.40) Such a representation exists if the basis (φ n ) is complete, but some random processes have the form (7.40) even if N is finite or if N is infinite but the basis is not complete. The representation (7.40) reduces the description of the continuous-time random process to the description of the coefficients, (C n ). This representation of X is much easier to work with if the coordinate random variables are orthogonal. Definition 7.6.1 A Karhunen-Lo`eve (KL) expansion for a random process X = (X t : a ≤ t ≤ b) is a representation of the form (7.40) such that: (1) the functions (φ n ) are orthonormal: (φ m , φ n ) = I ¡m=n¦ , and (2) the coordinate random variables C n are mutually orthogonal: E[C m C ∗ n ] = 0. Example 7.6.2 Let X t = A for 0 ≤ 1 ≤ T, where A is a random variable with 0 < E[A 2 ] < ∞. Then X has the form in (7.40) for N = 1, C 1 = A √ T, and φ 1 (t) = I {0≤t≤T} √ T . This is trivially a KL expansion, with only one term. Example 7.6.3 Let X t = Acos(2πt/T+Θ) for 0 ≤ t ≤ T, where A is a real-valued random variable with 0 < E[A 2 ] < ∞, and Θ is a random variable uniformly distributed on [0, 2π] and independent 7.6. THE KARHUNEN-LO ` EVE EXPANSION 233 of A. By the cosine angle addition formula, X t = Acos(Θ) cos(2πt/T) −Asin(Θ) sin(2πt/T). Then X has the form in (7.40) for N = 2, C 1 = A √ 2T cos(Θ), C 2 = −A √ 2T sin(Θ), φ 1 (t) = cos(2πt/T) √ 2T φ 2 (t) = sin(2πt/T) √ 2T . In particular, φ 1 and φ 2 form an orthonormal basis with N = 2 elements. To check whether this is a KL expansion, we see if E[C 1 C ∗ 2 ] = 0. Since E[C 1 C ∗ 2 ] = −2TE[A 2 ]E[cos(Θ) sin(Θ)] = −TE[A 2 ]E[sin(2Θ)] = 0, this is indeed a KL expansion, with two terms. Proposition 7.6.4 Suppose X = (X t : a ≤ t ≤ b) is m.s. continuous and (φ n ) is an orthonormal basis of continuous functions. If (7.40) holds for some random variables (C n ), it is a KL expan- sion (i.e., the coordinate random variables are orthogonal) if and only if the basis functions are eigenfunctions of R X : R X φ n = λ n φ n , (7.41) where for a function φ ∈ L 2 [a, b], R X φ denotes the function (R X φ)(s) = _ b a R X (s, t)φ(t)dt. In case (7.40) is a KL expansion, the eigenvalues are given by λ n = E[[C n [ 2 ]. Proof. Suppose (7.40) holds. Then C n = (X, φ n ) = _ b a X t φ ∗ n (t)dt, so that E[C m C ∗ n ] = E [(X, φ m )(X, φ n ) ∗ ] = E _ __ b a X s φ ∗ m (s)ds ___ b a X t φ ∗ n (t)dt _ ∗ _ = _ b a _ b a R X (s, t)φ ∗ m (s)φ n (t)dsdt = (R X φ n , φ m ) (7.42) Now, if the basis functions (φ n ) are eigenfunctions of R X , then E[C m C ∗ n ] = (R X φ n , φ m ) = (λ n φ n , φ m ) = λ n (φ n , φ m ) = λ n I ¡m=n¦ . In particular, E[C m C ∗ n ] = 0 if n ,= m, so that (7.40) is a KL expansion. Also, taking m = n yields E[[C n [ 2 ] = λ n . Conversely, suppose (7.40) is a KL expansion. Without loss of generality, suppose that the basis (φ n ) is complete. (If it weren’t complete, it could be extended to a complete basis by augmenting it with functions from a complete basis and applying the Gramm-Schmidt method of orthogonalizing.) Then for n fixed, (R X φ n , φ m ) = 0 for all m ,= n. By completeness of the (φ n ), the function R X φ n has an expansion of the form (7.38), but all terms except possibly the n th are zero. Hence, R n φ n = λ n φ n for some constant λ n , so the eigenrelations (7.41) hold. Again, E[[C n [ 2 ] = λ n by the computation above. The following theorem is stated without proof. Theorem 7.6.5 (Mercer’s theorem) If X = (X t : a ≤ t ≤ b) is m.s. continuous, or, equivalently, that R X is continuous. Then there exists a complete orthonormal basis (φ n : n ≥ 1) of continuous 234 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES eigenfunctions and corresponding nonnegative eigenvalues (λ n : n ≥ 1) for R X , and R X is given by the following series expansion: R X (s, t) = ∞ n=1 λ n φ n (s)φ ∗ n (t). (7.43) The series converges uniformly in s, t, meaning that lim N→∞ max a≤s,t≤b [R X (s, t) − N n=1 λ n φ n (s)φ ∗ n (t)[ = 0 Theorem 7.6.6 ( Karhunen-Lo`eve expansion) Any process X satisfying the conditions of Mercer’s theorem has a KL expansion, X t = ∞ n=1 φ n (t)(X, φ n ), and the series converges in the m.s. sense, uniformly in t. Proof. Use the complete orthonormal basis (φ n ) guaranteed by Mercer’s theorem. By (7.42), E[(X, φ m ) ∗ (X, φ n )] = (R X φ n , φ m ) = λ n I ¡n=m¦ . Also, E[X t (X, φ n ) ∗ ] = E[X t _ b a X ∗ s φ n (s)ds] = _ b a R X (t, s)φ n (s)ds = λ n φ n (t). These facts imply that for finite N, E _ _ ¸ ¸ ¸ ¸ ¸ X t − N n=1 φ n (t)(X, φ n ) ¸ ¸ ¸ ¸ ¸ 2 _ _ = R X (t, t) − N n=1 λ n [φ n (t)[ 2 , (7.44) which, since the series on the right side of (7.44) converges uniformly in t as n → ∞, implies the stated convergence property for the representation of X. Remarks (1) The means of the coordinates of X in a KL expansion can be expressed using the mean function µ X (t) = E[X t ] as follows: E[(X, φ n )] = _ b a µ X (t)φ ∗ n (t)dt = (µ X , φ n ) Thus, the mean of the n th coordinate of X is the n th coordinate of the mean function of X. (2) Symbolically, mimicking matrix notation, we can write the representation (7.43) of R X as R X (s, t) = [φ 1 (s)[φ 2 (s)[ ] _ ¸ ¸ ¸ ¸ ¸ _ λ 1 λ 2 λ 3 . . . _ ¸ ¸ ¸ ¸ ¸ _ _ ¸ ¸ ¸ ¸ ¸ _ φ ∗ 1 (t) φ ∗ 2 (t) . . . _ ¸ ¸ ¸ ¸ ¸ _ 7.6. THE KARHUNEN-LO ` EVE EXPANSION 235 (3) If f ∈ L 2 [a, b] and f(t) represents a voltage or current across a resistor, then the energy dissipated during the interval [a, b] is, up to a multiplicative constant, given by (Energy of f) = [[f[[ 2 = _ b a [f(t)[ 2 dt = ∞ n=1 [(f, φ n )[ 2 . The mean total energy of (X t : a < t < b) is thus given by E[ _ b a [X t [ 2 dt] = _ b a R X (t, t)dt = _ b a ∞ n=1 λ n [φ n (t)[ 2 dt = ∞ n=1 λ n (4) If (X t : a ≤ t ≤ b) is a real valued mean zero Gaussian process and if the orthonormal basis functions are real valued, then the coordinates (X, φ n ) are uncorrelated, real valued, jointly Gaus- sian random variables, and therefore are independent. Example 7.6.7 Let W = (W t : t ≥ 0) be a Brownian motion with parameter σ 2 . Let us find the KL expansion of W over the interval [0, T]. Substituting R X (s, t) = σ 2 (s ∧t) into the eigenrelation (7.41) yields _ t 0 σ 2 sφ n (s)ds + _ T t σ 2 tφ n (s)ds = λ n φ n (t) (7.45) Differentiating (7.45) with respect to t yields σ 2 tφ n (t) −σ 2 tφ n (t) + _ T t σ 2 φ n (s)ds = λ n φ t n (t), (7.46) and differentiating a second time yields that the eigenfunctions satisfy the differential equation λφ tt = −σ 2 φ. Also, setting t = 0 in (7.45) yields the boundary condition φ n (0) = 0, and setting t = T in (7.46) yields the boundary condition φ t n (T) = 0. Solving yields that the eigenvalue and eigenfunction pairs for W are λ n = 4σ 2 T 2 (2n + 1) 2 π 2 φ n (t) = _ 2 T sin _ (2n + 1)πt 2T _ n ≥ 0 These functions form a complete orthonormal basis for L 2 [0, T]. 236 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Example 7.6.8 Let X be a white noise process. Such a process is not a random process as defined in these notes, but can be defined as a generalized process in the same way that a delta function can be defined as a generalized function. Generalized random processes, just like generalized functions, only make sense when multiplied by a suitable function and then integrated. For example, the delta function δ is defined by the requirement that for any function f that is continuous at t = 0, _ ∞ −∞ f(t)δ(t)dt = f(0) A white noise process X is such that integrals of the form _ ∞ −∞ f(t)X(t)dt exist for functions f with finite L 2 norm [[f[[. The integrals are random variables with finite second moments, mean zero and correlations given by E ___ ∞ −∞ f(s)X s ds ___ ∞ −∞ g(t)X t dt _ ∗ _ = σ 2 _ ∞ −∞ f(t)g ∗ (t)dt In a formal or symbolic sense, this means that X is a WSS process with mean µ X = 0 and autocorrelation function R X (s, t) = E[X s X ∗ t ] given by R X (τ) = σ 2 δ(τ). What would the KL expansion be for a white noise process over some fixed interval [a,b]? The eigenrelation (7.41) becomes simply σ 2 φ(t) = λ n φ(t) for all t in the interval. Thus, all the eigenvalues of a white noise process are equal to σ 2 , and any function f with finite norm is an eigenfunction. Thus, if (φ n : n ≥ 1) is an arbitrary complete orthonormal basis for the square integrable functions on [a, b], then the coordinates of the white noise process X, formally given by X n = (X, φ n ), satisfy E[X n X ∗ m ] = σ 2 I ¡n=m¦ . (7.47) This offers a reasonable interpretation of white noise. It is a generalized random process such that its coordinates (X n : n ≥ 1) relative to an arbitrary orthonormal basis for a finite interval have mean zero and satisfy (7.47). 7.7 Periodic WSS random processes Let X = (X t : t ∈ R) be a WSS random process and let T be a positive constant. Proposition 7.7.1 The following three conditions are equivalent: (a) R X (T) = R X (0) (b) P[X T+τ = X τ ] = 1 for all τ ∈ R (c) R X (T +τ) = R X (τ) for all τ ∈ R (i.e. R X (τ) is periodic with period T.) Proof. Suppose (a) is true. Since R X (0) is real valued, so is R X (T), yielding E[[X T+τ −X τ [ 2 ] = E[X T+τ X ∗ T+τ −X T+τ X ∗ τ −X τ X ∗ T+τ +X τ X ∗ τ ] = R X (0) −R X (T) −R ∗ X (T) +R X (0) = 0 7.7. PERIODIC WSS RANDOM PROCESSES 237 Therefore, (a) implies (b). Next, suppose (b) is true and let τ ∈ R. Since two random variables that are equal with probability one have the same expectation, (b) implies that R X (T +τ) = E[X T+τ X ∗ 0 ] = E[X τ X ∗ 0 ] = R X (τ). Therefore (b) imples (c). Trivially (c) implies (a), so the equivalence of (a) through (c) is proved. Definition 7.7.2 We call X a periodic, WSS process of period T if X is WSS and any of the three equivalent properties (a), (b), or (c) of Proposition 7.7.1 hold. Property (b) almost implies that the sample paths of X are periodic. However, for each τ it can be that X τ ,= X τ+T on an event of probability zero, and since there are uncountably many real numbers τ, the sample paths need not be periodic. However, suppose (b) is true and define a process Y by Y t = X (t mod T) . (Recall that by definition, (t mod T) is equal to t + nT, where n is selected so that 0 ≤ t + nT < T.) Then Y has periodic sample paths, and Y is a version of X, which by definition means that P[X t = Y t ] = 1 for any t ∈ R. Thus, the properties (a) through (c) are equivalent to the condition that X is WSS and there is a version of X with periodic sample paths of period T. Suppose X is a m.s. continuous, periodic, WSS random process. Due to the periodicity of X, it is natural to consider the restriction of X to the interval [0, T]. The Karhunen-Lo`eve expansion of X restricted to [0, T] is described next. Let φ n be the function on [0, T] defined by φ n (t) = e 2πjnt/T √ T The functions (φ n : n ∈ Z) form a complete orthonormal basis for L 2 [0, T]. 4 In addition, for any n fixed, both R X (τ) and φ n are periodic with period dividing T, so _ T 0 R X (s, t)φ n (t)dt = _ T 0 R X (s −t)φ n (t)dt = _ s s−T R X (t)φ n (s −t)dt = _ T 0 R X (t)φ n (s −t)dt = 1 √ T _ T 0 R X (t)e 2πjns/T e −2πjnt/T dt = λ n φ n (s). where λ n is given by λ n = _ T 0 R X (t)e −2πjnt/T dt = √ T(R X , φ n ). (7.48) 4 Here it is more convenient to index the functions by the integers, rather than by the nonnegative integers. Sums of the form P ∞ n=−∞ should be interpreted as limits of P N n=−N as N →∞. 238 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Therefore φ n is an eigenfunction of R X with eigenvalue λ n . The Karhunen-Lo`eve expansion (5.20) of X over the interval [0, T] can be written as X t = ∞ n=−∞ ˆ X n e 2πjnt/T (7.49) where ˆ X n is defined by ˆ X n = 1 √ T (X, φ n ) = 1 T _ T 0 X t e −2πjnt/T dt Note that E[ ˆ X m ˆ X ∗ n ] = 1 T E[(X, φ m )(X, φ n ) ∗ ] = λ n T I ¡m=n¦ Although the representation (7.49) has been derived only for 0 ≤ t ≤ T, both sides of (7.49) are periodic with period T. Therefore, the representation (7.49) holds for all t. It is called the spectral representation of the periodic, WSS process X. By (7.48), the series expansion (7.38) applied to the function R X over the interval [0, T] can be written as R X (t) = ∞ n=−∞ λ n T e 2πjnt/T = ω p X (ω)e jωt , (7.50) where p X is the function on the real line R = (ω : −∞< ω < ∞), 5 defined by p X (ω) = _ λ n /T ω = 2πn T for some integer n 0 else and the sum in (7.50) is only over ω such that p X (ω) ,= 0. The function p X is called the power spectral mass function of X. It is similar to a probability mass function, in that it is positive for at most a countable infinity of values. The value p X ( 2πn T ) is equal to the power of the n th term in the representation (7.49): E[[ ˆ X n e 2πjnt/T [ 2 ] = E[[ ˆ X n [ 2 ] = p X ( 2πn T ) and the total mass of p X is the total power of X, R X (0) = E[[X t [] 2 . Periodicity is a rather restrictive assumption to place on a WSS process. In the next chapter we shall further investigate spectral properties of WSS processes. We shall see that many WSS random processes have a power spectral density. A given random variable might have a pmf or a pdf, and it definitely has a CDF. In the same way, a given WSS process might have a power spectral mass function or a power spectral density function, and it definitely has a cumulative power spectral distribution function. The periodic WSS processes of period T are precisely those WSS processes that have a power spectral mass function that is concentrated on the integer multiples of 2π T . 5 The Greek letter ω is used here as it is traditionally used for frequency measured in radians per second. It is related to the frequency f measured in cycles per second by ω = 2πf. Here ω is not the same as a typical element of the underlying space of all outcomes, Ω. The meaning of ω should be clear from the context. 7.8. PROBLEMS 239 7.8 Problems 7.1 Calculus for a simple Gaussian random process Define X = (X t : t ∈ R) by X t = A + Bt + Ct 2 , where A, B, and C are independent, N(0, 1) random variables. (a) Verify directly that X is m.s. differentiable. (b) Express P¦ _ 1 0 X s ds ≥ 1¦ in terms of Q, the standard normal complementary CDF. 7.2 Lack of sample path continuity of a Poisson process Let N = (N t : t ≥ 0) be a Poisson process with rate λ > 0. (a) Find the following two probabilities, explaining your reasoning: P¦N is continuous over the interval [0,T] ¦ for a fixed T > 0, and P¦N is continuous over the interval [0, ∞)¦. (b) Is N sample path continuous a.s.? Is N m.s. continuous? 7.3 Properties of a binary valued process Let Y = (Y t : t ≥ 0) be given by Y t = (−1) Nt , where N is a Poisson process with rate λ > 0. (a) Is Y a Markov process? If so, find the transition probability function p i,j (s, t) and the transition rate matrix Q. (b) Is Y mean square continuous? (c) Is Y mean square differentiable? (d) Does lim T→∞ 1 T _ T 0 y t dt exist in the m.s. sense? If so, identify the limit. 7.4 Some statements related to the basic calculus of random processes Classify each of the following statements as either true (meaning always holds) or false, and justify your answers. (a) Let X t = Z, where Z is a Gaussian random variable. Then X = (X t : t ∈ R) is mean ergodic in the m.s. sense. (b) The function R X defined by R X (τ) = _ σ 2 [τ[ ≤ 1 0 τ > 1 is a valid autocorrelation function. (c) Suppose X = (X t : t ∈ R) is a mean zero stationary Gaussian random process, and suppose X is m.s. differentiable. Then for any fixed time t, X t and X t t are independent. 7.5 Differentiation of the square of a Gaussian random process (a) Show that if random variables (A n : n ≥ 0) are mean zero and jointly Gaussian and if lim n→∞ A n = A m.s., then lim n→∞ A 2 n = A 2 m.s. (Hint: If A, B, C, and D are mean zero and jointly Gaussian, then E[ABCD] = E[AB]E[CD] +E[AC]E[BD] +E[AD]E[BC].) (b) Show that if random variables (A n , B n : n ≥ 0) are jointly Gaussian and lim n→∞ A n = A m.s. and lim n→∞ B n = B m.s. then lim n→∞ A n B n = AB m.s. (Hint: Use part (a) and the identity ab = (a+b) 2 −a 2 −b 2 2 .) (c) Let X be a mean zero, m.s. differentiable Gaussian random process, and let Y t = X 2 t for all t. Is Y m.s. differentiable? If so, justify your answer and express the derivative in terms of X t and X t t . 7.6 Cross correlation between a process and its m.s. derivative Suppose X is a m.s. differentiable random process. Show that R X X = ∂ 1 R X . (It follows, in particular, that ∂ 1 R X exists.) 240 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES 7.7 Fundamental theorem of calculus for m.s. calculus Suppose X = (X t : t ≥ 0) is a m.s. continuous random process. Let Y be the process defined by Y t = _ t 0 X u du for t ≥ 0. Show that X is the m.s. derivative of Y . (It follows, in particular, that Y is m.s. differentiable.) 7.8 A windowed Poisson process Let N = (N t : t ≥ 0) be a Poisson process with rate λ > 0, and let X = (X t : t ≥ 0) be defined by X t = N t+1 −N t . Thus, X t is the number of counts of N during the time window (t, t + 1]. (a) Sketch a typical sample path of N, and the corresponding sample path of X. (b) Find the mean function µ X (t) and covariance function C X (s, t) for s, t ≥ 0. Express your answer in a simple form. (c) Is X Markov? Why or why not? (d) Is X mean-square continuous? Why or why not? (e) Determine whether 1 t _ t 0 X s ds converges in the mean square sense as t →∞. 7.9 An integral of white noise times an exponential Let X t = _ t 0 Z u e −u du, for t ≥ 0, where Z is white Gaussian noise with autocorrelation function δ(τ)σ 2 , for some σ 2 > 0. (a) Find the autocorrelation function, R X (s, t) for s, t ≥ 0. (b) Is X mean square differentiable? Justify your answer. (c) Does X t converge in the mean square sense as t →∞? Justify your answer. 7.10 A singular integral with a Brownian motion Consider the integral _ 1 0 wt t dt, where w = (w t : t ≥ 0) is a standard Brownian motion. Since Var( wt t ) = 1 t diverges as t →0, we define the integral as lim →0 _ 1 wt t dt m.s. if the limit exists. (a) Does the limit exist? If so, what is the probability distribution of the limit? (b) Similarly, we define _ ∞ 1 wt t dt to be lim T→∞ _ T 1 wt t dt m.s. if the limit exists. Does the limit exist? If so, what is the probability distribution of the limit? 7.11 An integrated Poisson process Let N = (N t : t ≥ 0) denote a Poisson process with rate λ > 0, and let Y t = _ t 0 N s ds for s ≥ 0. (a) Sketch a typical sample path of Y . (b) Compute the mean function, µ Y (t), for t ≥ 0. (c) Compute Var(Y t ) for t ≥ 0. (d) Determine the value of the limit, lim t→∞ P[Y t < t]. 7.12 Recognizing m.s. properties Suppose X is a mean zero random process. For each choice of autocorrelation function shown, indi- cate which of the following properties X has: m.s. continuous, m.s. differentiable, m.s. integrable over finite length intervals, and mean ergodic in the the m.s. sense. (a) X is WSS with R X (τ) = (1 −[τ[) + , (b) X is WSS with R X (τ) = 1 + (1 −[τ[) + , (c) X is WSS with R X (τ) = cos(20πτ) exp(−10[τ[), (d) R X (s, t) = _ 1 if ¸s| = ¸t| 0 else , (not WSS, you don’t need to check for mean ergodic property) (e) R X (s, t) = √ s ∧ t for s, t ≥ 0. (not WSS, you don’t need to check for mean ergodic property) 7.8. PROBLEMS 241 7.13 A random Taylor’s approximation Suppose X is a mean zero WSS random process such that R X is twice continuously differentiable. Guided by Taylor’s approximation for deterministic functions, we might propose the following estimator of X t given X 0 and X t 0 : ´ X t = X 0 +tX t 0 . (a) Express the covariance matrix for the vector (X 0 , X t 0 , X t ) T in terms of the function R X and its derivatives. (b) Express the mean square error E[(X t − ´ X t ) 2 ] in terms of the function R X and its derivatives. (c) Express the optimal linear estimator ´ E[X t [X 0 , X t 0 ] in terms of X 0 , X t 0 , and the function R X and its derivatives. (d) (This part is optional - not required.) Compute and compare lim t→0 (mean square error)/t 4 for the two estimators, under the assumption that R X is four times continuously differentiable. 7.14 A stationary Gaussian process Let X = (X t : t ∈ Z) be a real stationary Gaussian process with mean zero and R X (t) = 1 1+t 2 . Answer the following unrelated questions. (a) Is X a Markov process? Justify your anwer. (b) Find E[X 3 [X 0 ] and express P¦[X 3 − E[X 3 [X 0 ][ ≥ 10¦ in terms of Q, the standard Gaussian complementary cumulative distribution function. (c) Find the autocorrelation function of X t , the m.s. derivative of X. (d) Describe the joint probability density of (X 0 , X t 0 , X 1 ) T . You need not write it down in detail. 7.15 Integral of a Brownian bridge A standard Brownian bridge B can be defined by B t = W t − tW 1 for 0 ≤ t ≤ 1, where W is a Brownian motion with parameter σ 2 = 1. A Brownian bridge is a mean zero, Gaussian random process which is a.s. sample path continuous, and has autocorrelation function R B (s, t) = s(1 −t) for 0 ≤ s ≤ t ≤ 1. (a) Why is the integral X = _ 1 0 B t dt well defined in the m.s. sense? (b) Describe the joint distribution of the random variables X and W 1 . 7.16 Correlation ergodicity of Gaussian processes A WSS random process X is called correlation ergodic (in the m.s. sense) if for any constant h, lim t→∞ m.s. 1 t _ t 0 X s+h X s ds = E[X s+h X s ] Suppose X is a mean zero, real-valued Gaussian process such that R X (τ) → 0 as [τ[ → ∞. Show that X is correlation ergodic. (Hints: Let Y t = X t+h X t . Then correlation ergodicity of X is equivalent to mean ergodicity of Y . If A, B, C, and D are mean zero, jointly Gaussian random variables, then E[ABCD] = E[AB]E[CD] +E[AC]E[BD] +E[AD]E[BC]. 7.17 A random process which changes at a random time Let Y = (Y t : t ∈ R) and Z = (Z t : t ∈ R) be stationary Gaussian Markov processes with mean zero 242 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES and autocorrelation functions R Y (τ) = R Z (τ) = e −[τ[ . Let U be a real-valued random variable and suppose Y , Z, and U, are mutually independent. Finally, let X = (X t : t ∈ R) be defined by X t = _ Y t t < U Z t t ≥ U (a) Sketch a typical sample path of X. (b) Find the first order distributions of X. (c) Express the mean and autocorrelation function of X in terms of the CDF, F U , of U. (d) Under what condition on F U is X m.s. continuous? (e) Under what condition on F U is X a Gaussian random process? 7.18 Gaussian review question Let X = (X t : t ∈ R) be a real-valued stationary Gauss-Markov process with mean zero and autocorrelation function C X (τ) = 9 exp(−[τ[). (a) A fourth degree polynomial of two variables is given by p(x, y) = a+bx+cy+dxy+ex 2 y+fxy 2 +... such that all terms have the form cx i y j with i + j ≤ 4. Suppose X 2 is to be estimated by an estimator of the form p(X 0 , X 1 ). Find the fourth degree polynomial p to minimize the MSE: E[(X 2 − p(X 0 , X 1 )) 2 ] and find the resulting MMSE. (Hint: Think! Very little computation is needed.) (b) Find P[X 2 ≥ 4[X 0 = 1 π , X 1 = 3]. You can express your answer using the Gaussian Q function Q(c) = _ ∞ c 1 √ 2π e −u 2 /2 du. (Hint: Think! Very little computation is needed.) 7.19 First order differential equation driven by Gaussian white noise Let X be the solution of the ordinary differential equation X t = −X+N, with initial condition x 0 , where N = (N t : t ≥ 0) is a real valued Gaussian white noise with R N (τ) = σ 2 δ(τ) for some con- stant σ 2 > 0. Although N is not an ordinary random process, we can interpret this as the condition that N is a Gaussian random process with mean µ N = 0 and correlation function R N (τ) = σ 2 δ(τ). (a) Find the mean function µ X (t) and covariance function C X (s, t). (b) Verify that X is a Markov process by checking the necessary and sufficient condition: C X (r, s)C X (s, t) = C X (r, t)C X (s, s) whenever r < s < t. (Note: The very definition of X also suggests that X is a Markov process, because if t is the “present time,” the future of X depends only on X t and the future of the white noise. The future of the white noise is independent of the past (X s : s ≤ t). Thus, the present value X t contains all the information from the past of X that is relevant to the future of X. This is the continuous-time analog of the discrete-time Kalman state equation.) (c) Find the limits of µ X (t) and R X (t +τ, t) as t →∞. (Because these limits exist, X is said to be asymptotically WSS.) 7.20 KL expansion of a simple random process Let X be a WSS random process with mean zero and autocorrelation function R X (τ) = 100(cos(10πτ)) 2 = 50 + 50 cos(20πτ). (a) Is X mean square differentiable? (Justify your answer.) (b) Is X mean ergodic in the m.s. sense? (Justify your answer.) 7.8. PROBLEMS 243 (c) Describe a set of eigenfunctions and corresponding eigenvalues for the Karhunen-Lo`eve expan- sion of (X t : 0 ≤ t ≤ 1). 7.21 KL expansion of a finite rank process Suppose Z = (Z t : 0 ≤ t ≤ T) has the form Z t = N n=1 X n ξ n (t) such that the functions ξ 1 , . . . , ξ N are orthonormal over the interval [0, T], and the vector X = (X 1 , ..., X N ) T has a correlation matrix K with det(K) ,= 0. The process Z is said to have rank N. Suppose K is not diagonal. Describe the Karhunen-Lo`eve expansion of Z. That is, describe an orthornormal basis (φ n : n ≥ 1), and eigenvalues for the K-L expansion of X, in terms of the given functions (ξ n ) and correlation matrix K. Also, describe how the coordinates (Z, φ n ) are related to X. 7.22 KL expansion for derivative process Suppose that X = (X t : 0 ≤ t ≤ 1) is a m.s. continuously differentiable random process on the interval [0, 1]. Differentiating the KL expansion of X yields X t (t) = n (X, φ n )φ t n (t), which looks similar to a KL expansion for X t , but it may be that the functions φ t n are not orthonormal. For some cases it is not difficult to identify the KL expansion for X t . To explore this, let (φ n (t)), ((X, φ n )), and (λ n ) denote the eigenfunctions, coordinate random variables, and eigenvalues, for the KL expansion of X over the interval [0, 1]. Let (ψ k (t)), ((X t , ψ k )), and (µ k ), denote the corresponding quantities for X t . For each of the following choices of (φ n (t)), express the eigenfunctions, coordinate random variables, and eigenvalues, for X t in terms of those for X : (a) φ n (t) = e 2πjnt , n ∈ Z (b) φ 1 (t) = 1, φ 2k (t) = √ 2 cos(2πkt), and φ 2k+1 (t) = √ 2 sin(2πkt) for k ≥ 1. (c) φ n (t) = √ 2 sin( (2n+1)πt 2 ), n ≥ 0. (Hint: Sketch φ n and φ t n for n = 1, 2, 3.) (d) φ 1 (t) = c 1 (1 + √ 3t) and φ 2 (t) = c 2 (1 − √ 3t). (Suppose λ n = 0 for n ,∈ ¦1, 2¦. The constants c n should be selected so that [[φ n [[ = 1 for n = 1, 2, but there is no need to calculate the constants for this problem.) 7.23 An infinitely differentiable process Let X = (X t : t ∈ R) be WSS with autocorrelation function R X (τ) = e −τ 2 /2 . (a) Show that X is k-times differentiable in the m.s. sense, for all k ≥ 1. (b) Let X (k) denote the k th derivative process of X, for k ≥ 1. Is X (k) mean ergodic in the m.s. sense for each k? Justify your answer. 7.24 Mean ergodicity of a periodic WSS random process Let X be a mean zero periodic WSS random process with period T > 0. Recall that X has a power spectral representation X t = n∈Z ´ X n e 2πjnt/T . where the coefficients ´ X n are orthogonal random variables. The power spectral mass function of X is the discrete mass function p X supported on frequencies of the form 2πn T , such that E[[ ´ X n [ 2 ] = p X ( 2πn T ). Under what conditions on p X is the process X mean ergodic in the m.s. sense? Justify your answer. 244 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES 7.25 Application of the KL expansion to estimation Let X = (X t : 0 ≤ T) be a random process given by X t = ABsin( πt T ), where A and T are positive constants and B is a N(0, 1) random variable. Think of X as an amplitude modulated random signal. (a) What is the expected total energy of X? (b) What are the mean and covariance functions of X? (c) Describe the Karhunen-Lo´eve expansion of X. (Hint: Only one eigenvalue is nonzero, call it λ 1 . What are λ 1 , the corresponding eigenfunction φ 1 , and the first coordinate X 1 = (X, φ 1 )? You don’t need to explicitly identify the other eigenfunctions φ 2 , φ 3 , . . .. They can simply be taken to fill out a complete orthonormal basis.) (d) Let N = (X t : 0 ≤ T) be a real-valued Gaussian white noise process independent of X with R N (τ) = σ 2 δ(τ), and let Y = X + N. Think of Y as a noisy observation of X. The same basis functions used for X can be used for the Karhunen-Lo`eve expansions of N and Y . Let N 1 = (N, φ 1 ) and Y 1 = (Y, φ 1 ). Note that Y 1 = X 1 + N 1 . Find E[B[Y 1 ] and the resulting mean square error. (Remark: The other coordinates Y 2 , Y 3 , . . . are independent of both X and Y 1 , and are thus useless for the purpose of estimating B. Thus, E[B[Y 1 ] is equal to E[B[Y ], the MMSE estimate of B given the entire observation process Y .) 7.26 * An autocorrelation function or not? Let R X (s, t) = cosh(a([s − t[ − 0.5)) for −0.5 ≤ s, t ≤ 0.5 where a is a positive constant. Is R X the autocorrelation function of a random process of the form X = (X t : −0.5 ≤ t ≤ 0.5)? If not, explain why not. If so, give the Karhunen-Lo`eve expansion for X. 7.27 * On the conditions for m.s. differentiability (a) Let f(t) = _ t 2 sin(1/t 2 ) t ,= 0 0 t = 0 . Sketch f and show that f is differentiable over all of R, and find the derivative function f t . Note that f t is not continuous, and _ 1 −1 f t (t)dt is not well defined, whereas this integral would equal f(1) −f(−1) if f t were continuous. (b) Let X t = Af(t), where A is a random variable with mean zero and variance one. Show that X is m.s. differentiable. (c) Find R X . Show that ∂ 1 R X and ∂ 2 ∂ 1 R X exist but are not continuous. Chapter 8 Random Processes in Linear Systems and Spectral Analysis Random processes can be passed through linear systems in much the same way as deterministic signals can. A time-invariant linear system is described in the time domain by an impulse response function, and in the frequency domain by the Fourier transform of the impulse response function. In a sense we shall see that Fourier transforms provide a diagonalization of WSS random processes, just as the Karhunen-Lo`eve expansion allows for the diagonalization of a random process defined on a finite interval. While a m.s. continuous random process on a finite interval has a finite average energy, a WSS random process has a finite mean average energy per unit time, called the power. Nearly all the definitions and results of this chapter can be carried through in either discrete time or continuous time. The set of frequencies relevant for continuous-time random processes is all of R, while the set of frequencies relevant for discrete-time random processes is the interval [−π, π]. For ease of notation we shall primarily concentrate on continuous-time processes and systems in the first two sections, and give the corresponding definition for discrete time in the third section. Representations of baseband random processes and narrowband random processes are discussed in Sections 8.4 and 8.5. Roughly speaking, baseband random processes are those which have power only in low frequencies. A baseband random process can be recovered from samples taken at a sampling frequency that is at least twice as large as the largest frequency component of the process. Thus, operations and statistical calculations for a continuous-time baseband process can be reduced to considerations for the discrete time sampled process. Roughly speaking, narrowband random processes are those processes which have power only in a band (i.e. interval) of frequencies. A narrowband random process can be represented as baseband random processes that is modulated by a deterministic sinusoid. Complex random processes naturally arise as baseband equivalent processes for real-valued narrowband random processes. A related discussion of complex random processes is given in the last section of the chapter. 245 246CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS 8.1 Basic definitions The output (Y t : t ∈ R) of a linear system with impulse response function h(s, t) and a random process input (X t : t ∈ R) is defined by Y s = _ ∞ −∞ h(s, t)X t dt (8.1) See Figure 8.1. For example, the linear system could be a simple integrator from time zero, defined h X Y Figure 8.1: A linear system with input X, impulse response function h, and output Y. by Y s = _ _ s 0 X t dt s ≥ 0 0 s < 0, in which case the impulse response function is h(s, t) = _ 1 s ≥ t ≥ 0 0 otherwise. The integral (8.1) defining the output Y will be interpreted in the m.s. sense. Thus, the integral defining Y s for s fixed exists if and only if the following Riemann integral exists and is finite: _ ∞ −∞ _ ∞ −∞ h ∗ (s, τ)h(s, t)R X (t, τ)dtdτ (8.2) A sufficient condition for Y s to be well defined is that R X is a bounded continuous function, and h(s, t) is continuous in t with _ ∞ −∞ [h(s, t)[dt < ∞. The mean function of the output is given by µ Y (s) = E[ _ ∞ −∞ h(s, t)X t dt] = _ ∞ −∞ h(s, t)µ X (t)dt (8.3) As illustrated in Figure 8.2, the mean function of the output is the result of passing the mean function of the input through the linear system. The cross correlation function between the output h X µ µ Y Figure 8.2: A linear system with input µ X and impulse response function h. 8.1. BASIC DEFINITIONS 247 and input processes is given by R Y X (s, τ) = E[ _ ∞ −∞ h(s, t)X t dtX ∗ τ ] = _ ∞ −∞ h(s, t)R X (t, τ)dt (8.4) and the correlation function of the output is given by R Y (s, u) = E _ Y s __ ∞ −∞ h(u, τ)X τ dτ _ ∗ _ = _ ∞ −∞ h ∗ (u, τ)R Y X (s, τ)dτ (8.5) = _ ∞ −∞ _ ∞ −∞ h ∗ (u, τ)h(s, t)R X (t, τ)dtdτ (8.6) Recall that Y s is well defined as a m.s. integral if and only if the integral (8.2) is well defined and finite. Comparing with (8.6), it means that Y s is well defined if and only if the right side of (8.6) with u = s is well defined and gives a finite value for E[[Y s [ 2 ]. The linear system is time invariant if h(s, t) depends on s, t only through s −t. If the system is time invariant we write h(s − t) instead of h(s, t), and with this substitution the defining relation (8.1) becomes a convolution: Y = h ∗ X. A linear system is called bounded input bounded output (bibo) stable if the output is bounded whenever the input is bounded. In case the system is time invariant, bibo stability is equivalent to the condition _ ∞ −∞ [h(τ)[dτ < ∞. (8.7) In particular, if (8.7) holds and if an input signal x satisfies [x s [ < L for all s, then the output signal y = x ∗ h satisfies [y(t)[ ≤ _ ∞ −∞ [h(t −s)[Lds = L _ ∞ −∞ [h(τ)[dτ for all t. If X is a WSS random process then by the Schwarz inequality, R X is bounded by R X (0). Thus, if X is WSS and m.s. continuous, and if the linear system is time-invariant and bibo stable, the integral in (8.2) exists and is bounded by R X (0) _ ∞ −∞ _ ∞ −∞ [h(s −τ)[[h(s −t)[dtdτ = R X (0)( _ ∞ −∞ [h(τ)[dτ) 2 < ∞ Thus, the output of a linear, time-invariant bibo stable system is well defined in the m.s. sense if the input is a stationary, m.s. continuous process. A paragraph about convolutions is in order. It is useful to be able to recognize convolution integrals in disguise. If f and g are functions on R, the convolution is the function f ∗ g defined by f ∗ g(t) = _ ∞ −∞ f(s)g(t −s)ds 248CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS or equivalently f ∗ g(t) = _ ∞ −∞ f(t −s)g(s)ds or equivalently, for any real a and b f ∗ g(a +b) = _ ∞ −∞ f(a +s)g(b −s)ds. A simple change of variable shows that the above three expressions are equivalent. However, in order to immediately recognize a convolution, the salient feature is that the convolution is the integral of the product of f and g, with the arguments of both f and g ranging over R in such a way that the sum of the two arguments is held constant. The value of the constant is the value at which the convolution is being evaluated. Convolution is commutative: f ∗g = g∗f and associative: (f ∗ g) ∗ k = f ∗ (g ∗ k) for three functions f, g, k. We simply write f ∗ g ∗ k for (f ∗ g) ∗ k. The convolution f ∗ g ∗ k is equal to a double integral of the product of f,g, and k, with the arguments of the three functions ranging over all triples in R 3 with a constant sum. The value of the constant is the value at which the convolution is being evaluated. For example, f ∗ g ∗ k(a +b +c) = _ ∞ −∞ _ ∞ −∞ f(a +s +t)g(b −s)k(c −t)dsdt. Suppose that X is WSS and that the linear system is time invariant. Then (8.3) becomes µ Y (s) = _ ∞ −∞ h(s −t)µ X dt = µ X _ ∞ −∞ h(t)dt Observe that µ Y (s) does not depend on s. Equation (8.4) becomes R Y X (s, τ) = _ ∞ −∞ h(s −t)R X (t −τ)dt = h ∗ R X (s −τ), (8.8) which in particular means that R Y X (s, τ) is a function of s −τ alone. Equation (8.5) becomes R Y (s, u) = _ ∞ −∞ h ∗ (u −τ)R Y X (s −τ)dτ. (8.9) The right side of (8.9) looks nearly like a convolution, but as τ varies the sum of the two arguments is u −τ +s −τ, which is not constant as τ varies. To arrive at a true convolution, define the new function ¯ h by ¯ h(v) = h ∗ (−v). Using the definition of ¯ h and (8.8) in (8.9) yields R Y (s, u) = _ ∞ −∞ ¯ h(τ −u)(h ∗ R X )(s −τ)dτ = ¯ h ∗ (h ∗ R X )(s −u) = ¯ h ∗ h ∗ R X (s −u) which in particular means that R Y (s, u) is a function of s −u alone. 8.2. FOURIER TRANSFORMS, TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES249 To summarize, if X is WSS and if the linear system is time invariant, then X and Y are jointly WSS with µ Y = µ X _ ∞ −∞ h(t)dt R Y X = h ∗ R X R Y = h ∗ ¯ h ∗ R X . (8.10) The convolution ¯ h ∗ h, equal to h ∗ ¯ h, can also be written as h ∗ ¯ h(t) = _ ∞ −∞ h(s) ¯ h(t −s)ds = _ ∞ −∞ h(s)h ∗ (s −t)ds (8.11) The expression shows that h ∗ ¯ h(t) is the correlation between h and h ∗ translated by t from the origin. The equations derived in this section for the correlation functions R X , R Y X and R Y also hold for the covariance functions C X , C Y X , and C Y . The derivations are the same except that covariances rather than correlations are computed. In particular, if X is WSS and the system is linear and time invariant, then C Y X = h ∗ C X and C Y = h ∗ ¯ h ∗ C X . 8.2 Fourier transforms, transfer functions and power spectral den- sities Fourier transforms convert convolutions into products, so this is a good point to begin using Fourier transforms. The Fourier transform of a function g mapping R to the complex numbers C is formally defined by ´ g(ω) = _ ∞ −∞ e −jωt g(t)dt (8.12) Some important properties of Fourier transforms are stated next. Linearity: ag +bh = a´ g +b ´ h Inversion: g(t) = _ ∞ −∞ e jωt ´ g(ω) dω 2π Convolution to multiplication: g ∗ h = ´ g ´ h and ´ g ∗ ´ h = 2π ´ gh Parseval’s identity: _ ∞ −∞ g(t)h ∗ (t)dt = _ ∞ −∞ ´ g(ω) ´ h ∗ (ω) dω 2π Transform of time reversal: ´ ¯ h = ´ h ∗ , where ¯ h(t) = h ∗ (−t) Differentiation to multiplication by jω: ´dg dt (ω) = (jω)´ g(ω) Pure sinusoid to delta function: For ω o fixed: e jωot (ω) = 2πδ(ω −ω o ) Delta function to pure sinusoid: For t o fixed: δ(t −t o )(ω) = e −jωto 250CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS The inversion formula above shows that a function g can be represented as an integral (basically a limiting form of linear combination) of sinusoidal functions of time e jωt , and ´ g(ω) is the coefficient in the representation for each ω. Paresval’s identity applied with g = h yields that the total energy of g (the square of the L 2 norm) can be computed in either the time or frequency domain: [[g[[ 2 = _ ∞ −∞ [g(t)[ 2 dt = _ ∞ −∞ [´ g(ω)[ 2 dω 2π . The factor 2π in the formulas can be attributed to the use of frequency ω in radians. If ω = 2πf, then f is the frequency in Hertz (Hz) and dω 2π is simply df. The Fourier transform can be defined for a very large class of functions, including generalized functions such as delta functions. In these notes we won’t attempt a systematic treatment, but will use Fourier transforms with impunity. In applications, one is often forced to determine in what senses the transform is well defined on a case-by-case basis. Two sufficient conditions for the Fourier transform of g to be well defined are mentioned in the remainder of this paragraph. The relation (8.12) defining a Fourier transform of g is well defined if, for example, g is a continuous function which is integrable: _ ∞ −∞ [g(t)[dt < ∞, and in this case the dominated convergence theorem implies that ´ g is a continuous function. The Fourier transform can also be naturally defined whenever g has a finite L 2 norm, through the use of Parseval’s identity. The idea is that if g has finite L 2 norm, then it is the limit in the L 2 norm of a sequence of functions g n which are integrable. Owing to Parseval’s identity, the Fourier transforms ´ g n form a Cauchy sequence in the L 2 norm, and hence have a limit, which is defined to be ´ g. Return now to consideration of a linear time-invariant system with an impulse response function h = (h(τ) : τ ∈ R). The Fourier transform of h is used so often that a special name and notation is used: it is called the transfer function and is denoted by H(ω). The output signal y = (y t : t ∈ R) for an input signal x = (x t : t ∈ R) is given in the time domain by the convolution y = x ∗ h. In the frequency domain this becomes ´ y(ω) = H(ω)´ x(ω). For example, given a < b let H [a,b] (ω) be the ideal bandpass transfer function for frequency band [a, b], defined by H [a,b] (ω) = _ 1 a ≤ ω ≤ b 0 otherwise. (8.13) If x is the input and y is the output of a linear system with transfer function H [a,b] , then the relation ´ y(ω) = H [a,b] (ω)´ x(ω) shows that the frequency components of x in the frequency band [a, b] pass through the filter unchanged, and the frequency components of x outside of the band are completely nulled. The total energy of the output function y can therefore be interpreted as the energy of x in the frequency band [a, b]. Therefore, Energy of x in frequency interval [a, b] = [[y[[ 2 = _ ∞ −∞ [H [a,b] (ω)[ 2 [´ x(ω)[ 2 dω 2π = _ b a [´ x(ω)[ 2 dω 2π . Consequently, it is appropriate to call [´ x(ω)[ 2 the energy spectral density of the deterministic signal x. Given a WSS random process X = (X t : t ∈ R), the Fourier transform of its correlation function R X is denoted by S X . For reasons that we will soon see, the function S X is called the power spectral density of X. Similarly, if Y and X are jointly WSS, then the Fourier transform of R Y X is denoted by S Y X , called the cross power spectral density function of Y and X. The Fourier transform of 8.2. FOURIER TRANSFORMS, TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES251 the time reverse complex conjugate function ¯ h is equal to H ∗ , so [H(ω)[ 2 is the Fourier transform of h ∗ ¯ h. With the above notation, the second moment relationships in (8.10) become: S Y X (ω) = H(ω)S X (ω) S Y (ω) = [H(ω)[ 2 S X (ω) Let us examine some of the properties of the power spectral density, S X . If _ ∞ −∞ [R X (t)[dt < ∞ then S X is well defined and is a continuous function. Because R Y X = ¯ R XY , it follows that S Y X = S ∗ XY . In particular, taking Y = X yields R X = ¯ R X and S X = S ∗ X , meaning that S X is real-valued. The Fourier inversion formula applied to S X yields that R X (τ) = _ ∞ −∞ e jωτ S X (ω) dω 2π . In partic- ular, E[[X t [ 2 ] = R X (0) = _ ∞ −∞ S X (ω) dω 2π . (8.14) The expectation E[[X t [ 2 ] is called the power (or total power) of X, because if X t is a voltage or current accross a resistor, [X t [ 2 is the instantaneous rate of dissipation of heat energy. Therefore, (8.14) means that the total power of X is the integral of S X over R. This is the first hint that the name power spectral density for S X is justified. Let a < b and let Y denote the output when the WSS process X is passed through the linear time-invariant system with transfer function H [a,b] defined by (8.13). The process Y represents the part of X in the frequency band [a, b]. By the relation S Y = [H [a,b] [ 2 S X and the power relationship (8.14) applied to Y , we have Power of X in frequency interval [a, b] = E[[Y t [ 2 ] = _ ∞ −∞ S Y (ω) dω 2π = _ b a S X (ω) dω 2π (8.15) Two observations can be made concerning (8.15). First, the integral of S X over any interval [a, b] is nonnegative. If S X is continuous, this implies that S X is nonnegative. Even if S X is not con- tinuous, we can conclude that S X is nonnegative except possibly on a set of zero measure. The second observation is that (8.15) fully justifies the name “power spectral density of X” given to S X . Example 8.2.1 Suppose X is a WSS process and that Y is a moving average of X with averaging window duration T for some T > 0: Y t = 1 T _ t t−T X s ds Equivalently, Y is the output of the linear time-invariant system with input X and impulse response function h given by h(τ) = _ 1 T 0 ≤ τ ≤ T 0 else The output correlation function is given by R Y = h ∗ ¯ h ∗ R X . Using (8.11) and referring to Figure 8.3 we find that h ∗ ¯ h is a triangular shaped waveform: h ∗ ¯ h(τ) = 1 T (1 − [τ[ T ) + 252CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS Similarly, C Y = h ∗ ¯ h ∗ C X . Let’s find in particular an expression for the variance of Y t in terms T s h(s−t) h(s) h*h ~ 1 T 0 t T −T Figure 8.3: Convolution of two rectangle functions. of the function C X . Var(Y t ) = C Y (0) = _ ∞ −∞ (h ∗ ¯ h)(0 −τ)C X (τ)dτ = 1 T _ T −T (1 − [τ[ T )C X (τ)dτ (8.16) The expression in (8.16) arose earlier in these notes, in the section on mean ergodicity. Let’s see the effect of the linear system on the power spectral density of the input. Observe that H(ω) = _ ∞ −∞ e −jωt h(t)dt = 1 T _ e −jωT −1 −jω _ = 2e −jωT/2 Tω _ e jωT/2 −e −jωT/2 2j _ = e −jωT/2 _ sin( ωT 2 ) ωT 2 _ Equivalently, using the substitution ω = 2πf, H(2πf) = e −jπfT sinc(fT) where in these notes the sinc function is defined by sinc(u) = _ sin(πu) πu u ,= 0 1 u = 0. (8.17) (Some authors use somewhat different definitions for the sinc function.) Therefore [H(2πf)[ 2 = [sinc(fT)[ 2 , so that the output power spectral density is given by S Y (2πf) = S X (2πf)[sinc(fT)[ 2 . See Figure 8.4. Example 8.2.2 Consider two linear time-invariant systems in parallel as shown in Figure 8.5. The 8.2. FOURIER TRANSFORMS, TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES253 u 1 T 2 T f H(2 f) π 1 2 0 0 sinc( ) u Figure 8.4: The sinc function and the impulse response function. h Y X U V k Figure 8.5: Parallel linear systems. first has input X, impulse response function h, and output U. The second has input Y , impulse response function k, and output V . Suppose that X and Y are jointly WSS. We can find R UV as follows. The main trick is notational: to use enough different variables of integration so that none are used twice. R UV (t, τ) = E __ ∞ −∞ h(t −s)X s ds __ ∞ −∞ k(τ −v)Y v dv _ ∗ _ = _ ∞ −∞ _ ∞ −∞ h(t −s)R XY (s −v)k ∗ (τ −v)dsdv = _ ∞ −∞ ¦h ∗ R XY (t −v)¦ k ∗ (τ −v)dv = h ∗ ¯ k ∗ R XY (t −τ). Note that R UV (t, τ) is a function of t−τ alone. Together with the fact that U and V are individually WSS, this implies that U and V are jointly WSS, and R UV = h ∗ ¯ k ∗ R XY . The relationship is expressed in the frequency domain as S UV = HK ∗ S XY , where K is the Fourier transform of k. Special cases of this example include the case that X = Y or h = k. Example 8.2.3 Consider the circuit with a resistor and a capacitor shown in Figure 8.6. Take as the input signal the voltage difference on the left side, and as the output signal the voltage across the capacitor. Also, let q t denote the charge on the upper side of the capacitor. Let us first identify the impulse response function by assuming a deterministic input x and a corresponding output y. The elementary equations for resistors and capacitors yield dq dt = 1 R (x t −y t ) and y t = q t C 254CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS R C q x(t) (t) − + y(t) − + Figure 8.6: An RC circuit modeled as a linear system. Therefore dy dt = 1 RC (x t −y t ) which in the frequency domain is jω´ y(ω) = 1 RC (´ x(ω) − ´ y(ω)) so that ´ y = H´ x for the system transfer function H given by H(ω) = 1 1 +RCjω Suppose, for example, that the input X is a real-valued, stationary Gaussian Markov process, so that its autocorrelation function has the form R X (τ) = A 2 e −α[τ[ for some constants A 2 and α > 0. Then S X (ω) = 2A 2 α ω 2 +α 2 and S Y (ω) = S X (ω)[H(ω)[ 2 = 2A 2 α (ω 2 +α 2 )(1 + (RCω) 2 ) Example 8.2.4 A random signal, modeled by the input random process X, is passed into a linear time-invariant system with feedback and with noise modeled by the random process N, as shown in Figure 8.7. The output is denoted by Y . Assume that X and N are jointly WSS and that the random variables comprising X are orthogonal to the random variables comprising N: R XN = 0. Assume also, for the sake of system stability, that the magnitude of the gain around the loop satisfies [H 3 (ω)H 1 (ω)H 2 (ω)[ < 1 for all ω such that S X (ω) > 0 or S N (ω) > 0. We shall express the output power spectral density S Y in terms the power spectral densities of X and N, and the three transfer functions H 1 , H 2 , and H 3 . An expression for the signal-to-noise power ratio at the output will also be computed. Under the assumed stability condition, the linear system can be written in the equivalent form shown in Figure 8.8. The process ˜ X is the output due to the input signal X, and ˜ N is the output 8.2. FOURIER TRANSFORMS, TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES255 + H ( ) H ( ) H ( ) 1 2 3 ! ! ! X + N Y t t t Figure 8.7: A feedback system. 1 1!H ( )H ( )H ( ) H ( ) 1!H ( )H ( )H ( ) H ( )H ( ) ! ! 3 1 2 2 3 1 2 X + N t t ~ X t t N ~ ~ ~ ~ Y =X +N t t t ! ! ! ! ! ! ! 2 Figure 8.8: An equivalent representation. due to the input noise N. The structure in Figure 8.8 is the same as considered in Example 8.2.2. Since R XN = 0 it follows that R ˜ X ˜ N = 0, so that S Y = S ˜ X +S ˜ N . Consequently, S Y (ω) = S ˜ X (ω) +S ˜ N (ω) = [H 2 (ω) 2 [ _ [H 1 (ω) 2 [S X (ω) +S N (ω) ¸ [1 −H 3 (ω)H 1 (ω)H 2 (ω)[ 2 The output signal-to-noise ratio is the ratio of the power of the signal at the output to the power of the noise at the output. For this example it is given by E[[ ˜ X t [ 2 ] E[[ ˜ N t [ 2 ] = _ ∞ −∞ [H 2 (ω)H 1 (ω)[ 2 S X (ω) [1−H 3 (ω)H 1 (ω)H 2 (ω)[ 2 dω 2π _ ∞ −∞ [H 2 (ω)[ 2 S N (ω) [1−H 3 (ω)H 1 (ω)H 2 (ω)[ 2 dω 2π Example 8.2.5 Consider the linear time-invariant system defined as follows. For input signal x the output signal y is defined by y ttt +y t +y = x+x t . We seek to find the power spectral density of the output process if the input is a white noise process X with R X (τ) = σ 2 δ(τ) and S X (ω) = σ 2 for all ω. To begin, we identify the transfer function of the system. In the frequency domain, the system is described by ((jω) 3 +jω + 1)´ y(ω) = (1 +jω)´ x(ω), so that H(ω) = 1 +jω 1 +jω + (jω) 3 = 1 +jω 1 +j(ω −ω 3 ) 256CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS Hence, S Y (ω) = S X (ω)[H(ω)[ 2 = σ 2 (1 +ω 2 ) 1 + (ω −ω 3 ) 2 = σ 2 (1 +ω 2 ) 1 +ω 2 −2ω 4 +ω 6 . Observe that output power = _ ∞ −∞ S Y (ω) dω 2π < ∞. 8.3 Discrete-time processes in linear systems The basic definitions and use of Fourier transforms described above carry over naturally to discrete time. In particular, if the random process X = (X k : k ∈ Z) is the input of a linear, discrete-time system with impule response function h, then the output Y is the random process given by Y k = ∞ n=−∞ h(k, n)X n . The equations in Section 8.1 can be modified to hold for discrete time simply by replacing integration over R by summation over Z. In particular, if X is WSS and if the linear system is time-invariant then (8.10) becomes µ Y = µ X ∞ n=−∞ h(n) R Y X = h ∗ R X R Y = h ∗ ¯ h ∗ R X , (8.18) where the convolution in (8.18) is defined for functions g and h on Z by g ∗ h(n) = ∞ k=−∞ g(n −k)h(k) Again, Fourier transforms can be used to convert convolution to multiplication. The Fourier trans- form of a function g = (g(n) : n ∈ Z) is the function ´ g on [−π, π] defined by ´ g(ω) = ∞ −∞ e −jωn g(n). Some of the most basic properties are: Linearity: ag +bh = a´ g +b ´ h Inversion: g(n) = _ π −π e jωn ´ g(ω) dω 2π Convolution to multiplication: g ∗ h = ´ g ´ h and ´ g ∗ ´ h = 1 2π ´ gh Parseval’s identity: ∞ n=−∞ g(n)h ∗ (n) = _ π −π ´ g(ω) ´ h ∗ (ω) dω 2π 8.3. DISCRETE-TIME PROCESSES IN LINEAR SYSTEMS 257 Transform of time reversal: ´ ¯ h = ´ h ∗ , where ¯ h(t) = h(−t) ∗ Pure sinusoid to delta function: For ω o ∈ [−π, π] fixed: e jωon (ω) = 2πδ(ω −ω o ) Delta function to pure sinusoid: For n o fixed: I ¡n=no¦ (ω) = e −jωno The inversion formula above shows that a function g on Z can be represented as an integral (basically a limiting form of linear combination) of sinusoidal functions of time e jωn , and ´ g(ω) is the coefficient in the representation for each ω. Paresval’s identity applied with g = h yields that the total energy of g (the square of the L 2 norm) can be computed in either the time or frequency domain: [[g[[ 2 = ∞ n=−∞ [g(n)[ 2 = _ π −π [´ g(ω)[ 2 dω 2π . The Fourier transform and its inversion formula for discrete-time functions are equivalent to the Fourier series representation of functions in L 2 [−π, π] using the complete orthogonal basis (e jωn : n ∈ Z) for L 2 [−π, π], as discussed in connection with the Karhunen-Lo`eve expansion. The functions in this basis all have norm 2π. Recall that when we considered the Karhunen-Lo`eve expansion for a periodic WSS random process of period T, functions on a time interval were important and the power was distributed on the integers Z scaled by 1 T . In this section, Z is considered to be the time domain and the power is distributed over an interval. That is, the role of Z and a finite interval are interchanged. The transforms used are essentially the same, but with j replaced by −j. Given a linear time-invariant system in discrete time with an impulse response function h = (h(τ) : τ ∈ Z), the Fourier transform of h is denoted by H(ω). The defining relation for the system in the time domain, y = h ∗ x, becomes ´ y(ω) = H(ω)´ x(ω) in the frequency domain. For −π ≤ a < b ≤ π, Energy of x in frequency interval [a, b] = _ b a [´ x(ω)[ 2 dω 2π . so it is appropriate to call [´ x(ω)[ 2 the energy spectral density of the deterministic, discrete-time signal x. Given a WSS random process X = (X n : n ∈ Z), the Fourier transform of its correlation function R X is denoted by S X , and is called the power spectral density of X. Similarly, if Y and X are jointly WSS, then the Fourier transform of R Y X is denoted by S Y X , called the cross power spectral density function of Y and X. With the above notation, the second moment relationships in (8.18) become: S Y X (ω) = H(ω)S X (ω) S Y (ω) = [H(ω)[ 2 S X (ω) The Fourier inversion formula applied to S X yields that R X (n) = _ π −π e jωn S X (ω) dω 2π . In partic- ular, E[[X n [ 2 ] = R X (0) = _ π −π S X (ω) dω 2π . The expectation E[[X n [ 2 ] is called the power (or total power) of X, and for −π < a < b ≤ π we have Power of X in frequency interval [a, b] = _ b a S X (ω) dω 2π 258CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS 8.4 Baseband random processes Deterministic baseband signals are considered first. Let x be a continuous-time signal (i.e. a function on R) such that its energy, _ ∞ −∞ [x(t)[ 2 dt, is finite. By the Fourier inversion formula, the signal x is an integral, which is essentially a sum, of sinusoidal functions of time, e jωt . The weights are given by the Fourier transform ´ x(w). Let f o > 0 and let ω o = 2πf o . The signal x is called a baseband signal, with one-sided band limit f o Hz, or equivalently ω o radians/second, if ´ x(ω) = 0 for [ω[ ≥ ω o . For such a signal, the Fourier inversion formula becomes x(t) = _ ωo −ωo e jωt ´ x(ω) dω 2π (8.19) Equation (8.19) displays the baseband signal x as a linear combination of the functions e jωt indexed by ω ∈ [−ω o , ω o ]. A celebrated theorem of Nyquist states that the baseband signal x is completely determined by its samples taken at sampling frequency 2f o . Specifically, define T by 1 T = 2f o . Then x(t) = ∞ n=−∞ x(nT) sinc _ t −nT T _ . (8.20) where the sinc function is defined by (8.17). Nyquist’s equation (8.20) is indeed elegant. It obviously holds by inspection if t = mT for some integer m, because for t = mT the only nonzero term in the sum is the one indexed by n = m. The equation shows that the sinc function gives the correct interpolation of the narrowband signal x for times in between the integer multiples of T. We shall give a proof of (8.20) for deterministic signals, before considering its extension to random processes. A proof of (8.20) goes as follows. Henceforth we will use ω o more often than f o , so it is worth remembering that ω o T = π. Taking t = nT in (8.19) yields x(nT) = _ ωo −ωo e jωnT ´ x(ω) dω 2π = _ ωo −ωo ´ x(ω)(e −jωnT ) ∗ dω 2π (8.21) Equation (8.21) shows that x(nT) is given by an inner product of ˆ x and e −jωnT . The functions e −jωnT , considered on the interval −ω o < ω < ω o and indexed by n ∈ Z, form a complete orthogonal basis for L 2 [−ω o , ω o ], and _ ωo −ωo T[e −jωnT [ 2 dω 2π = 1. Therefore, ´ x over the interval [−ω o , ω o ] has the following Fourier series representation: ´ x(ω) = T ∞ n=−∞ e −jωnT x(nT) ω ∈ [−ω o , ω o ] (8.22) Plugging (8.22) into (8.19) yields x(t) = ∞ n=−∞ x(nT)T _ ωo −ωo e jωt e −jωnT dω 2π . (8.23) 8.4. BASEBAND RANDOM PROCESSES 259 The integral in (8.23) can be simplified using T _ ωo −ωo e jωτ dω 2π = sinc _ τ T _ . (8.24) with τ = t −nT to yield (8.20) as desired. The sampling theorem extends naturally to WSS random processes. A WSS random process X with spectral density S X is said to be a baseband random process with one-sided band limit ω o if S X (ω) = 0 for [ ω [≥ ω o . Proposition 8.4.1 Suppose X is a WSS baseband random process with one-sided band limit ω o and let T = π/ω o . Then for each t ∈ R X t = ∞ n=−∞ X nT sinc _ t −nT T _ m.s. (8.25) If B is the process of samples defined by B n = X nT , then the power spectral densities of B and X are related by S B (ω) = 1 T S X _ ω T _ for [ ω [≤ π (8.26) Proof. Fix t ∈ R. It must be shown that N defined by the following expectation converges to zero as N →∞: ε N = E _ _ ¸ ¸ ¸ ¸ ¸ X t − N n=−N X nT sinc _ t −nT t _ ¸ ¸ ¸ ¸ ¸ 2 _ _ When the square is expanded, terms of the form E[X a X ∗ b ] arise, where a and b take on the values t or nT for some n. But E[X a X ∗ b ] = R X (a −b) = _ ∞ −∞ e jωa (e jωb ) ∗ S X (ω) dω 2π . Therefore, ε N can be expressed as an integration over ω rather than as an expectation: ε N = _ ∞ −∞ ¸ ¸ ¸ ¸ ¸ e jωt − N n=−N e jωnT sinc _ t −nT T _ ¸ ¸ ¸ ¸ ¸ 2 S X (ω) dω 2π . (8.27) For t fixed, the function (e jωt : −ω o < ω < ω o ) has a Fourier series representation (use (8.24)) e jωt = T ∞ −∞ e jωnT _ ωo −ωo e jωt e −jωnT dω 2π = ∞ −∞ e jωnT sinc _ t −nT T _ . 260CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS so that the quantity inside the absolute value signs in (8.27) is the approximation error for the N th partial Fourier series sum for e jωt . Since e jωt is continuous in ω, a basic result in the theory of Fourier series yields that the Fourier approximation error is bounded by a single constant for all N and ω, and as N → ∞ the Fourier approximation error converges to 0 uniformly on sets of the form [ ω [≤ ω o − ε. Thus ε N → 0 as N → ∞ by the dominated convergence theorem. The representation (8.25) is proved. Clearly B is a WSS discrete time random process with µ B = µ X and R B (n) = R X (nT) = _ ∞ −∞ e jnTω S X (ω) dω 2π = _ ωo −ωo e jnTω S X (ω) dω 2π , so, using a change of variable ν = Tω and the fact T = π ωo yields R B (n) = _ π −π e jnν 1 T S X ( ν T ) dν 2π . (8.28) But S B (ω) is the unique function on [−π, π] such that R B (n) = _ π −π e jnω S B (ω) dω 2π so (8.26) holds. The proof of Proposition 8.4.1 is complete. As a check on (8.26), we note that B(0) = X(0), so the processes have the same total power. Thus, it must be that _ π −π S B (ω) dω 2π = _ ∞ −∞ S X (ω) dω 2π , (8.29) which is indeed consistent with (8.26). Example 8.4.2 If µ X = 0 and the spectral density S X of X is constant over the interval [−ω o , ω o ], then µ B = 0 and S B (ω) is constant over the interval [−π, π]. Therefore R B (n) = C B (n) = 0 for n ,= 0, and the samples (B(n)) are mean zero, uncorrelated random variables. Theoretical Exercise What does (8.26) become if X is WSS and has a power spectral density, but X is not a baseband signal? 8.5. NARROWBAND RANDOM PROCESSES 261 8.5 Narrowband random processes As noted in the previous section, a signal – modeled as either a deterministic finite energy signal or a WSS random process – can be reconstructed from samples taken at a sampling rate twice the highest frequency of the signal. For example, a typical voice signal may have highest frequency 5 KHz. If such a signal is multiplied by a signal with frequency 10 9 Hz, the highest frequency of the resulting product is about 200,000 times larger than that of the original signal. Na¨ıve application of the sampling theorem would mean that the sampling rate would have to increase by the same factor. Fortunately, because the energy or power of such a modulated signal is concentrated in a narrow band, the signal is nearly as simple as the original baseband signal. The motivation of this section is to see how signals and random processes with narrow spectral ranges can be analyzed in terms of equivalent baseband signals. For example, the effects of filtering can be analyzed using baseband equivalent filters. As an application, an example at the end of the section is given which describes how a narrowband random process (to be defined) can be simulated using a sampling rate equal to twice the one-sided width of a frequency band of a signal, rather than twice the highest frequency of the signal. Deterministic narrowband signals are considered first, and the development for random pro- cesses follows a similar approach. Let ω c > ω o > 0. A narrowband signal (relative to ω o and ω c ) is a signal x such that ´ x(ω) = 0 unless ω is in the union of two intervals: the upper band, (ω c − ω o , ω c + ω o ), and the lower band, (−ω c − ω o , −ω c + ω o ). More compactly, ´ x(ω) = 0 if [[ ω [ −ω c [ ≥ ω o . A narrowband signal arises when a sinusoidal signal is modulated by a narrowband signal, as shown next. Let u and v be real-valued baseband signals, each with one-sided bandwidth less than ω o , as defined at the beginning of the previous section. Define a signal x by x(t) = u(t) cos(ω c t) −v(t) sin(ω c t). (8.30) Since cos(ω c t) = (e jωct +e −jωct )/2 and −sin(ω c t) = (je jωct −je −jωct )/2, (8.30) becomes ´ x(ω) = 1 2 ¦´ u(ω −ω c ) + ´ u(ω +ω c ) +j´ v(ω −ω c ) −j´ v(ω +ω c )¦ (8.31) Graphically, ´ x is obtained by sliding 1 2 ´ u to the right by ω c , 1 2 ´ u to the left by ω c , j 2 ´ v to the right by ω c , and −j 2 ´ v to the left by ω c , and then adding. Of course x is real-valued by its definition. The reader is encouraged to verify from (8.31) that ´ x(ω) = ´ x ∗ (−ω). Equation (8.31) shows that indeed x is a narrowband signal. A convenient alternative expression for x is obtained by defining a complex valued baseband signal z by z(t) = u(t) + jv(t). Then x(t) = Re(z(t)e jωct ). It is a good idea to keep in mind the case that ω c is much larger than ω o (written ω c ¸ ω o ). Then z varies slowly compared to the complex sinusoid e jωct . In a small neighborhood of a fixed time t, x is approximately a sinusoid with frequency ω c , peak amplitude [z(t)[, and phase given by the argument of z(t). The signal z is called the complex envelope of x and [z(t)[ is called the real envelope of x. So far we have shown that a real-valued narrowband signal x results from modulating sinusoidal functions by a pair of real-valued baseband signals, or equivalently, modulating a complex sinusoidal 262CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS function by a complex-valued baseband signal. Does every real-valued narrowband signal have such a representation? The answer is yes, as we now show. Let x be a real-valued narrowband signal with finite energy. One attempt to obtain a baseband signal from x is to consider e −jωct x(t). This has Fourier transform ´ x(ω +ω c ), and the graph of this transform is obtained by sliding the graph of ´ x(ω) to the left by ω c . As desired, that shifts the portion of ´ x in the upper band to the baseband interval (−ω o , ω o ). However, the portion of ´ x in the lower band gets shifted to an interval centered about −2ω c , so that e −jωct x(t) is not a baseband signal. An elegant solution to this problem is to use the Hilbert transform of x, denoted by ˇ x. By definition, ˇ x(ω) is the signal with Fourier transform −jsgn(ω)´ x(ω), where sgn(ω) = _ _ _ 1 ω > 0 0 ω = 0 −1 ω < 0 Therefore ˇ x can be viewed as the result of passing x through a linear, time-invariant system with transfer function −jsgn(ω) as pictured in Figure 8.9. Since this transfer function satisfies H ∗ (ω) = H(−ω), the output signal ˇ x is again real-valued. In addition, [H(ω)[ = 1 for all ω, except ω = 0, so −j sgn( ) ω x x Figure 8.9: The Hilbert transform as a linear, time-invariant system. that the Fourier transforms of x and ˇ x have the same magnitude for all nonzero ω. In particular, x and ˇ x have equal energies. Consider the Fourier transform of x +jˇ x. It is equal to 2´ x(ω) in the upper band and it is zero elsewhere. Thus, z defined by z(t) = (x(t)+jˇ x(t))e −jωct is a baseband complex valued signal. Note that x(t) = Re(x(t)) = Re(x(t) +jˇ x(t)), or equivalently x(t) = Re _ z(t)e jωct _ (8.32) If we let u(t) = Re(z(t)) and v(t) = Im(z(t)), then u and v are real-valued baseband signals such that z(t) = u(t) +jv(t), and (8.32) becomes (8.30). In summary, any finite energy real-valued narrowband signal x can be represented as (8.30) or (8.32), where z(t) = u(t) +jv(t). The Fourier transform ´ z can be expressed in terms of ´ x by ´ z(ω) = _ 2´ x(ω +ω c ) [ω[ ≤ ω o 0 else, (8.33) and ´ u is the Hermetian symmetric part of ´ z and ´ v is −j times the Hermetian antisymmetric part of ´ z: ´ u(ω) = 1 2 (´ z(ω) + ´ z ∗ (−ω)) ´ v(ω) = −j 2 (´ z(ω) − ´ z ∗ (−ω)) 8.5. NARROWBAND RANDOM PROCESSES 263 In the other direction, ´ x can be expressed in terms of ´ u and ´ v by (8.31). If x 1 and x 2 are each narrowband signals with corresponding complex envelope processes z 1 and z 2 , then the convolution x = x 1 ∗ x 2 is again a narrowband signal, and the corresponding complex envelope is 1 2 z 1 ∗ z 2 . To see this, note that the Fourier transform, ´ z, of the complex envelope z for x is given by (8.33). Similar equations hold for ´ z i in terms of ´ x i for i = 1, 2. Using these equations and the fact ´ x(ω) = ´ x 1 (ω)´ x 2 (ω), it is readily seen that ´ z(ω) = 1 2 ´ z 1 (ω) ´ z 2 (ω) for all ω, establishing the claim. Thus, the analysis of linear, time invariant filtering of narrowband signals can be carried out in the baseband equivalent setting. A similar development is considered next for WSS random processes. Let U and V be jointly WSS real-valued baseband random processes, and let X be defined by X t = U t cos(ω c t) −V t sin(ω c t) (8.34) or equivalently, defining Z t by Z t = U t +jV t , X t = Re _ Z t e jωct _ (8.35) In some sort of generalized sense, we expect that X is a narrowband process. However, such an X need not even be WSS. Let us find the conditions on U and V that make X WSS. First, in order that µ X (t) not depend on t, it must be that µ U = µ V = 0. Using the notation c t = cos(ω c t), s t = sin(ω c t), and τ = a −b, R X (a, b) = R U (τ)c a c b −R UV (τ)c a s b −R V U (τ)s a c b +R V (τ)s a s b . Using the trigonometric identities such as c a c b = (c a−b +c a+b )/2, this can be rewritten as R X (a, b) = _ R U (τ) +R V (τ) 2 _ c a−b + _ R UV (τ) −R V U (τ) 2 _ s a−b + _ R U (τ) −R V (τ) 2 _ c a+b − _ R UV (τ) +R V U (τ) 2 _ s a+b . Therefore, in order that R X (a, b) is a function of a−b, it must be that R U = R V and R UV = −R V U . Since in general R UV (τ) = R V U (−τ), the condition R UV = −R V U means that R UV is an odd function: R UV (τ) = −R UV (−τ). We summarize the results as a proposition. Proposition 8.5.1 Suppose X is given by (8.34) or (8.35), where U and V are jointly WSS. Then X is WSS if and only if U and V are mean zero with R U = R V and R UV = −R V U . Equivalently, X is WSS if and only if Z = U +jV is mean zero and E[Z a Z b ] = 0 for all a, b. If X is WSS then R X (τ) = R U (τ) cos(ω c τ) +R UV (τ) sin(ω c τ) S X (ω) = 1 2 [S U (ω −ω c ) +S U (ω +ω c ) −jS UV (ω −ω c ) +jS UV (ω +ω c )] 264CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS and, with R Z (τ) defined by R Z (a −b) = E[Z a Z ∗ b ], R X (τ) = 1 2 Re(R Z (τ)e jωcτ ) The functions S X , S U , and S V are nonnegative, even functions, and S UV is a purely imaginary odd function (i.e. S UV (ω) = Im(S UV (ω)) = −S UV (−ω).) Let X by any WSS real-valued random process with a spectral density S X , and continue to let ω c > ω o > 0. Then X is defined to be a narrowband random process if S X (ω) = 0 whenever [ [ω[ − ω c [≥ ω o . Equivalently, X is a narrowband random process if R X (t) is a narrowband function. We’ve seen how such a process can be obtained by modulating a pair of jointly WSS baseband random processes U and V . We show next that all narrowband random processes have such a representation. To proceed as in the case of deterministic signals, we first wish to define the Hilbert transform of X, denoted by ˇ X. A slight concern about defining ˇ X is that the function −jsgn(ω) does not have finite energy. However, we can replace this function by the function given by H(ω) = −jsgn(ω)I [ω[≤ωo+ωc , which has finite energy and it has a real-valued inverse transform h. Define ˇ X as the output when X is passed through the linear system with impulse response h. Since X and h are real valued, the random process ˇ X is also real valued. As in the deterministic case, define random processes Z, U, and V by Z t = (X t +j ˇ X t )e −jωct , U t = Re(Z t ), and V t = Im(Z t ). Proposition 8.5.2 Let X be a narrowband WSS random process, with spectral density S X satis- fying S X (ω) = 0 unless ω c − ω o ≤ [ω[ ≤ ω c + ω o , where ω o < ω c . Then µ X = 0 and the following representations hold X t = Re(Z t e jωct ) = U t cos(ω c t) −V t sin(ω c t) where Z t = U t + jV t , and U and V are jointly WSS real-valued random processes with mean zero and S U (ω) = S V (ω) = [S X (ω −ω c ) +S X (ω +ω c )] I [ω[≤ωo (8.36) and S UV (ω) = j [S X (ω +ω c ) −S X (ω −ω c )] I [ω[≤ωo (8.37) Equivalently, R U (τ) = R V (τ) = R X (τ) cos(ω c τ) + ˇ R X (τ) sin(ω c τ) (8.38) and R UV (τ) = R X (τ) sin(ω c τ) − ˇ R X (τ) cos(ω c τ) (8.39) . 8.5. NARROWBAND RANDOM PROCESSES 265 Proof To show that µ X = 0, consider passing X through a linear, time-invariant system with transfer function K(ω) = 1 if ω is in either the upper band or lower band, and K(ω) = 0 otherwise. Then µ Y = µ X _ ∞ −∞ h(τ)dτ = µ X K(0) = 0. Since K(ω) = 1 for all ω such that S X (ω) > 0, it follows that R X = R Y = R XY = R Y X . Therefore E[[X t −Y t [ 2 ] = 0 so that X t has the same mean as Y t , namely zero, as claimed. By the definitions of the processes Z, U, and V , using the notation c t = cos(ω c t) and s t = sin(ω c t), we have U t = X t c t + ˇ X t s t V t = −X t s t + ˇ X t c t The remainder of the proof consists of computing R U , R V , and R UV as functions of two variables, because it is not yet clear that U and V are jointly WSS. By the fact X is WSS and the definition of ˇ X, the processes X and ˇ X are jointly WSS, and the various spectral densities are given by S ˇ XX = HS X S X ˇ X = H ∗ S X = −HS X S ˇ X = [H[ 2 S X = S X Therefore, R ˇ XX = ˇ R X R X ˇ X = − ˇ R X R ˇ X = R X Thus, for real numbers a and b, R U (a, b) = E __ X(a)c a + ˇ X(a)s a _ _ X(b)c b + ˇ X(b)s b _¸ = R X (a −b)(c a c b +s a s b ) + ˇ R X (a −b)(s a c b −c a s b ) = R X (a −b)c a−b + ˇ R X (a −b)s a−b Thus, R U (a, b) is a function of a −b, and R U (τ) is given by the right side of (8.38). The proof that R V also satisfies (8.38), and the proof of (8.39) are similar. Finally, it is a simple matter to derive (8.36) and (8.37) from (8.38) and (8.39), respectively. 2 Equations (8.36) and (8.37) have simple graphical interpretations, as illustrated in Figure 8.10. Equation (8.36) means that S U and S V are each equal to the sum of the upper lobe of S X shifted to the left by ω c and the lower lobe of S X shifted to the right by ω c . Similarly, equation (8.36) means that S UV is equal to the sum of j times the upper lobe of S X shifted to the left by ω c and −j times the lower lobe of S X shifted to the right by ω c . Equivalently, S U and S V are each twice the symmetric part of the upper lobe of S X , and S UV is j times the antisymmetric part of the upper lobe of S X . Since R UV is an odd function of τ, if follows that R UV (0) = 0. Thus, for any fixed time t, U t and V t are uncorrelated. That does not imply that U s and V t are uncorrelated for all s and t, for the cross correlation function R XY is identically zero if and only if the upper lobe of S X is symmetric about ω c . Example 8.5.3 ( Baseband equivalent filtering of a random process) As noted above, filtering of narrowband deterministic signals can be described using equivalent baseband signals, namely the 266CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS S X j j j S S =S U V UV + + = = Figure 8.10: A narrowband power spectral density and associated baseband spectral densities. complex envelopes. The same is true for filtering of narrowband random processes. Suppose X is a narrowband WSS random process, suppose g is a finite energy narrowband signal, and suppose Y is the output process when X is filtered using impulse response function g. Then Y is also a WSS narrowband random process. Let Z denote the complex envelope of X, given in Proposition 8.5.2, and let z g denote the complex envelope signal of g, meaning that z g is the complex baseband signal such that g(t) = Re(z g (t)e jωct . It can be shown that the complex envelope process of Y is 1 2 z g ∗ Z. 1 Thus, the filtering of X by g is equivalent to the filtering of Z by 1 2 z g . Example 8.5.4 (Simulation of a narrowband random process) Let ω o and ω c be postive numbers with 0 < ω o < ω c . Suppose S X is a nonnegative function which is even (i.e. S X (ω) = S X (−ω) for all ω) with S X (ω) = 0 if [[ω[ − ω c [ ≥ ω o . We discuss briefly the problem of writing a computer simulation to generate a real-valued WSS random process X with power spectral density S X . By Proposition 8.5.1, it suffices to simulate baseband random processes U and V with the power spectral densities specified by (8.36) and cross power spectral density specified by (8.37). For increased tractability, we impose an additional assumption on S X , namely that the upper lobe of S X is symmetric about ω c . This assumption is equivalent to the assumption that S UV vanishes, and therefore that the processes U and V are uncorrelated with each other. Thus, the processes U and V can be generated independently. In turn, the processes U and V can be simulated by first generating sequences of random variables U nT and V nT for sampling frequency 1 T = 2f o = ωo π . A discrete time random process with power spectral density S U can be generated by passing a discrete-time white noise sequence 1 An elegant proof of this fact is based on spectral representation theory for WSS random processes, covered for example in Doob, Stochastic Processes, Wiley, 1953. The basic idea is to define the Fourier transform of a WSS random process, which, like white noise, is a generalized random process. Then essentially the same method we described for filtering of deterministic narrowband signals works. 8.6. COMPLEXIFICATION, PART II 267 with unit variance through a discrete-time linear time-invariant system with real-valued impulse response function such that the transfer function H satisfies S U = [H[ 2 . For example, taking H(ω) = _ S U (ω) works, though it might not be the most well behaved linear system. (The problem of finding a transfer function H with additional properties such that S U = [H[ 2 is called the problem of spectral factorization, which we shall return to in the next chapter.) The samples V kT can be generated similarly. For a specific example, suppose that (using kHz for kilohertz, or thousands of Hertz) S X (2πf) = _ 1 9, 000 kHz < [f[ < 9, 020 kHz 0 else . (8.40) Notice that the parameters ω o and ω c are not uniquely determined by S X . They must simply be positive numbers with ω o < ω c such that (9, 000 kHz, 9, 020 kHz) ⊂ (f c −f o , f c +f o ) However, only the choice f c = 9, 010 kHz makes the upper lobe of S X symmetric around f c . Therefore we take f c = 9, 010 kHz. We take the minmum allowable value for f o , namely f o = 10 kHz. For this choice, (8.36) yields S U (2πf) = S V (2πf) = _ 2 [f[ < 10 kHz 0 else (8.41) and (8.37) yields S UV (2πf) = 0 for all f. The processes U and V are continuous-time baseband random processes with one-sided bandwidth limit 10 kHz. To simulate these processes it is therefore enough to generate samples of them with sampling period T = 0.510 −4 , and then use the Nyquist sampling representation described in Section 8.4. The processes of samples will, according to (8.26), have power spectral density equal to 4 10 4 over the interval [−π, π]. Consequently, the samples can be taken to be uncorrelated with E[[A k [ 2 ] = E[[B k [ 2 ] = 4 10 4 . For example, these variables can be taken to be independent real Gaussian random variables. Putting the steps together, we find the following representation for X: X t = cos(ω c t) _ ∞ n=−∞ A n sinc _ t −nT T _ _ −sin(ω c t) _ ∞ n=−∞ B n sinc _ t −nT T _ _ 8.6 Complexification, Part II A complex random variable Z is said to be circularly symmetric if Z has the same distribution as e jθ Z for every real value of θ. If Z has a pdf f Z , circular symmetry of Z means that f Z (z) is invariant under rotations about zero, or, equivalently, f Z (z) depends on z only through [z[. A collection of random variables (Z i : i ∈ I) is said to be jointly circularly symmetric if for every real value of θ, the collection (Z i : i ∈ I) has the same finite dimensional distributions as the collection (Z i e jθ : i ∈ I). Note that if (Z i : i ∈ I) is jointly circularly symmetric, and if (Y j : j ∈ J) is another 268CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS collection of random variables such that each Y j is a linear combination of Z i ’s (with no constants added in) then the collection (Y j : j ∈ J) is also jointly circularly symmetric. Recall that a complex random vector Z, expressed in terms of real random vectors U and V as Z = U +jV , has mean EZ = EU +jEV and covariance matrix Cov(Z) = E[(Z −EZ)(Z −EZ) ∗ ]. The pseudo-covariance matrix of Z is defined by Cov p (Z) = E[(Z −EZ)(Z −EZ) T ], and it differs from the covariance of Z in that a transpose, rather than a Hermitian transpose, is involved. Note that Cov(Z) and Cov p (Z) are readily expressed in terms of Cov(U), Cov(V ), and Cov(U, V ) as: Cov(Z) = Cov(U) + Cov(V ) +j (Cov(V, U) −Cov(U, V )) Cov p (Z) = Cov(U) −Cov(V ) +j (Cov(V, U) + Cov(U, V )) where Cov(V, U) = Cov(U, V ) T . Conversely, Cov(U) = Re (Cov(Z) + Cov p (Z)) /2, Cov(V ) = Re (Cov(Z) −Cov p (Z)) /2, and Cov(U, V ) = Im(−Cov(Z) + Cov p (Z)) /2. The vector Z is defined to be Gaussian if the random vectors U and V are jointly Gaussian. Suppose that Z is a complex Gaussian random vector. Then its distribution is fully determined by its mean and the matrices Cov(U), Cov(V ), and Cov(U, V ), or equivalently by its mean and the matrices Cov(Z) and Cov p (Z). Therefore, for a real value of θ, Z and e jθ Z have the same distribution if and only if they have the same mean, covariance matrix, and pseudo-covariance matrix. Since E[e jθ Z] = e jθ EZ, Cov(e jθ Z) = Cov(Z), and Cov p (e jθ Z) = e j2θ Cov p (Z), Z and e jθ Z have the same distribution if and only if (e jθ −1)EZ = 0 and (e j2θ −1)Cov p (Z) = 0. Hence, if θ is not a multiple of π, Z and e jθ Z have the same distribution if and only if EZ = 0 and Cov p (Z) = 0. Consequently, a Gaussian random vector Z is circularly symmetric if and only if its mean vector and pseudo-covariance matrix are zero. The joint density function of a circularly symmetric complex random vector Z with n complex dimensions and covariance matrix K, with det K ,= 0, has the particularly elegant form: f Z (z) = exp(−z ∗ K −1 z) π n det(K) . (8.42) Equation (8.42) can be derived in the same way the density for Gaussian vectors with real com- ponents is derived. Namely, (8.42) is easy to verify if K is diagonal. If K is not diagonal, the Hermetian symmetric positive definite matrix K can be expressed as K = UΛU ∗ , where U is a unitary matrix and Λ is a diagonal matrix with strictly positive diagonal entries. The random vector Y defined by Y = U ∗ Z is Gaussian and circularly symmetric with covariance matrix Λ, and since det(Λ) = det(K), it has pdf f Y (y) = exp(−y ∗ Λ −1 y) π n det(K) . Since [ det(U)[ = 1, f Z (z) = f Y (U ∗ x), which yields (8.42). Let us switch now to random processes. Let Z be a complex-valued random process and let U and V be the real-valued random processes such that Z t = U t +jV t . Recall that Z is Gaussian if U and V are jointly Gaussian, and the covariance function of Z is defined by C Z (s, t) = Cov(Z s , Z t ). 8.7. PROBLEMS 269 The pseudo-covariance function of Z is defined by C p Z (s, t) = Cov p (Z s , Z t ). As for covariance matrices of vectors, both C Z and C p Z are needed to determine C U , C V , and C UV . Following the vast majority of the literature, we define Z to be wide sense stationary (WSS) if µ Z (t) is constant and if C Z (s, t) (or R Z (s, t)) is a function of s − t alone. Some authors use a stronger definition of WSS, by defining Z to be WSS if either of the following two equivalent conditions is satisfied: • µ Z (t) is constant, and both C Z (s, t) and C p Z (s, t) are functions of s −t • U and V are jointly WSS If Z is Gaussian then it is stationary if and only if it satisfies the stronger definition of WSS. A complex random process Z = (Z t : t ∈ T) is called circularly symmetric if the random variables of the process, (Z t : t ∈ T), are jointly circularly symmetric. If Z is a complex Gaussian random process, it is circularly symmetric if and only if it has mean zero and Cov p Z (s, t) = 0 for all s, t. Proposition 8.5.2 shows that the baseband equivalent process Z for a Gaussian real-valued narrowband WSS random process X is circularly symmetric. Nearly all complex valued random processes in applications arise in this fashion. For circularly symmetric complex random processes, the definition of WSS we adopted, and the stronger definition mentioned in the previous paragraph, are equivalent. A circularly symmetric complex Gaussian random process is stationary if and only if it is WSS. The interested reader can find more related to the material in this section in Neeser and Massey, “Proper Complex Random Processes with Applications to Information Theory,” IEEE Transactions on Information Theory, vol. 39, no. 4, July 1993. 8.7 Problems 8.1 On filtering a WSS random process Suppose Y is the output of a linear time-invariant system with WSS input X, impulse response function h, and transfer function H. Indicate whether the following statements are true or false. Justify your answers. (a) If [H(ω)[ ≤ 1 for all ω then the power of Y is less than or equal to the power of X. (b) If X is periodic (in addition to being WSS) then Y is WSS and periodic. (c) If X has mean zero and strictly positive total power, and if [[h[[ 2 > 0, then the output power is strictly positive. 8.2 On the cross spectral density Suppose X and Y are jointly WSS such that the power spectral densities S X , S Y , and S XY are continuous. Show that for each ω, [S XY (ω)[ 2 ≤ S X (ω)S Y (ω). Hint: Fix ω o , let > 0, and let J denote the interval of length centered at ω o . Consider passing both X and Y through a linear time-invariant system with transfer function H (ω) = I J (ω). Apply the Schwarz inequality to the output processes sampled at a fixed time, and let →0. 8.3 Modulating and filtering a stationary process Let X = (X t : t ∈ Z) be a discrete-time mean-zero stationary random process with power E[X 2 0 ] = 270CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS 1. Let Y be the stationary discrete time random process obtained from X by modulation as follows: Y t = X t cos(80πt + Θ), where Θ is independent of X and is uniformly distributed over [0, 2π]. Let Z be the stationary discrete time random process obtained from Y by the linear equations: Z t+1 = (1 −a)Z t +aY t+1 for all t, where a is a constant with 0 < a < 1. (a) Why is the random process Y stationary? (b) Express the autocorrelation function of Y , R Y (τ) = E[Y τ Y 0 ], in terms of the autocorrelation function of X. Similarly, express the power spectral density of Y , S Y (ω), in terms of the power spectral density of X, S X (ω). (c) Find and sketch the transfer function H(ω) for the linear system describing the mapping from Y to Z. (d) Can the power of Z be arbitrarily large (depending on a)? Explain your answer. (e) Describe an input X satisfying the assumptions above so that the power of Z is at least 0.5, for any value of a with 0 < a < 1. 8.4 Filtering a Gauss Markov process Let X = (X t : −∞ < t < +∞) be a stationary Gauss Markov process with mean zero and autocorrelation function R X (τ) = exp(−[τ[). Define a random process Y = (Y t : t ∈ R) by the differential equation ˙ Y t = X t −Y t . (a) Find the cross correlation function R XY . Are X and Y jointly stationary? (b) Find E[Y 5 [X 5 = 3]. What is the approximate numerical value? (c) Is Y a Gaussian random process? Justify your answer. (d) Is Y a Markov process? Justify your answer. 8.5 Slight smoothing Suppose Y is the output of the linear time-invariant system with input X and impulse response function h, such that X is WSS with R X (τ) = exp(−[τ[), and h(τ) = 1 a I ¡[τ[≤ a 2 ¦ for a > 0. If a is small, then h approximates the delta function δ(τ), and consequently Y t ≈ X t . This problem explores the accuracy of the approximation. (a) Find R Y X (0), and use the power series expansion of e u to show that R Y X (0) = 1 − a 4 +o(a) as a →0. Here, o(a) denotes any term such that o(a)/a →0 as a →0. (b) Find R Y (0), and use the power series expansion of e u to show that R Y (0) = 1 − a 3 + o(a) as a →0. (c) Show that E[[X t −Y t [ 2 ] = a 6 +o(a) as a →0. 8.6 A stationary two-state Markov process Let X = (X k : k ∈ Z) be a stationary Markov process with state space o = ¦1, −1¦ and one-step transition probability matrix P = _ 1 −p p p 1 −p _ , 8.7. PROBLEMS 271 where 0 < p < 1. Find the mean, correlation function and power spectral density function of X. Hint: For nonnegative integers k: P k = _ 1 2 1 2 1 2 1 2 _ + (1 −2p) k _ 1 2 − 1 2 − 1 2 1 2 _ . 8.7 A stationary two-state Markov process in continuous time Let X = (X t : t ∈ R) be a stationary Markov process with state space o = ¦1, −1¦ and Q matrix Q = _ −α α α −α _ , where α > 0. Find the mean, correlation function and power spectral density function of X. (Hint: Recall from the example in the chapter on Markov processes that for s < t, the matrix of transition probabilities p ij (s, t) is given by H(τ), where τ = t −s and H(τ) = _ 1+e −2ατ 2 1−e −2ατ 2 1−e −2ατ 2 1+e −2ατ 2 _ . 8.8 A linear estimation problem Suppose X and Y are possibly complex valued jointly WSS processes with known autocorrelation functions, cross-correlation function, and associated spectral densities. Suppose Y is passed through a linear time-invariant system with impulse response function h and transfer function H, and let Z be the output. The mean square error of estimating X t by Z t is E[[X t −Z t [ 2 ]. (a) Express the mean square error in terms of R X , R Y , R XY and h. (b) Express the mean square error in terms of S X , S Y , S XY and H. (c) Using your answer to part (b), find the choice of H that minimizes the mean square error. (Hint: Try working out the problem first assuming the processes are real valued. For the complex case, note that for σ 2 > 0 and complex numbers z and z o , σ 2 [z[ 2 −2Re(z ∗ z o ) is equal to [σz − zo σ [ 2 − [zo[ 2 σ 2 , which is minimized with respect to z by z = zo σ 2 .) 8.9 Linear time invariant, uncorrelated scattering channel A signal transmitted through a scattering environment can propagate over many different paths on its way to a receiver. The channel gains along distinct paths are often modeled as uncorrelated. The paths may differ in length, causing a delay spread. Let h = (h u : u ∈ Z) consist of uncorrelated, possibly complex valued random variables with mean zero and E[[h u [ 2 ] = g u . Assume that G = u g u < ∞. The variable h u is the random complex gain for delay u, and g = (g u : u ∈ Z) is the energy gain delay mass function with total gain G. Given a deterministic signal x, the channel output is the random signal Y defined by Y i = ∞ u=−∞ h u x i−u . (a) Determine the mean and autocorrelation function for Y in terms of x and g. (b) Express the average total energy of Y : E[ i Y 2 i ], in terms of x and g. (c) Suppose instead that the input is a WSS random process X with autocorrelation function R X . The input X is assumed to be independent of the channel h. Express the mean and autocorrelation function of the output Y in terms of R X and g. Is Y WSS? 272CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS (d) Since the impulse response function h is random, so is its Fourier transform, H = (H(ω) : −π ≤ ω ≤ π). Express the autocorrelation function of the random process H in terms of g. 8.10 The accuracy of approximate differentiation Let X be a WSS baseband random process with power spectral density S X , and let ω o be the one-sided band limit of X. The process X is m.s. differentiable and X t can be viewed as the output of a time-invariant linear system with transfer function H(ω) = jω. (a) What is the power spectral density of X t ? (b) Let Y t = X t+a −X t−a 2a , for some a > 0. We can also view Y = (Y t : t ∈ R) as the output of a time-invariant linear system, with input X. Find the impulse response function k and transfer function K of the linear system. Show that K(ω) →jω as a →0. (c) Let D t = X t t −Y t . Find the power spectral density of D. (d) Find a value of a, depending only on ω o , so that E[[D t [ 2 ] ≤ (0.01)E[[X t t [] 2 . In other words, for such a, the m.s. error of approximating X t t by Y t is less than one percent of E[[X t t [ 2 ]. You can use the fact that 0 ≤ 1 − sin(u) u ≤ u 2 6 for all real u. (Hint: Find a so that S D (ω) ≤ (0.01)S X (ω) for [ω[ ≤ ω o .) 8.11 Some linear transformations of some random processes Let U = (U n : n ∈ Z) be a random process such that the variables U n are independent, identically distributed, with E[U n ] = µ and Var(U n ) = σ 2 , where µ ,= 0 and σ 2 > 0. Please keep in mind that µ ,= 0. Let X = (X n : n ∈ Z) be defined by X n = ∞ k=0 U n−k a k , for a constant a with 0 < a < 1. (a) Is X stationary? Find the mean function µ X and autocovariance function C X for X. (b) Is X a Markov process ? (Hint: X is not necessarily Gaussian. Does X have a state represen- tation driven by U?) (c) Is X mean ergodic in the m.s. sense? Let U be as before, and let Y = (Y n : n ∈ Z) be defined by Y n = ∞ k=0 U n−k A k , where A is a random variable distributed on the interval (0, 0.5) (the exact distribution is not specified), and A is independent of the random process U. (d) Is Y stationary? Find the mean function µ Y and autocovariance function C Y for Y . (Your answer may include expectations involving A.) (e) Is Y a Markov process? (Give a brief explanation.) (f) Is Y mean ergodic in the m.s. sense? 8.12 Filtering Poisson white noise A Poisson random process N = (N t : t ≥ 0) has indpendent increments. The derivative of N, written N t , does not exist as an ordinary random process, but it does exist as a generalized random process. Graphically, picture N t as a superposition of delta functions, one at each arrival time of the Poisson process. As a generalized random process, N t is stationary with mean and autocovariance functions given by E[N t t ] = λ, and C N (s, t) = λδ(s − t), repectively, because, when integrated, these functions give the correct values for the mean and covariance of N: E[N t ] = _ t 0 λds and C N (s, t) = _ s 0 _ t 0 λδ(u−v)dvdu. The random process N t can be extended to be defined for negative times by augmenting the original random process N by another rate λ Poisson process for negative times. Then N t can be viewed as a stationary random process, and its integral over intervals gives 8.7. PROBLEMS 273 rise to a process N(a, b] as described in Problem 4.17. (The process N t −λ is a white noise process, in that it is a generalized random process which is stationary, mean zero, and has autocorrelation function λδ(τ). Both N t and N t − λ are called Poisson shot noise processes. One application for such processes is modeling noise in small electronic devices, in which effects of single electrons can be registered. For the remainder of this problem, N t is used instead of the mean zero version.) Let X be the output when N t is passed through a linear time-invariant filter with an impulse response function h, such that _ ∞ −∞ [h(t)[dt is finite. (Remark: In the special case that h(t) = I ¡0≤t<1¦ , X is the M/D/∞ process of Problem 4.17.) (a) Find the mean function and covariance functions of X. (b) Consider the special case that h(t) = e −t I ¡t≥0¦ . Explain why X is a Markov process in this case. (Hint: What is the behavior of X between the arrival times of the Poisson process? What does X do at the arrival times?) 8.13 A linear system with a feedback loop The system with input X and output Y involves feedback with the loop transfer function shown. Y ! 1+j " X + (a) Find the transfer function K of the system describing the mapping from X to Y. (b) Find the corresponding impulse response function. (c) The power of Y divided by the power of X, depends on the power spectral density, S X . Find the supremum of this ratio, over all choices of S X , and describe what choice of S X achieves this supremum. 8.14 Linear and nonlinear reconstruction from samples Suppose X t = ∞ n=−∞ g(t−n−U)B n , where the B n ’s are independent with mean zero and variance σ 2 > 0, g is a function with finite energy _ [g(t)[ 2 dt and Fourier transform G(ω), U is a random variable which is independent of B and uniformly distributed on the interval [0, 1]. The process X is a typical model for a digital baseband signal, where the B n ’s are random data symbols. (a) Show that X is WSS, with mean zero and R X (t) = σ 2 g ∗ ¯ g(t). (b) Under what conditions on G and T can the sampling theorem be used to recover X from its samples of the form (X(nT) : n ∈ Z)? (c) Consider the particular case g(t) = (1 − [t[) + and T = 0.5. Although this falls outside the conditions found in part (b), show that by using nonlinear operations, the process X can be recovered from its samples of the form (X(nT) : n ∈ Z). (Hint: Consider a sample path of X) 8.15 Sampling a cubed Gaussian process Let X = (X t : t ∈ R) be a baseband mean zero stationary real Gaussian random process with one-sided band limit f o Hz. Thus, X t = ∞ n=−∞ X nT sinc _ t−nT T _ where 1 T = 2f o . Let Y t = X 3 t for each t. 274CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS (a) Is Y stationary? Express R Y in terms of R X , and S Y in terms of S X and/or R X . (Hint: If A, B are jointly Gaussian and mean zero, Cov(A 3 , B 3 ) = 6Cov(A, B) 3 + 9E[A 2 ]E[B 2 ]Cov(A, B).) (b) At what rate 1 T should Y be sampled in order that Y t = ∞ n=−∞ Y nT sinc _ t−nT T _ ? (c) Can Y be recovered with fewer samples than in part (b)? Explain. 8.16 An approximation of white noise White noise in continuous time can be approximated by a piecewise constant process as follows. Let T be a small positive constant, let A T be a positive scaling constant depending on T, and let (B k : k ∈ Z) be a discrete-time white noise process with R B (k) = σ 2 I ¡k=0¦ . Define (N t : t ∈ R) by N t = B k for t ∈ [kT, (k + 1)T). (a) Sketch a typical sample path of N and express E[[ _ 1 0 N s ds[ 2 ] in terms of A T , T and σ 2 . For simplicity assume that T = 1 K for some large integer K. (b) What choice of A T makes the expectation found in part (a) equal to σ 2 ? This choice makes N a good approximation to a continuous-time white noise process with autocorrelation function σ 2 δ(τ). (c) What happens to the expectation found in part (a) as T →0 if A T = 1 for all T? 8.17 Simulating a baseband random process Suppose a real-valued Gaussian baseband process X = (X t : t ∈ R) with mean zero and power spectral density S X (2πf) = _ 1 if [f[ ≤ 0.5 0 else is to be simulated over the time interval [−500, 500] through use of the sampling theorem with sampling time T = 1. (a) What is the joint distribution of the samples, X n : n ∈ Z? (b) Of course a computer cannot generate infinitely many random variables in a finite amount of time. Therefore, consider approximating X by X (N) defined by X (N) t = N n=−N X n sinc(t −n) Find a condition on N to guarantee that E[(X t − X (N) t ) 2 ] ≤ 0.01 for t ∈ [−500, 500]. (Hint: Use [sinc(τ)[ ≤ 1 π[τ[ and bound the series by an integral. Your choice of N should not depend on t because the same N should work for all t in the interval [−500, 500] ). 8.18 Filtering to maximize signal to noise ratio Let X and N be continuous time, mean zero WSS random processes. Suppose that X has power spectral density S X (ω) = [ω[I ¡[ω[≤ωo¦ , and that N has power spectral density S N (ω) = σ 2 for all ω. Suppose also that X and N are uncorrelated with each other. Think of X as a signal, and N as noise. Suppose the signal plus noise X +N is passed through a linear time-invariant filter with transfer function H, which you are to specify. Let ¯ X denote the output signal and ¯ N denote the output noise. What choice of H, subject the constraints (i) [H(ω)[ ≤ 1 for all ω, and (ii) (power of ¯ X) ≥ (power of X)/2, minimizes the power of ¯ N? 8.7. PROBLEMS 275 8.19 Finding the envelope of a deterministic signal (a) Find the complex envelope z(t) and real envelope [z(t)[ of x(t) = cos(2π(1000)t)+cos(2π(1001)t), using the carrier frequency f c = 1000.5Hz. Simplify your answer as much as possible. (b) Repeat part (a), using f c = 995Hz. (Hint: The real envelope should be the same as found in part (a).) (c) Explain why, in general, the real envelope of a narrowband signal does not depend on which frequency f c is used to represent the signal (as long as f c is chosen so that the upper band of the signal is contained in an interval [f c −a, f c +a] with a << f c .) 8.20 Sampling a signal or process that is not band limited (a) Fix T > 0 and let ω o = π/T. Given a finite energy signal x, let x o be the band-limited signal with Fourier transform ´ x o (ω) = I ¡[ω[≤ωo¦ ∞ n=−∞ ´ x(ω + 2nω o ). Show that x(nT) = x o (nT) for all integers n. (b) Explain why x o (t) = ∞ n=−∞ x(nT)sinc _ t−nT T _ . (c) Let X be a mean zero WSS random process, and let R o X be the autocorrelation function for power spectral density S o X (ω) defined by S o X (ω) = I ¡[ω[≤ωo¦ ∞ n=−∞ S X (ω + 2nω o ). Show that R X (nT) = R o X (nT) for all integers n. (d) Explain why the random process Y defined by Y t = ∞ n=−∞ X nT sinc _ t−nT T _ is WSS with autocorrelation function R o X . (e) Find S o X in case S X (ω) = exp(−α[ω[) for ω ∈ R. 8.21 A narrowband Gaussian process Let X be a real-valued stationary Gaussian process with mean zero and R X (τ) = cos(2π(30τ))(sinc(6τ)) 2 . (a) Find and carefully sketch the power spectral density of X. (b) Sketch a sample path of X. (c) The process X can be represented by X t = Re(Z t e 2πj30t ), where Z t = U t +jV t for jointly stationary narrowband real-valued random processes U and V . Find the spectral densities S U , S V , and S UV . (d) Find P[[Z 33 [ > 5]. Note that [Z t [ is the real envelope process of X. 8.22 Another narrowband Gaussian process Suppose a real-valued Gaussian random process R = (R t : t ∈ R) with mean 2 and power spectral density S R (2πf) = e −[f[/10 4 is fed through a linear time-invariant system with transfer function H(2πf) = _ 0.1 5000 ≤ [f[ ≤ 6000 0 else (a) Find the mean and power spectral density of the output process X = (X t : t ∈ R). (b) Find P[X 25 > 6]. (c) The random process X is a narrowband random process. Find the power spectral densities S U , S V and the cross spectral density S UV of jointly WSS baseband random processes U and V so that X t = U t cos(2πf c t) −V t sin(2πf c t), using f c = 5500. (d) Repeat part (c) with f c = 5000. 8.23 Another narrowband Gaussian process (version 2) Suppose a real-valued Gaussian white noise process N (we assume white noise has mean zero) with power spectral density S N (2πf) ≡ No 2 for f ∈ R is fed through a linear time-invariant system with 276CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS transfer function H specified as follows, where f represents the frequency in gigahertz (GHz) and a gigahertz is 10 9 cycles per second. H(2πf) = _ ¸ _ ¸ _ 1 19.10 ≤ [f[ ≤ 19.11 _ 19.12−[f[ 0.01 19.11 ≤ [f[ ≤ 19.12 0 else (a) Find the mean and power spectral density of the output process X = (X t : t ∈ R). (b) Express P[X 25 > 2] in terms of N o and the standard normal complementary CDF function Q. (c) The random process X is a narrowband random process. Find and sketch the power spectral densities S U , S V and the cross spectral density S UV of jointly WSS baseband random processes U and V so that X t = U t cos(2πf c t) −V t sin(2πf c t), using f c = 19.11 GHz. (d) The complex envelope process is given by Z = U + jV and the real envelope process is given by [Z[. Specify the distributions of Z t and [Z t [ for t fixed. 8.24 Declaring the center frequency for a given random process Let a > 0 and let g be a nonnegative function on R which is zero outside of the interval [a, 2a]. Suppose X is a narrowband WSS random process with power spectral density function S X (ω) = g([ω[), or equivalently, S X (ω) = g(ω) +g(−ω). The process X can thus be viewed as a narrowband signal for carrier frequency ω c , for any choice of ω c in the interval [a, 2a]. Let U and V be the baseband random processes in the usual complex envelope representation: X t = Re((U t +jV t )e jωct ). (a) Express S U and S UV in terms of g and ω c . (b) Describe which choice of ω c minimizes _ ∞ −∞ [S UV (ω)[ 2 dω dπ . (Note: If g is symmetric arround some frequency ν, then ω c = ν. But what is the answer otherwise?) 8.25 * Cyclostationary random processes A random process X = (X t : t ∈ R) is said to be cyclostationary with period T, if whenever s is an integer multiple of T, X has the same finite dimensional distributions as (X t+s : t ∈ R). This property is weaker than stationarity, because stationarity requires equality of finite dimensional distributions for all real values of s. (a) What properties of the mean function µ X and autocorrelation function R X does any second order cyclostationary process possess? A process with these properties is called a wide sense cyclostationary process. (b) Suppose X is cyclostationary and that U is a random variable independent of X that is uniformly distributed on the interval [0, T]. Let Y = (Y t : t ∈ R) be the random process defined by Y t = X t+U . Argue that Y is stationary, and express the mean and autocorrelation function of Y in terms of the mean function and autocorrelation function of X. Although X is not necessarily WSS, it is reasonable to define the power spectral density of X to equal the power spectral density of Y . (c) Suppose B is a stationary discrete-time random process and that g is a deterministic function. 8.7. PROBLEMS 277 Let X be defined by X t = ∞ −∞ g(t −nT)B n . Show that X is a cyclostationary random process. Find the mean function and autocorrelation function of X in terms g, T, and the mean and autocorrelation function of B. If your answer is complicated, identify special cases which make the answer nice. (d) Suppose Y is defined as in part (b) for the specific X defined in part (c). Express the mean µ Y , autocorrelation function R Y , and power spectral density S Y in terms of g, T, µ B , and S B . 8.26 * Zero crossing rate of a stationary Gaussian process Consider a zero-mean stationary Gaussian random process X with S X (2πf) = [f[ − 50 for 50 ≤ [f[ ≤ 60, and S X (2πf) = 0 otherwise. Assume the process has continuous sample paths (it can be shown that such a version exists.) A zero crossing from above is said to occur at time t if X(t) = 0 and X(s) > 0 for all s in an interval of the form [t −, t) for some > 0. Determine the mean rate of zero crossings from above for X. If you can find an analytical solution, great. Alternatively, you can estimate the rate (aim for three significant digits) by Monte Carlo simulation of the random process. 278CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS Chapter 9 Wiener filtering 9.1 Return of the orthogonality principle Consider the problem of estimating a random process X at some fixed time t given observation of a random process Y over an interval [a, b]. Suppose both X and Y are mean zero second order random processes and that the minimum mean square error is to be minimized. Let ´ X t denote the best linear estimator of X t based on the observations (Y s : a ≤ s ≤ b). In other words, define 1 o = ¦c 1 Y s 1 + +c n Y sn : for some constants n, a ≤ s 1 , . . . , s n ≤ b, and c 1 , . . . , c n ¦ and let 1 be the m.s. closure of 1, which includes 1 o and any random variable that is the m.s. limit of a sequence of random variables in 1 o . Then ´ X t is the random variable in 1 that minimizes the mean square error, E[[X t − ´ X t [ 2 ]. By the orthogonality principle, the estimator ´ X t exists and it is unique in the sense that any two solutions to the estimation problem are equal with probability one. Perhaps the most useful part of the orthogonality principle is that a random variable W is equal to ´ X t if and only if (i) W ∈ 1 and (ii) (X t − W) ⊥ Z for all Z ∈ 1. Equivalently, W is equal to ´ X t if and only if (i) W ∈ 1 and (ii) (X t − W) ⊥ Y u for all u ∈ [a, b]. Furthermore, the minimum mean square error (i.e. the error for the optimal estimator ´ X t ) is given by E[[X t [ 2 ] −E[[ ´ X t [ 2 ]. Note that m.s. integrals of the form _ b a h(t, s)Y s ds are in 1, because m.s. integrals are m.s. limits of finite linear combinations of the random variables of Y . Typically the set 1 is larger than the set of all m.s. integrals of Y . For example, if u is a fixed time in [a, b] then Y u ∈ 1. In addition, if Y is m.s. differentiable, then Y t u is also in 1. Typically neither Y u nor Y t u can be expressed as a m.s. integral of (Y s : s ∈ R). However, Y u can be obtained as an integral of the process Y multiplied by a delta function, though the integration has to be taken in a generalized sense. The integral _ b a h(t, s)Y s ds is the linear MMSE estimator if and only if X t − _ b a h(t, s)Y s ds ⊥ Y u for u ∈ [a, b] or equivalently E[(X t − _ b a h(t, s)Y s ds)Y ∗ u ] = 0 for u ∈ [a, b] 279 280 CHAPTER 9. WIENER FILTERING or equivalently R XY (t, u) = _ b a h(t, s)R Y (s, u)ds for u ∈ [a, b]. Suppose now that the observation interval is the whole real line R and suppose that X and Y are jointly WSS. Then for t and v fixed, the problem of estimating X t from (Y s : s ∈ R) is the same as the problem of estimating X t+v from (Y s+v : s ∈ R). Therefore, if h(t, s) for t fixed is the optimal function to use for estimating X t from (Y s : s ∈ R), then it is also the optimal function to use for estimating X t+v from (Y s+v : s ∈ R). Therefore, h(t, s) = h(t +v, s +v), so that h(t, s) is a function of t−s alone, meaning that the optimal impulse response function h corresponds to a time- invariant system. Thus, we seek to find an optimal estimator of the form ˆ X t = _ ∞ −∞ h(t − s)Y s ds. The optimality condition becomes X t − _ ∞ −∞ h(t −s)Y s ds ⊥ Y u for u ∈ R which is equivalent to the condition R XY (t −u) = _ ∞ −∞ h(t −s)R Y (s −u)ds for u ∈ R or R XY = h∗R Y . In the frequency domain the optimality condition becomes S XY (ω) = H(ω)S Y (ω) for all ω. Consequently, the optimal filter H is given by H(ω) = S XY (ω) S Y (ω) and the corresponding minimum mean square error is given by E[[X t − ´ X t [ 2 ] = E[[X t [ 2 ] −E[[ ´ X t [ 2 ] = _ ∞ −∞ _ S X (ω) − [S XY (ω)[ 2 S Y (ω) _ dω 2π Example 9.1.1 Consider estimating a random process from observation of the random process plus noise, as shown in Figure 9.1. Assume that X and N are jointly WSS with mean zero. Suppose X N + h X Y Figure 9.1: An estimator of a signal from signal plus noise, as the output of a linear filter. X and N have known autocorrelation functions and suppose that R XN ≡ 0, so the variables of the process X are uncorrelated with the variables of the process N. The observation process is given by Y = X +N. Then S XY = S X and S Y = S X +S N , so the optimal filter is given by H(ω) = S XY (ω) S Y (ω) = S X (ω) S X (ω) +S N (ω) 9.1. RETURN OF THE ORTHOGONALITY PRINCIPLE 281 The associated minimum mean square error is given by E[[X t − ´ X t [ 2 ] = _ ∞ −∞ _ S X (ω) − S X (ω) 2 S X (ω) +S N (ω) _ dω 2π = _ ∞ −∞ S X (ω)S N (ω) S X (ω) +S N (ω) dω 2π Example 9.1.2 This example is a continuation of the previous example, for a particular choice of power spectral densities. Suppose that the signal process X is WSS with mean zero and power spectral density S X (ω) = 1 1+ω 2 , suppose the noise process N is WSS with mean zero and power spectral density 4 4+ω 2 , and suppose S XN ≡ 0. Equivalently, R X (τ) = e −|τ| 2 , R N (τ) = e −2[τ[ and R XN ≡ 0. We seek the optimal linear estimator of X t given (Y s : s ∈ R), where Y = X + N. Seeking an estimator of the form ´ X t = _ ∞ −∞ h(t −s)Y s ds we find from the previous example that the transform H of h is given by H(ω) = S X (ω) S X (ω) +S N (ω) = 1 1+ω 2 1 1+ω 2 + 4 4+ω 2 = 4 +ω 2 8 + 5ω 2 We will find h by finding the inverse transform of H. First, note that 4 +ω 2 8 + 5ω 2 = 8 5 +ω 2 8 + 5ω 2 + 12 5 8 + 5ω 2 = 1 5 + 12 5 8 + 5ω 2 We know that 1 5 δ(t) ↔ 1 5 . Also, for any α > 0, e −α[t[ ↔ 2α ω 2 +α 2 , (9.1) so 1 8 + 5ω 2 = 1 5 8 5 +ω 2 = _ 1 5 2 _ 5 8 _ 2 _ 8 5 ( 8 5 +ω 2 ) ↔ _ 1 4 √ 10 _ e − q 8 5 [t[ Therefore the optimal filter is given in the time domain by h(t) = 1 5 δ(t) + _ 3 5 √ 10 _ e − q 8 5 [t[ The associated minimum mean square error is given by (one way to do the integration is to use the fact that if k ↔K then _ ∞ −∞ K(ω) dω 2π = k(0)): E[[X t − ´ X t [ 2 ] = _ ∞ −∞ S X (ω)S N (ω) S X (ω) +S N (ω) dω 2π = _ ∞ −∞ 4 8 + 5ω 2 dω 2π = 4 _ 1 4 √ 10 _ = 1 √ 10 In an example later in this chapter we will return to the same random processes, but seek the best linear estimator of X t given (Y s : s ≤ t). 282 CHAPTER 9. WIENER FILTERING 9.2 The causal Wiener filtering problem A linear system is causal if the value of the output at any given time does not depend on the future of the input. That is to say that the impulse response function satisfies h(t, s) = 0 for s > t. In the case of a linear, time-invariant system, causality means that the impulse response function satisfies h(τ) = 0 for τ < 0. Suppose X and Y are mean zero and jointly WSS. In this section we will consider estimates of X given Y obtained by passing Y through a causal linear time-invariant system. For convenience in applications, a fixed parameter T is introduced. Let ´ X t+T[t be the minimum mean square error linear estimate of X t+T given (Y s : s ≤ t). Note that if Y is the same process as X and T > 0, then we are addressing the problem of predicting X t+T from (X s : s ≤ t). An estimator of the form _ ∞ −∞ h(t−s)Y s ds is sought such that h corresponds to a causal system. Once again, the orthogonality principle implies that the estimator is optimal if and only if it satisfies X t+T − _ ∞ −∞ h(t −s)Y s ds ⊥ Y u for u ≤ t which is equivalent to the condition R XY (t +T −u) = _ ∞ −∞ h(t −s)R Y (s −u)ds for u ≤ t or R XY (t + T − u) = h ∗ R Y (t − u). Setting τ = t − u and combining this optimality condition with the constraint that h is a causal function, the problem is to find an impulse response function h satisfying: R XY (τ +T) = h ∗ R Y (τ) for τ ≥ 0 (9.2) h(v) = 0 for v < 0 (9.3) Equations (9.2) and (9.3) are called the Wiener-Hopf equations. We shall show how to solve them in the case the power spectral densities are rational functions by using the method of spectral factorization. The next section describes some of the tools needed for the solution. 9.3 Causal functions and spectral factorization A function h on R is said to be causal if h(τ) = 0 for τ < 0, and it is said to be anticausal if h(τ) = 0 for τ > 0. Any function h on R can be expressed as the sum of a causal function and an anticausal function as follows. Simply let u(t) = I ¡t≥0¦ and notice that h(t) is the sum of the causal function u(t)h(t) and the anticausal function (1 − u(t))h(t). More compactly, we have the representation h = uh + (1 −u)h. A transfer function H is said to be of positive type if the corresponding impulse response function h is causal, and H is said to be of negative type if the corresponding impulse response function is anticausal. Any transfer function can be written as the sum of a positive type transfer function and a negative type transfer function. Indeed, suppose H is the Fourier transform of an impulse response function h. Define [H] + to be the Fourier transform of uh and [H] − to be the Fourier transform of (1 − u)h. Then [H] + is called the positive part of H and [H] − is called the negative part of H. The following properties hold: 9.3. CAUSAL FUNCTIONS AND SPECTRAL FACTORIZATION 283 • H = [H] + + [H] − (because h = uh + (1 −u)h) • [H] + = H if and only if H is positive type • [H] − = 0 if and only if H is positive type • [[H] + ] − = 0 for any H • [[H] + ] + = [H] + and [[H] − ] − = [H] − • [H +G] + = [H] + + [G] + and [H +G] − = [H] − + [G] − Note that uh is the casual function that is closest to h in the L 2 norm. That is, uh is the projection of h onto the space of causal functions. Indeed, if k is any causal function, then _ ∞ −∞ [h(t) −k(t)[ 2 dt = _ 0 −∞ [h(t)[ 2 dt + _ ∞ 0 [h(t) −k(t)[ 2 dt ≥ _ 0 −∞ [h(t)[ 2 dt (9.4) and equality holds in (9.4) if and only if k = uh (except possibly on a set of measure zero). By Parseval’s relation, it follows that [H] + is the positive type function that is closest to H in the L 2 norm. Equivalently, [H] + is the projection of H onto the space of positive type functions. Similarly, [H] − is the projection of H onto the space of negative type functions. Up to this point in these notes, Fourier transforms have been defined for real values of ω only. However, for the purposes of factorization to be covered later, it is useful to consider the analytic continuation of the Fourier transforms to larger sets in C. We use the same notation H(ω) for the function H defined for real values of ω only, and its continuation defined for complex ω. The following examples illustrate the use of the projections [ ] + and [ ] − , and consideration of transforms for complex ω. Example 9.3.1 Let g(t) = e −α[t[ for a constant α > 0. The functions g, ug and (1 − u)g are g(t) u(t)g(t) (1−u(t))g(t) t t t Figure 9.2: Decomposition of a two-sided exponential function. 284 CHAPTER 9. WIENER FILTERING pictured in Figure 9.2. The corresponding transforms are given by: [G] + (ω) = _ ∞ 0 e −αt e −jωt dt = 1 jω +α [G] − (ω) = _ 0 −∞ e αt e −jωt dt = 1 −jω +α G(ω) = [G] + (ω) + [G] − (ω) = 2α ω 2 +α 2 Note that [G] + has a pole at ω = jα, so that the imaginary part of the pole of [G] + is positive. Equivalently, the pole of [G] + is in the upper half plane. More generally, suppose that G(ω) has the representation G(ω) = N 1 n=1 γ n jω +α n + N n=N 1 +1 γ n −jω +α n where Re(α n ) > 0 for all n. Then [G] + (ω) = N 1 n=1 γ n jω +α n [G] − (ω) = N n=N 1 +1 γ n −jω +α n Example 9.3.2 Let G be given by G(ω) = 1 −ω 2 (jω + 1)(jω + 3)(jω −2) Note that G has only three simple poles. The numerator of G has no factors in common with the denominator, and the degree of the numerator is smaller than the degree of the denominator. By the theory of partial fraction expansions in complex analysis, it therefore follows that G can be written as G(ω) = γ 1 jω + 1 + γ 2 jω + 3 + γ 3 jω −2 In order to identify γ 1 , for example, multiply both expressions for G by (jω + 1) and then let jω = −1. The other constants are found similarly. Thus γ 1 = 1 −ω 2 (jω + 3)(jω −2) ¸ ¸ ¸ ¸ jω=−1 = 1 + (−1) 2 (−1 + 3)(−1 −2) = − 1 3 γ 2 = 1 −ω 2 (jω + 1)(jω −2) ¸ ¸ ¸ ¸ jω=−3 = 1 + 3 2 (−3 + 1)(−3 −2) = 1 γ 3 = 1 −ω 2 (jω + 1)(jω + 3) ¸ ¸ ¸ ¸ jω=2 = 1 + 2 2 (2 + 1)(2 + 3) = 1 3 9.3. CAUSAL FUNCTIONS AND SPECTRAL FACTORIZATION 285 Consequently, [G] + (ω) = − 1 3(jω + 1) + 1 jω + 3 and [G] − (ω) = 1 3(jω −2) Example 9.3.3 Suppose that G(ω) = e −jωT (jω+α) . Multiplication by e −jωT in the frequency domain represents a shift by T in the time domain, so that g(t) = _ e −α(t−T) t ≥ T 0 t < T , as pictured in Figure 9.3. Consider two cases. First, if T ≥ 0, then g is causal, G is positive type, g(t) t T>0: T T g(t) t T<0: Figure 9.3: Exponential function shifted by T. and therefore [G] + = G and [G] − = 0. Second, if T ≤ 0 then g(t)u(t) = _ e αT e −αt t ≥ 0 0 t < 0 so that [G] + (ω) = e αT jω+α and [G] − (ω) = G(ω) − [G] + (ω) = e −jωT −e αT (jω+α) . We can also find [G] − by computing the transform of (1 −u(t))g(t) (still assuming that T ≤ 0): [G] − (ω) = _ 0 T e α(T−t) e −jωt dt = e αT−(α+jω)t −(α +jω) ¸ ¸ ¸ ¸ ¸ 0 t=T = e −jωT −e αT (jω +α) Example 9.3.4 Suppose H is the transfer function for impulse response function h. Let us unravel the notation and express _ ∞ −∞ ¸ ¸ ¸ _ e jωT H(ω) ¸ + ¸ ¸ ¸ 2 dω 2π 286 CHAPTER 9. WIENER FILTERING in terms of h and T. (Note that the factor e jωT is used, rather than e −jωT as in the previous example.) Multiplication by e jωT in the frequency domain corresponds to shifting by −T in the time domain, so that e jωT H(ω) ↔ h(t +T) and thus _ e jωT H(ω) ¸ + ↔ u(t)h(t +T) Applying Parseval’s identity, the definition of u, and a change of variables yields _ ∞ −∞ ¸ ¸ ¸ _ e jωT H(ω) ¸ + ¸ ¸ ¸ 2 dω 2π = _ ∞ −∞ [u(t)h(t +T)[ 2 dt = _ ∞ 0 [h(t +T)[ 2 dt = _ ∞ T [h(t)[ 2 dt The integral decreases from the energy of h to zero as T ranges from −∞ to ∞. Example 9.3.5 Suppose [H] − = [K] − = 0. Let us find [HK] − . As usual, let h denote the inverse transform of H, and k denote the inverse transform of K. The supposition implies that h and k are both causal functions. Therefore the convolution h ∗ k is also a causal function. Since HK is the transform of h ∗ k, it follows that HK is a positive type function. Equivalently, [HK] − = 0. The decomposition H = [H] + + [H] − is an additive one. Next we turn to multiplicative decomposition, concentrating on rational functions. A function H is said to be rational if it can be written as the ratio of two polynomials. Since polynomials can be factored over the complex numbers, a rational function H can be expressed in the form H(ω) = γ (jω +β 1 )(jω +β 2 ) (jω +β K ) (jω +α 1 )(jω +α 2 ) (jω +α N ) for complex constants γ, α 1 , . . . , α N , β 1 , . . . , β K . Without loss of generality, we assume that ¦α i ¦ ∩ ¦β j ¦ = ∅. We also assume that the real parts of the constants α 1 , . . . , α N , β 1 , . . . , β K are nonzero. The function H is positive type if and only if Re(α i ) > 0 for all i, or equivalently, if and only if all the poles of H(ω) are in the upper half plane Im(ω) > 0. A positive type function H is said to have minimum phase if Re(β i ) > 0 for all i. Thus, a positive type function H is minimum phase if and only if 1/H is also positive type. Suppose that S Y is the power spectral density of a WSS random process and that S Y is a rational function. The function S Y , being nonnegative, is also real-valued, so S Y = S ∗ Y . Thus, if the denominator of S Y has a factor of the form jω + α then the denominator must also have a factor of the form −jω +α ∗ . Similarly, if the numerator of S Y has a factor of the form jω +β then the numerator must also have a factor of the form −jω +β ∗ . 9.4. SOLUTION OF THE CAUSAL WIENER FILTERING PROBLEM FOR RATIONAL POWER SPECTRAL DENSITIES287 Example 9.3.6 The function S Y given by S Y (ω) = 8 + 5ω 2 (1 +ω 2 )(4 +ω 2 ) can be factored as S Y (ω) = √ 5 (jω + _ 8 5 ) (jω + 2)(jω + 1) . ¸¸ . S + Y (ω) √ 5 (−jω + _ 8 5 ) (−jω + 2)(−jω + 1) . ¸¸ . S − Y (ω) (9.5) where S + Y is a positive type, minimum phase function and S − Y is a negative type function with S − Y = (S + Y ) ∗ . Note that the operators [ ] + and [ ] − give us an additive decomposition of a function H into the sum of a positive type and a negative type function, whereas spectral factorization has to do with products. At least formally, the factorization can be accomplished by taking a logarithm, doing an additive decomposition, and then exponentiating: S X (ω) = exp([ln S X (ω)] + ) . ¸¸ . S + X (ω) exp([ln S X (ω)] − ) . ¸¸ . S − X (ω) . (9.6) Notice that if h ↔H then, formally, 1 +h + h ∗ h 2! + h ∗ h ∗ h 3! ↔ exp(H) = 1 +H + H 2 2! + H 2 3! so that if H is positive type, then exp(H) is also positive type. Thus, the factor S + X in (9.6) is indeed a positive type function, and the factor S − X is a negative type function. Use of (9.6) is called the cepstrum method. Unfortunately, there is a host of problems, both numerical and analytical, in using the method, so that it will not be used further in these notes. 9.4 Solution of the causal Wiener filtering problem for rational power spectral densities The Wiener-Hopf equations (9.2) and ( 9.3) can be formulated in the frequency domain as follows: Find a positive type transfer function H such that _ e jωT S XY −HS Y ¸ + = 0 (9.7) Suppose S Y is factored as S Y = S + Y S − Y such that S + Y is a minimum phase, positive type transfer function and S − Y = (S + Y ) ∗ . Then S − Y and 1 S − Y are negative type functions. Since the product of 288 CHAPTER 9. WIENER FILTERING two negative type functions is again negative type, (9.7) is equivalent to the equation obtained by multiplying the quantity within square brackets in (9.7) by 1 S − Y , yielding the equivalent problem: Find a positive type transfer function H such that _ e jωT S XY S − Y −HS + Y _ + = 0 (9.8) The function HS + Y , being the product of two positive type functions, is itself positive type. Thus (9.8) becomes _ e jωT S XY S − Y _ + −HS + Y = 0 Solving for H yields that the optimal transfer function is given by H = 1 S + Y _ e jωT S XY S − Y _ + (9.9) The orthogonality principle yields that the mean square error satisfies E[[X t+T − ´ X t+T[t [ 2 ] = E[[X t+T [ 2 ] −E[[ ´ X t+T[t [ 2 ] = R X (0) − _ ∞ −∞ [H(ω)[ 2 S Y (ω) dω 2π = R X (0) − _ ∞ −∞ ¸ ¸ ¸ ¸ ¸ _ e jωT S XY S − Y _ + ¸ ¸ ¸ ¸ ¸ 2 dω 2π (9.10) where we used the fact that [S + Y [ 2 = S Y . Another expression for the MMSE, which involves the optimal filter h, is the following: MMSE = E[(X t+T − ´ X t+T[t )(X t+T − ´ X t+T[t ) ∗ ] = E[(X t+T − ´ X t+T[t )X ∗ t+T ] = R X (0) −R b XX (t, t +T) = R X (0) − _ ∞ −∞ h(s)R ∗ XY (s +T)ds. Exercise Evaluate the limit as T →−∞ and the limit as T →∞ in (9.10). Example 9.4.1 This example involves the same model as in an example in Section 9.1, but here a causal estimator is sought. The observed random process is Y = X + N, were X is WSS with mean zero and power spectral density S X (ω) = 1 1+ω 2 , N is WSS with mean zero and power spectral density S N (ω) = 4 4+ω 2 , and S XN = 0. We seek the optimal casual linear estimator of X t given (Y s : s ≤ t). The power spectral density of Y is given by S Y (ω) = S X (ω) +S N (ω) = 8 + 5ω 2 (1 +ω 2 )(4 +ω 2 ) 9.4. SOLUTION OF THE CAUSAL WIENER FILTERING PROBLEM FOR RATIONAL POWER SPECTRAL DENSITIES289 and its spectral factorization is given by (9.5), yielding S + Y and S − Y . Since R XN = 0 it follows that S XY (ω) = S X (ω) = 1 (jω + 1)(−jω + 1) . Therefore S XY (ω) S − Y (ω) = (−jω + 2) √ 5(jω + 1)(−jω + _ 8 5 ) = γ 1 jω + 1 + γ 2 −jω + _ 8 5 where γ 1 = −jω + 2 √ 5(−jω + _ 8 5 ) ¸ ¸ ¸ ¸ ¸ ¸ jω=−1 = 3 √ 5 + √ 8 γ 2 = −jω + 2 √ 5(jω + 1) ¸ ¸ ¸ ¸ jω= q 8 5 = − _ 8 5 + 2 √ 5 + √ 8 Therefore _ S XY (ω) S − Y (ω) _ + = γ 1 jω + 1 (9.11) and thus H(ω) = γ 1 (jω + 2) √ 5(jω + _ 8 5 ) = 3 5 + 2 √ 10 _ _ 1 + 2 − _ 8 5 jω + _ 8 5 _ _ so that the optimal causal filter is h(t) = 3 5 + 2 √ 10 _ δ(t) + (2 − _ 8 5 )u(t)e −t q 8 5 _ Finally, by (9.10) with T = 0, (9.11), and (9.1), the minimum mean square error is given by E[[X t − ´ X t [ 2 ] = R X (0) − _ ∞ −∞ γ 2 1 1 +ω 2 dω 2π = 1 2 − γ 2 1 2 ≈ 0.3246 which is slightly larger than 1 √ 10 ≈ 0.3162, the MMSE found for the best noncausal estimator (see the example in Section 9.1), and slightly smaller than 1 3 , the MMSE for the best “instantaneous” estimator of X t given Y t , which is Xt 3 . 290 CHAPTER 9. WIENER FILTERING Example 9.4.2 A special case of the causal filtering problem formulated above is when the ob- served process Y is equal to X itself. This leads to the pure prediction problem. Let X be a WSS mean zero random process and let T > 0. Then the optimal linear predictor of X t+T given (X s : s ≤ t) corresponds to a linear time-invariant system with transfer function H given by (because S XY = S X , S Y = S X , S + Y = S + X , and S − Y = S − X ): H = 1 S + X _ S + X e jωT ¸ + (9.12) To be more specific, suppose that S X (ω) = 1 ω 4 +4 . Observe that ω 4 +4 = (ω 2 +2j)(ω 2 −2j). Since 2j = (1 +j) 2 , we have (ω 2 +2j) = (ω +1 +j)(ω −1 −j). Factoring the term (ω 2 −2j) in a similar way, and rearranging terms as needed, yields that the factorization of S X is given by S X (ω) = 1 (jω + (1 +j))(jω + (1 −j)) . ¸¸ . S + X (ω) 1 (−jω + (1 +j))(−jω + (1 −j)) . ¸¸ . S − X (ω) so that S + X (ω) = 1 (jω + (1 +j))(jω + (1 −j)) = γ 1 jω + (1 +j) + γ 2 jω + (1 −j) where γ 1 = 1 jω + (1 −j) ¸ ¸ ¸ ¸ jω=−(1+j) = j 2 γ 2 = 1 jω + (1 +j) ¸ ¸ ¸ ¸ jω=−1+j = −j 2 yielding that the inverse Fourier transform of S + X is given by S + X ↔ j 2 e −(1+j)t u(t) − j 2 e −(1−j)t u(t) Hence S + X (ω)e jωT ↔ _ j 2 e −(1+j)(t+T) − j 2 e −(1−j)(t+T) t ≥ −T 0 else so that _ S + X (ω)e jωT ¸ + = je −(1+j)T 2(jω + (1 +j)) − je −(1−j)T 2(jω + (1 −j)) 9.5. DISCRETE TIME WIENER FILTERING 291 The formula (9.12) for the optimal transfer function yields H(ω) = je −(1+j)T (jω + (1 −j)) 2 − je −(1−j)T (jω + (1 +j)) 2 = e −T _ e jT (1 +j) −e −jT (1 −j) 2j + jω(e jT −e −jT ) 2j _ = e −T [cos(T) + sin(T) +jω sin(T)] so that the optimal predictor for this example is given by ´ X t+T[t = X t e −T (cos(T) + sin(T)) +X t t e −T sin(T) 9.5 Discrete time Wiener filtering Causal Wiener filtering for discrete-time random processes can be handled in much the same way that it is handled for continuous time random processes. An alternative approach can be based on the use of whitening filters and linear innovations sequences. Both of these approaches will be discussed in this section, but first the topic of spectral factorization for discrete-time processes is discussed. Spectral factorization for discrete time processes naturally involves z-transforms. The z trans- form of a function (h k : k ∈ Z) is given by H(z) = ∞ k=−∞ h(k)z −k for z ∈ C. Setting z = e jω yields the Fourier transform: H(ω) = H(e jω ) for 0 ≤ ω ≤ 2π. Thus, the z-transform H restricted to the unit circle in C is equivalent to the Fourier transform H on [0, 2π], and H(z) for other z ∈ C is an analytic continuation of its values on the unit circle. Let ¯ h(k) = h ∗ (−k) as before. Then the z-transform of ¯ h is related to the z-transform H of h as follows: ∞ k=−∞ ¯ h(k)z −k = ∞ k=−∞ h ∗ (−k)z −k = ∞ l=−∞ h ∗ (l)z l = _ ∞ l=−∞ h(l)(1/z ∗ ) −l _ ∗ = H ∗ (1/z ∗ ) The impulse response function h is called causal if h(k) = 0 for k < 0. The z-transform H is said to be positive type if h is causal. Note that if H is positive type, then lim [z[→∞ H(z) = h(0). The projection [H] + is defined as it was for Fourier transforms–it is the z transform of the function u(k)h(k), where u(k) = I ¡k≥0¦ . (We will not need to define or use [ ] − for discrete time functions.) If X is a discrete-time WSS random process with correlation function R X , the z-transform of R X is denoted by o X . Similarly, if X and Y are jointly WSS then the z-transform of R XY is 292 CHAPTER 9. WIENER FILTERING denoted by o XY . Recall that if Y is the output random process when X is passed through a linear time-invariant system with impulse response function h, then X and Y are jointly WSS and R Y X = h ∗ R X R XY = ¯ h ∗ R X R Y = h ∗ ¯ h ∗ R X which in the z-transform domain becomes: o Y X (z) = H(z)o X (z) o XY (z) = H ∗ (1/z ∗ )o X (z) o Y (z) = H(z)H ∗ (1/z ∗ )o X (z) Example 9.5.1 Suppose Y is the output process when white noise W with R W (k) = I ¡k=0¦ is passed through a linear time invariant system with impulse response function h(k) = ρ k I ¡k≥0¦ , where ρ is a complex constant with [ρ[ < 1. Let us find H, o Y , and R Y . To begin, H(z) = ∞ k=0 (ρ/z) k = 1 1 −ρ/z and the z-transform of ¯ h is 1 1−ρ ∗ z . Note that the z-transform for h converges absolutely for [z[ > [ρ[, whereas the z-transform for ¯ h converges absolutely for [z[ < 1/[ρ[. Then o Y (z) = H(z)H ∗ (1/z ∗ )o X (z) = 1 (1 −ρ/z)(1 −ρ ∗ z) The autocorrelation function R Y can be found either in the time domain using R Y = h ∗ ¯ h ∗ R W or by inverting the z-transform o Y . Taking the later approach, factor out z and use the method of partial fraction expansion to obtain o Y (z) = z (z −ρ)(1 −ρ ∗ z) = z _ 1 (1 −[ρ[ 2 )(z −ρ) + 1 ((1/ρ ∗ ) −ρ)(1 −ρ ∗ z) _ = 1 (1 −[ρ[ 2 ) _ 1 1 −ρ/z + zρ ∗ 1 −ρ ∗ z _ which is the z-transform of R Y (k) = _ ρ k 1−[ρ[ 2 k ≥ 0 (ρ ∗ ) −k 1−[ρ[ 2 k < 0 The z-transform o Y of R Y converges absolutely for [ρ[ < z < 1/[ρ[. Suppose that H(z) is a rational function of z, meaning that it is a ratio of two polynomials of z with complex coefficients. We assume that the numerator and denominator have no zeros 9.5. DISCRETE TIME WIENER FILTERING 293 in common, and that neither has a root on the unit circle. The function H is positive type (the z-transform of a causal function) if its poles (the zeros of its denominator polynomial) are inside the unit circle in the complex plane. If H is positive type and if its zeros are also inside the unit circle, then h and H are said to be minimum phase functions (in the time domain and z-transform domain, respectively). A positive-type, minimum phase function H has the property that both H and its inverse 1/H are causal functions. Two linear time-invariant systems in series, one with transfer function H and one with transfer function 1/H, passes all signals. Thus if H is positive type and minimum phase, we say that H is causal and causally invertible. Assume that o Y corresponds to a WSS random process Y and that o Y is a rational function with no poles or zeros on the unit circle in the complex plane. We shall investigate the symmetries of o Y , with an eye towards its factorization. First, R Y = ¯ R Y so that o Y (z) = o ∗ Y (1/z ∗ ) (9.13) Therefore, if z 0 is a pole of o Y with z 0 ,= 0, then 1/z ∗ 0 is also a pole. Similarly, if z 0 is a zero of o Y with z 0 ,= 0, then 1/z ∗ 0 is also a zero of o Y . These observations imply that o Y can be uniquely factored as o Y (z) = o + Y (z)o − Y (z) such that for some constant β > 0: • o + Y is a minimum phase, positive type z-transform • o − Y (z) = (o + Y (1/z ∗ )) ∗ • lim [z[→∞ o + Y (z) = β There is an additional symmetry if R Y is real-valued: o Y (z) = ∞ k=−∞ R Y (k)z −k = ∞ k=−∞ (R Y (k)(z ∗ ) −k ) ∗ = o ∗ Y (z ∗ ) (for real-valued R Y ) (9.14) Therefore, if R Y is real and if z 0 is a nonzero pole of o Y , then z ∗ 0 is also a pole. Combining (9.13) and (9.14) yields that if R Y is real then the real-valued nonzero poles of o Y come in pairs: z 0 and 1/z 0 , and the other nonzero poles of o Y come in quadruples: z 0 , z ∗ 0 , 1/z 0 , and 1/z ∗ 0 . A similar statement concerning the zeros of o Y also holds true. Some example factorizations are as follows (where [ρ[ < 1 and β > 0): o Y (z) = β 1 −ρ/z . ¸¸ . S + Y (z) β 1 −ρ ∗ z . ¸¸ . S − Y (z) o Y (z) = β(1 −.8/z) (1 −.6/z)(1 −.7/z) . ¸¸ . S + Y (z) β(1 −.8z) (1 −.6z)(1 −.7z) . ¸¸ . S − Y (z) o Y (z) = β (1 −ρ/z)(1 −ρ ∗ /z) . ¸¸ . S + Y (z) β (1 −ρz)(1 −ρ ∗ z) . ¸¸ . S − Y (z) 294 CHAPTER 9. WIENER FILTERING An important application of spectral factorization is the generation of a discrete-time WSS random process with a specified correlation function R Y . The idea is to start with a discrete-time white noise process W with R W (k) = I ¡k=0¦ , or equivalently, with o W (z) ≡ 1, and then pass it through an appropriate linear, time-invariant system. The appropriate filter is given by taking H(z) = o + Y (z), for then the spectral density of the output is indeed given by H(z)H ∗ (1/z ∗ )o W (z) = o + Y (z)o − Y (z) = o Y (z) The spectral factorization can be used to solve the causal filtering problem in discrete time. Arguing just as in the continuous time case, we find that if X and Y are jointly WSS random processes, then the best estimator of X n+T given (Y k : k ≤ n) having the form ´ X n+T[n = ∞ k=−∞ Y k h(n −k) for a causal function h is the function h satisfying the Wiener-Hopf equations (9.2) and (9.3), and the z transform of the optimal h is given by H = 1 o + Y _ z T o XY o − Y _ + (9.15) Finally, an alternative derivation of (9.15) is given, based on the use of a whitening filter. The idea is the same as the idea of linear innovations sequence considered in Chapter 3. The first step is to notice that the causal estimation problem is particularly simple if the observation process is white noise. Indeed, if the observed process Y is white noise with R Y (k) = I ¡k=0¦ then for each k ≥ 0 the choice of h(k) is simply made to minimize the mean square error when X n+T is estimated by the single term h(k)Y n−k . This gives h(k) = R XY (T + k)I ¡k≥0¦ . Another way to get the same result is to solve the Wiener-Hopf equations (9.2) and (9.3) in discrete time in case R Y (k) = I ¡k=0¦ . In general, of course, the observation process Y is not white, but the idea is to replace Y by an equivalent observation process Z that is white. Let Z be the result of passing Y through a filter with transfer function ((z) = 1/S + (z). Since S + (z) is a minimum phase function, ( is a positive type function and the system is causal. Thus, any random variable in the m.s. closure of the linear span of (Z k : k ≤ n) is also in the m.s. closure of the linear span of (Y k : k ≤ n). Conversely, since Y can be recovered from Z by passing Z through the causal linear time-invariant system with transfer function S + (z), any random variable in the m.s. closure of the linear span of (Y k : k ≤ n) is also in the m.s. closure of the linear span of (Z k : k ≤ n). Hence, the optimal causal linear estimator of X n+T based on (Y k : k ≤ n) is equal to the optimal causal linear estimator of X n+T based on (Z k : k ≤ n). By the previous paragraph, such estimator is obtained by passing Z through the linear time-invariant system with impulse response function R XZ (T +k)I ¡k≥0¦ , which has z transform [z T o XZ ] + . See Figure 9.4. 9.5. DISCRETE TIME WIENER FILTERING 295 1 S (z) Y Z Y + + X t+T|t ^ [z S (z)] T XZ Figure 9.4: Optimal filtering based on whitening first. The transfer function for two linear, time-invariant systems in series is the product of their z-transforms. In addition, o XZ (z) = ( ∗ (1/z ∗ )o XY (z) = o XY (z) o − Y (z) Hence, the series system shown in Figure 9.4 is indeed equivalent to passing Y through the linear time invariant system with H(z) given by (9.15). Example 9.5.2 Suppose that X and N are discrete-time mean zero WSS random processes such that R XN = 0. Suppose o X (z) = 1 (1−ρ/z)(1−ρz) where 0 < ρ < 1, and suppose that N is a discrete- time white noise with o N (z) ≡ σ 2 and R N (k) = σ 2 I ¡k=0¦ . Let the observed process Y be given by Y = X + N. Let us find the minimum mean square error linear estimator of X n based on (Y k : k ≤ n). We begin by factoring o Y . o Y (z) = o X (z) +o N (z) = z (z −ρ)(1 −ρz) +σ 2 = −σ 2 ρ _ z 2 −( 1+ρ 2 ρ + 1 σ 2 ρ )z + 1 _ (z −ρ)(1 −ρz) The quadratic expression in braces can be expressed as (z −z 0 )(z −1/z 0 ), where z 0 is the smaller root of the expression in braces, yielding the factorization o Y (z) = β(1 −z 0 /z) (1 −ρ/z) . ¸¸ . S + Y (z) β(1 −z 0 z) (1 −ρz) . ¸¸ . S − Y (z) where β 2 = σ 2 ρ z 0 Using the fact o XY = o X , and appealing to a partial fraction expansion yields o XY (z) o − Y (z) = 1 β(1 −ρ/z)(1 −z 0 z) = 1 β(1 −ρ/z)(1 −z 0 ρ) + z β((1/z 0 ) −ρ)(1 −z 0 z) (9.16) The first term in (9.16) is positive type, and the second term in (9.16) is the z transform of a function that is supported on the negative integers. Thus, the first term is equal to _ S XY S − Y _ + . 296 CHAPTER 9. WIENER FILTERING Finally, dividing by o + Y yields that the z-transform of the optimal filter is given by H(z) = 1 β 2 (1 −z 0 ρ)(1 −z 0 /z) or in the time domain h(n) = z n 0 I ¡n≥0¦ β 2 (1 −z 0 ρ) 9.6 Problems 9.1 A quadratic predictor Suppose X is a mean zero, stationary discrete-time random process and that n is an integer with n ≥ 1. Consider estimating X n+1 by a nonlinear one-step predictor of the form ´ X n+1 = h 0 + n k=1 h 1 (k)X k + n j=1 j k=1 h 2 (j, k)X j X k (a) Find equations in term of the moments (second and higher, if needed) of X for the triple (h 0 , h 1 , h 2 ) to minimize the one step prediction error: E[(X n+1 − ´ X n+1 ) 2 ]. (b) Explain how your answer to part (a) simplifies if X is a Gaussian random process. 9.2 A smoothing problem Suppose X and Y are mean zero, second order random processes in continuous time. Suppose the MMSE estimator of X 5 is to be found based on observation of (Y u : u ∈ [0, 3]∪[7, 10]). Assuming the estimator takes the form of an integral, derive the optimality conditions that must be satisfied by the kernal function (the function that Y is multiplied by before integrating). Use the orthogonality principle. 9.3 A simple, noncausal estimation problem Let X = (X t : t ∈ R) be a real valued, stationary Gaussian process with mean zero and autocorre- lation function R X (t) = A 2 sinc(f o t), where A and f o are positive constants. Let N = (N t : t ∈ R) be a real valued Gaussian white noise process with R N (τ) = σ 2 δ(τ), which is independent of X. Define the random process Y = (Y t : t ∈ R) by Y t = X t + N t . Let ´ X t = _ ∞ −∞ h(t − s)Y s ds, where the impulse response function h, which can be noncausal, is chosen to minimize E[D 2 t ] for each t, where D t = X t − ´ X t . (a) Find h. (b) Identify the probability distribution of D t , for t fixed. (c) Identify the conditional distribution of D t given Y t , for t fixed. (d) Identify the autocorrelation function, R D , of the error process D, and the cross correlation function, R DY . 9.4 Interpolating a Gauss Markov process Let X be a real-valued, mean zero stationary Gaussian process with R X (τ) = e −[τ[ . Let a > 0. Suppose X 0 is estimated by ´ X 0 = c 1 X −a + c 2 X a where the constants c 1 and c 2 are chosen to 9.6. PROBLEMS 297 minimize the mean square error (MSE). (a) Use the orthogonality principle to find c 1 , c 2 , and the resulting minimum MSE, E[(X 0 − ´ X 0 ) 2 ]. (Your answers should depend only on a.) (b) Use the orthogonality principle again to show that ´ X 0 as defined above is the minimum MSE estimator of X 0 given (X s : [s[ ≥ a). (This implies that X has a two-sided Markov property.) 9.5 Estimation of a filtered narrowband random process in noise Suppose X is a mean zero real-valued stationary Gaussian random process with the spectral density shown. 1 f 8 Hz S (2 f) π X 8 Hz 10 Hz 4 10 Hz 4 (a) Explain how X can be simulated on a computer using a pseudo-random number generator that generates standard normal random variables. Try to use the minimum number per unit time. How many normal random variables does your construction require per simulated unit time? (b) Suppose X is passed through a linear time-invariant system with approximate transfer function H(2πf) = 10 7 /(10 7 +f 2 ). Find an approximate numerical value for the power of the output. (c) Let Z t = X t +W t where W is a Gaussian white noise random process, independent of X, with R W (τ) = δ(τ). Find h to minimize the mean square error E[(X t − ´ X t ) 2 ], where ´ X = h ∗ Z. (d) Find the mean square error for the estimator of part (c). 9.6 Proportional noise Suppose X and N are second order, mean zero random processes such that R XN ≡ 0, and let Y = X + N. Suppose the correlation functions R X and R N are known, and that R N = γ 2 R X for some nonnegative constant γ 2 . Consider the problem of estimating X t using a linear estimator based on (Y u : a ≤ u ≤ b), where a, b, and t are given times with a < b. (a) Use the orthogonality principle to show that if t ∈ [a, b], then the optimal estimator is given by ´ X t = κY t for some constant κ, and identify the constant κ and the corresponding MSE. (b) Suppose in addition that X and N are WSS and that X t+T is to be estimated from (Y s : s ≤ t). Show how the equation for the optimal causal filter reduces to your answer to part (a) in case T ≤ 0. (c) Continue under the assumptions of part (b), except consider T > 0. How is the optimal filter for estimating X t+T from (Y s : s ≤ t) related to the problem of predicting X t+T from (X s : s ≤ t)? 9.7 Predicting the future of a simple WSS process Let X be a mean zero, WSS random process with power spectral density S X (ω) = 1 ω 4 +13ω 2 +36 . (a) Find the positive type, minimum phase rational function S + X such that S X (ω) = [S + X (ω)[ 2 . (b) Let T be a fixed known constant with T ≥ 0. Find ´ X t+T[t , the MMSE linear estimator of X t+T given (X s : s ≤ t). Be as explicit as possible. (Hint: Check that your answer is correct in case T = 0 and in case T →∞). (c) Find the MSE for the optimal estimator of part (b). 298 CHAPTER 9. WIENER FILTERING 9.8 Short answer filtering questions (a) Prove or disprove: If H is a positive type function then so is H 2 . (b) Prove or disprove: Suppose X and Y are jointly WSS, mean zero random processes with continuous spectral densities such that S X (2πf) = 0 unless [f[ ∈[9012 MHz, 9015 MHz] and S Y (2πf) = 0 unless [f[ ∈[9022 MHz, 9025 MHz]. Then the best linear estimate of X 0 given (Y t : t ∈ R) is 0. (c) Let H(2πf) = sinc(f). Find [H] + . 9.9 On the MSE for causal estimation Recall that if X and Y are jointly WSS and have power spectral densities, and if S Y is rational with a spectral factorization, then the mean square error for linear estimation of X t+T using (Y s : s ≤ t) is given by (MSE) = R X (0) − _ ∞ −∞ ¸ ¸ ¸ ¸ ¸ _ e jωT S XY S − Y _ + ¸ ¸ ¸ ¸ ¸ 2 dω 2π . Evaluate and interpret the limits of this expression as T →−∞ and as T →∞. 9.10 A singular estimation problem Let X t = Ae j2πfot , where f o > 0 and A is a mean zero complex valued random variable with E[A 2 ] = 0 and E[[A[ 2 ] = σ 2 A . Let N be a white noise process with R N (τ) = σ 2 N δ(τ). Let Y t = X t + N t . Let ´ X denote the output process when Y is filtered using the impulse response function h(τ) = αe −(α−j2πfo)t I ¡t≥0¦ . (a) Verify that X is a WSS periodic process, and find its power spectral density (the power spectral density only exists as a generalized function–i.e. there is a delta function in it). (b) Give a simple expression for the output of the linear system when the input is X. (c) Find the mean square error, E[[X t − ˆ X t [ 2 ]. How should the parameter α be chosen to approxi- mately minimize the MSE? 9.11 Filtering a WSS signal plus noise Suppose X and N are jointly WSS, mean zero, continuous time random processes with R XN ≡ 0. The processes are the inputs to a system with the block diagram shown, for some transfer functions K 1 (ω) and K 2 (ω): N K 2 + 1 K Y=X +N out out X Suppose that for every value of ω, K i (ω) ,= 0 for i = 1 and i = 2. Because the two subsystems are linear, we can view the output process Y as the sum of two processes, X out , due to the input X, plus N out , due to the input N. Your answers to the first four parts should be expressed in terms of K 1 , K 2 , and the power spectral densities S X and S N . (a) What is the power spectral density S Y ? (b) Find the signal-to-noise ratio at the output (the power of X out divided by the power of N out ). (c) Suppose Y is passed into a linear system with transfer function H, designed so that the output at time t is ´ X t , the best linear estimator of X t given (Y s : s ∈ R). Find H. 9.6. PROBLEMS 299 (d) Find the resulting minimum mean square error. (e) The correct answer to part (d) (the minimum MSE) does not depend on the filter K 2 . Why? 9.12 A prediction problem Let X be a mean zero WSS random process with correlation function R X (τ) = e −[τ[ . Using the Wiener filtering equations, find the optimal linear MMSE estimator (i.e. predictor) of X t+T based on (X s : s ≤ t), for a constant T > 0. Explain why your answer takes such a simple form. 9.13 Properties of a particular Gaussian process Let X be a zero-mean, wide-sense stationary Gaussian random process in continuous time with autocorrelation function R X (τ) = (1 +[τ[)e −[τ[ and power spectral density S X (ω) = (2/(1 +ω 2 )) 2 . Answer the following questions, being sure to provide justification. (a) Is X mean ergodic in the m.s. sense? (b) Is X a Markov process? (c) Is X differentiable in the m.s. sense? (d) Find the causal, minimum phase filter h (or its transform H) such that if white noise with autocorrelation function δ(τ) is filtered using h then the output autocorrelation function is R X . (e) Express X as the solution of a stochastic differential equation driven by white noise. 9.14 Spectral decomposition and factorization (a) Let x be the signal with Fourier transform given by ´ x(2πf) = _ sinc(100f)e j2πfT ¸ + . Find the energy of x for all real values of the constant T. (b) Find the spectral factorization of the power spectral density S(ω) = 1 ω 4 +16ω 2 +100 . (Hint: 1 +3j is a pole of S.) 9.15 A continuous-time Wiener filtering problem Let (X t ) and (N t ) be uncorrelated, mean zero random processes with R X (t) = exp(−2[t[) and S N (ω) ≡ N o /2 for a positive constant N o . Suppose that Y t = X t +N t . (a) Find the optimal (noncausal) filter for estimating X t given (Y s : −∞ < s < +∞) and find the resulting mean square error. Comment on how the MMSE depends on N o . (b) Find the optimal causal filter with lead time T, that is, the Wiener filter for estimating X t+T given (Y s : −∞ < s ≤ t), and find the corresponding MMSE. For simplicity you can assume that T ≥ 0. Comment on the limiting value of the MMSE as T →∞, as N o →∞, or as N o →0. 9.16 Estimation of a random signal, using the KL expansion Suppose that X is a m.s. continuous, mean zero process over an interval [a, b], and suppose N is a white noise process, with R XN ≡ 0 and R N (s, t) = σ 2 δ(s − t). Let (φ k : k ≥ 1) be a complete orthonormal basis for L 2 [a, b] consisting of eigenfunctions of R X , and let (λ k : k ≥ 1) denote the corresponding eigenvalues. Suppose that Y = (Y t : a ≤ t ≤ b) is observed. (a) Fix an index i. Express the MMSE estimator of (X, φ i ) given Y in terms of the coordinates, (Y, φ 1 ), (Y, φ 2 ), . . . of Y , and find the corresponding mean square error. (b) Now suppose f is a function in L 2 [a, b]. Express the MMSE estimator of (X, f) given Y in terms of the coordinates ((f, φ j ) : j ≥ 1) of f, the coordinates of Y , the λ’s, and σ. Also, find the mean square error. 300 CHAPTER 9. WIENER FILTERING 9.17 Noiseless prediction of a baseband random process Fix positive constants T and ω o , suppose X = (X t : t ∈ R) is a baseband random process with one-sided frequency limit ω o , and let H (n) (ω) = n k=0 (jωT) k k! , which is a partial sum of the power series of e jωT . Let ´ X (n) t+T[t denote the output at time t when X is passed through the linear time invariant system with transfer function H (n) . As the notation suggests, ´ X (n) t+T[t is an estimator (not necessarily optimal) of X t+T given (X s : s ≤ t). (a) Describe ´ X (n) t+T[t in terms of X in the time domain. Verify that the linear system is causal. (b) Show that lim n→∞ a n = 0, where a n = max [ω[≤ωo [e jωT −H (n) (ω)[. (This means that the power series converges uniformly for ω ∈ [−ω o , ω o ].) (c) Show that the mean square error can be made arbitrarily small by taking n sufficiently large. In other words, show that lim n→∞ E[[X t+T − ´ X (n) t+T[t [ 2 ] = 0. (d) Thus, the future of a narrowband random process X can be predicted perfectly from its past. What is wrong with the following argument for general WSS processes? If X is an arbitrary WSS random process, we could first use a bank of (infinitely many) narrowband filters to split X into an equivalent set of narrowband random processes (call them “subprocesses”) which sum to X. By the above, we can perfectly predict the future of each of the subprocesses from its past. So adding together the predictions, would yield a perfect prediction of X from its past. 9.18 Linear innovations and spectral factorization Suppose X is a discrete time WSS random process with mean zero. Suppose that the z-transform version of its power spectral density has the factorization as described in the notes: o X (z) = o + X (z)o − X (z) such that o + X (z) is a minimum phase, positive type function, o − X (z) = (o + X (1/z ∗ )) ∗ , and lim [z[→∞ o + X (z) = β for some β > 0. The linear innovations sequence of X is the sequence ¯ X such that ¯ X k = X k − ´ X k[k−1 , where ´ X k[k−1 is the MMSE predictor of X k given (X l : l ≤ k − 1). Note that there is no constant multiplying X k in the definition of ¯ X k . You should use o + X (z), o − X (z), and/or β in giving your answers. (a) Show that ¯ X can be obtained by passing X through a linear time-invariant filter, and identify the corresponding value of H. (b) Identify the mean square prediction error, E[[X k − ´ X k[k−1 [ 2 ]. 9.19 A singular nonlinear estimation problem Suppose X is a standard Brownian motion with parameter σ 2 = 1 and suppose N is a Poisson random process with rate λ = 10, which is independent of X. Let Y = (Y t : t ≥ 0) be defined by Y t = X t +N t . (a) Find the optimal estimator of X 1 among the estimators that are linear functions of (Y t : 0 ≤ t ≤ 1) and the constants, and find the corresponding mean square error. Your estimator can include a constant plus a linear combination, or limits of linear combinations, of Y t : 0 ≤ t ≤ 1. (Hint: There is a closely related problem elsewhere in this problem set.) (b) Find the optimal possibly nonlinear estimator of X 1 given (Y t : 0 ≤ t ≤ 1), and find the corresponding mean square error. (Hint: No computation is needed. Draw sample paths of the processes.) 9.6. PROBLEMS 301 9.20 A discrete-time Wiener filtering problem Extend the discrete-time Wiener filtering problem considered at the end of the notes to incorporate a lead time T. Assume T to be integer valued. Identify the optimal filter in both the z-transform domain and in the time domain. (Hint: Treat the case T ≤ 0 separately. You need not identify the covariance of error.) 9.21 Causal estimation of a channel input process Let X = (X t : t ∈ R) and N = (N t : t ∈ R) denote WSS random processes with R X (τ) = 3 2 e −[τ[ and R N (τ) = δ(τ). Think of X as an input signal and N as noise, and suppose X and N are orthogonal to each other. Let k denote the impulse response function given by k(τ) = 2e −3τ I ¡τ≥0¦ , and suppose an output process Y is generated according to the block diagram shown: k + X N Y That is, Y = X ∗ k + N. Suppose X t is to be estimated by passing Y through a causal filter with impulse response function h, and transfer function H. Find the choice of H and h in order to minimize the mean square error. 9.22 Estimation given a strongly correlated process Suppose g and k are minimum phase causal functions in discrete-time, with g(0) = k(0) = 1, and z-transforms ( and /. Let W = (W k : k ∈ Z) be a mean zero WSS process with S W (ω) ≡ 1, let X n = ∞ i=−∞ g(n −i)W i and Y n = ∞ i=−∞ k(n −i)W i . (a) Express R X , R Y , R XY , o X , o Y , and o XY in terms of g, k, (, /. (b) Find h so that ´ X n[n = ∞ i=−∞ Y i h(n−i) is the MMSE linear estimator of X n given (Y i : i ≤ n). (c) Find the resulting mean square error. Give an intuitive reason for your answer. 9.23 Estimation of a process with raised cosine spectrum Suppose Y = X +N, where X and N are independent, mean zero, WSS random processes with S X (ω) = (1 + cos( πω ωo )) 2 I ¡[ω[≤ωo¦ and S N (ω) = N o 2 where N o > 0 and ω o > 0. (a) Find the transfer function H for the filter such that if the input process is Y , the output process, ´ X, is such that ´ X is the optimal linear estimator of X t based on (Y s : s ∈ R). (b) Express the mean square error, σ 2 e = E[( ´ X t − X t ) 2 ], as an integral in the frequency domain. (You needn’t carry out the integration.) (c) Describe the limits of your answers to (a) and (b) as N o →0. (c) Describe the limits of your answers to (a) and (b) as N o →∞. 9.24 * Resolution of Wiener and Kalman filtering Consider the state and observation models: X n = FX n−1 +W n Y n = H T X n +V n 302 CHAPTER 9. WIENER FILTERING where (W n : −∞ < n < +∞) and (V n : −∞ < n < +∞) are independent vector-valued random sequences of independent, identically distributed mean zero random variables. Let Σ W and Σ V denote the respective covariance matrices of W n and V n . (F, H and the covariance matrices must satisfy a stability condition. Can you find it? ) (a) What are the autocorrelation function R X and crosscorrelation function R XY ? (b) Use the orthogonality principle to derive conditions for the causal filter h that minimizes E[| X n+1 − ∞ j=0 h(j)Y n−j | 2 ]. (i.e. derive the basic equations for the Wiener-Hopf method.) (c) Write down and solve the equations for the Kalman predictor in steady state to derive an expression for h, and verify that it satisfies the orthogonality conditions. Chapter 10 Martingales 10.1 Conditional expectation revisited The general definiton of a martingale requires the general definition of conditional expectation. We begin by reviewing the definitions we have given so far. In Chapter 1 we reviewed the following elementary definition of E[X[Y ]. If X and Y are both discrete random variables, E[X[Y = i] = j jP[X = j[Y = i], which is well defined if P¦Y = i¦ > 0 and either the sum restricted to j > 0 or to j < 0 is convergent. That is, E[X[Y = i] is the mean of the conditional pmf of X given Y = i. Note that g(i) = E[X[Y = i] is a function of i. Let E[X[Y ] be the random variable defined by E[X[Y ] = g(Y ). Similarly, if X and Y have a joint pdf, E[X[Y = y] = _ xf X[Y (x[y)dx = g(y) and E[X[Y ] = g(Y ). Chapter 3 showed that E[X[Y ] could be defined whenever E[X 2 ] < ∞, even if X and Y are neither discrete random variables nor have a joint pdf. The definition is based on a projection, characterized by the orthogonality principle. Specifically, if E[X 2 ] < ∞, then E[X[Y [ is the unique random variable such that: • it has the form g(Y ) for some (Borel measurable) function g such that E[(g(Y ) 2 ] < ∞, and • E[(X −E[X[Y ])f(Y )] = 0 for all (Borel measurable) functions f such that E[(f(Y )) 2 ] < ∞. That is, E[X[Y ] is an unconstrained estimator based on Y, such that the error X − E[X[Y ] is orthogonal to all other unconstrained estimators based on Y. By the orthogonality principle, E[X[Y ] exists and is unique, if differences on a set of probability zero are ignored. This second definition of E[X[Y ] is more general than the elementary definition, because it doesn’t require X and Y to be discrete or to have a joint pdf, but it is less general because it requires that E[X 2 ] < ∞. The definition of E[X[Y ] given next generalizes the previously given definition in two ways. First, the definition applies as long as E[[X[] < ∞, which is a weaker requirement than E[X 2 ] < ∞. Second, the definition is based on having information represented by a σ-algebra, rather than by a random variable. Recall that, by definition, a σ-algebra T for a set Ω is a set of subsets of Ω such that: 303 304 CHAPTER 10. MARTINGALES (a) Ω ∈ T, (b) if A ∈ T then A c ∈ T, (c) if A, B ∈ T then AB ∈ T, and more generally, if A 1 , A 2 , ... is such that A i ∈ T for i ≥ 1, then ∪ ∞ i=1 A i ∈ T. In particular, the set of events, T, in a probability space (Ω, T, P), is required to be a σ-algebra. The original motivation for introducing T in this context was a technical one, related to the impossibility of extending P to be defined on all subsets of Ω, for important examples such as Ω = [0, 1] and P¦(a, b)¦ = b − a for all intervals (a, b). However, σ-algebras are also useful for representing information available to an observer. We call T a sub-σ-algebra of T if T is a σ- algebra such that T ⊂ T. A random variable Z is said to be T-measurable if ¦Z ≤ c¦ ⊂ T for all c. By definition, random variables are functions on Ω that are T-measurable. The smaller the σ-algebra T is, the fewer the set of T measurable random variables. In practice, sub-σ-algebras are usually generated by collections of random variables: Definition 10.1.1 The σ-algebra generated by a collection of random variables (Y i : i ∈ I), denoted by σ(Y i : i ∈ I), is the smallest σ-algebra containing all sets of the form ¦Y i ≤ c¦. 1 The σ-algebra generated by a single random variable Y is denoted by σ(Y ), and sometimes as T Y . An equivalent definition would be that σ(Y i : i ∈ I) is the smallest σ-algebra such that each Y i is measurable with respect to it. A sub-σ-algebra of T represents knowledge about the probability experiment modeled by the probability space (Ω, T, P). In Chapter 3, the information gained from observing a random variable Y was modeled by requiring estimators to be random variables of the form g(Y ), for a Borel measurable function g. An equivalent condition would be to allow any estimator that is a σ(Y )- measurable random variable. That is, as shown in a starred homework problem, if Y and Z are random variables on the same probability space, then Z = g(Y ) for some Borel measurable function g if and only if Z is σ(Y ) measurable. Using sub-σ-algebras is more general, because some σ-algebras on some probability spaces are not generated by random variables. Using σ-algebras to represent information also works better when there is an uncountably infinite number of observations, such as observation of a continuous random process over an interval of time. But in engineering practice, the main difference between the two ways to model information is simply a matter of notation. Example 10.1.2 (The trivial σ-algebra) Let (Ω, T, P) be a probability space. Suppose X is a random variable such that, for some constant c o , X(ω) = c o for all ω ∈ Ω. Then X is measurable with respect to the trivial σ-algebra T defined by T = ¦∅, Ω¦. That is, constant random variables are ¦∅, Ω¦-measurable. Conversely, suppose Y is a ¦∅, Ω¦-measurable random variable. Select an arbitrary ω o ∈ Ω and let c o = Y (ω o ). On one hand, ¦ω : Y (ω) ≤ c¦ can’t be empty for c ≥ c o , so ¦ω : Y (ω) ≤ c¦ = Ω for c ≥ c o . On the other hand, ¦ω : Y (ω) ≤ c o ¦ doesn’t contain ω o for c < c o , so ¦ω : Y (ω) ≤ c o ¦ = ∅ for c < c o . Therefore, Y (ω) = c o for all ω. That is, ¦∅, Ω¦-measurable random variables are constant. 1 The smallest one exists–it is equal to the intersection of all σ-algebras which contain all sets of the form {Yi ≤ c}. 10.1. CONDITIONAL EXPECTATION REVISITED 305 Definition 10.1.3 If X is a random variable on (Ω, T, P) with finite mean and T is a sub-σ- algebra of T, the conditional expectation of X given T, E[X[T], is the unique (two versions equal with probability one are considered to be the same) random variable on (Ω, T, P) such that (i) E[X[T] is T-measurable (ii) E[(X −E[X[T])I D ] = 0 for all D ∈ T. (Here I D is the indicator function of D). We remark that a possible choice of D in property (ii) of the definition is D = Ω, so E[X[T] should satisfy E[X − E[X[T]] = 0, or equivalently, since E[X] is assumed to be finite, E[X] = E[E[X[T]]. In particular, an implication of the definition is that E[X[T] also has a finite mean. Proposition 10.1.4 Definition 10.1.3 is well posed. Specifically, there exits a random variable satisfying conditions (i) and (ii), and it is unique. Proof. (Uniqueness) Suppose U and V are each T-measurable random variables such that E[(X − U)I D ] = 0 and E[(X − V )I D ] = 0 for all D ∈ T. It follows that E[(U − V )I D ] = E[(X − V )I D ]−E[(X−U)I D ] = 0 for any D ∈ T. A possible choice of D is ¦U > V ¦, so E[(U−V )I ¡U>V ¦ ] = 0. Since (U − V )I ¡U>V ¦ is nonnegative and is strictly positive on the event ¦U > V ¦, it must be that P¦U > V ¦ = 0. Similarly, P¦U < V ¦ = 0. So P¦U = V ¦ = 1. (Existence) Existence is first proved under the added assumption that P¦X ≥ 0¦ = 1. Let L 2 (T) be the space of T-measurable random variables with finite second moments. Then T is a closed, linear subspace of L 2 (Ω, T, P), so the orthogonality principle can be applied. For any n ≥ 0, the random variable X ∧n is bounded and thus has a finite second moment. Let ´ X n be the projection of X ∧ n onto L 2 (T). Then by the orthogonality principle, X ∧ n − ´ X n is orthogonal to any random variable in L 2 (T). In particular, X ∧ n − ´ X n is orthogonal to I D for any D ∈ T. Therefore, E[(X ∧ n − ´ X n )I D ] = 0 for all D ∈ T. Equivalently, E[(X ∧ n)I D ] = E[ ´ X n I D ]. (10.1) The next step is to take a limit as n → ∞. Since E[(X ∧ n)I D ] is nondecreasing in n for each D ∈ T, the same is true of E[ ´ X n I D ]. Thus, for any n ≥ 0, E[( ´ X n+1 − ´ X n )I D ] ≥ 0 for any D ∈ T. Taking D = ¦ ´ X n+1 − ´ X n < 0¦ implies that P¦ ´ X n+1 ≥ ´ X n ¦ = 1. Therefore, the sequence ( ´ X n ) converges a.s., and we denote the limit by ´ X ∞ . We show that ´ X ∞ satisfies the two properties, (i) and (ii), required of E[X[T]. First, ´ X ∞ is T-measurable because it is the limit of a sequence of T-measurable random variables. Secondly, for any D ∈ T, the sequences or random variables (X ∧ n)I D and ´ X n I D are a.s. nondecreasing and nonnegative, so by the monotone convergence theorem (Theorem 11.6.6) and (10.1): E[XI D ] = lim n→∞ E[(X ∧ n)I D ] = lim n→∞ E[ ´ X n I D ] = E[ ´ X ∞ I D ]. So property (ii), E[(X − ´ X ∞ )I D ] = 0, is also satisfied. Existence is proved in case P¦X ≥ 0¦ = 1. 306 CHAPTER 10. MARTINGALES For the general case, X can be represented as X = X + −X − , where X + and X − are nonnegative with finite means. By the case already proved, E[X + [T] and E[X − [T] exist, and, of course, they satisfy conditions (i) and (ii) in Definition 10.1.3. Therefore, with E[X[T] = E[X + [T] −E[X − [T], it is a simple matter to check that E]X[T] also satisfies conditions (i) and (ii), as required. Proposition 10.1.5 Let X and Y be random variables on (Ω, T, P) and let / and T be sub-σ- algebras of T. 1. (Consistency with definition based on projection) If E[X 2 ] < ∞ and 1 = ¦g(Y ) : g is Borel measurable such that E[g(Y ) 2 ] < ∞¦, then E[X[Y ], defined as the MMSE projection of X onto 1 (also written as Π 1 (X)) is equal to E[X[σ(Y )]. 2. (Linearity) If E[X] and E[Y ] are finite, then aE[X[T] +bE[Y [T] = E[aX +bY [T]. 3. (Tower property) If E[X] is finite and / ⊂ T ⊂ T, then E[E[X[T][/] = E[X[/]. (In particular, E[E[X[T]] = E[X].) 4. (Positivity preserving) If E[X] is finite and X ≥ 0 a.s. then E[X[T] ≥ 0 a.s. 5. (L1 contraction property) E[[E[X[T][] ≤ E[[X[]. 6. (L1 continuity) If E[X n ] is finite for all n and E[[X n −X ∞ [] →0, then E[[E[X n [T] −E[X ∞ [T][] →0. 7. (Pull out property) If X is T-measurable and E[XY ] and E[Y ] are finite, then E[XY [T] = XE[Y [T]. Proof. (Consistency with definition based on projection) Suppose X and 1 are as in part 1. Then, by definition, E[X[Y ] ∈ 1 and E[(X −E[X[Y ])Z] = 0 for any Z ∈ 1. As mentioned above, a random variable has the form g(Y ) if and only if it is σ(Y )-measurable. In particular, 1 is simply the set of σ(Y )-measurable random variables Z such that E[Z 2 ] < ∞. Thus, E[X[Y ] is σ(Y ) measurable, and E[(X − E[X[Y ])Z] = 0 for any σ(Y )-measurable random variable Z such that E[Z 2 ] < ∞. As a special case, E[(X − E[X[Y ])I D ] = 0 for any D ∈ σ(Y ). Thus, E[X[Y ] satisfies conditions (i) and (ii) in Definition 10.1.3 of E[X[σ(Y )]. So E[X[Y ] = E[X[σ(Y )]. (Linearity Property) (This is similar to the proof of linearity for projections, Proposition 3.2.2.) It suffices to check that the linear combination aE[X[T] +bE[Y [T] satisfies the two conditions that define E[aX+bY [T]. First, E[X[T] and E[Y [T] are both T measurable, so their linear combination is also T-measurable. Secondly, if D ∈ T, then E[(X − E[X[T])I D ] = E[(Y − E[Y [T])I D ] = 0, from which if follows that E[(aX +bY −E[aX +bY [T]) I D ] = aE[(X −E[X[T])I D ] +bE[(Y −E[Y [T])I D ] = 0. Therefore, aE[X[T] +bE[Y [T] = E[aX +bY [T]. (Tower Property) (This is similar to the proof of Proposition 3.2.3, about projections onto nested subspaces.) It suffices to check that E[E[X[T][/] satisfies the two conditions that define E[X[/]. 10.1. CONDITIONAL EXPECTATION REVISITED 307 First, E[E[X[T][/] itself is a conditional expectation given /, so it is / measurable. Second, let D ∈ /. Now X −E[E[X[T] = (X −E[X[T]) + (E[X[T] −E[E[X[T]), and (use the fact D ∈ T): E[(X − E[X[T])I D ] and E[(E[X[T] − E[E[X[T])I D ] = 0. Adding these last two equations yields E[(X −E[E[X[)T)I D ] = 0. Therefore, E[E[X[T][/] = E[X[/]. The proofs of linearity and tower property are nearly identical to the proofs of the same prop- erties for projections (Propositions 3.2.2 and 3.2.3 ) and are left to the reader. (Positivity preserving) Suppose E[X] is finite and X ≥ 0 a.s. Let D = ¦E[X[T] < 0¦. Then D ∈ T because E[X[T] is T-measurable. So E[E[X[T]I D ] = E[XI D ] ≥ 0, while P¦E[X[T]I D ≤ 0¦ = 1. Hence, P¦E[X[T]I D = 0¦ = 1, which is to say that E[X[T] ≥ 0 a.s. (L1 contraction property) (This property is a special case of the conditional version of Jensen’s inequality, established in a starred homework problem. Here a different proof is given.) The variable X can be represented as X = X + − X − , where X + is the positive part of X and X − is the negative part of X, given by X + = X ∨ 0 and X − = (−X) ∨ 0. Since X is assumed to have a finite mean, the same is true of X ± . Moreover, E[E[X ± [T]] = E[X ± ], and by the linearity property, E[X[T] = E[X + [T] −E[X − [T]. By the positivity preserving property, E[X + [T] and E[X − [T] are both nonnegative a.s., so E[X + [T] +E[X − [T] ≥ [E[X + [T] −E[X − [T][ a.s. (The inequality is strict for ω such that both E[X + [T] and E[X − [T] are strictly positive.) Therefore, E[[X[] = E[X + ] +E[X − ] = E[E[X + [T] +E[X − [T]] ≥ E[[E[X + [T] −E[X − [T][] = E[[E[X[T[], and the L1 contraction property is proved. (L1 continuity) Since for any n, [X ∞ [ ≤ [X n [ + [X n − X ∞ [, the hypotheses imply that X ∞ has a finite mean. By linearity and the L1 contraction property, E[[E[X n [T] − E[X ∞ [T][] = E[[E[X n −X ∞ [T][] ≤ E[[E[X n −X ∞ [][], which implies the L1 continuity property. (Pull out property) The pull out property will be proved first under the added assumption that X and Y are nonnegative random variables. Clearly XE[Y [T] is T measurable. Let D ∈ T. It remains to show that E[XY I D ] = E[XE[Y [T]I D ]. (10.2) If X has the form I D 1 for D 1 ∈ T then (10.2) becomes E[Y I D∩D 1 ] = E[E[Y [D]I D∩D 1 ], which holds by the definition of E[Y [T] and the fact D ∩ D 1 ∈ T. Equation (10.2) is thus also true if X is a finite linear combination of random variables of the form I D 1 , that is, if X is a simple T- measurable random variables. Then X is the a.s. limit of a nondecreasing sequence of nonnegative simple random variables X n . Now (10.2) holds for X replaced by X n : E[X n Y I D ] = E[X n E[Y [T]I D ]. (10.3) Also, X n Y I D is a nondecreasing sequence converging to XY I D a.s., and X n E[Y [T]I D is a nonde- creasing sequence converging to XE[Y [T]I D a.s. By the monotone convergence theorem, taking n → ∞ on both sides of (10.3), yields (10.2). This proves the pull out property under the added assumption that X and Y are nonnegative. 308 CHAPTER 10. MARTINGALES In the general case, X = X + −X − , where X + = X ∨0 and X − = (−X) ∨0, and similarly Y = Y + −Y − . The hypotheses imply E[X ± Y ± ] and E[Y ± ] are finite so that E[X ± Y ± [T] = X ± E[Y ± [T], and therefore E[X ± Y ± I D ] = E[X ± E[Y ± [T]I D ], (10.4) where in these equations, the sign on both appearances of X should be the same, and the sign on both appearances of Y should be the same. The left side of (10.2) can be expressed as a linear combination of terms of the form E[X ± Y ± I D ]: E[XY I D ] = E[X + Y + I D ] −E[X + Y − I D ] −E[X − Y + I D ] +E[X − Y − I D ]. Similarly, the right side of (10.2) can be expressed as a linear combination of terms of the form E[X ± E[Y ± [T]I D ]. Therefore, (10.2) follows from (10.4). 10.2 Martingales with respect to filtrations A filtration of a σ-algebra T is a sequence of sub-σ-algebras TTT = (T n : n ≥ 0) of T, such that T n ⊂ T n+1 for n ≥ 0. If Y = (Y n : n ≥ 0) or Y = (Y n : n ≥ 1) is a sequence of random variables on (Ω, T, P), the filtration generated by Y , often written as TTT Y = (T Y n : n ≥ 0), is defined by letting T Y n = σ(Y k : k ≤ n). (If there is no variable Y 0 defined, we take T Y 0 to be the trivial σ-algebra, T Y 0 = ¦∅, Ω¦, representing no observations.) In practice, a filtration represents an sequence of observations or measurements. If the filtration is generated by a random process, then the information available at time n is represents observation of the random process up to time n. A random process (X n : n ≥ 0) is adapted to a filtration TTT if X n is T n measurable for each n ≥ 0. Definition 10.2.1 Let (Ω, T, P) be a probability space with a filtration TTT = (T n : n ≥ 0). Let Y = (Y n : n ≥ 0) be a sequence of random variables adapted to TTT. Then Y is a martingale with respect to TTT if for all n ≥ 0: (0) Y n is T n measurable (i.e. the process Y is adapted to TTT ) (i) E[[Y n [] < ∞, (ii) E[Y n+1 [T n ] = Y n a.s. Similarly, Y is a submartingale relative to TTT if (i) and (ii) are true and E[Y n+1 [T n ] ≥ Y n , a.s., and Y is a supermartingale relative to TTT if (i) and (ii) are true and E[Y n+1 [T n ] ≤ Y n a.s. Some comments are in order. Note the condition (ii) in the definition of a margingale implies condition (0), because conditional expectations with respect to a σ-algebra are random variables measurable with respect to the σ-algebra. Note that if Y = (Y n : n ≥ 0) is a martingale with respect to a filtration TTT = (T n : n ≥ 0), then Y is also a martingale with respect to the filtration generated by Y itself. Indeed, for each n, 10.2. MARTINGALES WITH RESPECT TO FILTRATIONS 309 Y n is T n measurable, whereas T Y n is the smallest σ-algebra with respect to which Y n is measurable, so T Y n ⊂ T n . Therefore, the tower property of conditional expectation, the fact Y is a margtingale with respect to TTT, and the fact Y n is T Y n measurable, imply E[Y n+1 [T Y n ] = E[E[Y n+1 [T n ][T Y n ] = E[Y n [T Y n ] = Y n . Thus, in practice, if Y is said to be a martingale and no filtration T is specified, at least Y is a martingale with respect to the filtration it generates. Note that if Y is a martingale with respect to a filtration TTT, then for any n, k ≥ 0, E[Y n+k+1 [T n ] = E[E[Y n+k+1 [T n+k [T n ] = E[Y n+k [T n ] Therefore, by induction on k for n fixed: E[X n+k [T n ] = X n , (10.5) for n, k ≥ 0. Example 10.2.2 Suppose (U i : i ≥ 1) is a collection of independent random variables, each with mean zero. Let S 0 = 0 and for n ≥ 1, S n = n i=1 U i . Let TTT = (T n : n ≥ 0) denote the filtration generated by S: T n = σ(S 0 , . . . , S n ). Equivalently, T is the filtration generated by (U i : i ≥ 1): T 0 = ¦∅, Ω¦ and T n = σ(S 0 , . . . , S n ). for n ≥ 1. Then S = (S n : n ≥ 0) is a martingale with respect to TTT: E[S n+1 [T n ] = E[U n+1 [T n ] +E[S n+1 [T n ] = 0 +S n = S n . Example 10.2.3 Suppose S = (S n : n ≥ 0) and TTT = (T n : n ≥ 0) are defined as in Example 10.2.2 in terms of a sequence of independent random variables U = (U i : i ≥ 1). Suppose in addition that Var(U i ) = σ 2 for some finite constant σ 2 . Finally, let M n = S 2 n −nσ 2 for n ≥ 0. Then M = (M n : n ≥ 0) is a martingale relative to TTT. Indeed, M is adapted to TTT. Since S n+1 = S n +U n , we have M n+1 = M n + 2S n U n+1 +U 2 n+1 −σ 2 so that E[M n+1 [T n ] = E[M n [T n ] + 2S n E[U n [T n ]] +E[U 2 n −σ 2 [T n ]] = M n + 2S n E[U n ] +E[U 2 n −σ 2 ] = M n Example 10.2.4 M(n) = e θSn /E[e θX 1 ] n in case X’s are iid. Example 10.2.5 (Branching process) Let G n denote the number of individuals in the n th gen- eration. Suppose the number of offspring per individual be represented by a random variable X. Select a > 0 so that E[a X ] = 1. Then, a Gn is a martingale. 310 CHAPTER 10. MARTINGALES Example 10.2.6 (Cumulative innovations process) Let M n = Y n − n−1 k=0 E[Y k+1 −Y k [Y 0 , , Y n−1 ]. Example 10.2.7 (Doob martingale) Y n = E[Φ[T n ]. For example, n nodes, m edges in a graph, X i = I ¡ith edge exists¦ and the filtration is generated by X 0 , X 1 , . Definition 10.2.8 A martingale difference sequence (D n : n ≥ 1) relative to a filtration TTT = (T n : n ≥ 0) is a sequence of random variables (D n : n ≥ 1) such that (0) (D n : n ≥ 1) is adapted to TTT (i.e. D n is T n -measurable for each n ≥ 1) (i) E[[D n [] < ∞ for n ≥ 1 (ii) E[D n+1 [T n ] = 0 a.s. for all n ≥ 0. Equivalently, (D n : n ≥ 1) has the form D n = M n −M n−1 for n ≥ 1, for some (M n : n ≥ 0) which is a martingale with respect to TTT. Definition 10.2.9 A random process (H n : n ≥ 1) is said to be predictable with respect to a filtration TTT = (T n : n ≥ 0) if H n is T n−1 measurable for all n ≥ 1. (Sometimes this is called “one-step” predictable, because T n determines H one step ahead.) Example 10.2.10 Suppose (D n : n ≥ 1) is a martingale difference sequence and (H k : k ≥ 1) is predictable, both relative to a filtration TTT = (T n : n ≥ 0). We claim that the new process ¯ D = ( ¯ D n : n ≥ 1) defined by ¯ D n = H n D n is also a martingale difference sequence with respect to TTT. Indeed, it is adapted, has finite means, and E[H n+1 D n+1 [T n ] = H n+1 E[D n+1 [T n ] = 0, where we pulled out the T n measurable random variable H n+1 from the conditional expectation given T n . An interpretation is that D n is the net gain to a gambler if one dollar is staked on the outcome of a fair game in round n, and so H n D n is the net stake if H n dollars are staked on round n. The requirement that (H k : k ≥ 1) be predictable means that the gambler must decide how much to stake in round n based only on information available at the end of round n −1. If would be an unfair advantage if the gambler already knew D n when deciding how much money to stake in round n. If the initial reserves of the gambler were some constant M 0 , then the reserves of the gambler after n rounds would be given by: M n = M 0 + n k=1 H k D k 10.3. AZUMA-HOEFFDING INEQUALTITY 311 Then (M n : n ≥ 0) is a margingale with respect to TTT. The random variables are H k D k , 1 ≤ k ≤ n are orthogonal. Also, E[(H k D k ) 2 ] = E[E[(H k D k ) 2 [T k−1 ]] = E[H 2 k E[D 2 k [T k−1 ]]. Therefore, E[(M n −M 0 ) 2 ] = n k=1 E[H 2 k E[D 2 k [T k−1 ]]. 10.3 Azuma-Hoeffding inequaltity One of the most simple inequalities for martingales is the Azuma-Hoeffding inequality. It is proven in this section, and applications to prove concentration inequalities for some combinatorial problems are given. 2 Lemma 10.3.1 Suppose D is a random variable with E[D] = 0 and P¦[D−b[ ≤ d¦ = 1 for some constant b. Then for any α ∈ R, E[e αD ] ≤ e (αd) 2 /2 . Proof. Since D has mean zero and D lies in the interval [b − d, b + d] with probability one, the interval must contain zero, so [b[ ≤ d. To avoid trivial cases we assume that [b[ < d. Since e αx is convex in x, the value of e αx for x ∈ [b − d, b + d] is bounded above by the linear function that is equal to e αx at the endpoints, x = b ±d, of the interval: e αx ≤ x −b +d 2d e α(b+d) + b +d −x 2d e α(b−d) . (10.6) Since D lies in that interval with probability one, (10.6) remains true if x is replaced by the random variable D. Taking expectations on both sides and using E[D] = 0 yields E[e αD ] ≤ d −b 2d e α(b+d) + b +d 2d e α(b−d) . (10.7) The proof is completed by showing that the right side of (10.7) is less than or equal to e (αd) 2 /2 for any [b[ < d. Letting u = αd and θ = b/d, the inequality to be proved becomes f(u) ≤ e u 2 /2 , for u ∈ R and [θ[ < 1, where f(u) = ln _ (1 −θ)e u(1+θ) + (1 +θ)e u(−1+θ) 2 _ . Taylor’s formula implies that f(u) = f(0)+f t (0)u+ f (v)u 2 2 for some v in the interval with endpoints 0 and u. Elementary, but somewhat tedious, calculations show that f t (u) = (1 −θ 2 )(e u −e −u ) (1 −θ)e u + (1 +θ)e −u 2 See McDiarmid survey paper 312 CHAPTER 10. MARTINGALES and f tt (u) = 4(1 −θ 2 ) [(1 −θ)e u + (1 +θ)e −u ] 2 = 1 cosh 2 (u +β) , where β = 1 2 ln( 1−θ 1+θ ). Note that f(0) = f t (0) = 0, and f tt (u) ≤ 1 for all u ∈ R. Therefore, f(u) ≤ u 2 /2 for all u ∈ R, as was to be shown. Definition 10.3.2 A random process (B n : n ≥ 0) is said to be predictable with respect to a filtration TTT = (T n : n ≥ 0) if B 0 is a constant and B n is T n−1 measurable for all n ≥ 1. Proposition 10.3.3 (Azuma-Hoeffding inequality with centering) Let (Y n : n ≥ 0) be a martingale and (B n : n ≥ 0) be a predictable process, both with respect to a filtration TTT = (T n : n ≥ 0), such that P[[Y n+1 −B n+1 [ ≤ d n ] = 1 for all n ≥ 0. Then P¦[Y n −Y 0 [ ≥ λ¦ ≤ 2 exp _ − λ 2 2 n i=1 d 2 i _ . Proof. Let n ≥ 0. The idea is to write Y n = Y n − Y n−1 + Y n−1 , to use the tower property of conditional expectation, and to apply Lemma 10.3.1 to the random variable Y n −Y n−1 for d = d n . This yields: E[e α(Yn−Y 0 ) ] = E[E[e α(Yn−Y n−1 +Y n−1 −Y 0 ) [T n−1 ]] = E[e α(Y n−1 −Y 0 ) E[e α(Yn−Y n−1 [T n−1 ]] ≤ E[e α(Y n−1 −Y 0 ) ]e (αdn) 2 /2 . Thus, by induction on n, E[e α(Yn−Y 0 ) ] ≤ e (α 2 /2) P n i=1 d 2 i . The remainder of the proof is essentially the Chernoff inequality: P¦Y n −Y 0 ≥ λ¦ ≤ E[e α(Yn−Y 0 −λ) ] ≤ e (α 2 /2) P n i=1 d 2 i −αλ . Finally, taking α to make this bound as tight as possible, i.e. α = λ 2 P n i=1 d 2 i , yields P¦Y n −Y 0 ≥ λ¦ ≤ exp _ − λ 2 2 n i=1 d 2 i _ . Similarly, P¦Y n −Y 0 ≤ −λ¦ satisfies the same bound because the previous bound applies for (Y n ) replaced by (−Y n ), yielding the proposition. 10.3. AZUMA-HOEFFDING INEQUALTITY 313 Definition 10.3.4 A function f of n variables x 1 , . . . , x n is said to satisfy the Lipschitz condition with constant c if [f(x 1 , . . . , x n ) − f(x 1 , . . . , x n−1 , y i , x i+1 , . . . , x n )[ ≤ c for any x 1 , . . . , x n , i, and y i . 3 Proposition 10.3.5 (McDiarmid’s inequality) Suppose F = f(X 1 , . . . , X n ), where f satisfies the Lipschitz condition with constant c, and X 1 , . . . , X n are independent random variables. Then P¦[F −E[F][ ≥ λ¦ ≤ 2 exp(− 2λ 2 nc 2 ). Proof. Let (Z k : 0 ≤ k ≤ n) denote the Doob martingale defined by Z k = E[F[T X k ], where, as usual, T X k = σ(X k : 1 ≤ k ≤ n) is the filtration generated by (X k ). Note that T X 0 is the trivial σ-algebra ¦∅, Ω¦, corresponding to no observations, so Z 0 = E[F]. Also, Z n = F. In words, Z k is the expected value of F, given that the first k X’s are revealed. For 0 ≤ k ≤ n −1, let g k (x 1 , . . . , x k , x k+1 ) = E[f(x 1 , . . . , x k+1 , X k+2 , . . . , X n )]. Note that Z k+1 = g k (X 1 , . . . , X k+1 ). Since f satisfies the Lipschitz condition with constant c, the same is true of g k . In particular, for x 1 , . . . , x k fixed, the set of possible values (i.e. range) of g k (x 1 , . . . , x k+1 ) as x k+1 varies, lies within some interval (depending on x 1 , . . . , x k ) with length at most c. We define m k (x 1 , , x k ) to be the midpoint of the smallest such interval: m k (x 1 , . . . , x k ) = sup x k+1 g k (x 1 , . . . , x k+1 ) + inf x k+1 g k (x 1 , . . . , x k+1 ) 2 and let B k+1 = m k (X 1 , . . . , X k ). Then B is a predictable process and [Z k+1 − B k+1 [ ≤ c 2 with probability one. Thus, the Azuma-Hoeffding inequality with centering can be applied with d i = c 2 for all i, giving the desired result. Example 10.3.6 Let V = ¦v 1 , . . . , v n ¦ be a finite set of cardinality n ≥ 1. For each i, j with 1 ≤ i < j ≤ n, suppose that Z i,j is a Bernoulli random variable with parameter p, where 0 ≤ p ≤ 1. Suppose that the Z’s are mutually independent. Let G = (V, E) be a random graph, such that for i < j, there is an undirected edge between vertices v i and v j (i.e. v i and v j are neighbors) if and only if Z i,j = 1. Equivalently, the set of edges is E = ¦¦i, j¦ : i < j and Z i,j = 1¦. An independent set in the graph is a set of vertices, no two of which are neighbors. Let 1 = 1(G) denote the maximum of the cardinalities of all independent sets for G. Note that 1 is a random variable, because the graph is random. We shall apply McDiarmid’s inequality to find a concentration bound for 1(G). Note that 1(G) = F((Z i,j : 1 ≤ i < j ≤ n)), for an appropriate function F. We could write a computer program for computing F, for example by cycling through all subsets of V , seeing which ones are independent sets, and reporting the largest cardinality of the independent. However, there is no need to be so explicit about what f is. Observe next that changing any one of the Z’s would change 3 Equivalently, f(x) − f(y) ≤ cdH(x, y), where dH(x, y) denotes the Hamming distance, which is the number of coordinates in which x and y differ. In the analysis of functions of a continuous variable, the Euclidean distance is used instead of the Hamming distance. 314 CHAPTER 10. MARTINGALES 1(G) by at most one. In particular, if there is an independent set in a graph, and if one edge is added to the graph, then at most one vertex would have to be removed from the independent set for the original graph to obtain one for the new graph. Thus, F satisfies the Lipschitz condition with constant c = 1. Thus, by McDiarmid’s inequality with c = 1 and m = n(n −1)/2 variables, P¦[1 −E[1][ ≥ λ¦ ≤ 2 exp(− 4λ 2 n(n −1) ). More thought yields a tighter bound. For 1 ≤ i ≤ n, let X i = (Z i,i+1 , Z i,i+2 , . . . , Z i,n ). In words, for each i, X i determines which vertices with index larger than i are neighbors of vertex v i . Of course 1 is also determined by X 1 , . . . , X n . Moreover, if any one of the X’s changes, 1 changes by at most one. That is, 1 can be expressed as a function of the n variables X 1 , . . . , X n , such that the function satisfies the Lipschitz condition with constant c = 1. Therefore, by McDiarmid’s inequality with c = 1 and n variables, 4 P¦[1 −E[1][ ≥ λ¦ ≤ 2 exp(− 2λ 2 n ). For example, if λ = a √ n, we have P¦[1 −E[1][ ≥ a √ n¦ ≤ 2 exp(−2a 2 ) whenever n ≥ 1, 0 ≤ p ≤ 1, and a > 0. McDiarmid’s inequality can similiarly be applied to obtain concentration inequalities for many other numbers associated with graphs, such as the size of a maximum matching (a matching is a set of edges, no two of which have a node in common), chromatic index (number of colors needed to color all edges so that all edges containing a single vertex are different colors), chromatic number (number of colors needed to color all vertices so that neighbors are different colors), minimum number of edges that need to be cut to break graph into two equal size components, and so on. 10.4 Stopping times and the optional sampling theorem Let X = (X k : k ≥ 0) be a martingale with respect to a filtration TTT = (T k : k ≥ 0). Note that E[X k+1 ] = E[E[X k+1 [T k ] = E[X k ]. So, by induction on n, E[X n ] = E[X 0 ] for all n ≥ 0. A useful interpretation of a martingale X = (X k : k ≥ 0) is that X k is the reserve (amount of money on hand) that a gambler playing a fair game at each time step, has after k time steps, if X 0 is the initial reserve. (If the gambler is allowed to go into debt, the reserve can be negative.) The condition E[X k+1 [T k ] = X k means that, given the knowledge that is observable up to time k, the expected reserve after the next game is equal to the reserve at time k. The equality E[X n ] = E[X 0 ] has the natural interpretation that the expected reserve of the gambler after n games have been played, is equal to the inital reserve X 0 . What happens if the gambler stops after a random number, T, of games. Is it true that E[X T ] = E[X 0 ]? 4 Since Xn is degenerate, we could use n −1 instead of n, but it makes little difference. 10.4. STOPPING TIMES AND THE OPTIONAL SAMPLING THEOREM 315 Example 10.4.1 Suppose that X n = W 1 + +W n where P¦W k = 1¦ = P¦W k = −1¦ = 0.5 for all k, and the W’s are independent. Let T be the random time: T = _ 3 if W 1 +W 2 +W 3 = 1 0 else Then X T = 3 with probability 1/8, and X T = 0 otherwise. Hence, E[X T ] = 3/8. Does example 10.4.1 give a realistic strategy for a gambler to obtain a strictly positive expected payoff from a fair game? To implement the strategy, the gambler should stop gambling after T games. However, the event ¦T = 0¦ depends on the outcomes W 1 , W 2 , and W 3 . Thus, at time zero, the gambler is required to make a decision about whether to stop before any games are played based on the outcomes of the first thee games. Unless the gambler can somehow predict the future, the gambler will be unable to implement the strategy of stopping play after T games. Intuitively, a random time corresponds to an implementable stopping strategy if the gambler has enough information after n games to tell whether to play future games. That type of condition is captured by the notion of optional stopping time, defined as follows. Definition 10.4.2 An optional stopping time T relative to a filtration TTT = (T k : k ≥ 0) is a random variable with values in Z + such that for any n ≥ 0, ¦T ≤ n¦ ∈ T n . The intuitive interpretation of the condition ¦T ≤ n¦ ∈ T n is that, the gambler should have enough information by time n to know whether to stop by time n. Since σ-algebras are closed under set complements, the condition in the definition of an optional stopping time is equivalent to requiring that, for any n ≥ 0, ¦T > n¦ ∈ T n . This means that the gambler should have enough information by time n to know whether to continue gambling strictly beyond time n. Example 10.4.3 Let (X n : n ≥ 0) be a random process adapted to a filtration TTT = (T n : n ≥ 0). Let A be some fixed (Borel measurable) set, and let T = min¦n ≥ 0 : X n ∈ A¦. Then T is a stopping time relative to TTT. Indeed, ¦T ≤ n¦ = ¦X k ∈ A for some k with 0 ≤ k ≤ n¦. So ¦T ≤ n¦ is an event determined by (X 0 , . . . , X n ), which is in T n because X is adapted to the filtration. Example 10.4.4 Suppose W 1 , W 2 , . . . are independent Bernoulli random variables with p = 0.5, modeling fair coin flips. Suppose that if a gambler stakes some money at the beginning of the n th round, then if W n = 1, the gambler wins back the stake and an additional equal amount. If W n = 0, the gambler loses the money staked. Let X n denote the reserve of the gambler after n rounds. For simplicity, we assume that the gambler can borrow money as needed, and that the initial reserve of the gambler is zero. So X 0 = 0. Suppose the gambler adopts the following strategy. The gambler continues playing until the first win, and in each round until stopping, the gambler stakes the amount of money needed to insure that the reserve at the time the gambler stops, is one doller. For example, the gambler initially borrows one dollar, and stakes it on the first outcome. 316 CHAPTER 10. MARTINGALES If W 1 = 1 the gambler’s reserve (money in hand minus the amount borrowed) is one dollar, and the gambler stops, so T = 1 and X T = 1. Since no money is staked after time T, X k = X T for all k ≥ T. If the gambler loses in the first round (i.e. W 1 = 0), then X 1 = −1. In that case, the gambler keeps playing, and, next, borrows two more dollars and stakes them on the second outcome. If W 2 = 1 the gambler’s reserve is one dollar, and the gambler stops. So T = 2 and X T = 1. If the gambler loses in the second round (i.e. W 2 = 0), then X 2 = −3. In that case, the gambler keeps playing, and, next, borrows four more dollars and stakes them on the third outcome, and so on. The random process (X n : n ≥ 0) is a martingale. For this strategy, the number of rounds, T, that the gambler plays has the geometric distribution with parameter p = 0.5. Thus, E[T] = 2. In particular, T is finite with probability one. Thus, X T = 1 a.s., while X 0 = 0. Thus, E[X T ] ,= E[X 0 ]. This strategy does not require the gambler to be able to predict the future, and the gambler is always up one dollar after stopping. But don’t run out and start playing this strategy, expecting to make money for sure. There is a catch–the amount borrowed can be very large. Indeed, let us compute the expectation of B, the total amount borrowed before the final win. If T = 1 then B = 1 (only the dollar borrowed in the first round is counted). If T = 2 then B = 3 (the first dollar in the first round, and two more in the second). In general, B = 2 T −1. Thus, E[B] = ∞ n=1 (2 n −1)P¦T = n¦ = ∞ n=1 (2 n −1)2 −n = ∞ n=1 (1 −2 −n ) = +∞ That is, the expected amount of money the gambler will need to borrow is infinite. Proposition 10.4.5 If X is a martingale and T is an optional stopping time, relative to (Ω, TTT, P), then E[X T∧n ] = E[X 0 ] for any n. Proof. Note that X T∧(n+1) −X T∧n = _ 0 if T ≤ n X ∧ (n + 1) −X ∧ n if T > n = (X ∧ (n + 1) −X ∧ n)I ¡T>n¦ Using this and the tower property of conditional expectation yields E[X T∧(n+1) −X T∧n ] = E[E[(X ∧ (n + 1) −X ∧ n)I ¡T>n¦ [T n ]] = E[E[(X ∧ (n + 1) −X ∧ n)[T n ]I ¡T>n¦ ] = 0 because E[(X ∧ (n + 1) −X ∧ n)[T n ] = 0. Therefore, E[X ∧ (n + 1)] = E[X ∧ n] for all n ≥ 0. So by induction on n, E[X T∧n ] = E[X 0 ] for all n ≥ 0. The following corollary follows immediately from Proposition 10.4.5. 10.4. STOPPING TIMES AND THE OPTIONAL SAMPLING THEOREM 317 Corollary 10.4.6 If X is a martingale and T is an optional stopping time, relative to (Ω, TTT, P), then E[X 0 ] = lim n→∞ E[X T∧n ]. In particular, if lim n→∞ E[X T∧n ] = E[X T ] (10.8) then E[X T ] = E[X 0 ]. By Corollary 10.4.6, the trick to establishing E[X T ] = E[X 0 ] comes down to proving (10.8). Note that X T∧n a.s. → X T as n → ∞, so (10.8) is simply requiring the convergence of the means to the mean of the limit, for an a.s. convergent sequence of random variables. There are several different sufficient conditions for this to happen, involving conditions on the martingale X, the stopping time T, or both. For example: Corollary 10.4.7 If X is a martingale and T is an optional stopping time, relative to (Ω, TTT, P), and if T is bounded (so P¦T ≤ n¦ = 1 for some n) then E[X T ] = E[X 0 ]. Proof. If P¦T ≤ n¦ = 1 then T ∧ n = T with probability one, so E[X T∧n = X T ]. Therefore, the corollary follows from Proposition 10.4.5. Corollary 10.4.8 If X is a martingale and T is an optional stopping time, relative to (Ω, TTT, P), and if there is a random variable Z such that [X n [ ≤ Z a.s. for all n, and E[Z] < ∞, then E[X T ] = E[X 0 ]. Proof. Let > 0. Since E[Z] < ∞, there exists δ > 0 so that if A is any set with P[A] < δ, then E[ZI A ] < . Since X T∧n a.s. → X T , we also have X T∧n p. → X T . Therefore, if n is sufficiently large, P¦[X T∧n −X T [ ≥ ¦ ≤ δ. For such n, [X T∧n −X T [ ≤ +[X T∧n −X T [I ¡[X T∧n −X T [>¦ ≤ + 2[Z[I ¡[X T∧n −X T [>¦ (10.9) Now E[[Z[I ¡[X T∧n −X T [>¦ ] < by the choice of δ and n. So taking expectations of each side of (10.9) yields E[[X T∧n − X T [] ≤ 3. Both X T∧n and X T have finite means, because both have absolute values less than or equal to Z, so [E[X T∧n ] −E[X T ][ = [E[X T∧n −X T ][ ≤ E[[X T∧n −X T [] < 3 Since was an arbitrary positive number, the corollary is proved. Corollary 10.4.9 Suppose (X n : n ≥ 0) is a martingale relative to (Ω, TTT, P). Suppose (i) there is a constant c such that E[ [X n+1 −X n [ [T n ] ≤ c for n ≥ 0, (ii) T is stopping time such that E[T] < ∞. Then E[X T ] = E[X 0 ]. If, instead, (X n : n ≥ 0) is a submartingale relative to (Ω, TTT, P), satisfying (i) and (ii), then E[X T ] ≥ E[X 0 ]. 318 CHAPTER 10. MARTINGALES Proof. Suppose (X n : n ≥ 0) is a martingale relative to (Ω, TTT, P), satisfying (i) and (ii). Looking at the proof of Corollary 10.4.8, we see that it is enough to show that there is a random variable Z such that E[Z] < +∞ and [X T∧n [ ≤ Z for all n ≥ 0. Let Z = [X 0 [ +[X 1 −X 0 [ + +[X T −X T−1 [ Obviously, [X T∧n [ ≤ Z for all n ≥ 0, so it remains to show that E[Z] < ∞. But E[Z] = E[[X 0 ] +E _ ∞ i=1 [X i −X i−1 [I ¡i≤T¦ _ = E[[X 0 [] +E _ E _ ∞ i=1 [X i −X i−1 [I ¡i≤T¦ [ T i−1 __ = E[[X 0 [] +E _ I ¡i≤T¦ E _ ∞ i=1 [X i −X i−1 [ [ T i−1 __ = E[[X 0 [] +c ∞ i=1 P¦i ≤ T¦ = E[[X 0 [] +cE[T] < ∞ The first statement of the Corollary is proved. If instead X is a submartingale, then a minor variation of Proposition 10.4.5 yields that E[X T∧n ] ≥ E[X 0 ]. The proof for the first part of the corollary, already given, shows that conditions (i) and (ii) imply that E[X T∧n ] →E[X T ] as n →∞. Therefore, E[X T ] ≥ E[X 0 ]. Martingale inequalities offer a way to provide upper and lower bounds on the completion times of algorithms. The following example shows how a lower bound can be found for a particular game. Example 10.4.10 Consider the following game. There is an urn, initially with k 1 red marbles and k 2 blue marbles. A player takes turns until the urn is empty, and the goal of the player is to minimize the expected number of turns required. At the beginning of each turn, the player can remove a set of marbles, and the set must be one of four types: one red, one blue, one red and one blue, or two red and two blue. After removing the set of marbles, a fair coin is flipped. If tails appears, the turn is over. If heads appears, then some marbles are added back to the bag, according to Table 10.1 Our goal will be to find a lower bound on E[T], where T is the number of turns needed by the player until the urn is empty. The bound should hold for any strategy the player adopts. Let X n denote the total number of marbles in the urn after n turns. If the player elects to remove only one marble during a turn (either red or blue) then with probability one half, two marbles are put back. Hence, for either set with one marble, the expected change in the total number of marbles in the urn is zero. If the player elects to remove two reds or two blues, then with probability one half, three marbles are put back into the urn. For these turns, the expected change in the number of marbles in the urn is -0.5. Hence, for any choice of u n (representing the decision of the player for the n + 1 th turn), E[X n+1 [X n , u n ] ≥ X n −0.5 on ¦T > n¦ 10.5. NOTES 319 Table 10.1: Rules of the marble game Set removed Set returned to bag on “heads” one red one red and one blue one blue one red and one blue two reds three blues two blues three reds That is, the drift of X n towards zero is at most 0.5 in magnitude, so we suspect that no strategy can empty the urn in average time less than (k 1 +k 2 )/0.5. In fact, this result is true, and it is now proved. Let M n = X n∧T + n∧T 2 . By the observations above, M is a submartingale. Furthermore, [M n+1 − M n [ ≤ 2. Either E[T] = +∞ or E[T] < ∞. If E[T] = +∞ then the inequality to be proved, E[T] ≥ 2(k 1 + k 2 ), is trivially true, so suppose E[T] < ∞. Then by Corollary 10.4.9, E[M T ] ≥ E[M 0 ] = k 1 +k 2 . Also, M T = T 2 with probability one, so E[T] ≥ 2(k 1 +k 2 ), as claimed. 10.5 Notes Material on Azuma-Hoeffding inequality and McDiarmid’s method can be found in McDiarmid’s tutorial article [7]. 10.6 Problems 10.1 Two martingales associated with a simple branching process Let Y = (Y n : n ≥ 0) denote a simple branching process. Thus, Y n is the number of individuals in the n th generation, Y 0 = 1, the numbers of offspring of different individuals are independent, and each has the same distribution as a random variable X. (a) Identify a constant θ so that G n = Yn θ n is a martingale. (b) Let c denote the event of eventual extinction, and let α = P¦c¦. Show that P[c[Y 0 , . . . , Y n ] = α Yn . Thus, M n = α Yn is a martingale. (c) Using the fact E[M 1 ] = E[M 0 ], find an equation for α. (Note: Problem 4.29 shows that α is the smallest positive solution to the equation, and α < 1 if and only if E[X] > 1.) 10.2 A covering problem Consider a linear array of n cells. Suppose that m base stations are randomly placed among the cells, such that the locations of the base stations are independent, and uniformly distributed among the n cell locations. Let r be a positive integer. Call a cell i covered if there is at least one base station at some cell j with [i − j[ ≤ r − 1. Thus, each base station (unless those near the edge of the array) covers 2r − 1 cells. Note that there can be more than one base station at a given cell, and interference between base stations is ignored. 320 CHAPTER 10. MARTINGALES (a) Let F denote the number of cells covered. Apply the method of bounded differences based on the Azuma-Hoeffding inequality to find an upper bound on P[[F −E[F][ ≥ γ]. (b) (This part is related to the coupon collector problem and may not have anything to do with martingales.) Rather than fixing the number of base stations, m, let X denote the number of base stations needed until all cells are covered. In case r = 1 we have seen that P[X ≥ nln n + cn] → exp(−e −c ) (the coupon collectors problem). For general r ≥ 1, find g 1 (r) and g 2 (r) to so that for any > 0, P[X ≥ (g 2 (r) + )nln n] → 0 and P[X ≤ (g 1 (r) − )nln n] → 0. (Ideally you can find g 1 = g 2 , but if not, it’d be nice if they are close.) 10.3 Doob decomposition Suppose X = (X k : k ≥ 0) is an integrable (meaning E[[X k [] < ∞ for each k) sequence adapted to a filtration TTT = (T k : k ≥ 1). (a) Show that there is sequence B = (B k : k ≥ 0) which is predictable relative to TTT (which means that B 0 is a constant and B k is T k−1 measurable for k ≥ 1) and a mean zero martingale M = (M k : k ≥ 0), such that X k = B k + M k for all k. (b) Are the sequences B and M uniquely determined by X and TTT? 10.4 On uniform integrability (a) Show that if (X i : i ∈ I) and (Y i : i ∈ I) are both uniformly integrable collections of random variables with the same index set I, then (Z i : i ∈ I), where Z i = X i + Y i for all i, is also a uniformly integrable collection. (b) Show that a collection of random variables (X i : i ∈ I) is uniformly integrable if and only if there exists a convex increasing function ϕ : R + → R + with lim c→∞ ϕ(c) c = +∞ and a constant K, such that E[ϕ(X i )] ≤ K for all i ∈ I. 10.5 Stopping time properties (a) Show that if S and T are stopping times for some filtration TTT, then S ∧ T, S ∨ T, and S +T, are also stopping times. (b) Show that if TTT is a filtration and X = (X k : k ≥ 0) is the random sequence defined by X k = I ¡T≤k¦ for some random time T with values in Z + , then T is a stopping time if and only if X is TTT-adapted. (c) If T is a stopping time for a filtration TTT, recall that T T is the set of events A such that A∩¦T ≤ n¦ ∈ T n for all n. (Or, for discrete time, the set of events A such that A∩¦T = n¦ ∈ T n for all n.) Show that (i) T T is a σ-algebra, (ii) T is T T measurable, and (iii) if X is an adapted process then X T is T T measurable. 10.6 A stopped random walk Let W 1 , W 2 , . . . be a sequence of independent, identically distributed mean zero random variables. To avoid triviality, assume P¦W 1 = 0¦ ,= 0. Let S 0 = 0 and S n = W 1 + . . . W n for n ≥ 1. Fix a constant c > 0 and let τ = min¦n ≥ 0 : [S n [ ≥ c¦. The goal of this problem is to show that E[S τ ] = 0. (a) Show that E[S τ ] = 0 if there is a constant D so that P[[W i [ > D] = 0. (Hint: Invoke a version of the optional stopping theorem). (b) In view of part (a), we need to address the case that the W’s are not bounded. Let W n = 10.6. PROBLEMS 321 _ _ _ W n if [W n [ ≤ 2c a if W n > 2c −b if W n < −2c where the constants a and b are selected so that a ≥ 2c, b ≥ 2c, and E[ W i ] = 0. Note that if τ < n and if ˜ W n ,= W n , then τ = n. Thus, τ defined above also satisfies τ = min¦n ≥ 0 : [ ¯ S n [ ≥ c¦. Let ¯ σ 2 = Var( W i ). Let ¯ S n = W 1 + . . . W n for n ≥ 0 and let M n = ¯ S 2 n − n¯ σ 2 . Show that M is a martingale. Hence, E[M τ∧n ] = 0 for all n. Conclude that E[τ] < ∞ (c) Show that E[S τ ] = 0. (Hint: Use part (b) and invoke a version of the optional stopping theorem). 10.7 Bounding the value of a game Consider the following game. Initially a jar has a o red marbles and b o blue marbles. On each turn, the player removes a set of marbles, consisting of either one or two marbles of the same color, and then flips a fair coin. If heads appears on the coin, then if one marble was removed, one of each color is added to the jar, and if two marbles were removed, then three marbles of the other color are added back to the jar. If tails appears, no marbles are added back to the jar. The turn is then over. Play continues until the jar is empty after a turn, and then the game ends. Let τ be the number of turns in the game. The goal of the player is to minimize E[τ]. A strategy is a rule to decide what set of marbles to remove at the beginning of each turn. (a) Find a lower bound on E[τ] that holds no matter what strategy the player selects. (b) Suggest a strategy that approximately minimizes E[τ], and for that strategy, find an upper bound on E[τ]. 10.8 On the size of a maximum matching in a random bipartite graph Given 1 ≤ d < n, let U = ¦u 1 , . . . , u n ¦ and V = ¦v 1 , . . . , v n ¦ be disjoint sets of cardinality n, and let G be a bipartite random graph with vertex set U ∪ V , such that if V i denotes the set of neighbors of u i , then V 1 , . . . , V n are independent, and each is uniformly distributed over the set of all _ n d _ subsets of V of cardinality d. A matching for G is a subset of edges M such that no two edges in M have a common vertex. Let Z denote the maximum of the cardinalities of the matchings for G. (a) Find bounds a and b, with 0 < a ≤ b < n, so that a ≤ E[Z] ≤ b. (b) Give an upper bound on P¦[Z − E[Z][ ≥ γ √ n¦, for γ > 0, showing that for fixed d, the distribution of Z is concentrated about its mean as n →∞. (c) Suggest a greedy algorithm for finding a large cardinality matching. 10.9 * Equivalence of having the form g(Y ) and being measurable relative to the sigma algebra generated by Y . Let Y and Z be random variables on the same probability space. The purpose of this problem is to establish that Z = g(Y ) for some Borel measurable function g if and only if Z is σ(Y ) measurable. (“only if” part) Suppose Z = g(Y ) for a Borel measurable function g, and let c ∈ R. It must be shown that ¦Z ≤ c¦ ∈ σ(Y ). Since g is a Borel measurable function, by definition, A = ¦y : g(y) ≤ c¦ is a Borel subset of R. (a) Show that ¦Z ≤ c¦ = ¦Y ∈ A¦. (b) Using the definition of Borel sets, show that ¦Y ∈ A¦ ∈ σ(Y ) for any Borel set A. The “only if” part follows. 322 CHAPTER 10. MARTINGALES (“if” part) Suppose Z is σ(Y ) measurable. It must be shown that Z has the form g(Y ) for some Borel measurable function g. (c) Prove this first in the special case that Z has the form of an indicator function: Z = I B , for some event B, which satisfies B ∈ σ(Y ). (Hint: Appeal to the definition of σ(Y ).) (d) Prove the “if” part in general. (Hint: Z can be written as the supremum of a countable set of random variables, with each being a constant times an indicator function: Z = sup n q n I ¡Z≤qn¦ , where q 1 , q 2 , . . . is an enumeration of the set of rational numbers.) 10.10 * Regular conditional distributions Let X be a random variable on (Ω, T, P) and let T be a sub-σ-algebra of T. A conditional prob- ability such as P[X ≤ c[T] for a fixed constant c can sometimes have different versions, but any two such versions are equal with probability one. Roughly speaking, the idea of regular conditional distributions, defined next, is to select a version of P[X ≤ c[T] for every real number c so that, as a function of c for ω fixed, the result is a valid CDF (i.e. nondecreasing, right-continuous, with limit zero at −∞ and limit one at +∞.) The difficulty is that there are uncountably many choices of c. Here is the definition. A regular conditional CDF of X given D, denoted by F X[T (c[ω), is a function of (c, ω) ∈ R Ω such that: (1) for each c ∈ R fixed, F X[T (c[ω) is a T measurable function of ω, (2) for each ω fixed, as a function of c, F X[T (c[ω) is a valid CDF, (3) for any c ∈ R, F X[T (c[ω) is a version of P[X ≤ c[T]. The purpose of this problem is to prove the existence of a regular conditional CDF. For each rational number q, let Φ(q) = P[X ≤ q[T]. That is, for each rational number q, we pick Φ(q) to be one particular version of P[X ≤ q[T]. Thus, Φ(q) is a random variable, and so we can also write it at as Φ(q, ω) to make explicit the dependence on ω. By the positivity preserving property of conditional expectations, P¦Φ(q) > Φ(q t )¦ = 0 if q < q t . Let ¦q 1 , q 2 , . . .¦ denote the set of rational numbers, listed in some order. The event N defined by N = ∩ n,m:qn<qm ¦Φ(q n ) > Φ(q m ).¦ thus has probability zero. Modify Φ(q, ω) for ω ∈ N by letting Φ(q, ω) = F o (q) for ω ∈ N and all rational q, where F o is an arbitrary, fixed CDF. Then for any c ∈ IR and ω ∈ Ω, let Φ(c, ω) = inf q≥c Φ(q, ω) Show that Φ so defined is a regular, condtional CDF of X given T. 10.11 * An even more general definition of conditional expectation, and the condi- tional version of Jensen’s inequality Let X be a random variable on (Ω, T, P) and let T be a sub-σ-algebra of T. Let F X[T (c[ω) be a regular conditional CDF of X given T. Then for each ω, we can define E[X[T] at ω to equal the mean for the CDF F X[T (c[ω) : c ∈ R¦, which is contained in the extended real line R∪¦−∞, +∞¦. Symbollically: E[X[T](ω) = _ R cF X[T (dc[ω). Show that, in the special case that E[[X[] < ∞, this 10.6. PROBLEMS 323 definition is consistent with the one given previously. As an application, the following conditional version of Jensen’s inequality holds: If φ is a convex function on R, then E[φ(X)[T] ≥ φ(E[X[T]) a.s. The proof is given by applying the ordinary Jensen’s inequality for each ω fixed, for the regular conditional CDF of X given T evaluated at ω. 324 CHAPTER 10. MARTINGALES Chapter 11 Appendix 11.1 Some notation The following notational conventions are used in these notes. A c = complement of A AB = A∩ B A ⊂ B ↔ any element of A is also an element of B A−B = AB c ∞ _ i=1 A i = ¦a : a ∈ A i for some i¦ ∞ i=1 A i = ¦a : a ∈ A i for all i¦ a ∨ b = max¦a, b¦ = _ a if a ≥ b b if a < b a ∧ b = min¦a, b¦ a + = a ∨ 0 = max¦a, 0¦ I A (x) = _ 1 if x ∈ A 0 else (a, b) = ¦x : a < x < b¦ (a, b] = ¦x : a < x ≤ b¦ [a, b) = ¦x : a ≤ x < b¦ [a, b] = ¦x : a ≤ x ≤ b¦ Z − set of integers Z + − set of nonnegative integers R − set of real numbers R + − set of nonnegative real numbers C = set of complex numbers 325 326 CHAPTER 11. APPENDIX A 1 A n = ¦(a 1 , . . . , a n ) T : a i ∈ A i for 1 ≤ i ≤ n¦ A n = A A . ¸¸ . n times ¸t| = greatest integer n such that n ≤ t ¸t| = least integer n such that n ≥ t A . = expression − denotes that A is defined by the expression All the trigonometric identities required in these notes can be easily derived from the two identities: cos(a +b) = cos(a) cos(b) −sin(a) sin(b) sin(a +b) = sin(a) cos(b) + cos(a) sin(b) and the facts cos(−a) = cos(a) and sin(−b) = −sin(b). A set of numbers is countably infinite if the numbers in the set can be listed in a sequence x i : i = 1, 2, . . .. For example, the set of rational numbers is countably infinite, but the set of all real numbers in any interval of positive length is not countably infinite. 11.2 Convergence of sequences of numbers We begin with some basic definitions. Let (x n ) = (x 1 , x 2 , . . .) and (y n ) = (y 1 , y 2 , . . .) be sequences of numbers and let x be a number. By definition, x n converges to x as n goes to infinity if for each > 0 there is an integer n so that [ x n − x [< for every n ≥ n . We write lim n→∞ x n = x to denote that x n converges to x. Example 11.2.1 Let x n = 2n+4 n 2 +1 . Let us verify that lim n→∞ x n = 0. The inequality [ x n [< holds if 2n +4 ≤ (n 2 +1). Therefore it holds if 2n +4 ≤ n 2 . Therefore it holds if both 2n ≤ 2 n 2 and 4 ≤ 2 n 2 . So if n = ¸max¦ 4 , _ 8 ¦| then n ≥ n implies that [ x n [< . So lim n→∞ x n = 0. By definition, (x n ) converges to +∞ as n goes to infinity if for every K > 0 there is an integer n K so that x n ≥ K for every n ≥ n K . Convergence to −∞ is defined in a similar way. 1 For example, n 3 →∞ as n →∞ and n 3 −2n 4 →−∞ as n →∞. Occasionally a two-dimensional array of numbers (a m,n : m ≥ 1, n ≥ 1) is considered. By definition, a m,n converges to a number a ∗ as m and n jointly go to infinity if for each > 0 there is n > 0 so that [ a m,n −a ∗ [< for every m, n ≥ n . We write lim m,n→∞ a m,n = a to denote that a m,n converges to a as m and n jointly go to infinity. Theoretical Exercise Let a m,n = 1 if m = n and a m,n = 0 if m ,= n. Show that lim n→∞ (lim m→∞ a m,n ) = lim m→∞ (lim n→∞ a mn ) = 0 but that lim m,n→∞ a m,n does not exist. 1 Some authors reserve the word “convergence” for convergence to a finite limit. When we say a sequence converges to +∞ some would say the sequence diverges to +∞. 11.2. CONVERGENCE OF SEQUENCES OF NUMBERS 327 Theoretical Exercise Let a m,n = (−1) m+n min(m,n) . Show that lim m→∞ a m,n does not exist for any n and lim n→∞ a m,n does not exist for any m, but lim m,n→∞ a m,n = 0. Theoretical Exercise If lim m,n→∞ a mn = a∗ and lim m→∞ a mn = b n for each n, then lim n→∞ b n = a ∗ . The condition lim m,n→∞ a m,n = a∗ can be expressed in terms of convergence of sequences depending on only one index (as can all the other limits discussed in these notes) as follows. Namely, lim m,n→∞ a m,n = a∗ is equivalent to the following: lim k→∞ a m k ,n k = a ∗ whenever ((m k , n k ) : k ≥ 1) is a sequence of pairs of positive integers such that m k →∞ and n k →∞ as k →∞. The condition that the limit lim m,n→∞ a m,n exists, is equivalent to the condition that the limit lim k→∞ a m k ,n k exists whenever ((m k , n k ) : k ≥ 1) is a sequence of pairs of positive integers such that m k → ∞ and n k →∞ as k →∞. 2 A sequence a 1 , a 2 , . . . is said to be nondecreasing if a i ≤ a j for i < j. Similarly a function f on the real line is nondecreasing if f(x) ≤ f(y) whenever x < y. The sequence is called strictly increasing if a i < a j for i < j and the function is called strictly increasing if f(x) < f(y) whenever x < y. 3 A strictly increasing or strictly decreasing sequence is said to be strictly monotone, and a nondecreasing or nonincreasing sequence is said to be monotone. The sum of an infinite sequence is defined to be the limit of the partial sums. That is, by definition, ∞ k=1 y k = x means that lim n→∞ n k=1 y k = x Often we want to show that a sequence converges even if we don’t explicitly know the value of the limit. A sequence (x n ) is bounded if there is a number L so that [ x n [≤ L for all n. Any sequence that is bounded and monotone converges to a finite number. Example 11.2.2 Consider the sum ∞ k=1 k −α for a constant α > 1. For each n the n th partial sum can be bounded by comparison to an integral, based on the fact that for k ≥ 2, the k th term of the sum is less than the integral of x −α over the interval [k −1, k]: n k=1 k −α ≤ 1 + _ n 1 x −α dx = 1 + 1 −n 1−α (α −1) ≤ 1 + 1 α −1 = α α −1 The partial sums are also monotone nondecreasing (in fact, strictly increasing). Therefore the sum ∞ k=1 k −α exists and is finite. 2 We could add here the condition that the limit should be the same for all choices of sequences, but it is auto- matically true. If if two sequences were to yield different limits of am k ,n k , a third sequence could be constructed by interleaving the first two, and am k ,n k wouldn’t be convergent for that sequence. 3 We avoid simply saying “increasing,” because for some authors it means strictly increasing and for other authors it means nondecreasing. While inelegant, our approach is safer. 328 CHAPTER 11. APPENDIX A sequence (x n ) is a Cauchy sequence if lim m,n→∞ [ x m −x n [= 0. It is not hard to show that if x n converges to a finite limit x then (x n ) is a Cauchy sequence. More useful is the converse statement, called the Cauchy criteria for convergence, or the completeness property of R: If (x n ) is a Cauchy sequence then x n converges to a finite limit as n goes to infinity. Example 11.2.3 Suppose (x n : n ≥ 1) is a sequence such that ∞ i=1 [x i+1 −x i [ < ∞. The Cauchy criteria can be used to show that the sequence (x n : n ≥ 1) is convergent. Suppose 1 ≤ m < n. Then by the triangle inequality for absolute values: [x n −x m [ ≤ n−1 i=m [x i+1 −x i [ or, equivalently, [x n −x m [ ≤ ¸ ¸ ¸ ¸ ¸ n−1 i=1 [x i+1 −x i [ − m−1 i=1 [x i+1 −x i [ ¸ ¸ ¸ ¸ ¸ . (11.1) Inequality (11.1) also holds if 1 ≤ n ≤ m. By the definition of the sum, ∞ i=1 [x i+1 −x i [, both sums on the right side of (11.1) converge to ∞ i=1 [x i+1 − x i [ as m, n → ∞, so the right side of (11.1) converges to zero as m, n →∞. Thus, (x n ) is a Cauchy sequence, and it is hence convergent. Theoretical Exercise 1. Show that if lim n→∞ x n = x and lim n→∞ y n = y then lim n→∞ x n y n = xy. 2. Find the limits and prove convergence as n →∞ for the following sequences: (a) x n = cos(n 2 ) n 2 +1 , (b) y n = n 2 log n (c) z n = n k=2 1 k log k The minimum of a set of numbers, A, written min A, is the smallest number in the set, if there is one. For example, min¦3, 5, 19, −2¦ = −2. Of course, min A is well defined if A is finite (i.e. has finite cardinality). Some sets fail to have a minimum, for example neither ¦1, 1/2, 1/3, 1/4, . . .¦ nor ¦0, −1, −2, . . .¦ have a smallest number. The infimum of a set of numbers A, written inf A, is the greatest lower bound for A. If A is bounded below, then inf A = max¦c : c ≤ a for all a ∈ A¦. For example, inf¦1, 1/2, 1/3, 1/4, . . .¦ = 0. If there is no finite lower bound, the infimum is −∞. For example, inf¦0, −1, −2, . . .¦ = −∞. By convention, the infimum of the empty set is +∞. With these conventions, if A ⊂ B then inf A ≥ inf B. The infimum of any subset of R exists, and if min A exists, then min A = inf A, so the notion of infimum extends the notion of minimum to all subsets of R. Similarly, the maximum of a set of numbers A, written max A, is the largest number in the set, if there is one. The supremum of a set of numbers A, written sup A, is the least upper bound for A. We have sup A = −inf¦−a : a ∈ A¦. In particular, sup A = +∞ if A is not bounded above, and sup ∅ = −∞. The supremum of any subset of R exists, and if max A exists, then max A = sup A, so the notion of supremum extends the notion of maximum to all subsets of R. 11.2. CONVERGENCE OF SEQUENCES OF NUMBERS 329 The notions of infimum and supremum of a set of numbers are useful because they exist for any set of numbers. There is a pair of related notions that generalizes the notion of limit. Not every sequence has a limit, but the following terminology is useful for describing the limiting behavior of a sequence, whether or not the sequence has a limit. Definition 11.2.4 The liminf (also called limit inferior) of a sequence (x n : n ≥ 1), is defined by liminf n→∞ x n = lim n→∞ [inf¦x k : k ≥ n¦] , (11.2) and the limsup (also called limit superior) is defined by limsup n→∞ x n = lim n→∞ [sup¦x k : k ≥ n¦] , (11.3) The possible values of the liminf and limsup of a sequence are R ∪ ¦−∞, +∞¦. The limit on the right side of (11.2) exists because the infimum inside the square brackets is monotone nondecreasing in n. Similarly, the limit on the right side of (11.3) exists. So every sequence of numbers has a liminf and limsup. Definition 11.2.5 A subsequence of a sequence (x n : n ≥ 1) is a sequence of the form (x k i : i ≥ 1), where k 1 , k 2 , . . . is a strictly increasing sequence of integers. The set of limit points of a sequence is the set of all limits of convergent subsequences. The values −∞ and +∞ are possible limit points. Example 11.2.6 Suppose y n = 121 −25n 2 for n ≤ 100 and y n = 1/n for n ≥ 101. The liminf and limsup of a sequence do not depend on any finite number of terms of the sequence, so the values of y n for n ≤ 100 are irrelevant. For all n ≥ 101, inf¦x k : k ≥ n¦ = inf¦1/n, 1/(n + 1), . . .¦ = 0, which trivially converges to 0 as n → ∞. So the liminf of (y n ) is zero. For all n ≥ 101, sup¦x k : k ≥ n¦ = sup¦1/n, 1/(n + 1), . . .¦ = 1 n , which converges also to 0 at n →∞. So the limsup of (y n ) is also zero. Zero is also the only limit point of (y n ). Example 11.2.7 Consider the sequence of numbers (2, −3/2, 4/3, −5/4, 6/5, . . .), which we also write as (x n : n ≥ 1) such that x n = (n+1)(−1) n+1 n . The maximum (and supremum) of the sequence is 2, and the minimum (and infimum) of the sequence is −3/2. But for large n, the sequence alternates between numbers near one and numbers near minus one. More precisely, the subsequence of odd numbered terms, (x 2i−1 : i ≥ 1), converges to 1, and the subsequence of even numbered terms, (x 2i : i ≥ 1¦, has limit +1. Thus, both 1 and -1 are limit points of the sequence, and there aren’t any other limit points. The overall sequence itself does not converge (i.e. does not have a limit) but liminf n→∞ x n = −1 and limsup n→∞ x n = +1. Some simple facts about the limit, liminf, limsup, and limit points of a sequence are collected in the following proposition. The proof is left to the reader. 330 CHAPTER 11. APPENDIX Proposition 11.2.8 Let (x n : n ≥ 1) denote a sequence of numbers. 1. The condition liminf n→∞ x n = x ∞ is equivalent to the following: for any γ < x ∞ , x n ≥ γ for all sufficiently large n. 2. The condition limsup n→∞ x n = x ∞ is equivalent to the following: for any γ > x ∞ , x n ≤ γ for all sufficiently large n. 3. liminf n→∞ x n ≤ limsup n→∞ x n . 4. lim n→∞ x n exists if and only if the liminf equals the limsup, and if the limit exists, then the limit, liminf, and limsup are equal. 5. lim n→∞ x n exists if and only if the sequence has exactly one limit point, x ∗ , and if the limit exists, it is equal to that one limit point. 6. Both the liminf and limsup of the sequence are limit points. The liminf is the smallest limit point and the limsup is the largest limit point (keep in mind that −∞ and +∞ are possible values of the liminf, limsup, or a limit point). Theoretical Exercise 1. Prove Proposition 11.2.8 2. Here’s a more challenging one. Let r be an irrational constant, and let x n = nr − ¸nr| for n ≥ 1. Show that every point in the interval [0, 1] is a limit point of (x n : n ≥ 1). (P. Bohl, W. Sierpinski, and H. Weyl independently proved a stronger result in 1909-1910: namely, the fraction of the first n values falling into a subinterval converges to the length of the subinterval.) 11.3 Continuity of functions Let f be a function on R n for some n, and let x o ∈ R n . The function has a limit y at x o , and such situation is denoted by lim x→xo f(x) = y, if the following is true. Given > 0, there exists δ > 0 so that [ f(x) −y [≤ whenever 0 < |x −x o | < δ. This convergence condition can also be expressed in terms of convergence of sequences, as follows. The condition lim x→xo f(x) = y is equivalent to the condition f(x n ) →y for any sequence x 1 , x 2 , . . . from R n −x o such that x n →x o . The function f is said to be continuous at x o , or equivalently, x o is said to be a continuity point of f, if lim x→xo f(x) = f(x o ). In terms of sequences, f is continuous at x o if f(x n ) → f(x o ) whenever x 1 , x 2 , . . . is a sequence converging to x o . The function f is simply said to be continuous if it is continuous at every point in R n . Let n = 1, so consider a function f on R, and let x o ∈ R. The function has a right-hand limit y at x o , and such situation is denoted by f(x o +) = y or lim x¸xo f(x) = y, if the following is true. Given > 0, there exists δ > 0 so that [ f(x) − y [≤ whenever 0 < x − x o < δ. Equivalently, f(x o +) = y if f(x n ) → y for any sequence x 1 , x 2 , . . . from (x o , +∞) such that x n → x o . The 11.4. DERIVATIVES OF FUNCTIONS 331 left-hand limit f(x o −) = lim x,xo f(x) is defined similarly. If f is monotone nondecreasing, then the left-hand and right-hand limits exist, and f(x o −) ≤ f(x o ) ≤ f(x o +) for all x o . A function f is called right-continuous at x o if f(x o ) = f(x o +). A function f is simply called right-continuous if it is right-continuous at all points. Definition 11.3.1 A function f on a bounded interval (open, closed, or mixed) with endpoints a < b is piecewise continuous, if there exist n ≥ 1 and a = t 0 < t 1 < < t n = b, such that, for 1 ≤ k ≤ n: f is continuous over (t k−1 , t k ) and has finite limits at the endpoints of (t k−1 , t k ). More generally, if T is all of R or an interval in R, f is piecewise continuous over T if it is piecewise continuous over every bounded subinterval of T. 11.4 Derivatives of functions Let f be a function on R and let x o ∈ R. Then f is differentiable at x o if the following limit exists and is finite: lim x→xo f(x) −f(x o ) x −x o . The value of the limit is the derivative of f at x o , written as f t (x o ). In more detail, this condition that f is differentiable at x o means there is a finite value f t (x o ) so that, for any > 0, there exists δ > 0, so that ¸ ¸ ¸ ¸ f(x) −f(x o ) x −x o −f t (x o ) ¸ ¸ ¸ ¸ ≤ δ whenever 0 < [x −x o [ < . Alternatively, in terms of convergence of sequences, it means there is a finite value f t (x o ) so that lim n→∞ f(x n ) −f(x o ) x n −x o = f t (x o ) whenever (x n : n ≥ 1) is a sequence with values in R − ¦x o ¦ converging to x o . The function f is differentiable if it is differentiable at all points. The right-hand derivative of f at a point x o , denoted by D + f(x o ), is defined the same way as f t (x o ), except the limit is taken using only x such that x > x o . The extra condition x > x o is indicated by using a slanting arrow in the limit notation: D + f(x 0 ) = lim x¸xo f(x) −f(x o ) x −x o . Similarly, the left-hand derivative of f at a point x o is D − f(x 0 ) = lim x,xo f(x)−f(xo) x−xo . Theoretical Exercise 1. Suppose f is defined on an open interval containing x o , then f t (x o ) exists if and only if D + f(x o ) = D − f(x 0 ). If f t (x o ) exists then D + f(x o ) = D − f(x 0 ) = f t (x o ). 332 CHAPTER 11. APPENDIX We write f tt for the derivative of f t . For an integer n ≥ 0 we write f (n) to denote the result of differentiating f n times. Theorem 11.4.1 (Mean value form of Taylor’s theorem) Let f be a function on an interval (a, b) such that its nth derivative f (n) exists on (a, b). Then for a < x, x 0 < b, f(x) = n−1 k=0 f (k) (x 0 ) k! (x −x 0 ) k + f (n) (y)(x −x 0 ) n n! for some y between x and x 0 . Clearly differentiable functions are continuous. But they can still have rather odd properties, as indicated by the following example. Example 11.4.2 Let f(t) = t 2 sin(1/t 2 ) for t ,= 0 and f(0) = 0. This function f is a classic example of a differentiable function with a derivative function that is not continuous. To check the derivative at zero, note that [ f(s)−f(0) s [ ≤ [s[ →0 as s →0, so f t (0) = 0. The usual calculus can be used to compute f t (t) for t ,= 0, yielding f t (t) = _ 2t sin( 1 t 2 ) − 2 cos( 1 t 2 ) t t ,= 0 0 t = 0 The derivative f t is not even close to being continuous at zero. As t approaches zero, the cosine term dominates, and f reaches both positive and negative values with arbitrarily large magnitude. Even though the function f of Example 11.4.2 is differentiable, it does not satisfy the funda- mental theorem of calculus (stated in the next section). One way to rule out the wild behavior of Example 11.4.2, is to assume that f is continuously differentiable, which means that f is dif- ferentiable and its derivative function is continuous. For some applications, it is useful to work with functions more general than continuously differentiable ones, but for which the fundamental theorem of calculus still holds. A possible approach is to use the following condition. Definition 11.4.3 A function f on a bounded interval (open, closed, or mixed) with endpoints a < b is continuous and piecewise continuously differentiable, if f is continuous over the interval, and if there exist n ≥ 1 and a = t 0 < t 1 < < t n = b, such that, for 1 ≤ k ≤ n: f is continuously differentiable over (t k−1 , t k ) and f t has finite limits at the endpoints of (t k−1 , t k ). More generally, if T is all of R or a subinterval of R, then a function f on T is continuous and piecewise continuously differentiable if its restriction to any bounded interval is continuous and piecewise continuously differentiable. Example 11.4.4 Two examples of continuous, piecewise continuously differentiable functions on R are: f(t) = min¦t 2 , 1¦ and g(t) = [ sin(t)[. 11.5. INTEGRATION 333 Example 11.4.5 The function given in Example 11.4.2 is not considered to be piecewise contin- uously differentiable because the derivative does not have finite limits at zero. Theoretical Exercise 1. Suppose f is a continuously differentiable function on an open bounded interval (a, b). Show that if f t has finite limits at the endpoints, then so does f. 2. Suppose f is a continuous function on a closed, bounded interval [a, b] such that f t exists and is continuous on the open subinterval (a, b). Show that if the right-hand limit of the derviative at a, f t (a+) = lim x¸a f t (x), exists, then the right-hand derivative at a, defined by D + f(a) = lim x¸a f(x) −f(a) x −a also exists, and the two limits are equal. Let g be a function from R n to R m . Thus for each vector x ∈ R n , g(x) is an m vector. The derivative matrix of g at a point x, ∂g ∂x (x), is the nm matrix with ijth entry ∂g i ∂x j (x). Sometimes for brevity we write y = g(x) and think of y as a variable depending on x, and we write the derivative matrix as ∂y ∂x (x). Theorem 11.4.6 (Implicit function theorem) If m = n and if ∂y ∂x is continuous in a neighborhood of x 0 and if ∂y ∂x (x 0 ) is nonsingular, then the inverse mapping x = g −1 (y) is defined in a neighborhood of y 0 = g(x 0 ) and ∂x ∂y (y 0 ) = _ ∂y ∂x (x 0 ) _ −1 11.5 Integration 11.5.1 Riemann integration Let g be a bounded function on a bounded interval of the form (a, b]. Given: • An partition of (a, b] of the form (t 0 , t 1 ], (t 1 , t 2 ], , (t n−1 , t n ], where n ≥ 0 and a = t 0 ≤ t 1 < t n = b • A sampling point from each subinterval, v k ∈ (t k−1 , t k ], for 1 ≤ k ≤ n, the corresponding Riemann sum for g is defined by n k=1 g(v k )(t k −t k−1 ). 334 CHAPTER 11. APPENDIX The norm of the partition is defined to be max k [t k − t k−1 [. The Riemann integral _ b a g(x)dx is said to exist and its value is I if the following is true. Given any > 0, there is a δ > 0 so that [ n k=1 g(v k )(t k − t k−1 ) − I[ ≤ whenever the norm of the partition is less than or equal to δ. This definition is equivalent to the following condition, expressed using convergence of sequences. The Riemann integral exists and is equal to I, if for any sequence of partitions, specified by ((t m 1 , t m 2 , . . . , t m nm ) : m ≥ 1), with corresponding sampling points ((v m 1 , . . . , v m nm ) : m ≥ 1), such that norm of the m th partition converges to zero as m → ∞, the corresponding sequence of Riemann sums converges to I as m → ∞. The function g is said to be Reimann integrable over (a, b] if the integral _ b a g(x)dx exists and is finite. Next, suppose g is defined over the whole real line. If for every interval (a, b], g is bounded over [a, b] and Riemann integrable over (a, b], then the Riemann integral of g over R is defined by _ ∞ −∞ g(x)dx = lim a,b→∞ _ b −a g(x)dx provided that the indicated limit exist as a, b jointly converge to +∞. The values +∞ or −∞ are possible. A function that is continuous, or just piecewise continuous, is Riemann integrable over any bounded interval. Moreover, the following is true for Riemann integration: Theorem 11.5.1 (Fundamental theorem of calculus) Let f be a continuously differentiable function on R. Then for a < b, f(b) −f(a) = _ b a f t (x)dx. (11.4) More generally, if f is continuous and piecewise continuously differentiable, (11.4) holds with f t (x) replaced by the right-hand derivative, D + f(x). (Note that D + f(x) = f t (x) whenever f t (x) is de- fined.) We will have occasion to use Riemann integrals in two dimensions. Let g be a bounded function on a bounded rectangle of the form (a 1 , b 1 ] (a 2 , b 2 ]. Given: • A partition of (a 1 , b 1 ] (a 2 , b 2 ] into n 1 n 2 rectangles of the form (t 1 j , t 1 j−1 ] (t 2 k , t 2 k−1 ], where n i ≥ 1 and a i = t i 0 < t i 1 < < t i n i = b i for i = 1, 2 • A sampling point (v 1 jk , v 2 jk ) inside (t 1 j , t 1 j−1 ] (t 2 k , t 2 k−1 ] for 1 ≤ j ≤ n 1 and 1 ≤ k ≤ n 2 , the corresponding Riemann sum for g is n 1 j=1 n 2 k=1 g(v 1 j,k , v 2 j,k )(t 1 j −t 1 j−1 )(t 2 k −t 2 k−1 ). The norm of the partition is max i∈¡1,2¦ max k [ t i k −t i k−1 [. As in the case of one dimension, g is said to be Riemann integrable over (a 1 , b 1 ] (a 2 , b 2 ], and _ _ (a 1 ,b 1 ](a 2 ,b 2 ] g(x 1 , x 2 )dsdt = I, if the value 11.5. INTEGRATION 335 of the Riemann sum converges to I for any sequence of partitions and sampling point pairs, with the norms of the partitions converging to zero. The above definition of a Riemann sum allows the n 1 n 2 sampling points to be selected arbitrarily from the n 1 n 2 rectangles. If, instead, the sampling points are restricted to have the form (v 1 j , v 2 k ), for n 1 + n 2 numbers v 1 1 , . . . , v 1 n 1 , v 2 1 , . . . v 2 n 2 , we say the corresponding Riemann sum uses aligned sampling. We define a function g on [a, b] [a, b] to be Riemann integrable with aligned sampling in the same way as we defined g to be Riemann integrable, except the family of Riemann sums used are the ones using aligned sampling. Since the set of sequences that must converge is more restricted for aligned sampling, a function g on [a, b] [a, b] that is Riemann integrable is also Riemann integrable with aligned sampling. Proposition 11.5.2 A sufficient condition for g to be Riemann integrable (and hence Riemann integrable with aligned sampling) over (a 1 , b 1 ] (a 2 , b 2 ] is that g be the restriction to (a 1 , b 1 ] (a 2 , b 2 ] of a continuous function on [a 1 , b 1 ] [a 2 , b 2 ]. More generally, g is Riemann integrable over (a 1 , b 1 ](a 2 , b 2 ] if there is a partition of (a 1 , b 1 ](a 2 , b 2 ] into finitely many subrectangles of the form (t 1 j , t 1 j−1 ] (t 2 k , t 2 k−1 ], such that g on (t 1 j , t 1 j−1 ] (t 2 k , t 2 k−1 ] is the restriction to (t 1 j , t 1 j−1 ] (t 2 k , t 2 k−1 ] of a continuous function on [t 1 j , t 1 j−1 ] [t 2 k , t 2 k−1 ]. Proposition 11.5.2 is a standard result in real analysis. It’s proof uses the fact that continuous functions on bounded, closed sets are uniformly continuous, from which if follows that, for any > 0, there is a δ > 0 so that the Riemann sums for any two partitions with norm less than or equal to δ differ by most . The Cauchy criteria for convergence of sequences of numbers is also used. 11.5.2 Lebesgue integration Lebesgue integration with respect to a probability measure is defined in the section defining the expectation of a random variable X and is written as E[X] = _ Ω X(ω)P(dω) The idea is to first define the expectation for simple random variables, then for nonnegative random variables, and then for general random variables by E[X] = E[X + ] − E[X − ]. The same approach can be used to define the Lebesgue integral _ ∞ −∞ g(ω)dω for Borel measurable functions g on R. Such an integral is well defined if either _ ∞ −∞ g + (ω)dω < +∞ or _ ∞ −∞ g − (ω)dω < +∞. 336 CHAPTER 11. APPENDIX 11.5.3 Riemann-Stieltjes integration Let g be a bounded function on a closed interval [a, b] and let F be a nondecreasing function on [a, b]. The Riemann-Stieltjes integral _ b a g(x)dF(x) (Riemann-Stieltjes) is defined the same way as the Riemann integral, except that the Riemann sums are changed to n k=1 g(v k )(F(t k ) −F(t k−1 )) Extension of the integral over the whole real line is done as it is for Riemann integration. An alternative definition of _ ∞ −∞ g(x)dF(x), preferred in the context of these notes, is given next. 11.5.4 Lebesgue-Stieltjes integration Let F be a CDF. As seen in Section 1.3, there is a corresponding probability measure ˜ P on the Borel subsets of R. Given a Borel measurable function g on R, the Lebesgue-Stieltjes integral of g with respect to F is defined to be the Lebesgue integral of g with respect to ˜ P: (Lebesgue-Stieltjes) _ ∞ −∞ g(x)dF(x) = _ ∞ −∞ g(x) ˜ P(dx) (Lebesgue) The same notation _ ∞ −∞ g(x)dF(x) is used for both Riemann-Stieltjes (RS) and Lebesgue-Stieltjes (LS) integration. If g is continuous and the LS integral is finite, then the integrals agree. In particular, _ ∞ −∞ xdF(x) is identical as either an LS or RS integral. However, for equivalence of the integrals _ Ω g(X(ω))P(dω) and _ ∞ −∞ g(x)dF(x), even for continuous functions g, it is essential that the integral on the right be understood as an LS integral. Hence, in these notes, only the LS interpretation is used, and RS integration is not needed. If F has a corresponding pdf f, then (Lebesgue-Stieltjes) _ ∞ −∞ g(x)dF(x) = _ ∞ −∞ g(x)f(x)dx (Lebesgue) for any Borel measurable function g. 11.6. ON THE CONVERGENCE OF THE MEAN 337 11.6 On the convergence of the mean Suppose (X n : n ≥ 1) is a sequence of random variables such that X n p. → X ∞ , for some random variable X ∞ . The theorems in this section address the question of whether E[X n ] → E[X ∞ ]. The hypothesis X n p. → X ∞ means that for any > 0 and δ > 0, P¦[X n −X ∞ [ ≤ ¦ ≥ 1 −δ. Thus, the event that X n is close to X ∞ has probability close to one. But the mean of X n can differ greatly from the mean of X if, in the unlikely event that [X n −X ∞ [ is not small, it is very, very large. Example 11.6.1 Suppose U is a random variable with a finite mean, and suppose A 1 , A 2 , . . . is a sequence of events, each with positive probability, but such that P[A n ] →0, and let b 1 , b 2 , be a sequence of nonzero numbers. Let X n = U + b n I An for n ≥ 1. Then for any > 0, P¦[X n −U[ ≥ ¦ ≤ P¦X n ,= U¦ = P[A n ] →0 as n →∞, so X n p. →U. However, E[X n ] = E[U] +b n P[A n ]. Thus, if the b n have very large magnitude, the mean E[X n ] can be far larger or far smaller than E[U], for all large n. The simplest way to rule out the very, very large values of [X n −X ∞ [ is to require the sequence (X n ) to be bounded. That would rule out using constants b n with arbitrarily large magnitudes in Example 11.6.1. The following result is a good start–it is generalized to yield the dominated convergence theorem further below. Theorem 11.6.2 (Bounded convergence theorem) Let X 1 , X 2 , . . . be a sequence of random vari- ables such that for some finite L, P¦[X n [ ≤ L¦ = 1 for all n ≥ 1, and such that X n p. → X as n →∞. Then E[X n ] →E[X]. Proof. For any > 0, P¦[ X [≥ L + ¦ ≤ P¦[ X − X n [≥ ¦ → 0, so that P¦[ X [≥ L + ¦ = 0. Since was arbitrary, P¦[ X [≤ L¦ = 1. Therefore, P¦[X −X n [ ≤ 2L¦ = 1 for all n ≥ 1. Again let > 0. Then [X −X n [ ≤ + 2LI ¡[X−Xn[≥¦ , (11.5) so that [E[X] −E[X n ][ = [E[X−X n ][ ≤ E[[X−X n [] ≤ +2LP[X −X n [ ≥ ¦. By the hypotheses, P¦[X − X n [ ≥ ¦ → 0 as n → ∞. Thus, for n large enough, [E[X] − E[X n ][ < 2. Since is arbitrary, E[X n ] →E[X]. Equation (11.5) is central to the proof just given. It bounds the difference [X − X n [ by on the event ¦[X − X n [ < ¦, which has probability close to one for n large, and on the complement of this event, the difference [X − X n [ is still bounded so that its contribution is small for n large enough. The following lemma, used to establish the dominated convergence theorem, is similar to the bounded convergence theorem, but the variables are assumed to be bounded only on one side: specifically, the random variables are restricted to be greater than or equal to zero. The result is that E[X n ] for large n can still be much larger than E[X ∞ ], but cannot be much smaller. The restriction to nonnegative X n ’s would rule out using negative constants b n with arbitrarily large magnitudes in Example 11.6.1. The statement of the lemma uses “liminf,” which is defined in Appendix 11.2. 338 CHAPTER 11. APPENDIX Lemma 11.6.3 (Fatou’s lemma) Suppose (X n ) is a sequence of nonnegative random variables such that X n p. → X ∞ . Then liminf n→∞ E[X n ] ≥ E[X ∞ ]. (Equivalently, for any γ < E[X ∞ ], E[X n ] ≥ γ for all sufficiently large n.) Proof. We shall prove the equivalent form of the conclusion given in the lemma, so let γ be any constant with γ < E[X ∞ ]. By the definition of E[X ∞ ], there is a simple random variable Z with Z ≤ X ∞ such that E[Z] ≥ γ. Since Z = X ∞ ∧ Z, [X n ∧ Z −Z[ = [X n ∧ Z −X ∞ ∧ Z[ ≤ [X n −X ∞ [ p. →0, so that X n ∧Z p. →Z. Therefore, by the bounded convergence theorem, lim n→∞ E[X n ∧Z] = E[Z] > γ. Since E[X n ] ≥ E[X n ∧ Z n ], it follows that E[X n ] ≥ γ for all sufficiently large n. Theorem 11.6.4 (Dominated convergence theorem) If X 1 , X 2 , . . . is a sequence of random vari- ables and X ∞ and Y are random variables such that the following three conditions hold: (i) X n p. →X ∞ as n →∞ (ii) P¦[X n [ ≤ Y ¦ = 1 for all n (iii) E[Y ] < +∞ then E[X n ] →E[X ∞ ]. Proof. The hypotheses imply that (X n +Y : n ≥ 1) is a sequence of nonnegative random variables which converges in probability to X ∞ + Y . So Fatou’s lemma implies that liminf n→∞ E[X n + Y ] ≥ E[X ∞ + Y ], or equivalently, subtracting E[Y ] from both sides, liminf n→∞ E[X n ] ≥ E[X ∞ ]. Similarly, since (−X n +Y : n ≥ 1) is a sequence of nonnegative random variables which converges in probability to −X ∞ +Y , Fatou’s lemma implies that liminf n→∞ E[−X n +Y ] ≥ E[−X ∞ +Y ], or equivalently, limsup n→∞ E[X n ] ≤ E[X ∞ ]. Summarizing, limsup n→∞ E[X n ] ≤ E[X ∞ ] ≤ liminf n→∞ E[X n ]. In general, the liminf of a sequence is less than or equal to the limsup, and if the liminf is equal to the limsup, then the limit exists and is equal to both the liminf and limsup. Thus, E[X n ] →E[X ∞ ]. Corollary 11.6.5 (A consequence of integrability) If Z has a finite mean, then given any > 0, there exits a δ > 0 so that if P[A] < δ, then [E[ZI A ][ ≤ . Proof. If not, there would exist a sequence of events A n with P¦A n ¦ → 0 with E[ZI An ] ≥ . But ZI An p. →0, and ZI An is dominated by the integrable random variable Z for all n, so the dominated convergence theorem implies that E[ZI An ] →0, which would result in a contradiction. 11.7. MATRICES 339 Theoretical Exercise 1. Work backwards, and deduce the dominated convergence theorem from Corollary 11.6.5. The following theorem is based on a different way to control the difference between E[X n ] for large n and E[X ∞ ]. Rather than a domination condition, it is assumed that the sequence is monotone in n. Theorem 11.6.6 (Monotone convergence theorem) Let X 1 , X 2 , . . . be a sequence of random vari- ables such that E[X 1 ] > −∞ and such that X 1 (ω) ≤ X 2 (ω) ≤ . Then the limit X ∞ given by X ∞ (ω) = lim n→∞ X n (ω) for all ω is an extended random variable (with possible value ∞) and E[X n ] →E[X ∞ ] as n →∞. Proof. By adding min¦0, −X 1 ¦ to all the random variables involved if necessary, we can assume without loss of generality that X 1 , X 2 , . . . , and therefore also X, are nonnegative. Recall that E[X] is equal to the supremum of the expectation of simple random variables that are less than or equal to X. So let γ be any number such that γ < E[X]. Then, there is a simple random variable ¯ X less than or equal to X with E[ ¯ X] ≥ γ. The simple random variable ¯ X takes only finitely many possible values. Let L be the largest. Then ¯ X ≤ X ∧ L, so that E[X ∧ L] > γ. By the bounded convergence theorem, E[X n ∧ L] → E[X ∧ L]. Therefore, E[X n ∧ L] > γ for all large enough n. Since E[X n ∧ L] ≤ E[X n ] ≤ E[X], if follows that γ < E[X n ] ≤ E[X] for all large enough n. Since γ is an arbitrary constant with γ < E[X], the desired conclusion, E[X n ] →E[X], follows. 11.7 Matrices An mn matrix over the reals R has the form A = _ _ _ _ _ a 11 a 12 a 1n a 21 a 22 a 2n . . . . . . . . . a m1 a m2 a mn _ _ _ _ _ where a ij ∈ R for all i, j. This matrix has m rows and n columns. A matrix over the complex numbers C has the same form, with a ij ∈ C for all i, j. The transpose of an mn matrix A = (a ij ) is the n m matrix A T = (a ji ). For example _ 1 0 3 2 1 1 _ T = _ _ 1 2 0 1 3 1 _ _ The matrix A is symmetric if A = A T . Symmetry requires that the matrix A be square: m = n. The diagonal of a matrix is comprised by the entries of the form a ii . A square matrix A is called diagonal if the entries off of the diagonal are zero. The n n identity matrix is the n n diagonal matrix with ones on the diagonal. We write I to denote an identity matrix of some dimension n. 340 CHAPTER 11. APPENDIX If A is an mk matrix and B is a k n matrix, then the product AB is the mn matrix with ij th element k l=1 a il b lj . A vector x is an m 1 matrix, where m is the dimension of the vector. Thus, vectors are written in column form: x = _ _ _ _ _ x 1 x 2 . . . x m _ _ _ _ _ The set of all dimension m vectors over R is the m dimensional Euclidean space R m . The inner product of two vectors x and y of the same dimension m is the number x T y, equal to m i=1 x i y i . The vectors x and y are orthogonal if x T y = 0. The Euclidean length or norm of a vector x is given by |x| = (x T x) 1 2 . A set of vectors ϕ 1 , . . . , ϕ n is orthonormal if the vectors are orthogonal to each other and |ϕ i | = 1 for all i. A set of vectors v 1 , . . . , v n in R m is said to span R m if any vector in R m can be expressed as a linear combination α 1 v 1 +α 2 v 2 + +α n v n for some α 1 , . . . , α n ∈ R. An orthonormal set of vectors ϕ 1 , . . . , ϕ n in R m spans R m if and only if n = m. An orthonormal basis for R m is an orthonormal set of m vectors in R m . An orthonormal basis ϕ 1 , . . . , ϕ m corresponds to a coordinate system for R m . Given a vector v in R m , the coordinates of v relative to ϕ 1 , . . . , ϕ m are given by α i = ϕ T i v. The coordinates α 1 , . . . , α m are the unique numbers such that v = α 1 ϕ 1 + +α m ϕ m . A square matrix U is called orthonormal if any of the following three equivalent conditions is satisfied: 1. U T U = I 2. UU T = I 3. the columns of U form an orthonormal basis. Given an mm orthonormal matrix U and a vector v ∈ R m , the coordinates of v relative to U are given by the vector U T v. Given a square matrix A, a vector ϕ is an eigenvector of A and λ is an eigenvalue of A if the eigen relation Aϕ = λϕ is satisfied. A permutation π of the numbers 1, . . . , m is a one-to-one mapping of ¦1, 2, . . . , m¦ onto itself. That is (π(1), . . . , π(m)) is a reordering of (1, 2, . . . , m). Any permutation is either even or odd. A permutation is even if it can be obtained by an even number of transpositions of two elements. Otherwise a permutation is odd. We write (−1) π = _ 1 if π is even −1 if π is odd The determinant of a square matrix A, written det(A), is defined by det(A) = π (−1) π m i=1 a iπ(i) The absolute value of the determinant of a matrix A is denoted by [ A [. Thus [ A [=[ det(A) [. Some important properties of determinants are the following. Let A and B be mm matrices. 11.7. MATRICES 341 1. If B is obtained from A by multiplication of a row or column of A by a scaler constant c, then det(B) = c det(A). 2. If | is a subset of R m and 1 is the image of | under the linear transformation determined by A: 1 = ¦Ax : x ∈ |¦ then (the volume of |) = [ A [ (the volume of 1) 3. det(AB) = det(A) det(B) 4. det(A) = det(A T ) 5. [U[ = 1 if U is orthonormal. 6. The columns of A span R n if and only if det(A) ,= 0. 7. The equation p(λ) = det(λI −A) defines a polynomial p of degree m called the characteristic polynomial of A. 8. The zeros λ 1 , λ 2 , . . . , λ m of the characteristic polynomial of A, repeated according to mul- tiplicity, are the eigenvalues of A, and det(A) = n i=1 λ i . The eigenvalues can be complex valued with nonzero imaginary parts. If K is a symmetric m m matrix, then the eigenvalues λ 1 , λ 2 , . . . , λ m , are real-valued (not necessarily distinct) and there exists an orthonormal basis consisting of the corresponding eigen- vectors ϕ 1 , ϕ 2 , . . . , ϕ m . Let U be the orthonormal matrix with columns ϕ 1 , . . . , ϕ m and let Λ be the diagonal matrix with diagonal entries given by the eigenvalues Λ = _ _ _ _ _ λ 1 0 ∼ λ 2 . . . 0 ∼ λ m _ _ _ _ _ Then the relations among the eigenvalues and eigenvectors may be written as KU = UΛ. Therefore K = UΛU T and Λ = U T KU. A symmetric m m matrix A is positive semidefinite if α T Aα ≥ 0 for all m-dimensional vectors α. A symmetric matrix is positive semidefinite if and only if its eigenvalues are nonnegative. The remainder of this section deals with matrices over C. The Hermitian transpose of a matrix A is the matrix A ∗ , obtained from A T by taking the complex conjugate of each element of A T . For example, _ 1 0 3 + 2j 2 j 1 _ ∗ = _ _ 1 2 0 −j 3 −2j 1 _ _ 342 CHAPTER 11. APPENDIX The set of all dimension m vectors over C is the m-complex dimensional space C m . The inner product of two vectors x and y of the same dimension m is the complex number y ∗ x, equal to m i=1 x i y ∗ i . The vectors x and y are orthogonal if x ∗ y = 0. The length or norm of a vector x is given by |x| = (x ∗ x) 1 2 . A set of vectors ϕ 1 , . . . , ϕ n is orthonormal if the vectors are orthogonal to each other and |ϕ i | = 1 for all i. A set of vectors v 1 , . . . , v n in C m is said to span C m if any vector in C m can be expressed as a linear combination α 1 v 1 +α 2 v 2 + +α n v n for some α 1 , . . . , α n ∈ C. An orthonormal set of vectors ϕ 1 , . . . , ϕ n in C m spans C m if and only if n = m. An orthonormal basis for C m is an orthonormal set of m vectors in C m . An orthonormal basis ϕ 1 , . . . , ϕ m corresponds to a coordinate system for C m . Given a vector v in R m , the coordinates of v relative to ϕ 1 , . . . , ϕ m are given by α i = ϕ ∗ i v. The coordinates α 1 , . . . , α m are the unique numbers such that v = α 1 ϕ 1 + +α m ϕ m . A square matrix U over C is called unitary (rather than orthonormal) if any of the following three equivalent conditions is satisfied: 1. U ∗ U = I 2. UU ∗ = I 3. the columns of U form an orthonormal basis. Given an m m unitary matrix U and a vector v ∈ C m , the coordinates of v relative to U are given by the vector U ∗ v. Eigenvectors, eigenvalues, and determinants of square matrices over C are defined just as they are for matrices over R. The absolute value of the determinant of a matrix A is denoted by [ A [. Thus [ A [=[ det(A) [. Some important properties of determinants of matrices over C are the following. Let A and B by mm matrices. 1. If B is obtained from A by multiplication of a row or column of A by a constant c ∈ C, then det(B) = c det(A). 2. If | is a subset of C m and 1 is the image of | under the linear transformation determined by A: 1 = ¦Ax : x ∈ |¦ then (the volume of |) = [ A [ 2 (the volume of 1) 3. det(AB) = det(A) det(B) 4. det ∗ (A) = det(A ∗ ) 5. [ U [= 1 if U is unitary. 6. The columns of A span C n if and only if det(A) ,= 0. 11.7. MATRICES 343 7. The equation p(λ) = det(λI −A) defines a polynomial p of degree m called the characteristic polynomial of A. 8. The zeros λ 1 , λ 2 , . . . , λ m of the characteristic polynomial of A, repeated according to mul- tiplicity, are the eigenvalues of A, and det(A) = n i=1 λ i . The eigenvalues can be complex valued with nonzero imaginary parts. A matrix K is called Hermitian symmetric if K = K ∗ . If K is a Hermitian symmetric mm matrix, then the eigenvalues λ 1 , λ 2 , . . . , λ m , are real-valued (not necessarily distinct) and there exists an orthonormal basis consisting of the corresponding eigenvectors ϕ 1 , ϕ 2 , . . . , ϕ m . Let U be the unitary matrix with columns ϕ 1 , . . . , ϕ m and let Λ be the diagonal matrix with diagonal entries given by the eigenvalues Λ = _ _ _ _ _ λ 1 0 ∼ λ 2 . . . 0 ∼ λ m _ _ _ _ _ Then the relations among the eigenvalues and eigenvectors may be written as KU = UΛ. Therefore K = UΛU ∗ and Λ = U ∗ KU. A Hermitian symmetric m m matrix A is positive semidefinite if α ∗ Aα ≥ 0 for all α ∈ C m . A Hermitian symmetric matrix is positive semidefinite if and only if its eigenvalues are nonnegative. Many questions about matrices over C can be addressed using matrices over R. If Z is an mm matrix over C, then Z can be expressed as Z = A+Bj, for some mm matrices A and B over R. Similarly, if x is a vector in C m then it can be written as x = u + jv for vectors u, v ∈ R m . Then Zx = (Au −Bv) +j(Bu +Av). There is a one-to-one and onto mapping from C m to R 2m defined by u +jv → _ u v _ . Multiplication of x by the matrix Z is thus equivalent to multiplication of _ u v _ by ¯ Z = _ A −B B A _ . We will show that [Z[ 2 = det( ¯ Z) (11.6) so that Property 2 for determinants of matrices over C follows from Property 2 for determinants of matrices over R. It remains to prove (11.6). Suppose that A −1 exists and examine the two 2m2m matrices _ A −B B A _ and _ A 0 B A+BA −1 B _ . (11.7) The second matrix is obtained from the first by left multiplying each sub-block in the right column of the first matrix by A −1 B, and adding the result to the left column. Equivalently, the second matrix is obtained by right multiplying the first matrix by _ I A −1 B 0 I _ . But det _ I A −1 B 0 I _ = 1, so that the two matrices in (11.7) have the same determinant. Equating the determinants of the two 344 CHAPTER 11. APPENDIX matrices in (11.7) yields det( ¯ Z) = det(A) det(A + BA −1 B). Similarly, the following four matrices have the same determinant: _ A+Bj 0 0 A−Bj _ _ A+Bj A−Bj 0 A−Bj _ _ 2A A−Bj A−Bj A−Bj _ _ 2A 0 A−Bj A+BA −1 B 2 _ (11.8) Equating the determinants of the first and last of the matrices in (11.8) yields that [Z[ 2 = det(Z) det ∗ (Z) = det(A + Bj) det(A − Bj) = det(A) det(A + BA −1 B). Combining these ob- servations yields that (11.6) holds if A −1 exists. Since each side of (11.6) is a continuous function of A, (11.6) holds in general. Chapter 12 Solutions to Problems 1.2 Independent vs. mutually exclusive (a) If E is an event independent of itself, then P[E] = P[E ∩ E] = P[E]P[E]. This can happen if P[E] = 0. If P[E] ,= 0 then cancelling a factor of P[E] on each side yields P[E] = 1. In summary, either P[E] = 0 or P[E] = 1. (b) In general, we have P[A∪B] = P[A] +P[B] −P[AB]. If the events A and B are independent, then P[A∪B] = P[A] +P[B] −P[A]P[B] = 0.3 +0.4 −(0.3)(0.4) = 0.58. On the other hand, if the events A and B are mutually exclusive, then P[AB] = 0 and therefore P]A∪ B] = 0.3 +0.4 = 0.7. (c) If P[A] = 0.6 and P[B] = 0.8, then the two events could be independent. However, if A and B were mutually exclusive, then P[A] + P[B] = P[A ∪ B] ≤ 1, so it would not possible for A and B to be mutually exclusive if P[A] = 0.6 and P[B] = 0.8. 1.4 Frantic search Let D,T,B, and O denote the events that the glasses are in the drawer, on the table, in the brief- case, or in the office, respectively. These four events partition the probability space. (a) Let E denote the event that the glasses were not found in the first drawer search. P[T[E] = P[TE] P[E] = P[E[T]P[T] P[E[D]P[D]+P[E[D c ]P[D c ] = (1)(0.06) (0.1)(0.9)+(1)(0.1) = 0.06 0.19 ≈ 0.315 (b) Let F denote the event that the glasses were not found after the first drawer search and first table search. P[B[F] = P[BF] P[F] = P[F[B]P[B] P[F[D]P[D]+P[F[T]P[T]+P[F[B]P[B]+P[F[O]P[O] = (1)(0.03) (0.1)(0.9)+(0.1)(0.06)+(1)(0.03)+(1)(0.01) ≈ 0.22 (c) Let G denote the event that the glasses were not found after the two drawer searches, two table searches, and one briefcase search. P[O[G] = P[OG] P[G] = P[G[O]P[O] P[G[D]P[D]+P[G[T]P[T]+P[G[B]P[B]+P[G[O]P[O] = (1)(0.01) (0.1) 2 (0.9)+(0.1) 2 (0.06)+(0.1)(0.03)+(1)(0.01) ≈ 0.4225 1.6 Conditional probabilities–basic computations of iterative decoding (a) Here is one of several approaches to this problem. Note that the n pairs (B 1 , Y 1 ), . . . , (B n , Y n ) 345 346 CHAPTER 12. SOLUTIONS TO PROBLEMS are mutually independent, and λ i (b i ) def = P[B i = b i [Y i = y i ] = q i (y i [b i ) q i (y i [0)+q i (y i [1) . Therefore P[B = 1[Y 1 = y 1 , . . . , Y n = y n ] = b 1 ,...,bn:b 1 ⊕⊕bn=1 P[B 1 = b 1 , . . . , B n = b n [Y 1 = y 1 , . . . , Y n = y n ] = b 1 ,...,bn:b 1 ⊕⊕bn=1 n i=1 λ i (b i ). (b) Using the definitions, P[B = 1[Z 1 = z 1 , . . . , Z k = z k ] = p(1, z 1 , . . . , z k ) p(0, z 1 , . . . , z k ) +p(1, z 1 , . . . , z k ) = 1 2 k j=1 r j (1[z j ) 1 2 k j=1 r j (0[z j ) + 1 2 k j=1 r j (1[z j ) = η 1 +η where η = k j=1 r j (1[z j ) r j (0[z j ) . 1.8 Blue corners (a) There are 24 ways to color 5 corners so that at least one face has four blue corners (there are 6 choices of the face, and for each face there are four choices for which additional corner to color blue.) Since there are _ 8 5 _ = 56 ways to select 5 out of 8 corners, P[B[exactly 5 corners colored blue] = 24/56 = 3/7. (b) By counting the number of ways that B can happen for different numbers of blue corners we find P[B] = 6p 4 (1 −p) 4 + 24p 5 (1 −p) 3 + 24p 6 (1 −p) 2 + 8p 7 (1 −p) +p 8 . 1.10 Recognizing cumulative distribution functions (a) Valid (draw a sketch) P[X 2 ≤ 5] = P[X ≤ − √ 5] +P[X ≥ √ 5] = F 1 (− √ 5) +1−F 1 ( √ 5) = e −5 2 . (b) Invalid. F(0) > 1. Another reason is that F is not nondecreasing (c) Invalid, not right continuous at 0. 1.12 CDF and characteristic function of a mixed type random variable (a) Range of X is [0, 0.5]. For 0 ≤ c ≤ 0.5, P[X ≤ c] = P[U ≤ c + 0.5] = c + 0.5 Thus, F X (c) = _ _ _ 0 c < 0 c + 0.5 0 ≤ c ≤ 0.5 1 c ≥ 0.5 (b) Φ X (u) = 0.5 + _ 0.5 0 e jux dx = 0.5 + e ju/2 −1 ju 1.14 Conditional expectation for uniform density over a triangular region (a) The triangle has base and height one, so the area of the triangle is 0.5. Thus the joint pdf is 2 347 inside the triangle. (b) f X (x) = _ ∞ −∞ f XY (x, y)dy = _ ¸ _ ¸ _ _ x/2 0 2dy = x if 0 < x < 1 _ x/2 x−1 2dy = 2 −x if 1 < x < 2 0 else (c) In view of part (c), the conditional density f Y [X (y[x) is not well defined unless 0 < x < 2. In general we have f Y [X (y[x) = _ ¸ ¸ ¸ ¸ _ ¸ ¸ ¸ ¸ _ 2 x if 0 < x ≤ 1 and y ∈ [0, x 2 ] 0 if 0 < x ≤ 1 and y ,∈ [0, x 2 ] 2 2−x if 1 < x < 2 and y ∈ [x −1, x 2 ] 0 if 1 < x < 2 and y ,∈ [x −1, x 2 ] not defined if x ≤ 0 or x ≥ 2 Thus, for 0 < x ≤ 1, the conditional distribution of Y is uniform over the interval [0, x 2 ]. For 1 < x ≤ 2, the conditional distribution of Y is uniform over the interval [x −1, x 2 ]. (d) Finding the midpoints of the intervals that Y is conditionally uniformly distributed over, or integrating x against the conditional density found in part (c), yields: E[Y [X = x] = _ _ _ x 4 if 0 < x ≤ 1 3x−2 4 if 1 < x < 2 not defined if x ≤ 0 or x ≥ 2 1.16 Density of a function of a random variable (a) P[X ≥ 0.4[X ≤ 0.8] = P[0.4 ≤ X ≤ 0.8[X ≤ 0.8] = (0.8 2 −0.4 2 )/0.8 2 = 3 4 . (b) The range of Y is the interval [0, +∞). For c ≥ 0, P¦−ln(X) ≤ c¦ = P¦ln(X) ≥ −c¦ = P¦X ≥ e −c ¦ = _ 1 e −c 2xdx = 1 − e −2c so f Y (c) = _ 2 exp(−2c) c ≥ 0 0 else That is, Y is an exponential random variable with parameter 2. 1.18 Functions of independent exponential random variables (a) Z takes values in the positive real line. So let z ≥ 0. P[Z ≤ z] = P[min¦X 1 , X 2 ¦ ≤ z] = P[X 1 ≤ z or X 2 ≤ z] = 1 −P[X 1 > z and X 2 > z] = 1 −P[X 1 > z]P[X 2 > z] = 1 −e −λ 1 z e −λ 2 z = 1 −e −(λ 1 +λ 2 )z Differentiating yields that f Z (z) = _ (λ 1 +λ 2 )e −(λ 1 +λ 2 )z , z ≥ 0 0, z < 0 That is, Z has the exponential distribution with parameter λ 1 +λ 2 . (b) R takes values in the positive real line and by independence the joint pdf of X 1 and X 2 is the 348 CHAPTER 12. SOLUTIONS TO PROBLEMS product of their individual densities. So for r ≥ 0 , P[R ≤ r] = P[ X 1 X 2 ≤ r] = P[X 1 ≤ rX 2 ] = _ ∞ 0 _ rx 2 0 λ 1 e −λ 1 x 1 λ 2 e −λ 2 x 2 dx 1 dx 2 = _ ∞ 0 (1 −e −rλ 1 x 2 )λ 2 e −λ 2 x 2 dx 2 = 1 − λ 2 rλ 1 +λ 2 . Differentiating yields that f R (r) = _ λ 1 λ 2 (λ 1 r+λ 2 ) 2 r ≥ 0 0, r < 0 1.20 Gaussians and the Q function (a) Cov(3X + 2Y, X + 5Y + 10) = 3Cov(X, X) + 10Cov(Y, Y ) = 3Var(X) + 10Var(Y ) = 13. (b) X + 4Y is N(0, 17), so P¦X + 4Y ≥ 2¦ = P¦ X+4Y √ 17 ≥ 2 √ 17 ¦ = Q( 2 √ 17 ). (c) X −Y is N(0, 2), so P¦(X −Y ) 2 > 9¦ = P¦(X −Y ) ≥ 3 orX −Y ≤ −3¦ = 2P¦ X−Y √ 2 ≥ 3 √ 2 ¦ = 2Q( 3 √ 2 ). 1.22 Working with a joint density (a) The density must integrate to one, so c = 4/19. (b) f X (x) = _ 4 19 _ 2 1 (1 +xy)dy = 4 19 [1 + 3x 2 ] 2 ≤ x ≤ 3 0 else f Y (y) = _ 4 19 _ 3 2 (1 +xy)dx = 4 19 [1 + 5y 2 ] 1 ≤ y ≤ 2 0 else Therefore f X[Y (x[y) is well defined only if 1 ≤ y ≤ 2. For 1 ≤ y ≤ 2: f X[Y (x[y) = _ 1+xy 1+ 5 2 y 2 ≤ x ≤ 3 0 for other x 1.24 Density of a difference (a) Method 1 The joint density is the product of the marginals, and for any c ≥ 0, the probability P¦[X − Y [ ≤ c¦ is the integral of the joint density over the region of the positive quadrant such that ¦[x−y[ ≤ c¦, which by symmetry is one minus twice the integral of the density over the region ¦y ≥ 0 and y ≤ y+c¦. Thus, P¦X−Y [ ≤ c¦ = 1−2 _ ∞ 0 exp(−λ(y+c))λexp(−λy)dy = 1−exp(−λc). Thus, f Z (c) = _ λexp(−λc) c ≥ 0 0 else That is, Z has the exponential distribution with parameter λ. 349 (Method 2 The problem can be solved without calculation by the memoryless property of the ex- ponential distribution, as follows. Suppose X and Y are lifetimes of identical lightbulbs which are turned on at the same time. One of them will burn out first. At that time, the other lightbulb will be the same as a new light bulb, and [X −Y ] is equal to how much longer that lightbulb will last. 1.26 Some characteristic functions (a) Differentiation is straight-forward, yielding jEX = Φ t (0) = 2j or EX = 2, and j 2 E[X 2 ] = Φ tt (0) = −14, so Var(x) = 14 − 2 2 = 10. In fact, this is the characteristic function of a N(10, 2 2 ) random variable. (b) Evaluation of the derivatives at zero requires l’Hospital’s rule, and is a little tedious. A simpler way is to use the Taylor series expansion exp(ju) = 1 + (ju) + (ju) 2 /2! + (ju) 3 /3!... The result is EX = 0.5 and Var(X) = 1/12. In fact, this is the characteristic function of a U(0, 1) random variable. (c) Differentiation is straight-forward, yielding EX = Var(X) = λ. In fact, this is the characteristic function of a Poi(λ) random variable. 1.28 A transformation of jointly continuous random variables (a) We are using the mapping, from the square region ¦(u, v) : 0 ≤ u, v ≤ 1¦ in the u −v plane to the triangular region with corners (0,0), (3,0), and (3,1) in the x −y plane, given by x = 3u y = uv. The mapping is one-to-one, meaning that for any (x, y) in the range we can recover (u, v). Indeed, the inverse mapping is given by u = x 3 v = 3y x . The Jacobian determinant of the transformation is J(u, v) = det _ ∂x ∂u ∂x ∂v ∂y ∂u ∂y ∂v _ = det _ 3 0 v u _ = 3u ,= 0, for all u, v ∈ (0, 1) 2 . Therefore the required pdf is f X,Y (x, y) = f U,V (u, v) [J(u, v)[ = 9u 2 v 2 [3u[ = 3uv 2 = 9y 2 x within the triangle with corners (0,0), (3,0), and (3,1), and f X,Y (x, y) = 0 elsewhere. (b) Integrating out y from the joint pdf yields f X (x) = _ _ x 3 0 9y 2 x dy = x 2 9 if 0 ≤ x ≤ 3 0 else 350 CHAPTER 12. SOLUTIONS TO PROBLEMS Therefore the conditional density f Y [X (y[x) is well defined only if 0 ≤ x ≤ 3. For 0 ≤ x ≤ 3, f Y [X (y[x) = f X,Y (x, y) f X (x) = _ 81y 2 x 3 if 0 ≤ y ≤ x 3 0 else 1.30 Jointly distributed variables (a) E[ V 2 1+U ] = E[V 2 ]E[ 1 1+U ] = _ ∞ 0 v 2 λe −λv dv _ 1 0 1 1+u du = ( 2 λ 2 )(ln(2)) = 2 ln 2 λ 2 . (b) P¦U ≤ V ¦ = _ 1 0 _ ∞ u λe −λv dvdu = _ 1 0 e −λu du = (1 −e −λ )/λ. (c) The support of both f UV and f Y Z is the strip [0, 1] [0, ∞), and the mapping (u, v) →(y, z) defined by y = u 2 and z = uv is one-to-one. Indeed, the inverse mapping is given by u = y 1 2 and v = zy − 1 2 . The absolute value of the Jacobian determinant of the forward mapping is [ ∂(x,y) ∂(u,v) [ = ¸ ¸ ¸ ¸ 2u 0 v u ¸ ¸ ¸ ¸ = 2u 2 = 2y. Thus, f Y,Z (y, z) = _ λ 2y e −λzy − 1 2 (y, z) ∈ [0, 1] [0, ∞) 0 otherwise. 2.2 The limit of the product is the product of the limits (a) There exists n 1 so large that [y n − y[ ≤ 1 for n ≥ n 1 . Thus, [y n [ ≤ L for all n, where L = max¦[y 1 [, [y 2 [, . . . , [y n 1 −1 [, [y[ + 1¦.. (b) Given > 0, there exists n so large that [x n −x[ ≤ 2L and [y n −y[ ≤ 2([x[+1) . Thus, for n ≥ n , [x n y n −xy[ ≤ [(x n −x)y n [ +[x(y n −y)[ ≤ [x n −x[L +[x[[y n −y[ ≤ 2 + 2 ≤ . So x n y n →xy as n →∞. 2.4 Limits of some deterministic series (a) Convergent. This is the power series expansion for e x , which is everywhere convergent, evalu- ated at x = 3. The value of the sum is thus e 3 . Another way to show the series is convergent is to notice that for n ≥ 3 the n th term can be bounded above by 3 n n! = 3 3 3! 3 4 3 5 3 n ≤ (4.5)( 3 4 ) n−3 . Thus, the sum is bounded by a constant plus a geometric series, so it is convergent. (b) Convergent. Let 0 < η < 1. Then ln n < n η for all large enough n. Also, n+2 ≤ 2n for all large enough n, and n+5 ≥ n for all n. Therefore, the n th term in the series is bounded above, for all suffi- ciently large n, by 2nn η n 3 = 2n η−2 . Therefore, the sum in (b) is bounded above by finitely many terms of the sum, plus 2 ∞ n=1 n η−2 , which is finite, because, for α > 1, ∞ n=1 n −α < 1+ _ ∞ 1 x −α dx = α α−1 , as shown in an example in the appendix of the notes. (c) Not convergent. Let 0 < η < 0.2. Then log(n + 1) ≤ n η for all n large enough, so for n large enough the n th term in the series is greater than or equal to n −5η . The series is therefore divergent. 351 We used the fact that ∞ n=1 n −α is infinite for any 0 ≤ α ≤ 1, because it is greater than or equal to the integral _ ∞ 1 x −α dx, which is infinite for 0 ≤ α ≤ 1. 2.6 Convergence of sequences of random variables (a) The distribution of X n is the same for all n, so the sequence converges in distribution to any random variable with the distribution of X 1 . To check for mean square convergence, use the fact cos(a) cos(b) = (cos(a+b)+cos(a−b))/2 to calculate that E[X n X m ] = 1 2 if n = m and E[X n X m ] = 0 if n ,= m. Therefore, lim n,m→∞ E[X n X m ] does not exist, so the sequence (X n ) does not satisfy the Cauchy criteria for m.s. convergence, so it doesn’t converge in the m.s. sense. Since it is a bounded sequence, it therefore does not converge in the p. sense either. (Because for bounded sequences, convergence p. implies convergence m.s.) Therefore the sequence doesn’t converge in the a.s. sense either. In summary, the sequence converges in distribution but not in the other three senses. (Another approach is to note that the distribution of X n − X 2n is the same for all n, so that the sequence doesn’t satisfy the Cauchy criteria for convergence in probability.) (b) If ω is such that 0 < Θ(ω) < 2π, then [1 − Θ(ω) π [ < 1 so that lim n→∞ Y n (ω) = 0 for such ω. Since P[0 < Θ(ω) < 2π] = 1, it follows that (Y n ) converges to zero in the a.s. sense, and hence also in the p. and d. senses. Since the sequence is bounded, it also converges to zero in the m.s. sense. 2.8 Convergence of random variables on (0,1], version 2 (a) (a.s, p., d., not m.s.) For any ω ∈ Ω fixed, the deterministic sequence X n (ω) converges to zero. So X n → 0 a.s. The sequence thus also converges in p. and d. If the sequence converged in the m.s. sense, the limit would also have to be zero, but E[[X n −0[ 2 ] = E[X n [ 2 ] = 1 n 2 _ 1 0 1 ω dω = +∞,→0. The sequence thus does not converge in the m.s. sense. (b) (a.s, p., d., not m.s.) For any ω ∈ Ω fixed, except the single point 1 which has zero probability, the deterministic sequence X n (ω) converges to zero. So X n → 0 a.s. The sequence also converges in p. and d. If the sequence converged in the m.s. sense, the limit would also have to be zero, but E[[X n −0[ 2 ] = E[X n [ 2 ] = n 2 _ 1 0 ω 2n dω = n 2 2n + 1 ,→0. The sequence thus does not converge in the m.s. sense. (c) (d. only) For ω fixed and irrational, the sequence does not even come close to settling down, so intuitively we expect the sequence does not converge in any of the three strongest senses: a.s., m.s., or p. To prove this, it suffices to prove that the sequence doesn’t converge in p. Since the sequence is bounded, convergence in probability is equivalent to convergence in the m.s. sense, so it also would suffice to prove the sequence does not converge in the m.s. sense. The Cauchy criteria for m.s. convergence would be violated if E[(X n − X 2n ) 2 ] ,→ 0 as n → ∞. By the double angle formula, X 2n (ω) = 2ω sin(2πnω) cos(2πnω) so that E[(X n −X 2n ) 2 ] = _ 1 0 ω 2 (sin(2πnω)) 2 (1 −2 cos(2πnω)) 2 dω 352 CHAPTER 12. SOLUTIONS TO PROBLEMS and this integral clearly does not converge to zero as n → ∞. In fact, following the heuristic reasoning below, the limit can be shown to equal E[sin 2 (Θ)(1−2 cos(Θ)) 2 ]/3, where Θ is uniformly distributed over the interval [0, 2π]. So the sequence (X n ) does not converge in m.s., p., or a.s. senses. The sequence does converge in the distribution sense. We shall give a heuristic derivation of the limiting CDF. Note that the CDF of X n is given by F Xn (c) = _ 1 0 I ¡f(ω) sin(2πnω)≤c¦ dω (12.1) where f is the function defined by f(ω) = ω. As n → ∞, the integrand in (12.1) jumps between zero and one more and more frequently. For any small > 0, we can imagine partitioning [0, 1] into intervals of length . The number of oscillations of the integrand within each interval converges to infinity, and the factor f(ω) is roughly constant over each interval. The fraction of a small interval for which the integrand is one nearly converges to P ¦f(ω) sin(Θ) ≤ c¦ , where Θ is a random variable that is uniformly distributed over the interval [0, 2π], and ω is a fixed point in the small interval. So the CDF of X n converges for all constants c to: _ 1 0 P ¦f(ω) sin(Θ) ≤ c¦ dω. (12.2) (Note: The following observations can be used to make the above argument rigorous. The integrals in (12.1) and (12.2) would be equal if f were constant within each interval of the form ( i n , i+1 n ). If f is continuous on [0, 1], it can be approximated by such step functions with maximum approximation error converging to zero as n →∞. Details are left to the reader.) 2.10 Convergence of a sequence of discrete random variables (a) The CDF of X n is shown in Figure 12.1. Since F n (x) = F X _ x − 1 n _ it follows that lim n→∞ F n (x) = 1 2 3 4 5 6 0 F X n 0 1 Figure 12.1: F x F X (x−) all x. So lim n→∞ F n (x) = F X (x) unless F X (x) ,= F X (x−) i.e., unless x = 1, 2, 3, 4, 5, or 6. (b) F X is continuous at x unless x ∈ ¦1, 2, 3, 4, 5, 6¦. 353 (c) Yes, lim n→∞ X n = X d. by definition. 2.12 Convergence of a minimum (a) The sequence (X n ) converges to zero in all four senses. Here is one proof, and there are others. For any with 0 < < 1, P[[X n − 0[ ≥ ] = P[U 1 ≥ , . . . , U n ≥ ] = (1 − ) n , which converges to zero as n → ∞. Thus, by definition, X n → 0 p. Thus, the sequence converges to zero in d. sense and, since it is bounded, in the m.s. sense. For each ω, as a function of n, the sequence of numbers X 1 (ω), X 2 (ω), . . . is a nonincreasing sequence of numbers bounded below by zero. Thus, the sequence X n converges in the a.s. sense to some limit random variable. If a limit of random variables exists in different senses, the limit random variable has to be the same, so the sequence (X n ) converges a.s. to zero. (b) For n fixed, the variable Y n is distributed over the interval [0, n θ ], so let c be a number in that interval. Then P[Y n ≤ c] = P[X n ≤ cn −θ ] = 1 −P[X n > cn −θ ] = 1 −(1 −cn −θ ) n . Thus, if θ = 1, lim n→∞ P[Y n ≤ c] = 1 − lim n→∞ (1 − c n ) n = 1 − exp(−c) for any c ≥ 0. Therefore, if θ = 1, the sequence (Y n ) converges in distribution, and the limit distribution is the exponential distribution with parameter one. 2.14 Limits of functions of random variables (a) Yes. Since g is a continuous function, if a sequence of numbers a n converges to a limit a, then g(a n ) converges to g(a). Therefore, for any ω such that lim n→∞ X n (ω) = X(ω), it holds that lim n→∞ g(X n (ω)) = g(X(ω)). If X n → X a.s., then the set of all such ω has probability one, so g(X n ) →g(X) a.s. (b) Yes. A direct proof is to first note that [g(b) −g(a)[ ≤ [b −a[ for any numbers a and b. So, if X n → X m.s., then E[[g(X n ) −g(X)[ 2 ] ≤ E[[X −X n [ 2 ] → 0 as n → ∞. Therefore g(X n ) → g(X) m.s. A slightly more general proof would be to use the continuity of g (implying uniform continuity on bounded intervals) to show that g(X n ) → g(X) p., and then, since g is bounded, use the fact that convergence in probability for a bounded sequence implies convergence in the m.s. sense.) (c) No. For a counter example, let X n = (−1) n /n. Then X n → 0 deterministically, and hence in the a.s. sense. But h(X n ) = (−1) n , which converges with probability zero, not with probability one. (d) No. For a counter example, let X n = (−1) n /n. Then X n → 0 deterministically, and hence in the m.s. sense. But h(X n ) = (−1) n does not converge in the m.s. sense. (For a proof, note that E[h(X m )h(X n )] = (−1) m+n , which does not converge as m, n → ∞. Thus, h(X n ) does not satisfy the necessary Cauchy criteria for m.s. convergence.) 2.16 Sums of i.i.d. random variables, II (a) Φ X 1 (u) = 1 2 e ju + 1 2 e −ju = cos(u), so Φ Sn (u) = Φ X 1 (u) n = (cos(u)) n , and Φ Vn (u) = Φ Sn (u/ √ n) = cos(u/ √ n) n . (b) lim n→∞ Φ Sn (u) = _ _ _ 1 if u is an even multiple of π does not exist if u is an odd multiple of π 0 if u is not a multiple of π. 354 CHAPTER 12. SOLUTIONS TO PROBLEMS lim n→∞ Φ Vn (u) = lim n→∞ _ 1 − 1 2 _ u √ n _ 2 +o _ u 2 n _ _ n = e − u 2 2 . (c) S n does not converge in distribution, because, for example, lim n→∞ Φ Sn (π) = lim n→∞ (−1) n does not exist. So S n does not converge in the m.s., a.s. or p. sense either. The limit of Φ Vn is the characteristic function of the N(0, 1) distribution, so that (V n ) converges in distribution and the limit distribution is N(0, 1). It will next be proved that V n does not converge in probability. The intuitive idea is that if m is much larger than n, then most of the random variables in the sum defining V m are independent of the variables defining V n . Hence, there is no reason for V m to be close to V n with high probability. The proof below looks at the case m = 2n. Note that V 2n −V n = X 1 + +X 2n √ 2n − X 1 + +X n √ n = √ 2 −2 2 _ X 1 + +X n √ n _ + 1 √ 2 _ X n+1 + +X 2n √ n _ The two terms within the two pairs of braces are independent, and by the central limit theo- rem, each converges in distribution to the N(0, 1) distribution. Thus lim n→∞ d. V 2n − V n = W, where W is a normal random variable with mean 0 and Var(W) = _ √ 2−2 2 _ 2 + _ 1 √ 2 _ 2 = 2 − √ 2. Thus, lim n→∞ P([V 2n − V n [ > ) ,= 0 so by the Cauchy criteria for convergence in probability, V n does not converge in probability. Hence V n does not converge in the a.s. sense or m.s. sense either. 2.18 On the growth of the maximum of n independent exponentials (a) Let n ≥ 2. Clearly F Zn (c) = 0 for c ≤ 0. For c > 0, F Zn (c) = P¦max¦X 1 , . . . , X n ¦ ≤ c ln n¦ = P¦X 1 ≤ c ln n, X 2 ≤ c ln n, . . . , X n ≤ c ln n¦ = P¦X 1 ≤ c ln n¦P¦X 2 ≤ c ln n¦ P¦X n ≤ c ln n¦ = (1 −e −c ln n ) n = (1 −n −c ) n (b) Or, F Xn (c) = (1 + xn n ) n , where x n = −n 1−c . Observe that as n →∞, x n → _ _ _ −∞ c < 1 −1 c = 1 0 c > 1, so by Lemma 2.3.1 (and the monotonicity of the function e x to extend to the case x = −∞), F Zn (c) → _ _ _ 0 c < 1 e −1 c = 1 1 c > 1. Therefore, if Z ∞ is the random variable that is equal to one with probability one, then F Zn (c) → F Z∞ (c) at the continuity points (i.e. at c ,= 1) of F Z∞ . So the sequence (Z n ) converges to one in 355 distribution. 2.20 Limit behavior of a stochastic dynamical system Due to the persistent noise, just as for the example following Theorem 2.1.5 in the notes, the sequence does not converge to an ordinary random variables in the a.s., p., or m.s. senses. To gain some insight, imagine (or simulate on a computer) a typical sample path of the process. A typical sample sequence hovers around zero for a while, but eventually, since the Gaussian variables can be arbitrarily large, some value of X n will cross above any fixed threshold with probability one. After that, X n would probably converge to infinity quickly. For example, if X n = 3 for some n, and if the noise were ignored from that time forward, then X would go through the sequence 9, 81, 6561, 43046721, 1853020188851841, 2.43 10 30 , . . ., and one suspects the noise terms would not stop the growth. This suggests that X n → +∞ in the a.s. sense (and hence in the p. and d. senses as well. (Convergence to +∞ in the m.s. sense is not well defined.) Of course, then, X n does not converge in any sense to an ordinary random variable. We shall follow the above intuition to prove that X n → ∞ a.s. If W n−1 ≥ 3 for some n, then X n ≥ 3. Thus, the sequence X n will eventually cross above the threshold 3. We say that X diverges nicely from time n if the event E n = ¦X n+k ≥ 3 2 k for all k ≥ 0¦ is true. Note that if X n+k ≥ 3 2 k and W n+k ≥ −3 2 k , then X n+k+1 ≥ (3 2 k ) 2 − 3 2 k = 3 2 k (3 2 k − 1) ≥ 3 2 k+1 . Therefore, E n ⊃ ¦X n ≥ 3 and W n+k ≥ −3 2 k for all k ≥ 0¦. Thus, using a union bound and the bound Q(u) ≤ 1 2 exp(−u 2 /2) for u ≥ 0: P[E n [X n ≥ 3] ≥ P¦W n+k ≥ −3 2 k for all k ≥ 0¦ = 1 −P _ ∪ ∞ k=0 ¦W n+k ≤ −3 2 k ¦ _ ≥ 1 − ∞ k=0 P¦W n+k ≤ −3 2 k ¦ = 1 − ∞ k=0 Q(3 2 k √ 2) ≥ 1 − 1 2 ∞ k=0 exp(−(3 2 k ) 2 ) ≥ 1 − 1 2 ∞ k=0 (e −9 ) k+1 = 1 − e −9 2(1 −e −9 ) ≥ 0.9999. The pieces are put together as follows. Let N 1 be the smallest time such that X N 1 ≥ 3. Then N 1 is finite with probability one, as explained above. Then X diverges nicely from time N 1 with proba- bility at least 0.9999. However, if X does not diverge nicely from time N 1 , then there is some first time of the form N 1 +k such that X N 1 +k < 3 2 k . Note that the future of the process beyond that time has the same evolution as the original process. Let N 2 be the first time after that such that X N 2 ≥ 3. Then X again has chance at least 0.9999 to diverge nicely to infinity. And so on. Thus, X will have arbitrarily many chances to diverge nicely to infinity, with each chance having proba- bility at least 0.9999. The number of chances needed until success is a.s. finite (in fact it has the geometric distribution), so that X diverges nicely to infinity from some time, with probability one. 2.22 Convergence analysis of successive averaging (b) The means µ n of X n for all n are determined by the recursion µ 0 = 0, µ 1 = 1, and, for n ≥ 1, µ n+1 = (µ n + µ n−1 )/2. This second order recursion has a solution of the form µ n = Aθ n 1 + Bθ n 2 , 356 CHAPTER 12. SOLUTIONS TO PROBLEMS where θ 1 and θ 2 are the solutions to the equation θ 2 = (1 +θ)/2. This yields µ n = 2 3 (1 −(− 1 2 ) n ). (c) It is first proved that lim n→∞ D n = 0 a.s.. Note that D n = U 1 U n−1 . Since ln D n = ln(U 1 ) + ln(U n−1 ) and E[ln U i ] = _ 1 0 ln(u)du = (xln x − x)[ 1 0 = −1, the strong law of large numbers implies that lim n→∞ ln Dn n−1 = −1 a.s., which in turn implies lim n→∞ ln D n = −∞ a.s., or equivalently, lim n→∞ D n = 0 a.s., which was to be proved. By the hint, for each ω such that D n (ω) converges to zero, the sequence X n (ω) is a Cauchy sequence of numbers, and hence has a limit. The set of such ω has probability one, so X n converges a.s. 2.24 Mean square convergence of a random series Let Y n = X 1 + + X n . We are interested in determining whether lim n→∞ Y n exists in the m.s. sense. By Proposition 2.2.3, the m.s. limit exists if and only if the limit lim m,n→∞ E[Y m Y n ] exists and is finite. But E[Y m Y n ] = n∧m k=1 σ 2 k which converges to ∞ k=1 σ 2 k as n, m → ∞. Thus, (Y n ) converges in the m.s. sense if and only if ∞ k=1 σ 2 k < ∞. 2.26 A large deviation Since E[X 2 1 ] = 2 > 1, Cram´er’s theorem implies that b = (2), which we now compute. Note, for a > 0, _ ∞ −∞ e −ax 2 dx = _ ∞ −∞ e − x 2 2( 1 2a ) dx = _ π a . So M(θ) = ln E[e θx 2 ] = ln _ ∞ −∞ 1 √ 2π e −x 2 ( 1 2 −θ) dx = − 1 2 ln(1 −2θ) (a) = max θ _ θa + 1 2 ln(1 −2θ) _ = 1 2 _ 1 − 1 a _ θ ∗ = 1 2 (1 − 1 a ) b = (2) = 1 2 (1 −ln 2) = 0.1534 e −100b = 2.18 10 −7 . 2.28 A rapprochement between the central limit theorem and large deviations (a) Differentiating with respect to θ yields M t (θ) = ( dE[exp(θX)] dθ )/E[exp(θX)]. Differentiating again yields M tt (θ) = _ d 2 E[X exp(θX)] (dθ) 2 E[exp(θX)] −( dE[exp(θX)] dθ ) 2 _ /E[exp(θX)] 2 . Interchanging differen- tiation and expectation yields d k E[exp(θX)] (dθ) k = E[X k exp(θX)]. Therefore, M t (θ) = E[X exp(θX)]/E[exp(θX)], which is the mean for the tilted distribution f θ , and M tt (θ) = _ E[X 2 exp(θX)]E[exp(θX)] −E[X exp(θX)] 2 _ /E[exp(θX)] 2 , which is the second mo- ment, minus the first moment squared, or simply the variance, for the tilted density f θ . (b) In particular, M t (0) = 0 and M tt (0) = Var(X) = σ 2 , so the second order Taylor’s approximation 357 for M near zero is M(θ) = θ 2 σ 2 /2. Therefore, (a) for small a satisfies (a) ≈ max θ (aθ− θ 2 σ 2 2 ) = a 2 2σ 2 , so as n → ∞, the large deviations upper bound behaves as P[S n ≥ b √ n] ≤ exp(−n(b/ √ n)) ≈ exp(−n b 2 2σ 2 n ) = exp(− b 2 2σ 2 ). The exponent is the same as in the bound/approximation to the central limit approximation described in the problem statement. Thus, for moderately large b, the central limit theorem approximation and large deviations bound/approximation are consistent with each other. 2.30 Large deviations of a mixed sum Modifying the derivation for iid random variables, we find that for θ ≥ 0: P _ S n n ≥ a _ ≤ E[e θ(Sn−an) ] = E[e θX 1 ] nf E[e θY 1 ] n(1−f) e −nθa = exp(−n[θa −fM X (θ) −(1 −f)M Y (θ)]) where M X and M Y are the log moment generating functions of X 1 and Y 1 respectively. Therefore, l(f, a) = max θ θa −fM X (θ) −(1 −f)M Y (θ) where M X (θ) = _ −ln(1 −θ) θ < 1 +∞ θ ≥ 1 M Y (θ) = ln ∞ k=0 e θk e −1 k! = ln(e e θ −1 ) = e θ −1, Note that l(a, 0) = a ln a + 1 − a (large deviations exponent for the Poi(1) distribution) and l(a, 1) = a − 1 − ln(a) (large deviations exponent for the Exp(1) distribution). For 0 < f < 1 we compute l(f, a) by numerical optimization. The result is f 0 0+ 1/3 2/3 1 l(f, 4) 2.545 2.282 1.876 1.719 1.614 Note: l(4, f) is discontinuous in f at f = 0. In fact, adding only one exponentially distributed random variable to a sum of Poisson random variables can change the large deviations exponent. 2.32 The limit of a sum of cumulative products of a sequence of uniform random vari- ables (a) Yes. E[(B k −0) 2 ] = E[A 2 1 ] k = ( 5 8 ) k →0 as k →∞. Thus, B k m.s. → 0. (b) Yes. Each sample path of the sequence B k is monotone nonincreasing and bounded below by zero, and is hence convergent. Thus, lim k→∞ B k a.s. exists. (The limit has to be the same as the m.s. limit, so B k converges to zero almost surely.) (c) If j ≤ k, then E[B j B k ] = E[A 2 1 A 2 j A j+1 A k ] = ( 5 8 ) j ( 3 4 ) k−j . Therefore, 358 CHAPTER 12. SOLUTIONS TO PROBLEMS E[S n S m ] = E[ n j=1 B j m k=1 B k ] = n j=1 m k=1 E[B j B k ] → ∞ j=1 ∞ k=1 E[B j B k ] (12.3) = 2 ∞ j=1 ∞ k=j+1 _ 5 8 _ j _ 3 4 _ k−j + ∞ j=1 _ 5 8 _ j = 2 ∞ j=1 ∞ l=1 _ 5 8 _ j _ 3 4 _ l + ∞ j=1 _ 5 8 _ j = _ _ ∞ j=1 _ 5 8 _ j _ _ _ 2 ∞ l=1 _ 3 4 _ l + 1 _ (12.4) = 5 3 (2 3 + 1) = 35 3 A visual way to derive (12.4), is to note that (12.3) is the sum of all entries in the infinite 2-d array: . . . . . . . . . . . _ 5 8 _ _ 3 4 _ 2 _ 5 8 _ 2 _ 3 4 _ _ 5 8 _ 3 _ 5 8 _ _ 3 4 _ _ 5 8 _ 2 _ 5 8 _ 2 _ 3 4 _ _ 5 8 _ _ 5 8 _ _ 3 4 _ _ 5 8 _ _ 3 4 _ 2 Therefore, _ 5 8 _ j _ 2 ∞ l=1 _ 3 4 _ l + 1 _ is readily seen to be the sum of the j th term on the diagonal, plus all terms directly above or directly to the right of that term. (d) Mean square convergence implies convergence of the mean. Thus, the mean of the limit is lim n→∞ E[S n ] = lim n→∞ n k=1 E[B k ] = ∞ k=1 ( 3 4 ) k = 3. The second moment of the limit is the limit of the second moments, namely 35 3 , so the variance of the limit is 35 3 −3 2 = 8 3 . (e) Yes. Each sample path of the sequence S n is monotone nondecreasing and is hence convergent. Thus, lim n→∞ S n a.s. exists. The limit has to be the same as the m.s. limit. (To show that the limit is finite with probability one, and hence an ordinary random variable, the monotone convergence theorem could be applied, yielding E[lim n→∞ S ∞ ] = lim n→∞ E[S n ] = 1. ) 3.2 Linear approximation of the cosine function over an interval ´ E[Y [Θ] = E[Y ] + Cov(Θ,Y ) Var(Θ) (Θ−E[Θ]), where E[Y ] = 1 π _ π 0 cos(θ)dθ = 0, E[Θ] = π 2 , Var(Θ) = π 2 12 , E[ΘY ] = _ π 0 θ cos(θ) π dθ = θ sin(θ) π [ π 0 − _ π 0 sin(θ) π dθ = − 2 π , and Cov(Θ, Y ) = E[ΘY ] −E[Θ]E[Y ] = − 2 π . Therefore, ´ E[Y [Θ] = − 24 π 3 (Θ− π 2 ), so the optimal choice is a = 12 π 2 and b = − 24 π 3 . 3.4 Valid covariance matrix Set a = 1 to make K symmetric. Choose b so that the determinants of the following seven matrices are nonnegative: (2) (1) (1) _ 2 1 1 1 _ _ 2 b b 1 _ _ 1 0 0 1 _ K itself 359 The fifth matrix has determinant 2 − b 2 and det(K) = 2 − 1 − b 2 = 1 − b 2 . Hence K is a valid covariance matrix (i.e. symmetric and positive semidefinite) if and only if a = 1 and −1 ≤ b ≤ 1. 3.6 Conditional probabilities with joint Gaussians II (a) P[[X −1[ ≥ 2] = P[X ≤ −1 or X ≥ 3] = P[ X 2 ≤ − 1 2 ] +P[ X 2 ≥ 3 2 ] = Φ(− 1 2 ) + 1 −Φ( 3 2 ). (b) Given Y = 3, the conditional density of X is Gaussian with mean E[X]+ Cov(X,Y ) Var(Y ) (3−E[Y ]) = 1 and variance Var(X) − Cov(X,Y ) 2 Var(Y ) = 4 − 6 2 18 = 2. (c) The estimation error X − E[X[Y ] is Gaussian, has mean zero and variance 2, and is indepen- dent of Y . (The variance of the error was calculated to be 2 in part (b)). Thus the probability is Φ(− 1 √ 2 ) + 1 −Φ( 1 √ 2 ), which can also be written as 2Φ(− 1 √ 2 ) or 2(1 −Φ( 1 √ 2 )). 3.8 An MMSE estimation problem (a) E[XY ] = 2 _ 1 0 _ 1+x 2x xydxdy = 5 12 . The other moments can be found in a similar way. Alterna- tively, note that the marginal densities are given by f X (x) = _ 2(1 −x) 0 ≤ x ≤ 1 0 else f Y (y) = _ _ _ y 0 ≤ y ≤ 1 2 −y 1 ≤ y ≤ 2 0 else so that EX = 1 3 , Var(X) = 1 18 , EY = 1, Var(Y ) = 1 6 , Cov(X, Y ) = 5 12 − 1 3 = 1 12 . So ´ E[X [ Y ] = 1 3 + 1 12 ( 1 6 ) −1 (Y −1) = 1 3 + Y −1 2 E[e 2 ] = 1 18 −( 1 12 )( 1 6 ) −1 ( 1 12 ) = 1 72 = the MMSE for ´ E[X[Y ] Inspection of Figure 12.2 shows that for 0 ≤ y ≤ 2, the conditional distribution of X given Y = y is the uniform distribution over the interval [0, y/2] if 0 ≤ y ≤ 1 and the over the interval [y −1, y/2] if 1 ≤ y ≤ 2. The conditional mean of X given Y = y is thus the midpoint of that interval, yielding: E[X[Y ] = _ Y 4 0 ≤ Y ≤ 1 3Y −2 4 1 ≤ Y ≤ 2 To find the corresponding MSE, note that given Y , the conditional distribution of X is uniform over some interval. Let L(Y ) denote the length of the interval. Then E[e 2 ] = E[E[e 2 [Y ]] = E[ 1 12 L(Y ) 2 ]. = 2 _ 1 12 _ 1 0 y( y 2 ) 2 dy _ = 1 96 For this example, the MSE for the best estimator is 25% smaller than the MSE for the best linear estimator. (b) EX = _ ∞ −∞ [y[ 1 √ 2π e −y 2 /2 dy = 2 _ ∞ 0 y √ 2π e − 1 2 y 2 dy = _ 2 π and EY = 0, 360 CHAPTER 12. SOLUTIONS TO PROBLEMS E[X|Y=y] E[X|Y=y] y x 2 1 0 1 Figure 12.2: Sketch of E[X[Y = y] and ´ E[X[Y = y]. Var(Y ) = 1, Cov(X, Y ) = E[[Y [Y ] = 0 so ´ E[X[Y ] = _ 2 π + 0 1 Y ≡ _ 2 π That is, the best linear estimator is the constant EX. The corresponding MSE is Var(X) = E[X 2 ] − (EX) 2 = E[Y 2 ] − 2 π = 1 − 2 π . Note that [Y [ is a function of Y with mean square error E[(X − [Y [) 2 ] = 0. Nothing can beat that, so [Y [ is the MMSE estimator of X given Y . So [Y [ = E[X[Y ]. The corresponding MSE is 0, or 100% smaller than the MSE for the best linear estimator. 3.10 Conditional Gaussian comparison (a) p a = P¦X ≥ 2¦ = P¦ X √ 10 ≥ 2 √ 10 ¦ = Q( 2 √ 10 ) = Q(0.6324). (b) By the theory of conditional distributions for jointly Gaussian random variables, the conditional distribution of X given Y = y is Gaussian, with mean ´ E[X[Y = y] and variance σ 2 e , which is the MSE for estimation of X by ´ E[X[Y ]. Since X and Y are mean zero and Cov(X,Y ) Var(Y ) = 0.8, we have ´ E[X[Y = y] = 0.8y, and σ 2 e = Var(X) − Cov(X,Y ) 2 Var(Y ) = 3.6. Hence, given Y = y, the conditional distribution of X is N(0.8y, 3.6). Therefore, P[X ≥ 2[Y = y] = Q( 2−(0.8)y √ 3.6 ). In particular, p b = P[X ≥ 2[Y = 3] = Q( 2−(0.8)3 √ 3.6 ) = Q(−0.2108). (c) Given the event ¦Y ≥ 3¦, the conditional pdf of Y is obtained by setting the pdf of Y to zero on the interval (−∞, 3), and then renormalizing by P¦Y ≥ 3¦ = Q( 3 √ 10 ) to make the density integrate to one. We can write this as f Y [Y ≥3 (y) = _ _ _ f Y (y) 1−F Y (3) = e −y 2 /20 Q( 3 √ 10 ) √ 20π y ≥ 3 0 else. 361 Using this density, by considering the possible values of Y , we have p c = P[X ≥ 2[Y ≥ 3] = _ ∞ 3 P[X ≥ 2, Y ∈ dy[Y ≥ 3] = _ ∞ 3 P[X ≥ 2[Y = y]P[Y ∈ dy[Y ≥ 3] = _ ∞ 3 Q( 2 −(0.8)y √ 3.6 )f Y [Y ≥3 (y)dy (ALTERNATIVE) The same expression can be derived in a more conventional fashion as follows: p e = P[X ≥ 2[Y ≥ 3] = P¦X ≥ 2, Y ≥ 3¦ P¦Y ≥ 3¦ = _ ∞ 3 __ ∞ 2 f X[Y (x[y)dx _ f Y (y)dy/P¦Y ≥ 3¦ = _ ∞ 3 Q _ 2 −(0.8)y √ 3.6 _ f Y (y)dy/(1 −F Y (3)) = _ ∞ 3 Q( 2 −(0.8)y √ 3.6 )f Y [Y ≥3 (y)dy (d) We will show that p a < p b < p c . The inequality p a < p b follows from parts (a) and (b) and the fact the function Q is decreasing. By part (c), p c is an average of Q( 2−(0.8)y √ 3.6 ) with respect to y over the region y ∈ [3, ∞) (using the pdf f Y [Y ≥3 ). But everywhere in that region, Q( 2−(0.8)y √ 3.6 ) > p b , showing that p c > p b . 3.12 An estimator of an estimator To show that ´ E[X[Y ] is the LMMSE estimator of E[X[Y ], it suffices by the orthogonality principle to note that ´ E[X[Y ] is linear in (1, Y ) and to prove that E[X[Y ] − ´ E[X[Y ] is orthogonal to 1 and to Y . However E[X[Y ] − ´ E[X[Y ] can be written as the difference of two random variables (X− ´ E[X[Y ]) and (X−E[X[Y ]), which are each orthogonal to 1 and to Y . Thus, E[X[Y ]− ´ E[X[Y ] is also orthogonal to 1 and to Y , and the result follows. Here is a generalization, which can be proved in the same way. Suppose 1 0 and 1 1 are two closed linear subspaces of random variables with finite second moments, such that 1 0 ⊃ 1 1 . Let X be a random variable with finite second moment, and let X ∗ i be the variable in 1 i with the minimum mean square distance to X, for i = 0 or i = 1. Then X ∗ 1 is the variable in 1 1 with the minimum mean square distance to X ∗ 0 . Another solution to the original problem can be obtained by using the formula for ´ E[Z[Y ] applied to Z = E[X[Y ]: ´ E[E[X[Y ][Y ] = E[E[X[Y ]] + Cov(Y, E[X[Y ])Var(Y ) −1 (Y −EY ) which can be simplified using E[E[X[Y ]] = EX and Cov(Y, E[X[Y ]) = E[Y (E[X[Y ] −EX)] = E[Y E[X[Y ]] −EY EX = E[E[XY [Y ]] −EY EX = E[XY ] −EX EY = Cov(X, Y ) 362 CHAPTER 12. SOLUTIONS TO PROBLEMS to yield the desired result. 3.14 Some identities for estimators (a) True. The random variable E[X[Y ] cos(Y ) has the following two properties: • It is a function of Y with finite second moments (because E[X[Y ] is a function of Y with finite second moment and cos(Y ) is a bounded function of Y ) • (X cos(Y ) − E[X[Y ] cos(Y )) ⊥ g(Y ) for any g with E[g(Y ) 2 ] < ∞ (because for any such g, E[(X cos(Y )−E[X[Y ] cos(Y ))g(Y )] = E[(X−E[X[Y ])¯ g(Y )] = 0, where ¯ g(Y ) = g(Y ) cos(Y ).) Thus, by the orthogonality principle, E[X[Y ] cos(Y ) is equal to E[X cos(Y )[Y ]. (b) True. The left hand side is the projection of X onto the space ¦g(Y ) : E[g(Y ) 2 ] < ∞¦ and the right hand side is the projection of X onto the space ¦f(Y 3 ) : E[f(Y 3 ) 2 ] < ∞¦. But these two spaces are the same, because for each function g there is the function f(u) = g(u 1/3 ). The point is that the function y 3 is an invertible function, so any function of Y can also be written as a function of Y 3 . (c) False. For example, let X be uniform on the interval [0, 1] and let Y be identically zero. Then E[X 3 [Y ] = E[X 3 ] = 1 4 and E[X[Y ] 3 = E[X] 3 = 1 8 . (d) False. For example, let P¦X = Y = 1¦ = P¦X = Y = −1¦ = 0.5. Then E[X[Y ] = Y while E[X[Y 2 ] = 0. The point is that the function y 2 is not invertible, so that not every function of Y can be written as a function of Y 2 . Equivalently, Y 2 can give less information than Y . (e) False. For example, let X be uniformly distributed on [−1, 1], and let Y = X. Then ´ E[X[Y ] = Y while ´ E[X[Y 3 ] = E[X] + Cov(X,Y 3 ) Var(Y 3 ) (Y 3 −E[Y 3 ]) = E[X 4 ] E[X 6 ] Y 3 = 7 5 Y 3 . 3.16 Some simple examples Of course there are many valid answers for this problem–we only give one. (a) Let X denote the outcome of a roll of a fair die, and let Y = 1 if X is odd and Y = 2 if X is even. Then E[X[Y ] has to be linear. In fact, since Y has only two possible values, any function of Y can be written in the form a + bY. That is, any function of Y is linear. (There is no need to even calculate E[X[Y ] here, but we note that is is given by E[Y [X] = X + 2.) (b) Let X be a N(0,1) random variable, and let W be indpendent of X, with P¦W = 1¦ = P¦W = −1¦ = 1 2 . Finally, let Y = XW. The conditional distribution of Y given W is N(0, 1), for either possible value of W, so the unconditional value of Y is also N(0, 1). However, P¦X−Y = 0¦ = 0.5, so that X −Y is not a Gaussian random variable, so X and Y are not jointly Gaussian. (c) Let the triplet (X, Y, Z) take on the four values (0, 0, 0), (1, 1, 0), (1, 0, 1), (0, 1, 1) with equal probability. Then any pair of these variables takes the values (0, 0), (0, 1), (1, 0), (1, 1) with equal probability, indicating pairwise independence. But P¦(X, Y, Z¦ = (0, 0, 1)¦ = 0 ,= P¦X = 0¦P¦Y = 0¦P¦Z = 1¦ = 1 8 . So the three random variables are not independent. 3.18 Estimating a quadratic (a) Recall the fact that E[Z 2 ] = E[Z] 2 + Var(Z) for any second order random variable Z. The idea is to apply the fact to the conditional distribution of X given Y . Given Y , the conditional 363 distribution of X is Gaussian with mean ρY and variance 1 −ρ 2 . Thus, E[X 2 [Y ] = (ρY ) 2 +1 −ρ 2 . (b) The MSE=E[(X 2 ) 2 ]−E[(E[X 2 [Y ]) 2 ] = E[X 4 ]−ρ 4 E[Y 4 ]−2ρ 2 E[Y 2 ](1−ρ 2 )−(1−ρ 2 ) 2 = 2(1−ρ 4 ) (c) Since Cov(X 2 , Y ) = E[X 2 Y ] = 0, it follows that ´ E[X 2 [Y ] = E[X 2 ] = 1. That is, the best linear estimator in this case is just the constant estimator equal to 1. 3.20 An innovations sequence and its application (a) Y 1 = Y 1 . (Note: E[ Y 1 2 ] = 1), Y 2 = Y 2 − E[Y 2 e Y 1 ] E[ f Y 1 2 ] ¯ Y 1 = Y 2 −0.5Y 1 (Note: E[ Y 2 2 ] = 0.75.) Y 3 = Y 3 − E[Y 3 e Y 1 ] E[ f Y 1 2 ] ¯ Y 1 − E[Y 3 e Y 2 ] E[ f Y 2 2 ] ¯ Y 2 = Y 3 −(0.5) Y 1 − 1 3 Y 2 = Y 3 − 1 3 Y 1 − 1 3 Y 2 . Summarizing, _ _ _ Y 1 Y 2 Y 3 _ _ _ = A _ _ Y 1 Y 2 Y 3 _ _ where A = _ _ 1 0 0 − 1 2 1 0 − 1 3 − 1 3 1 _ _ _ _ Y 1 Y 2 Y 3 _ _ . (b) Since Cov _ _ Y 1 Y 2 Y 3 _ _ = _ _ 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 _ _ and Cov(X, _ _ Y 1 Y 2 Y 3 _ _ ) = (0 0.25 0.25), it follows that Cov _ _ _ ¯ Y 1 ¯ Y 2 ¯ Y 3 _ _ _ = A _ _ 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 _ _ A T = _ _ 1 0 0 0 3 4 0 0 0 2 3 _ _ , and that Cov(X, _ _ _ ¯ Y 1 ¯ Y 2 ¯ Y 3 _ _ _) = (0 0.25 0.25)A T = (0 1 4 1 6 ). (c) a = Cov(X, e Y 1 ) E[ e Y 2 1 ] = 0 b = Cov(X, e Y 2 ) E[ e Y 2 2 ] = 1 3 c = Cov(X, e Y 3 ) E[ e Y 2 3 ] = 1 4 . 3.22 A Kalman filtering example (a) ´ x k+1[k = f´ x k[k−1 +K k (y k − ´ x k[k−1 ) σ 2 k+1 = f 2 (σ 2 k −σ 2 k (σ 2 k + 1) −1 σ 2 k ) + 1 = σ 2 k f 2 1 +σ 2 k + 1 and K k = f( σ 2 k 1+σ 2 k ). (b) Since σ 2 k ≤ 1 +f 2 for k ≥ 1, the sequence (σ 2 k ) is bounded for any value of f. 3.24 A variation of Kalman filtering Equations (3.25) and (3.24) are still true: ´ x k+1[k = f´ x k[k−1 +K k (y k − ´ x k[k−1 ) and K k = Cov(x k+1 −f´ x k[k−1 , ˜ y k )Cov(˜ y k ) −1 . The variable v k in (3.28) is replaced by w k to yield: Cov(x k+1 −f´ x k[k−1 , ˜ y k ) = Cov(f(x k − ´ x k[k−1 ) +w k , x k − ´ x k[k−1 ) +w k = fσ 2 k + 1 364 CHAPTER 12. SOLUTIONS TO PROBLEMS As before, writing σ 2 k for Σ k[k−1 , Cov(˜ y k ) = Cov((x k − ´ x k[k−1 ) +w k ) = σ 2 k + 1 So now K k = fσ 2 k + 1 1 +σ 2 k and σ 2 k+1 = (f 2 σ 2 k + 1) − (fσ 2 k + 1) 2 1 +σ 2 k = (1 −f) 2 σ 2 k 1 +σ 2 k Note that if f = 1 then σ 2 k = 0 for all k ≥ 1, which makes sense because x k = y k−1 in that case. 3.26 An innovations problem (a) E[Y n ] = E[U 1 U n ] = E[U 1 ] E[U n ] = 2 −n and E[Y 2 n ] = E[U 2 1 U 2 n ] = E[U 2 1 ] E[U 2 n ] = 3 −n , so Var(Y n ) = 3 −n −(2 −n ) 2 = 3 −n −4 −n . (b) E[Y n [Y 0 , . . . , Y n−1 ] = E[Y n−1 U n [Y 0 , . . . , Y n−1 ] = Y n−1 E[U n [Y 0 , . . . , Y n−1 ] = Y n−1 E[U n ] = Y n−1 /2. (c) Since the conditional expectation found in (b) is linear, it follows that ´ E[Y n [Y 0 , . . . , Y n−1 ] = E[Y n [Y 0 , . . . , Y n−1 ] = Y n−1 /2. (d) ¯ Y 0 = Y 0 = 1, and ¯ Y n = Y n −Y n−1 /2 (also equal to U 1 U n−1 (U n − 1 2 )) for n ≥ 1. (e) For n ≥ 1, Var( ¯ Y n ) = E[( ¯ Y n ) 2 ] = E[U 2 1 U 2 n−1 (U n − 1 2 ) 2 ] = 3 −(n−1) /12 and Cov(X M , ¯ Y n ) = E[(U 1 + +U M ) ¯ Y n ] = E[(U 1 + +U M )U 1 U n−1 (U n − 1 2 )] = E[U n (U 1 U n−1 )(U n − 1 2 )] = 2 −(n−1) Var(U n ) = 2 −(n−1) /12. Since ¯ Y 0 = 1 and all the other innovations variables are mean zero, we have ´ E[X M [Y 0 , . . . , Y M ] = M 2 + M n=1 Cov(X M , ¯ Y n ) ¯ Y n Var( ¯ Y n ) = M 2 + M n=1 2 −n+1 /12 3 −n+1 /12 ¯ Y n = M 2 + M n=1 ( 3 2 ) n−1 ¯ Y n 3.28 Linear innovations and orthogonal polynomials for the uniform distribution (a) E[U n ] = _ 1 −1 u n 2 du = u n+1 2(n + 1) ¸ ¸ ¸ ¸ 1 −1 = _ 1 n+1 n even 0 n odd (b) The formula for the linear innovations sequence yields: 365 ¯ Y 1 = U, ¯ Y 2 = U 2 − 1 3 , ¯ Y 3 = U 3 − 3U 5 , and ¯ Y 4 = U 4 − E[U 4 1] E[1 2 ] 1 − E[U 4 (U 2 − 1 3 )] E[(U 2 − 1 3 ) 2 ] (U 2 − 1 3 ) = U 4 − 1 5 − _ 1 7 − 1 5 1 5 − 2 3 +1 _ (U 2 −1) = U 4 − 6 7 U 2 + 3 35 . Note: These mutually orthogonal (with respect to the uniform distribution on [-1,1] ) polynomials 1, U, U 2 − 1 3 , U 3 − 3 5 U, U 4 − 6 7 U 2 + 3 35 are (up to constant multiples) known as the Legendre polynomials. 3.30 Kalman filter for a rotating state (a) Given a nonzero vector x k , the vector Fx k is obtained by rotating the vector x k one tenth revolution counter-clockwise about the origin, and then shrinking the vector towards zero by one percent. Thus, successive iterates F k x o spiral in towards zero, with one-tenth revolution per time unit, and shrinking by about ten percent per revolution. (b) Suppose Σ k[k−1 = _ a c c b _ . Note that H T k Σ k[k−1 H k is simply the scaler a. The equations for Σ k+1[k and K k can thus be written as Σ k+1[k = F _ Σ k[k−1 − _ a 2 1+a ac 1+a ac 1+a c 2 1+a __ F T +Q K k = F _ a 1+a c 1+a _ . 4.2 Correlation function of a product R X (s, t) = E[Y s Z s Y t Z t ] = E[Y s Y t Z s Z t ] = E[Y s Y t ]E[Z s Z t ] = R Y (s, t)R Z (s, t) 4.4 Another sinusoidal random process (a) Since EX 1 = EX 2 = 0, EY t ≡ 0. The autocorrelation function is given by R Y (s, t) = E[X 2 1 ] cos(2πs) cos(2πt) −2E[X 1 X 2 ] cos(2πs) sin(2πt) +E[X 2 2 ] sin(2πs) sin(2πt) = σ 2 (cos(2πs) cos(2πt) + sin(2πs) sin(2πt)] = σ 2 cos(2π(s −t)) (a function of s −t only) So (Y t : t ∈ R) is WSS. (b) If X 1 and X 2 are each Gaussian random variables and are independent then (Y t : t ∈ R) is a real-valued Gaussian WSS random process and is hence stationary. (c) A simple solution to this problem is to take X 1 and X 2 to be independent, mean zero, variance σ 2 random variables with different distributions. For example, X 1 could be N(0, σ 2 ) and X 2 could be discrete with P(X 1 = σ) = P(X 1 = −σ) = 1 2 . Then Y 0 = X 1 and Y 3/4 = X 2 , so Y 0 and Y 3/4 do not have the same distribution, so that Y is not stationary. 4.6 Brownian motion: Ascension and smoothing (a) Since the increments of W over nonoverlapping intervals are independent, mean zero Gaussian 366 CHAPTER 12. SOLUTIONS TO PROBLEMS random variables, P¦W r ≤ W s ≤ W t ¦ = P¦W s −W r ≥ 0, W t −W s ≥ 0¦ = P¦W s −W r ≥ 0¦P¦W t −W s ≥ 0¦ = 1 2 1 2 = 1 4 . (b) Since W is a Gaussian process, the three random variables W r , W s , W t are jointly Gaussian. They also all have mean zero, so that E[W s [W r , W t ] = ´ E[W s [W r , W t ] = (Cov(W s , W r ), Cov(W s , W t )) _ Var(X r ) Cov(X r , X t ) Cov(X t , X r ) Var(X t ) _ −1 _ W r W t _ = (r, s) _ r r r t _ −1 _ W r W t _ = (t −s)W r + (s −r)W t t −r , where we use the fact _ a b c d _ −1 = 1 ad−bc _ d −b −c a _ . Note that as s varies from r to t, E[W s [W r , W t ] is obtained by linearly interpolating between W r and W t . 4.8 Some Poisson process calculations (a) (Method 1) Given N 2 = 2, the distribution of the locations of the first two jump points are as if they are independently and uniformly distributed over the interval [0, 2]. Thus, each falls in the second half of the interval with probability 1/2, and they both fall into the second interval with probability ( 1 2 ) 2 = 1 4 . Thus, at least one of them falls in the first half of the interval with probabilty 1 − 1 4 = 3 4 . (Method 2) Since N 1 and N 2 −N 1 are independent, Poi(λ) random variables, P¦N 1 ≥ 1, N 2 = 2¦ = P¦N 1 = 1, N 2 = 2¦ +P¦N 1 = 2, N 2 = 2¦ = P¦N 1 = 1, N 2 −N 1 = 1¦ +P¦N 1 = 2, N 2 −N 1 = 0¦ = (λe −λ )(λe −λ ) + ( λ 2 e −λ 2 )e −λ = 3λ 2 e −2λ 2 and P¦N 2 ¦ = (2λ) 2 e −2λ 2 . Thus, P[N 2 = 2[N 1 ≥ 1] = 3λ 2 e −2λ 2 /( (2λ) 2 e −2λ 2 ) = 3 4 . (b) P¦N 1 ≥ 1¦ = 1 −e −λ . Therefore, P[N 2 = 2[N 1 ≥ 1] = 3λ 2 e −2λ 2 /(1 −e −λ ). (c) Yes. The process N is Markov (because N 0 is constant and N has independent increments), and since X t and N t are functions of each other for each t, the process X is also Markov. The state space for X is S = ¦0, 1, 4, 9, . . .¦. For i, j ∈ Z and t, τ ≥ 0, p i 2 ,j 2(τ) = P[X t+τ = j 2 [X t = i 2 ] = _ (λτ) j−i e −λτ (j−i)! 0 ≤ i ≤ j 0 else 367 4.10 MMSE prediction for a Gaussian process based on two observations (a) Since R X (0) = 5, R X (1) = 0, and R X (2) = − 5 9 , the covariance matrix is _ _ 5 0 − 5 9 0 5 0 − 5 9 0 5 _ _ . (b) As the variables are all mean zero, E[X(4)[X(2)] = Cov(X(4),X(2)) Var(X(2) X(2) = − X(2) 9 . (c) The variable X(3) is uncorrelated with (X(2), X(4)) T . Since the variables are jointly Gaussian, X(3) is also independent of (X(2), X(4)) T . So E[X(4)[X(2)] = E[X(4)[X(2), X(3)] = − X(2) 9 . 4.12 Poisson process probabilities (a) The numbers of arrivals in the disjoint intervals are independent, Poisson random variables with mean λ. Thus, the probability is (λe −λ ) 3 = λ 3 e −3λ . (b) The event is the same as the event that the numbers of counts in the intervals [0,1], [1,2], and [2,3] are 020, 111, or 202. The probability is thus e −λ ( λ 2 2 e −λ )e −λ +(λe −λ ) 3 +( λ 2 2 e −λ )e −λ ( λ 2 2 e −λ ) = ( λ 2 2 +λ 3 + λ 4 4 )e −3λ . (c) This is the same as the probability the counts are 020, divided by the answer to part (b), or λ 2 2 /( λ 2 2 +λ 3 + λ 4 4 ) = 2λ 2 /(2 + 4λ +λ 2 ). 4.14 Adding jointly stationary Gaussian processes (a) R Z (s, t) = E[ _ X(s)+Y (s) 2 __ X(t)+Y (t) 2 _ ] = 1 4 [R X (s −t) +R Y (s −t) +R XY (s −t) +R Y X (s −t)]. So R Z (s, t) is a function of s −t. Also, R Y X (s, t) = R XY (t, s). Thus, R Z (τ) = 1 4 [2e −[τ[ + e −|τ−3| 2 + e −|τ+3| 2 ]. (b) Yes, the mean function of Z is constant (µ Z ≡ 0) and R Z (s, t) is a function of s −t only, so Z is WSS. However, Z is obtained from the jointly Gaussian processes X and Y by linear operations, so Z is a Gaussian process. Since Z is Gaussian and WSS, it is stationary. (c) P[X(1) < 5Y (2) + 1] = P[ X(1)−5Y (2) σ ≤ 1 σ ] = Φ( 1 σ ), where σ 2 = Var(X(1) −5Y (2)) = R X (0) −10R XY (1 −2) + 25R Y (0) = 1 − 10e −4 2 + 25 = 26 −5e −4 . 4.16 A linear evolution equation with random coefficients (a) P k+1 = E[(A k X k +B k ) 2 ] = E[A 2 k X 2 k ] + 2E[A k X k ]E[B k ] +E[B 2 k ] = σ 2 A P k +σ 2 B . (b) Yes. Think of n as the present time. The future values X n+1 , X n+2 , . . . are all functions of X n and (A k , B k : k ≥ n). But the variables (A k , B k : k ≥ n) are independent of X 0 , X 1 , . . . X n . Thus, the future is conditionally independent of the past, given the present. (c) No. For example, X 1 −X 0 = X 1 = B 1 , and X 2 −X 1 = A 2 B 1 +B 2 , and clearly B 1 and A 2 B 1 +B 2 are not independent. (Given B 1 = b, the conditional distribution of A 2 B 1 +B 2 is N(0, σ 2 A b 2 +σ 2 B ), which depends on b.) (d) Suppose s and t are integer times with s < t. Then R Y (s, t) = E[Y s (A t−1 Y t−1 + B t−1 )] = E[A t−1 ]E[Y s Y t−1 ] +E[Y s ]E[B t−1 ] = 0. Thus, R Y (s, t) = _ P k if s = t = k 0 else. (e) The variables Y 1 , Y 2 , . . . are already orthogonal by part (d) (and the fact the variables have 368 CHAPTER 12. SOLUTIONS TO PROBLEMS mean zero). Thus, ¯ Y k = Y k for all k ≥ 1. 4.18 A fly on a cube (a)-(b) See the figures. For part (a), each two-headed line represents a directed edge in each direction and all directed edges have probability 1/3. (b) 000 010 001 110 100 101 011 111 1 2 3 0 1 2/3 1/3 1 2/3 1/3 (a) (c) Let a i be the mean time for Y to first reach state zero starting in state i. Conditioning on the first time step yields a 1 = 1 + 2 3 a 2 , a 2 = 1 + 2 3 a 1 + 1 3 a 3 , a 3 = 1 + a 2 . Using the first and third of these equations to eliminate a 1 and a 3 from the second equation yields a 2 , and then a 1 and a 3 . The solution is (a 1 , a 2 , a 3 ) = (7, 9, 10). Therefore, E[τ] = 1 +a 1 = 8. 4.20 A random process created by interpolation (a) t n+1 n X t (b) X t is the sum of two random variables, (1−a)U t , which is uniformly distributed on the interval [0, 1 −a], and aU n+1 , which is uniformly distributed on the interval [0, a]. Thus, the density of X t is the convolution of the densities of these two variables: 1 1!a 1 1!a 1 0 1!a 0 a a 1!a 1 a * = 0 (c) C X (t, t) = a 2 +(1−a) 2 12 for t = n +a. Since this depends on t, X is not WSS. (d) P¦max 0≤t≤10 X t ≤ 0.5¦ = P¦U k ≤ 0.5 for 0 ≤ k ≤ 10¦ = (0.5) 11 . 4.22 Restoring samples (a) Yes. The possible values of X k are ¦1, . . . , k−1¦. Given X k , X k+1 is equal to X k with probability X k k and is equal to X k + 1 with probability 1 − X k k . Another way to say this is that the one-step transition probabilities for the transition from X k to X k+1 are given by p ij = _ _ _ i k for j = i 1 − i k for j = i + 1 0 else 369 (b) E[X k+1 [X k ] = X k ( X k k ) + (X k + 1) _ 1 − X k k _ = X k + 1 − X k k . (c) The Markov property of X, the information equivalence of X k and M k , and part (b), inply that E[M k+1 [M 2 , . . . , M k ] = E[M k+1 [M k ] = 1 k+1 (X k + 1 − X k k ) ,= M k , so that (M k ) does not form a martingale sequence. (d) Using the transition probabilities mentioned in part (a) again, yields (with some tedious algebra steps not shown) E[D 2 k+1 [X k ] = _ X k k + 1 − 1 2 _ 2 _ X k k _ + _ X k + 1 k + 1 − 1 2 _ 2 _ k −X k k _ = 1 4k(k + 1) 2 _ (4k −8)X 2 k −(4k −8)kX k +k(k −1) 2 _ = 1 (k + 1) 2 _ k(k −2)D 2 k + 1 4 _ (e) Since, by the tower property of conditional expectations, v k+1 = E[D 2 k+1 ] = E[E[D 2 k+1 [X k ]], taking the expectation on each side of the equation found in part (d) yields v k+1 = 1 (k + 1) 2 _ k(k −2)v k + 1 4 _ and the initial condition v 2 = 0 holds. The desired inequality, v k ≤ 1 4k , is thus true for k = 2. For the purpose of proof by induction, suppose that v k ≤ 1 4k for some k ≥ 2. Then, v k+1 ≤ 1 (k + 1) 2 _ k(k −2) 1 4k + 1 4 _ = 1 4(k + 1) 2 ¦k −2 + 1¦ ≤ 1 4(k + 1) . So the desired inequality is true for k+1. Therefore, by proof by induction, v k ≤ 1 4k for all k. Hence, v k → 0 as k → ∞. By definition, this means that M k m.s. → 1 2 as k → ∞. (We could also note that, since M k is bounded, the convergence also holds in probability, and also it holds in distribution.) 4.24 An M/M/1/B queueing system (a) Q = _ _ _ _ _ _ −λ λ 0 0 0 1 −(1 +λ) λ 0 0 0 1 −(1 +λ) λ 0 0 0 1 −(1 +λ) λ 0 0 0 1 −1 _ _ _ _ _ _ . (b) The equilibrium vector π = (π 0 , π 1 , . . . , π B ) solves πQ = 0. Thus, λπ 0 = π 1 . Also, λπ 0 − (1 + λ)π 1 + π 2 = 0, which with the first equation yields λπ 1 = π 2 . Continuing this way yields that π n = λπ n−1 for 1 ≤ n ≤ B. Thus, π n = λ n π 0 . Since the probabilities must sum to one, π n = λ n /(1 +λ + +λ B ). 370 CHAPTER 12. SOLUTIONS TO PROBLEMS 4.26 Identification of special properties of two discrete-time processes (version 2) (a) (yes, yes, no). The process is Markov by its description. Think of a time k as the present time. Given the number of cells alive at the present time k (i.e. given X k ) the future evolution does not depend on the past. To check for the martingale property in discrete time, it suffices to check that E[X k+1 [X 1 , . . . , X k ] = X k . But this equality is true because for each cell alive at time k, the expected number of cells alive at time k + 1 is one (=0.5 0 + 0.5 2). The process does not have independent increments, because, for example, P[X 2 − X 1 = 0[X 1 − X 0 = −1] = 1 and P[X 2 −X 1 = 0[X 1 −X 0 = 1] = 1/2. So X 2 −X 1 is not independent of X 1 −X 0 . (b) (yes, yes, no). Let k be the present time. Given Y k , the future values are all determined by Y k , U k+1 , U k+2 , . . .. Since U k+1 , U k+2 , . . . is independent of Y 0 , . . . , Y k , the future of Y is condition- ally independent of the past, given the present value Y k . So Y is Markov. The process Y is a mar- tingale because E[Y k+1 [Y 1 , . . . , Y k ] = E[U k+1 Y k [Y 1 , . . . , Y k ] = Y k E[U k+1 [Y 1 , . . . , Y k ] = Y k E[U k+1 ] = Y k . The process Y does not have independent increments, because, for example Y 1 − Y 0 = U 1 − 1 is clearly not independent of Y 2 −Y 1 = U 1 (U 2 −1). (To argue this further we could note that the conditional density of Y 2 − Y 1 given Y 1 − Y 0 = y − 1 is the uniform distribution over the interval [−y, y], which depends on y.) 4.28 Identification of special properties of two continuous-time processes (version 2) (a) (yes,no,no) Z is Markov because W is Markov and the mapping from W t to Z t is invertible. So W t and Z t have the same information. To see if W 3 is a martingale we suppose s ≤ t and use the independent increment property of W to get: E[W 3 t [W u , 0 ≤ u ≤ s] = E[W 3 t [W s ] = E[(W t −W s +W s ) 3 [W s ] = 3E[(W t −W s ) 2 ]W s +W 3 s = 3(t −s)W s +W 3 s ,= W 3 s . Therefore, W 3 is not a martingale. If the increments were independent, then since W s is the incre- ment W s − W 0 , it would have to be that E[(W t − W s + W s ) 3 [W s ] doesn’t depend on W s . But it does. So the increments are not independent. (b) (no, no, no) R is not Markov because knowing R t for a fixed t doesn’t quite determines Θ to be one of two values. But for one of these values R has a positive derivative at t, and for the other R has a negative derivative at t. If the past of R just before t were also known, then θ could be completely determined, which would give more information about the future of R. So R is not Markov. (ii)R is not a martingale. For example, observing R on a finite interval total determines R. So E[R t [(R u , 0 ≤ u ≤ s] = R t , and if s − t is not an integer, R s ,= R t . (iii) R does not have independent increments. For example the increments R(0.5) −R(0) and R(1.5) −R(1) are identical random variables, not independent random variables. 4.30 Moving balls (a) The states of the “relative-position process” can be taken to be 111, 12, and 21. The state 111 means that the balls occupy three consecutive positions, the state 12 means that one ball is in the left most occupied position and the other two balls are one position to the right of it, and the state 21 means there are two balls in the leftmost occupied position and one ball one position to the right of them. With the states in the order 111, 12, 21, the one-step transition probability matrix 371 is given by P = _ _ 0.5 0.5 0 0 0 1 0.5 0.5 0 _ _ . (b) The equilibrium distribution π of the process is the probability vector satisfying π = πP, from which we find π = ( 1 3 , 1 3 , 1 3 ). That is, all three states are equally likely in equilibrium. (c) Over a long period of time, we expect the process to be in each of the states about a third of the time. After each visit to states 111 or 12, the left-most position of the configuration advances one posi- tion to the right. After a visit to state 21, the next state will be 12, and the left-most position of the configuration does not advance. Thus, after 2/3 of the slots there will be an advance. So the long-term speed of the balls is 2/3. Another approach is to compute the mean distance the moved ball travels in each slot, and divide by three. (d) The same states can be used to track the relative positions of the balls as in discrete time. The generator matrix is given by Q = _ _ −0.5 0.5 0 0 −1 1 0.5 0.5 −1 _ _ . (Note that if the state is 111 and if the leftmost ball is moved to the rightmost position, the state of the relative-position process is 111 the entire time. That is, the relative-position process misses such jumps in the actual configuration process.) The equilibrium distribution can be determined by solving the equation πQ = 0, and the solution is found to be π = ( 1 3 , 1 3 , 1 3 ) as before. When the relative-position process is in states 111 or 12, the leftmost position of the actual configuration advances one position to the right at rate one, while when the relative-position process is in state is 21, the rightmost position of the actual configuration cannot directly move right. The long-term average speed is thus 2/3, as in the discrete-time case. 4.32 Mean hitting time for a continuous-time, discrete-space Markov process Q = _ _ −1 1 0 10 −11 1 0 5 −5 _ _ π = _ 50 56 , 5 56 , 1 56 _ Consider X h to get a 1 = h + (1 −h)a 1 +ha 2 +o(h) a 2 = h + 10a 1 + (1 −11h)a 2 +o(h) or equivalently 1−a 1 +a 2 + o(h) h = 0 and 1+10a 1 −11a 2 + o(h) h = 0. Let h →0 to get 1−a 1 +a 2 = 0 and 1 + 10a 1 −11a 2 = 0, or a 1 = 12 and a 2 = 11. 4.34 Poisson splitting This is basically the previous problem in reverse. This solution is based directly on the definition of a Poisson process, but there are other valid approaches. Let X be Possion random variable, and let each of X individuals be independently assigned a type, with type i having probability p i , for 372 CHAPTER 12. SOLUTIONS TO PROBLEMS some probability distribution p 1 , . . . , p K . Let X i denote the number assigned type i. Then, P(X 1 = i 1 , X 2 = i 2 , , X K = i K ) = P(X = i 1 + +i K ) (i 1 + +i K )! i 1 ! i 2 ! i K ! p k 1 1 p i K K = K j=1 e −λ j λ i j j i j ! where λ i = λp i . Thus, independent splitting of a Poisson number of individuals yields that the number of each type i is Poisson, with mean λ i = λp i and they are independent of each other. Now suppose that N is a rate λ Poisson process, and that N i is the process of type i points, given independent splitting of N with split distribution p 1 , . . . , p K . By the definition of a Poisson process, the following random variables are independent, with the i th having the Poi(λ(t i+1 −t i )) distribution: N(t 1 ) −N(t 0 ) N(t 2 ) −N(t 1 ) N(t p ) −N(t p−1 ) (12.5) Suppose each column of the following array is obtained by independent splitting of the corresponding variable in (12.5). N 1 (t 1 ) −N 1 (t 0 ) N 1 (t 2 ) −N 1 (t 1 ) N 1 (t p ) −N 1 (t p−1 ) N 2 (t 1 ) −N 2 (t 0 ) N 2 (t 2 ) −N 2 (t 1 ) N 2 (t p ) −N 2 (t p−1 ) . . . . . . . . . N K (t 1 ) −N K (t 0 ) N K (t 2 ) −N K (t 1 ) N K (t p ) −N K (t p−1 ) (12.6) Then by the splitting property of Poisson random variables described above, we get that all el- ements of the array (12.6) are independent, with the appropriate means. By definition, the i th process N i is a rate λp i random process for each i, and because of the independence of the rows of the array, the K processes N 1 , . . . , N K are mutually independent. 4.36 Some orthogonal martingales based on Brownian motion Throughout the solution of this problem, let 0 < s < t, and let Y = W t − W s . Note that Y is independent of J s and it has the N(0, t −s) distribution. (a) E[M t [J s ] = M s E[ Mt Ms [J s ]. Now Mt Ms = exp(θY − θ 2 (t−s) 2 ). Therefore, E[ Mt Ms [J s ] = E[ Mt Ms ] = 1. Thus E[M t [J s ] = M s , so by the hint, M is a martingale. (b) W 2 t −t = (W s +Y ) 2 −s −(t −s) = W 2 s −s + 2W s Y +Y 2 −(t −s), but E[2W s Y [J s ] = 2W s E[Y [J s ] = 2W s E[Y ] = 0, and E[Y 2 − (t − s)[J s ] = E[Y 2 − (t − s)] = 0. It follows that E[2W s Y +Y 2 −(t −s)[J s ] = 0, so the martingale property follows from the hint. Similarly, W 3 t −3tW t = (Y +W s ) 3 −3(s+t−s)(Y +W s ) = W 3 s −3sW s +3W 2 s Y +3W s (Y 2 −(t−s))+Y 3 −3tY . Because Y is independent of J s and because E[Y ] = E[Y 2 −(t −s)] = E[Y 3 ] = 0, it follows that E[3W 2 s Y +3W s (Y 2 −(t −s)) +Y 3 −3tY [J s ] = 0, so the martingale property follows from the hint. 373 (c) Fix distinct nonnegative integers m and n. Then E[M n (s)M m (t)] = E[E[M n (s)M m (t)[J s ]] property of cond. expectation = E[M n (s)E[M m (t)[J s ]] property of cond. expectation = E[M n (s)M m (s)] martingale property = 0 orthogonality of variables at a fixed time 5.2 A variance estimation problem with Poisson observation (a) P¦N = n¦ = E[P[N = n[X]] = E[ (X 2 ) n e −X 2 n! ] = _ ∞ −∞ x 2n e −x 2 n! e − x 2 2σ 2 √ 2πσ 2 dx (b) To arrive at a simple answer, we could set the derivative of P¦N = n¦ with respect to σ 2 equal to zero either before or after simplifying. Here we simplify first, using the fact that if X is a N(0, ¯ σ 2 ) random variable, then E[X 2n ] = e σ 2n (2n)! n!2 n . Let ˜ σ 2 be such that 1 2˜ σ 2 = 1 + 1 2σ 2 , or equivalently, ¯ σ 2 = σ 2 1+2σ 2 . Then the above integral can be written as follows: P¦N = n¦ = ¯ σ σ _ ∞ −∞ x 2n n! e −x 2 2e σ 2 √ 2π¯ σ 2 dx = c 1 ¯ σ 2n+1 σ = c 1 σ 2n (1 + 2σ 2 ) 2n+1 2 , where the constant c 1 depends on n but not on σ 2 . Taking the logarithm of P¦N = n¦ and calcu- lating the derivative with respect to σ 2 , we find that P¦N = n¦ is maximized at σ 2 = n. That is, ´ σ 2 ML (n) = n. 5.4 Transformation of estimators and estimators of transformations (a) Yes, because the transformation is invertible. (b) Yes, because the transformation is invertible. (c) Yes, because the transformation is linear, the pdf of 3 + 5Θ is a scaled version of the pdf of Θ. (d) No, because the transformation is not linear. (e) Yes, because the MMSE estimator is given by the conditional expectation, which is linear. That is, 3 + 5E[Θ[Y ] = E[3 + 5Θ[Y ]. (f) No. Typically E[Θ 3 [Y ] ,= E[Θ[Y ] 3 . 374 CHAPTER 12. SOLUTIONS TO PROBLEMS 5.6 Finding a most likely path Finding the path z to maximize the posterior probability given the sequence 021201 is the same as maximizing p cd (y, z[θ). Due to the form of the parameter θ = (π, A, B), for any path z = (z 1 , . . . , z 6 ), p cd (y, z[θ) has the form c 6 a i for some i ≥ 0. Similarly, the variable δ j (t) has the form c t a i for some i ≥ 0. Since a < 1, larger values for p cd (y, z[θ) and δ j (t) correspond to smaller values of i. Rather than keeping track of products, such as a i a j , we keep track of the exponents of the products, which for a i a j would be i +j. Thus, the problem at hand is equivalent to finding a path from left to right in trellis indicated in Figure 12.3(a) with minimum weight, where the weight of a ’s 3 1 1 3 3 1 1 3 3 1 1 3 3 1 1 3 3 1 1 3 3 1 1 3 3 1 1 3 3 1 1 3 3 1 1 3 3 1 1 3 0 2 1 2 0 1 0 1 Observations State t=1 t=6 3 2 3 1 2 3 2 3+2 1 1 3 1+1 (a) (b) 18 6 9 13 15 6 10 12 15 2 0 2 1 2 0 1 0 1 Observations State 3 2 3 1 2 3 2 3+2 1 1 3 1+1 t=1 t=6 19 5 ! Figure 12.3: Trellis diagram for finding a MAP path. path is the sum of all the numbers indicated on the vertices and edges of the graph. Figure 12.3(b) shows the result of running the Viterbi algorithm. The value of δ j (t) has the form c t a i , where for i is indicated by the numbers in boxes. Of the two paths reaching the final states of the trellis, the upper one, namely the path 000000, has the smaller exponent, 18, and therefore, the larger probability, namely c 6 a 18 . Therefore, 000000 is the MAP path. 5.8 Specialization of Baum-Welch algorithm for no hidden data (a) Suppose the sequence y = (y 1 , . . . , y T ) is oberserved. If θ (0) = θ = (π, A, B) is such that B is the identity matrix, and all entries of π and A are nonzero, then directly by the definitions (without 375 using the α’s and β’s): γ i (t) . = P[Z t = i[Y 1 = y 1 , . . . , Y T = y T , θ] = I ¡yt=i¦ ξ ij (t) . = P[Z t = i, Z t+1 = j[Y 1 = y 1 , . . . , Y T = y T , θ] = I ¡(yt,y t+1 )=(i,j)¦ Thus, (5.27) - (5.29) for the first iteration, t = 0, become π (1) i = I ¡y 1 =i¦ i.e. the probability vector for o with all mass on y 1 a (1) i,j = number of (i, j) transitions observed number of visits to i up to time T −1 b (1) il = number of times the state is i and the observation is l number of times the state is i It is assumed that B is the identity matrix, so that each time the state is i the observation should also be i. Thus, b (1) il = I ¡i=l¦ for any state i that is visited. That is consistent with the assumption that B is the identify matrix. (Alternatively, since B is fixed to be the identity matrix, we could just work with estimating π and A, and simply not consider B as part of the parameter to be esti- mated.) The next iteration will give the same values of π and A. Thus, the Baum-Welch algorithm converges in one iteration to the final value θ (1) = (π (1) , A (1) , B (1) ) already described. Note that, by Lemma 5.1.7, θ (1) is the ML estimate. (b) In view of part (a), the ML estimates are π = (1, 0) and A = _ 2 3 1 3 1 3 2 3 _ . This estimator of A results from the fact that, of the first 21 times, the sate was zero 12 times, and 8 of those 12 times the next state was a zero. So a 00 = 8/12 = 2/3 is the ML estimate. Similarly, the ML estimate of a 11 is 6/9, which simplifies to 2/3. 5.10 Baum-Welch saddlepoint It turns out that π (k) = π (0) and A (k) = A (0) , for each k ≥ 0. Also, B (k) = B (1) for each k ≥ 1, where B (1) is the matrix with identical rows, such that each row of B (1) is the empirical distribution of the observation sequence. For example, if the observations are binary valued, and if there are T = 100 observations, of which 37 observations are zero and 63 are 1, then each row of B (1) would be (0.37, 0.63). Thus, the EM algorithm converges in one iteration, and unless θ (0) happens to be a local maximum or local minimum, the EM algorithm converges to an inflection point of the likelihood function. One intuitive explanation for this assertion is that since all the rows of B (0) are the same, then the observation sequence is initially believed to be independent of the state sequence, and the state process is initially believed to be stationary. Hence, even if there is, for example, notable time variation in the observed data sequence, there is no way to change beliefs in a particular direction in order to increase the likelihood. In real computer experiments, the algorithm may still eventually reach a near maximum likelihood estimate, due to round-off errors in the computations which allow the algorithm to break away from the inflection point. The assertion can be proved by use of the update equations for the Baum-Welch algorithm. It is enough to prove the assertion for the first iteration only, for then it follows for all iterations by induction. 376 CHAPTER 12. SOLUTIONS TO PROBLEMS Since the rows of B (0) ) are all the same, we write b l to denote b (0) il for an arbitrary value of i. By induction on t, we find α i (t) = b y 1 b yt π (0) i and β j (t) = b y t+1 b y T . In particular, β j (t) does not depend on j. So the vector (α i β i : i ∈ o) is proportional to π (0) , and therefore γ i (t) = π (0) i . Similarly, ξ i,j (t) = P[Z t = i, Z t+1 = j[y, θ (0) ] = π (0) i a (0) i,j . By (5.27), π (1) = π (0) , and by (5.28), A (1) = A (0) . Finally, (5.29) gives b (1) i,l = T t=1 π i I ¡yt=l¦ Tπ i = number of times l is observed T . 5.12 Constraining the Baum-Welch algorithm A quite simple way to deal with this problem is to take the initial parameter θ (0) = (π, A, B) in the Baum-Welch algorithm to be such that a ij > 0 if and only if a ij = 1 and b il > 0 if and only if b il = 1. (These constraints are added in addition to the usual constraints that π, A, and B have the appropriate dimensions, with π and each row of A and b being probability vectors.) After all, it makes sense for the initial parameter value to respect the constraint. And if it does, then the same constraint will be satisfied after each iteration, and no changes are needed to the algorithm itself. 6.2 A two station pipeline in continuous time (a) o = ¦00, 01, 10, 11¦ (b) µ 00 01 10 11 ! µ ! 1 µ 2 2 (c) Q = _ _ _ _ −λ 0 λ 0 µ 2 −µ 2 −λ 0 λ 0 µ 1 −µ 1 0 0 0 µ 2 −µ 2 _ _ _ _ . (d) η = (π 00 +π 01 )λ = (π 01 +π 11 )µ 2 = π 10 µ 1 . If λ = µ 1 = µ 2 = 1.0 then π = (0.2, 0.2, 0.4, 0.2) and η = 0.4. (e) Let τ = min¦t ≥ 0 : X(t) = 00¦, and define h s = E[τ[X(0) = s], for s ∈ o. We wish to find h 11 . h 00 = 0 h 01 = 1 µ 2 +λ + µ 2 h 00 µ 2 +λ + λh 11 µ 2 +λ h 10 = 1 µ 1 +h 01 h 11 = 1 µ 2 +h 10 For If λ = µ 1 = µ 2 = 1.0 this yields _ _ _ _ h 00 h 01 h 10 h 11 _ _ _ _ = _ _ _ _ 0 3 4 5 _ _ _ _ . Thus, 377 h 11 = 5 is the required answer. 6.4 A simple Poisson process calculation Suppose 0 < s < t and 0 ≤ i ≤ k. P[N(s) = i[N(t) = k] = P[N(s) = i, N(t) = k] P[N(t) = k] = _ e −λs (λs) i i! _ _ e −λ(t−s) (λ(t −s)) k−i (k −i)! _ _ e −λt (λt) k k! _ −1 = _ k i _ _ s t _ i _ t −s t _ k−1 That is, given N(t) = k, the conditional distribution of N(s) is binomial. This could have been deduced with no calculation, using the fact that given N(t) = k, the locations of the k points are uniformly and independently distributed on the interval [0, t]. 6.6 A mean hitting time problem (a) 2 0 1 2 1 1 2 2 πQ = 0 implies π = ( 2 7 , 2 7 , 3 7 ). (b) Clearly a 1 = 0. Condition on the first step. The initial holding time in state i has mean − 1 q ii and the next state is j with probability p J ij = −q ij q ii . Thus _ a 0 a 2 _ = _ − 1 q 00 − 1 q 22 _ + _ 0 p J 02 p J 20 0 __ a 0 a 2 _ . Solving yields _ a 0 a 2 _ = _ 1 1.5 _ . (c) Clearly α 2 (t) = 0 for all t. α 0 (t +h) = α 0 (t)(1 +q 00 h) +α 1 (t)q 10 h +o(h) α 1 (t +h) = α 0 (t)q 01 h +α 1 (t)(1 +q 11 h) +o(h) Subtract α i (t) from each side and let h → 0 to yield ( ∂α 0 ∂t , ∂α 1 ∂t ) = (α 0 , α 1 ) _ q 00 q 01 q 10 q 11 _ with the inital condition (α 0 (0), α 1 (0)) = (1, 0). (Note: the matrix involved here is the Q matrix with the row and column for state 2 removed.) (d) Similarly, β 0 (t −h) = (1 +q 00 h)β 0 (t) +q 01 hβ 1 (t) +o(h) β 1 (t −h) = q 10 hβ 0 (t) + (1 +q 11 h)β 1 (t)) +o(h) 378 CHAPTER 12. SOLUTIONS TO PROBLEMS Subtract β i (t)’s, divide by h and let h → 0 to get: _ − ∂β 0 ∂t − ∂β 1 ∂t _ = _ q 00 q 01 q 10 q 11 __ β 0 β 1 _ with _ β 0 (t f ) β 1 (t f ) _ = _ 1 1 _ 6.8 Markov model for a link with resets (a) Let o = ¦0, 1, 2, 3¦, where the state is the number of packets passed since the last reset. µ 0 1 2 3 µ ! ! ! µ (b) By the PASTA property, the dropping probability is π 3 . We can find the equilibrium distribu- tion π by solving the equation πQ = 0. The balance equation for state 0 is λπ 0 = µ(1 −π 0 ) so that π 0 = µ λ+µ . The balance equation for state i ∈ ¦1, 2¦ is λπ i−1 = (λ + µ)π i , so that π 1 = π 0 ( λ λ+µ ) and π 2 = π 0 ( λ λ+µ ) 2 . Finally, λπ 2 = µπ 3 so that π 3 = π 0 ( λ λ+µ ) 2 λ µ = λ 3 (λ+µ) 3 . The dropping prob- ability is π 3 = λ 3 (λ+µ) 3 . (This formula for π 3 can be deduced with virtually no calculation from the properties of merged Poisson processes. Fix a time t. Each event is a packet arrival with probability λ λ+µ and is a reset otherwise. The types of different events are independent. Finally, π 3 (t) is the probability that the last three events before time t were arrivals. The formula follows.) 6.10 A queue with decreasing service rate (a) X(t) 0 . . . . . . ! ! ! ! ! ! µ µ µ µ/2 µ/2 µ/2 1 K K+2 K+1 K t (b) S 2 = ∞ k=0 ( µ 2λ ) k 2 k∧K , where k ∧ K = min¦k, K¦. Thus, if λ < µ 2 then S 2 < +∞ and the process is recurrent. S 1 = ∞ k=0 ( 2λ µ ) k 2 −k∧K , so if λ < µ 2 then S 1 < +∞ and the process is positive recurrent. In this case, π k = ( 2λ µ )2 −k∧K π 0 , where π 0 = 1 S 1 = _ 1 −(λ/µ) K 1 −(λ/µ) + (λ/µ) K 1 −(2λ/µ) _ −1 . (c) If λ = 2µ 3 , the queue appears to be stable until if fluctuates above K. Eventually the queue- length will grow to infinity at rate λ − µ 2 = µ 6 . See figure above. 6.12 An M/M/1 queue with impatient customers (a) 379 ! 3 1 2 0 . . . 4 ! ! µ+" µ+2" µ+3" µ+4" ! ! µ (b) The process is positive recurrent for all λ, µ if α > 0, and p k = cλ k µ(µ+α)(µ+(k−1)α) where c is chosen so that the p k ’s sum to one. (c) If α = µ, p k = cλ k k!µ k = cρ k k! . Therefore, (p k : k ≥ 0) is the Poisson distribution with mean ρ. Furthermore, p D is the mean departure rate by defecting customers, divided by the mean arrival rate λ. Thus, p D = 1 λ ∞ k=1 p k (k −1)α = ρ −1 +e −ρ ρ → _ 1 as ρ →∞ 0 as ρ →0 where l’Hˆospital’s rule can be used to find the limit as ρ →0. 6.14 A queue with blocking (a) 5 3 1 2 0 4 ! ! ! ! µ ! µ µ µ µ π k = ρ k 1+ρ+ρ 2 +ρ 3 +ρ 4 +ρ 5 = ρ k (1−ρ) 1−ρ 6 for 0 ≤ k ≤ 5. (b) p B = π 5 by the PASTA property. (c) W = N W /(λ(1 − p B )) where N W = 5 k=1 (k − 1)π k . Alternatively, W = N/(λ(1 − p B )) − 1 µ (i.e. W is equal to the mean time in system minus the mean time in service) (d) π 0 = 1 λ(mean cycle time for visits to state zero) = 1 λ(1/λ+mean busy period duration) There- fore, the mean busy period duration is given by 1 λ [ 1 π 0 −1] = ρ−ρ 6 λ(1−ρ) = 1−ρ 5 µ(1−ρ) 6.16 On two distibutions seen by customers t k k+1 N(t) As can be seen in the picture, between any two transtions from state k to k +1 there is a transition form state k + 1 to k, and vice versa. Thus, the number of transitions of one type is within one of the number of transitions of the other type. This establishes that [D(k, t) −R(k, t)[ ≤ 1 for all k. (b)Observe that ¸ ¸ ¸ ¸ D(k, t) α t − R(k, t) δ t ¸ ¸ ¸ ¸ ≤ ¸ ¸ ¸ ¸ D(k, t) α t − R(k, t) α t ¸ ¸ ¸ ¸ + ¸ ¸ ¸ ¸ R(k, t) α t − R(k, t) δ t ¸ ¸ ¸ ¸ ≤ 1 α t + R(k, t) α t ¸ ¸ ¸ ¸ 1 − α t δ t ¸ ¸ ¸ ¸ ≤ 1 α t + ¸ ¸ ¸ ¸ 1 − α t δ t ¸ ¸ ¸ ¸ →0 as t →∞ 380 CHAPTER 12. SOLUTIONS TO PROBLEMS Thus, D(k,t) αt and R(k,t) δt have the same limits, if the limits of either exists. 6.18 Positive recurrence of reflected random walk with negative drift Let V (x) = 1 2 x 2 . Then PV (x) −V (x) = E[ (x +B n +L n ) 2 2 ] − x 2 2 ≤ E[ (x +B n ) 2 2 ] − x 2 2 = xB + B 2 2 Therefore, the conditions of the combined Foster stability criteria and moment bound corollary apply, yielding that X is positive recurrent, and X ≤ B 2 −2B . (This bound is somewhat weaker than Kingman’s moment bound, disussed later in the notes: X ≤ Var(B) −2B .) 6.20 An inadequacy of a linear potential function Suppose x is on the postive x 2 axis (i.e. x 1 = 0 and x 2 > 0). Then, given X(t) = x, during the slot, queue 1 will increase to 1 with probability a(1 − d 1 ) = 0.42, and otherwise stay at zero. Queue 2 will decrease by one with probability 0.4, and otherwise stay the same. Thus, the drift of V , E[V (X(t + 1) − V (x)[X(t) = x] is equal to 0.02. Therefore, the drift is strictly positive for infinitely many states, whereas the Foster-Lyapunov condition requires that the drift be negative off of a finite set C. So, the linear choice for V does not work for this example. 6.22 Opportunistic scheduling (a) The left hand side of (6.37) is the arrival rate to the set of queues in s, and the righthand side is the probability that some queue in s is eligible for service in a given time slot. The condition is necessary for the stability of the set of queues in s. (b) Fix > 0 so that for all s ∈ E with s ,= ∅, i∈s (a i +) ≤ B:B∩s,=∅ w(B) Consider the flow graph shown. 381 . a b q 1 q 2 q N s 1 s 2 s N! a 2 a 1 a N s k N! 2 1 w(s ) w(s ) w(s ) w(s ) k +! +! +! . . . . . . . . . . . . . . . . . . . . In addition to the source node a and sink node b, there are two columns of nodes in the graph. The first column of nodes corresponds to the N queues, and the second column of nodes corresponds to the 2 N subsets of E. There are three stages of links in the graph. The capacity of a link (a,q i ) in the first stage is a i +, there is a link (q i , s j ) in the second stage if and only if q i ∈ s j , and each such link has capacity greater than the sum of the capacities of all the links in the first stage, and the weight of a link (s k , t) in the third stage is w(s k ). We claim that the minimum of the capacities of all a − b cuts is v∗ = N i=1 (a i + ). Here is a proof of the claim. The a −b cut (¦a¦ : V −¦a¦) (here V is the set of nodes in the flow network) has capacity v ∗ , so to prove the claim, it suffices to show that any other a − b cut has capacity greater than or equal to v ∗ . Fix any a − b cut (A : B). Let ¯ A = A ∩ ¦q 1 , . . . , q N ¦, or in words, ¯ A is the set of nodes in the first column of the graph (i.e. set of queues) that are in A. If q i ∈ ¯ A and s j ∈ B such that (q i , s j ) is a link in the flow graph, then the capacity of (A : B) is greater than or equal to the capacity of link (q i , s j ), which is greater than v ∗ , so the required inequality is proved in that case. Thus, we can suppose that A contains all the nodes s j in the second column such that s j ∩ ¯ A ,= ∅. Therefore, C(A : B) ≥ i∈¡q 1 ,...,q N ¦− e A (a i +) + s⊂E:s∩ e A,=∅ w(s) ≥ i∈¡q 1 ,...,q N ¦− e A (a i +) + i∈ e A (a i +) = v ∗ , (12.7) where the inequality in (12.7) follows from the choice of . The claim is proved. Therefore there is an a−b flow f which saturates all the links of the first stage of the flow graph. Let u(i, s) = f(q i , s)/f(s, b) for all i, s such that f(s, b) > 0. That is, u(i, s) is the fraction of flow on link (s, b) which comes from link (q i , s). For those s such that f(s, b) = 0, define u(i, s) in some arbitrary way, respecting the requirements u(i, s) ≥ 0, u(i, s) = 0 if i ,∈ s, and i∈E u(i, s) = I ¡s,=∅¦ . Then a i + = f(a, q i ) = s f(q i , s) = s f(s, b)u(i, s) ≤ s w(s)u(i, s) = µ i (u), as required. (c) Let V (x) = 1 2 i∈E x 2 i . Let δ(t) denote the identity of the queue given a potential service at time t, with δ(t) = 0 if no queue is given potential service. Then P[δ(t) = i[S(t) = s] = u(i, s). The 382 CHAPTER 12. SOLUTIONS TO PROBLEMS dynamics of queue i are given by X i (t +1) = X i (t) +A i (t) −R i (δ(t)) +L i (t), where R i (δ) = I ¡δ=i¦ . Since i∈E (A i (t) −R i (δ i (t))) 2 ≤ i∈E (A i (t)) 2 + (R i (δ i (t))) 2 ≤ N + i∈E A i (t) 2 we have PV (x) −V (x) ≤ _ i∈E x i (a i −µ i (u)) _ +K (12.8) ≤ − _ i∈E x i _ +K (12.9) where K = N 2 + N i=1 K i . Thus, under the necessary stability conditions we have that under the vector of scheduling probabilities u, the system is positive recurrent, and i∈E X i ≤ K (12.10) (d) If u could be selected as a function of the state, x, then the right hand side of (12.8) would be minimized by taking u(i, s) = 1 if i is the smallest index in s such that x i = max j∈s x j . This suggests using the longest connected first (LCF) policy, in which the longest connected queue is served in each time slot. If P LCF denotes the one-step transition probability matrix for the LCF policy, then (12.8) holds for any u, if P is replaced by P LCF . Therefore, under the necessary condition and as in part (b), (12.9) also holds with P replaced by P LCF , and (12.10) holds for the LCF policy. 6.24 Stability of two queues with transfers (a) System is positive recurrent for some u if and only if λ 1 < µ 1 +ν, λ 2 < µ 2 , and λ 1 +λ 2 < µ 1 +µ 2 . (b) QV (x) = y:y,=x q xy (V (y) −V (x)) = λ 1 2 [(x 1 + 1) 2 −x 2 1 ] + λ 2 2 [(x 2 + 1) 2 −x 2 2 ] + µ 1 2 [(x 1 −1) 2 + −x 2 1 ] + µ 2 2 [(x 2 −1) 2 + −x 2 2 ] + uνI ¡x 1 ≥1¦ 2 [(x 1 −1) 2 −x 2 1 + (x 2 + 1) 2 −x 2 2 ] (12.11) (c) If the righthand side of (12.11) is changed by dropping the positive part symbols and dropping the factor I ¡x 1 ≥1¦ , then it is not increased, so that QV (x) ≤ x 1 (λ 1 −µ 1 −uν) +x 2 (λ 2 +uν −µ 2 ) +K ≤ −(x 1 +x 2 ) min¦µ 1 +uν −λ 1 , µ 2 −λ 2 −uν¦ +K (12.12) where K = λ 1 +λ 2 +µ 1 +µ 2 +2ν 2 . To get the best bound on X 1 + X 2 , we select u to maximize the min term in (12.12), or u = u ∗ , where u ∗ is the point in [0, 1] nearest to µ 1 +µ 2 −λ 1 −λ 2 2ν . For u = u ∗ , we find QV (x) ≤ −(x 1 + x 2 ) + K where = min¦µ 1 + ν − λ 1 , µ 2 − λ 2 , µ 1 +µ 2 −λ 1 −λ 2 2 ¦. Which of the three terms is smallest in the expression for corresponds to the three cases u ∗ = 1, u ∗ = 0, and 383 0 < u ∗ < 1, respectively. It is easy to check that this same is the largest constant such that the stability conditions (with strict inequality relaxed to less than or equal) hold with (λ 1 , λ 2 ) replaced by (λ 1 +, λ 2 +). 7.2 Lack of sample path continuity of a Poisson process (a) The sample path of N is continuous over [0, T] if and only if it has no jumps in the in- terval, equivalently, if and only if N(T) = 0. So P[N is continuous over the interval [0,T] ] = exp(−λT). Since ¦N is continuous over [0, +∞)¦ = ∩ ∞ n=1 ¦N is continuous over [0, n]¦, it follows that P[N is continuous over [0, +∞)] = lim n→∞ P[N is continuous over [0, n]] = lim n→∞ e −λn = 0. (b) Since P[N is continuous over [0, +∞)] ,= 1, N is not a.s. sample continuous. However N is m.s. continuous. One proof is to simply note that the correlation function, given by R N (s, t) = λ(s ∧ t) + λ 2 st, is continuous. A more direct proof is to note that for fixed t, E[[N s − N t [ 2 ] = λ[s −t[ +λ 2 [s −t[ 2 →0 as s →t. 7.4 Some statements related to the basic calculus of random processes (a) False. lim t→∞ 1 t _ t 0 X s ds = Z ,= E[Z] (except in the degenerate case that Z has variance zero). (b) False. One reason is that the function is continuous at zero, but not everywhere. For another, we would have Var(X 1 −X 0 −X 2 ) = 3R X (0) −4R X (1) + 2R X (2) = 3 −4 + 0 = −1. (c) True. In general, R X X (τ) = R t X (τ). Since R X is an even function, R t X (0) = 0. Thus, for any t, E[X t t X t ] = R X X (0) = R t X (0) = 0. Since the process X has mean zero, it follows that Cov(X t t , X t ) = 0 as well. Since X is a Gaussian process, and differentiation is a linear operation, X t and X t t are jointly Gaussian. Summarizing, for t fixed, X t t and X t are jointly Gaussian and uncorrelated, so they are independent. (Note: X t s is not necessarily independent of X t if s ,= t. ) 7.6 Cross correlation between a process and its m.s. derivative Fix t, u ∈ T. By assumption, lim s→t Xs− s−t = X t t m.s. Therefore, by Corollary 2.2.4, E __ Xs−Xt s−t _ X u _ → E[X t t X u ] as s →t. Equivalently, R X (s, u) −R X (t, u) s −t → R X X (t, u) as s →t. Hence ∂ 1 R X (s, u) exists, and ∂ 1 R X (t, u) = R X X (t, u). 7.8 A windowed Poisson process (a) The sample paths of X are piecewise constant, integer valued with initial value zero. They jump by +1 at each jump of N, and jump by -1 one time unit after each jump of N. (b) Method 1: If [s−t[ ≥ 1 then X s and X t are increments of N over disjoint intervals, and are there- fore independent, so C X (s, t) = 0. If [s−t[ < 1, then there are three disjoint intervals, I 0 , I 1 , and I 2 , with I 0 = [s, s+1] ∪[t, t+1], such that [s, s+1] = I 0 ∪I 1 and [t, t+1] = I 0 ∪I 2 . Thus, X s = D 0 +D 1 and X t = D 0 + D 2 , where D i is the increment of N over the interval I i . The three increments D 1 , D 2 , and D 3 are independent, and D 0 is a Poisson random variable with mean and variance equal to λ times the length of I 0 , which is 1 − [s − t[. Therefore, C X (s, t) = Cov(D 0 + D 1 , D 0 + D 2 ) = 384 CHAPTER 12. SOLUTIONS TO PROBLEMS Cov(D 0 , D 0 ) = λ(1 −[s −t[). Summarizing, C X (s, t) = _ λ(1 −[s −t[) if [s −t[ < 1 0 else Method 2: C X (s, t) = Cov(N s+1 −N s , N t+1 −N t ) = λ[min(s +1, t +1) −min(s +1, t) −min(s, t + 1) − min(s, t)]. This answer can be simplified to the one found by Method 1 by considering the cases [s −t[ > 1, t < s < t + 1, and s < t < s + 1 separately. (c) No. X has a -1 jump one time unit after each +1 jump, so the value X t for a “present” time t tells less about the future, (X s : s ≥ t), than the past, (X s : 0 ≤ s ≤ t), tells about the future . (d) Yes, recall that R X (s, t) = C X (s, t) −µ X (s)µ X (t). Since C X and µ X are continuous functions, so is R X , so that X is m.s. continuous. (e) Yes. Using the facts that C X (s, t) is a function of s − t alone, and C X (s) → 0 as s → ∞, we find as in the section on ergodicity, Var( 1 t _ t 0 X s ds) = 2 t _ t 0 (1 − s t )C X (s)ds →0 as t →∞. 7.10 A singular integral with a Brownian motion (a) The integral _ 1 wt t dt exists in the m.s. sense for any > 0 because w t /t is m.s. continuous over [, 1]. To see if the limit exists we apply the correlation form of the Cauchy criteria (Proposition 2.2.2). Using different letters as variables of integration and the fact R w (s, t) = s ∧t (the minimum of s and t), yields that as , t →0, E __ 1 w s s ds _ 1 w t t dt _ = _ 1 _ 1 s ∧ t st dsdt → _ 1 0 _ 1 0 s ∧ t st dsdt = 2 _ 1 0 _ t 0 s ∧ t st dsdt = 2 _ 1 0 _ t 0 s st dsdt = 2 _ 1 0 _ t 0 1 t dsdt = 2 _ 1 0 1dt = 2. Thus the m.s. limit defining the integral exits. The integral has the N(0, 2) distribution. (b) As a, b →∞, E __ a 1 w s s ds _ b 1 w t t dt _ = _ a 1 _ b 1 s ∧ t st dsdt → _ ∞ 1 _ ∞ 1 s ∧ t st dsdt = 2 _ ∞ 1 _ t 1 s ∧ t st dsdt = 2 _ ∞ 1 _ t 1 s st dsdt = 2 _ ∞ 1 _ t 1 1 t dsdt = 2 _ ∞ 1 t −1 t dt = ∞, so that the m.s. limit does not exist, and the integral is not well defined. 7.12 Recognizing m.s. properties (a) Yes m.s. continuous since R X is continuous. No not m.s. differentiable since R t X (0) doesn’t 385 exist. Yes, m.s. integrable over finite intervals since m.s. continuous. Yes mean ergodic in m.s. since R X (T) →0 as [T[ →∞. (b) Yes, no, yes, for the same reasons as in part (a). Since X is mean zero, R X (T) = C X (T) for all T. Thus lim [T[→∞ C X (T) = lim [T[→∞ R X (T) = 1 Since the limit of C X exists and is net zero, X is not mean ergodic in the m.s. sense. (c) Yes, no, yes, yes, for the same reasons as in (a). (d) No, not m.s. continuous since R X is not continuous. No, not m.s. differentiable since X is not even m.s. continuous. Yes, m.s. integrable over finite intervals, because the Riemann integral _ b a _ b a R X (s, t)dsdt exists and is finite, for the region of integration is a simple bounded region and the integrand is piece-wise constant. (e) Yes, m.s. continuous since R X is continuous. No, not m.s. differentiable. For example, E _ _ X t −X 0 t _ 2 _ = 1 t 2 [R X (t, t) −R X (t, 0) −R X (0, t) +R X (0, 0)] = 1 t 2 _ √ t −0 −0 + 0 _ →+∞ as t →0. Yes, m.s. integrable over finite intervals since m.s. continuous. 7.14 A stationary Gaussian process (a) No. All mean zero stationary, Gaussian Markov processes have autocorrelation functions of the form R X (t) = Aρ [t[ , where A ≥ 0 and 0 ≤ ρ ≤ 1 for continuous time (or [ρ[ ≤ 1 for discrete time). (b) E[X 3 [X 0 ] = ´ E[X 3 [X 0 ] = R X (3) R X (0) X 0 = X 0 10 . The error is Gaussian with mean zero and variance MSE = Var(X 3 ) −Var( X 0 10 ) = 1 −0.01 = 0.99. So P¦[X 3 −E[X 3 [X 0 ][ ≥ 10¦ = 2Q( 10 √ 0.99 ). (c) R X (τ) = −R tt X (τ) = 2−6τ 2 (1+τ 2 ) 3 . In particular, since −R tt X exists and is continuous, X is continu- ously differentiable in the m.s. sense. (d) The vector has a joint Gaussian distribution because X is a Gaussian process and differ- entiation is a linear operation. Cov(X τ , X t 0 ) = R XX (τ) = −R t X (τ) = 2τ (1+τ 2 ) 2 . In particular, Cov(X 0 , X t 0 ) = 0 and Cov(X 1 , X t 0 ) = 2 4 = 0.5. Also, Var(X t 0 ) = R X (0) = 2. So (X 0 , X t 0 , X 1 ) T has the N _ _ _ _ 0 0 0 _ _ , _ _ 1 0 0.5 0 2 0.5 0.5 0.5 1 _ _ _ _ distribution. 7.16 Correlation ergodicity of Gaussian processes Fix h and let Y t = X t+h X t . Clearly Y is stationary with mean µ Y = R X (h). Observe that C Y (τ) = E[Y τ Y 0 ] −µ 2 Y = E[X τ+h X τ X h X 0 ] −R X (h) 2 = R X (h) 2 +R X (τ)R X (τ) +R X (τ +h)R X (τ −h) −R X (h) 2 386 CHAPTER 12. SOLUTIONS TO PROBLEMS Therefore, C Y (τ) →0 as [τ[ →∞. Hence Y is mean ergodic, so X is correlation ergodic. 7.18 Gaussian review question (a) Since X is Markovian, the best estimator of X 2 given (X 0 , X 1 ) is a function of X 1 alone. Since X is Gaussian, such estimator is linear in X 1 . Since X is mean zero, it is given by Cov(X 2 , X 1 )Var(X 1 ) −1 X 1 = e −1 X 1 . Thus E[X 2 [X 0 , X 1 ] = e −1 X 1 . No function of (X 0 , X 1 ) is a better estimator! But e −1 X 1 is equal to p(X 0 , X 1 ) for the polynomial p(x 0 , x 1 ) = x 1 /e. This is the optimal polynomial. The resulting mean square error is given by MMSE = Var(X 2 ) − (Cov(X 1 X 2 ) 2 )/Var(X 1 ) = 9(1 −e −2 ) (b) Given (X 0 = π, X 1 = 3), X 2 is N _ 3e −1 , 9(1 −e −2 ) _ so P[X 2 ≥ 4[X 0 = π, X 1 = 3] = P _ X 2 −3e −1 _ 9(1 −e −2 ) ≥ 4 −3e −1 _ 9(1 −e −2 ) _ = Q _ 4 −3e −1 _ 9(1 −e −2 ) _ 7.20 KL expansion of a simple random process (a) Yes, because R X (τ) is twice continuously differentiable. (b) No. lim t→∞ 2 t _ t 0 ( t−τ t )C X (τ)dτ = 50 + lim t→∞ 100 t _ t 0 ( t−τ t ) cos(20πτ)dτ = 50 ,= 0. Thus, the necessary and sufficient condition for mean ergodicity in the m.s. sense does not hold. (c) APPROACH ONE Since R X (0) = R X (1), the process X is periodic with period one (actually, with period 0.1). Thus, by the theory of WSS periodic processes, the eigen-functions can be taken to be φ n (t) = e 2πjnt for n ∈ Z. (Still have to identify the eigenvalues.) APPROACH TWO The identity cos(θ) = 1 2 (e jθ +e −jθ ), yields R X (s −t) = 50 + 25e 20πj(s−t) + 25e −20πj(s−t) = 50 + 25e 20πjs e −20πjt + 25e −20πjs e 20πjt = 50φ 0 (s)φ ∗ 0 (t) + 25φ 1 (s)φ ∗ 1 (t) + 25φ 2 (s)φ ∗ 2 (t) for the choice φ 0 (t) ≡ 1, φ 1 (t) = e 20πjt and φ 2 = e −20πjt . The eigenvalues are thus 50, 25, and 25. The other eigenfunctions can be selected to fill out an orthonormal basis, and the other eigenvalues are zero. APPROACH THREE For s, t ∈ [0, 1] we have R X (s, t) = 50 + 50 cos(20π(s −t)) = 50+50 cos(20πs) cos(20πt) +50 sin(20πs) sin(20πt) = 50φ 0 (s)φ ∗ 0 (t) +25φ 1 (s)φ ∗ 1 (t) +25φ 2 (s)φ ∗ 2 (t) for the choice φ 0 (t) ≡ 1, φ 1 (t) = √ 2 cos(20πt) and φ 2 = √ 2 sin(20πt). The eigenvalues are thus 50, 25, and 25. The other eigenfunctions can be selected to fill out an orthonormal basis, and the other eigenvalues are zero. (Note: the eigenspace for eigenvalue 25 is two dimensional, so the choice of eigen functions spanning that space is not unique.) 7.22 KL expansion for derivative process (a) Since φ t n (t) = (2πjn)φ n (t), the derivative of each φ n is a constant times φ n itself. Therefore, the equation given in the problem statement leads to: X t (t) = n (X, φ n )φ t n (t) = n [(2πjn)(X, φ n )]φ n (t), which is a KL expansion, because the functions φ n are orthonormal in L 2 [0, 1] and the coordinates are orthogonal random variables. Thus, ψ n (t) = φ n (t), (X t , ψ n ) = (2πjn)(X n , φ n ), and µ n = (2πn) 2 λ n for n ∈ Z 387 (Recall that the eigenvalues are equal to the means of the squared magnitudes of the coordinates.) (b) Note that φ t 1 = 0, φ t 2k (t) = −(2πk)φ 2k+1 (t) and φ t 2k+1 (t) = (2πk)φ 2k (t). This is similar to part (a). The same basis functions can be used for X t as for X, but the (2k) th and (2k+1) th coordinates of X t come from the (2k +1) th and (2k) th coordinates of X, respectively, for all k ≥ 1. Specifically, we can take ψ n (t) = φ n (t) for n ≥ 0, (X t , ψ 0 ) = 0 µ 0 = 0, (X t , ψ 2k ) = (2πk)(X, φ 2k+1 ), µ 2k = (2πk) 2 λ 2k+1 , (X t , ψ 2k+1 ) = −(2πk)(X, φ 2k ), µ 2k+1 = (2πk) 2 λ 2k , for k ≥ 1 (It would have been acceptable to not define ψ 0 , because the corresponding eigenvalue is zero.) (c) Note that φ t n (t) = (2n+1)π 2 ψ n (t), where ψ n (t) = √ 2 cos( (2n+1)πt 2 ), n ≥ 0. That is, ψ n is the same as φ n , but with sin replaced by cos . Or equivalently, by the hint, we discover that ψ n is obtained from φ n by time-reversal: ψ n (t) = φ n (1 − t)(−1) n . Thus, the functions ψ n are orthonormal. As in part (a), we also have (X t , ψ n ) = (2n+1)π 2 (X, φ n ), and therefore, µ n = ( (2n+1)π 2 ) 2 λ n . (The set of eigenfunctions is not unique–for example, some could be multiplied by -1 to yield another valid set.) (d) Differentiating the KL expansion of X yields X t t = (X, φ 1 )φ t 1 (t) + (X, φ 2 )φ t 2 (t) = (X, φ 1 )c 1 √ 3 −(X, φ 2 )c 2 √ 3. That is, the random process X t is constant in time. So its KL expansion involves only one nonzero term, with the eigenfunction ψ 1 (t) = 1 for 0 ≤ t ≤ 1. Then (X t , ψ 1 ) = (X, φ 1 )c 1 √ 3 −(X, φ 2 )c 2 √ 3, and therefore µ 1 = 3c 2 1 λ 1 + 3c 2 2 λ 2 . 7.24 Mean ergodicity of a periodic WSS random process 1 t _ t 0 X u du = 1 t _ t 0 n ´ X n e 2πjnu/T du = n∈Z a n,t ´ X n where a 0 = 1, and for n ,= 0, [a n,t [ = [ 1 t _ t 0 e 2πjnu/T du[ = [ e 2πjnt/T −1 2πjnt/T [ ≤ T πnt . The n ,= 0 terms are not important as t →∞. Indeed, E _ _ ¸ ¸ ¸ ¸ ¸ ¸ n∈Z,n,=0 a n,t ´ X n ¸ ¸ ¸ ¸ ¸ ¸ 2 _ _ = n∈Z,n,=0 [a n,t [ 2 p X (n) ≤ T 2 π 2 t 2 n∈Z,n,=0 p X (n) →0 as t →∞ Therefore, 1 t _ t 0 X u du → ´ X 0 m.s. The limit has mean zero and variance p X (0). For mean ergodic- ity (in the m.s. sense), the limit should be zero with probability one, which is true if and only if p X (0) = 0. That is, the process should have no zero frequency, or DC, component. (Note: More generally, if X were not assumed to be mean zero, then X would be mean ergodic if and only if Var( ´ X 0 ) = 0, or equivalently, p X (0) = µ 2 X , or equivalently, ´ X 0 is a constant a.s.) 8.2 On the cross spectral density Follow the hint. Let U be the output if X is filtered by H and V be the output if Y is filtered by 388 CHAPTER 12. SOLUTIONS TO PROBLEMS H . The Schwarz inequality applied to random variables U t and V t for t fixed yields [R UV (0)[ 2 ≤ R U (0)R V (0), or equivalently, [ _ J S XY (ω) dω 2π [ 2 ≤ _ J S X (ω) dω 2π _ J S Y (ω) dω 2π , which implies that [S XY (ω o ) +o()[ 2 ≤ (S X (ω o ) +o())(S Y (ω o ) +o()) Letting →0 yields the desired conclusion. 8.4 Filtering a Gauss Markov process (a) The process Y is the output when X is passed through the linear time-invariant system with impulse response function h(τ) = e −τ I ¡τ≥0¦ . Thus, X and Y are jointly WSS, and R XY (τ) = R X ∗ ¯ h(τ) = _ ∞ t=−∞ R X (t) ¯ h(τ −t)dt = _ ∞ −∞ R X (t)h(t −τ)dt = _ 1 2 e −τ τ ≥ 0 ( 1 2 −τ)e τ τ ≤ 0 (b) X 5 and Y 5 are jointly Gaussian, mean zero, with Var(X 5 ) = R X (0) = 1, and Cov(Y 5 , X 5 ) = R XY (0) = 1 2 , so E[Y 5 [X 5 = 3] = (Cov(Y 5 , X 5 )/Var(X 5 ))3 = 3/2. (c) Yes, Y is Gaussian, because X is a Gaussian process and Y is obtained from X by linear oper- ations. (d) No, Y is not Markov. For example, we see that S Y (ω) = 2 (1+ω 2 ) 2 , which does not have the form required for a stationary mean zero Gaussian process to be Markov (namely 2A α 2 +ω 2 ). Another explanation is that, if t is the present time, given Y t , the future of Y is determined by Y t and (X s : s ≥ t). The future could be better predicted by knowing something more about X t than Y t gives alone, which is provided by knowing the past of Y . (Note: the R 2 -valued process ((X t , Y t ) : t ∈ R) is Markov.) 8.6 A stationary two-state Markov process πP = π implies π = ( 1 2 , 1 2 ) is the equilibrium distribution so P[X n = 1] = P[X n = −1] = 1 2 for all n. Thus µ X = 0. For n ≥ 1 R X (n) = P[X n = 1, X 0 = 1] +P[X n = −1, X 0 = −1] −P[X n = −1, X 0 = 1] −P[X n = 1, X 0 = −1] = 1 2 [ 1 2 + 1 2 (1 −2p) n ] + 1 2 [ 1 2 + 1 2 (1 −2p) n ] − 1 2 [ 1 2 − 1 2 (1 −2p) n ] − 1 2 [ 1 2 − 1 2 (1 −2p) n ] = (1 −2p) n So in general, R X (n) = (1 −2p) [n[ . The corresponding power spectral density is given by: S X (ω) = ∞ n=−∞ (1 −2p) n e −jωn = ∞ n=0 ((1 −2p)e −jω ) n + ∞ n=0 ((1 −2p)e jω ) n −1 = 1 1 −(1 −2p)e −jω + 1 1 −(1 −2p)e jω −1 = 1 −(1 −2p) 2 1 −2(1 −2p) cos(ω) + (1 −2p) 2 389 8.8 A linear estimation problem E[[X t −Z t [ 2 ] = E[(X t −Z t )(X t −Z t ) ∗ ] = R X (0) +R Z (0) −R XZ (0) −R ZX (0) = R X (0) +h ∗ ¯ h ∗ R Y (0) −2Re( ¯ h ∗ R XY (0)) = _ ∞ −∞ S X (ω) +[H(ω)[ 2 S Y (ω) −2Re(H ∗ (ω)S XY (ω)) dω 2π The hint with σ 2 = S Y (ω), z o = S( XY (ω), and z = H(ω) implies H opt (ω) = S XY (ω) S Y (ω) . 8.10 The accuracy of approximate differentiation (a) S X (ω) = S X (ω)[H(ω)[ 2 = ω 2 S X (ω). (b) k(τ) = 1 2a (δ(τ +a) −δ(τ −a)) and K(ω) = _ ∞ −∞ k(τ)e −jωt dτ = 1 2a (e jωa −e −jωa ) = j sin(aω) a . By l’Hˆospital’s rule, lim a→0 K(ω) = lim a→0 jω cos(aω) 1 = jω. (c) D is the output of the linear system with input X and transfer function H(ω) − K(ω). The output thus has power spectral density S D (ω) = S X (ω)[H(ω) −K(ω)[ 2 = S X (ω)[ω − sin(aω) a [ 2 . (d) Or, S D (ω) = S X (ω)[1 − sin(aω) aω [ 2 . Suppose 0 < a ≤ √ 0.6 ωo (≈ 0.77 ωo ). Then by the bound given in the problem statement, if [ω[ ≤ ω o then 0 ≤ 1 − sin(aω) aω ≤ (aω) 2 6 ≤ (aωo) 2 6 ≤ 0.1, so that S D (ω) ≤ (0.01)S X (ω) for ω in the base band. Integrating this inequality over the band yields that E[[D t [ 2 ] ≤ (0.01)E[[X t t [ 2 ]. 8.12 Filtering Poisson white noise (a) Since µ N = λ, µ X = λ _ ∞ −∞ h(t)dt. Also, C X = h ∗ ¯ h ∗ C N = λh ∗ ¯ h. (In particular, if h(t) = I ¡0≤t<1¦ , then C X (τ) = λ(1 −[τ[) + , as already found in Problem 4.17.) (b) In the special case, in between arrival times of N, X decreases exponentially, following the equation X t = −X. At each arrival time of N, X has an upward jump of size one. Formally, we can write, X t = −X +N t . For a fixed time t o , which we think of as the present time, the process after time t o is the solution of the above differential equation, where the future of N t is independent of X up to time t o . Thus, the future evolution of X depends only on the current value, and random variables independent of the past. Hence, X is Markov. 8.14 Linear and nonlinear reconstruction from samples (a) We first find the mean function and autocorrelation function of X. E[X t ] = n E[g(t − n − 390 CHAPTER 12. SOLUTIONS TO PROBLEMS U)]E[B n ] = 0 because E[B n ] = 0 for all n. R X (s, t) = E _ ∞ n=−∞ g(s −n −U)B n ∞ m=−∞ g(t −m−U)B m _ = σ 2 ∞ n=−∞ E[g(s −n −U)g(t −n −U)] = σ 2 ∞ n=−∞ _ 1 0 g(s −n −u)g(t −n −u)du = σ 2 ∞ n=−∞ _ n+1 n g(s −v)g(t −v)dv = σ 2 _ ∞ −∞ g(s −v)g(t −v)dv = σ 2 _ ∞ −∞ g(s −v)¯ g(v −t)dv = σ 2 (g ∗ ¯ g)(s −t) So X is WSS with mean zero and R X = σ 2 g ∗ ¯ g. (b) By part (a), the power spectral density of X is σ 2 [G(ω)[ 2 . If g is a baseband signal, so that [G(ω) 2 [ = 0 for ω ≥ ω o . then by the sampling theorem for WSS baseband random processes, X can be recovered from the samples (X(nT) : n ∈ Z) as long as T ≤ π ωo . (c) For this case, G(2πf) = sinc 2 (f), which is not supported in any finite interval. So part (a) does not apply. The sample paths of X are continuous and piecewise linear, and at least two sample points fall within each linear portion of X. Either all pairs of samples of the form (X n , X n+0.5 ) fall within linear regions (happens when 0.5 ≤ U ≤ 1), or all pairs of samples of the form (X n+0.5 , X n+1 ) fall within linear regions (happens when 0 ≤ U ≤ 0.5). We can try reconstructing X using both cases. With probability one, only one of the cases will yield a reconstruction with change points having spacing one. That must be the correct reconstruction of X. The algorithm is illustrated in Figure 12.4. Figure 12.4(a) shows a sample path of B and a corresponding sample path of X, 1 1 3 2 4 6 0 5 (b) (a) (c) (d) 3 2 4 6 0 5 1 3 2 4 6 0 5 1 3 2 4 6 0 5 Figure 12.4: Nonlinear reconstruction of a signal from samples for U = 0.75. Thus, the breakpoints of X are at times of the form n + 0.75 for integers n. Figure 12.4(b) shows the corresponding samples, taken at integer multiples of T = 0.5. Figure 12.4(c) 391 shows the result of connecting pairs of the form (X n , X n+0.5 ), and Figure 12.4(d) shows the result of connecting pairs of the form (X n+0.5 , X n+1 ). Of these two, only Figure 12.4(c) yields breakpoints with unit spacing. Thus, the dashed lines in Figure 12.4(c) are connected to reconstruct X. 8.16 An approximation of white noise (a) Since E[B k B ∗ l ] = I ¡k=l¦ , E[[ _ 1 0 N t dt[ 2 ] = E[[A T T K k=1 B k [ 2 ] = (A T T) 2 E[ K k=1 B k K l=1 B ∗ l ] = (A T T) 2 σ 2 K = A 2 T Tσ 2 (b) The choice of scaling constant A T such that A 2 T T ≡ 1 is A T = 1 √ T . Under this scaling the process N approximates white noise with power spectral density σ 2 as T →0. (c) If the constant scaling A T = 1 is used, then E[[ _ 1 0 N t dt[ 2 ] = Tσ 2 →0 as T →0. 8.18 Filtering to maximize signal to noise ratio The problem is to select H to minimize σ 2 _ ∞ −∞ [H(ω)[ 2 dω 2π , subject to the constraints (i) [H(ω)[ ≤ 1 for all ω, and (ii) _ ωo −ωo [ω[ [H(ω)[ 2 dω 2π ≥ (the power of X)/2. First, H should be zero outside of the interval [−ω o , ω o ], for otherwise it would be contributing to the output noise power with no contribution to the output signal power. Furthermore, if [H(ω)[ 2 > 0 over a small interval of length 2π contained in [ω o , ω o ], then the contribution to the output signal power is [ω[[H(ω)[ 2 , whereas the contribution to the output noise is σ 2 [H(ω)[ 2 . The ratio of these contributions is [ω[/σ 2 , which we would like to be as large as possible. Thus, the optimal H has the form H(ω) = I ¡a≤[ω[≤ωo¦ , where a is selected to meet the signal power constraint with equality: (power of ¯ X)= (power of X)/2. This yields a = ω o / √ 2. In summary, the optimal choice is [H(ω)[ 2 = I ¡ωo/ √ 2≤[ω[≤ωo¦ . 8.20 Sampling a signal or process that is not band limited (a) Evaluating the inverse transform of ´ x o at nT, and using the fact that 2ω o T = 2π yields x o (nT) = _ ∞ −∞ e jωnT ´ x o (ω) dω 2π = _ ωo −ωo e jωnT ´ x o (ω) dω 2π = ∞ m=−∞ _ ωo −ωo e jωnT ´ x(ω + 2mω o ) dω 2π = ∞ m=−∞ _ (2m+1)ωo (2m−1)ωo e jωnT ´ x(ω) dω 2π = _ ∞ −∞ e jωnT ´ x(ω) dω 2π = x(nT). (b) The observation follows from the sampling theorem for the narrowband signal x o and part (a). (c) The fact R X (nT) = R o X (nT) follows from part (a) with x(t) = R X (t). (d) Let X o be a WSS baseband random process with autocorrelation function R o X . Then by the sampling theorem for baseband random processes, X o t = ∞ n=−∞ X o nT sinc _ t−nT T _ . But the discrete time processes (X nT : n ∈ Z) and (X o nT : n ∈ Z) have the same autocorrelation function by part (c). Thus X o and Y have the same autocorrelation function. Also, Y has mean zero. So Y is WSS with mean zero and autocorrelation function R o X . (e) For 0 ≤ ω ≤ ω o , S o X (ω) = ∞ n=−∞ exp(−α[ω + 2nω o [) = ∞ n=0 exp(−α(ω + 2nω o )) + −∞ n=−1 exp(α(ω + 2nω o )) 392 CHAPTER 12. SOLUTIONS TO PROBLEMS = exp(−αω)+exp(α(ω−2ωo)) 1−exp(−2αωo) = exp(−α(ω−ωo))+exp(α(ω−ω 0 )) exp(αωo)−exp(−αωo)) = cosh(α(ωo−ω)) sinh(αωo) . Thus, for any ω, S o X (ω) = I ¡[ω[≤ωo¦ cosh(α(ω o −[ω[)) sinh(αω o ) . 8.22 Another narrowband Gaussian process (a) µ X = µ R _ ∞ −∞ h(t)dt = µ R H(0) = 0 S X (2πf) = [H(2πf)[ 2 S R (2πf) = 10 −2 e −[f[/10 4 I 5000≤[f[≤6000 (b) R X (0) = _ ∞ −∞ S X (2πf)df = 2 10 2 _ 6000 5000 e −f/10 4 df = (200)(e −0.5 −e −0.6 ) = 11.54 X 25 ∼ N(0, 11.54) so P[X 25 > 6] = Q _ 6 √ 11.54 _ ≈ Q(1.76) ≈ 0.039 (c) For the narrowband representation about f c = 5500 (see Figure 12.5), S X S =S U V S =S U V j S UV j S UV For f = 5500 c For f = 5000 c Figure 12.5: Spectra of baseband equivalent signals for f c = 5500 and f c = 5000. S U (2πf) = S V (2πf) = 10 −2 _ e −(f+5500)/10 4 +e −(−f+5500)/10 4 _ I [f[≤500 = e −.55 50 cosh(f/10 4 )I [f[≤500 S UV (2πf) = 10 −2 _ je −(f+5500)/10 4 −je −(−f+5500)/10 4 _ I [f[≤500 = −je −.55 50 sinh(f/10 4 )I [f[≤500 393 (d) For the narrowband representation about f c = 5000 (see Figure 12.5), S U (2πf) = S V (2πf) = 10 −2 e 0.5 e −[f[/10 4 I [f[≤1000 S UV (2πf) = jsgn(f)S U (2πf) 8.24 Declaring the center frequency for a given random process (a) S U (ω) = g(ω +ω c ) +g(−ω +ω c ) and S UV (ω) = j(g(ω +ω c ) −g(−ω +ω c )). (b) The integral becomes: _ ∞ −∞ (g(ω+ω c )+g(−ω+ω c )) 2 dω 2π = 2 _ ∞ −∞ g(ω) 2 dω 2π +2 _ ∞ −∞ g(ω+ω c )g(−ω+ω c ) dω 2π = 2[[g[[ 2 +g∗g(2ω c ) Thus, select ω c to maximize g ∗ g(2ω c ). 9.2 A smoothing problem Write ´ X 5 = _ 3 0 g(s)Y s ds + _ 10 7 g(s)y s ds. The mean square error is minimized over all linear estima- tors if and only if (X 5 − ´ X 5 ) ⊥ Y u for u ∈ [0, 3] ∪ [7, 10], or equivalently R XY (5, u) = _ 3 0 g(s)R Y (s, u)ds + _ 10 7 g(s)R Y (s, u)ds for u ∈ [0, 3] ∪ [7, 10]. 9.4 Interpolating a Gauss Markov process (a) The constants must be selected so that X 0 − ´ X 0 ⊥ X a and X 0 − ´ X 0 ⊥ X −a , or equivalently e −a − [c 1 e −2a + c 2 ] = 0 and e −a − [c 1 + c 2 e −2a ] = 0. Solving for c 1 and c 2 (one could begin by subtracting the two equations) yields c 1 = c 2 = c where c = e −a 1+e −2a = 1 e a +e −a = 1 2 cosh(a) . The corre- sponding minimum MSE is given by E[X 2 0 ] −E[ ´ X 2 0 ] = 1 −c 2 E[(X −a +X a ) 2 ] = 1 −c 2 (2 +2e −2a ) = e 2a −e −2a (e a +e −a ) 2 = (e a −e −a )(e a +e −a ) (e a +e −a ) 2 = tanh(a). (b) The claim is true if (X 0 − ´ X 0 ) ⊥ X u whenever [u[ ≥ a. If u ≥ a then E[(X 0 −c(X −a +X a ))X u ] = e −u − 1 e a +e −a (e −a−u +e a+u ) = 0. Similarly if u ≤ −a then E[(X 0 − c(X −a + X a ))X u ] = e u − 1 e a +e −a (e a+u + e −a+u ) = 0. The orthogonality condition is thus true whenever [u[ ≥ a as required. 9.6 Proportional noise (a) In order that κY t be the optimal estimator, by the orthogonality principle, it suffices to check two things: 1. κY t must be in the linear span of (Y u : a ≤ u ≤ b). This is true since t ∈ [a, b] is assumed. 2. Orthogonality condition: (X t −κY t ) ⊥ Y u for u ∈ [a, b] It remains to show that κ can be chosen so that the orthogonality condition is true. The condition is equivalent to E[(X t −κY t )Y ∗ u ] = 0 for u ∈ [a, b], or equivalently R XY (t, u) = κR Y (t, u) for u ∈ [a, b]. 394 CHAPTER 12. SOLUTIONS TO PROBLEMS The assumptions imply that R Y = R X + R N = (1 + γ 2 )R X and R XY = R X , so the orthogonality condition becomes R X (t, u) = κ(1 + γ 2 )R X (t, u) for u ∈ [a, b], which is true for κ = 1/(1 + γ 2 ). The form of the estimator is proved. The MSE is given by E[[X t − ´ X t [ 2 ] = E[[X t [ 2 ] − E[[ ´ X t [] 2 = γ 2 1+γ 2 R X (t, t). (b) Since S Y is proportional to S X , the factors in the spectral factorization of S Y are proportional to the factors in the spectral factorization of X: S Y = (1 +γ 2 )S X = _ _ 1 +γ 2 S + X _ . ¸¸ . S + Y _ _ 1 +γ 2 S − X _ . ¸¸ . S − Y . That and the fact S XY = S X imply that H(ω) = 1 S + Y _ e jωT S XY S − Y _ + = 1 _ 1 +γ 2 S + X _ e jωT S + X _ 1 +γ 2 _ + = κ S + X (ω) _ e jωT S + X (ω) ¸ + Therefore H is simply κ times the optimal filter for predicting X t+T from (X s : s ≤ t). In par- ticular, if T < 0 then H(ω) = κe jωT , and the estimator of X t+T is simply ´ X t+T[t = κY t+T , which agrees with part (a). (c) As already observed, if T > 0 then the optimal filter is κ times the prediction filter for X t+T given (X s : s ≤ t). 9.8 Short answer filtering questions (a) The convolution of a causal function h with itself is causal, and H 2 has transform h ∗ h. So if H is a positive type function then H 2 is positive type. (b) Since the intervals of support of S X and S Y do not intersect, S X (2πf)S Y (2πf) ≡ 0. Since [S XY (2πf)[ 2 ≤ S X (2πf)S Y (2πf) (by the first problem in Chapter 6) it follows that S XY ≡ 0. Hence the assertion is true. (c) Since sinc(f) is the Fourier transform of I [− 1 2 , 1 2 ] , it follows that [H] + (2πf) = _ 1 2 0 e −2πfjt dt = 1 2 e −πjf/2 sinc _ f 2 _ 9.10 A singular estimation problem (a) E[X t ] = E[A]e j2πfot = 0, which does not depend on t. R X (s, t) = E[Ae j2πfos (Ae j2πfot ) ∗ ] = σ 2 A e j2πfo(s−t) is a function of s −t. Thus, X is WSS with µ X = 0 and R X (τ) = σ 2 A e j2πfoτ . Therefore, S X (2πf) = σ 2 A δ(f −f 0 ), or equiv- alently, S X (ω) = 2πσ 2 A δ(ω−ω 0 ) (This makes R X (τ) = _ ∞ −∞ S X (2πf)e j2πfτ df = _ ∞ −∞ S X (ω)e jωτ dω 2π .) (b) (h ∗ X) t = _ ∞ −∞ h(τ)X t−τ dτ = _ ∞ 0 αe −α−j2πfo)τ Ae j2πfo(t−τ) dτ = _ ∞ 0 αe −(ατ dτAe j2πfot = X t . Another way to see this is to note that X is a pure tone sinusoid at frequency f o , and H(2πf 0 ) = 1. (c) In view of part (b), the mean square error is the power of the output due to the noise, or MSE=(h∗ ¯ h∗R N )(0) = _ ∞ −∞ (h∗ ¯ h)(t)R N (0−t)dt = σ 2 N h∗ ¯ h(0) = σ 2 N [[h[[ 2 = σ 2 N _ ∞ 0 α 2 e −2αt dt = σ 2 N α 2 . 395 The MSE can be made arbitrarily small by taking α small enough. That is, the minimum mean square error for estimation of X t from (Y s : s ≤ t) is zero. Intuitively, the power of the signal X is concentrated at a single frequency, while the noise power in a small interval around that frequency is small, so that perfect estimation is possible. 9.12 A prediction problem The optimal prediction filter is given by 1 S + X _ e jωT S + X ¸ . Since R X (τ) = e −[τ[ , the spectral factoriza- tion of S X is given by S X (ω) = _ √ 2 jω + 1 _ . ¸¸ . S + X _ √ 2 −jω + 1 _ . ¸¸ . S − X so [e jωT S + X ] + = e −T S + X (see Figure 12.6). Thus the optimal prediction filter is H(ω) ≡ e −T , or in 2 e -T -(t+T) t 0 Figure 12.6: √ 2e jωT S + X in the time domain the time domain it is h(t) = e −T δ(t), so that ´ X T+t[t = e −T X t . This simple form can be explained and derived another way. Since linear estimation is being considered, only the means (assumed zero) and correlation functions of the processes matter. We can therefore assume without loss of gener- ality that X is a real valued Gaussian process. By the form of R X we recognize that X is Markov so the best estimate of X T+t given (X s : s ≤ t) is a function of X t alone. Since X is Gaussian with mean zero, the optimal estimator of X t+T given X t is E[X t+T [X t ] = Cov(X t+T ,Xt)Xt Var(Xt) = e −T X t . 9.14 Spectral decomposition and factorization (a) Building up transform pairs by steps yields: sinc(f) ↔ I ¡− 1 2 ≤t≤ 1 2 ¦ sinc(100f) ↔ 10 −2 I ¡− 1 2 ≤ t 100 ≤ 1 2 ¦ sinc(100f)e 2πjfT ↔ 10 −2 I ¡− 1 2 ≤ t+T 100 ≤ 1 2 ¦ _ sinc(100f)e j2πfT _ + ↔ 10 −2 I ¡−50−T≤t≤50−T¦∩¡t≥0¦ so [[x[[ 2 = 10 −4 length of ([−50 −T, 50 −T] ∩ [0, +∞)) = _ _ _ 10 −2 T ≤ −50 10 −4 (50 −T) −50 ≤ T ≤ 50 0 T ≥ 50 396 CHAPTER 12. SOLUTIONS TO PROBLEMS (b) By the hint, 1 + 3j is a pole of S. (Without the hint, the poles can be found by first solving for values of ω 2 for which the denominator of S is zero.) Since S is real valued, 1 − 3j must also be a pole of S. Since S is an even function, i.e. S(ω) = S(−ω), −(1 +3j) and −(1 −3j) must also be poles. Indeed, we find S(ω) = 1 (ω −(1 + 3j))(ω −(1 −3j))(ω + 1 + 3j)(ω + 1 −3j) . or, multiplying each term by j (and using j 4 = 1) and rearranging terms: S(ω) = 1 (jω + 3 +j)(jω + 3 −j) . ¸¸ . S + (ω) 1 (−jω + 3 +j)(−jω + 3 −j) . ¸¸ . S − (ω) or S + (ω) = 1 (jω 2 )+6jω+10 . The choice of S + is unique up to a multiplication by a unit magnitude constant. 9.16 Estimation of a random signal, using the KL expansion Note that (Y, φ j ) = (X, φ j ) +(N, φ j ) for all j, where the variables (X, φ j ), j ≥ 1 and (N, φ j ), j ≥ 1 are all mutually orthogonal, with E[[(X, φ j )[ 2 ] = λ j and E[[(N, φ j )[ 2 ] = σ 2 . Observation of the process Y is linearly equivalent to observation of ((Y, φ j ) : j ≥ 1). Since these random variables are orthogonal and all random variables are mean zero, the MMSE estimator is the sum of the projections onto the individual observations, (Y, φ j ). But for fixed i, only the i th observation, (Y, φ i ) = (X, φ i )+(N, φ i ), is not orthogonal to (X, φ i ). Thus, the optimal linear estimator of (X, φ i ) given Y is Cov((X,φ i ),(Y,φ i )) Var((Y,φ i )) (Y, φ i ) = λ i (Y,φ i ) λ i +σ 2 . The mean square error is (using the orthogonality principle): E[[(X, φ i )[ 2 ] −E[[ λ i (Y,φ i ) λ i +σ 2 [ 2 ] = λ i − λ 2 i (λ i +σ 2 ) (λ i +σ 2 ) 2 = λ i σ 2 λ i +σ 2 . (b) Since f(t) = j (f, φ j )φ j (t), we have (X, f) = j (f, φ j )(X, φ j ). That is, the random variable to be estimated is the sum of the random variables of the form treated in part (a). Thus, the best linear estimator of (X, f) given Y can be written as the corresponding weighted sum of linear estimators: (MMSE estimator of (X, f) given Y ) = i λ i (Y, φ i )(f, φ i ) λ i +σ 2 . The error in estimating (X, f) is the sum of the errors for estimating the terms (f, φ j )(X, φ j ), and those errors are orthogonal. Thus, the mean square error for (X, f) is the sum of the mean square errors of the individual terms: (MSE) = i λ i σ 2 [(f, φ i )[ 2 λ i +σ 2 . 9.18 Linear innovations and spectral factorization First approach: The first approach is motivated by the fact that 1 S + Y is a whitening filter. Let H(z) = β S + X (z) and let Y be the output when X is passed through a linear time-invariant system 397 with z-transform H(z). We prove that Y is the innovations process for X. Since H is positive type and lim [z[→∞ H(z) = 1, it follows that Y k = X k +h(1)X k−1 +h(2)X k−2 + Since o Y (z) = H(z)H ∗ (1/z ∗ )o X (z) ≡ β 2 , it follows that R Y (k) = β 2 I ¡k=0¦ . In particular, Y k ⊥ linear span of ¦Y k−1 , Y k−2 , ¦ Since H and 1/H both correspond to causal filters, the linear span of ¦Y k−1 , Y k−2 , ¦ is the same as the linear span of ¦X k−1 , X k−2 , ¦. Thus, the above orthogonality condition becomes, X k −(−h(1)X k−1 −h(2)X k−2 − ) ⊥ linear span of ¦X k−1 , X k−2 , ¦ Therefore −h(1)X k−1 − h(2)X k−2 − must equal ´ X k[k−1 , the one step predictor for X k . Thus, (Y k ) is the innovations sequence for (X k ). The one step prediction error is E[[Y k [ 2 ] = R Y (0) = β 2 . Second approah: The filter / for the optimal one-step linear predictor ( ´ X k+1[k ) is given by (take T = 1 in the general formula): / = 1 o + X _ zo + X ¸ + . The z-transform zo + X corresponds to a function in the time domain with value β at time -1, and value zero at all other negative times, so [zo + X ] + = zo + X − zβ. Hence /(z) = z − zβ S + X (z) . If X is filtered using /, the output at time k is ´ X k+1[k . So if X is filtered using 1 − β S + X (z) , the output at time k is ´ X k[k−1 . So if X is filtered using H(z) = 1 −(1 − β S + X (z) ) = β S + X (z) then the output at time k is X k − ´ X k[k−1 = ¯ X k , the innovations sequence. The output ¯ X has o e X (z) ≡ β 2 , so the prediction error is R e X (0) = β 2 . 9.20 A discrete-time Wiener filtering problem To begin, z T S XY (z) S − Y (z) = z T β(1 −ρ/z)(1 −z o ρ) + z T+1 β( 1 zo −ρ)(1 −z o z) The right hand side corresponds in the time domain to the sum of an exponential function supported on −T, −T +1, −T +2, . . . and an exponential function supported on −T −1, −T −2, . . .. If T ≥ 0 then only the first term contributes to the positve part, yielding _ z T S XY S − Y _ + = z T o β(1 −ρ/z)(1 −z o ρ) H(z) = z T o β 2 (1 −z o ρ)(1 −z o /z) and h(n) = z T o β 2 (1 −z o ρ) z n o I ¡n≥0¦ . 398 CHAPTER 12. SOLUTIONS TO PROBLEMS On the other hand if T ≤ 0 then _ z T S XY S − Y _ + = z T β(1 −ρ/z)(1 −z o ρ) + z(z T −z T o ) β( 1 zo −ρ)(1 −z o z) so H(z) = z T β 2 (1 −z o ρ)(1 −z o /z) + z(z T −z T o )(1 −ρ/z) β 2 ( 1 zo −ρ)(1 −z o z)(1 −z o /z) . Inverting the z-transforms and arranging terms yields that the impulse response function for the optimal filter is given by h(n) = 1 β 2 (1 −z 2 o ) _ z [n+T[ o − _ z o −ρ 1 zo −ρ _ z n+T o _ I ¡n≥0¦ . (12.13) Graphically, h is the sum of a two-sided symmetric exponential function, slid to the right by −T and set to zero for negative times, minus a one sided exponential function on the nonnegative integers. (This structure can be deduced by considering that the optimal casual estimator of X t+T is the optimal causal estimator of the optimal noncausal estimator of X t+T .) Going back to the z-transform domain, we find that H can be written as H(z) = _ z T β 2 (1 −z o /z)(1 −z o z) _ + − z T o (z o −ρ) β 2 (1 −z 2 o )( 1 zo −ρ)(1 −z o /z) . (12.14) Although it is helpful to think of the cases T ≥ 0 and T ≤ 0 separately, interestingly enough, the expressions (12.13) and (12.14) for the optimal h and H hold for any integer value of T. 9.22 Estimation given a strongly correlated process (a) R X = g ∗ ¯ g ↔o X (z) = ((z)( ∗ (1/z ∗ ), R Y = k ∗ ¯ k ↔o Y (z) = /(z)/ ∗ (1/z ∗ ), R XY = g ∗ ¯ k ↔o XY (z) = ((z)/ ∗ (1/z ∗ ). (b) Note that o + Y (z) = /(z) and o − Y (z) = / ∗ (1/z ∗ ). By the formula for the optimal causal estimator, H(z) = 1 o + Y _ o XY o − Y _ + = 1 /(z) _ ((z)/ ∗ (1/z ∗ ) / ∗ (1/z ∗ ) _ + = [(] + / = ((z) /(z) (c) The power spectral density of the estimator process ´ X is given by H(z)H ∗ (1/z ∗ )o Y (z) = o X (z). Therefore, MSE = R X (0) − R b X (0) = _ π −π o X (e jω ) dω 2π − _ π −π o b X (e jω ) dω 2π = 0. A simple explanation for why the MSE is zero is the following. Using 1 | inverts /, so that filtering Y with 1 | produces the process W. Filtering that with ( then yields X. That is, filtering Y with H produces X, so the estimation error is zero. 10.2 A covering problem (a) Let X i denote the location of the i th base station. Then F = f(X 1 , . . . , X m ), where f satisfies the Lipschitz condition with constant (2r −1). Thus, by the method of bounded differences based 399 on the Azuma-Hoeffding inequality, P¦[F −E[F][ ≥ γ¦ ≤ 2 exp(− γ 2 m(2r−1) 2 ). (b) Using the Possion method and associated bound technique, we compare to the case that the number of stations has a Poisson distribution with mean m. Note that the mean number of stations that cover cell i is m(2r−1) n , unless cell i is near one of the boundaries. If cells 1 and n are covered, then all the other cells within distance r of either boundary are covered. Thus, P¦X ≥ m¦ ≤ 2P¦Poi(m) stations is not enough¦ ≤ 2ne −m(2r−1)/n +P¦cell 1 or cell n is not covered¦ → 0 as n →∞ if m = (1 +)nln n 2r −1 As to a bound going the other direction, note that if cells differ by 2r −1 or more then the events that they are covered are independent. Hence, P¦X ≤ m¦ ≤ 2P¦Poi(m) stations cover all cells¦ ≤ 2P¦Poi(m) stations cover cells 1 + (2r −1)j, 1 ≤ j ≤ n −1 2r −1 ¦ ≤ 2 _ 1 −e − m(2r−1) n _ n−1 2r−1 ≤ 2 exp(−e − m(2r−1) n n −1 2r −1 ) → 0 as n →∞ if m = (1 −)nln n 2r −1 Thus, in conclusion, we can take g 1 (r) = g 2 (r) = 1 2r−1 . 10.4 On uniform integrability (a) Use the small set characterization of u.i. First, sup i (E[[X i [] + E[[Y i []) ≤ (sup i E[[X i []) + (sup i E[[Y i []) < ∞Second, given > 0, there exists δ so small that E[[X i [I A ] ≤ 2 and E[[Y i [I A ] ≤ 2 for all i, whenever A is an event with P¦A¦ ≤ . Therefore E[([X i +Y i [I A ] ≤ E[[X i [I A ]+E[[Y i [I A ] ≤ for all i, whenever A is an event with P¦A¦ ≤ . Thus, (X i +Y i : i ∈ I) is u.i. (b) Suppose (X i : i ∈ I) is u.i., which by definition means that lim c→∞ K(c) = 0, were K(c) = sup i∈I E[[X i [I ¡[X i [≥c¦ ]. Let c n be a sequence of positive numbers monotonically converging to +∞ so that K(c n ) ≤ 2 −n . Let ϕ(u) = ∞ n=1 (u−c n ) + . The sum is well defined, because for any u, only finitely many terms are nonzero. The function ϕ is convex and increasing because each term in the sum is. The slope of ϕ(u) is at least n on the interval (c n , ∞) for all n, so that lim u→∞ ϕ(u) u = +∞. Also, for any i ∈ I and c, E[([X i [ −c) + ] = E[([X i [ −c)I ¡[X i [≥c¦ ] ≤ K(c). Therefore, E[ϕ([X i [)] ≤ ∞ n=1 E[([X i [ − c n ) + ] ≤ ∞ n=1 2 −n ≤ 1. Thus, the function ϕ has the required properties. The converse is now proved. Suppose ϕ exists with the desired properties, and let > 0. Select c o so large that u ≥ c o implies that u ≤ ϕ(u). Then, for all c ≥ c o , E[[X i [I ¡[X i [≥c¦ ] ≤ E[ϕ([X i [)I ¡[X i [≥c¦ ] ≤ E[ϕ([X i [)] ≤ K 400 CHAPTER 12. SOLUTIONS TO PROBLEMS Since this inequality holds for all i ∈ I, and is arbitrary, it follows that (X i : i ∈ I) is u.i. 10.6 A stopped random walk (a) By the fact S 0 = 0 and the choice of τ, [S n [ ≤ c for 0 ≤ n ≤ τ −1. Since the W’s are bounded by D, it follows that [S τ [ ≤ c + D. Therefore, S n∧τ is a bounded sequence of random variables, so 0 = E[S n∧τ ] → E[S τ ] as n → ∞, by the dominated convergence theorem. (Or we could invoke the first or third form of the optional stopping theorem discussed in class. Note here that we are taking it for granted that T < ∞ with probability one. If we don’t want to make that assumption, we could appeal to part (b) or some other method. ) (b) Let T n = σ(W 1 , . . . , W n ). Then M n+1 −M n = 2 W n+1 ¯ S n + _ W 2 n+1 − ¯ σ 2 _ so that E[M n+1 − M n [T n ] = 2E[ W n+1 [T n ] ¯ S n + E[ W 2 n+1 − ¯ σ 2 ] = 0, so that M is a martingale. Therefore, E[M τ∧n ] = 0 for all n, or E[τ ∧ n] = E[ ¯ S 2 n∧τ ]/¯ σ 2 . Arguing as in part a, we see that the sequence [ ¯ S n∧τ [ ≤ c +a ∧ b for all n. Thus, E[τ ∧ n] ≤ (c+a∧b) 2 e σ 2 for all n. But E[τ ∧ n] →E[τ] as n →∞ by the monotone convergence theorem, so that E[τ] ≤ (c+a∧b) 2 e σ 2 . (c) We have that E[[S n+1 −S n [[T n ] = C for all n and E[τ] < ∞ by part (b), so by the third version of the optional stopping theorem discussed in class, E[S τ ] = E[S 0 ] = 0. 10.8 On the size of a maximum matching in a random bipartite graph (a) Many bounds are possible. Generally the tighter bounds are more complex. If d = 1, then every vertex in V of degree one or more can be included in a single matching, and such a matching is a maximum matching with expected size n(1 − (1 − 1 n ) n ) ≥ n(1 − e −1 ). The cardinality of a maximum matching cannot decrease as more edges are added, so a = n(1 −e −1 ) is a lower bound for any d and n. An upper bound is given by the expected number of vertices of V with degree at least one, so b = n(1 −(1 − d n ) n ) ≈ n(1 −e −d ) works. (b) The variable Z can be expressed as Z = F(V 1 , . . . , V n ), where V i is the set of neighbors of u i , and F satisfies the Lipschitz condition with constant 1. We are thus considering a martingale associated with Z by exposing the vertices of U one at a time. The (improved version of the) Azuma-Hoeffding inequality yields that P¦[Z −E[Z][ ≥ γ √ n¦ ≤ 2e −2γ 2 . (c) Process the vertices of U one at a time. Select an unprocessed vertex in U. If the vertex has positive degree, select one of the edges associated with it and add it to the matching. Remove edges from the graph that become ineligible for the matching, and reduce the degrees of the unprocessed vertices correspondingly. (This algorithm can be improved by selecting the next vertex in U from among those with smallest reduced degree, and also selecting outgoing edges with endpoint in V having the smallest reduced degree in V , or both. If every matched edge has an endpoint with reduced degree one at the time of matching, the resulting matching has maximum cardinality. ) Index autocorrelation function, see correlation function autocovariance function, see covariance function baseband random process, 259 signal, 258 Baum-Welch algorithm, 161 Bayes’ formula, 6 Bernoulli distribution, 21 binomial distribution, 22 Borel sets, 3 Borel-Cantelli lemma, 7 bounded convergence theorem, 337 bounded input bounded output (bibo) stability, 247 Brownian motion, 111 Cauchy criterion for convergence, 52, 328 criterion for m.s. convergence in correlation form, 54 sequence, 53, 328 central limit theorem, 58 characteristic function of a random variable, 21 of a random vector, 75 Chebychev inequality, 20 circular symmetry, 267 joint, 267 completeness of a probability space, 206 of an orthogonal basis, 232 of the real numbers, 328 conditional expectation, 28, 81 mean, see conditional expectation pdf, 28 probability, 5 conjugate prior, 148 continuity of a function, 330 of a function at a point, 330 of a random process, 208 of a random process at a point, 207 piecewise m.s., 211 convergence of sequences almost sure, 41 deterministic, 326 in distribution, 47 in probability, 43 mean square, 43, 229 convex function, 59 convolution, 247 correlation coefficient, 26 cross correlation matrix, 73 function, 105 matrix, 73 count times, 113 countably infinite, 326 counting process, 113 covariance cross covariance matrix, 73 function, 105, 269 matrix, 74 pseudo-covariance function, 269 pseudo-covariance matrix, 268 Cram´er’s theorem, 62 cumulative distribution function (CDF), 8, 25, 401 402 INDEX 105 derivative right-hand, 331 differentiable, 331 at a point, 331 continuous and piecewise continuously, 332 continuously, 332 m.s. at a point, 211 m.s. continuous and piecewise continuously, 215 m.s. continuously, 211 m.s. sense, 211 Dirichlet density, 148 discrete-time random process, 105 dominated convergence theorem, 338 drift vector, 184, 192 energy spectral density, 250 Erlang B formula, 183 Erlang C formula, 183 expectation, 17 of a random vector, 73 expectation-maximization (EM) algorithm, 149 exponential distribution, 23 Fatou’s lemma, 338 forward-backard algorithm, 157 Fourier transform, 249 inversion formula, 249 Parseval’s identity, 249 fundamental theorem of calculus, 334 gambler’s ruin problem, 109 gamma distribution, 24 Gaussian distribution, 23 joint pdf, 86 random vector, 86 geometric distribution, 22 implicit function theorem, 333 impulse response function, 246 independence events, 5 pairwise, 5 independent increment process, 110 infimum, 328 information update, 94 inner product, 340, 342 integration Riemann-Stieltjes, 336 Lebesgue, 335 Lebesgue-Stieltjes, 336 m.s. Riemann, 216 Riemann, 333 intercount times, 113 Jacobian matrix, 28 Jensen’s inequality, 59 joint Gaussian distribution, 86 jointly Gaussian random variables, 86 Kalman filter, 92 law of total probability, 6 law of large numbers, 57 strong law, 57 weak law, 57 liminf, or limit inferior, 329, 330 limit points, 329, 330 limsup, or limit superior, 329, 330 linear innovations sequence, 92 Lipschitz condition, 313 Little’s law, 179 log moment generating function, 60 log-sum inequality, 152 Markov inequality, 20 Markov process, 121 aperiodic, 168 birth-death process, 171 Chapman-Kolmogorov equations, 124 equilibrium distribution, 124 generator matrix, 126 holding times, 129 irreducible, 167 jump process, 129 INDEX 403 Kolmogorov forward equations, 127 nonexplosive, 171 null recurrent, 168, 172 one-step transition probability matrix, 124 period of a state, 168 positive recurrent, 168, 172 pure-jump for a countable state space, 171 pure-jump for a finite state space, 126 space-time structure, 129, 130 stationary, 124 time homogeneous, 124 transient, 168, 172 transition probabilities, 123 transition probability diagram, 125 transition rate diagram, 126 martingale, 110 matrices, 339 characteristic polynomial, 341, 343 determinant, 340 diagonal, 339 eigenvalue, 340, 341, 343 eigenvector, 340 Hermitian symmetric, 343 Hermitian transpose of, 341 identity matrix, 339 positive semidefinite, 341, 343 symmetric, 339 unitary, 342 maximum, 328 maximum a posteriori probability (MAP) esti- mator, 144 maximum likelihood (ML) estimator, 143 mean, see expectation Mean ergodic, 223 mean function, 105 mean square closure, 279 memoryless property of the geometric distribu- tion, 22 minimum, 328 monotone convergence theorem, 339 narrowband random process, 264 signal, 261 norm of a vector, 340 of an interval partition, 334 normal distribution, see Gaussian distribution Nyquit sampling theorem, 258 orthogonal, 342 complex random variables, 229 random variables, 75 vectors, 340 orthogonality principle, 77 orthonormal, 340 basis, 340, 342 matrix, 340 partition, 6 periodic WSS random processes, 236 permutation, 340 piecewise continuous, 331 Poisson arrivals see time averages (PASTA), 182 Poisson distribution, 22 Poisson process, 113 posterior, or a posteriori, 144 power of a random process, 251 spectral density, 250 prior, or a priori, 144 probability density function (pdf), 25 projection, 76 Rayleigh distribution, 24 Riemann sum, 333 sample path, 105 Schwarz’s inequality, 26 second order random process, 106 sinc function, 252 span, 340 spectral representation, 238 stationary, 116, 230 wide sense, 116, 269 strictly increasing, 327 subsequence, 329 404 INDEX supremum, 328 Taylor’s theorem, 332 time update, 94 time-invariant linear system, 247 tower property, 83 transfer function, 250 triangle inequality, L 2 , 26 uniform distribution, 24 uniform prior, 145 version of a random process, 237 Viterbi algorithm, 159 wide sense sationary, 269 wide sense stationary, 116 Wiener process, see Brownian motion Bibliography [1] S. Asmussen. Applied Probability and Queues. Springer, second edition, 2003. [2] L.E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilisitic functions of markov chains. Ann. Math. Statist., 41:164–171, 1970. [3] A.P. Dempster, N.M. Laird, and B.D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, 39(1):1–38, 1977. [4] F.G. Foster. On the stochastic matrices associated with certain queueing processes. Ann. Math. Statist, 24:355–360, 1953. [5] J.F.C. Kingman. Some inequalities for the queue GI/G/1. Biometrika, 49(3/4):315–324, 1962. [6] P.R. Kumar and S.P. Meyn. Stability of queueing networks and scheduling policies. IEEE Trans. on Automatic Control, 40(2):251–260, 1995. [7] C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics, 141:148–188, 1989. [8] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand. Achieving 100% throughput in an input-queued switch. IEEE Trans. Communications, 47(8):1260–1267, 1999. [9] S.P. Meyn and R.L. Tweedie. Markov chains and stochastic stability. Springer-Verlag London, 1993. [10] J.R. Norris. Markov Chains. Cambridge University Press, 1997. [11] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, February 1989. [12] L. Tassiulas. Scheduling and performance limits of networks with constantlychanging topology. IEEE Trans. Information Theory, 43(3):1067–1073, 1997. [13] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing systems and scheduling- policies for maximum throughput in multihop radio networks. IEEE Trans. on Automatic Control, 37(12):1936–1948, 1992. [14] L. Tassiulas and A. Ephremides. Dynamic server allocation to parallel queues with randomly varying- connectivity. IEEE Trans. Information Theory, 39(2):466–478, 1993. [15] R.L. Tweedie. Existence of moments for stationary markov chains. Journal of Applied Probability, 20(1):191–196, 1983. [16] C. Wu. On the convergence property of the EM algorithm. The Annals of Statistics, 11:95–103, 1983. 405 Contents 1 Getting Started 1.1 The axioms of probability theory . . . . . 1.2 Independence and conditional probability 1.3 Random variables and their distribution . 1.4 Functions of a random variable . . . . . . 1.5 Expectation of a random variable . . . . . 1.6 Frequently used distributions . . . . . . . 1.7 Jointly distributed random variables . . . 1.8 Cross moments of random variables . . . . 1.9 Conditional densities . . . . . . . . . . . . 1.10 Transformation of random vectors . . . . 1.11 Problems . . . . . . . . . . . . . . . . . . 1 1 5 8 11 17 21 25 26 27 28 31 41 41 52 56 59 60 64 73 73 75 80 80 82 83 85 91 92 96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Convergence of a Sequence of Random Variables 2.1 Four definitions of convergence of random variables . 2.2 Cauchy criteria for convergence of random variables 2.3 Limit theorems for sequences of independent random 2.4 Convex functions and Jensen’s inequality . . . . . . 2.5 Chernoff bound and large deviations theory . . . . . 2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Random Vectors and Minimum Mean Squared Error Estimation 3.1 Basic definitions and properties . . . . . . . . . . . . . . . . . . . . . . . 3.2 The orthogonality principle for minimum mean square error estimation . 3.3 Conditional expectation and linear estimators . . . . . . . . . . . . . . . 3.3.1 Conditional expectation as a projection . . . . . . . . . . . . . . 3.3.2 Linear estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Discussion of the estimators . . . . . . . . . . . . . . . . . . . . . 3.4 Joint Gaussian distribution and Gaussian random vectors . . . . . . . . 3.5 Linear Innovations Sequences . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Discrete-time Kalman filtering . . . . . . . . . . . . . . . . . . . . . . . 3.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 4 Random Processes 4.1 Definition of a random process . . . . . . . . . . . . . . 4.2 Random walks and gambler’s ruin . . . . . . . . . . . . 4.3 Processes with independent increments and martingales 4.4 Brownian motion . . . . . . . . . . . . . . . . . . . . . . 4.5 Counting processes and the Poisson process . . . . . . . 4.6 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Joint properties of random processes . . . . . . . . . . . 4.8 Conditional independence and Markov processes . . . . 4.9 Discrete-state Markov processes . . . . . . . . . . . . . . 4.10 Space-time structure of discrete-state Markov processes 4.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 105 105 108 110 111 113 116 119 119 122 128 132 143 143 149 154 155 158 159 161 161 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Inference for Markov Models 5.1 A bit of estimation theory . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The expectation-maximization (EM) algorithm . . . . . . . . . . . . . . 5.3 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Posterior state probabilities and the forward-backward algorithm 5.3.2 Most likely state sequence – Viterbi algorithm . . . . . . . . . . 5.3.3 The Baum-Welch algorithm, or EM algorithm for HMM . . . . . 5.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Dynamics of Countable-State Markov Models 6.1 Examples with finite state space . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Classification and convergence of discrete-time Markov processes . . . . . . . 6.3 Classification and convergence of continuous-time Markov processes . . . . . 6.4 Classification of birth-death processes . . . . . . . . . . . . . . . . . . . . . . 6.5 Time averages vs. statistical averages . . . . . . . . . . . . . . . . . . . . . . 6.6 Queueing systems, M/M/1 queue and Little’s law . . . . . . . . . . . . . . . . 6.7 Mean arrival rate, distributions seen by arrivals, and PASTA . . . . . . . . . 6.8 More examples of queueing systems modeled as Markov birth-death processes 6.9 Foster-Lyapunov stability criterion and moment bounds . . . . . . . . . . . . 6.9.1 Stability criteria for discrete-time processes . . . . . . . . . . . . . . . 6.9.2 Stability criteria for continuous time processes . . . . . . . . . . . . . 6.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Basic Calculus of Random Processes 7.1 Continuity of random processes . . . . . . . . . . 7.2 Mean square differentiation of random processes 7.3 Integration of random processes . . . . . . . . . . 7.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . 7.5 Complexification, Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 . 165 . 167 . 170 . 173 . 175 . 177 . 180 . 182 . 184 . 184 . 192 . 195 205 . 205 . 211 . 215 . 222 . 228 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 7.6 7.7 7.8 v The Karhunen-Lo`ve expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 e Periodic WSS random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 245 246 249 256 258 261 267 269 8 Random Processes in Linear Systems and Spectral Analysis 8.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Fourier transforms, transfer functions and power spectral densities 8.3 Discrete-time processes in linear systems . . . . . . . . . . . . . . . 8.4 Baseband random processes . . . . . . . . . . . . . . . . . . . . . . 8.5 Narrowband random processes . . . . . . . . . . . . . . . . . . . . 8.6 Complexification, Part II . . . . . . . . . . . . . . . . . . . . . . . 8.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Wiener filtering 9.1 Return of the orthogonality principle . . . . . . . . . . . . . . . . 9.2 The causal Wiener filtering problem . . . . . . . . . . . . . . . . 9.3 Causal functions and spectral factorization . . . . . . . . . . . . 9.4 Solution of the causal Wiener filtering problem for rational power 9.5 Discrete time Wiener filtering . . . . . . . . . . . . . . . . . . . . 9.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Martingales 10.1 Conditional expectation revisited . . . . . . . . . . 10.2 Martingales with respect to filtrations . . . . . . . 10.3 Azuma-Hoeffding inequaltity . . . . . . . . . . . . 10.4 Stopping times and the optional sampling theorem 10.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Problems . . . . . . . . . . . . . . . . . . . . . . . 11 Appendix 11.1 Some notation . . . . . . . . . . . . 11.2 Convergence of sequences of numbers 11.3 Continuity of functions . . . . . . . . 11.4 Derivatives of functions . . . . . . . 11.5 Integration . . . . . . . . . . . . . . 11.5.1 Riemann integration . . . . . 11.5.2 Lebesgue integration . . . . . 11.5.3 Riemann-Stieltjes integration 11.5.4 Lebesgue-Stieltjes integration 11.6 On the convergence of the mean . . 11.7 Matrices . . . . . . . . . . . . . . . . 12 Solutions to Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . spectral . . . . . . . . . . . . . . . . . . . . . . . . . densities . . . . . . . . . . 279 . 279 . 282 . 282 . 287 . 291 . 296 303 303 308 311 314 319 319 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 . 325 . 326 . 330 . 331 . 333 . 333 . 335 . 336 . 336 . 337 . 339 345 vi CONTENTS Preface From an applications viewpoint, the main reason to study the subject of these notes is to help deal with the complexity of describing random, time-varying functions. A random variable can be interpreted as the result of a single measurement. The distribution of a single random variable is fairly simple to describe. It is completely specified by the cumulative distribution function F (x), a function of one variable. It is relatively easy to approximately represent a cumulative distribution function on a computer. The joint distribution of several random variables is much more complex, for in general, it is described by a joint cumulative probability distribution function, F (x1 , x2 , . . . , xn ), which is much more complicated than n functions of one variable. A random process, for example a model of time-varying fading in a communication channel, involves many, possibly infinitely many (one for each time instant t within an observation interval) random variables. Woe the complexity! These notes help prepare the reader to understand and use the following methods for dealing with the complexity of random processes: • Work with moments, such as means and covariances. • Use extensively processes with special properties. Most notably, Gaussian processes are characterized entirely be means and covariances, Markov processes are characterized by one-step transition probabilities or transition rates, and initial distributions. Independent increment processes are characterized by the distributions of single increments. • Appeal to models or approximations based on limit theorems for reduced complexity descriptions, especially in connection with averages of independent, identically distributed random variables. The law of large numbers tells us that, in a certain context, a probability distribution can be characterized by its mean alone. The central limit theorem, similarly tells us that a probability distribution can be characterized by its mean and variance. These limit theorems are analogous to, and in fact examples of, perhaps the most powerful tool ever discovered for dealing with the complexity of functions: Taylor’s theorem, in which a function in a small interval can be approximated using its value and a small number of derivatives at a single point. • Diagonalize. A change of coordinates reduces an arbitrary n-dimensional Gaussian vector into a Gaussian vector with n independent coordinates. In the new coordinates the joint probability distribution is the product of n one-dimensional distributions, representing a great vii viii CONTENTS reduction of complexity. Similarly, a random process on an interval of time, is diagonalized by the Karhunen-Lo`ve representation. A periodic random process is diagonalized by a Fourier e series representation. Stationary random processes are diagonalized by Fourier transforms. • Sample. A narrowband continuous time random process can be exactly represented by its samples taken with sampling rate twice the highest frequency of the random process. The samples offer a reduced complexity representation of the original process. • Work with baseband equivalent. The range of frequencies in a typical radio transmission is much smaller than the center frequency, or carrier frequency, of the transmission. The signal could be represented directly by sampling at twice the largest frequency component. However, the sampling frequency, and hence the complexity, can be dramatically reduced by sampling a baseband equivalent random process. These notes were written for the first semester graduate course on random processes, offered by the Department of Electrical and Computer Engineering at the University of Illinois at UrbanaChampaign. Students in the class are assumed to have had a previous course in probability, which is briefly reviewed in the first chapter of these notes. Students are also expected to have some familiarity with real analysis and elementary linear algebra, such as the notions of limits, definitions of derivatives, Riemann integration, and diagonalization of symmetric matrices. These topics are reviewed in the appendix. Finally, students are expected to have some familiarity with transform methods and complex analysis, though the concepts used are reviewed in the relevant chapters. Each chapter represents roughly two weeks of lectures, and includes homework problems. Solutions to the even numbered problems without stars can be found at the end of the notes. Students are encouraged to first read a chapter, then try doing the even numbered problems before looking at the solutions. Problems with stars, for the most part, investigate additional theoretical issues, and solutions are not provided. Hopefully some students reading these notes will find them useful for understanding the diverse technical literature on systems engineering, ranging from control systems, image processing, communication theory, and communication network performance analysis. Hopefully some students will go on to design systems, and define and analyze stochastic models. Hopefully others will be motivated to continue study in probability theory, going on to learn measure theory and its applications to probability and analysis in general. A brief comment is in order on the level of rigor and generality at which these notes are written. Engineers and scientists have great intuition and ingenuity, and routinely use methods that are not typically taught in undergraduate mathematics courses. For example, engineers generally have good experience and intuition about transforms, such as Fourier transforms, Fourier series, and z-transforms, and some associated methods of complex analysis. In addition, they routinely use generalized functions, in particular the delta function is frequently used. The use of these concepts in these notes leverages on this knowledge, and it is consistent with mathematical definitions, but full mathematical justification is not given in every instance. The mathematical background required for a full mathematically rigorous treatment of the material in these notes is roughly at the level of a second year graduate course in measure theoretic probability, pursued after a course on measure theory. CONTENTS ix The author gratefully acknowledges the students and faculty (Todd Coleman, Christoforos Hadjicostis, Andrew Singer, R. Srikant, and Venu Veeravalli) in the past five years for their comments and corrections. Bruce Hajek July 2011 Chapter 7-9 These three chapters develop calculus for random processes based on mean square convergence. so that most readers should cover those chapters before moving on.4. Sections 6. including basic notation. Chapter 10 This chapter explores martingales with respect to filtrations. In previous semesters. Applications include the modeling of queueing systems. central limit theorem. Chapter 6 describes the use of Markov processes for modeling and analysis of dynamical systems.8. and ending with causal and noncausal Wiener filtering. and briefly covers several examples and classes of random processes. The following four additional topics can be covered independently of each other.1 are to be covered. and the asymptotic behavior of large deviations. Chapters 1-4 and 7-9 were covered in ECE 534 at Illinois. Sections 8. Chapter 1 is meant primarily as a review of concepts found in a typical first course on probability theory. Applications include natural language processing. and Section 9. A number of background topics are covered in the appendix. orthogonal expansions.1-8. with an emphasis on axioms and the definition of expectation. Chapter 7. In Fall 2011. Chapter 2 focuses on various ways in which a sequence of random variables can converge.1-6. and the basic limit theorems of probability: law of large numbers. Chapter 4 introduces the notion of random process. Chapter 5 describes the use of Markov processes for modeling and statistical inference. ECE 534 at Illinois Chapters 1-5. . Markov processes and martingales are introduced in this chapter. but are covered in greater depth in later chapters. moving to linear filtering. Chapter 3 focuses on minimum mean square error estimation and the orthogonality principle.x Organization CONTENTS The first four chapters of the notes are used heavily in the remaining chapters. F. B ∈ F. The set of events F is assumed to be a σ-algebra. meaning it satisfies the following axioms: (See Appendix 11. A. then AB ∈ F by A. Each element ω of Ω is called an outcome and Ω is called the sample space. . if not accurate. F. F.1 Ω ∈ F A. B ∈ F then A ∪ B ∈ F. The idea is to assume a mathematically solid definition of the model.2 If A ∈ F then Ac ∈ F A. is a sequence of elements in a σ-algebra F. A2 . These notes adopt the most widely used framework of probability and random processes. is a nonempty set.3 and the fact AB = (Ac ∪ B c )c . This structure encourages a modeler to have a consistent. then ∞ Ai ∈ F. indexed by a set I are called mutually exclusive if the intersection Ai Aj = ∅ for all i. Ω. The first component.Chapter 1 Getting Started This chapter reviews many of the main concepts in a first level course on probability theory with more emphasis on axioms and the definition of expectation than is typical of a first course. .1 P [A] ≥ 0 for all A ∈ F 1 . is a sequence of elements in F then ∞ i=1 Ai ∈ F If F is a σ-algebra and A. A probability space is a triplet (Ω. A2 .3 If A. . . P ) is a probability measure on F satisfying the following axioms: P.2. P).1 The axioms of probability theory Random processes are widely used to model systems in engineering and scientific applications. is a set of subsets of Ω called events.1 for set notation). j ∈ I with i = j. Also. A. i ∈ I. By the same reasoning. model. . of the triplet (Ω. 1. The final component. i=1 Events Ai . namely the one based on Kolmogorov’s axioms of probability. if A1 . The second component. P . . if A1 . 2 CHAPTER 1. GETTING STARTED P.2 If A, B ∈ F and if A and B are mutually exclusive, then P [A ∪ B] = P [A] + P [B]. Also, if A1 , A2 , . . . is a sequence of mutually exclusive events in F then P ( ∞ Ai ) = ∞ P [Ai ]. i=1 i=1 P.3 P [Ω] = 1. The axioms imply a host of properties including the following. For any subsets A, B, C of F: • If A ⊂ B then P ]A] ≤ P [B] • P [A ∪ B] = P [A] + P [B] − P [AB] • P [A ∪ B ∪ C] = P [A] + P [B] + P [C] − P [AB] − P [AC] − P [BC] + P [ABC] • P [A] + P [Ac ] = 1 • P [∅] = 0. Example 1.1.1 (Toss of a fair coin) Using “H” for “heads” and “T ” for “tails,” the toss of a fair coin is modelled by Ω = {H, T } F = {{H}, {T }, {H, T }, ∅} 1 P {H} = P {T } = , 2 P {H, T } = 1, P [∅] = 0 Note that, for brevity, we omitted the square brackets and wrote P {H} instead of P [{H}]. Example 1.1.2 (Standard unit-interval probability space) Take Ω = {θ : 0 ≤ θ ≤ 1}. Imagine an experiment in which the outcome ω is drawn from Ω with no preference towards any subset. In particular, we want the set of events F to include intervals, and the probability of an interval [a, b] with 0 ≤ a ≤ b ≤ 1 to be given by: P [ [a, b] ] = b − a. (1.1) Taking a = b, we see that F contains singleton sets {a}, and these sets have probability zero. Since F is to be a σ-algrebra, it must also contain all the open intervals (a, b) in Ω, and for such an open interval, P [ (a, b) ] = b − a. Any open subset of Ω is the union of a finite or countably infinite set of open intervals, so that F should contain all open and all closed subsets of Ω. Thus, F must contain the intersection of any set that is the intersection of countably many open sets, and so on. The specification of the probability function P must be extended from intervals to all of F. It is not a priori clear how large F can be. It is tempting to take F to be the set of all subsets of Ω. However, that idea doesn’t work–see the starred homework showing that the length of all subsets of R can’t be defined in a consistent way. The problem is resolved by taking F to be the smallest 1.1. THE AXIOMS OF PROBABILITY THEORY 3 σ-algebra containing all the subintervals of Ω, or equivalently, containing all the open subsets of Ω. This σ-algebra is called the Borel σ-algebra for [0, 1], and the sets in it are called Borel sets. While not every subset of Ω is a Borel subset, any set we are likely to encounter in applications is a Borel set. The existence of the Borel σ-algebra is discussed in an extra credit problem. Furthermore, extension theorems of measure theory1 imply that P can be extended from its definition (1.1) for interval sets to all Borel sets. The smallest σ-algebra, B, containing the open subsets of R is called the Borel σ-algebra for R, and the sets in it are called Borel sets. Similarly, the Borel σ-algebra B n of subsets of Rn is the smallest σ-algebra containing all sets of the form [a1 , b1 ] × [a2 , b2 ] × · · · × [an , bn ]. Sets in B n are called Borel subsets of Rn . The class of Borel sets includes not only rectangle sets and countable unions of rectangle sets, but all open sets and all closed sets. Virtually any subset of Rn arising in applications is a Borel set. Example 1.1.3 (Repeated binary trials) Suppose we’d like to represent an infinite sequence of binary observations, where each observation is a zero or one with equal probability. For example, the experiment could consist of repeatedly flipping a fair coin, and recording a one for heads and a zero for tails, for each flip. Then an outcome ω would be an infinite sequence, ω = (ω1 , ω2 , · · · ), such that for each i ≥ 1, ωi ∈ {0, 1}. Let Ω be the set of all such ω’s. Certainly we’d want F to contain any set that can be defined in terms of only finitely many of the observations. In particular, for any binary sequence (b1 , · · · , bn ) of some finite length n, the set {ω ∈ Ω : ωi = bi for 1 ≤ i ≤ n} should be in F, and the probability of such a set should be 2−n . We take F to be the smallest σ-algebra of subsets of Ω containing all such sets. Again, the extension theorems of measure theory can be used to show that P can be extended to a probability measure defined on all of F. For a specific example of a set in F, consider A = {ω ∈ Ω : ωi = 1 for all even i}. Let’s check that A ∈ F. The key is to express A as A = ∪∞ An , where An = {ω ∈ Ω : ωi = n=1 1 for all even i with i ≤ n} for each n. For any n, An is determined by only finitely many observations, so An ∈ F. Therefore, A is the intersection of countably many sets in F, so A itself is in F. Next, we identify P [A], by first identifying P [A2n ] for n ≥ 1. For any sequence b1 , b2 , . . . , b2n of length 2n, the probability of the set {ω ∈ Ω : ωi = bi for 1 ≤ i ≤ 2n} is 2−2n . The set A2n is the union of 2n such sets, which are disjoint, due to the 2n possible choices of bi for odd i with 1 ≤ i ≤ 2n. So P [A2n ] = 2n 2−2n = 2−n . Furthermore, A ⊂ A2n so that P [A] ≤ P [A2n ] = 2−n . Since n can be arbitrarily large, it follows that P [A] = 0. The following lemma gives a continuity property of probability measures which is analogous to continuity of functions on Rn , reviewed in Appendix 11.3. If B1 , B2 , . . . is a sequence of events such that B1 ⊂ B2 ⊂ B3 ⊂ · · · , then we can think that Bj converges to the set ∪∞ Bi as j → ∞. The i=1 lemma states that in this case, P [Bj ] converges to the probability of the limit set as j → ∞. See, for example, H.L. Royden, Real Analysis, Third edition. Macmillan, New York, 1988, or S.R.S. Varadhan, Probability Theory Lecture Notes, American Mathematical Society, 2001. The σ-algebra F and P can be extended somewhat further by requiring the following completeness property: if B ⊂ A ∈ F with P [A] = 0, then B ∈ F (and also P [B] = 0). 1 4 CHAPTER 1. GETTING STARTED Lemma 1.1.4 (Continuity of Probability) Suppose B1 , B2 , . . . is a sequence of events. (a) If B1 ⊂ B2 ⊂ · · · then limj→∞ P [Bj ] = P [ ∞ Bi ] i=1 (b) If B1 ⊃ B2 ⊃ · · · then limj→∞ P [Bj ] = P [ ∞ Bi ] i=1 Proof Suppose B1 ⊂ B2 ⊂ · · · . Let D1 = B1 , D2 = B2 − B1 , and, in general, let Di = Bi − Bi−1 for i ≥ 2, as shown in Figure 1.1. Then P [Bj ] = j P [Di ] for each j ≥ 1, so i=1 B1=D 1 D2 D3 ... Figure 1.1: A sequence of nested sets. j j→∞ lim P [Bj ] = (a) j→∞ ∞ lim P [Di ] i=1 = P [Di ] i=1 ∞ ∞ (b) = P i=1 Di = P i=1 Bi where (a) is true by the definition of the sum of an infinite series, and (b) is true by axiom P.2. This proves Lemma 1.1.4(a). Lemma 1.1.4(b) can be proved similarly, or can be derived by applying c Lemma 1.1.4(a) to the sets Bj . Example 1.1.5 (Selection of a point in a square) Take Ω to be the square region in the plane, Ω = {(x, y) : 0 ≤ x, y ≤ 1}. Let F be the Borel σ-algebra for Ω, which is the smallest σ-algebra containing all the rectangular subsets of Ω that are aligned with the axes. Take P so that for any rectangle R, P [R] = area of R. (It can be shown that F and P exist.) Let T be the triangular region T = {(x, y) : 0 ≤ y ≤ x ≤ 1}. Since T is not rectangular, it is not immediately clear that T ∈ F, nor is it clear what P [T ] is. That is where the axioms come in. For n ≥ 1, let Tn denote the region shown in Figure 1.2. Since Tn can be written as a union of finitely many mutually exclusive rectangles, it follows that Tn ∈ F 1.2. INDEPENDENCE AND CONDITIONAL PROBABILITY 5 Tn 0 1 n 2 n 1 Figure 1.2: Approximation of a triangular region. and it is easily seen that P [Tn ] = 1+2+···+n = n+1 . Since T1 ⊃ T2 ⊃ T4 ⊃ T8 · · · and ∩j T2j = T , it 2n n2 1 follows that T ∈ F and P [T ] = limn→∞ P [Tn ] = 2 . The reader is encouraged to show that if C is the diameter one disk inscribed within Ω then P [C] = (area of C) = π . 4 1.2 Independence and conditional probability Events A1 and A2 are defined to be independent if P [A1 A2 ] = P [A1 ]P [A2 ]. More generally, events A1 , A2 , . . . , Ak are defined to be independent if P [Ai1 Ai2 · · · Aij ] = P [Ai1 ]P [Ai2 ] · · · P [Aij ] whenever j and i1 , i2 , . . . , ij are integers with j ≥ 1 and 1 ≤ i1 < i2 < · · · < ij ≤ k. For example, events A1 , A2 , A3 are independent if the following four conditions hold: P [A1 A2 ] = P [A1 ]P [A2 ] P [A1 A3 ] = P [A1 ]P [A3 ] P [A2 A3 ] = P [A2 ]P [A3 ] P [A1 A2 A3 ] = P [A1 ]P [A2 ]P [A3 ] A weaker condition is sometimes useful: Events A1 , . . . , Ak are defined to be pairwise independent if Ai is independent of Aj whenever 1 ≤ i < j ≤ k. Independence of k events requires that 2k − k − 1 equations hold: one for each subset of {1, 2, . . . , k} of size at least two. Pairwise independence only requires that k = k(k−1) equations hold. 2 2 If A and B are events and P [B] = 0, then the conditional probability of A given B is defined by P [A | B] = P [AB] . P [B] It is not defined if P [B] = 0, which has the following meaning. If you were to write a computer routine to compute P [A | B] and the inputs are P [AB] = 0 and P [B] = 0, your routine shouldn’t 6 CHAPTER 1. GETTING STARTED simply return the value 0. Rather, your routine should generate an error message such as “input error–conditioning on event of probability zero.” Such an error message would help you or others find errors in larger computer programs which use the routine. As a function of A for B fixed with P [B] = 0, the conditional probability of A given B is itself a probability measure for Ω and F. More explicitly, fix B with P [B] = 0. For each event A define P [A] = P [A | B]. Then (Ω, F, P ) is a probability space, because P satisfies the axioms P 1 − P 3. (Try showing that). If A and B are independent then Ac and B are independent. Indeed, if A and B are independent then P [Ac B] = P [B] − P [AB] = (1 − P [A])P [B] = P [Ac ]P [B]. Similarly, if A, B, and C are independent events then AB is independent of C. More generally, suppose E1 , E2 , . . . , En are independent events, suppose n = n1 +· · ·+nk with ni > 1 for each i, and suppose F1 is defined by Boolean operations (intersections, complements, and unions) of the first n1 events E1 , . . . , En1 , F2 is defined by Boolean operations on the next n2 events, En1 +1 , . . . , En1 +n2 , and so on, then F1 , . . . , Fk are independent. Events E1 , . . . , Ek are said to form a partition of Ω if the events are mutually exclusive and Ω = E1 ∪ · · · ∪ Ek . Of course for a partition, P [E1 ] + · · · + P [Ek ] = 1. More generally, for any event A, the law of total probability holds because A is the union of the mutually exclusive sets AE1 , AE2 , . . . , AEk : P [A] = P [AE1 ] + · · · + P [AEk ]. If P [Ei ] = 0 for each i, this can be written as P [A] = P [A | E1 ]P [E1 ] + · · · + P [A | Ek ]P [Ek ]. Figure 1.3 illustrates the condition of the law of total probability. E1 A E2 E3 E 4 Ω Figure 1.3: Partitioning a set A using a partition of Ω. Judicious use of the definition of conditional probability and the law of total probability leads to Bayes’ formula for P [Ei | A] (if P [A] = 0) in simple form P [Ei | A] = P [AEi ] P [A] = P [A | Ei ]P [Ei ] , P [A] 1.2. INDEPENDENCE AND CONDITIONAL PROBABILITY or in expanded form: P [Ei | A] = P [A | Ei ]P [Ei ] . P [A | E1 ]P [E1 ] + · · · + P [A | Ek ]P [Ek ] 7 The remainder of this section gives the Borel-Cantelli lemma. It is a simple result based on continuity of probability and independence of events, but it is not typically encountered in a first course on probability. Let (An : n ≥ 0) be a sequence of events for a probability space (Ω, F, P ). Definition 1.2.1 The event {An infinitely often} is the set of ω ∈ Ω such that ω ∈ An for infinitely many values of n. Another way to describe {An infinitely often} is that it is the set of ω such that for any k, there is an n ≥ k such that ω ∈ An . Therefore, {An infinitely often} = ∩k≥1 (∪n≥k An ) . For each k, the set ∪n≥k An is a countable union of events, so it is an event, and {An infinitely often} is an intersection of countably many such events, so that {An infinitely often} is also an event. Lemma 1.2.2 (Borel-Cantelli lemma) Let (An : n ≥ 1) be a sequence of events and let pn = P [An ]. (a) If (b) If ∞ n=1 pn ∞ n=1 pn < ∞, then P {An infinitely often} = 0. = ∞ and A1 , A2 , · · · are mutually independent, then P {An infinitely often} = 1. Proof. (a) Since {An infinitely often} is the intersection of the monotonically nonincreasing sequence of events ∪n≥k An , it follows from the continuity of probability for monotone sequences of events (Lemma 1.1.4) that P {An infinitely often} = limk→∞ P [∪n≥k An ]. Lemma 1.1.4, the fact that the probability of a union of events is less than or equal to the sum of the probabilities of the events, and the definition of the sum of a sequence of numbers, yield that for any k ≥ 1, m ∞ P [∪n≥k An ] = lim P [∪m An ] ≤ lim n=k m→∞ m→∞ pn = n=k n=k pn ∞ Combining the above yields P {An infinitely often} ≤ limk→∞ ∞ pn . If n=k n=1 pn < ∞, then ∞ limk→∞ n=k pn = 0, which implies part (a) of the lemma. (b) Suppose that ∞ pn = +∞ and that the events A1 , A2 , . . . are mutually independent. For n=1 any k ≥ 1, using the fact 1 − u ≤ exp(−u) for all u, m P [∪n≥k An ] = ≥ m→∞ lim P [∪m An ] = lim 1 − n=k m→∞ n=k m (1 − pn ) ∞ m→∞ lim 1 − exp(− n=k pn ) = 1 − exp(− n=k pn ) = 1 − exp(−∞) = 1. Therefore, P {An infinitely often} = limk→∞ P [∪n≥k An ] = 1. 8 CHAPTER 1. GETTING STARTED 1 Example 1.2.3 Consider independent coin tosses using biased coins, such that P [An ] = pn = n , ∞ 1 th toss. Since where An is the event of getting heads on the n n=1 n = +∞, the part of the Borel-Cantelli lemma for independent events implies that P {An infinitely often} = 1. Example 1.2.4 Let (Ω, F, P ) be the standard unit-interval probability space defined in Example 1 1 1.1.2, and let An = [0, n ]. Then pn = n and An+1 ⊂ An for n ≥ 1. The events are not independent, 1 because for m < n, P [Am An ] = P [An ] = n = P [Am ]P [An ]. Of course 0 ∈ An for all n. But for any 1 ω ∈ (0, 1], ω ∈ An for n > ω . Therefore, {An infinitely often} = {0}. The single point set {0} has probability zero, so P {An infinitely often} = 0. This conclusion holds even though ∞ pn = +∞, n=1 illustrating the need for the independence assumption in Lemma 1.2.2(b). 1.3 Random variables and their distribution Let a probability space (Ω, F, P ) be given. By definition, a random variable is a function X from Ω to the real line R that is F measurable, meaning that for any number c, {ω : X(ω) ≤ c} ∈ F. (1.2) If Ω is finite or countably infinite, then F can be the set of all subsets of Ω, in which case any real-valued function on Ω is a random variable. If (Ω, F, P ) is given as in the uniform phase example with F equal to the Borel subsets of [0, 2π], then the random variables on (Ω, F, P ) are called the Borel measurable functions on Ω. Since the Borel σ-algebra contains all subsets of [0, 2π] that come up in applications, for practical purposes we can think of any function on [0, 2π] as being a random variable. For example, any piecewise continuous or piecewise monotone function on [0, 2π] is a random variable for the uniform phase example. The cumulative distribution function (CDF) of a random variable X is denoted by FX . It is the function, with domain the real line R, defined by FX (c) = P {ω : X(ω) ≤ c} = P {X ≤ c} (for short) (1.3) (1.4) If X denotes the outcome of the roll of a fair die (“die” is singular of “dice”) and if Y is uniformly distributed on the interval [0, 1], then FX and FY are shown in Figure 1.4 The CDF of a random variable X determines P {X ≤ c} for any real number c. But what about P {X < c} and P {X = c}? Let c1 , c2 , . . . be a monotone nondecreasing sequence that converges to c from the left. This means ci ≤ cj < c for i < j and limj→∞ cj = c. Then the events {X ≤ cj } are nested: {X ≤ ci } ⊂ {X ≤ cj } for i < j, and the union of all such events is the event {X < c}. Thus, by Lemma 1.1.4 P {X < c} = i→∞ lim P {X ≤ ci } = i→∞ lim FX (ci ) = FX (c−). 1.3. RANDOM VARIABLES AND THEIR DISTRIBUTION 1 9 FX 1 F Y 0 1 2 3 4 5 6 0 1 Figure 1.4: Examples of CDFs. Therefore, P {X = c} = FX (c) − FX (c−) = FX (c), where FX (c) is defined to be the size of 1 the jump of F at c. For example, if X has the CDF shown in Figure 1.5 then P {X = 0} = 2 . The requirement that FX be right continuous implies that for any number c (such as c = 0 for this example), if the value FX (c) is changed to any other value, the resulting function would no longer be a valid CDF. The collection of all events A such that P {X ∈ A} is determined by FX is a σ-algebra containing the intervals, and thus this collection contains all Borel sets. That is, P {X ∈ A} is determined by FX for any Borel set A. 1 0.5 −1 0 Figure 1.5: An example of a CDF. Proposition 1.3.1 A function F is the CDF of some random variable if and only if it has the following three properties: F.1 F is nondecreasing F.2 limx→+∞ F (x) = 1 and limx→−∞ F (x) = 0 F.3 F is right continuous Proof The “only if” part is proved first. Suppose that F is the CDF of some random variable X. Then if x < y, F (y) = P {X ≤ y} = P {X ≤ x} + P {x < X ≤ y} ≥ P {X ≤ x} = F (x) so that F.1 is true. Consider the events Bn = {X ≤ n}. Then Bn ⊂ Bm for n ≤ m. Thus, by Lemma 1.1.4, ∞ n→∞ lim F (n) = lim P [Bn ] = P n→∞ n=1 Bn = P [Ω] = 1. F (x) → 1 as x → +∞. so that F (x) → 0 as x → −∞. X has CDF F . Then An ⊂ Am for n ≥ m so 1 lim F (x + ) = lim P [An ] = P n→∞ n→∞ n ∞ Ak = P {X ≤ x} = FX (x). to be described next. The probability mass function (pmf) of a discrete random variable X. let F be a function satisfying properties F. Define P on intervals of the form (a. let X(ω) = ω for all ω ∈ Ω. It must be shown that there exists a random variable with CDF F .5) where the sum in (1. It can ˜ can be extended to all of F so that be shown by an extension theorem of measure theory that P ˜ the axioms of probability are satisfied. implies that F (x+) = F (x). Property F.3 is thus proved. FX (x) = y:y≤x pX (y). Fix an arbitrary real number x. A random variable X is a discrete random variable if there is a finite or countably infinite set of values {xi : i ∈ I} such that P {X ∈ {xi : i ∈ I}} = 1.2 is proved. That is. (1. Property F. The vast majority of random variables described in applications are one of two types. A random variable X is a continuous random variable if the CDF is the integral of a function: x FX (x) = −∞ fX (y)dy .3.3. If X is a discrete random variable with only finitely many mass points in any finite interval. Given any > 0. b]] = F (b) − F (a). b] by P [(a.3 is similar. together with the fact that F is nondecreasing. The proof of F. ˜ Therefore. b]] = P [(a. The proof of the “only if” portion of Proposition 1.1 is complete To prove the “if” part of Proposition 1. the pmf and CDF of a discrete random variable are related by pX (x) = FX (x) and conversely.1. then FX is a piecewise constant function.3. is defined by pX (x) = P {X = x}.5) is taken only over y such that pX (y) = 0. GETTING STARTED This and the fact F is nondecreasing imply the following. there exists N so large that F (x) ≥ 1 − for all x ≥ N . Then ˜ ˜ ˜ P [X ∈ (a. Finally. ∞ n→−∞ lim F (n) = n→∞ lim P [B−n ] = P n=1 B−n = P [∅] = 0. b]] = F (b) − F (a).1-F. k=1 1 Convergence along the sequence x + n . However. denoted pX (x). Let Ω = R and let F be the set ˜ ˜ B of Borel subsets of R.10 CHAPTER 1. Similarly. Typically the pmf of a discrete random variable is much more useful than the CDF. Define the sequence of events An 1 for n ≥ 1 by An = {X ≤ x + n }. We write Y = g(X). The distribution PX is determined uniquely by the CDF FX . b]] = P {X ∈ (a. (1. In case X is a continuous random variable with known distribution.5 and 11. The distribution is also determined by the pdf fX if X is continuous type. or pX . In general.6. b].1.6) can be understood as a Riemann integral if A is a finite union of intervals and f is piecewise continuous or monotone. PX [(a. In common usage. using FY (c) = P {Y ≤ c} = P {g(X) ≤ c}.4. or the pmf pX if X is discrete type. F. Specifically. 2 Lebesgue integration is defined in Sections 1. the response to the question “What is the distribution of X?” is answered by giving one or more of FX . then P {X ∈ A} = A fX (x)dx. then the value fX (x) has the following nice interpretation: fX (x) = lim 1 x+ε fX (y)dy ε→0 ε x 1 = lim P {x ≤ X ≤ x + ε}. suppose for any constant c that {x : g(x) ≤ c} is a Borel subset of R.5 . the following three step procedure works well: (1) Examine the ranges of possible values of X and Y . FUNCTIONS OF A RANDOM VARIABLE 11 The function fX is called the probability density function (pdf). We define the probability distribution of X to be the probability measure PX . (2) Find the CDF of Y . See Figure 1. If the pdf fX is continuous at a point x. 1. whichever is most convenient. fX is required to be Borel measurable and the integral is defined by Lebesgue integration. or possibly a transform of one of these.3.2 Any random variable X on an arbitrary probability space has a CDF FX . P ) is a function mapping Ω to the real line R . satisfying the condition {ω : X(ω) ≤ a} ∈ F for all a ∈ R. Then Y maps Ω to R and Y is a random variable.4 Functions of a random variable Recall that a random variable X on a probability space (Ω. ε→0 ε If A is any Borel subset of R. b]}. The idea is to express the event {g(X) ≤ c} as {X ∈ A} for some set A depending on c.6) The integral in (1. Sketch the function g. Suppose g is a function mapping R to R that is not too bizarre. fX . Let Y (ω) = g(X(ω)).1 there exists a probability measure PX (called P in the proof) on the Borel subsets of R such that for any interval (a. As noted in the proof ˜ of Proposition 1. Often we’d like to compute the distribution of Y from knowledge of g and the distribution of X. FB (c) = 0 for c ≤ −a and FB (c) = 1 for c ≥ a. (3) If FY has a piecewise continuous derivative. Then the effective speed of the vehicle in the random direction is B = a cos(Θ). and the range of g over this support is R+ . and if the pmf fY is desired. FY (c) = 0 for c < 0.4. If instead X is a discrete random variable then step 1 should be followed. subtending an angle Θ from the direction of travel which is uniformly distributed over the interval [0. See Figure 1. π] is the interval [−a. π]. differentiate FY . Since P {Y ≥ 0} = 1. because cos is monotone nonincreasing . √ √ FY (c) = P {X 2 ≤ c} = P {− c ≤ X ≤ c} √ √ − c−2 X −2 c−2 = P{ √ ≤ √ ≤ √ } 3 3 3 √ √ c−2 − c−2 = Φ( √ ) − Φ( √ ) 3 3 Differentiate with respect to c.4.1 Suppose X is a N (µ = 2.7) 2 fY (c) = + exp(−[ − √c−2 ]2 )} 6 √ 0 if y ≥ 0 if y < 0 Example 1. Next we find the CDF. Let us describe the density of Y . Let us find the pdf of B. After that the pmf of Y can be found from the pmf of X using pY (y) = P {g(X) = y} = x:g(x)=y pX (x) Example 1. Let now −a < c < a.6 for the definition) and Y = X 2 . Then. Note that Y = g(X) where g(x) = x2 . σ 2 = 3) random variable (see Section 1. FY . a]. using the chain rule and the fact. The support of the distribution of X is the whole real line. For c ≥ 0.7.6: A function of a random variable as a composition of mappings.12 X CHAPTER 1. Φ (s) = √ c−2 √ 1 {exp(−[ √ ]2 ) 24πc 6 √1 2π exp(− s2 ) to obtain (1. Therefore. The range of a cos(Θ) as θ ranges over [0. GETTING STARTED g Ω X(ω) g(X(ω)) Figure 1. and that a random direction is selected.2 Suppose a vehicle is traveling in a straight line at speed a. as illustrated in Figure 1.1. √ 1 π a2 −c2 0 | c |< a | c |> a fB −a 0 a Figure 1. Example 1. because cos−1 (y) has derivative.4. fB (c) = A sketch of the density is given in Figure 1. where Θ is uniformly distributed over the interval (− π . on the interval [0. π]. −(1 − y 2 )− 2 . FUNCTIONS OF A RANDOM VARIABLE 13 B Θ a Figure 1. π ) .4.8: The pdf of the effective speed in a uniformly distributed direction. FB (c) = P {a cos(Θ) ≤ c} = P {cos(Θ) ≤ c } a c = P {Θ ≥ cos−1 ( )} a c cos−1 ( a ) = 1− π 1 Therefore. The function tan(θ) increases from −∞ to ∞ 2 2 .8. Let us find the pdf of Y .3 Suppose Y = tan(Θ).7: Direction of travel and a random direction.9. 2π] .9: A horizontal line. so fix c with 0 ≤ c < 2π and seek to find Clearly Θ ˜ P {Θ ≤ c}. Therefore. Let ˜ B = n [2πn. let h be a constant. 2πn + c]. P {Θ ≤ c} = 2π . ˜ P {Θ ≤ c} = A T B 1 dθ 2π By sketching the set B. Thus Θ ≤ c if and only if Θ + h ∈ B.4. and a line through the point with random direction.14 CHAPTER 1. it is easy to see that A B is either a single interval of length c. 2π]. Thus. 2 2 FY (c) = P {Y ≤ c} = P {tan(Θ) ≤ c} = P {Θ ≤ tan−1 (c)} = tan−1 (c) + π π 2 Differentiating the CDF with respect to c yields that Y has the Cauchy pdf: fY (c) = 1 π(1 + c2 ) −∞<c<∞ Example 1. Therefore. Let A denote the interval [h. Thus. 2π]. ˜ takes values in the interval [0. GETTING STARTED Θ 0 Y Figure 1. π ). Let Θ be uniformly distributed over [0. (θ mod 2π) is equal to θ + 2πn. a fixed point at unit distance. For any real c. or the c ˜ ˜ union of two intervals with lengths adding to c. 2π]. and let ˜ Θ = (Θ + h mod 2π) ˜ Let us find the distribution of Θ. h + 2π]. Θ + h is uniformly distributed over A. let (θ mod 2π) denote the equivalent angle in the interval [0. where the integer n is such that 0 ≤ θ + 2πn < 2π.4 Given an angle θ expressed in radians. so that Θ is itself uniformly distributed over [0. as θ ranges over the interval (− π . a suitable version of F −1 always exists. the density of X becomes more and more “evenly spread out. 1]. An important step in many computer simulations of random systems is to generate a random variable with a specified CDF. Let Y = X . 1].4. and let R = X − X . 1]. FUNCTIONS OF A RANDOM VARIABLE 15 Example 1. 2. so it is sufficient to find the pmf of Y . For integers k ≥ 0. Differentiating FR yields the pmf: 0≤c≤1 otherwise fR (c) = λe−λc 1−e−λ 0 What happens to the density of R as λ → 0 or as λ → ∞? By l’Hospital’s rule. . Clearly Y is a discrete random variable with possible values 0. So let 0 < c < 1 and find FR (c): ∞ FR (c) = P {X − X ≤ c} = P {X ∈ ∞ ∞ [k.1. .4. . 1]. If λ is very large then the factor e−λ is nearly zero. The problem is to find a function g so that F is the CDF of g(U ). by applying a function to a random variable that is uniformly distributed on the interval [0.8) . An appropriate function g is given by the inverse function of F . in the limit as λ → 0. k+1 pY (k) = P {k ≤ X < k + 1} = k λe−λx dx = e−λk (1 − e−λ ) and pY (k) = 0 for other k. k + c]} k=0 = k=0 P {k ≤ X ≤ k + c} = k=0 1 1−α e−λk (1 − e−λc ) = 1 − e−λc 1 − e−λ where we used the fact 1 + α + α2 + · · · = for | α |< 1. 1. Clearly R takes values in the interval [0. Turn next to the distribution of R. lim fR (c) = 1 0≤c≤1 0 otherwise λ→0 That is.5 Let X be an exponentially distributed random variable with parameter λ. . Although F may not be strictly increasing. Let F be a function satisfying the three properties required of a CDF. defined for 0 < u < 1 by F −1 (u) = min{x : F (x) ≥ u} (1. which is the integer part of X. and let U be uniformly distributed over the interval [0. We shall describe the distributions of Y and R. and the density of R is nearly the same as the exponential density with parameter λ. which is the remainder.” and R becomes uniformly distributed over the interval [0. F (xo ) ≥ uo Thus. That is.4. and hence the correct CDF. note that if x ≥ 0. 1 − U has the same distribution as U . we’d like 1 − e−x = u which is equivalent to e−x = 1 − u. if U is uniformly distributed on the interval [0. It is not hard to check that for any real xo and uo with 0 < uo < 1.10: A CDF and its inverse. One way to generate a random variable with CDF F is to actually roll a die.16 CHAPTER 1. To double check the answer. GETTING STARTED F (u) F(x) 1 !1 x 1 u Figure 1. note that 6 if 1 ≤ i ≤ 6. we can identify its inverse by solving for x as a function of u so that F (x) = u. g(u) = i for i−1 < u ≤ 6 for 1 ≤ i ≤ 6. shown on the left half of Figure 1. then P {− ln(1 − U ) ≤ x} = P {ln(1 − U ) ≥ −x} = P {1 − U ≥ e−x } = P {U ≤ 1 − e−x } = F (x). That is. if and only if xo ≥ F −1 (uo ) Example 1. we’d seek a function g so that g(U ) has the same CDF.4. so the CDF of − ln(U ) is also F . for 0 < u < 1. if X = F −1 (U ) then P {F −1 (U ) ≤ x} = P {U ≤ F (x)} = F (x) so that indeed F is the CDF of X Example 1.8) or the graphical method illustrated in Figure 1. we get i that for 0 < u < 1. If the graphs of F and F −1 are closed up by adding vertical lines at jump points. Using g = F −1 and using (1. To double check the answer.4. To simulate that on a compute. then i−1 i 1 <U ≤ = P {g(U ) = i} = P 6 6 6 so that g(U ) has the correct pmf.6 Suppose F (x) = 1−e−x for x ≥ 0 and F (x) = 0 for x < 0. So we can take g(u) = − ln(1 − u) for 0 < u < 1. then the graphs are reflections of each other about the x = y line. For example. 1]. or x = − ln(1 − u). Since F is continuously increasing in this case. . The choice of g is not unique in general. F −1 (u) = − ln(1 − u). as illustrated in Figure 1.10 to find F −1 . then the CDF of − ln(1 − U ) is F .10. Thus.7 Suppose F is the CDF for the experiment of rolling a fair die. One learns in any elementary probability class that E[X + Y ] = E[X] + E[Y ]. . . . Before giving a general definition. alternatively called the mean.9) can be viewed as an integral over Ω. A refinement of this partition consists of another partition C1 . . yn } for all ω. Like all random variables. X is a function on a probability space (Ω. . If we let xj denote the value .5.11 illustrates that the sum defining E[X] in (1. Of course E[Y ] = i=1 yi P {Y = yi }. we shall consider a straight forward case. A random variable X is called simple if there is a finite set {x1 . P ). . Note that X + Y is again a simple random variable. xm } such that X(ω) ∈ {x1 .10) X(ω )=x 1 X(ω )=x 2 Ω X(ω )=x 3 Figure 1. . .9) The definition (1. Let Y be another simple random variable on the same probability space as X. . F.11: A simple random variable with three possible values. This suggests writing E[X] = Ω X(ω)P [dω] (1. .9) helpful? We shall give a proof that E[X +Y ] = E[X]+E[Y ] motivated by (1. .10). of a random variable X can be defined in several different ways. . . . with Y (ω) ∈ n {y1 . so that E[X + Y ] can be defined in the same way as E[X] was defined.1. Cm such that X is constant over each Cj . The sets {X = x1 }. EXPECTATION OF A RANDOM VARIABLE 17 1. . How would you prove E[X +Y ] = E[X]+E[Y ]? Is (1.9) clearly shows that E[X] for a simple random variable X depends only on the pmf of X. {X = xm } form a partition of Ω. xm } for all ω. . . . Figure 1. . . The expectation of such a random variable is defined by m E[X] = i=1 xi P {X = xi } (1.5 Expectation of a random variable The expectation. We will prove that E[X] defined by the Lebesgue integral (1. . Therefore. . This completes the definition of E[X] using (1. Furthermore it can be shown that the value of the limit depends only on (Ω. The expression (1. For such a random variable. then clearly E[X] = j CHAPTER 1. define the simple random variable Xn by Xn (ω) = k2−n 0 if k2−n ≤ X(ω) < (k + 1)2−n . . defined as follows: Suppose X is an arbitrary nonnegative random variable. For example. E[X + Y ] = j (xj + yj )P [Cj ] = j xj P [Cj ] + j yj P [Cj ] = E[X] + E[Y ] While the expression (1. Cm so that both X and Y are constant over each Cj . 22n − 1 else . with possible value +∞. GETTING STARTED xj P [Cj ] Now. As long as at least one of E[X+ ] or E[X− ] is finite. Let yj denote the value of Y on Cj . X2 . Then there exists a sequence of simple random variables X1 . Then X(ω) = X+ (ω) − X− (ω) for all ω. not on the particular choice of the approximating simple sequence.10) can be used to define E[X] in great generality if it is interpreted as a Lebesgue integral.10) is rather suggestive.11) is not convergent for this choice of X. We thus define E[X] = limn→∞ E[Xn ]. and n ≥ 1. For example. but the Riemann integral (1. .1. . such that for every ω ∈ Ω. It suffices to show this for a nonnegative random variable X. such X satisfies P {X = 0} = 1. F.2.10) interpreted as a Lebesgue integral. each Cj could have the form {X = xi } ∩ {Y = yk } for some i. and X+ and X− are both nonnegative random variables.18 of X on Cj . . k. X1 (ω) ≤ X2 (ω) ≤ · · · and Xn (ω) → X(ω) as n → ∞. it is possible to select the partition C1 . The expectation E[X] is undefined if E[X+ ] = E[X− ] = +∞. . define E[X] = E[X+ ] − E[X− ]. −X(ω)}. X(ω)} for each value of ω. Clearly we’d like E[X] = 0. . then it is tempting to define E[X] by Riemann integration (see the appendix): 1 E[X] = 0 X(ω)dω (1. Then xj + yj is the value of X + Y on Cj . Similarly define the negative part of X to be the random variable X− (ω) = max{0. Suppose X is an arbitrary random variable. Since the set of rational numbers in Ω is countably infinite. it would be overly restrictive to interpret it as a Riemann integral over Ω. suppose X is the simple random variable such that X(w) = 1 for rational values of ω and X(ω) = 0 otherwise. . Define the positive part of X to be the random variable X+ defined by X+ (ω) = max{0. . Thus. P ) and X.10) depends only on the CDF of X. . 1.11) However. if X is a nonnegative random variable. E[X] is always well defined in this way. k = 0. Then E[Xn ] is well defined for each n and is nondecreasing in n. so the limit of E[Xn ] as n → ∞ exists with values in [0. if X is a random variable for the the standard unitinterval probability space defined in Example 1. +∞]. 12) is the Lebesgue-Stieltjes integral of x with respect to FX . EXPECTATION OF A RANDOM VARIABLE Then 22n −1 19 E[Xn ] = k=0 k2−n (FX ((k + 1)2−n ) − FX (k2−n ) so that E[Xn ] is determined by the CDF FX for each n.3 If X has pdf fX then ∞ E[X] = −∞ xfX (x)dx (Lebesgue) E.3 we defined the probability distribution PX of a random variable such that the ˜ canonical random variable X(ω) = ω on (R. E[X] = limn→∞ E[Xn ]. In Section 1. ∞ E[X] = −∞ xPX (dx) (Lebesgue) (1. E[g(X)] = = −∞ g(X(ω))P [dω] Ω ∞ (Lebesgue) (Lebesgue-Stieltjes) g(x)dFX (x) and in case X is a continuous type random variable ∞ E[g(X)] = −∞ g(x)fX (x)dx (Lebesgue) . E[Y ] and E[X] + E[Y ] are well defined. Therefore E[X] = ˜ or E[X].2 (Preservation of order) If P {X ≥ Y } = 1 and E[Y ] is well defined then E[X] is well defined and E[X] ≥ E[Y ].4 If X has pmf pX then E[X] = x>0 xpX (x) + x<0 xpX (x). and therefore the limit E[X] is determined by FX .5.13) Expectation has the following properties. E. Furthermore. E.12) By definition. B. E. Thus. Let X. so that ∞ E[X] = −∞ xdFX (x) (Lebesgue-Stieltjes) (1. E. Y be random variables and c a constant. then E[X + Y ] is well defined and E[X + Y ] = E[X] + E[Y ].1 (Linearity) E[cX] = cE[X]. If E[X].5 (Law of the unconscious statistician) If g is Borel measurable.1. the Xn ’s are nondecreasing in n and converge to X. PX ) has the same CDF as X. the integral (1. then for c > 0. we write EX instead of E[X].14) is finite. The following two inequalities are simple and fundamental. the variance of X satisfies the useful relation: Var(X) = E[X 2 − 2X(EX) + (EX)2 ] = E[X 2 ] − (EX)2 . Alternatively.12. The c Chebychev inequality states that if X is a random variable with finite mean µ and variance σ 2 .5 can be proved by approximating g by piecewise constant functions. Namely. The variance of a random variable X with EX finite is defined by Var(X) = E[(X −EX)2 ]. then for any d > 0. Sometimes. σ2 P {|X − µ| ≥ d} ≤ 2 d . since FX (U ) has the same distribution as X. By the linearity of expectation.12) and properties of Lebesgue-Stieltjes integrals. Property E. (1.3 and E. as long as at least one of these regions has finite area. for brevity. GETTING STARTED 0 E[X] = 0 (1 − FX (x))dx − −∞ FX (x)dx.6 (Integration by parts formula) ∞ CHAPTER 1. if EX is finite. minus the area of the region bounded by the x axis and the graph of FX and contained in {x ≤ 0}. Properties E. and this integral can also be interpreted as the difference of the areas of the same two regions. See Figure 1. There is a simple graphical interpretation of (1. 1].2 are true for simple random variables and they carry over to general random variables in the limit defining the Lebesgue integral (1.14). E[X] is equal to the area of the region between the horizontal line {y = 1} and the graph of FX and contained in {x ≥ 0}.14) which is well defined whenever at least one of the two integrals in (1.20 E.10). note that I{Y ≥c} ≤ Y . if U is uniformly distributed on the interval [0.1 and E. P {Y ≥ c} ≤ E[Y ] c To prove Markov’s inequality.4 follow from the equivalent definition (1. The Markov inequality states that if Y is a nonnegative random variable. y y=1 + F (x) X ! 0 F (x) X x Figure 1. Properties E. and take expectations on each side.13). the law of the unconscious statistician yields 1 −1 that E[X] = 0 FX (u)du.6 can be proved by −1 integration by parts applied to (1.12: E[X] is the difference of two areas. Property E. then the mean. The characteristic function ΦX of a random variable X is defined by ΦX (u) = E[ejuX ] for real values of u.1. and α are real-valued. σ. If E[X k ] is finite it can be found from the derivatives of ΨX up to the kth order at z = 1. except n can be noninteger-valued in the case of the gamma distribution. λ. If E[X k ] exists and is finite for an integer k ≥ 1. a typical example and significance of the distribution. then the derivatives of ΦX up to order k exist and are continuous. if any. variance. defined by ∞ (k) ΨX (z) = E[z ] = k=0 X z k pX (k) for real or complex z with | z |≤ 1. FREQUENTLY USED DISTRIBUTIONS 21 The Chebychev inequality follows by applying the Markov inequality with Y = |X −µ|2 and c = d2 .6. For each distribution an abbreviation. Two such random variables have the same probability distribution if and only if their z transforms are equal. µ. and n and i are integer-valued. Two random variables have the same probability distribution if and only if they have the same characteristic function. and . 0 ≤ p ≤ 1 p i=1 1−p i=0 0 else z-transform: 1 − p + pz pmf: p(i) = mean: p variance: p(1 − p) 1 2 Example: Number of heads appearing in one flip of a coin. The coin is called fair if p = biased otherwise. then ∞ ΦX (u) = −∞ exp(jux)fX (x)dx. if X has pdf f . ΨX (1) = E[X(X − 1) · · · (X − k + 1)] (k) 1. pdf or pmf. b.6 Frequently used distributions The following is a list of the most basic and frequently used probability distributions. and valid parameter values are given. For example. where j = √ −1. The constants p. followed by either the CDF. Bernoulli: Be(p). and ΦX (0) = j k E[X k ] For a nonnegative integer-valued random variable X it is often more convenient to work with the z transform of the pmf. a. which is 2π times the inverse Fourier transform of fX . Significant property: If X has the geometric distribution. GETTING STARTED 0≤i≤n variance: np(1 − p) Example: Number of heads appearing in n independent flips of a coin. 0 ≤ p ≤ 1 n i p (1 − p)n−i i z-transform: (1 − p + pz)n pmf:p(i) = mean: np CHAPTER 1. j ≥ 1. P {X > i} = (1 − p)i for integers i ≥ 1. Significance: The Poisson pmf is the limit of the binomial pmf as n → +∞ and p → 0 in such a way that np → λ. Poisson: P oi(λ). n ≥ 1. p). Any positive integer-valued random variable with this property has a geometric distribution. Geometric: Geo(p).22 Binomial: Bi(n. λ ≥ 0 pmf: p(i) = mean: λ λi e−λ i≥0 i! z-transform: exp(λ(z − 1)) variance: λ Example: Number of phone calls placed during a ten second interval in a large city. . 0 < p ≤ 1 pmf: p(i) = (1 − p)i−1 p i≥1 pz z-transform: 1 − z + pz 1 1−p mean: variance: p p2 Example: Number of independent flips of a coin until heads first appears. So X has the memoryless property: P {X > i + j | X > i} = P {X > j} for i. Notation: The character Φ is often used to denote the CDF of a N (0. on a given day. lim P X1 + · · · + Xn − nµ √ ≤c nσ 2 = Φ(c) n→∞ Exponential: Exp (λ). 1) random variable. Φ is also used to denote characteristic functions. . . Significance: If X has the Exp(λ) distribution. . 3 As noted earlier. are independent and identically distributed with mean µ and nonzero variance σ 2 .1. FREQUENTLY USED DISTRIBUTIONS Gaussian (also called Normal): N (µ.3 and Q is often used for the complementary CDF: ∞ Q(c) = 1 − Φ(c) = c x2 1 √ e− 2 dx 2π Significant property (Central limit theorem): If X1 . X2 . σ 2 ). t ≥ 0 Any nonnegative random variable with this property is exponentially distributed. So X has the memoryless property: P {X ≥ s + t | X ≥ s} = P {X ≥ t} s. λ > 0 pdf: f (x) = λe−λx x≥0 λ characteristic function: λ − ju 1 1 mean: variance: 2 λ λ Example: Time elapsed between noon sharp and the first telephone call placed in a large city. σ ≥ 0 pdf (if σ 2 > 0): f (x) = √ pmf (if σ 2 = 0): p(x) = 1 (x − µ)2 2σ 2 23 2πσ 2 1 x=µ 0 else exp − characteristic function: exp(juµ − mean: µ variance: σ 2 u2 σ 2 ) 2 Example: Instantaneous voltage difference (due to thermal noise) measured across a resistor held at a fixed temperature. The meaning should be clear from the context. µ ∈ R.6. . then for any constant c. P {X ≥ t} = e−λt for t ≥ 0. 24 Uniform: U (a. Exp(α) distributed random variables. α > 0 (n real valued) pdf: f (x) = αn xn−1 e−αx Γ(n) ∞ 0 x≥0 where Γ(n) = sn−1 e−s ds α α − ju n α2 n characteristic function: mean: n α variance: Significance: If n is a positive integer then Γ(n) = (n − 1)! and a Gamma (n. Gamma(n. 1 . then (X 2 + Y 2 ) 2 has the Rayleigh(σ 2 ) distribution. α): n. Rayleigh (σ 2 ): σ2 > 0 r r2 exp − 2 r>0 σ2 2σ r2 CDF : 1 − exp − 2 2σ π π mean: σ variance: σ 2 2 − 2 2 pdf: f (r) = Example: Instantaneous value of the envelope of a mean zero. N (0. Also notable is the simple form of the CDF. Significance: If X and Y are independent. GETTING STARTED 0 a≤x≤b else characteristic function: mean: a+b 2 ejub − ejua ju(b − a) (b − a)2 variance: 12 Example: The phase difference between two independent oscillators operating at the same frequency may be modelled as uniformly distributed over [0. 2π] Significance: Uniform is uniform. σ 2 ) random variables. narrow band noise signal. b) − ∞ < a < b < ∞ pdf: f (x) = 1 b−a CHAPTER 1. α) random variable has the same distribution as the sum of n independent. . . . Xm be random variables on a single probability space (Ω. . b] × (a . . X2 . . . . . .. xm ) = ··· −∞ −∞ fX1 X2 ···Xm (u1 . . Xm ≤ xm } The CDF determines the probabilities of all events concerning X1 . . called the joint probability density function (pdf). For example. .. Xm ) ∈ A} = (u1 . Xm can be obtained by summing out the other coordinates of the joint pmf. um ) = P [{X1 = u1 } ∩ {X2 = u2 } ∩ · · · ∩ {Xm = um }] The sum of the probability masses is one. such that x1 xm FX1 X2 ···Xm (x1 . u2 . then P {(X1 .7 Jointly distributed random variables Let X1 . .um )∈A pX (u1 . b ) − FX1 X2 (b. . . X2 ≤ x2 . a ) We write +∞ as an argument of FX in place of xi to denote the limit as xi → +∞. . Xm . . By the countable additivity axiom of probability. . The joint cumulative distribution function (CDF) is the function on Rm defined by FX1 X2 ···Xm (x1 . +∞) = x2 →∞ lim FX1 X2 (x1 .. .1. . . . u2 )du2 . pX1 (u1 ) = u2 pX1 X2 (u1 . . so that X1 has pdf given by ∞ fX1 (u1 ) = −∞ fX1 X2 (u1 . X2 ) ∈ R} = FX1 X2 (b. x2 ) = FX1 (x1 ) The random variables are jointly continuous if there exists a function fX1 X2 ···Xm . . u2 )du2 du1 . a ) + FX1 X2 (a. . . . . If X1 . then FX1 (x1 ) = FX1 X2 (x1 . . . and for any subset A of Rm P {(X1 . . For example.. JOINTLY DISTRIBUTED RANDOM VARIABLES 25 1. . . FX1 X2 (x1 . . X2 . P ). . um ) The joint pmf of subsets of X1 . . b ] in the plane. . F. u2 . Xm are each discrete random variables. . . if R is the rectangular region (a. u2 ) . +∞) x1 ∞ −∞ = −∞ fX1 X2 (u1 . xm ) = P {X1 ≤ x1 . . um )dum · · · du1 . . b ) − FX1 X2 (a. then they have a joint pmf pX1 X2 ···Xm defined by pX1 X2 ···Xm (u1 . Note that if X1 and X2 are jointly continuous. .7. Schwarz’s inequality (1. . {Xm ∈ Am } are independent. FX1 X2 ···Xm (x1 . if equality holds in (1. Am of R. . By the inequality (a + b)2 ≤ 2a2 + 2b2 it follows that E[(X − λY )2 ] < ∞ for any constant λ. the random variables are independent if and only if the joint characteristic function factors. .15) then P [X = cY ] = 1 for c = λ. . GETTING STARTED The joint characteristic function of X1 . independence is equivalent to the condition that the joint pmf factors. Similarly.26 CHAPTER 1. .8 Cross moments of random variables Let X and Y be random variables on the same probability space with finite second moments. independence is equivalent to the condition that the joint pdf factors. equality holds if and only if P [X = cY ] = 1 for some constant c. xm ) = FX1 (x1 ) · · · FXm (xm ) If the random variables are jointly continuous. If the random variables are discrete. The random variables are independent if and only if the joint CDF factors. . . . 1. Take λ = E[XY ]/E[Y 2 ] and note that 0 ≤ E[(X − λY )2 ] = E[X 2 ] − 2λE[XY ] + λ2 E[Y 2 ] E[XY ]2 . u2 . and conversely. . . Y ) the correlation coefficient: ρXY = Var(X)Var(Y ) A fundamental inequality is Schwarz’s inequality: | E[XY ] | ≤ E[X 2 ]E[Y 2 ] (1. If P [X = cY ] = 1 for some c then equality holds in (1. the events {X1 ∈ A1 }.15) is equivalent to the L2 triangle inequality for random variables: E[(X + Y )2 ] 2 ≤ E[X 2 ] 2 + E[Y 2 ] 2 1 1 1 (1. .15). . . . . . um ) = E[ej(X1 u1 +X2 ux +···+Xm um ) ] Random variables X1 . . = E[X 2 ] − E[Y 2 ] which is clearly equivalent to the Schwarz inequality. Xm is the function on Rm defined by ΦX1 X2 ···Xm (u1 .16) Schwarz’s inequality can be proved as follows. if E[Y 2 ] = 0. . . . .15) Furthermore. Xm are defined to be independent if for any Borel subsets A1 . so suppose E[Y 2 ] > 0. . . Three important related quantities are: the correlation: E[XY ] the covariance: Cov(X. If P {Y = 0} = 1 the inequality is trivial. . Y ) = E[(X − E[X])(Y − E[Y ])] Cov(X. . if Var(X) and Var(Y ) are not zero. U ) + Cov(X. so that the correlation coefficient ρXY is well defined. Sm ) = i Var(Xi ) + i. If X and Y are independent then they are uncorrelated. Covariance is linear in each of its two arguments: Cov(X + Y. then | ρXY |≤ 1 with equality if and only if X = aY + b for some constants a. cY + d) = acCov(X. d. if Var(Y ) = 0 then equality holds if and only if X = aY + b for some constants a and b. U + V ) = Cov(X. the second marginal density of fXY . if either X or Y has mean zero then E[XY ] = Cov(X. CONDITIONAL DENSITIES 27 Application of Schwarz’s inequality to X − E[X] and Y − E[Y ] in place of X and Y yields that | Cov(X.9. Consequently. The converse is far from true. namely FXY (x. Independence requires a large number of equations to be true. Xm are (pairwise) uncorrelated with E[Xi ] = µ and Var(Xi ) = σ 2 for 1 ≤ i ≤ m. Random variables X and Y are called orthogonal if E[XY ] = 0 and are called uncorrelated if Cov(X. The condition of being uncorrelated involves only a single equation to hold. b. such that X1 . Y ) for constants a. U ) + Cov(Y. y) = FX (x)FY (y) for every real value of x and y. Covariance generalizes variance. c. in that Var(X) = Cov(X. Y ) = E[X(Y − E[Y ])] = E[(X − E[X])Y ] = E[XY ] − E[X]E[Y ] In particular. X).j:i=j 2 Cov(Xi . Therefore. Y ) are often useful in calculations: Cov(X. Xj ) = mσ . · · · . Recall that the pdf fY . V ) Cov(aX + b. is given by ∞ fY (y) = −∞ fXY (x. S√ −mµ m mσ 2 has mean zero and variance one. b. consider the sum Sm = X1 + · · · + Xm . Then E[Sm ] = mµ and Var(Sm ) = Cov(Sm . Y ) | ≤ Var(X)Var(Y ) Furthermore. y)dx . 1. Y ) = 0. V ) + Cov(Y.9 Conditional densities Suppose that X and Y have a joint pdf fXY .1. For example. Y ). The following alternative expressions for Cov(X. . X and Y are both discrete random variables. fX|Y (x | y) is itself a pdf. equivalently. . ∂y Suppose that the Jacobian matrix of derivatives ∂x (x) is continuous in x and nonsingular for all x. the conditional pmf pX|Y and the conditional expectation E[X | Y = y] can be defined in a similar way. . xn ) can as well be written as fX (x). Xm are jointly continuous. x = g −1 (y). It is defined for y such that fY (y) > 0 by fX|Y (x | y) = fXY (x. GETTING STARTED The conditional pdf of X given Y . If instead. Note that conditional pdf and conditional expectation were so far defined in case X and Y have a joint pdf. More general notions of conditional expectation are considered in a later chapter. . then as a function of x. Think of g as mapping x-space (here x is lower case. The expectation of the conditional pdf is called the conditional expectation (or conditional mean) of X given Y = y. . 1. the joint pdf fX1 X2 ···Xm (x1 . Xm where X1 . . . . is undefined if fY (y) = 0. . The joint distribution of X1 . . By the inverse function theorem of vector calculus. y = g(x) or. it follows that the Jacobian matrix of the ∂y inverse mapping (from y to x) exists and satisfies ∂x (y) = ( ∂x (x))−1 . Let g be a one-to-one mapping from Rn to Rn . For example. .10 Transformation of random vectors A random vector X of dimension m has the form X = X1 X2 . if X1 . the result is a random variable denoted by E[X | Y ]. and be thought of as the pdf of the random vector X. . Use | K | for a square matrix ∂y K to denote det(K) |. As x varies over Rn . y) fY (y) − ∞ < x < +∞ If y is fixed and fY (y) > 0. y varies over the range of g. written as ∞ E[X | Y = y] = −∞ xfX|Y (x | y)dx If the deterministic function E[X | Y = y] is applied to the random variable Y . Xm can be considered to be the distribution of the vector X. denoted by fX|Y (x | y). . . . Xm are random variables. . representing a coordinate value) into y-space.28 CHAPTER 1. . All the while. . Let X be a continuous type random vector on Rn . . and √ √ x ≤ y ≤ 2 x} See Figure 1.10. y) ∈ A else . for if (x. using the transformation formula and expressing u and V in terms of x and y yields √ y x+( √x −1) fXY (x. The image in the x − y plane of the square [0. y) : 0 ≤ x ≤ 1. V have the joint pdf: fU V (u.1. v ≤ 1 0 else and let X = U 2 and Y = U (1 + V ). v) can be y 2 v 1 1 u 1 x Figure 1. 1]2 in the u − v plane is the set A given by A = {(x. y) ∈ A then (u.1 Under the above assumptions.2 Let U . The vector (U. v) = u + v 0 ≤ u. Y ) in the x − y plane under a mapping g that maps u. v to x = u2 and y = u(1 + v). The Jacobian determinant is ∂x ∂v ∂y ∂v ∂x ∂u ∂y ∂u = 2u 0 1+v u = 2u2 Therefore.13: Transformation from the u − v plane to the x − y plane. recovered by u = √ x and v = y √ x − 1.10.13 The mapping from the square is one to one. TRANSFORMATION OF RANDOM VECTORS 29 Proposition 1. Let’s find the pdf fXY .10. Y is a continuous type random vector and for y in the range of g: fY (y) = fX (x) | ∂y ∂x (x) | = fX (x) ∂x (y) ∂y Example 1. y) = 2x 0 if (x. V ) in the u − v plane is transformed into the vector (X. Θ)T . Y and the marginal density of X.10. The inverse of this mapping is given by x1 = r cos(θ) x2 = r sin(θ) We endeavor to find the pdf of the random vector (R. The pdf of X is given by fX (x) = fX1 (x1 )fX2 (x2 ) = 1 − r22 e 2σ 2πσ 2 The range of the mapping is the set r > 0 and 0 < θ ≤ 2π.10. Any point of x ∈ R2 can 1 be represented in polar coordinates by the vector (r. ∂x r ∂ θ ∂x1 ∂r ∂x2 ∂r ∂x1 ∂θ ∂x2 ∂θ = = cos(θ) −r sin(θ) sin(θ) r cos(θ) =r . The absolute value of the Jacobian determinant is given by ∂x ∂u ∂y ∂u ∂x ∂v ∂y ∂v = 1 1 0 1 =1 Therefore fXY (x. y) = fU V (u.4 Let X1 and X2 be independent N (0. σ 2 ) random variables. the polar coordinates of X. v) = fU (x − y)fV (y) The marginal density of X is given by ∞ ∞ fX (x) = −∞ fXY (x. The mapping g: u v → u v = u+v v is invertible. Let X = U + V and Y = V .30 CHAPTER 1. On the range. θ)T such that r = x = (x2 + x2 ) 2 and 1 2 x2 θ = tan−1 ( x1 ) with values r ≥ 0 and 0 ≤ θ < 2π. Example 1. with inverse given by u = x − y and v = y. y)dy = −∞ fU (x − y)fV (y)dy That is fX = fU ∗ fV . and let X = (X1 .3 Let U and V be independent continuous type random variables. Let us find the joint density of X. X2 )T denote the two-dimensional random vector with coordinates X1 and X2 . GETTING STARTED Example 1. θ) = 0 off the range of the mapping. P ) to describe this situation.3 Congestion at output ports Consider a packet switch with some number of input ports and eight output ports.Θ (r.Θ (r. and Θ is uniformly distributed on [0. X8 .6 and P [B] = 0.8. F. (d) Find P {Xi ≤ 1 for all i}.1 Simple events A register contains 8 random binary digits which are mutually independent. (c) Find Cov(X1 . In this case.11. Each digit is a zero or a one with equal probability.1. Assume the choices of output ports are mutually independent. and each is routed toward an output port.2 Independent vs. Show that either P [E] = 0 or P [E] = 1. What is P [A ∪ B] if A and B are independent? What is P [A ∪ B] if A and B are mutually exclusive? (c) Now suppose that P [A] = 0. . Suppose four packets simultaneously arrive on different input ports. P ) corresponding to looking at the contents of the register. 2π]. . R has the Rayleigh density with parameter σ 2 . θ)T in the range of the mapping. F. . Describe the joint pmf of X1 . (b) Express each of the following four events explicitly as subsets of Ω. (b) Events A and B have probabilities P [A] = 0. θ) = fX (x) = Of course fR. The joint density factors into a function of r and a function of θ. (a) Specify a probability space (Ω. . PROBLEMS Therefore for (r. mutually exclusive (a) Suppose that an event E is independent of itself. Moreover. 1. ∂x r ∂ θ r − r22 e 2σ 2πσ 2 31 fR.4. 1. and for each packet. . (a) Describe an appropriate probability space (Ω. each output port has equal probability. could the events A and B be independent? Could they be mutually exclusive? 1. and find their probabilities: E1=“No two neighboring digits are the same” E2=“Some cyclic shift of the register contents is equal to 01100110” E3=“The register contains exactly four zeros” E4=“There is a run of at least six consecutive ones” (c) Find P [E1 |E3 ] and P [E2 |E3 ]. so R and Θ are independent. X2 ). (b) Let Xi denote the number of packets routed to output port i for 1 ≤ i ≤ 8. (e) Find P {Xi ≤ 2 for all i}.3 and P [B] = 0.11 Problems 1. zk ) = 1 2 k j=1 rj (zj |b) 0 if b ∈ {0. . 1}.03. yn ) = 2−n n i=1 qi (yi |bi ) 0 if bi ∈ {0. Express P [B = 1|Y1 = y1 .06. whether or not she already looked in the same place. leaves them in her briefcase with probability 0. . . If the webserver is not working. . . she doesn’t have her glasses on and she is in a hurry.) (a) Given that Professor Plum didn’t find the glasses in her drawer after looking one time. on the table twice. . any attempt to access it fails. an attempt to access it can fail due to network congestion beyond the control of the webserver. of a certain car battery. . .Yn = yn ] in terms of the yi and the functions qi .7 Conditional lifetimes and the memoryless property of the geometric distribution (a) Let X represent the lifetime. . Thus. . Y1 .6 Conditional probabilities–basic computations of iterative decoding Suppose B1 . . let B = B1 ⊕· · ·⊕Bn represent the modulo two sum of B1 . . then each access attempt is successful with probability 0. Suppose that the a priori probability that the server is working is 0. .8. . . Zk are discrete random variables with joint pmf p(b. z1 .32 CHAPTER 1. GETTING STARTED 1. 1. but each time she looks in a place the glasses are actually located. Simplify your answer. and in the briefcase once. . Bn . .1. 1} fixed. Express P [B = 1|Z1 = z1 . (After all. · · · . (a) P [ first access attempt fails] (b) P [server is working | first access attempt fails ] (c) P [second access attempt fails | first access attempt fails ] (d) P [server is working | first and second access attempts fail ]. rounded up to an integer number of years. what is the conditional probability she left the glasses at the office? 1. bn . . . what is the conditional probability they are in the briefcase? (c) Given that she failed to find the glasses after looking in the drawer twice. Bn . . . . . she misses finding them with probability 0. (b) Suppose B and Z1 . She looks for them. y1 . . . Bn . The next morning she has no recollection of where she left the glasses. 1. .5 Conditional probability of failed device given failed attempts A particular webserver may be working or not working. independently of other access attempts. .9. and she actually leaves them at the office with probability 0.01. Suppose that if the server is working. 1} for 1 ≤ i ≤ n else where qi (yi |bi ) as a function of yi is a pmf for bi ∈ {0. Yn are discrete random variables with joint pmf p(b1 . B is even. 1} else where rj (zj |b) as a function of zj is a pmf for b ∈ {0. Zk = zk ] in terms of the zj and the functions rj . · · · . what is the conditional probability the glasses are on the table? (b) Given that she didn’t find the glasses after looking for them in the drawer and on the table once each. . Even if the webserver is working. the ordinary sum of the n+1 random variables B1 .4 Frantic search At the end of each day Professor Plum puts her glasses in her drawer with probability .90. . . . . leaves them on the table with probability . Find the following quantities. . Finally. (i) Find the probability. 1. with some probability p. what is the conditional probability that she will need more than three additional shots for a success? (i. find P {X 2 > 5}. and 5 Mbps can be routed over links 3 and 5. state at least one reason why.e.2 if 3 ≤ k ≤ 7 and pX (k) = 0 otherwise. and are the same in each direction. 1. Let B denote the event that at least one face of the cube has all four corners colored blue. P {Y > 3}. PROBLEMS 33 Suppose that the pmf of X is given by pX (k) = 0. Let Y denote the number of shots required for the first success. if all links are working. in terms of p. then the maximum communication rate is 10 Mbps: 5 Mbps can be routed over links 1 and 2. F3 . ( F1 (x) = 1 e−x 4 −x2 −e4 2 x<0 x≥0 0 0.8 Blue corners Suppose each corner of a cube is colored blue. independently of the outcomes of previous shots. F2 . independently of the other corners. what is P [X > 8|X > 5]?) (b) A certain Illini basketball player shoots the ball repeatedly from half court during practice. that a three year old battery is still working. Let Fi be the event that link i fails. P {X > 3}. what is the conditional probability that the battery will still be working three years later? (i. Let X be defined as the maximum rate (in Mbits per second) at which data can be sent from the source node to the destination node. (a) Find the conditional probability of B given that exactly five corners of the cube are colored blue. For example. Suppose that F1 .9 Distribution of the flow capacity of a network A communication network is shown. Information flow 2 Destination 4 1 Source 3 5 from the source to the destination can be split among multiple paths. the unconditional probability of B.11. what is P [Y > 8|Y > 5])? (iii) What type of probability distribution does Y have? 1. C2 = C5 = 10 and C4 =8. Find the pmf pX . F4 and F5 are independent and P [Fi ] = 0.5 + e−x F2 (x) = : 1 8 < x<0 0≤x<3 x≥3 F3 (x) = 8 < 0 0. (b) Find P [B]. (ii) Given that she already missed the first five shots.10 Recognizing cumulative distribution functions Which of the following are valid CDF’s? For each that is not valid. (i) Express the probability that she needs more than three shots for a success.1.e.5 + : 1 x 20 x≤0 0 < x ≤ 10 x ≥ 10 . (ii) Given that the battery is still working after five years. For each that is valid. The link capacities in megabits per second (Mbps) are given by C1 = C3 = 5. Each shot is a success with probability p and a miss with probability 1 − p.2 for each i. In particular. Express your answer as a simple function of µ and p.15 Transformation of a random variable Let X be exponentially distributed with mean λ−1 . (d) Find the conditional expectation E[Y |X = x]. which is the expected value computed according to the conditional distribution found in part (c). 1]. Express your answer as a simple function of µ and p. µ and i.0 CHAPTER 1.) (c) Find P [Y = i|Y < Z] for i ≥ 0. Y ) be uniformly distributed over the triangle with coordinates (0. Find and carefully sketch the distribution functions for the random variables Y = exp(X) and Z = min(X. (b) Find E[X]. and (2. 3). Express your answer as a simple function of p. what is FX (0)? (b) Find the characteristic function ΦX (u) for real values of u. X = U − 0. (c) Find Var(X). 1. F X 1. Be sure to specify which values of x this conditional expectation is well defined for. Also. (a) What is the value of the joint pdf inside the triangle? (b) Find the marginal density of X. (c) Find the conditional density function fY |X (y|x). 1.5 < 0. 0). 1. 0). where U is uniformly distributed over the interval [0.5 ≥ 0. (Hint: This is a conditional probability for events. (a) Find P {Y < Z}.34 1. (b) Find P [Y < Z|Z = i] for i ≥ 1. (Hint: This is a conditional probability for events.14 Conditional expectation for uniform density over a triangular region Let (X. Be sure to specify which values of x the conditional density is well defined for.) (d) Find E[Y |Y < Z]. for such x briefly describe the conditional density of y in words. fX (x).13 Poisson and geometric random variables with conditioning Let Y be a Poisson random variable with mean µ > 0 and let Z be a geometrically distributed random variable with parameter p with 0 < p < 1.5)+ .8}. (a) Find and carefully sketch the CDF FX . GETTING STARTED 0. That is.11 A CDF of mixed type Let X have the CDF shown. 1). . and X = 0 if U − 0.5 if U − 0. and for such x specify the conditional density for all y. 1. (1. Assume Y and Z are independent. Be sure to specify your answer for all real values of x.5 0 1 2 (a) Find P {X ≤ 0.12 CDF and characteristic function of a mixed type random variable Let X = (U − 0. 1.21 Correlation of histogram values Suppose that n fair dice are independently rolled. variances. where X has the N (10. 1. 1.4|X ≤ 0.8]. 1]. 1. (b) P {X 2 ≥ 16}. which is simply the number of 2’s rolled. 9) distribution. (b) Express P {X + 4Y ≥ 2} in terms of the Q function. (c) Find Cov(Xi . X2 }. (a) Find the pdf of Z = min{X1 . with Xi being exponentially distributed with X1 parameter λi . where X has the N (10. (b) Find the density function of Y defined by Y = − log(X).16 Density of a function of a random variable Suppose X is a random variable with probability density function fX (x) = 2x 0 ≤ x ≤ 1 0 else 35 (a) Find P [X ≥ 0. which is simply the number of 1’s rolled. N (0. Find the means.18 Functions of independent exponential random variables Let X1 and X2 be independent random varibles. (a) Find Cov(3X + 2Y. (b) Find E[X] and Var(X). 1) random variables. N (0. 1) random variables. where X and Y are independent. (a) Find E[X1 ] and Var(X1 ). 9) distribution.20 Gaussians and the Q function Let X and Y be independent. (c) P {|X − 2Y | > 1}. Let Xi = 1 if a 1 shows on the ith roll 0 else Yi = 1 if a 2 shows on the ith roll 0 else Let X denote the sum of the Xi ’s. and probability densities of C and A.19 Using the Gaussian Q function Express each of the given probabilities in terms of the standard Gaussian complementary CDF Q. Note that if a histogram is made recording the number of occurrences of each of the six numbers. Yj ) if 1 ≤ i. then X and Y are the heights of the first two entries in the histogram.17 Moments and densities of functions of a random variable Suppose the length L and width W of a rectangle are independent and each uniformly distributed over the interval [0. (c) Express P {(X − Y )2 > 9} in terms of the Q function. (b) Find the pdf of R = X2 . PROBLEMS 1. (Hint: Linear combinations of independent Gaussian random variables are Gaussian.11. 1. Let Y denote the sum of the Yi ’s. X + 5Y + 10). j ≤ n (Hint: Does it make a difference if i = j?) . (a) P {X ≥ 16}.) 1. Let C = 2L + 2W (the length of the perimeter) and A = LW (the area). 1. 1.27 Uniform density over a union of two square regions Let the random variables X and Y be jointly uniformly distributed on the region {0 ≤ u ≤ 1. the marginal pdf of X.24 Density of a difference Let X and Y be independent. 1.22 Working with a joint density Suppose X and Y have joint density function fX. and (c) Φ(u) = exp(λ(eju − 1)). Y ) = Cov(X.36 CHAPTER 1.1). y) = 0 otherwise.25 Working with a two dimensional density L et the random variables X and Y be jointly uniformly distributed over the region shown. Y ) and the correlation coefficient ρ(X. (c) Find the mean and variance of X. (a) Determine the value of fXY on the region shown. 1. the marginal pdf of X. (1. Find the pdf of Z = |X − Y |. (c) Find the conditional pdf of Y given that X = a. for 0 < a ≤ 1. and (0. (e) Find E[Y |X = x] for any integer x with 0 ≤ x ≤ n.1). 1. (c) Find fX|Y . Be sure to specify which range of x this conditional expectation is well defined for.0). . (d) Find the conditional pdf of Y given that X = x. such that λ > 0. GETTING STARTED (d) Find Cov(X. but otherwise your answer is deterministic. (b) Find fX . 1 0 0 1 2 3 (a) Determine the value of fX. for 1 ≤ x ≤ 2. Note that your answer should depend on x and n. (e) Find the conditional pdf of Y given that X = x. 0 ≤ v ≤ 1} ∪ {−1 ≤ u < 0. Y )/ Var(X)Var(Y ).23 A function of jointly distributed random variables Suppose (U.Y on the region shown. and fX. Find the CDF and pdf of X. V ) is uniformly distributed over the square with corners (0. 1. for 0 ≤ x ≤ 1. (a) Find c. and let X = U V . y) = c(1 + xy) if 2 ≤ x ≤ 3 and 1 ≤ y ≤ 2.Y (x. (1.Y (x.0). (b) Find fX and fY .26 Some characteristic functions Find the mean and variance of random variables with the following characteristic functions: (a) Φ(u) = exp(−5u2 + 2ju) (b) Φ(u) = (eju − 1)/ju. −1 ≤ v < 0}. exponentially distributed random variables with parameter λ. (b) Find fX . (f) Find and sketch E[Y |X = x] as a function of x. so that the following three axioms are satisfied: L0: 0 ≤ length(A) ≤ ∞ for any A ⊂ R L1: length([a. (b) Using the joint pdf of X and Y . v) = c(u − v)2 0 ≤ u. and for each such x specify the conditional pdf for all real values of y. such that U is uniformly distributed over the interval [0. (c) Find the joint probability density function of Y and Z. 1]. (b) Suppose X = U 2 and Y = U 2 V 2 . of Y given X. PROBLEMS (d) Find the conditional pdf of Y given that X = a. . Be sure to indicate where the joint pdf is zero. (e) Find E[Y |X = a] for |a| ≤ 1. (Be sure to indicate which values of x the conditional pdf is well defined for.31 * Why not every set has a length Suppose a length (actually. where A + y represents the translation of A by y. v) = 9u2 v 2 if 0 ≤ u ≤ 1 & 0 ≤ v ≤ 1 0 else 37 Let X = 3U and Y = U V . V ) has joint pdf fU.Y (x. (a) Find the joint pdf of X and Y . then length(A) = i=1 ∞ i=1 length(Bi ). v ≤ 1 0 else for some constant c.30 Jointly distributed variables Let U and V be independent random variables. for any A ⊂ R and y ∈ R.V (u. (a) Find the constant c.29 Transformation of densities Let U and V have the joint pdf: fU V (u. where Y = U 2 and Z = U V . y) of X and Y .1.11. 1.) 1.28 A transformation of jointly continuous random variables Suppose (U. for −1 ≤ a < 0. Describe the joint pdf fX. defined by A + y = {x + y : x ∈ A} L3: If A = ∪∞ Bi such that B1 . (b) Calculate P {U ≤ V }. and V has the exponential probability density function V2 (a) Calculate E[ 1+U ]. B2 . 1. find the conditional pdf. fY |X (y|x). being sure to specify where the joint pdf is zero. (f) What is the correlation coefficient of X and Y ? (g) Are X and Y independent? (h) What is the pdf of Z = X + Y ? 1. “one-dimensional volume” would be a better name) of any subset A ⊂ R could be defined. · · · are disjoint. b]) = b − a for a < b L2: length(A) = length(A + y). A nonempty collection F of subsets of Ω is defined to be an algebra if: (i) Ac ∈ F whenever A ∈ F and (ii) A ∪ B ∈ F whenever A. B2 . then so is the intersection. . GETTING STARTED The purpose of this problem is to show that the above supposition leads to a contradiction. (c) Let U be an arbitrary nonempty set. 1]. either Qx = Qy or Qx ∩ Qy = ∅. (d) The collection of all subsets of Ω is a σ-algebra. q2 . .. (b) Show that ∼ is an equivalence relation. is an enumeration of all the rational numbers in the interval [−1. . F is called a σ-algebra if F is an algebra such that whenever A1 . and measurable functions Prove the seven statements lettered (a)-(g) in what follows. or in other words. q ∈ Z. (c) Show that for any x. 1] would be covered by a countable union of disjoint sets of length zero. Definition. You must use the fact that F is a σ-algebra. Then the intersection ∩u∈U Fu is also a σ-algebra. 1]. (Hint: Fix a random variable X. and write x ∼ y. if x − y ∈ Q. and suppose that Fu is a σ-algebra of subsets of Ω for each u ∈ U . P ) and A ∈ B(R) then {ω : X(ω) ∈ A} ∈ F. are each in F . Sets in B(R) are called Borel sets. a]. and transitive (a ∼ b and b ∼ c implies a ∼ c). B(R) is the smallest σ-algebra of subsets of R which contains all sets of the form (−∞. and the union or intersection of any finite collection of sets in F is in F. Either way there is a contradiction. they should all have the same length i=1 as V . It is enough (why?) to show that D contains all sets of the form (−∞.32 * On sigma-algebras. A function g mapping R to R is called Borel measurable if {x : g(x) ∈ A} ∈ B(R) whenever A ∈ B(R). Q = {p/q : p. is well defined for A ∈ B(R).) Remark. . Ω ∈ F. Let Vi = V + qi for i ≥ 1. then [0. F. 2] would have infinite length. so [0. Let Q denote the set of rational numbers. Definition. each equivalence class contains at least one element from the interval [0. 1] from each equivalence class (by accepting that V is well defined. and hence the interval [−1. or P {X ∈ A} for short. B ∈ F. (b) If F is a σ-algebra and B1 . q2 .}. Sets of the form Qx are called equivalence classes of the equivalence relation ∼. 1] ⊂ ∪∞ Vi ⊂ [−1. . 1] = ∅ for all x ∈ R. with no number appearing twice in the list. which means that Q is countably infinite. and [0. For any x ∈ R. (a) Show that the set of rational numbers can be expressed as Q = {q1 . . q = 0}. (d) Show that Qx ∩ [0. Let Ω be an arbitrary set. By (f). Since the Vi ’s are translations of V . (f) If X is a random variable on (Ω. then the countable union would have infinite length. a] and that D is a σ-algebra of subsets of R. 1]. (a) If F is an algebra then ∅ ∈ F. 1] would also have length zero. Definition. So V is a subset of [0. A2 . y ∈ R are equivalent. let Qx = Q + x. P {ω : X(ω) ∈ A}. .. . so is the union. y ∈ R. If the length of V is defined to be zero. meaning it is reflexive (a ∼ a for all a ∈ R). ∩Bi . are in F . F. (e) Verify that the sets Vi are disjoint. 1. If the length of V were strictly positive.38 CHAPTER 1. Say that x. . random variables. (e) If Fo is any collection of subsets of Ω then there is a smallest σ-algebra containing Fo (Hint: use (c) and (d). symmetric (a ∼ b implies b ∼ a). . . Let D be the collection of all subsets A of B(R) for which the conclusion is true. A real-valued random variable on a probability space (Ω. ∪Ai . Suppose q1 . 2]. you’ll be accepting what is called the Axiom of Choice). P ) is a real-valued function X on Ω such that {ω : X(ω) ≤ a} ∈ F for any a ∈ R. Let V be a set obtained by choosing exactly one element in [0.) Definitions. P ). F.11. then Y defined by Y = g(X) is also a random variable on (Ω. F. PROBLEMS 39 (g) If X is a real-valued random variable on (Ω. P ) and g is a Borel measurable function. .1. 40 CHAPTER 1. GETTING STARTED . We wish to consider derivatives and integrals of random functions. and a. one can first find the set {ω : limn→∞ Xn (ω) = X(ω)} and then see if it has probability one. F. to check almost sure convergence. in that it 4 2 is not clear what the values of X are at the jump points. if all the random variables are defined on the same probability space.1.s.1 Four definitions of convergence of random variables Recall that a random variable X is a function on Ω for some probability space (Ω.1 is a bit sloppy. so it is natural to begin by examining what it means for a sequence of random variables to converge. See the Appendix for a review of the definition of convergence for a sequence of numbers. However. There are many possible definitions for convergence of a sequence of random variables. One idea is to require Xn (ω) to converge for each fixed ω. what happens on an event of probability zero is not important. Figure 2. ω = 1/4 or ω = 1/2. consider the random variable X pictured in Figure 2. so the distribution of X is the same no matter how X is defined at those points. P ). or Xn → X. Definition 2. each of these points has probability zero. Thus. at least intuitively. A sequence of random variables (Xn (ω) : n ≥ 1) is hence a sequence of functions. F. can be simply specified by their graphs. For example. we use the following definition.Chapter 2 Convergence of a Sequence of Random Variables Convergence to limits is a central concept in the theory of calculus. This particular choice of (Ω.1.s. because random variables. 41 . Conceptually. Limits are used to define derivatives and integrals.1 A sequence of random variables (Xn : n ≥ 1) converges almost surely to a random variable X.1. P {limn→∞ Xn = X} = 1. being functions on Ω. P ) is useful for generating examples. Almost sure convergence is denoted by limn→∞ Xn = X a. However.2. The probability mass function for such X is given by P {X = 1} = P {X = 2} = 1 and P {X = 3} = 1 . We shall construct some examples using the standard unit-interval probability space defined in Example 1. 2. This sequence converges for all X1( ! ) 1 1 X2( ! ) 1 X3( ! ) 1 X4( ! ) 0 0 1 ! 0 0 1 ! 0 0 1 ! 0 0 1 ! Figure 2. each n ≥ 1 can be written as 4 n = 2k + j where k = ln2 n and 0 ≤ j < 2k . CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES X(ω ) 3 2 1 0 1 4 1 2 3 4 1 ω Figure 2.2: Xn (ω) = ω n on the standard unit-interval probability space. illustrated in Figure 2.2 Let (Xn : n ≥ 1) be the sequence of random variables on the standard unit-interval probability space defined by Xn (ω) = ω n . so it is also true (and simpler to say) that (Xn : n ≥ 1) converges a. X6 . as shown in Figure 2. convergence. ω ∈ Ω. Then for each k ≥ 1.42 CHAPTER 2. if we let X be the zero random variable. The variable Xn is one on the length 2−k interval (j2−k .s. there is one value of n with 2k ≤ n < 2k+1 such that Xn (ω) = 1. P ).2. to zero.s. and X7 are one on intervals of length 1 . defined by X(ω) = 0 a.s. F. then Xn → X. (j + 1)2−k ].1. and Xn (ω) = 0 for all other n. The variable 1 X1 is identically one. for all ω. . In general.1. To investigate a. fix an arbitrary value for ω. In other words. The variables X2 and X3 are one on intervals of length 2 .3. Example 2. The variables X4 .3 (Moving. Example 2. The single point set {1} has probability zero. X5 . shrinking rectangles) Let (Xn : n ≥ 1) be the sequence of random variables on the standard unit-interval probability space.1: A random variable on (Ω. with the limit n→∞ lim Xn (ω) = 0 if 0 ≤ ω < 1 1 if ω = 1. 3: A sequence of random variables on (Ω. limn→∞ Xn = X m. Xn does not converge in the a. or Xn → X.4 A sequence of random variables (Xn ) converges to a random variable X in probability if all the random variables are defined on the same probability space. Thus. and for any > 0. This suggests that Xn converges to the zero random variable in some weaker sense. Example 2. sense. However. Xn → X.1. limn→∞ P {|X − Xn | ≥ } = 0.1.s. the average squared value has to be small enough.2. but on the small probability event that |X − Xn | is not small. Definition 2. Convergence in probability requires that |X − Xn | be small with high probability (to be precise. F..1. Although it isn’t explicitly stated in the definition of m. so of course.s. Convergence in probability is denoted by limn→∞ Xn = X p. Mean square convergence is denoted by +∞ for all n. or p. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES X1( ω ) 1 43 0 0 1 ω X2( ω ) 1 1 X3( ω ) 0 0 1 ω 0 0 1 ω X4( ω ) 1 1 X5( ω ) 1 X6( ω ) 1 X7( ω ) 0 0 1 ω 0 0 1 ω 0 0 1 ω 0 0 1 ω Figure 2.3 motivates us to consider the following weaker notion of convergence of a sequence of random variables. {ω : limn→∞ Xn exists} = ∅. P ). That is. it can be arbitrarily large. Therefore. For some applications that is unacceptable. P {limn→∞ Xn exists} = 0.1. P {Xn = 0} is close to one.s. limn→∞ Xn (ω) does not exist. Definition 2.5 A sequence of random variables (Xn ) converges to a random variable X in the 2 mean square sense if all the random variables are defined on the same probability space. convergence. less than or equal to with probability that converges to one as n → ∞). the limit random variable must also have a finite second moment: . for large n.s. and limn→∞ E[(Xn − X) m. Roughly speaking. E[Xn ] < 2 ] = 0. the next definition of convergence requires that |X − Xn | be small with high probability for large n. and even if it is not small. We don’t explicitly identify the location of the interval. p. 2 Proof. (1. but we require that for any fixed ω.3. Xn → 0 if m. Example 2. sense.16). Each random variable of the sequence (Xn : n ≥ 1) is defined as indicated in Figure 2.1. Such a choice of the locations of the intervals is possible because the sum of the lengths of the intervals. then (Xn ) doesn’t converge to any random variable in the m. senses..1.. then E[X 2 ] < +∞. then {ω : limn→∞ Xn (ω) exists} = ∅. Of course Xn → 0 if the deterministic sequence (an ) converges to zero. m. (Proposition 2. and only if the sequence of constants (an ) is such that limn→∞ an = 0. a.1.44 CHAPTER 2. just as in Example 2. if an = ln(n) n √ m. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES m. P {|Xn − 0| ≥ } ≤ P {Xn = 0} = 1 → 0.6 If Xn → X. So if 2 an n → 0. By definition.s. if there is a constant > 0 such that an ≥ for all n (for example if an = 1 for all n).3.s. sense. Also by definition. However.s. Hence.13 below shows that a sequence can have only one limit in the a. implies that zero is the only possible limit in the m. n 2 Finally. For example. shrinking rectangles) This example is along the same lines as Example 2.s. note that E[|Xn − 0|2 ] = a2 n n . using the standard unit-interval probability space. to investigate mean square convergence. The graph of Xn for n ≥ 1 has height an over some subinterval of Ω of 1 length n .4: A sequence of random variables corresponding to moving.s. ∞ 1 n=1 n .1. Suppose that Xn → X. sense.4. Xn (ω) = an for infinitely many values of n.1.s. The L2 triangle inequality for random exists some no so that E[(X − Xn ) o 1 1 2 1 variables.s. then Xn → 0. for all n. then (Xn ) does not converge to zero in the m.) .s. yields E[(X∞ )2 ] 2 ≤ E[(X∞ − Xno )2 ] 2 + E[Xno ] 2 < +∞. p. because for any > 0.7 (More moving. but if an = n. and Xn (ω) = 0 for infinitely many values of n. so the fact Xn → 0. or m. E[Xn ] < ∞ for all n. shrinking rectangles. Proposition 2. The sequence converges to zero in probability for any choice of the constants (an ). is infinite. there 2 ] < 1 for all n ≥ n .s. constant depending on n. where the value an > 0 is some Xn( ! ) an ! 0 1/n 1 Figure 2.s. s.1. The first three random variables in the sequence are indicated in Figure 2. Intuitively speaking.1.s. or m. and to zero otherwise. That is. so that Xn+3 = Xn for all n ≥ 1.9 (Rearrangements of rectangles) Let (Xn : n ≥ 1) be a sequence of random variables defined on the standard unit-interval probability space. convergence. Example 2. Xn (ω) = 0 for all n such that n > 1/ω. . n It is shown in Proposition 2. sense for this example is exactly the same as in Example 2. a.8 (Anchored. So Xn → 0. imply m. like convergence in probability.s. That is.1.s. the sequence of random variables X1( ! ) 1 1 X2( ! ) 1 X3( ! ) 0 0 1 ! 0 0 1 ! 0 0 1 ! Figure 2. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 45 Example 2. as indicated in Figure 2. the shrinking intervals of support don’t matter. shrinking rectangles) Let (Xn : n ≥ 1) be a sequence of random variables defined on the standard unit-interval probability space.8 shows that a.6. where Xn( ! ) an ! 0 1/n 1 Figure 2.1. for convergence in probability or mean square sense. convergence. unless an additional assumption is made to control the difference |Xn (ω) − X(ω)| everywhere on Ω.. So neither convergence in probability.7.s. Therefore.5: A sequence of random variables corresponding to anchored.6: A sequence of random variables obtained by rearrangement of rectangles. with period three. convergence. Example 2.1.5. shrinking rectangles. Suppose that the sequence is periodic. convergence imply convergence in probability. the locations of 2 p.s.s. Xn (ω) is equal to an if 0 ≤ ω ≤ 1/n. nor a.13 below that either a. Whether the sequence (Xn ) converges in p. the value an > 0 is some constant depending on n.2. And Xn → 0 if and only if an → 0.s. Xn → 0. or m. m.1. For any nonzero ω in Ω. can allow |Xn (ω) − X(ω)| to be extremely large for ω in a small probability set. The variables are only different in that the places they take their possible values are rearranged. Does the CDF of Xn converge to the CDF of X? The CDF of Xn is graphed in Figure 2. Thus. P {X = xo } = > 0.. then even if Xn is very close to X.3.. Suppose for the sake of contradiction that Xn → X for some random variable. Example 2.s. The following example shows such a definition could give unexpected results in some cases.10 Let U be uniformly distributed on the interval [0.46 CHAPTER 2. Hence. In particular. if FX (x) = FX (x−). variable.s. we adopt a definition of convergence in distribution which requires convergence of the CDFs only at the continuity points of the limit CDF.1. P {X1 = X2 = X3 } = 1. Example 2. FX n n even F X n odd n 0 1 n −1 0 n Figure 2. sense. it can be observed that all of the Xn ’s have the same probability distribution. It is easy to verify that n p.s. or p. to any one random p. Thus. Since CDFs are right-continuous and nondecreasing. m. it doesn’t converge to FX (0). which is a contradiction. Since is arbitrary it must be that P {X = Xn } = 1 for 1 ≤ n ≤ 3. Example 2. One idea for a definition of convergence in distribution is to require that the sequence of CDFs FXn (x) converge as n → ∞ for all n. But because the sequence is periodic. Continuity points are defined for general functions in Appendix 11.e. Since δ is arbitrary it must be that P {|Xn − X| ≥ } = 0 for 1 ≤ n ≤ 3. To overcome this phenomenon.7: CDF of Xn = (−1)n n . a. if n is sufficiently large. Let X denote the random variable such that X = 0 for all ω. and for n ≥ 1. This can be proved as follows. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES persistently jumps around.e.10 illustrates the fact that if the limit random variable X has such a point mass. sense. FXn (0) alternates between 0 and 1 and hence does not converge to anything. the value FXn (x) need not converge. .s. if and only if the CDF has a jump of size at xo : F (xo ) − F (xo −) = . Obviously it does not converge in the a. The sequence does not settle down to converge. Recall that the distribution of a random variable X has probability mass at some value xo . P {|Xn −X| ≥ } ≤ δ. The CDF FXn (x) converges to 0 for x < 0 and to one for x > 0. senses. However.1. i. Xn → X and Xn → X. let Xn = (−1)n U . 1].1. the sequence does not converge in probability.s. Then for any > 0 and δ > 0.7. A similar argument shows it does not converge in the m. a point x is a continuity point of a CDF F if and only if there is no jump of F at X: i.9 suggests that it would be useful to have a notion of convergence that just depends on the distributions of the random variables. Even though the sequence fails to converge in a. it must be that P {|Xn − X| ≥ } ≤ δ for 1 ≤ n ≤ 3. FXn (x) converges to FX (x) for all x except x = 0. even in the sense of convergence in probability. either. (iii) ΦXn (u) → ΦX (u) for each u ∈ R (i. and are pictured in Figure 2. Convergence in distribution involves only the individual distributions of the random variables to have a convergence property. a. or Xn → X. by th ted wi ina ble nt. Convergence in the p. Convergence in the a.13 (a) If Xn → X then Xn → X..s. senses require the variables to all be defined on the same probability space.) om varia ome m is d ce dom nd uen ran eco seq ingle nite s i (If a s af d. Convergence in the a.s. a.11 A sequence (Xn : n ≥ 1) of random variables converges in distribution to a random variable X if n→∞ lim FXn (x) = FX (x) at all continuity points x of FX .8. .e.s. One way to investigate convergence in distribution is through the use of characteristic functions.1. pointwise convergence of characteristic functions) The relationships among the four types of convergence discussed in this section are given in the following proposition. For convergence in distribution. Proposition 2. and if p.s. then Xn → X.1. Figure 2. (c) If P {|Xn | ≤ Y } = 1 for all n for some fixed random variable Y with E[Y 2 ] < ∞. sense involves only pairwise joint distributions–namely those of (Xn .s. p. (b) If Xn → X then Xn → X.12 Let (Xn ) be a sequence of random variables and let X be a random variable. and p. p. d. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 47 Definition 2.1. X) for all n. m.s.s. sense involves joint properties of all the random variables. m.1.8: Relationships among four types of convergence of random variables.2. d. the random variables need not be defined on the same probability space.s. Convergence in distribution is denoted by limn→∞ Xn = X d.s. m. Then the following are equivalent: (i) Xn → X (ii) E[f (Xn )] → E[f (X)] for any bounded continuous function f . The definitions use differing amounts of information about the random variables (Xn : n ≥ 1) and X involved. p. m. or m. Proposition 2.. Xn → X. . CHAPTER 2. Therefore. and hence the left side of (2. since E[Y 2 ] < ∞ and P {|X − Xn | ≥ } → 0. P {| X − Xn |≥ } ≤ E[| X − Xn |2 ] 2 (2. m.6.s.5.s. That is.s. E[|X − Xn |2 ] ≤ 2 2 .) Now . m. d. Note that Bn ⊂ An and B1 ⊂ B2 ⊂ · · · so limn→∞ P [Bn ] = P [B] where B = B ⊃ {ω : lim Xn (ω) = X(ω)} n→∞ Clearly so 1 = P [B] = limn→∞ P [Bn ]. (e) Suppose Xn → X in the p. and by the hypotheses. p. (f ) Suppose Xn → X and Xn → Y. so that P {| X − Xn |2 ≤ 4Y 2 } = 1. by m. (b) Suppose Xn → X and let > 0.. Even if Y is random. | X − Xn |2 ≤ 4Y 2 I{|X−Xn |≥ } + so E[| X − Xn |2 ] ≤ 4E[Y 2 I{|X−Xn |≥ } ] + 2 2 In the special case that P {Y = L} = 1 for a constant L. So. Since P [An ] is squeezed between P [Bn ] and 1. Define Bn by Bn = {ω :| Xk (ω) − X(ω) |< for all k ≥ n} ∞ n=1 Bn . Thus. (c) Suppose Xn → X.1) The right side of (2. (See Figure 2. Let > 0. or a. sense and Xn → Y in the p. for n large enough. Since was arbitrary. Then P {X = Y } = 1. Then for any > 0. for any > 0. senses are used).s. converges to zero as n goes to infinity.. or a. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES d. (d) If Xn → X then Xn → X. Then X and Y have the same distribution. limn→∞ P [An ] = 1. P {|X − Xn | ≥ } → 0.1). P {| X |≤ Y } = 1. p. Define a sequence of events An by An = {ω :| Xn (ω) − X(ω) |< } We only need to show that P [An ] → 1.s. Xn → X. sense. with probability one.. the term E[Y 2 I{|X−Xn |≥ } ] is equal to L2 P {|X − Xn | ≥ }.9.s.1). m.s. and/or a. P {| X |≥ Y + } ≤ P {| X − Xn |≥ } → 0 so that P {| X |≥ Y + } = 0 for every > 0.48 p. (a) Suppose Xn → X and let a. Select any continuity point x of FX . it still follows that E[Y 2 I{|X−Xn |≥ } ] → 0 as n → ∞. (d) Assume Xn → X. d.. so Xn → X. By the Markov inequality applied to |X − Xn |2 .s. Proof. Therefore Xn → X as n → ∞. a sequence of random variables can have only one limit (if p. Corollary 11. p. m. It must be proved that limn→∞ FXn (x) = FX (x). p. Then there exists δ > 0 so that FX (x) ≤ FX (x − δ) + 2 . > 0. if differences on sets of probability zero are ignored..s. For all n sufficiently large.2. it must be that P {|X − Y | = 0} = 1. We’ve proved that P {|X − Y | ≥ 2 } ≤ 2δ.46 = 9. FX (x) ≥ FXN (x) − . it must be that P {|X − Y | ≥ 2 } = 0. So for any x ∈ R. the sequence of numbers Xn (ω) a.14 Suppose X0 is a random variable with P {X0 ≥ 0} = 1. By the triangle inequality. √ X2 (ω) = 6 + 9. This and the choice of δ yield. there is a point of continuity of both functions. . already proved. FX (x) ≤ FXn (x) + . in any nonempty interval. p. Xn ≤ x} ∪ {X ≤ x − δ. |FXn (x) − FX (x)| ≤ . Therefore. So FX (xn ) = FY (xn ) for all n. . (f) Suppose Xn → X and Xn → Y. Since was arbitrary. Since FX and FY are nondecreasing and bounded. . √ X3 (ω) = 6 + 9.1. Thus.0127 . if for some ω it happens that X0 (ω) = 12. Hence.9: A CDF at a continuity point. Then FX (x) = FY (y) whenever x is a continuity point of both x and y. P {| Xn −X |≥ δ} ≤ 2 . So for all n sufficiently large. |X − Y | ≤ |XN − X| + |XN − Y |. they can have only finitely many discontinuities of size greater than 1/n for any n.076 = 9. and select N so large that P {|Xn − X| ≥ } ≤ δ and P {|Xn − Y | ≥ } ≤ δ for all n ≥ N . so that the total number of discontinuities is at most countably infinite. (e) By parts (a) and (b). for all n sufficiently large. it is clear that for any ω with X0 (ω) > 0.1. limn→∞ FXn (x) = FX (x). Since δ was arbitrary.s. we can assume that Xn → X and Xn → Y. converges to 9. {|X − Y | ≥ 2 } ⊂ {|XN − X| ≥ } ∪ {|YN − X| ≥ } so that P {|X − Y | ≥ 2 } ≤ P {|XN − X| ≥ } + P {|XN − Y | ≥ } ≤ 2δ. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 49 F (x) X F (x!!) X x! x ! Figure 2. d. such that xn is a point of continuity of both FX and FY . Xn → 9 The rate of convergence can be bounded as follows. For example. there is a strictly decreasing sequence of numbers converging to x. Examining Figure 2. Let > 0 and δ > 0. Since was arbitrary.10. Taking the limit as n → ∞ and using the right-continuitiy of CDFs. Similarly.076 .465 . . . we have FX (x) = FY (x). Xn > x} ⊂ {Xn ≤ x} ∪ {|X − Xn | ≥ δ} so FX (x − δ) ≤ FXn (x) + P {| Xn − X |≥ δ}. Suppose Xn = 6+ Xn−1 for n ≥ 1. then √ X1 (ω) = 6 + 12 = 9. for all n sufficiently large. p. . . √ Example 2. {X ≤ x − δ} = {X ≤ x − δ. Note that d. 3) Squaring and taking expectations on each side of (2. Wn ≥ 2} = P {Xn−1 ≥ 0}P {Wn ≤ −2} + P {Xn−1 < 0}P {Wn ≥ 2} = P {Wn ≥ 2} ≥ 0.02 . Intuitively speaking. the sequence of numbers X0 (ω).15 Let W0 . Therefore. senses). We claim that Xn does not converge in probability (so also not in the a.s. might appear as in Figure 2. . .10: Graph of the functions 6 + for each x ≥ 0. 3 − 9 |. . | 6 + √ x−9| ≤ |6+ x 3 √ x and 6 + x . we investigate m. be independent. Then P {| Xn − Xn−1 |≥ 2} ≥ P {Xn−1 ≥ 0. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES y 9 6 x=y 0 x 0 9 6+ x 6+ x 3 Figure 2. Xn−1 (ω) −9| = 3 1 | Xn−1 (ω) − 9 | 3 (2.11. Examination of a table for the normal distribution yields that P {Wn ≥ 2} = P {Wn ≤ −2} ≥ 0.s.s. Xn persistently moves.s. Here is a proof of the claim. W1 .2) | Xn (ω) − 9 | ≤ | 6 + so that by induction on n.50 CHAPTER 2. it follows that 2 E[(X0 − 9)2 ] ≤ 2(E[X0 ] + 81) (2. | Xn (ω) − 9 | ≤ 3−n | X0 (ω) − 9 | a.3) thus yields 2 E[| Xn − 9 |2 ] ≤ 2 · 3−2n {E[X0 ] + 81} m. . Wn ≤ −2} + P {Xn−1 < 0. convergence under the assumption that E[X0 ] < +∞.s. 2 Finally.9)Xn−1 + Wn n≥0 In what sense does Xn converge as n goes to infinity? For fixed ω. p. Therefore. .02. By the inequality (a + b)2 ≤ 2a2 + 2b2 . Since Xn → 9 it follows that Xn → 9. Example 2. normal random variables with mean 0 and variance 1. X1 (ω).10) and using (2. Let X−1 = 0 and Xn = (. Xn → 9. .1. or m. Since FX∞ is a valid CDF. (. Let µn and σn denote the mean and variance of Xn . m. Proof.11: A typical sample sequence of X. in any one of the four senses.16 Suppose Xn is a Gaussian random variable for each n.1.s. or m. P {| Xn − X |≥ 1} + P {| Xn−1 − X |≥ 1} ≥ P {| Xn − X |≥ 1 or | Xn−1 − X |≥ 1} ≥ P {| Xn − Xn−1 |≥ 2} ≥ 0.. p.81 + 1)) 2 2 2 2 2 In general. Although Xn does not converge in probability. we can assume 2 that the sequence converges in distribution.263. Therefore. σn ) where the variances satisfy the recursion σn = (0. In fact. or d.9)X0 + W1 is N (0. the CDF of Xn converges everywhere to the CDF of any 2 random variable X which has the N (0. To probe this point further. FOUR DEFINITIONS OF CONVERGENCE OF RANDOM VARIABLES 51 Xk k Figure 2. the The first step is to show that the sequence σn distribution of Xn would get too spread out to converge.s. Then X∞ is also a Gaussian random variable.81)(1.9)X1 + W2 is N (0. Proposition 2.02 so P {| Xn − X |≥ 1} does not converge to zero as n → ∞. Therefore.) senses. there exists . it nevertheless seems to asymptotically settle into an equilibrium. The previous example involved convergence in distribution of Gaussian random variables. in which case FX (c) = I{c≥µ} and P {X = µ} = 1. 1) distribution. and that Xn → X∞ as n → ∞. σ∞ ) distribution. d. a. we close this section by showing that limits of Gaussian random variables are always Gaussian.s. if it weren’t bounded. 1) X1 = (.s.2. Since convergence in the other senses implies convergence in distribution. So Xn → X for any such X. 2 is bounded. for any random variable X. Recall that X is a Gaussian random variable with mean µ and variance σ 2 if either σ 2 > 0 and FX (c) = Φ( c−µ ) for all c. or in the a. let’s find the distribution of Xn for each n. So Xn does not converge in probability to any random variable X. or σ 2 = 0. 1. X0 = W0 is N (0. The claim is proved.81) X2 = (.1..81)σn−1 +1 so σn → σ∞ 1 2 where σ∞ = 0. The limit random variable was also Gaussian. Intuitively.19 = 5. Xn is N (0. where Φ is the CDF of the σ standard N (0.. Since FX∞ is right-continuous. then (Xn ) converges a. Since c1 − c2 = 0 and the sequence (σn ) is bounded. the probability P {|Xn | ≤ L} is maximized by µn = 0. By increasing L if necessary. Let c1 and c2 be two distinct such points. 2. For 3 3 3 3 3 2 σn fixed. Therefore. such that |xm − xn | < for all m.n→∞ |xm − xn | = 0. whenever n ≥ no . implies that the sequence is a Cauchy sequence. More useful is the converse statement. and therefore also (µn ) has a finite limit. For deterministic sequences that are not monotone. The first no − 1 terms of the sequence (σn ) are finite. P ). for n ≥ no . σn ≤ L/Φ−1 ( 2 ). FXn (−L) ≤ 1 and FXn (L) ≥ 2 .s. σn −c The limit of the difference of the sequences is the difference of the limits. F.e. the CDFs FXn converge pointwise to the CDF 2 2 for the N (µ∞ . x1 ≤ x2 ≤ · · · . It follows that limn→∞ ci −µn = bi for i = 1. and if it satisfies xn ≤ L for all n for some finite constant L. If the sequence (xn ) has a finite limit x∞ . µ∞ . so it has infinitely many points of continuity such that the function value is strictly between zero and one. for n ≥ no . 2. and let p1 and p2 denote the values of FX∞ at those two points. σ∞ ) distribution.2. Therefore. then the triangle inequality for distances between numbers. So there exists no such that. i. So assume without loss of generality that X∞ is not a constant random variable. . One useful result for this purpose is that if (xn : n ≥ 1) is monotone nondecreasing. This result carries over immediately to random variables: if (Xn : n ≥ 1) is a sequence of random variables such P {Xn ≤ Xn+1 } = 1 for all n and if there is a random variable Y such that P {Xn ≤ Y } = 1 for all n. the 2 whole sequence (σn ) is bounded. P {|Xn | ≤ L} ≥ FXn ( 2 ) − FXn ( 1 ) ≥ 1 . Proposition 2. 2Φ( σn ) − 1 ≥ P {|Xn | ≤ L}. n ≥ N . A deterministic sequence (xn : n ≥ 1) is said to be a Cauchy sequence if limm. The function can only have countably many points of discontinuity. Then there exists a value co so that FX∞ (co ) is strictly between zero and one. Thus. Φ( σn ) ≥ 3 . with left endpoint co . 3 2 where Φ−1 is the inverse of Φ. Therefore.1 (Cauchy criteria for random variables) Let (Xn ) be a sequence of random variables on a probability space (Ω. X∞ has the N (µ∞ . This means that. For example. for any > 0. there exists N sufficiently large.52 CHAPTER 2. the Cauchy criteria gives a simple yet general condition that implies convergence to a finite limit. the function must lie strictly between zero and one over some interval of positive length. or the completeness property of R: If (xn ) is a Cauchy sequence then (xn ) converges to a finite limit as n → ∞. and let bi = Φ−1 (pi ) for i = 1. it is useful to determine if the sum of an infinite series of numbers is convergent without needing to know the value of the sum. 2. σ∞ . CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 2 a value L so large that FX∞ (−L) < 1 and FX∞ (L) > 3 .2 Cauchy criteria for convergence of random variables It is important to be able to show that a limit exists even if the limit value is not known. it follows that (σn ) has a finite limit. σ∞ ) distribution. then the sequence is convergent. or equivalently. so no matter what the value of µn L L 2 is. we can also 3 assume that L and −L are continuity points of FX . called the Cauchy criteria for convergence. The following proposition gives similar criteria for convergence of random variables. Constant random variables are considered to be Gaussian random variables–namely degenerate ones with zero variance. |xm − xn | ≤ |xm − x∞ | + |xn − x∞ |. so limn→∞ c1σn 2 = b1 − b2 . Therefore. s. the sequence of numbers (Xki (ω) : i ≥ 1) is a Cauchy sequence.4) holds.s. The “only if” part of (b) is proved.s. By the L2 triangle inequality for random variables (1. i=1 Note that |Xki | ≤ Sn for 1 ≤ i ≤ k by the triangle inequality for differences of real numbers.2.s.2. By the L2 triangle inequality for random variables. . Thus. so that (2. . S∞ is finite a. the right side of (2. Choose the sequence k1 < k2 < . . CAUCHY CRITERIA FOR CONVERGENCE OF RANDOM VARIABLES (a) Xn converges a. the set on the left has probability one (i. meaning E[Xn ] < +∞ for all n and m. F. to some random variable if and only if P {ω : m.e. n → ∞. (b) First the “only if” part is proved. Part (a) is proved. to some random variable if and only if (Xn ) is a Cauchy sequence in the 2 m. the following equality of sets holds: {ω : lim Xn (ω) exists and is finite} = {ω : n→∞ m.3 in the appendix.n→∞ lim E[(Xm − Xn )2 ] = 0. X converges a. (See Example 11. Suppose Xn → X∞ .16). Let X∞ (ω) = 0 on the zero . sense. (b) Xn converges m. .n→∞ lim |Xm (ω) − Xn (ω)| = 0}. Moving to the proof of the “if” part. Once k1 .n→∞ lim P {|Xm − Xn | ≥ } = 0. (Xn (ω) : n ≥ 1) is a sequence of numbers..5) Proof. to a random variable) if and only if the set on the right has probability one. and for any ω such that S∞ (ω) is finite.s. n−1 2 1 E[Sn ] 2 ≤ 2 1 E[Xk1 ] 2 + i=1 2 E[(Xki+1 − Xki )2 ] 2 ≤ E[Xk1 ] 2 + 1. (2. P ). 1 1 1 (2.s. to some random variable if and only if for every m. to a limit S∞ . Let k1 be so large that E[(Xn −Xk1 )2 ] ≤ 1/2 for all n ≥ k1 .n→∞ 53 lim |Xm (ω) − Xn (ω)| = 0} = 1. ki−1 are selected.6) Since Xn → X∞ . It follows from this choice of the ki ’s that E[(Xki+1 −Xki )2 ] ≤ 2−i for all i ≥ 1. So by the Cauchy criterion for convergence of a sequence of numbers.4) (c) Xn converges p. > 0. So.s. . (2.4) holds.6) converges to zero as m. suppose (2. for ω in that set. Let Sn = |Xk1 |+ n−1 |Xki+1 −Xki |. it converges a. (a) For any ω fixed. 1 1 Since Sn is monotonically increasing.s. Note that |Xki | ≤ S∞ for 2 2 2 1 all i ≥ 1. S∞ is in L2 (Ω.2. E[S∞ ] = limn→∞ E[Sn ] ≤ (E[Xk1 ] 2 + 1)2 . .) By completeness of R. By the monotone convergence theorem. the limit X∞ (ω) exists. let ki be so large that ki > ki−1 and E[(Xn −Xki )2 ] ≤ 2−i for all n ≥ ki . E[(Xn − Xm )2 ] 2 ≤ E[(Xm − X∞ )2 ] 2 + E[(Xn − X∞ )2 ] 2 m. recursively as follows. In particular. m. is used extensively in these notes. Suppose Xn → X∞ . Since the sum of the probabilities of these events is finite.54 CHAPTER 2. Lemma 1. a. Corollary 2.1.n→∞ E[Xn Xm ] = E[X 2 ]. P {|Xm − Xn | ≥ 2 } ≤ P {|Xm − X∞ | ≥ } + P {|Xm − X∞ | ≥ } → 0 as m. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES probability event that (Xki (ω) : i ≥ 1) does not converge.s. P ).s. by the Borel-Cantelli lemma (specifically. The final step is to prove that the entire sequence (Xn ) converges in the p.2. It follows that Xki → X∞ as well. (c) First the “only if” part is proved. and P {|Xki − X∞ | ≥ } < .2. Xn → X∞ . n → ∞. Let X∞ (ω) = 0 on the zero probability event that p.1(c). 1 1 1 > 0. for all ω is a set with probability one. The remainder of this section concerns a more convenient form of the Cauchy criteria for m. Select an increasing sequence of integers ki so that P {|Xn − Xm | ≥ 2−i } ≤ 2−i for all m. It follows. Then. Thus. Then there exists a random m. . The proof of (c) is complete. By completeness of R. X∞ (up to differences on a set of measure zero).5) holds.3 (Correlation version of the Cauchy criterion for m. and E[(Xki − X∞ )2 ] ≤ 2 .s.s. sense to X∞ . that P {|Xki+1 − Xki | ≥ 2−i } ≤ 2−i . the limit X∞ (ω) exists.s. Moving to the proof of the “if” part. Proposition 2. Then. For this purpose. Summarizing. Select i so large that E[(Xn − Xki )2 ] < 2 for all n ≥ ki .s. p.s. if Xn → X. By Proposition 2. the sequence satisfies (2.13(c) that m. (Xki (ω) : i ≥ 1) is a Cauchy sequence of numbers. for any n ≥ ki . sense to X∞ .s.2 If Xn → X∞ . let > 0. for ω in that set. Xki → X∞ .s. the limit of the subsequence is the same random variable.s. variable X such that Xn → X if and only if the limit limm. p. senses. p. the probability that infinitely many of the events is true is zero. by the L2 triangle inequality. It therefore follows from Proposition 2. By uniqueness of limits in the p. then there is a subsequence (Xki : i ≥ 1) such that limi→∞ Xki = X∞ a.2.1(b).2. we have limi→∞ Xki = X∞ a. Then for any m. m.1). then limm.2. the Cauchy criteria for mean square convergence. and |Xki | ≤ S∞ where S∞ ∈ L2 (Ω. P {|Xki+1 − Xki | ≤ 2−i for all large enough i} = 1.2. The proof of (b) is complete. For this purpose. or a. The following is a corollary of Proposition 2.1(c) and its proof. n ≥ ki . Select i so large that P {||Xn − Xki || ≥ } < for all n ≥ ki . Proposition 2.2. F. The final step is to prove that the entire sequence (Xn ) converges in the m.2. Xki → X∞ . so that (2. E[(Xn − X∞ )2 ] 2 ≤ E(Xn − Xki )2 ] 2 + E[(Xki − X∞ )2 ] 2 ≤ 2 Since was arbitrary. Proof. convergence.1(c) there is a subsequence (Xki ) that converges a.5) holds.s. The “only if” part is proved. Then P {|Xn − X∞ | ≥ 2 } ≤ 2 for all n ≥ ki . convergence) Let (Xn ) 2 be a sequence of random variables with E[Xn ] < +∞ for each n. in particular. By the proof of Proposition 2. Furthermore.s. Xn → X∞ .n→∞ E[Xn Xm ] exists and is finite. Since was arbitrary. let > 0. (Xki (ω) : i ≥ 1) does not converge. Thus.2(a)). suppose (2. The question is whether Yn converges in the mean square sense to k=1 k a random variable with finite second moment. Does the series ∞ Xk converge in the mean square sense to a random variable with a finite second k=1 k moment? Let Yn = n Xk . Then E[Xn ] → E[X]. 2 Corollary 2. m. Let X1 .2. n → ∞ k2 . The answer is yes if and only if limm.s.n→∞ E[Xn Xm ] = c for a finite constant c. m.s. 2 Proposition 2. . sense. CAUCHY CRITERIA FOR CONVERGENCE OF RANDOM VARIABLES 55 proof The “if” part is proved first.2. Thus E[Xm Xn ] → E[X 2 ].2. The proof of the proposition is complete.2.n) E[Ym Yn ] = k=1 ∞ 1 k2 → k=1 1 as m.n→∞ E[Ym Yn ] exists and is finite. Corollary 2. Since Xn Yn = ((Xn + Yn )2 − Xn − Yn )/2. E[Xn ] → E[X 2 ].3. suppose Xn → X. it follows that Xn + Yn → X + Y as n → ∞. m. and 2 2 2 E[Yn ] → E[Y 2 ]. . Observe next that E[Xm Xn ] = E[(X + (Xm − X))(X + (Xn − X))] = E[X 2 + (Xm − X)X + X(Xn − X) + (Xm − X)(Xn − X)] By the Cauchy-Schwarz inequality.2. Corollary 2.s. Then E (Xn − Xm )2 = 2 2 E[Xn ] − 2E[Xn Xm ] + E[Xm ] → c − 2c + c = 0 as m.s.6 This example illustrates the use of Proposition 2. Then E[Xn Yn ] → E[XY ]. Suppose limm. the corollary follows. To prove the “only if” part. n → ∞ Thus.s. Xn is Cauchy in the m.4 by taking Yn = 1 for all n. By the inequality (a + b)2 ≤ 2a2 + 2b2 . .2. Proof. E[| (Xm − X)X |] ≤ E[(Xm − X)2 ] 2 E[X 2 ] 2 → 0 E[| (Xm − X)(Xn − X) |] ≤ E[(Xm − X)2 ] 2 E[(Xn − X)2 ] 2 → 0 and similarly E[| X(Xn − X) |] → 0.3 therefore implies that E[(Xn + Yn )2 ] → E[(X + Y )2 ].5 Suppose Xn → X. be mean zero random variables such that E[Xi Xj ] = 1 if i = j 0 else m. Example 2. m. This establishes both the “only if” part of the proposition and the last statement of the proposition.s. Observe that min(m. so Xn → X for some random variable X.5 follows from Corollary 2. X2 .2.2.s.4 Suppose Xn → X and Yn → Y . 1 1 1 1 m. Proof.2. is f (x) ≈ f (0) + xf (0) + x2 f (0) 2 Bounds on the approximation error are given by Taylor’s theorem. the sum is equal to is the main point here. and applying the exponential function. The basic idea is to note that (1 + s)n = exp(n ln(1 + s)). 2. where y lies in the closed interval with endpoints 0 and s. found in Appendix 11. For engineering applications. A continuously differentiable function f can be approximated near zero by the first order Taylor’s approximation f (x) ≈ f (0) + xf (0) A second order approximation. then y ≥ − 1 .s. Thus. this represents a low complexity method for describing the random variables. the mean value form of Taylor’s Theorem (see the appendix) yields that s if s > −1. the series ∞ Xk k=1 k indeed converges in the m. then ln(1 + s) = s − 2(1+y)2 . and apply Taylor’s theorem to ln(1+s) about the point s = 0. so that 2 s− s2 ≤ ln(1 + s) ≤ s 2 if s ≥ 0.1 Therefore. 6 but the technique of comparing the sum to an integral to show the sum is finite . if s ≥ − 1 2 multiplying through by n. We shall see that the law of large numbers and central limit theorem can be viewed not just as analogies of the first and second order Taylor’s approximations. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES ∞ 1 This sum is smaller than 1 + 1 x2 dx = 2 < ∞. but actually as consequences of them. by Proposition 2. In essence.3.4.56 CHAPTER 2. The details are given next. if it is only known that s ≥ − 1 .2. 1 and ln(1 + s) = − (1+s)2 . f (0) and f (0). yields that exp(xn − 2x2 xn n n ) ≤ (1 + ) ≤ exp(xn ) n n if xn ≥ − n 2 In fact. in case f is twice continuously differentiable. More to the point. Lemma 2. 1 π2 . Proof. so that 2 2 s − 2s2 ≤ ln(1 + s) ≤ s Letting s = xn n . if s ≥ 0.3 Limit theorems for sequences of independent random variables Sums of many independent random variables often have distributions that can be characterized by a small number of parameters.1 If xn → x as n → ∞ then (1 + xn n n ) → ex as n → ∞. Taylor’s approximation lets us represent the function by the numbers f (0). then y ≥ 0. sense.3. ln(1+s) |s=0 = 1. Since ln(1+s)|s=0 = 0. An analogous tool is the Taylor series approximation. . are iid. n → Sn m. and xn − 2 yielding the desired result. are iid. where ΦX denotes the characteristic function of X1 . A sequence of random variables (Xn ) is said to be independent and identically distributed (iid) if the Xi ’s are mutually independent and identically distributed. . then the conditions of part (a) are true. LIMIT THEOREMS FOR SEQUENCES OF INDEPENDENT RANDOM VARIABLES 57 If xn → x as n → ∞ then the condition xn > − n holds for all large enough n.) m if X1 . . Here a second approach to proving (b) is given. Suppose the conditions of (a) are true. Φ Sn (u) = n ΦX u n n . (hence also Sn → m and Sn → m.) We give a proof of (a) and (b). . (This version is the weak law of large numbers. Xj ) = 0 i = j (i. Proposition 2. n → Sn a.3. Let Sn = X1 + · · · + Xn . Then E Sn −m n 2 = Var = 1 n2 Sn n = 1 Var(Sn ) n2 1 n2 Var(Xi ) ≤ i Cov(Xi . The characteristic function of Xi is given by n E exp juXi n u = E exp j( )Xi n = ΦX u n m. . for any u fixed. n Turn next to part (b). (This version is the strong law of large numbers.3.s.s. . By Taylor’s theorem. but prove (c) only under an extra condition.2. is a sequence of random variables such that each Xi has finite mean m. if the variances are bounded and the Xi ’s are uncorrelated). Since the characteristic function of the sum of independent random variables is the product of the characteristic functions. (b) (c) m if X1 . .) if for some constant c. Sn p.2 (Law of large numbers) Suppose that X1 . d. n n and Cov(Xi . . Var(Xi ) ≤ c for all i. Xj ) = i j c n Therefore Sn → m. . n → p. X2 . ΦX (0) = jm and Φ is continuous.s. Since E(X1 ) = m it follows that ΦX is differentiable with ΦX (0) = 1. Then (a) m. X2 . the conclusion of part (b) follows.e. Since mean square convergence implies convergence in probability. An extra credit problem shows how to use the same approach to verify (b) even if Var(X1 ) = +∞. X2 . ΦX u n = 1+ uΦX (un ) n . If in addition to the conditions of (b) it is assumed that Var(X1 ) < +∞. 2x2 n n → x. Let Sn = X1 + · · · + Xn . 4 4 2 E[Sn ] = nE[X1 ] + 3n(n − 1)E[X1 ]2 Let Y = ∞ ( Sn )4 . Since Φ (un ) → jm as n → ∞. Since Φ (un ) → −σ 2 as n → ∞.3. σ 2 ) distribution as n → ∞. . the proposition is proved. Note that exp(jum) is the characteristic function of a random variable equal to m with probability one. X 2 X X terms of the form Xi j i j i j k or Xi Xj Xk Xl for distinct i. for any u fixed. However. ΦX for some un between 0 and 2 2 u √ n = 1+ u2 Φ (un ) 2n u √ n for all n.d.. Consider expanding Sn . The value of Y is well defined but it is a priori possible that Y (ω) = +∞ for n=1 n some ω. n convergence in distribution to a constant implies convergence in probability. where ΦX denotes the characteristic function of X1 .i. Then the normalized sum Sn − nµ √ n converges in distribution to the N (0. assume that µ = 0. Lemma 2. Since pointwise convergence of characteristic functions to d. are i. Without loss of generality.1 yields u n ΦX ( n ) → exp(jum) as n → ∞.3. j. .58 CHAPTER 2. it follows that ΦX is twice differentiable with ΦX (0) = 1. Lemma 2. The other terms have the form X 3 X . Since pointwise convergence of characteristic functions to a valid characteristic function implies convergence in distribution. . n Since X1 has mean 0 and finite second moment σ 2 . 4 Part (c) is proved under the additional assumption that E[X1 ] < +∞. Thus. k. so the event of convergence also has probability one. it follows that Sn → m.1 yields σ u ΦX ( √n )n → exp(− u 2 ) as n → ∞. P {Y < +∞} = 1. l. . Thus. However. part (c) under the extra fourth moment condition is proved. so that ∞ E[Y ] = n=1 E Sn n 4 ∞ = n=1 4 2 nE[X1 ] + 3n(n − 1)E[X1 ]2 < +∞ n4 Therefore. each with mean µ and variance σ 2 . X2 . and ΦX is continuous. Proof. Without loss of generality 4 we assume that EX1 = 0. a valid characteristic function implies convergence in distribution. There are n terms of the form Xi4 and 3n(n − 1) 2 X 2 with 1 ≤ i.3. and these terms have mean zero. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES u for some un between 0 and n for all n. ΦX (0) = 0.3 (Central Limit Theorem) Suppose that X1 . However. Then the characteristic function of S u the normalized sum √n is given by ΦX ( √n )n . the expectation of the sum of nonnegative random variables is the sum of the expectations. Proposition 2. ΦX (0) = −σ 2 . By Taylor’s theorem. j ≤ n and i = j. so (b) is proved. {Y < +∞} is a subset of the event of convergence (w) {w : Snn → 0 as n → ∞}. by the monotone convergence theorem. x ln x x > 0 0 x=0 ϕ(x) = +∞ x < 0. b] lies below the line segment equal to ϕ at the endpoints of the interval. Examples of convex functions include: ax2 + bx + c e λx for constants for λ constant. Then f is convex if and only if f (v) ≥ 0 for all v. Proof. b) containing vo so that f (v) < 0 for a < v < b. if f (vo ) < 0 for some vo . so that f is convex. (1 − λ)(b − v)}f (v)dv (2. Given a < b.b < 0. define Dab = λf (a) + (1 − λ)f (b) − f (λa + (1 − λ)b).1 Suppose f is twice continuously differentiable on R. then by the assumed continuity of f . Suppose that f is twice continuously differentiable. Proposition 2. Then E[ϕ(X)] ≥ ϕ(E[X]). ϕ(x) = − ln x x > 0 +∞ x ≤ 0. CONVEX FUNCTIONS AND JENSEN’S INEQUALITY 59 2. a.4. This means that the graph of ϕ on any interval [a. c with a ≥ 0. But then Da.2 (Jensen’s inequality) Let ϕ be a convex function and let X be a random variable such that E[X] is finite. Then ϕ is said to be convex if for any a.4 Convex functions and Jensen’s inequality Let ϕ be a function on R with values in R ∪ {+∞} such that ϕ(x) < ∞ for at least one value of x. b and λ with a < b and 0 ≤ λ ≤ 1 ϕ(aλ + b(1 − λ)) ≤ λϕ(a) + (1 − λ)ϕ(b). implying that f is not convex. there is a small open interval (a.4.2. On the other hand.4. For any x ≥ a. . then Dab ≥ 0 whenever a < b. Theorem 2. x x x f (x) = f (a) + a f (u)du = f (a) + (x − a)f (a) + a v v f (v)dvdu = f (a) + (x − a)f (a) + a (x − v)f (v)dv Applying this to Dab yields b Dab = a min{λ(v − a). if f (v) ≥ 0 for all v.7) On one hand. b. be an iid sequence of random variables with finite mean µ. Then Cram´r’s theorem. which establishes the theorem. One can think of the central limit theorem. there is a tangent to the graph of ϕ at E[X]. to concern “normal” deviations of Sn from its mean.8). In n case the Xi ’s have finite variance. and hence M (θ) itself. like the Chebychev inequality.5 Chernoff bound and large deviations theory Let X1 . See the illustration in Figure 2. Large deviations theory. with possible value +∞. Therefore E[ϕ(X)] ≥ E[L(X)] = L(E[X]) = ϕ(E[X]). CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES For example. The weak law of large numbers implies that for fixed a with a > µ. We shall first describe the Chernoff bound. If ϕ is concave then E[ϕ(X)] ≤ ϕ(E[X]). 2. which also follows from the fact Var(X) = E[X 2 ] − E[X]2 . the limit is not zero. and let Sn = X1 +· · ·+Xn . Proof. Since ϕ is convex. For fixed c.12.60 CHAPTER 2. and in particular it identifies how quickly n P { Sn ≥ a} converges to zero as n → ∞. therefore. Jensen’s inequality implies that E[X 2 ] ≥ E[X]2 . P { Sn ≥ a} → 0 as n → ∞.8) The bound (2. . is a consequence of Markov’s inequality applied to . A function ϕ is called concave if −ϕ is convex. by contrast. is well-defined for all real values of θ. which is a n simple upper bound on P { Sn ≥ a}. X2 . The moment generating function of X1 is defined by M (θ) = E[eθX1 ]. The Chernoff bound is simply given as P Sn ≥a n ≤ exp(−n[θa − ln M (θ)]) for θ ≥ 0 (2. is stated. by identifying the limit of P { Sn ≥ an }. meaning there is a function L of the form L(x) = a + bx such that ϕ(x) ≥ L(x) for all x and ϕ(E[X]) = L(E[X]). the central limit theorem offers a refinement of the law of large numbers. φ(x) L(x) x E[X] Figure 2.12: A convex function and a tangent linear function. and ln M (θ) is called the log moment generating function. addresses P { Sn ≥ a} for a fixed. Since eθX1 is a positive random variable. the expectation. . . to the effect that the Chernoff bound e n is in a certain sense tight. where (an ) is a sequence that converges to µ in n c the particular manner: an = µ + √n . Note that ln M (0) = 0.13 also shows the line of slope a. the maximum value of θa − ln M (θ) over θ ≥ 0 is equal to l(a).9) Thus.13. n but by the celebrated theorem of Cram´r stated next. Then d ln M (θ) dθ = θ=0 E[X1 eθX1 ] E[eθX1 ] = E[X1 ] θ=0 The sketch of a typical case is shown in Figure 2. Let us suppose that M (θ) is finite for some θ > 0. e . we wish to select θ ≥ 0 to maximize aθ − ln M (θ). the Chernoff bound in its optimized form. For θ ≥ 0: P Sn ≥a n = P {eθ(X1 +···+Xn −na) ≥ 1} ≤ E[eθ(X1 +···+Xn −na) ] = E[eθX1 ]n e−nθa = exp(−n[θa − ln M (θ)]) 61 To make the best use of the Chernoff bound we can optimize the bound by selecting the best θ.2. Figure 2. defined by l(a) = sup −∞<θ<∞ θa − ln M (θ) (2. the Chernoff bound gives the right exponent. Thus. the line lies strictly above ln M (θ) for small enough θ and below ln M (θ) for all θ < 0. is P Sn ≥a n ≤ exp(−nl(a)) a > E[X1 ] There does not exist such a clean lower bound on the large deviation probability P { Sn ≥ a}. ln M(! ) l(a) ! a! Figure 2. CHERNOFF BOUND AND LARGE DEVIATIONS THEORY an appropriate function. Therefore. In general the log moment generating function ln M is convex.5.13: A log moment generating function and a line of slope a. Because of the assumption that a > E[X1 ]. . . It is not difficult to show that the mean and variance of the Xi ’s under Pθ are given by: E X1 eθX1 = (ln M (θ)) M (θ) 2 Varθ [X1 ] = Eθ [X1 ] − Eθ [X1 ]2 = (ln M (θ)) Eθ [X1 ] = . X2 . Informally. .. we can write for n large: (2. are defined on such that for any n and any event of the form {(X1 . this assumption can be removed by a truncation argument covered in a homework problem. . then under (x)eθx the new probability measure Pθ ... if a < E[X1 ] and l(a) is finite.1 (Cram´r’s theorem) Suppose E[X1 ] is finite. Xn ) ∈ B}. . .. Also. . The lower bound (2. and that E[X1 ] < a. to avoid trivialities. Xn ) ∈ B} = E I{(X1 . then 1 P n where n Sn ≤a n = exp(−n(l(a) + n n )) is a sequence with n ≥ 0 and limn→∞ P Sn ∈ da n = 0. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES Theorem 2.Xn )∈B} eθSn M (θ)n In particular. suppose P {X1 > a} > 0. and Pθ is similarly called the tilted version of P with parameter θ. The assumption that X1 is bounded and the monotone convergence theorem imply that the function M (θ) is finite and infinitely differentiable over θ ∈ R. X2 . . Pθ {(X1 . .11) ≈ e−nl(a) da Proof. The pdf fθ is called the tilted version of f with parameter θ.5. each Xi has pdf fθ defined by fθ (x) = f M (θ) . let Pθ denote a new probability measure on the same probability space that X1 .10) for all n ≥ n . Given θ ∈ R. . and the random variables X1 . Similarly. . if l(a) is finite (equivalently if P {X1 ≥ a} > 0) then P where ( n ) is a sequence with n Sn ≥a n = exp(−n(l(a) + n n )) ≥ 0 and limn→∞ = 0. . .62 CHAPTER 2. . are independent under Pθ .10) is proved here under the additional assumption that X1 is a bounded random variable: P {|X1 | ≤ C} = 1 for some constant C. Combining this bound with the Chernoff inequality yields 1 ln P n→∞ n lim Sn ≥a n = −l(a) In particular. if Xi has pdf f for each i under the original probability measure P . Then for > 0 e there exists a number n such that P Sn ≥a n ≥ exp(−n(l(a) + )) (2. . so that ln M (θ) is strictly convex.2 Let X1 . be independent and exponentially distributed with parameter λ = 1. Then ∞ ln M (θ) = ln 0 eθx e−x dx = − ln(1 − θ) θ < 1 +∞ θ≥1 See Figure 2. there thus exists a unique value θ∗ of θ that maximizes aθ − ln M (θ). Also. X1 has strictly positive variance under Pθ for all θ.14 + "" + "" ln M(! ) l(a) ! 0 1 0 1 a Figure 2.2.5. . Pθ∗ {na ≤ Sn ≤ 1 nb} → 2 as n → ∞ so Pθ∗ {na ≤ Sn ≤ nb} ≥ 1/3 for n large enough. for n large enough. the derivative of aθ − ln M (θ) at θ = θ∗ is zero. Observe that for any b with b > a. implies (2. X2 . So l(a) = aθ∗ − ln M (θ∗ ).14: ln M (θ) and l(a) for an Exp(1) random variable. Therefore. Together with the fact that ln M (θ) is differentiable and strictly convex.5. . P Sn ≥n n = {ω:na≤Sn } 1 dP M (θ∗ )n e−θ {ω:na≤Sn } ∗S n = = M (θ∗ )n eθ Sn dP M (θ∗ )n dPθ∗ ∗S n ∗ e−θ {ω:na≤Sn } ∗S n ≥ M (θ∗ )n ≥ ∗ e−θ dPθ∗ {ω:na≤Sn ≤nb} ∗ n −θ∗ nb M (θ ) e Pθ∗ {na ≤ Sn ≤ nb} Now M (θ∗ )n e−θ nb = exp(−n(l(a) + θ∗ (b − a)}). The assumption P {X1 > a} > 0 implies that (aθ − ln M (θ)) → −∞ as θ → ∞.10) for large enough n. so that Eθ∗ [X] = (ln M (θ)) θ=θ∗ = a. P Sn ≥a n ≥ exp −n l(a) + θ∗ (b − a) + ln 3 n Taking b close enough to a. and by the central limit theorem. Example 2. CHERNOFF BOUND AND LARGE DEVIATIONS THEORY 63 Under the assumptions we’ve made. . 3 Let X1 .6 Problems 2. Thus Sn has the binomial distribution. leading to p(1−a) l(a) = See Figure 2. For 0 ≤ a ≤ 1. l(a) = max{aθ − ln M (θ)} θ = max{aθ + ln(1 − θ)} θ<1 If a ≤ 0 then l(a) = +∞. which has asymptotic slope 1 as θ → +∞ and converges to a constant as θ → −∞. . (b) Using the definition of a limit.θ>0 1+cos(θ) = +∞. X2 .14. we find aθ − ln M (θ) is maximized by θ = ln( a(1−p) ). Therefore. if a > 0 then setting the derivative of aθ + ln(1 − θ) 1 to 0 yields the maximizing value θ = 1 − a . .15. a − 1 − ln(a) a > 0 +∞ a≤0 Example 2. + !! + !! a ln( a ) + (1 − a) ln( 1−a ) 0 ≤ a ≤ 1 p 1−p +∞ else ln M(" ) l(a) " 0 0 p 1 a Figure 2. and justify your answer: √ ∞ 1+ n n=1 1+n2 . CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES Therefore.1 Limits and infinite sums for deterministic sequences (a) Using the definition of a limit.15: ln M (θ) and l(a) for a Bernoulli distribution. for any a ∈ R. θ (c) Determine whether the following sum is finite. show that limθ→0 θ(1 + cos(θ)) = 0. be independent Bernoulli random variables with parameter p satisfying 0 < p < 1. Then ln M (θ) = ln(peθ + (1 − p)). . l(a) = +∞ if a > 1 or if a < 0.64 CHAPTER 2. and therefore l(a) = The function l is shown in Figure 2.5. 2. . show that limθ→0. On the other hand. P ). Then fn is defined to converge to f uniformly if for any > 0.6. 2π]. 2. (Hint: Note that xn yn − xy = (xn − x)yn + x(yn − y) and use part (a)). (a) Xn (ω) = nω − nω .s. then the sequence (1/xn ) converges to 1/x∞ .. For the following two sequences of random variables on (Ω.) do each of the following two sequences converge. there is an x ∈ D such that f (x) ≥ c. Equivalently.6 Convergence of sequences of random variables Let Θ be uniformly distributed on the interval [0. and let P be the probability measure on F such that P ([a. 3 3n2 (b) Suppose fn is a function on some set D for each n ≥ 1. Show that the functions fn (x) = xn on the semi-open interval [0. and Xn (ω) = 0 otherwise. find and sketch the distribution function of Xn for typical n.s.e. 1) do not converge uniformly to the zero function. and justify your answers. Conclude that if fn converges to f uniformly on D. ∞ (a) n=0 3n n! ∞ (b) n=1 (n + 2) ln n (n + 5)3 ∞ (c) n=1 1 . . (a) (Xn : n ≥ 1) defined by Xn = cos(nΘ). then supD fn converges to supD f . (b) (Yn : n ≥ 1) defined by Yn = |1 − Θ |n . A key point is that n does not depend on x. let F be the Borel σ algebra of subsets of (0. Justify your answers.. 1]. F. and decide in which sense(s) (if any) each of the two sequences converges. prove that if (xn ) is a sequence with a nonzero finite limit x∞ .5 On convergence of deterministic sequences and functions 2 (a) Let xn = 8n +n for n ≥ 1. written supD f . (ln(n + 1))5 2. d. b]) = b − a for 0 < a ≤ b ≤ 1.3 The reciprocal of the limit is the limit of the reciprocal Using the definition of converence for deterministic sequences. have partial sums converging to a finite limit). where x is the largest integer less than or equal to x.. (b) Prove that limn→∞ xn yn = xy. (c) The supremum of a function f on D.1] Let Ω = (0. and suppose f is also a function on D. 2.2 The limit of the product is the product of the limits Consider two (deterministic) sequences with finite limits: limn→∞ xn = x and limn→∞ yn = y. (a) Prove that the sequence (yn ) is bounded. Prove that limn→∞ xn = 8 . PROBLEMS 65 2. there exists an n such that |fn (x) − f (x)| ≤ for all x ∈ D whenever n ≥ n . π 2. p.2. In which of the four senses (a. if they exist. m.4 Limits of some deterministic series Determine which of the following series are convergent (i. 2. Identify the limits. 1]. Show that | supD f − supD g| ≤ supD |f − g|. supD f satisfies supD f ≥ f (x) for all x ∈ D. (b) Xn (ω) = n2 ω if 0 < ω < 1/n. is the least upper bound of f . and given any c < supD f .7 Convergence of random variables on (0. p. X2 . (Try at least for a heuristic justification. and identify the limit distribution. and let Xn = U1 U2 · · · Un for n ≥ 1.11 Convergence in distribution to a nonrandom limit Let (Xn . CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 2. Justify your answers. n √ (a) Xn (ω) = (−1) . Justify your answers..) 2. (Hint: Z ≥ max{S0 . Un } for n ≥ 1. . let Sn = X1 + · · · + Xn . and let Fn denote the distribution function of Xn ... if any. be a sequence of independent random variables. . 2. (a) Show that Z is well defined with probability one. p. Justify your answers. .) the sequence (Xn ) converges as n → ∞. then limn→∞ Xn = X p. 1]. and let Xn = min{U1 . and P {Z < +∞} = 1. 5 or 6. . p. and identify the limit. Prove that if limn→∞ Xn = X d. prove that convergence in distribution to a constant implies convergence in probability to the same constant. if any.10 Convergence of a sequence of discrete random variables Let Xn = X +(1/n) where P [X = i] = 1/6 for i = 1. d. Determine in which of the four senses (a. be a sequence of independent random variables.) 2. each of the following three sequences of random variables converges. .s. (a) For what values of x does Fn (x) converge to F (x) as n tends to infinity? (b) At what values of x is FX (x) continuous? (c) Does the sequence (Xn ) converge in distribution to X? 2. 1]. depending only on the above assumptions. (c) Xn (ω) = ω sin(2πnω).9 On the maximum of a random walk with negative drift Let X1 . d.s. Let Z = max{Sn : n ≥ 0}. U2 . . b]) = b − a for 0 < a ≤ b ≤ 1.13 Convergence of a product Let U1 . such that E[Z] ≤ L? Justify your answer. m. . .).12 Convergence of a minimum Let U1 . m. (a) Determine in which of the senses (a. m. . X1 }..1]. S1 } = max{0. with each variable being uniformly distributed over the interval [0.s. .... d. version 2 Let Ω = (0. . with each variable being uniformly distributed over the interval [0. n ≥ 1) be a sequence of random variables and let X be a random variable such that P [X = c] = 1 for some constant c. 2. and let P be the probability measure on F such that P ([a. let F be the Borel σ algebra of subsets of (0. identically distributed random variables with mean E[Xi ] = −1. . and identify the limit.. 2. (a) Determine in which of the senses (a. (b) Determine the value of the constant θ so that the sequence (Yn ) defined by Yn = nθ ln(Xn ) converges in distribution as n → ∞ to a nonzero limit.. if any. . Let S0 = 0.) the sequence (Xn ) converges as n → ∞.s. and for n ≥ 1. (b) Determine the value of the constant θ so that the sequence (Yn ) defined by Yn = nθ Xn converges in distribution as n → ∞ to a nonzero limit.66 CHAPTER 2. 4.s. 3. That is. 2]. . be independent. U2 .s. n ω (b) Xn (ω) = nω n . (b) Does there exist a finite constant L.8 Convergence of random variables on (0. 1]. each with the exponential distribution with parameter λ = 1. PROBLEMS 2.n + X2. (b) Find the pointwise limits of the characteristic functions of Sn and Vn as n → ∞.. random variables. Thus. Xn.s.Xn } . .s. (b) Show that (Zn ) converges in distribution to a constant. let X1. if any.s. Give a yes or no answer to each of the four questions below.i.4. and otherwise she gets nothing back.n = 0] = 1 − (λ/n).n .i. necessarily exist? (b) If limn→∞ Xn = X m.15 Sums of i.i. necessarily exist? (d) If limn→∞ Xn = X m.18 On the growth of the maximum of n independent exponentials Suppose that X1 . II Let X1 . (a) Compute the characteristic function of the following random variables: X1 .s. Roughly what is the probability that she is ahead after playing the game one hundred times? 2.d. √ and Vn = Sn / n.. ln(n) (a) Find a simple expression for the CDF of Zn . (Note: It follows immediately that Zn converges in p. . does the sequence (Yn ) converge? 2.s. are independent random variables. For each no answer. random variables.. . X2 . (b) Find the pointwise limit of the characteristic functions as n → ∞ tends.6. .n + · · · + Xn. and that X is also a random variable.s.s.. random variables. then does limn→∞ h(Xn ) a. then does limn→∞ h(Xn ) m.s.n be independent random variables such that P [Xi.n . to the same constant.1. . ...14 Limits of functions of random variables Let g and h be functions defined as follows: −1 if x ≤ −1 x if − 1 ≤ x ≤ 1 g(x) = 1 if x ≥ 1 67 h(x) = −1 if x ≤ 0 1 if x > 0.d. Let Yn = X1.17 Sums of i.. . For each integer n > λ. For n ≥ 2. X2 . all on the same underlying probability space. The limit is the characteristic function of what probability distribution? (c) In what sense(s). she gets just the one dollar back with probability 0.n = 1] = λ/n and P [Xi. . Sn = X1 + · · · + Xn . and find the constant. senses to the same constant. For each yes answer. (c) In what sense(s). identify the limit and give a justification. then does limn→∞ g(Xn ) a. X2.2.s.. and m. It can also be shown that (Zn ) converges in the a. (a) Compute the characteristic function of Yn for each n. III Fix λ > 0. let Zn = max{X1 . do the sequences (Sn ) and (Vn ) converge? 2. (a) If limn→∞ Xn = X a. if any. I A gambler repeatedly plays the following game: She bets one dollar and then there are three possible outcomes: she wins two dollars back with probability 0. necessarily exist? 2.n . Suppose that (Xn : n ≥ 0) is a sequence of random variables.) . . give a counterexample. .5.16 Sums of i. necessarily exist? (c) If limn→∞ Xn = X a. then does limn→∞ g(Xn ) m. g represents a clipper and h represents a hard limiter.s. be independent random variable with P [Xi = 1] = P [Xi = −1] = 0.d. (c) D(f |g) ≥ 0. and for n ≥ 1 let Xn+1 = (1 − Un )Xn + Un Xn−1 .s. Specifically. (b) E[X 4 ] ≥ E[X 2 ]2 . . 0. d. (c) Show that Xn converges in the a.1]. be independent random variables.. X2 . a million dollars). Show that X1 + X2 + · · · exists in the mean square sense if and only if the sum of the variances. the variable Xn+1 is uniformly distributed on the interval with endpoints Xn−1 and Xn . where f and g are positive probability densities on a set A. Using the normal approximation suggested by the central limit theorem. Show your work. (b) Find E[Xn ] for all n. Suppose that the Xi are mutually independent with mean zero.19 Normal approximation for quantization error Suppose each of 100 real numbers are rounded to the nearest integer and then added. X1 + X2 + · · · is defined as the limit as n tends to infinity of the partial sums X1 + X2 + · · · + Xn .22 Convergence analysis of successive averaging Let U1 . etc. Note that given Xn−1 and Xn . (a) Derive an upper bound on P [|X| ≥ 10]. .) the sequence (Xn ) converges as n → ∞. . if any.s. The limit can be taken in the usual senses (in probability. 2. m. (Hint: Let Dn = |Xn − Xn−1 |.20 Limit behavior of a stochastic dynamical system Let W1 . 1 1 (a) E[ X ] ≥ E[X] . be a sequence of independent. Then Dn+1 = Un Dn . 2. U2 .) 2.). 0.24 Mean square convergence of a random series The sum of infinitely many random variables. and D is the divergence distance defined by D(f |g) = A f (x) ln f (x) dx. . in distribution.) g(x) 2. Each day you bet . each uniformly distributed on the interval [0. (The base used in the logarithm is not relevant. . recursively by Xk+1 = Xk + Wk . Assume the individual roundoff errors are independent and uniformly distributed over the interval [−0.. for a positive random variable X with finite mean.68 CHAPTER 2. Find a distribution for X such that the bound you found in part (a) holds with equality.23 Understanding the Markov inequality Suppose X is a random variable with E[X 4 ] = 30.5) random variables. . Let X0 = 0 and X1 = 1.s.5.5]. (a) Sketch a typical sample realization of the first few variables in the sequence. is finite. find the approximate probability that the absolute value of the sum of the errors is greater than 5..21 Applications of Jensen’s inequality Explain how each of the inequalties below follows from Jensen’s inequality. sense as n goes to infinity. p. and define 2 X1 . Explain your reasoning.. . Determine in which of the senses (a. 2. and identify the limit. Let X0 = 0. N (0.. Var(X1 ) + Var(X2 ) + · · · . CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 2. for a random variable X with finite second moment. identify the convex function and random variable used. (Hint: Apply the Cauchy criteria for mean square convergence. and if m > n then |Xm − Xn | ≤ Dn . (b) (Your bound in (a) must be the best possible in order to get both parts (a) and (b) correct). Justify your answer.25 Portfolio allocation Suppose that you are given one unit of money (for example. W2 .) 2. 1) random variables. whereas if you lose. 2. P [Sn ≥ b n] → Q(b/σ) as n → ∞. and if the answer is yes. be independent. You needn’t give details. The standard 1 Cauchy density and its characteristic function are given by f (x) = π(1+x2 ) and Φ(u) = exp(−|u|). (a) Show that for θ ∈ I. Thus. . X2 ..26 A large deviation Let X1 . Find the constant b such that 2 2 2 P {X1 + X2 + . be independent. you get half of your money back.2. N(0. Suppose the moment generating function M (θ) is finite for θ in an open interval I containing zero. variance σ 2 . (d) What value of α maximizes the expected wealth. 2. . (a) Find the characteristic function of Sn for a constant θ. since (ln M (θ)) is nonnegative.. and if the answer is yes. and if the answer is yes. (The interchange of expectation and differentiation with respect to θ can be justified for θ ∈ I. Q(b/σ) ≈ 2 2 σ √ e−b /2σ b 2π if b/σ is moderately large. . PROBLEMS 69 a fraction α of it on a coin toss. . . What is the numerical value of the approximation exp(−nb) if n = 100.6. ln M is a convex function. identify the value of the limit (a) for α = 0 (pure banking). S (d) Does √n converge in distribution as n → ∞? Justify your answer.28 A rapprochement between the central limit theorem and large deviations Let X1 .27 Sums of independent Cauchy random variables Let X1 . . n identify the limiting distribution.) 2. Let Sn = X1 + X2 + · · · + Xn . (ln M (θ)) is the variance for the “tilted” density function fθ defined by fθ (x) = f (x) exp(θx − ln M (θ)). be independent. nθ (b) Does Sn converge in distribution as n → ∞? Justify your answer.) Let b > 0 and let Sn = X1 + · · · + Xn for n any positive integer. (c) for general α. n identify the limiting distribution. and probability density function f . Identify in what sense(s) the limit limn→∞ Wn exists. By the central limit the√ orem. Let Wn denote the wealth you have accumulated (or have left) after n days. If you win. (b) for α = 1 (pure betting). (c) Does Sn converge in distribution as n → ∞? Justify your answer. . n2 identify the limiting distribution. + Xn ≥ 2n} = exp(−n(b + n )) where n → 0 as n → ∞. X2 . . you get double your money back. X2 . E[Wn ]? Would you recommend using that value of α? (e) What value of α maximizes the long term growth rate of Wn (Hint: Consider ln(Wn ) and apply the LLN. This bound is a good approximation 2π if u is moderately large. In particular. each with the standard Cauchy density function. identically distributed random variables with mean zero. and when it does. An upper bound on the Q function is given by 2 2 2 ∞ ∞ s 1 Q(u) = u √1 e−s /2 ds ≤ u u√2π e−s /2 ds = u√2π e−u /2 . . . 29 Chernoff bound for Gaussian and Poisson random variables (a) Let X have the N (µ.31 Large deviation exponent for a mixture distribution Problem 2. . Find the optimized Chernoff bound on P {Y ≥ E[Y ] + c} for c ≥ 0. (Note: Y has variance λ. where ψ(u) = 2g(1 + u)/u2 . with g(s) = s(ln s − 1) + 1. . and suppose Sn = X1 + · · · + 1 Xnf + Y1 + · · · + Y(1−f )n . (Hint: Approximate ln M near zero by its second order Taylor’s approximation.) 2. a) can be computed by replacing the probability P { Sn ≥ a} by its n optimized Chernoff bound. . Define l(f. . with ψ(−1) = 2 and ψ(0) = 1.i. each with CDF given by FX (c) = f FY (c) + (1 − f )FZ (c). Suppose all these random variables are mutually independent. . and compare with the approximation given by the central limit theorem.d. 3 .30 Large deviations of a mixed sum Let X1 .30 concerns an example such that 0 < f < 1 and Sn is the sum of n independent random variables. X n + Y n .) c2 c Show that your answer to part (b) can be expressed as P {Y ≥ E[Y ]+c} ≤ exp(− 2λ ψ( λ )) for c ≥ 0. and generating Xi using CDF FY if heads shows and generating Xi with CDF FZ if tails shows. Equivalently. +∞). X2 . 3 2 2 2. Y2 . n (a) Express the function l in terms of f . such that a fraction f of the random variables have a CDF FY and a fraction 1 − f have a CDF FZ . ) 2. . 2 1 random variables. . X2 . . a) for f ∈ {0. Let 0 ≤ f ≤ 1. Cram´rs theorem can e n be extended to show that l(f. It is shown in the solutions that the large deviations exponent for Sn is given by: n l(a) = max {θa − f MY (θ) − (1 − f )MZ (θ)} θ where MY (θ) and MZ (θ) are the log moment generating functions for FY and FZ respectively. Find the optimized Chernoff bound on P {X ≥ E[X] + c} for c ≥ 0. +∞). X1 + Y1 . and let l e denote the large deviations exponent for Sn . and identically distributed. have the Exp(1) distribution. . and MZ . Let Sn = X1 + · · · + Xn . each Xi can be generated by flipping a biased coin with probability of heads equal to f . The function ψ is strictly positive and strictly decreasing on the interval [−1. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES √ √ (b) The large deviations upper bound yields P [Sn ≥ b n] ≤ exp(−n (b/ n)). (For example. Consider the following variation. (b) Let Y have the P oi(λ) distribution.32 The limit of a sum of cumulative products of a sequence of uniform random variables . MY . have the P oi(1) distribution. Can you also offer an intuitive explanation? 2. . . if f = 1/2. uψ(u) is strictly increasing in u over the interval [−1. (c) (The purpose of this problem is to highlight the similarity of the answers to parts (a) and (b). a) = limn→∞ n ln P { Sn ≥ a} for a > 1. and Y1 . 2 . so the essential difference between the normal and Poisson bounds is the ψ term. . Also. Identify the limit of the large deviations upper bound as n → ∞. . Let X1 . or l(a) ≥ l(a) for all a. . (b) Determine which is true and give a proof: l(a) ≤ l(a) for all a. σ 2 ) distribution.) Compute l(f. 1} and a = 4.70 CHAPTER 2. Xn be independent. we simply view Sn as the sum of the n i. You might also find the previous problem helpful. Y )(1 + )/ . X2 . (ii) in distribution if and only if d2 (X. define d1 (X. For this to be true we must think of the metric as being defined on equivalence classes of random variables.34 * Weak Law of Large Numbers Let X1 . with 1 P [Ai = 1] = P [Ai = 1 ] = 2 for all i. . Show that Xn converges to Y (i) in probability if and only if d1 (X. 2 or 3. Taking that result as a starting point. Y ) = min{ ≥ 0 : FX (x + ) + ≥ FY (x) and FY (x + ) + ≥ FX (x) for all x} d3 (X. Y ) − /(1 + ) ≤ P {| X − Y |≥ } ≤ d1 (X.6. X) = 0 and di (X. then Chebychev’s inequality easily establishes that (X1 + · · · + Xn )/n converges in probability to zero. 2.s. (a) Show that di is a metric for i = 1. (d) Find the mean and variance of the limit random variable. X2 . The metric d2 is called the Levy metric. . . Verify in addition the triangle inequality. Let Bk = A1 · · · Ak . show that the convergence still holds even if Var(Xi ) is infinite. PROBLEMS 71 Let A1 . A2 . Assume that E[Xi ] exists and is equal to zero for all i. sense? Justify your anwswer. Y ) it is assumed that E[X 2 ] and E[Y 2 ] are finite. . sense? Justify your anwswer. The “only if” part of (ii) is a little tricky. . . .s. E[| Uk |] and E[Vk ] don’t depend on k and converge to zero as c tends to infinity. be a sequence of random variables which are independent and identically distributed. which implies that limn→∞ Sn 3 exists in the m. 2. . 2 (a) Does limk→∞ Bk exist in the m. If Var(Xi ) is finite. (Hint for (i): It helps to establish that d1 (X. . Y ) = (E[(X − Y )2 ])1/2 .s. Y ) converges to zero. Y ) = E[| X − Y | /(1+ | X − Y |)] d2 (X. (e) Does limn→∞ Sn exist in the a. Y ) converges to zero. Y ) = di (Y. where in defining d3 (X. (b) Does limk→∞ Bk exist in the a.n→∞ E[Sm Sn ] = 35 . . (iii) in the mean square sense if and only if d3 (X. be a sequence of independent random variables.s. (Hint: Use “truncation” by defining Uk = Xk I{| Xk |≥ c} and Vk = Xk I{| Xk |< c} for some constant c. . . Y ) = 0 only if X = Y .) (b) Let X1 . Show that limm. be a sequence of random variables and let Y be a random variable.33 * Distance measures (metrics) for random variables For random variables X and Y . (The only other requirement of a metric is that di (X. Y ) converges to zero (assume E[Y 2 ] < ∞). X). (c) Let Sn = B1 + . Clearly di (X. sense.2. sense? Justify your anwswer. + Bn . 1 without the assumption that the random variables are bounded. Let Sn = X1 + · · · + Xn and let l denote the large deviations exponent for Xi . select a large constant C and let Xi denote a random variable with the conditional distribution of Xi given that |Xi | ≤ C. . Then Sn Sn P ≥ n ≥ P {|X1 | ≤ C}n P ≥n n n One step is to show that l(a) converges to l(a) as C → ∞. then the minima of the convex functions converges to a nonnegative value. To begin.5.72 CHAPTER 2. CONVERGENCE OF A SEQUENCE OF RANDOM VARIABLES 2. It is equivalent to showing that if a pointwise monotonically increasing sequence of convex functions converges pointwise to a nonnegative convex function.35 * Completing the proof of Cramer’s theorem Prove Theorem 2. Xm where the Xi ’s are random variables all on the same probability space. 3. and the covariance matrix is the matrix of covariances. . which has ij th entry E[Xi Yj ]. EXm Suppose Y is another random vector on the same probability space as X.Chapter 3 Random Vectors and Minimum Mean Squared Error Estimation The reader is encouraged to review the section on matrices in the appendix before reading this chapter. . is the matrix with ij th entry Cov(Xi . and is written as E[XX T ]. the cross correlation matrix of X with itself. is simply called the correlation matrix of X. Yj ). The cross correlation matrix of X and Y is the m × n matrix E[XY T ]. denoted by Cov(X. 73 . Y ). with dimension n. . Note that the correlation matrix is the matrix of correlations.1 Basic definitions and properties A random vector X of dimension m has the form X = X1 X2 . In the particular case that n = m and Y = X. The cross covariance matrix of X and Y . The expectation of X (also called the mean of X) is the vector EX (or E[X]) defined by EX1 EX2 EX = . and it has ij th entry E[Xi Xj ]. . Y + Z) = Cov(W. Z) In particular. the notation Cov(X. .1 Correlation matrices and covariance matrices are positive semidefinite. The following proposition neatly resolves these issues. Also. Similarly. so that |Cov(Xi . Y )C T 5. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION The cross covariance matrix of X with itself.74CHAPTER 3. Xi )Cov(Xj . has ij th entry Cov(Xi . E[(AX)(CY )T ] = AE[XY T ]C T 4. then the cross correlation and cross covariance matrices are equal. if the mean vector of either X or Y is zero. So the notations Cov(X) and Cov(X. X)α = Cov(αT X. Cov(W + X. The first part of the proposition is proved. X). Proposition 3. Cov(X. then K is the covariance matrix and correlation matrix for some mean zero random vector X. This matrix is called the covariance matrix of X. then K = E[XX T ] for some random vector X. In particular. Xj ). Proof. X) are interchangeable. Y ) = E[X(Y − EY )T ] = E[(X − EX)Y T ] = E[XY T ] − (EX)(EY )T 3. if K = Cov(X. the diagonal elements must be nonnegative. But if an m×m matrix is not diagonal. αT Kα = αT Cov(X.1. Additional inequalities arise for consideration of three or more random variables at a time. Cov(X. For example. X) then for any vector α. Xj ). Of course a square diagonal matrix is a correlation matrix if and only if its diagonal entries are nonnegative. αT X) = Var(αT X) ≥ 0. Given any vector α. and it is also denoted by Cov(X).8) must be respected. CY + d) = ACov(X. If K is a correlation matrix. Not every square matrix is a correlation matrix. αT X is a scaler random variable. the second property above shows the close connection between correlation matrices and covariance matrices. While the notation Cov(X) is more concise. so αT Kα = E[αT XX T α] = E[(αT X)(X T α)] = E[(αT X)2 ] ≥ 0. because only vectors with independent entries need be considered. Z) + Cov(X. Cov(AX + b) = ACov(X)AT 6. Elementary properties of expectation. Y ) + Cov(W. Xj )| ≤ Cov(Xi . X) is more suggestive of the way the covariance matrix scales when X is multiplied by a constant. 1. E[AX + b] = AE[X] + b 2. and covariance for vectors follow immediately from similar properties for ordinary scalar random variables. correlation. Y ) + Cov(X. if K is a positive semidefinite matrix. Cov(AX + b. Conversely. Schwarz’s inequality (see Section 1. These properties include the following (here A and C are nonrandom matrices and b and d are nonrandom vectors). it is not a priori clear whether there are m random variables with all m(m + 1)/2 correlations matching the entries of the matrix. 7 in the appendix. F.” The essential fact E[X −EX] = 0 is equivalent to the following condition: X −EX is orthogonal to constants: (X − EX) ⊥ c for any constant c.2 The orthogonality principle for minimum mean square error estimation Let X be a random variable with some known distribution. . 3. . where Λ is the diagonal matrix with the λi ’s on the diagonal. is a sequence of elements of V and if Zn → Z∞ m.3 V is closed in the mean square sense: If Z1 . the choice of constant b yielding the minimum mean square error is the one that makes the error X − b orthogonal to all constants. for some random variable Z∞ . Ym )T . . THE ORTHOGONALITY PRINCIPLE FOR MINIMUM MEAN SQUARE ERROR ESTIMATION75 For the converse part. Fix some probability space and let L2 (Ω. then a1 Z1 + a2 Z2 ∈ V V. and let V be a collection of random variables on the same probability space as X such that V. . . . Ym be independent. .2. . λm and U be the corresponding set of eigenvalues and orthonormal matrix formed by the eigenvectors. X) = Cov(U Y.s. and let Y be the random vector Y = (Y1 . . Let X be a random variable in L2 (Ω. Suppose X is not observed but that we wish to estimate X.) Let Y1 . Z2 . From this expression it is easy to see that the mean square error is minimized with respect to b if and only if b = EX. The minimum possible value is Var(X). Let X = U Y . a2 are constants. If we use a constant b to estimate X. stated next. mean 0 random variables with Var(Yi ) = λi . .3. (See Section 11. Let λ1 . Then EX = 0 and Cov(X. .1 V ⊂ L2 (Ω. E[(X − b)2 ] = E[((X − EX) + (EX − b))2 ] = E[(X − EX)2 + 2(X − EX)(EX − b) + (EX − b)2 ] = Var(X) + (EX − b)2 . Since E[X − EX] = 0 and EX − b is constant. P ). . The characteristic function ΦX of X is the function on Rm defined by ΦX (u) = E[exp(juT X)]. F. Y ) = Λ. F. .2 V is a linear class: If Z1 ∈ V and Z2 ∈ V and a1 . Therefore. Random variables X and Y are called orthogonal if E[XY ] = 0. U Y ) = U ΛU T = K. . Then Cov(Y. suppose that K is an arbitrary symmetric positive semidefinite matrix. Therefore. the estimation error will be X − b. The mean square error (MSE) is E[(X − b)2 ]. Orthogonality is denoted by “X ⊥ Y . . then Z∞ ∈ V. . P ) V. This result is generalized by the orthogonality principle. K is both the covariance matrix and the correlation matrix of X. P ) be the set of all random variables on the probability space with finite second moments. V is a closed linear subspace of L2 (Ω. Estimating a random variable by a constant corresponds to the case that V is the set of constant random variables: the projection of a random variable X onto the set of constant random variables is EX. That is. Z ∗ is the random variable in V that is closest to X in the minimum mean square error (MMSE) sense.1. F. We call it the projection of X onto V and denote it as ΠV (X). The orthogonality principle stated next is illustrated in Figure 3. E[(X − Z)2 ]. .76CHAPTER 3. P ). The problem of interest is to find Z ∗ in V to minimize the mean square error. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION That is. over all Z ∈ V. 2. Proof. it follows that W = Z ∗ .3. F. Thus E[(X − W )2 ] ≤ E[(X − Z)2 ]. (b) (Characterization) Let W be a random variable. (Here. and let X ∈ L2 (Ω. (X − W ) ⊥ (W − Z). (c)(Error expression) The minimum mean square error (MMSE) is given by E[(X − Z ∗ )2 ] = E[X 2 ] − E[(Z ∗ )2 ]. and the “if” half of (b) is proved. which implies that E[(X − Z)2 ] = E[(X − W + W − Z)2 ] = E[(X − W )2 + 2(X − W )(W − Z) + (W − Z)2 ] = E[(X − W )2 ] + E[(W − Z)2 ]. Here parts (b) and (c) are proved. The technical condition V. The proof of part (a) is given in an extra credit homework problem. Then W = Z ∗ if and only if the following two conditions hold: (i) W ∈ V (ii) (X − W ) ⊥ Z for all Z in V.1 (The orthogonality principle) Let V be a closed. . THE ORTHOGONALITY PRINCIPLE FOR MINIMUM MEAN SQUARE ERROR ESTIMATION77 X Z Z* 0 e V Figure 3. we consider two elements Z and Z of V to be the same if P {Z = Z } = 1). P ). Therefore.1: Illustration of the orthogonality principle. (a) (Existence and uniqueness) There exists a unique element Z ∗ (also denoted by ΠV (X)) in V so that E[(X − Z ∗ )2 ] ≤ E[(X − Z)2 ] for all Z ∈ V. F. P ). Since Z is an arbitrary element of V. for some probability space (Ω. linear subspace of L2 (Ω. F. P ). To establish the “if” half of part (b). Then W − Z ∈ V because V is a linear class.2. Theorem 3.3 on V is essential for the proof of existence. suppose W satisfies (i) and (ii) and let Z be an arbitrary element of V. But E[(X − (Z ∗ + cZ))2 ] = E[(X − Z ∗ ) − cZ)2 ] = E[(X − Z ∗ )2 ] − 2cE[(X − Z ∗ )Z] + c2 E[Z 2 ].3) . P ) such that V2 ⊂ V1 . This proves part (c). with proofs based on the orthogonality principle. Thus. E[X 2 ] = E[((X − Z ∗ ) + Z ∗ )2 ] = E[(X − Z ∗ )2 ] + E[(Z ∗ )2 ]. F. and Z is an arbitrary element of V. Then for any X ∈ L2 (Ω. F. Proposition 3.1). the projection of X onto V2 can be found by first projecting X onto V1 .1) As a function of c the left side of (3. where ei = Xi − ΠV (Xi ) for i = 1. So E[eZ] = a1 E[e1 Z] + a2 E[e2 Z] = 0. E[(X − ΠV2 (X))2 ] = E[(X − ΠV1 (X))2 ] + E[(ΠV1 (X) − ΠV2 (X))2 ]. So. and then projecting the result onto V2 . RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION To establish the “only if” half of part (b). and the proof is complete. ΠV2 (X) = ΠV2 ΠV1 (X). it suffices to show that a1 ΠV1 (X1 ) + a2 ΠV2 (X2 ) satisfies these two properties. First.2. The “only if” half of (b) is proved. Let Z ∈ V and let c ∈ R. P ). for i = 1. Proposition 3. Then Z ∗ + cZ ∈ V. Then ΠV (a1 X1 + a2 X2 ) = a1 ΠV (X1 ) + a2 ΠV (X2 ). e ⊥ Z. Hence its derivative with respect to c at 0 must be zero.2 (Linearity of projection) Suppose V is a closed linear subspace of L2 (Ω. In particular.1) is a parabola with value zero at c = 0. (In words.2. This follows immediately from the fact that ΠV (Xi ) ∈ V. E[(X − ΠV2 (X))2 ] ≥ E[(X − ΠV1 (X))2 ].78CHAPTER 3. X1 and X2 are in L2 (Ω.) Furthermore. which yields that (X − Z ∗ ) ⊥ Z. (3. and ei ⊥ Z for i = 1.2) Proof. so that E[(X − (Z ∗ + cZ))2 ] ≥ E[(X − Z ∗ )2 ]. note that Z ∗ ∈ V by the definition of Z ∗ . F. to prove (3. and a1 and a2 are constants. where e = a1 X1 +a2 X2 −(a1 ΠV (X1 )+a2 ΠV (X2 )). (3. the second property is also checked. P ). the projection ΠV (a1 X1 + a2 X2 ) is characterized by two properties. The following propositions give some properties of the projection mapping ΠV .2). Since X − Z ∗ is orthogonal to all elements of V. 2. P ). 2.3 (Projections onto nested subspaces) Suppose V1 and V2 are closed linear subspaces of L2 (Ω. and V is a linear subspace. we must check that e ⊥ Z. so that −2cE[(X − Z ∗ )Z] + c2 E[Z 2 ] ≥ 0. The expression of part (c) is proved as follows.2. (3. 2. By the characterization part of the orthogonality principle (part (b) of Theorem 3. Second. so the first property is checked. F. we must check that a1 ΠV1 (X1 ) + a2 ΠV2 (X2 ) ∈ V. or equivalently. including Z ∗ itself. Now e = a1 e1 +a2 e2 . the projection ΠV (X) is characterized by two properties.6. Thus. This follows immediately from the fact that ΠVi (X) ∈ Vi . where e = X − (ΠV1 (X) + ΠV2 (X)). Therefore. so it is proved that ΠV2 (X) = ΠV2 ΠV1 (X). and Z is an arbitrary element of V. By the characterization part of the orthogonality principle (part (b) of Theorem 3. Now any such Z can be written as Z = Z1 + Z2 where Zi ∈ Vi for i = 1. where e = X − ΠV2 ΠV1 (X). the projection ΠV2 (X) is characterized by two properties. In particular.2.3. which means that E[Z1 Z2 ] = 0 for any Z1 ∈ V1 and Z2 ∈ V2 . where e1 = X − ΠV1 (X) and e2 = ΠV1 (X) − ΠV2 ΠV1 (X). discussed in Sections 3. Second. By the characterization of ΠV1 (X).2. the second property is also checked.3). Second. P ) such that V1 ⊥ V2 . which is equivalent to (3. so e ⊥ Z. The space V is also a closed linear subspace of L2 (Ω. for i = 1. This follows immediately from the fact that ΠV2 (X) maps into V2 . The minimum mean square error satisfies E[(X − ΠV (X))2 ] = E[X 2 ] − E[(ΠV1 (X))2 ] − E[(ΠV2 (X))2 ]. P ) (see a starred homework problem). F. Proposition 3. e ⊥ Z1 . the second property is also checked. Therefore.3) is proved. Therefore. Thus. As mentioned above. (3. we must check that e ⊥ Z. The inequality is also equivalent to the inequality minW ∈V2 E[(X − W )2 ] ≥ minW ∈V1 E[(X − W )2 ].3). Thus. So to prove ΠV (X) = ΠV1 (X) + ΠV2 (X). we must check that ΠV1 (X)+ΠV2 (X) ∈ V. it suffices to show that ΠV2 ΠV1 (X) satisfies the two properties. and Z is an arbitrary element of V2 . so the first property is checked. e1 is perpendicular to any random variable in V1 . it follows that e ⊥ Z. of course. The 1 2 last inequality of the proposition follows.5 and 3.2. By the characterization part of the orthogonality principle (part (b) of Theorem 3. 2. F. Thus. 2. Then for any X ∈ L2 (Ω. ΠV (X) = ΠV1 (X) + ΠV2 (X). THE ORTHOGONALITY PRINCIPLE FOR MINIMUM MEAN SQUARE ERROR ESTIMATION79 Proof. Let V = V1 ⊕ V2 = {Z1 + Z2 : Zi ∈ Vi } denote the span of V1 and V2 . 2.4 (Projection onto the span of orthogonal subspaces) Suppose V1 and V2 are closed linear subspaces of L2 (Ω. E[eZ1 ] = E[(X − (ΠV1 (X) + ΠV2 (X))Z1 ] = E[(X − ΠV1 (X))Z1 ] = 0. and similarly e ⊥ Z2 . ΠV1 (X) ⊥ ΠV2 (X). First. because Z ∈ V2 ⊂ V1 .1). E[(ΠV (X))2 ] = E[(ΠV1 (X))2 ] + E[(ΠV2 (X))2 ]. 2. Observe that ΠV2 (X) ⊥ Z1 because ΠV2 (X) ∈ V2 and Z1 ∈ V1 . e1 is perpendicular to any random variable in V1 . The following proposition is closely related to the use of linear innovations sequences. where the last equality follows from the characterization of ΠV1 (X). from (3. First. and the expression for the MMSE in the proposition follows from the error expression in the orthogonality principle. Since ei ⊥ Z for i = 1. The characterization of the projection of ΠV1 (X) onto V2 implies that e2 ⊥ Z. Proof. P ). it suffices to show that ΠV1 (X) + ΠV2 (X) satisfies these two properties. so the first property is checked. we must check that e ⊥ Z. E[e2 ] = E[e2 ] + E[e2 ]. to prove ΠV2 (X) = ΠV2 ΠV1 (X). so ΠV (X) = ΠV1 (X) + ΠV2 (X) is proved. . we must check that ΠV2 ΠV1 (X) ∈ V2 . So. e1 ⊥ Z. which implies that e1 ⊥ e2 . Now e = e1 + e2 . F.1). Since ΠVi (X) ∈ Vi for i = 1. and this inequality is true because the minimum of a set of numbers cannot increase if more numbers are added to the set.2. the optimal estimator in V is g ∗ (Y ) which. or all linear functions.80CHAPTER 3. is equal to the random variable E[X|Y ]. (3. The projection of X onto this class V is the unconstrained minimum mean square error (MMSE) estimator of X given Y . These two possibilities are discussed in this section.2.3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION 3. Suppose E[X 2 ] < +∞.4) There is also the implicit condition that g is Borel measurable so that g(Y ) is a random variable.5) Therefore.1 Conditional expectation as a projection Suppose a random variable X is to be estimated using an observed random vector Y of dimension m. Consider the most general class of estimators based on Y . What does the orthogonality principle imply for this example? It implies that there exists an optimal estimator g ∗ (Y ) which is the unique element of V such that (X − g ∗ (Y )) ⊥ g(Y ) . 3. the minimizing choice for g(y) is g ∗ (y) = E[X|Y = y] = ∞ xfX|Y (x | y)dx. an estimator is a function of Y . In applications. Let us first proceed to identify the optimal estimator by conditioning on the value of Y . Thus. for each fixed y. leading to the best linear estimator. E[g(Y )2 ] < +∞}. by setting V = {g(Y ) : g : Rm → R. a random variable X is to be estimated based on observation of a random variable Y . by definition.3 Conditional expectation and linear estimators In many applications. For technical reasons we assume for now that X and Y have a joint pdf. E[(X − g(Y ))2 ] = Rm E[(X − g(Y ))2 |Y = y]fY (y)dy where ∞ E[(X − g(Y ))2 |Y = y] = −∞ (x − g(y))2 fX|Y (x | y)dx Since the mean is the MMSE estimator of a random variable among all constants. as discussed at the beginning of Section 3. leading to the best unconstrained estimator. thereby reducing this example to the estimation of a random variable by a constant. conditioning on Y . −∞ (3. Then. the two most frequently considered classes of functions of Y used in this context are essentially all functions. it follows that . So the optimal estimator g ∗ (Y ) of the entire vector X is given by g ∗ (Y ) = E[X | Y ] = E[X1 | Y ] E[X2 | Y ] . Suppose a random vector X is to be estimated using estimators of the form g(Y). Y have a joint pdf then we can check that E[X | Y ] satisfies the required condition. where here g maps Rn into Rm . In summary. j. but often we wish to estimate a random vector. which is also the projection of Xi onto the set of unconstrained estimators based on Y . As seen above. because the expression within the braces is zero. and it is the unique function of Y such that E[(X − E[X | Y ])g(Y )] = 0 for all g(Y ) in V. Indeed. the MMSE estimator for each coordinate Xi is E[Xi |Y ]. finding the MMSE estimator of a random vector X decomposes into finding the MMSE estimators of the coordinates of X separately. E[(X − E[X | Y ])g(Y )] = = = 0. Even if X and Y don’t have a joint pdf or joint pmf. . e = X − E[X | Y ]. E[Xm | Y ] Let the estimation error be denoted by e. for most sets of estimators V typically encountered. because the MSE for estimation of a random vector is the sum of the MSEs of the coordinates: m (x − E[X | Y = y])g(y)fX|Y (x | y)fY (y)dxdy (x − E[X | Y = y])fX|Y (x | y)dx g(y)fY (y)dy E[ X − g(Y ) 2 ] = i=1 E[(Xi − gi (Y ))2 ] Therefore. Since Eei = 0. If X. CONDITIONAL EXPECTATION AND LINEAR ESTIMATORS 81 for all g(Y ) ∈ V. note that E[Xj | Y ] is in V for each j.4). we define the conditional expectation E[X | Y ] to be the MMSE estimator of X given Y.) The mean of the error is given by Ee = 0. defined in (3. Assume E[ X 2 ] < +∞ and seek an estimator to minimize the MSE. (Even though e is a random vector we use lower case for it for an obvious reason. A beauty of the MSE criteria is that it easily extends to estimation of random vectors. so ei ⊥ E[Xj | Y ] for each i.3. Estimation of a random variable has been discussed. if X and Y have a joint pdf (and similarly if they have a joint pmf) then the MMSE estimator of X given Y is E[X | Y ]. As for the covariance of the error. .3. By the orthogonality principle E[X | Y ] exists as long as E[X 2 ] < ∞. To identify the optimal linear estimator we shall apply the orthogonality principle for each coordinate of X with V = {c0 + c1 Y1 + c2 Y2 + . cn ∈ R} Let e denote the estimation error e = X − (AY + b). we find that the covariance of the error vector satisfies Cov(e) = Cov(X) − Cov(E[X | Y ]) which by (3. e) + Cov(e. X). using (3. we must select A and b so that ei ⊥ 1 e i ⊥ Yj all i all i. denoted by E[X | Y ].5) in case a joint pdf exists) may be too complex or may require more information about the joint distribution of X and Y than is available. Equivalently. Y )Cov(Y. E[X | Y ]) = 0. Y )Cov(Y. Cov(e) = Cov(X) − Cov(E[X | Y ]). j. j. Y )−1 (Y − EY ) (3. Y )−1 Cov(Y. Y ) = 0. . Y ) − ACov(Y. E[X|Y ]) = Cov(E[X | Y ]) + Cov(e) Thus. For both of these reasons. A widely used set is the set of all linear functions. Such estimators are called linear estimators because each coordinate of AY + b is a linear combination of Y1 . Seek estimators of the form AY + b to minimize the MSE.82CHAPTER 3. The condition Ee = 0 requires that b = EX − AEY . it is worthwhile to consider classes of estimators that are constrained to smaller sets of functions of the observations. In practice. . . . If Cov(Y. The condition ei ⊥ 1. . so that e = X − EX − A(Y − EY ). described next. Ym and 1. We must select A and b so that ei ⊥ Z for all Z ∈ V. the required orthogonality conditions on A and b become Ee = 0 and Cov(e. Y )Cov(Y. implies that E[ei Yj ] = Cov(ei . so we can restrict our attention to estimators of the form EX + A(Y − EY ). .6) yields Cov(e) = Cov(X) − Cov(X.3. Y2 . c1 . then A must be given by A = Cov(X. which means Eei = 0. computation of E[X | Y ] (for example. leading to linear estimators. In this case the optimal linear estimator. .7) . (3. Using this and the fact X = E[X | Y ] + e yields Cov(X) = Cov(E[X | Y ] + e) = Cov(E[X | Y ]) + Cov(e) + Cov(E[X|Y ]. + cn Yn : c0 . Equivalently.6) Proceeding as in the case of unconstrained estimators of a random vector. The condition Cov(e. 3. Cov(e. Thus. is given by E[X | Y ] = E[X] + Cov(X. Yj ). Y ) = 0. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Cov(ei . Y ) = 0 becomes Cov(X. . Y )−1 . E[Xj | Y ]) = 0 for all i.2 Linear estimators Let X and Y be random vectors with E[ X 2 ] < +∞ and E[ Y 2 ] < +∞. Y ) is not singular. . Here “1” stands for the random variable that is always equal to 1. . Hence. Proposition 3. that E[X|Y ] = E[X|Y ] if and only if E[X|Y ] has the linear form. For example.3 also implies relations among estimators based on different sets of observations. 3 + 6y 7 12 .3. Therefore.2. it implies that the best linear estimator of X based on Y is equal to the best linear estimator of the estimator E[X|Y ]: that is. for V consisting of constants. the best constant estimator of X. Var(Y ) = 11 144 and . 1 0≤x≤1 else 0 E[X | Y = y] = 0 2+3Y 3+6Y xfX|Y (x | y)dx = 2 + 3y . and for such y it is given by x+y 1 +y 2 fX|Y (x | y) = So for 0 ≤ y ≤ 1.3. Therefore. That is. Proposition 3.1 Let X. The space of unrestricted estimators based on Y1 alone is a subspace of the space of unrestricted estimators based on both Y1 and Y2 . the orthogonality principle. Similarly. Y2 ]|Y1 ] = E[X|Y1 ]. E[X | Y ] = . E[X] = E[E[E[X|Y ]|Y ]]. y ≤ 1 0 else Let us find E[X | Y ] and E[X | Y ]. suppose X is to be estimated and Y1 and Y2 are both possible observations.” Proposition 3. y) = x + y 0 ≤ x.2. linear estimators based on Y . In fact. ∞ fY (y) = −∞ fXY (x. To find E[X | Y ] we compute EX = EY = 7 1 7 1 Cov(X.3 implies that E[E[X|Y1 . E[X] = E[E[X|Y ]] = E[E[X|Y ]].2. respectively. are all instances of projection mappings ΠV .2 implies that these estimators are linear functions of X. in particular.2. Y ) = − 144 so E[X | Y ] = 12 − 11 (Y − 12 ). y)dx = 1 2 +y 0≤y ≤1 0 else Therefore. a property that is sometimes called the tower property of conditional expectation. the expectation E[X].3. and the conditional expectation E[X|Y ]. AX + b. CONDITIONAL EXPECTATION AND LINEAR ESTIMATORS 83 3. Y2 ]|Y1 ] = E[X|Y1 ].3. regarding projections onto nested subspaces.3 Discussion of the estimators As seen above. Proposition 3. Example 3. E[X|Y ] = E[E[X|Y ]|Y ].4 all apply to these estimators. Furthermore.2-3. or unconstrained estimators based on Y . implies an ordering of the mean square errors: E[(X − E[X | Y ])2 ] ≤ E[(X − E[X | Y ])2 ] ≤ Var(X). It follows. In particular.2. E[X]. Y be jointly continuous random variables with the pdf fXY (x.2. the MMSE linear estimator E[X|Y |.3. The same relation holds true for the same reason for the best linear estimators: E[E[X|Y1 . To find E[X | Y ] we first identify fY (y) and fX|Y (x|y). fX|Y (x | y) is defined only for 0 ≤ y ≤ 1. E[a1 X1 + a2 X2 |Y ] = a1 E[X1 |Y ] + a2 E[X2 |Y ]. and Propositions 3. is also the best constant estimator of E[X|Y ] or of E[X|Y ]. and the same is true with “E” replaced by “E. Now fXY (x. y)dx/fY (y) −∞ ∞ x −x2 /2σ 2 dx σ exp(−y 2 /2σ 2 ) y σ2 e √ √ = y y 2π 2πQ( σ ) Q( σ ) σ σ exp(−Y 2 /2σ 2 ) √ 2πQ( Y ) σ E[X | Y ] = .3. X has the Rayleigh density fX (x) = x −x2 /2σ 2 e σ2 0 x≥0 else and U is uniformly distributed on the interval [0. where X and U are independent random variables.84CHAPTER 3. y)dx = = 2π σ Q y σ 0 y≥0 y<0 where Q is the complementary CDF for the standard normal distribution.2 Suppose that Y = XU . y) = fX (x)fY |X (y | x) = fY (y) = −∞ 1 −x2 /2σ 2 e σ2 0 ∞ 0≤y≤x else ∞ 1 −x2 /2σ 2 dx y σ2 e √ fXY (x. xfXY (x. 1]. We find E[X | Y ] and E[X | Y ]. Y ) = E[U ]E[X 2 ] − E[U ]E[X]2 = Var(X) = σ 2 1 − 2 4 Thus E[X | Y ] = σ (1 − π ) π + 2 4 2 (3 − π) 8 Y − σ 2 π 2 2 π − 3 8 To find E[X | Y ] we first find the joint density and then the conditional density. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Example 3. So for y ≥ 0 ∞ E[X | Y = y] = = Thus. To compute E[X | Y ] we find ∞ EX = 0 1 x2 −x2 /2σ2 e dx = 2 σ σ σ 2 π 2 π 2 ∞ √ −∞ x2 2πσ 2 e−x 2 /2σ 2 dx = σ π 2 EY = EXEU = E[X 2 ] = 2σ 2 Var(Y ) = E[Y 2 ] − E[Y ]2 = E[X 2 ]E[U 2 ] − E[X]2 E[U ]2 = σ 2 π 1 Cov(X. . .4. Also. As a degenerate case. then so is aX for any constant a. so we can assume without loss of generality that a1 = a2 = 1.3 Suppose that Y is a random variable and f is a Borel measurable function such that E[f (Y )2 ] < ∞. . Thus. σ 2 ) random variable. 2 2 2 2 2 where µ = µ1 + µ2 and σ 2 = σ1 + σ2 . Let (Xi : i ∈ I) be a collection of random variables indexed by some set I. E[f (Y )|Y ] = f (Y ).4.3. X2 . JOINT GAUSSIAN DISTRIBUTION AND GAUSSIAN RANDOM VECTORS 85 Example 3. Let us show that E[f (Y )|Y ] = f (Y ). E[f (Y )|Y ] is the random variable of the form g(Y ) which is closest to f (Y ) in the mean square sense. By definition. No other estimator can have a smaller mean square error. 3. .1 Suppose X1 . X is a N (µ. If we take g(Y ) = f (Y ). Let µi = E[Xi ] 2 and σi = Var(Xi ). Equivalently. then the mean square error is zero. if Y is a random vector with E[||Y ||2 ] < ∞. Then any linear combination a1 X1 + · · · + an Xn is a Gaussian random variable. which possibly has infinite cardinality. then E[AY + b|Y ] = AY + b. we say X is Gaussian with mean µ and variance 0 if P {X = µ} = 1. A finite linear combination of (Xi : i ∈ I) is a random variable of the form a1 Xi1 + a2 Xi2 + · · · + an Xin where n is finite. and if A is a matrix and b a vector.4 Joint Gaussian distribution and Gaussian random vectors Recall that a random variable X is Gaussian (or normal) with mean µ and variance σ 2 > 0 if X has pdf fX (x) = √ 1 2πσ 2 e− (x−µ)2 2σ 2 . it is sufficient to prove the lemma for n = 2. Proof. then the sum X = X1 + X2 is also a Gaussian random variable. ik ∈ I for each k. and ak ∈ R for each k. if X is a Gaussian random variable.3. Thus. Then the characteristic function of X is given by ΦX (u) = E[ejuX ] = E[ejuX1 ejuX2 ] = E[ejuX1 ]E[ejuX2 ] 2 2 u2 σ1 u2 σ2 u2 σ 2 = exp − + jµ1 u exp − + jµ2 u = exp − + jµu . By an induction argument on n. Similarly. . Xn are independent Gaussian random variables. 2 Lemma 3. X is Gaussian with mean µ and variance σ 2 ≥ 0 if its characteristic function is given by ΦX (u) = exp − u2 σ 2 + jµu . It remains to prove that if X1 and X2 are independent Gaussian random variables. . . K) random vector is given by ΦX (u) = E[eju 1 T T eju µ− 2 u Ku . (b) If the random variables Xi : i ∈ I are each Gaussian and if they are independent. A random vector X is called a Gaussian random vector if its coordinate random variables are jointly Gaussian. Then Z has a joint Gaussian distribution. is a linear class. . i2 . and let (Zk : k ∈ K) denote a set of random variables such that each Zk is a limit in probability (or in the m. . Xi2 . in . the coordinates of X are uncorrelated) then the coordinates X1 . or equivalently. . (3. then each of the random variables itself is Gaussian. then (Xi : i ∈ I) has a joint Gaussian distribution.e. . . Then (Yj : j ∈ J) and (Zk : k ∈ K) each have a joint Gaussian distribution.8) Any random vector X such that Cov(X) is singular does not have a pdf.86CHAPTER 3. Proposition 3. Xin are independent for any finite number of indices i1 . (d) The characteristic function of a N (µ. which means that Xi1 . senses) of a sequence from (Yj : j ∈ J). . A collection of random vectors is said to have a joint Gaussian distribution if all of the coordinate random variables of all of the vectors are jointly Gaussian.4. Xj ) = 0 for i = j. . (c) (Preservation of joint Gaussian property under linear combinations and limits) Suppose (Xi : i ∈ I) has a joint Gaussian distribution. or a. We write that X is a N (µ.4.2 A collection (Xi : i ∈ I) of random variables has a joint Gaussian distribution (and the random variables Xi : i ∈ I themselves are said to be jointly Gaussian) if every finite linear combination of (Xi : i ∈ I) is a Gaussian random variable. cov(Xi . Let (Yj : j ∈ J) denote a collection of random variables such that each Yj is a finite linear combination of (Xi : i ∈ I). . TX ] = (e) If X is a N (µ. K) random vector X such that K is nonsingular has a pdf given by fX (x) = 1 m (2π) 2 |K| 2 − 1 exp (x − µ)T K −1 (x − µ) 2 . (c ) (Alternative version of (c)) Suppose (Xi : i ∈ I) has a joint Gaussian distribution. . then they are independent if and only if Cov(X.s. (f ) A N (µ. Y ) = 0. K) random vector if X is a Gaussian random vector with mean vector µ and covariance matrix K. Let Z denote the smallest set of random variables that contains (Xi : i ∈ I). and is closed under taking limits in probability. (g) If X and Y are jointly Gaussian vectors. K) random vector and K is a diagonal matrix (i. . RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Definition 3. Xm are independent. .s.3 (a) If (Xi : i ∈ I) has a joint Gaussian distribution. We will show that Z = (Zk : k ∈ K). because it is a finite linear combination of (Yj : j ∈ J). (ii) (Zk : k ∈ K) is a linear class (iii) (Zk : k ∈ K) is closed under taking limits in probability Property (i) follows from the fact that for any io ∈ I. Let V be a finite linear combination of (Yj : j ∈ J) : V = b1 Yj1 + b2 Yj2 + · · · + bn Yjn . will establish (c ).n . it follows that W is a Gaussian random variable. Since limits in probability of Gaussian random variables are also Gaussian random variables (Proposition 2.9) Since each term on the right-hand side of (3. (c ) Suppose (Xi : i ∈ I) has a joint Gaussian distribution. Each Wn is a Gaussian random variable. and is hence also a Gaussian random variable. Then any finite linear combination of (Xi : i ∈ I) is the sum of finitely many independent Gaussian random variables (by Lemma 3. Let Wn = a1 Yj1. any finite linear combination of (Yj : j ∈ J) is Gaussian.n |. it follows that p. an arbitrary finite linear combination W of (Zk : k ∈ K) is Gaussian.3. (c) Suppose the hypotheses of (c) are true. and each is Gaussian. Let W be a finite linear combination of (Zk : k ∈ K): W = a1 Zk1 +· · ·+am Zkm . So each Xi is a Gaussian random variable.1). (b) Suppose the variables Xi : i ∈ I are mutually independent. so that (Yj : j ∈ J) has a joint Gaussian distribution. We begin by establishing that (Zk : k ∈ K) satisfies the three properties required of Z : (i) (Zk : k ∈ K) contains (Xi : i ∈ I). so.n → Zkl as n → ∞. which together with part (c) already proved. Each Xi for i ∈ I is itself a finite linear combination of all the variables (with only one term). JOINT GAUSSIAN DISTRIBUTION AND GAUSSIAN RANDOM VECTORS 87 Proof. Also. So (Xi : i ∈ I) has a joint Gaussian distribution. Using the notation of part (c). Each Yj is a finite linear combination of (Xi : i ∈ I). i ∈ I).4. and it is trivially the limit in probability of the sequence with all . the random variable Xio is trivially a finite linear combination of (Xi : i ∈ I). m d. let (Yi : i ∈ I) denote the set of all finite linear combinations of (Xi : i ∈ I) and let (Zk : k ∈ K) denote the set of all random variables that are limits in probability of random variables in (Yi .9) converges to zero in probability. By assumption.16). Therefore V is thus a Gaussian random variable.4. by definition. so V can be written as a finite linear combination of (Xi : i ∈ I): V = b1 (a11 Xi11 + a12 Xi12 + · · · + a1k1 Xi1k1 ) + · · · + bn (an1 Xin1 + · · · + ankn Xinkn ).n : n ≥ 1) of indices from J such that Yjl. |W − Wn | ≤ l=1 al |Zkl − Yjl. (Zk : k ∈ K) has a joint Gaussian distribution.n + · · · + am Yjm. there is a sequence (jl. so that all finite linear combinations of the Xi ’s are Gaussian random variables.1. Thus. Thus. for 1 ≤ l ≤ m. (a) Supppose (Xi : i ∈ I) has a joint Gaussian distribution. Wn → W as n → ∞. (3. . it can be assumed that P {|Z∞ − Zkn | ≥ 2−(n+1) } ≤ 2−(n+1) for all n ≥ 1. it follows that P {|Z∞ − Yjn | ≥ 2−n } ≤ 2−n . so that Y has the joint pdf m fY (y) = i=1 √ 1 y2 exp − i 2λi 2πλi = 1 (2π) 2 m det(K) exp − y T Λ−1 y 2 . λi ). (f) Let X be a N (µ. λ2 . . . so (Zk : k ∈ K) is closed under convergence in probability. the ith one being N (0. . That is. U T X) = U T KU = Λ. kii ) random variable. since (Yj : j ∈ J) is a linear class. ] = E[ej(u T X) ] = ΦuT X (1) = eju T µ− 1 uT Ku 2 .88CHAPTER 3. . Since det(K) = λ1 λ2 · · · λm this implies that λi > 0 for each i. . Xm are independent random variables. (e) If X is a N (µ. Then Y is a Gaussian vector with mean 0 and covariance matrix given by Cov(Y.) Let Y = U T (X − µ). So Yjn → Z∞ . . Property (ii) is true because a linear combination of the form a1 Zk1 + a2 Zk2 is the limit in probability of a sequence of random variables of the form a1 Yjn. we already know the characteristic function of uT X. . . By passing to a subsequence if necessary. and hence must contain (Zk : k ∈ K). Z∞ is a random variable in (Zk : k ∈ K). (d) Let X be a N (µ. .1 + a2 Yjn2 is a random variable from (Yj : j ∈ J) for each n. In summary. and Φi is the characteristic function of a N (µi . Since p |Z∞ − Yjn | ≤ |Z∞ − Zkn | + |Zkn − Yjn |.1 + a2 Yjn. for each n there is a jn ∈ J so that P {|Zkn − Yjn | ≥ 2−(n+1) } ≤ 2−(n+1) . suppose Zkn → Z∞ as n → ∞ for some sequence k1 . and. (See Section 11. To prove (iii). In summary. (Zk : k ∈ K) has properties (i)-(iii). uT X) = uT Ku. Any set of random variables with these three properties must contain (Yj : j ∈ J). as claimed. meaning det(K) = 0. K) random vector. it follows that X1 . λm of K along the diagonal. Y ) = Cov(U T X. k2 . and Y is a vector of independent Gaussian random variables. the random variable uT X is Gaussian with mean uT µ and variance given by Var(uT X) = Cov(uT X. . But the characteristic function of the vector X evaluated at u is the characteristic function of uT X evaluated at 1: ΦX (u) = E[eju TX p. then m ΦX (u) = i=1 exp(jui µi − kii u2 i ) = 2 Φi (ui ) i where kii denotes the ith diagonal element of K. Since each Zkn is the limit in probability of a sequence of random variables from (Yj : j ∈ J). Since K is positive semidefinite it can be written as K = U ΛU T where U is orthonormal (so U U T = U T U = I) and Λ is a diagonal matrix with the eigenvalues λ1 . . K) random vector. So (Zk : k ∈ K) is indeed the smallest set of random variables with properties (i)-(iii). Suppose further that K is nonsingular. K) random vector and K is a diagonal matrix. Therefore.7 of the appendix. (Zk : k ∈ K) = Z. we have X = U Y + µ. Thus. Then for any vector u with the same dimension as X.2 . RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION entries equal to itself. from K. a1 Yjn. By uniqueness of joint characteristic functions. which establishes part (d) of the proposition. then the characteristic function of the joint density must factor. As shown in the next proposition. Given Y = y. instead. Y1 . or equivalently that λi = 0 for one of the eigenvalues of K. In particular. Cov(e)). = exp − T Cov(X) 0 0 Cov(Y ) . Y ) = 0. the MMSE unconstrained estimator of Y is linear. ΦZ u v 1 u 2 v = ΦX (u)ΦY (v). . Y ) = 0.4. . which implies that Cov(Z) is block diagonal as above. P {αT X = αT µ} = 1. .4. X does not have a pdf. the two estimators coincide. E[X|Y |. the joint pdf for the N (µ. Proposition 3. which is determined entirely by first and second moments. Recall that in general. (g) Suppose X and Y are jointly Gaussian vectors and uncorrelated (so Cov(X. the conditional mean E[X|Y = y] is equal to E[X|Y = y]. Conversely. . The tradeoff. Yn . that there is a vector α such that αT Kα = 0 (such an α is an eigenvector of K for eigenvalue zero). is that E[X|Y | can be much more difficult to compute than E[X|Y |. if X and Y are jointly Gaussian. the conditional distribution of X is N (E[X|Y = y]. Cov(X. Xm . . That is. then E[X|Y ] = E[X|Y ]. Now suppose. if X and Y are two random vectors on the same probability space. or equivalently. however. X is in the subspace {x ∈ Rm : αT (x − µ) = 0}. The proposition identifies not only the conditional mean. Therefore. We also know that E[X|Y = y] is the mean of the conditional mean of X given Y = y. Y ) = 0. That is. with probability one. Cov(Z) u u +j v v T EZ Such factorization implies that X and Y are independent. if X and Y are jointly Gaussian. That means that det K = 0. Therefore. . αT X) = Var(αT X). then the mean square error for the MMSE linear estimator E[X|Y | is greater than or equal to the mean square error for the best unconstrained estimator. if X and Y are jointly Gaussian and independent of each other. that X is any random vector with some mean µ and a singular covariance matrix K. JOINT GAUSSIAN DISTRIBUTION AND GAUSSIAN RANDOM VECTORS 89 Since | det(U )| = 1 and U Λ−1 U T = K −1 . K) random vector X is given by fX (x) = fY (U T (x − µ)) = 1 m (2π) 2 |K| 2 − 1 exp (x − µ)T K −1 (x − µ) 2 . but the entire conditional distribution of X given Y = y. Since Cov(X.) Let Z denote the dimension m + n vector with coordinates X1 . That is. for the case X and Y are jointly Gaussian. the covariance matrix of Z is block diagonal: Cov(Z) = Therefore. . . That is. for u ∈ Rm and v ∈ Rn . But then 0 = αT Kα = αT Cov(X.4 Let X and Y be jointly Gaussian vectors. . X)α = Cov(αT X.3. The if part of part (f) is proved. Consider the MMSE linear estimator E[X|Y ] of X given Y . which does not depend on y. while E[X|Y = y] is completely determined by y. given Y = y. So the conditional distribution of X given Y = y is N (E[X|Y = y]. is therefore the mean of the N (E[X|Y = y].5 Suppose X and Y are jointly Gaussian mean zero random variables such that X 4 3 the vector has covariance matrix . E[X|Y = y] = E[X|Y = y] = EX + Cov(X. and Z is a N (0. and let e denote the corresponding error vector: e = X − E[X|Y ]. For the next part of the proof.13) (Basically. Ee = 0 and Cov(e. E[X|Y ] = E[X|Y ]. K) random vector. then E[W 2 ] = µ2 + σ 2 and P {W ≥ c} = Q( c−µ ). Cov(X) − Cov(X. K) vector.12) is just the formula (3. they are jointly Gaussian. Cov(e)).11). the conditional distribution of e is the N (0. (3. Y )Cov(Y )−1 (y − E[Y ]) Cov(e) = Cov(X) − Cov(X. Note that if W is a random variable with the N (µ. by the orthogonality principle.11) −1 Cov(Y. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION If Cov(Y ) is nonsingular. 3 . Y )Cov(Y ) and if Cov(e) is nonsingular.Y ) y.10) (3.4. E[X|Y ]. Cov(e)). Example 3.10) and (3. E[X 2 |Y = y] = ( y )2 + 3 and P [X ≥ c|Y = y] = Q( c−(y/3) ).Y)) ). (3. Since Y and e are obtained from X and Y by linear transformations. E[X|Y = y]. Cov(e)) vector e and the determined vector E[X|Y = y]. Equation (3. Cov(e)) distribution. the whole proof of the proposition hinges on (3. 3 Var(Y ) Var(Y √ Therefore. Recall that. the following key observation can be made. 3). Equations (3. Hence E[X|Y = y] = E[X|Y = y]. the conditional distribution of X is N ( Cov(X. the reader should keep in mind that if a is a deterministic vector of some dimension m. exp − 1 x − E[X|Y = y] 2 T Cov(e)−1 (x − E[X|Y = y]) . Y ) = 0. the random vectors e and Y are also independent.) Since E[X|Y ] is a function of Y and since e is independent of Y with distribution N (0. where Q is the standard Gaussian σ complementary CDF. Cov(e)) distribution. and its associated covariance of error.12) Proof. The idea is to apply these facts to the conditional distribution of X given Y . X). which in general is the mean of the conditional distribution of X given Y = y. with µ = E[X|Y = y] and K = Cov(e).7) derived for the MMSE linear estimator. In particular. σ 2 ) distribution. K) distribution.8) for the pdf of a N (µ. Applying these two 3 3 √ functions to the random variable Y yields E[X 2 |Y ] = ( Y )2 + 3 and P [X ≥ c|Y ] = Q( c−(Y3/3) ). respectively. for a matrix K that is not a function of a. then Z + a has the N (a.6) and (3.90CHAPTER 3. 2 Given Y = y. or N ( y . fX|Y (x|y) = 1 (2π) |Cov(e)| m 2 1 2 (3. are just the equations (3. Since Cov(e. Since this is true for all y. X can be viewed as the sum of the N (0. Let us find simple expressions for the two Y 3 9 random variables E[X 2 |Y ] and P [X ≥ c|Y ]. So.13). Focus on the following rearrangement of the definition of e: X = e + E[X|Y ]. Y ) = 0. Given Y = y. . (3. Yn ] = X + i=1 n E[X − X|Yi ] = X+ i=1 E[X Yi ]Yi E[(Yi )2 ] . if E[Yi ] = 0 for all i and E[Yi YjT ] = 0 for i = j (i. .14) satisfies the two properties that together characterize the left side of (3. for any random variable X with finite second moments. Yn ] is considerably more complicated than computation of the individual projections E[X|Yi ]. so E[eYjT ] = 0. . because it requires inversion of the covariance matrix of all the Y ’s. In general. Y1 . . . . .14). . n E[X | Y1 . . we can prove that the set of all random variables obtained by linear transformation of 1. . . . all coordinates of Yi are orthogonal to constants and to all coordinates of Yj for i = j). LINEAR INNOVATIONS SEQUENCES 91 3. The orthogonality principle can be used to prove (3. Each term on the right side of this equation is zero.5 Linear Innovations Sequences Let X. because X − X and Yi have mean zero. by orthogonalizing this sequence we can obtain a sequence 1. It suffices to prove that the right side of (3.15) Then E[Yi ] = 0 for all i and E[Yi YjT ] = 0 for i = j.14) where we write X for EX. . . Furthermore. let e denote the error when the right side of (3. . Yk . . all on the same probability space. Thus. Y2 . E[eYjT ] = E X − E[X|Yj ] YjT − i:i=j E[Bi Yi YjT ]. Yk is equal to the set of all random variables obtained by linear transformation of 1. In addition. . First. . . . Yn . However.14) is used to estimate X: n e = X −X − i=1 E[X − X|Yi ]. (3. . . T It must be shown that E[e(Y1T c1 + Y2T c2 + · · · + Yn cn + b)] = 0 for any constant vectors c1 . . . Yn have finite second moments but are not orthogonal. Yn that can be used instead.5. Y1 . . cn and constant b. . .14) as follows. by induction on k. Thus. . . . However. . It is enough to show that E[e] = 0 and E[eYjT ] = 0 for all j. Yn be random vectors with finite second moments. Let Y1 = Y1 − E[Y1 ]. . Secondly. Yn ] = X + i=1 E[X − X|Yi ]. and for k ≥ 2 let Yk = Yk − E[Yk |Y1 . . the right side is a linear combination of 1. . . then (3. Y1 . . . . .3.. Y2 .14) is proved.e. Yk−1 ].14) doesn’t directly apply. If 1. E[e] = 0. But E[X − X|Yi ] has the form Bi Yi . then n E[X|Y1 . . . computation of the joint projection E[X|Y1 . Y1 . . . . . Y1 . and (3. . Y1 . . . Yn ] = E[X|Y1 . • Ex0 = x0 . .. State: Observation: It is assumed that • x0 .. and k−1 Yk = Yk − E[Yk ] − i=1 E[Yk Yi ]Yi E[(Yi )2 ] k ≥ 2. w1 . . . The state sequence x0 . Recursive equations are found that are useful in many real-time applications. . Rk for k ≥ 0. . state sequence x0 . . . . . Y2 .2: Block diagram of the state and observations equations. Evk = 0. Cov(vk ) = Rk . . this same result can be used to compute the innovations sequence recursively: Y1 = Y1 − E[Y1 ]. . Yn is called the linear innovations sequence for Y1 . is driven by the random vectors w0 . represent observation noise. lower case letters are used for random vectors. . v0 . Cov(x0 ) = P0 . P0 are known matrices. These sequences of random vectors are assumed to satisfy the following state and observation equations. . . . . . Yn . All the random variables involved are assumed to have finite second moments. Y2 . RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Moreover. . • Fk . Qk . The sequence Y1 . . y1 . See Figure 3. . . x1 . • x0 is a known vector. while the random vectors v0 . . For notational convenience. . 3. Hk .92CHAPTER 3. . because there are so many matrices in this section. The evolution of the xk+1 = Fk xk + wk yk = T Hk xk k≥0 k ≥ 0. . x1 . w0 .. Ewk = 0. are pairwise uncorrelated.2 for a block diagram of the state and observation equations. .6 Discrete-time Kalman filtering Kalman filtering is a state-space approach to the problem of estimating one random sequence from another. . . v1 . v1 . . is to be estimated from an observed sequence y0 . . Cov(wk ) = Qk . . w1 . . + vk vk wk + xk+1 Delay xk T Hk + yk F k Figure 3. Then the Kalman filter equations reduce to (3. DISCRETE-TIME KALMAN FILTERING Let xk = E[xk ] and Pk = Cov(xk ). The Kalman filter equations will first be stated. and then derived.17) T = Fk xk|k−1 + Kk yk − Hk xk|k−1 with the initial condition x0|−1 = x0 .3. Define for nonnegative integers i. yk Kk + xk+1 k Delay xk k−1 F −Kk HT k k Figure 3.3 for the block diagram. Let y k = (y0 . These quantities are recursively determined for k ≥ 1 by T xk+1 = Fk xk and Pk+1 = Fk Pk Fk + Qk . See Figure 3. where the gain matrix Kk is given by T Kk = Fk Σk|k−1 Hk Hk Σk|k−1 Hk + Rk −1 (3. before deriving them.19) with the initial condition Σ0|−1 = P0 . First.18) and the covariance of error matrices are recursively computed by T Σk+1|k = Fk Σk|k−1 − Σk|k−1 Hk Hk Σk|k−1 Hk + Rk −1 T T Hk Σk|k−1 Fk + Qk (3. Σk|k−1 = Pk and Kk = 0. y1 . observe what happens if Hk is the zero matrix. 93 (3. The goal is to compute xk+1|k for k ≥ 0.6. . . .16) with xk|k−1 = xk . We comment briefly on the Kalman filter equations. then briefly discussed.3: Block diagram of the Kalman filter. Hk = 0. The idea of the Kalman filter equations is to recursively compute conditional expectations in a similar way. .16) where the initial conditions x0 and P0 are given as part of the state model. yk ) represent the observations up to time k. for all k. . Taking Hk = 0 for all k is equivalent to having no observations available. j xi|j = E[xi |y j ] and the associated covariance of error matrices Σi|j = Cov(xi − xi|j ). The Kalman filter equations are given by xk+1|k = T Fk − Kk Hk xk|k−1 + Kk yk (3. yk ). . The difference in square brackets in (3. yk . The Kalman filter equations are now derived. y1 . the estimates can be computed using only (3. Since the linear span of the random variables in (1. is given by Σk+1|k−1 = Cov(xk+1 − xk+1|k−1 ) = Cov(Fk (xk − xk|k−1 ) + wk ) T = Fk Σk|k−1 Fk + Qk .17). we can consider yk = yk − E[yk | y k−1 ] to be the new part of the ˜ observation yk . as defined in Section 3.20) is the contribution of the new observation.18) and (3. Indeed. y1 . represents the result of modifying xk|k−1 to take into account the passage of time on the state. so that in steady state the filter equation (3.19). .20). (3. the matrices in the state and/or observation equations might not be known until just before they are needed. including the covariance matrices of the vk ’s and wk ’s. Here. to the estimation of xk+1 . . but if the model is stable in some sense.17) becomes time invariant: xk+1|k = (F − KH T )xk|k−1 + Kyk . Furthermore. y k−1 .22) Information update: Consider next the new observation yk . Specifically. y0 . or simply by its mean in the case k = 0. The gain matrices Kk could still depend on k due to the initial conditions. for the purposes ˜ . then this would be the entire Kalman filter.94CHAPTER 3. the sequence of gain matrices can be computed ahead of time according to (3. accounting for the availability of the new observation yk . but with no new observation. xk+1|k−1 = E[Fk xk + wk |y k−1 ] = Fk E[xk |y k−1 ] + E[wk |y k−1 ] = Fk xk|k−1 (3. particularly those involving feedback control. is the linear innovation sequence for the observation sequence ˜ ˜ y0 . there are two considerations for computing xk+1|k once xk|k−1 is computed: (1) the time update. Roughly speaking.21) Thus. accounting for the change in state from xk to xk+1 . The observation yk is not totally new—for it can be predicted in part from the previous observations. . .20) where the first term on the right of (3.5 (with the minor difference that here the vectors are indexed from time k = 0 on. namely xk+1|k−1 . we can write xk+1|k = xk+1|k−1 + xk+1|k − xk+1|k−1 . the time update consists of simply multiplying the previous estimate by Fk . . Time update: In view of the state update equation and the fact that wk is uncorrelated with the random variables of y k−1 and has mean zero. the covariance of error matrix for predicting xk+1 by xk+1|k−1 . for k ≥ 0. and (2) the information update.. then the gains converge to a constant matrix K. yk ) is the same as the linear span of the random variables in (1. Then as the observations become available. If there were no new observation. In some applications the matrices involved in the state and observation models. In other applications. y k−1 . do not depend on k. (3. rather than from time k = 1). RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION In many applications. y Therefore. ˜ (3.20). yk ) = Cov(xk+1 . yk ) have the same span and the random ˜ ˜ ˜ k−1 are orthogonal to the random variables in y . y k−1 . yk y ˜ y y = xk+1 + E xk+1 − xk+1 |˜k−1 + E [xk+1 − xk+1 |˜k ] = xk+1|k−1 + E [xk+1 − xk+1 |˜k ] . ˜ T By the observation equation and the facts E[vk ] = 0 and E[y k−1 vk ] = 0.24).27) = Fk Σk|k−1 Hk . and (3. and (3. ˜ y (3. Hk (xk − xk|k−1 ) + vk ) ˜ T = Cov(Fk xk . it follows that T E[yk | y k−1 ] = E Hk xk + wk |y k−1 T = Hk xk|k−1 . yields ˜ a reduction in the covariance of error: Σk+1|k = Σk+1|k−1 − Cov(Kk yk ). y k−1 . Since xk+1 − xk+1 and yk both have mean zero.20)) is E [xk+1 − xk+1 |˜k ] .26) The Kalman filter equations (3. y ˜ where (use the fact Cov(xk+1 − xk+1 .21) and the fact the information update term is Kk yk .23) Putting it all together: Equation (3. To convert (3. as follows.18). use T Cov(xk+1 . Since (1. with the time update equation (3. the term to be added for the information update (in square brackets in (3. yields the main Kalman filter equation: ˜ xk+1|k = Fk xk|k−1 + Kk yk . (3. the information update term y ˜ can be simplified to: E [xk+1 − xk+1 |˜k ] = Kk yk .26).24) into (3. which is orthogonal to the previous observations. yk ) and (1.25).6. yk ) because E[˜k ] = 0) ˜ ˜ y Kk = Cov(xk+1 . (3.19) follow easily from (3. ˜ (3. xk+1|k = E xk+1 |˜k−1 . Hk (xk − xk|k−1 )) T = Cov(Fk (xk − xk|k−1 ).24) (3. yk ) = Cov(Fk xk + wk . T so yk = yk − Hk xk|k−1 .25) Taking into account the new observation yk . DISCRETE-TIME KALMAN FILTERING 95 of incorporating the new observation we can pretend that yk is the new observation rather than yk . yk )Cov(˜k )−1 .18). and all these random variables have variables in y ˜ ˜k mean zero.3.17). Hk (xk − xk|k−1 )) (3. 3. 2π]). where X has the exponential distribution with parameter λ. Suppose Y = cos(Θ) is to be estimated by an estimator of the form a + bΘ. (Hint: Answer is yes if and only if there is no estimator for X of the form g(Y ) with a smaller MSE than E[X|Y ].2 Linear approximation of the cosine function over an interval Let Θ be uniformly distributed on the interval [0. (Recall that E[X] = λ and Var(X) = λ2 .) 3. (a) Write an expression for the pdf of X that does not use matrix notation. and N is Gaussian with mean 0 and variance σ 2 . and the parameters λ and 1 1 σ 2 are strictly positive. (b) Does E[X|Y ] = E[X|Y ]? Justify your answer. 3. not [0. π] (yes.) (a) Find E[X|Y ] and also find the mean square error for estimating X by E[X|Y ].19) use (3. [0.3 Calculation of some minimum mean square error estimators Let Y = X + N . π].22) and T Cov(Kk yk ) = Kk Cov(˜k )Kk ˜ y = Cov(xk+1 − Fk xk|k−1 )Cov(˜k )−1 Cov(xk+1 − Fk xk|k−1 ) y This completes the derivation of the Kalman filtering equations.26) into (3.4 Valid covariance matrix For what real values of a and b is the following matrix the covariance matrix of some real-valued random vector? 2 1 b K = a 1 0 . (b) Find a vector b and orthonormal matrix U such that the vector Y defined by Y = U T (X − b) is a mean zero Gaussian vector such at Y1 and Y2 are independent. The variables X and N are independent.7 Problems 3. What numerical values of a and b minimize the mean square error? 3.96CHAPTER 3. b 0 1 . RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION and T Cov(˜k ) = Cov(Hk (xk − xk|k−1 ) + vk ) y T = Cov(Hk (xk − xk|k−1 )) + Cov(vk ) T = Hk Σk|k−1 Hk + Rk To convert (3.1 Rotation of a joint normal distribution yielding independence Let X be a Gaussian vector with E[X] = 10 5 Cov(X) = 2 1 1 1 . . Find both the linear minimum mean square error (LMMSE) estimator estimator of X given Y and the (possibly nonlinear) MMSE estimator X given Y .7 An estimation error bound Suppose the random vector X Y has mean vector 2 −2 and covariance matrix 8 3 3 2 . 1]. (a) Find E[X|U ] and calculate the MSE. 3.9 Comparison of MMSE estimators for an example 1 Let X = 1+U . 1) random variable and X = |Y |. Let e = X − E[X | Y ]. compute E[e2 ].6 Conditional probabilities with joint Gaussians II Let X. 2π (a) Find P [|X − 1| ≥ 2]. give an upper bound. 3. and (1.3.5 Conditional probabilities with joint Gaussians I X 1 ρ Let be a mean zero Gaussian vector with correlation matrix Y ρ 1 (a) Express P [X ≤ 1|Y ] in terms of ρ.1). (b) For what joint distribution of X and Y (consistent with the given information) is E[e2 ] maximized? Is your answer unique? 3. Y . (b) Find E[X|U ] and calculate the MSE.2). 3. and the standard normal CDF. Compute the mean square error for each estimator.8 An MMSE estimation problem (a) Let X and Y be jointly uniformly distributed over the triangular region in the x − y plane with corners (0. PROBLEMS 97 Hint: An symmetric n × n matrix is positive semidefinite if and only if the determinant of every matrix obtained by deleting a set of rows and the corresponding set of columns. 3.0).10 Conditional Gaussian comparison Suppose that X and Y are jointly Gaussian. E[(X − E[X|U ])2 ]. (c) Find P [|X − E[X|Y ]| ≥ 1]. What percentage reduction in MSE does the MMSE estimator provide over the LMMSE? (b) Repeat part (a) assuming Y is a N (0. Y be jointly Gaussian random variables with mean zero and covariance matrix Cov X Y = 4 6 6 18 . u 2 You may express your answers in terms of the Φ function defined by Φ(u) = −∞ √1 e−s /2 ds. or describe it as a well known density with specified parameter values. E[(X − E[X|U ])2 ]. 3. (a) If possible. (b) What is the conditional density of X given that Y = 3? You can either write out the density in full. (b) Find E[(X − Y )2 |Y = y] for real values of y. where |ρ| < 1. is nonnegative. Φ.7. with Var(X) = Var(Y ) = 10 and . (0. If not. mean zero. where U is uniformly distributed over the interval [0. E[A2 B] = 0. and suppose X is a random variable with finite second moment. determine if the statement is true. Y ) = 8.14 Some identities for estimators Let X and Y be random variables with E[X 2 ] < ∞.13 Projections onto nested linear subspaces (a) Use the Orthogonality Principle to prove the following statement: Suppose V0 and V1 are two closed linear spaces of second order random variables. If yes. 3. (ii) E[X|Y1 ] = E[ E[X|Y1 . You need not carry out the integration. Find an orthonormal 2 by 2 matrix U such that X = U Y for a Gaussian vector Y1 Y = such that Y1 is independent of Y2 . If no. Let Zi∗ be the random variable in Vi with the ∗ minimum mean square distance from X. Also. from smallest to largest. and pc are ordered. B. 3. Express the following probabilities in terms of the Q function. find the variances of Y1 and Y2 . Also. For each of the following statements.12 An estimator of an estimator Let X and Y be square integrable random variables and let Z = E[X | Y ]. (b) Suppose that X. For each of the following three statements. in the m.11 Diagonalizing a two-dimensional Gaussian distribution X1 1 ρ Let X = be a mean zero Gaussian random vector with correlation matrix . 3. give a counter example. Show that the LMMSE estimator of X given Y is also the LMMSE estimator of Z given Y . (a) E[X cos(Y )|Y ] = E[X|Y ] cos(Y ) . (b) pb = P [X ≥ 2|Y = 3]. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION Cov(X.s. identify the choice of subspace V0 and V1 such that the statement follows from part (a): (i) E[X|Y1 ] = E[ E[X|Y1 .”) (iii) E[X] = E[E[X|Y1 ]]. If A. then E[ABCD] = E[AB]E[CD] + E[AC]E[BD] + E[AD]E[BC]. Y2 ] |Y1 ]. (Can you generalize this result?). and Cov(A2 . Var(A2 ) = 2E[A2 ].98CHAPTER 3. Y2 ] |Y1 ].) (d) Indicate how pa . Y . 3. This implies that E[A4 ] = 3E[A2 ]2 . so Z is the MMSE estimator of X given Y . B)2 . C. (Note: pc can be expressed as an integral. (Sometimes called the “tower property. give a justification using the orthogonality principle. (a) pa = P {X ≥ 2}. pb . and Y are random variables with finite second square distance from Z0 1 2 moments. and D are jointly Gaussian and mean zero. (c) pc = P [X ≥ 2|Y ≥ 3]. B 2 ) = 2Cov(A. sense. such that V0 ⊃ V1 . (Think of the expectation of a random variable as the constant closest to the random variable. X2 ρ 1 where |ρ| < 1. Then Z1 is the variable in V1 with the minimum mean ∗ . Y2 Note: The following identity might be useful for some of the problems that follow. and compute the mean square error. but all three together are not independent. If false. then BB T is positive semidefinite.6. then there exists a symmetric positive semidefinite matrix S over the reals such that K = S 2 .19 A quadratic estimator Suppose Y has the N (0. (b) E[(X − E[X|Y ])2 ] = E[(X − E[X|Y.16 Some simple examples Give an example of each of the following. (Hint: What if K is also diagonal?) 3. . version 2 Let X. Y 2 ])2 ]. Y 2 ]2 ] if X and Y are jointly Gaussian. 1) distribution and that X = |Y |. For each of the following.15 Some identities for estimators. (c) Find the resulting minimum mean square error. where |ρ| < 1. 3. Y ρ 1 (a) Find E[X 2 |Y ]. and X and Y are not jointly Gaussian. (b) Find the estimator X. (b) True or false? If K is a symmetric positive semidefinite matrix over the reals. if true. then X and Y are independent. which are pairwise independent. and such that E[X|Y | is not simply constant. (a) E[(X − E[X|Y ])2 ] ≤ E[(X − E[X|Y.3. and Z be random variables with finite second moments and suppose X is to be estimated.8. (c) Three random variables X. Y. (c) E[ (X − E[E[X|Z] |Y ])2 ] ≤ E[(X − E[X|Y ])2 ].18 Estimating a quadratic X 1 ρ Let be a mean zero Gaussian vector with correlation matrix . explain your reasoning. give a counter example. E[Y 4 ] = 3. (d) If E[(X − E[X|Y ])2 ] = Var(X). the best estimator of X 2 given Y. 3. and c. (a) Two random variables X and Y such that E[X|Y ] = E[X|Y ].) (a) Use the orthogonality principle to derive equations for a. (b) A pair of random variables X and Y on some probability space such that X is Gaussian. the best linear (actually.7. and Z. 3.17 The square root of a positive-semidefinite matrix (a) True or false? If B is a square matrix over the reals. and in each case. Y. (You can use the following numerical values: E[|Y |] = 0. PROBLEMS (b) E[X|Y ] = E[X|Y 3 ] (c) E[X 3 |Y ] = E[X|Y ]3 (d) E[X|Y ] = E[X|Y 2 ] (e) E[X|Y ] = E[X|Y 3 ] 99 3. E[|Y |Y 2 ] = 1. (b) Compute the mean square error for the estimator E[X 2 |Y ]. (c) Find E[X 2 |Y ]. give a brief explanation. affine) estimator of X 2 given Y. Y is Gaussian. but X and Y are not jointly Gaussian. b. Find the estimator for X of the form X = a + bY + cY 2 which minimizes the mean square error. (b) Express the limit. in terms of f .) 3. and c to minimize E[(X − aY1 − bY2 − cY3 )2 ]. Write down the Kalman filter equations for recursively computing the estimates xk|k−1 . Y2 .21 Estimation for an additive Gaussian noise model Assume x and n are independent Gaussian vectors with means x. the (scaler) gains Kk . Y3 Y3 Y1 Y1 (b) Find the correlation matrix of Y2 and cross covariance matrix Cov(X. σ∞ 2 (c) Explain why σ∞ = 1 if f = 0. (b) For what values of f is the sequence of error variances bounded? 3.5 0. Y3 denote the innovations sequence. . v1 . Find the matrix A so that Y2 = A Y2 . x n (Assume that the various inverses exist. n and covariance matrices Σx ¯ ¯ and Σn . and let x0 denote a N (0. and ˆ 2 the sequence of the variances of the errors (for brevity write σk for the covariance or error instead of Σk|k−1 ).25 1 Y1 Y1 (a) Let Y1 . Show that the conditional covariance matrix of x given y is given by any of the three expressions: Σx − Σx (Σx + Σn )−1 Σx = Σx (Σx + Σn )−1 Σn = (Σ−1 + Σ−1 )−1 . σ 2 ) random variable. w2 .5 1 0.5 0. b. 1) random variables. .100CHAPTER 3. 3. let f be a real constant. 2 .20 An innovations sequence and its application Y1 1 0. ¯ ¯ (b). . Let y = x + n.25 X 0 0. . Y3 Y3 (c) Find the constants a.5 0 Y 0. (a) Show that E[x|y] is given by either x + Σx (Σx + Σn )−1 (y − (¯ + n)) ¯ x ¯ or Σn (Σx + Σn )−1 x + Σx (Σx + Σn )−1 (y − n). Let 2 be a mean zero random vector with correlation matrix Y3 0. Consider the state and observation sequences defined by: (state) xk+1 = f xk + wk yk = xk + vk (observation) where w1 . Then x and y are jointly Gaussian. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION 3. . . 2 (a) Show that limk→∞ σk exists. v2 .5 0. are mutually independent N (0.25 0.22 A Kalman filtering example (a) Let σ 2 > 0. . Y2 ).25 .5 1 0. .23 Steady state gains for one-dimensional Kalman filter This is a continuation of the previous problem. 3.7. PROBLEMS 101 3.24 A variation of Kalman filtering (a) Let σ 2 > 0, let f be a real constant, and let x0 denote a N (0, σ 2 ) random variable. Consider the state and observation sequences defined by: (state) xk+1 = f xk + wk y k = x k + wk (observation) where w1 , w2 , . . . are mutually independent N (0, 1) random variables. Note that the state and observation equations are driven by the same sequence, so that some of the Kalman filtering equations derived in the notes do not apply. Derive recursive equations needed to compute xk|k−1 , including ˆ recursive equations for any needed gains or variances of error. (Hints: What modifications need to be made to the derivation for the standard model? Check that your answer is correct for f = 1.) 3.25 The Kalman filter for xk|k Suppose in a given application a Kalman filter has been implemented to recursively produce xk+1|k for k ≥ 0, as in class. Thus by time k, xk+1|k , Σk+1|k , xk|k−1 , and Σk|k−1 are already computed. Suppose that it is desired to also compute xk|k at time k. Give additional equations that can be used to compute xk|k . (You can assume as given the equations in the class notes, and don’t need to write them all out. Only the additional equations are asked for here. Be as explicit as you can, expressing any matrices you use in terms of the matrices already given in the class notes.) 3.26 An innovations problem Let U1 , U2 , . . . be a sequence of independent random variables, each uniformly distributed on the interval [0, 1]. Let Y0 = 1, and Yn = U1 U2 · · · Un for n ≥ 1. (a) Find the variance of Yn for each n ≥ 1. (b) Find E[Yn |Y0 , . . . , Yn−1 ] for n ≥ 1. (c) Find E[Yn |Y0 , . . . , Yn−1 ] for n ≥ 1. (d) Find the linear innovations sequence Y = (Y0 , Y1 , . . .). (e) Fix a positive integer M and let XM = U1 + . . . + UM . Using the answer to part (d), find E[XM |Y0 , . . . , YM ], the best linear estimator of XM given (Y0 , . . . , YM ). 3.27 Linear innovations and orthogonal polynomials for the normal distribution (a) Let X be a N (0, 1) random variable. Show that for integers n ≥ 0, E[X n ] = n! (n/2)!2n/2 0 n even n odd Hint: One approach is to apply the power series expansion for ex on each side of the identity 2 E[euX ] = eu /2 , and identify the coefficients of un . (b) Let X be a N (0, 1) random variable, and let Yn = X n for integers n ≥ 1. Express the first four terms, Y1 through Y4 , of the linear innovations sequence of Y in terms of U . 102CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION 3.28 Linear innovations and orthogonal polynomials for the uniform distribution (a) Let U be uniformly distributed on the interval [−1, 1]. Show that for integers n ≥ 0, E[U n ] = 1 n+1 0 n even n odd (b) Let Yn = U n for integers n ≥ 0. Note that Y0 ≡ 1. Express the first four terms, Y1 through Y4 , of the linear innovations sequence of Y in terms of U . 3.29 Representation of three random variables with equal cross covariances Let K be a matrix of the form 1 a a K = a 1 a , a a 1 where a ∈ R. (a) For what values of a is K the covariance matrix of some random vector? (b) Let a have one of the values found in part (a). Fill in the missing entries of the matrix U, 1 ∗ ∗ √3 1 U = ∗ ∗ √3 , 1 ∗ ∗ √3 to yield an orthonormal matrix, and find a diagonal matrix Λ with nonnegative entries, so that 1 if Z is a three dimensional random vector with Cov(Z) = I, then U Λ 2 Z has covariance matrix K. (Hint: It happens that the matrix U can be selected independently of a. Also, 1 + 2a is an eigenvalue of K.) 3.30 Kalman filter for a rotating state Consider the Kalman state and observation equations for the following matrices, where θo = 2π/10 (the matrices don’t depend on time, so the subscript k is omitted): F = (0.99) cos(θo ) − sin(θo ) sin(θo ) cos(θo ) H= 1 0 Q= 1 0 0 1 R=1 (a) Explain in words what successive iterates F n xo are like, for a nonzero initial state xo (this is the same as the state equation, but with the random term wk left off). (b) Write out the Kalman filter equations for this example, simplifying as much as possible (but no more than possible! The equations don’t simplify all that much.) 3.31 * Proof of the orthogonality principle Prove the seven statements lettered (a)-(g) in what follows. Let X be a random variable and let V be a collection of random variables on the same probability space such that (i) E[Z 2 ] < +∞ for each Z ∈ V 3.7. PROBLEMS 103 (ii) V is a linear class, i.e., if Z, Z ∈ V then so is aZ + bZ for any real numbers a and b. (iii) V is closed in the sense that if Zn ∈ V for each n and Zn converges to a random variable Z in the mean square sense, then Z ∈ V. The Orthogonality Principle is that there exists a unique element Z ∗ ∈ V so that E[(X − Z ∗ )2 ] ≤ E[(X − Z)2 ] for all Z ∈ V. Furthermore, a random variable W ∈ V is equal to Z ∗ if and only if (X − W ) ⊥ Z for all Z ∈ V. ((X − W ) ⊥ Z means E[(X − W )Z] = 0.) The remainder of this problem is aimed at a proof. Let d = inf{E[(X −Z)2 ] : Z ∈ V}. By definition of infimum there exists a sequence Zn ∈ V so that E[(X − Zn )2 ] → d as n → +∞. (a) The sequence Zn is Cauchy in the mean square sense. (Hint: Use the “parallelogram law”: E[(U − V )2 ] + E[(U + V )2 ] = 2(E[U 2 ] + E[V 2 ]). Thus, by the Cauchy criteria, there is a random variable Z ∗ such that Zn converges to Z ∗ in the mean square sense. (b) Z ∗ satisfies the conditions advertised in the first sentence of the principle. (c) The element Z ∗ satisfying the condition in the first sentence of the principle is unique. (Consider two random variables that are equal to each other with probability one to be the same.) This completes the proof of the first sentence. (d) (“if” part of second sentence). If W ∈ V and (X − W ) ⊥ Z for all Z ∈ V, then W = Z ∗ . (The “only if” part of second sentence is divided into three parts:) (e) E[(X − Z ∗ − cZ)2 ] ≥ E[(X − Z ∗ )2 ] for any real constant c. (f) −2cE[(X − Z ∗ )Z] + c2 E[Z 2 ] ≥ 0 for any real constant c. (g) (X − Z ∗ ) ⊥ Z, and the principle is proved. 3.32 * The span of two closed subspaces is closed Check that the span, V1 ⊕ V2 , of two closed linear spaces (defined in Proposition 3.2.4) is also a closed linear space. A hint for showing that V is closed is to use the fact that if (Zn ) is a m.s. convergent sequence of random variables in V, then each variable in the sequence can be represented as Zn = Zn,1 + Zn,2 , where Zn,i ∈ Vi , and E[(Zn − Zm )2 ] = E[(Zn,1 − Zm,1 )2 ] + E[(Zn,2 − Zm,2 )2 ]. 3.33 * Von Neumann’s alternating projections algorithm Let V1 and V2 be closed linear subspaces of L2 (Ω, F, P ), and let X ∈ L2 (Ω, F, P ). Define a sequence (Zn : n ≥ 0) recursively, by alternating projections onto V1 and V2 , as follows. Let Z0 = X, and for k ≥ 0, let Z2k+1 = ΠV1 (Z2k ) and Z2k+2 = ΠV2 (Z2k+1 ). The goal of this problem is to show that m.s. Zn → ΠV1 ∩V2 (X). The approach will be to establish that (Zn ) converges in the m.s. sense, by verifying the Cauchy criteria, and then use the orthogonality principle to identify the limit. Define D(i, j) = E[(Zi − Zj )]2 for i ≥ 0 and j ≥ 0, and let i = D(i + 1, i) for i ≥ 0. (a) Show that i = E[(Zi )2 ] − E[(Zi+1 )2 ]. (b) Show that ∞ i ≤ E[Z 2 ] < ∞. i=0 (c) Use the orthogonality principle to show that for n ≥ 1 and k ≥ 0: D(n, n + 2k + 1) = n + D(n + 1, n + 2k + 1) n+2k+1 . D(n, n + 2k + 2) = D(n, n + 2k + 1) − 104CHAPTER 3. RANDOM VECTORS AND MINIMUM MEAN SQUARED ERROR ESTIMATION (d) Use the above equations to show that for n ≥ 1 and k ≥ 0, D(n, n + 2k + 1) = D(n, n + 2k + 2) = n n + ··· + + ··· + n+k n+k −( −( n+k+1 n+k+1 + ··· + + ··· + n+2k ) n+2k+1 ). Consequently, D(n, m) ≤ m−1 i for 1 ≤ n < m, and therefore (Zn : n ≥ 0) is a Cauchy sequence, i=n m.s. so Zn → Z∞ for some random variable Z∞ . (e) Verify that Z∞ ∈ V1 ∩ V2 . (f) Verify that (X − Z∞ ) ⊥ Z for any Z ∈ V1 ∩ V2 . (Hint: Explain why (X − Zn ) ⊥ Z for all n, and let n → ∞.) By the orthogonality principle, (e) and (f) imply that Z∞ = ΠV1 ∩V2 (X). Chapter 4 Random Processes 4.1 Definition of a random process A random process X is an indexed collection X = (Xt : t ∈ T) of random variables, all on the same probability space (Ω, F, P ). In many applications the index set T is a set of times. If T = Z, or more generally, if T is a set of consecutive integers, then X is called a discrete-time random process. If T = R or if T is an interval of R, then X is called a continuous-time random process. Three ways to view a random process X = (Xt : t ∈ T) are as follows: • For each t fixed, Xt is a function on Ω. • X is a function on T × Ω with value Xt (ω) for given t ∈ T and ω ∈ Ω. • For each ω fixed with ω ∈ Ω, Xt (ω) is a function of t, called the sample path corresponding to ω. Example 4.1.1 Suppose W1 , W2 , . . . are independent random variables with 1 P {Wk = 1} = P {Wk = −1} = 2 for each k, and suppose X0 = 0 and Xn = W1 + · · · + Wn for positive integers n. Let W = (Wk : k ≥ 1) and X = (Xn : n ≥ 0). Then W and X are both discrete-time random processes. The index set T for X is Z+ . A sample path of W and a corresponding sample path of X are shown in Figure 4.1. The following notation is used: µX (t) = E[Xt ] RX (s, t) = E[Xs Xt ] CX (s, t) = Cov(Xs , Xt ) FX,n (x1 , t1 ; . . . ; xn , tn ) = P {Xt1 ≤ x1 , . . . , Xtn ≤ xn } and µX is called the mean function , RX is called the correlation function, CX is called the covariance function, and FX,n is called the nth order CDF. Sometimes the prefix “auto,” meaning “self,” is 105 106 CHAPTER 4. RANDOM PROCESSES W (ω) k k X (ω) k k Figure 4.1: Typical sample paths. added to the words correlation and covariance, to emphasize that only one random process is involved. Definition 4.1.2 A second order random process is a random process (Xt : t ∈ T) such that 2 E[Xt ] < +∞ for all t ∈ T. The mean, correlation, and covariance functions of a second order random process are all welldefined and finite. If Xt is a discrete random variable for each t, then the nth order pmf of X is defined by pX,n (x1 , t1 ; . . . ; xn , tn ) = P {Xt1 = x1 , . . . , Xtn = xn }. Similarly, if Xt1 , . . . , Xtn are jointly continuous random variables for any distinct t1 , . . . , tn in T, then X has an nth order pdf fX,n , such that for t1 , . . . , tn fixed, fX,n (x1 , t1 ; . . . ; xn , tn ) is the joint pdf of Xt1 , . . . , Xtn . Example 4.1.3 Let A and B be independent, N (0, 1) random variables. Suppose Xt = A+Bt+t2 for all t ∈ R. Let us describe the sample functions, the mean, correlation, and covariance functions, and the first and second order pdf’s of X. Each sample function corresponds to some fixed ω in Ω. For ω fixed, A(ω) and B(ω) are numbers. The sample paths all have the same shape–they are parabolas with constant second derivative equal to 2. The sample path for ω fixed has t = 0 intercept A(ω), and minimum value 2 A(ω) − B(ω) achieved at t = − B(w) . Three typical sample paths are shown in Figure 4.2. The 4 2 various moment functions are given by µX (t) = E[A + Bt + t2 ] = t2 RX (s, t) = E[(A + Bs + s2 )(A + Bt + t2 )] = 1 + st + s2 t2 CX (s, t) = RX (s, t) − µX (s)µX (t) = 1 + st. As for the densities, for each t fixed, Xt is a linear combination of two independent Gaussian random variables, and Xt has mean µX (t) = t2 and variance Var(Xt ) = CX (t, t) = 1 + t2 . Thus, Xt is a 4.1. DEFINITION OF A RANDOM PROCESS 107 A(ω) −B(ω) 2 A(ω)− B(ω) 4 2 t Figure 4.2: Typical sample paths. N (t2 , 1 + t2 ) random variable. That specifies the first order pdf fX,1 well enough, but if one insists on writing it out in all detail it is given by fX,1 (x, t) = 1 2π(1 + t2 ) exp − (x − t2 )2 2(1 + t2 ) . For s and t fixed distinct numbers, Xs and Xt are jointly Gaussian and their covariance matrix is given by Cov Xs Xt = 1 + s2 1 + st 1 + st 1 + t2 . The determinant of this matrix is (s − t)2 , which is nonzero. Thus X has a second order pdf fX,2 . For most purposes, we have already written enough about fX,2 for this example, but in full detail it is given by fX,2 (x, s; y, t) = 1 1 exp − 2π|s − t| 2 x − s2 y − t2 T 1 + s2 1 + st 1 + st 1 + t2 −1 x − s2 y − t2 . The nth order distributions of X for this example are joint Gaussian distributions, but densities don’t exist for n ≥ 3 because Xt1 , Xt2 , and Xt3 are linearly dependent for any t1 , t2 , t3 . A random process (Xt : t ∈ T) is said to be Gaussian if the random variables Xt : t ∈ T comprising the process are jointly Gaussian. The process X in the example just discussed is Gaussian. All the finite order distributions of a Gaussian random process X are determined by the mean function µX and autocorrelation function RX . Indeed, for any finite subset {t1 , t2 , . . . , tn } of T, (Xt1 , . . . , Xtn )T is a Gaussian vector with mean (µX (t1 ), . . . , µX (tn ))T and covariance matrix with ijth element CX (ti , tj ) = RX (ti , tj ) − µX (ti )µX (tj ). Two or more random processes are said to be jointly Gaussian if all the random variables comprising the processes are jointly Gaussian. 108 CHAPTER 4. RANDOM PROCESSES Example 4.1.4 Let U = (Uk : k ∈ Z) be a random process such that the random variables 1 Uk : k ∈ Z are independent, and P [Uk = 1] = P [Uk = −1] = 2 for all k. Let X = (Xt : t ∈ R) be the random process obtained by letting Xt = Un for n ≤ t < n + 1 for any n. Equivalently, Xt = U t . A sample path of U and a corresponding sample path of X are shown in Figure 4.3. Both random processes have zero mean, so their covariance functions are equal to their correlation Uk Xt k t Figure 4.3: Typical sample paths. function and are given by RU (k, l) = 1 if k = l 0 else RX (s, t) = 1 if s = t 0 else . The random variables of U are discrete, so the nth order pmf of U exists for all n. It is given by pU,n (x1 , k1 ; . . . ; xn , kn ) = 2−n if (x1 , . . . , xn ) ∈ {−1, 1}n 0 else for distinct integers k1 , . . . , kn . The nth order pmf of X exists for the same reason, but it is a bit more difficult to write down. In particular, the joint pmf of Xs and Xt depends on whether s = t . If s = t then Xs = Xt and if s = t then Xs and Xt are independent. Therefore, the second order pmf of X is given as follows: 1 2 if t1 = t2 and either x1 = x2 = 1 or x1 = x2 = −1 1 if t1 = t2 and x1 , x2 ∈ {−1, 1} fX,2 (x1 , t1 ; x2 , t2 ) = 4 0 else. 4.2 Random walks and gambler’s ruin Suppose p is given with 0 < p < 1. Let W1 , W2 , . . . be independent random variables with P {Wi = 1} = p and P {Wi = −1} = 1 − p for i ≥ 1. Suppose X0 is an integer valued random variable independent of (W1 , W2 , . . .), and for n ≥ 1, define Xn by Xn = X0 + W1 + · · · + Wn . A sample path of X = (Xn : n ≥ 0) is shown in Figure 4.4. The random process X is called a random walk. Write Pk and Ek for conditional probabilities and conditional expectations given that X0 = k. For example, Pk [A] = P [A | X0 = k] for any event A. Let us summarize some of the basic properties of X. While the random walk (Xn : n ≥ 0) continues on forever. • Ek [Xn ] = k + n(2p − 1).s. A simple idea allows us to compute the ruin probability. In fact. Xn −n(2p−1) √ 4np(1−p) • limn→∞ Pk ≤c = Φ(c). Interpret Xn as the number of units of money a gambler has at time n. and m. • Vark (Xn ) = Var(k + W1 + · · · + Wn ) = 4np(1 − p). and then to recognize that after the first step is taken.s. The so-called Gambler’s Ruin problem is a nice example of the calculation of a probability involving the joint distributions of the random walk X. • limn→∞ Xn n = 2p − 1 (a. only the strong law of large numbers. under Pk . . and suppose the gambler has a goal of accumulating b units of money for some positive integer b ≥ k. k fixed). The gambler’s ruin probability is Pk [Rb ]. depends on the joint distribution of the Xn ’s.4.4: A typical sample path.s.2. Let Rb denote the event that the gambler is eventually ruined. convergence in the third property listed. The idea is to condition on the value of the first step W1 . the conditional probability of ruin is the same as the unconditional probability of ruin for initial wealth k + W1 . RANDOM WALKS AND GAMBLER’S RUIN 109 Xn (ω) b k n Figure 4. meaning the random walk reaches zero without first reaching b. n j • Pk {Xn = k + j − (n − j)} = pj (1 − p)n−j for 0 ≤ j ≤ n. giving the a. we are only interested in it until it hits either 0 or b. Almost all the properties listed are properties of the one dimensional distributions of X. Assume that the initial wealth k satisfies k ≥ 0. . as n → ∞. . Xtn ] = Xtn or. Clearly r0 = 1 and rb = 0. as n → ∞. equivalently. This yields b − 1 linear equations for the b − 1 unknowns r1 . . quadratic equation θ = pθ The roots are 1 and 1−p . b 2 Note that. Let R be the event that the gambler is eventually ruined. we seek a solution of the form rk = Aθ1 +Bθ2 . that Xn → +∞ a.s. where θ1 and θ2 are the two roots of the 2 2 + (1 − p) and A. 2 n This implies. Xn → 2p − 1 a. now. For 1 ≤ k ≤ b − 1. k k If p = 1 . . E[Xtn+1 | Xt1 . Focus. The events Rb increase with b because if b is larger the gambler has more possibilities to be ruined before accumulating b units of money: Rb ⊂ Rb+1 ⊂ · · · and R = ∪∞ Rb . condition on W1 to yield rk = Pk {W1 = 1}Pk [Rb | W1 = 1] + Pk {W1 = −1}Pk [Rb | W1 = −1] or rk = prk+1 + (1 − p)rk−1 . . RANDOM PROCESSES Let rk = Pk [Rb ] for 0 ≤ k ≤ b. . . that if p = 1 p 2 1−p p k − 1−p p rk = 1− 1−p p b b 0 ≤ k ≤ b. .s. he’ll have b units with probability k b and zero units otherwise. Xtn ] = 0. on the case that p > 1 . so rk is the ruin probability for the gambler with initial wealth k and target wealth b. Thus. . Using the boundary conditions r0 = 1 and rb = 0. . we find that rk = 1 − k in case p = 1 . unless the gambler is ruined in finite time. in particular. interestingly enough.3 Processes with independent increments and martingales The increment of a random process X = (Xt : t ∈ T) over an interval [a. E[Xtn+1 − Xtn | Xt1 . after the gambler stops playing. Xtn − Xtn−1 are mutually independent. Therefore by b=k the countable additivity of probability. k. Pk [R] = lim Pk [Rb ] = lim rk = 1−p p k b→∞ b→∞ . . Thus. . the increments Xt1 − Xt0 . A random process (Xt : t ∈ T) is called a martingale if E[Xt ] is finite for all t and for any positive integer n and t1 < t2 < · · · < tn < tn+1 . . . B are selected to meet the two boundary conditions. b] is the random variable Xb − Xa . the probability of eventual ruin decreases geometrically with the initial wealth k. Thus. his expected wealth after completing the game is equal to his initial capital. 1 If p = 2 the equations become rk = 1 {rk−1 + rk+1 } so that rk = A + Bk for some constants A 2 and B. his capital converges to infinity. . . rb−1 . A random process is said to have independent increments if for any positive integer n and any t0 < t1 < · · · < tn in T.110 CHAPTER 4. 4. . and finding A and B yields. By the law of large numbers. . Xtn − Xtn−1 .1 W has independent increments. . . E 0≤k≤n max Xk 4. . X1 . .2 Wt − Ws has the N (0. such that. Suppose the gambler makes bets with various odds. X2 . . BROWNIAN MOTION 111 If tn is interpreted as the present time.3 P [Wt is a continuous function of t] = 1. without proof. . . . . .4. .4. also called a Wiener process. Proposition 4. and if p = 2 it is also a martingale. The following proposition is stated. or in other words.3. the bets made are all for fair games in which the expected net gains are zero. . . Then if Xt denotes the wealth of the gambler at any time t ≥ 0. . . .0 P {W0 = 0} = 1. . given the past and present values of the process. with X0 equal to a constant and with mean zero increments. . then tn+1 is a future time and the value of (Xt1 . . . be nonnegative random variables such that E[Xk+1 | X0 . Xtn ) represents information about the past and present values of X. as far as the past history of X can determine. Xtn ) is a function of the increments Xt1 − X0 . as we now show. Let t1 < · · · < tn+1 be in T. Then (Xt1 . The random walk (Xn : n ≥ 0) arising in the gambler’s ruin problem is an independent increment 1 process. to give an indication of some of the useful deductions that follow from the martingale property. Thus E[Xtn+1 − Xtn | Xt1 . σ 2 (t − s)) distribution for t ≥ s. then (Xt : t ≥ 0) is a martingale. . W is sample path continuous with probability one. B.1 (a) Let X0 . Then 2 2 ≤ 4E[Xn ]. Then X is a martingale. Suppose a gambler has initial wealth X0 . An example of a martingale is the following. . . Xt2 − Xt1 . Xk ] ≤ Xk for k ≥ 0 (such X is a nonnegative supermartingale). be a martingale sequence with E[Xn ] < +∞ for some n. Xtn ] = E[Xtn+1 − Xtn ] = 0.4 Brownian motion A Brownian motion. the martingale property is that the future increments of X have conditional mean zero. and hence it is independent of the increment Xtn+1 − Xtn . is a random process W = (Wt : t ≥ 0) such that B. . B. X1 . With this interpretation. B. Suppose (Xt ) is an independent increment process with index set T = R+ or T = Z+ . γ 0≤k≤n 2 (b) (Doob’s L2 Inequality) Let X0 . . . Then P max Xk ≥γ ≤ E[X0 ] . 3.3 does not come automatically. Wtn ) is a linear combination of the n independent Gaussian random variables (Wti − Wti−1 : 1 ≤ i ≤ n). . then each coordinate of the vector (Wt1 . correlation. W is not a Brownian motion. . Then P {Wt = Wt } = 1 for each t ≥ 0 and W also satisfies B. t) = σ 2 (s ∧ t).0–B. A typical sample path of a Brownian motion is shown in Figure 4. RW (s. The difference between W and W is significant if events involving uncountably many values of t are investigated. . A Brownian motion is Gaussian. and covariance functions of a Brownian motion W are given by µW (t) = E[Wt ] = E[Wt − W0 ] = 0 and. The mean.112 CHAPTER 4. then B. t) = σ 2 (s ∧ t).2 imply that W is a Gaussian random process with µW = 0 and RW (s. t) = RW (s. P {Wt ≤ 1 for 0 ≤ t ≤ 1} = P {Wt ≤ 1 for 0 ≤ t ≤ 1}.0–B. For example. t) = E[Ws Wt ] = E[(Ws − W0 )(Ws − W0 + Wt − Ws )] = E[(Ws − W0 )2 ] = σ 2 s so that. . if W is a Brownian motion and if U is a Unif(0. being a mean zero independent increment process with P {W0 = 0} = 1. but W fails to satisfy B. because if 0 = t0 ≤ t1 ≤ · · · ≤ tn . CW (s. in general. For example. Thus.5. In fact. t) = σ 2 (s ∧ t). Property B. .1) distributed random variable independent of W . If W = (Wt : t ≥ 0) is a Gaussian random process with mean zero and RW (s. let W be defined by Wt = Wt + I{U =t} . RANDOM PROCESSES X (ω ) t t Figure 4. is a martingale. properties B. Thus. the converse is also true. A Brownian motion. for s ≤ t.2 are true.0–B.2.5: A typical sample path of Brownian motion. The interpretation is that f (t) is the number of “counts” observed during the interval (0.4. A random process is called a counting process if with probability one its sample path is a counting function. a Poisson process with rate λ is a random process N = (Nt : t ≥ 0) such that N.5. f is right continuous.5. if u1 = t1 and ui = ti − ti−1 for i ≥ 2. . then f can be described by the sequence (ti ). See Figure 4.1 Let λ ≥ 0. If ti denotes the time of the ith count for i ≥ 1. the intercount times. b].2 Let N be a counting process and let λ > 0. The numbers t1 . Or.2 N has independent increments. Proposition 4. defined next. f is nondecreasing.6. A counting process has two corresponding random sequences.6: A counting function. . t]. The following equations clearly hold: ∞ f (t) = n=1 I{t≥tn } tn = min{t : f (t) ≥ n} tn = u1 + · · · + un . u2 . are called f(t) 3 2 1 0 0 u1 t1 u2 t2 u3 t3 t Figure 4. t2 . The following are equivalent: . then f can be described by the sequence (ui ). . and f is integer valued.5 Counting processes and the Poisson process A function f on R+ is called a counting function if f (0) = 0. the sequence of count times and the sequence of intercount times. N.5.1 N is a counting process. By definition. COUNTING PROCESSES AND THE POISSON PROCESS 113 4. The most widely used example of a counting process is a Poisson process. N.3 N (t) − N (s) has the P oi(λ(t − s)) distribution for t ≥ s. are called the count times and the numbers u1 . An increment f (b) − f (a) is the number of counts in the interval (a. Definition 4. . . . 3) factors into the product of n pdfs. . Tn ). . . . and given {Nτ = n}. tn ] are disjoint intervals of R+ . . Unif[0. The volume of the cube is n. (c) For each τ > 0. . Tn ) is in the n-dimensional cube with upper corner t1 . . t1 ]. . t2 ]. . . Let 0 < t1 < t2 < · · · < tn . . . U2 . . . . it has range Rn . tn and sides of length is given by P {Ti ∈ (ti − . So (a) implies (b). . .. because tk = u1 + · · · + uk for 1 ≤ k ≤ n. . Un ) is the image of (T1 . by the formula for the transformation of random vectors (see Section 1. (T1 . −t1 ) ) · · · (λ e−λ ) Therefore (T1 . . . CHAPTER 4. and (c) implies (a). Suppose N is a Poisson process. . Tn ) has the pdf λn e−λtn 0 if 0 < t1 < · · · < tn else. are mutually independent. . Tn can be found as follows. . . τ ] are the same as n independent. is: f (t1 .10). . tn ) = The vector (U1 . for any n ≥ 1. . τ ] random variables.. . . Thus the intercount times U1 . (b) implies (c).114 (a) N is a Poisson process with rate λ. Then the probability that (T1 .1) Proof. . . tn |Nτ = n) = n! τn 0 0 < t1 < · · · < tn ≤ τ else (4. tn ) → (u1 . the conditional density of the first n count times. . . . . The joint pdf of the first n count times T1 . the times of the n counts during [0. fU1 .3) 0 else The joint pdf in (4. Exp(λ) random variables. . The mapping is invertible.Un (u1 . . . reordered to be nondecreasing. . . . . . . . . . .. (a) implies (b). (4. . . .. (t2 − . Ntn − Ntn − = 1} = (e−λ(t1 − ) )(λ e−λ )(e−λ(t2 − = (λ )n e−λtn . Nt1 − Nt1 − = 1. It will be shown that (a) implies (b). −1 1 has unit determinant. Select > 0 so small that (t1 − . . are independent and each is exponentially distributed with parameter λ. Nt2 − − Nt1 = 0. and the Jacobian + 1 −1 1 ∂u −1 1 = ∂t . . un ) = (4.2) fT1 ···Tn (t1 . . U2 . ti ] for 1 ≤ i ≤ n} = P {Nt1 − = 0. . . uk = tk − tk−1 for k ≥ 2. . Tn ) under the mapping (t1 . un ) defined by u1 = t1 . given the event {Nτ = n}. (tn − . . λn e−λ(u1 +···+un ) u ∈ Rn + . . . . That is. . Therefore. with each pdf being for an Exp(λ) random variable. . . RANDOM PROCESSES (b) The intercount times U1 . . . . Nτ is a Poisson random variable with parameter λτ . Suppose t0 < t1 < . Therefore.3). . . appealing to the transformation of random vectors in the reverse direction. ∞) yields: λn e−λτ P {Nτ =n} f (t1 . tn . Tn+1 ). . tn |Nτ = n) = 0 0 < t1 < · · · < tn ≤ τ else (4. . The unit cube [0. are independent.τ : λn+1 e−λtn+1 P {Nτ =n} f (t1 . . for 1 ≤ i ≤ K.5) implies both that (4. setting it equal to zero off of the set An. . Tn+1 ).4) is nonzero for tn+1 ∈ (τ.5) The conditional density in (4.τ = {t ∈ R+ : 0 < t1 < · · · < tn ≤ τ < tn+1 }.5. . < tk and let n1 .1) holds and that P {Nτ = n} = (λτ )n! .τ . . nk be nonnegative integers. n ≥ 1. . and scaling it up by the factor 1/P {Nτ = n} on An. then the density in (4. is pn1 · · · pnk . given that {Nτ = n}. . is obtained for each (t1 . . . . by (c). Thus. Exp(λ) random variables. . Equivalently. . . The number of ways to assign n counts to the intervals such that there are ki counts 1 k n n! in the ith interval is n1 ··· nk = n1 !···nk ! . Thus. . corresponding to one of the orderings. . . . . . ∞).2). . tn ) by integrating the density in (4. . Suppose that N is a counting process such that the intercount times U1 .4) The joint density of (T1 . .4. U2 . Since the + density must integrate to one. . ni particular counts fall into the ith interval. namely λn+1 e−λ(tn+1 ) on {t ∈ Rn+1 : 0 < t1 < · · · < tn+1 }. . has volume τ n /n!.5) is constant over the set {t ∈ Rn : 0 < t1 < · · · < tn ≤ τ }. This thus gives rise to what is known as a multinomial . Fix τ > 0 and an integer n ≥ 1. the pdf of the first n count times. given that {Nτ = n}. . the density in (4. . Tn+1 ) ∈ An. Integrating (4.4) with respect to tn+1 over R. P {Nτ = 0} = P {U1 > τ } = e τ n −λτ (c) implies (a).τ . Set n = n1 + . . If 0 < t1 < · · · < tn ≤ τ . . . the first n intercount times have the joint pdf given in (4. . . tn+1 |Nτ = n) = 0 0 < t1 < · · · < tn ≤ τ < tn+1 else (4. The event {Nτ = n} is equivalent to the event (T1 . . (T1 . It can be partitioned into n! equal volume subsets corresponding to the n! possible orderings of the numbers t1 . is given by (4. . . τ ]n in Rn has volume τ n . The probability that. The conditional pdf of (T1 . . . COUNTING PROCESSES AND THE POISSON PROCESS 115 (b) implies (c). If 0 < t1 < · · · < tn ≤ τ does not hold. Hence. Given there are n counts in the interval [0. These implications are for −λτ . falling into the ith subinterval with probability pi . .4) with respect to tn+1 over (τ. . + nk and pi = (ti − ti−1 )/tk for 1 ≤ i ≤ k. for n ≥ 1. . Suppose (c) is true. Tn ). that constant must be the reciprocal of the n-dimensional volume of the set. for some λ > 0. the set {t ∈ Rn : 0 ≤ t1 < · · · < tn ≤ τ }. N is a Poi(λτ ) random variable.4) is zero for all values of tn+1 . . . τ ]. . is obtained by starting with the joint pdf of (T1 . . Tn ). the distribution of the numbers of counts in each subinterval is as if each of the n counts is thrown into a subinterval at random. + e (4. where n+1 An. Also. . tn ∈ T. . Then X is said to be stationary if for any t1 . t1 . which shows how little one really knows about a process from its mean function and correlation function alone. xn ∈ R. then E[Xt+s ] and RX (t1 + s. if X is WSS. and RX (t1 . A Poisson process is not a martingale. xn . are independent. a stationary second order random process is WSS. Xtn +s ) have the same distribution. . The condition of stationarity of X can also be expressed in terms of the CDF’s of X: X is stationary if for any n ≥ 1. . FX. t2 ) depends on t1 . . If E[Xt ] < +∞ for all t. As shown above. By a convenient and widely accepted abuse of notation. . . Thus.) Then by the n = 1 part of the definition of stationarity. . However. . t2 ∈ T. µX (t) and E[Xt ] do not depend on t. t2 + s) for all t. tn and s in T. t2 only through the difference t1 − t2 .n (x1 . . 1 ≤ i ≤ k. . A second order random process (Xt : t ∈ T) with T = Z or T = R is called wide sense stationary (WSS) if µX (t) = µX (s + t) and RX (t1 . In other words. . . the joint statistics of X of all orders are unaffected by a shift in time. and x1 . tn + s). . 4. . Note that N has the same mean and covariance function as a Brownian motion with σ 2 = λ.6 Stationarity Consider a random process X = (Xt : t ∈ T) such that either T = Z or T = R. . . . s. Suppose X is a stationary second order random process. t1 + s. We have P {N (ti ) − N (ti−1 ) = ni for 1 ≤ i ≤ k} = P {N (tk ) = n} P [N (ti ) − N (ti−1 ) = ni for 1 ≤ i ≤ k | N (tk ) = n] n (λtk )n e−λtk pn1 · · · pnk = k n! n1 · · · nk 1 k = i=1 (λ(ti − ti−1 ))ni e−λ(ti −ti−1 ) ni ! Therefore the increments N (ti ) − N (ti−1 ). tn ) = FX. if N is defined by Nt = Nt − λt. then N is an independent increment process with mean 0 and N0 = 0. not depending on t. t2 ) = RX (t1 + s. In particular. by the n = 2 2 part of the definition E[Xt1 Xt2 ] = E[Xt1 +s Xt2 +s ] for any s ∈ T. . . . So (a) is proved. s.n (x1 . . xn . t1 . Xtn ) and (Xt1 +s . RANDOM PROCESSES distribution for the numbers of counts per interval. with N (ti ) − N (ti−1 ) being a Poisson random variable with mean λ(ti − ti−1 ). Wide sense stationarity means that µX (t) is a finite number. . . (Recall that second order means that 2 E[Xt ] < ∞ for all t. Moreover. . . . we use µX to be the constant and RX to be the function of one real variable such that E[Xt ] = µX E[Xt1 Xt2 ] = RX (t1 − t2 ) t∈T t1 .116 CHAPTER 4. N is a martingale. the random vectors (Xt1 . t1 . t2 + s) are finite and both do not depend on s. t2 ∈ T. for 1 ≤ i ≤ k. . . . Xt has the same 2 distribution for all t. with Xk being N (0. . . . and the other interpretation requires only one argument. µ)T and covariance matrix with ijth entry CX ((ti +s)−(tj +s)) = CX (ti −tj ). the random vector (Xt1 +s . if X is WSS then CX (t1 − t2 ) = CX (t1 . the interpretation is clear from the number of arguments. if X is WSS then CX (t1 . Xt2 +s . one can mathematically define WSS processes that are spectacularly different in appearance from any stationary random process. STATIONARITY 117 The dual use of the notation RX if X is WSS leads to the identity RX (t1 . Therefore. this means replacing a comma by a minus sign. k) = P {Xk = x} = 1− 1 2k2 1 k2 x ∈ {k. CX (τ ) = RX (τ ) − µ2 .1 (x. t2 . . Since the covariance function CX of a random process X satisfies CX (t1 . Some brave authors even skip mentioning that X is WSS when they write: “Suppose (Xt : t ∈ R) has mean µX and correlation function RX (τ ). Also. The situation is much different if X is a Gaussian process. .7. Then for any t1 . Therefore. t2 ) = RX (t1 − t2 ). t2 ) = RX (t1 . Since one interpretation of RX requires it to have two arguments.” because it is implicit in this statement that X is WSS. . . . For example. where in this equation τ should be thought X of as the difference of two times. there is much more to know about a random vector or a random process than the first and second moments. . tn . Xtn +s )T is Gaussian with mean (µ. Xt2 ). 1) for k ≤ 0 and with Xk having pmf pX. To be specific. µ. t1 − t2 . t2 ) is a function of t1 − t2 . A typical sample path of this WSS random process is shown in Figure 4. −k} if x = 0 0 else for k ≥ 1. . .6. any random process (Xk : k ∈ Z) such that the Xk are independent with E[Xk ] = 0 and Var(Xk ) = 1 for all k is WSS. t2 ) − µX (t1 )µX (t2 ). s ∈ T. In general. . The notation CX is also used to denote the function of one variable such that CX (t1 − t2 ) = Cov(Xt1 . t2 ). Indeed.4.7: A typical sample path. Xk k Figure 4. we could take the Xk to be independent. . As a practical matter. suppose X is Gaussian and WSS. Two possibilities are (a) Θ is uniformly distributed on the interval [0. The only way such a linear combination can be independent of t is if the coefficients of both cos(ωc t) and sin(ωc t) are zero (in fact. using the trigonometric identity cos(a) cos(b) = (cos(a − b) + cos(a + b))/2 yields RX (s. µX (t) 2 does not depend on t if and only if E[cos(Θ)] = E[sin(Θ)] = 0. and see if they make X stationary. if any. the function µX (t) is a linear combination of cos(ωc t) and sin(ωc t). A and Θ are independent random variables with P {A > 0} = 1 and E[A2 ] < +∞. In summary. Therefore. t) be a function of s − t alone it is necessary that E[cos(ωc (s + t) + 2Θ)] be a constant. π . ˜ Xt+s = A cos(ωc (t + s) + Θ) = A cos(ωc t + Θ) . µX (t) = E[A] (E[cos(Θ)] cos(ωc t) − E[sin(Θ)] sin(ωc t)) . Turning next to RX . it is enough to equate the values of µX (t) at ωc t = 0. Stationarity means that for any fixed constant s. Example 4. Thus. we consider two distributions for Θ which each make X WSS.118 CHAPTER 4. Therefore. 2π] such that the four moments specified are zero. RANDOM PROCESSES This mean and covariance matrix do not depend on s. We address two questions. then X is stationary. 2π]. if X is stationary then X is WSS. E[cos(Θ)] = E[sin(Θ)] = E[cos(2Θ)] = E[sin(2Θ)] = 0. 3π with equal probability. the mean and correlation functions can be computed as follows.6. First. Arguing just as in the case of µX . Each sample path of the random process (Xt : t ∈ R) is a pure sinusoidal function at frequency ωc radians per unit time. Is X stationary for either 2 2 possibility? We shall show that X is stationary if Θ is uniformly distributed over [0. There are many distributions for Θ in [0. the distribution of the vector does not depend on s. what additional assumptions. and if X is both Gaussian and WSS. To address whether X is WSS. For this example. Since A and Θ are independent and since cos(ωc t + Θ) = cos(ωc t) cos(Θ) − sin(ωc t) sin(Θ). if and only if. t) is a function of s − t if and only if E[cos(2Θ)] = E[sin(2Θ)] = 0. 2 Since s + t can be arbitrary for s − t fixed. Combining the findings for µX and RX .1 Let Xt = A cos(ωc t+Θ). X is stationary. the random processes (Xt : t ∈ R) and (Xt+s : t ∈ R) have the same finite order distributions. Thus. independent of the value of s + t. taking the four values 0. yields that RX (s. are needed on the distributions of A and Θ to imply that X is WSS? Second. with amplitude A and phase Θ. in order that RX (s. 2π]. or. t) = E[A2 ]E[cos(ωc s + Θ) cos(ωc t + Θ)] E[A2 ] = {cos(ωc (s − t)) + E[cos(ωc (s + t) + 2Θ)]} . and π). π . yields that X is WSS. (b) Θ is a discrete random variable. with Θ replaced by 2Θ. where ωc is a nonzero constant. π. so CXY (τ ) = CY X (−τ ).) 4. . then ωc t + Θ cannot be an integer multiple of π . π. By Example 1. t) = E[Xs Yt ]. we use RXY (τ ) for RXY (s. and remaining fuel at time t. Xt would have the same distribution for all t. Θ is again uniformly distributed on the interval ˜ have the same joint distribution. For example. P {X0 = 0} = P {Θ = π or Θ = 3π } = 1 . Yt1 +s . On one hand. RXY . CXY .7. so A cos(ωc t+Θ) and A cos(ωc t+Θ) ˜ [0. (Θ mod 2π) is uniformly distributed over the interval [0. the state of an aircraft at time t could consist of the position. 2π]. Hence X is not 2 2 stationary.4. (With more work it can be shown that X is stationary. if ωc t is not an integer 2 2 2 multiple of π . Rather.8 Conditional independence and Markov processes Markov processes are naturally associated with the state space approach for modeling a system. and the cross covariance function. the state summarizes enough about the system up to the present so that if the state is known. tn . and if RXY (s. The state of the system at time t should summarize everything about the system up to and including time t that is relevant to the future of the system. and similarly CXY (s − t) = CXY (s. If X and Y are jointly WSS. no more information about the past is relevant to the future possibilities. s). is defined by CXY (s. 2π]. the distribution of the random vector (Xt1 +s . For example. If X and Y are second order random processes on the same probability space. t). . . . The concept of state is . . The state at time t determines the possible future part of the aircraft trajectory. The random processes X and Y are said to be jointly WSS. the cross correlation function. Thus (A. t) is a function of s − t. if and only if. it determines how much longer the aircraft can fly and where it could possibly land.7 Joint properties of random processes Two random processes X and Y are said to be jointly stationary if their parameter set T is either Z or R.4. . Assume now that Θ takes on each of the values of 0. Xt2 +s . s ∈ T. t) where τ = s − t. Θ) have the same finite order distributions. Yt ). Ytn +s ) does not depend on s. t) = Cov(Xs . t) = CY X (t. in particular. Note that CXY (s. The state at time t does not completely determine the entire past trajectory of the aircraft. velocity. . . Think of t as the present time. Xtn +s . X is indeed stationary if Θ is uniformly distributed over [0. 2π]. The random processes X and Y are said to be jointly Gaussian if all the random variables comprising X and Y are jointly Gaussian. JOINT PROPERTIES OF RANDOM PROCESSES 119 ˜ ˜ where Θ = ((ωc s + Θ) mod 2π). is defined by RXY (s. so P {Xt = 0} = 0. 4. and 3π with equal probability. Θ) and (A. The idea of a state space model for a given system is to define the state of the system at any given time t. Is X 2 2 stationary? If X were stationary then. Hence. and if for any t1 . . if X and Y are each WSS. . π . . . On the other hand.4. Yt2 +s . RANDOM PROCESSES inherent in the Kalman filtering model discussed in Chapter 3. Y. Equivalently again. j. Z are discrete. Z). Y ] = E[Z|Y ] + E[Z|X] where X = X − E[X|Y ]. Thus X − Y − Z if and only if ˜ = 0. Y. Y. Y . the conditional independence condition X − Y − Z holds by definition if fXZ|Y (x. The forms (4. then the condition X − Y − Z can be defined using the pdfs and conditional pdfs in a similar way. Z be random vectors. or equivalently.7) make it clear that the condition X − Y − Z is symmetric in X and Z: thus X − Y − Z is the same condition as Z − Y − X. Since X = X − Cov(X. and Z are jointly Gaussian. Y = j} for all i.9) is equivalent to the condition that E[Z|X. which are discussed next. the conditional distribution of Z is Gaussian.Z if P [Z = k | X = i. Z) = 0. j. and Z have a joint pdf. Y = j] = P [Z = k | Y = j] (4.7) (4. Because X. X . Y = j}P {Z = k.6) and (4.1 Suppose X. Z) − Cov(X. z|y) = fX|Y (x|y)fZ|Y (z|y) An equivalent condition is fZ|XY (z|x. Such condition is denoted by X − Y − Z.8) shows that X − Y − Z means that knowing Y alone is as informative as knowing both X and Y . Let us see what the condition X − Y − Z means in terms of the covariance matrices. y) = fZ|Y (z|y) whenever fXY (x. or just given Y . k. The form (4.7) does not involve conditional probabilities. Y = j} > 0. Z = k}P {Y = j} = P {X = i. k with P {Y = j} > 0.8. Z are jointly Gaussian vectors. If X. Y . Intuitively. the condition (4. The notion of state is captured for random processes using the notion of conditional independence and the Markov property. then X − Y − Z is defined to hold if P [X = i. Y )Cov(Y )−1 Y . For example. Assume without loss of generality that the vectors have mean zero. ˜ ˜ E[Z|X] if follows that ˜ Cov(X. Y ] = E[Z|Y ] (because given X.9) whenever fY (y) > 0 Example 4.8) (4. X − Y − Z if P {X = i. Y = j. k with P {X = i.) The idea of linear innovations applied to the length two sequence ˜ ˜ (Y. y) > 0. if and only if Cov(X. We shall define the condition that X and Z are conditionally independent given Y . Y . Y )Cov(Y )−1 Cov(Y. so no requirement about conditioning events having positive probability is needed. The form (4. If X. Z = k | Y = j] = P [X = i | Y = j]P [Z = k | Y = j] for all i.120 CHAPTER 4. and in the two cases the mean and covariance of the conditional distribution of Z is the same. for the purpose of determining conditional probabilies of Z. the condition X − Y − Z means that the random variable Y serves as a state. Equivalently. j.Y . Let X. . Z) = Cov(X. X) yields E[Z|X. (4.6) for all i. . . tn+1 with 0 ≤ t1 ≤ · · · ≤ tn+1 . we define X and Z to be conditionally independent given Y . the vector (Xt1 . In words. In particular.10) In particular.8. Then for any t1 . random walks. Z) = Cov(X.11) is easier to check than condition (4. Y ] = P [Z ∈ B|Y ].11) It turns out that the Markov property is equivalent to the following conditional independence property: For any t1 . . Xtn ) − Xtn − (Xtn . P [{X ∈ A}{Z ∈ B}|Y ] = P [X ∈ A|Y ]P [Z ∈ B|Y ]. based on the general definition of conditional expectation given in Chapter 3. Y )Cov(Y )−1 Cov(Y. is given next.8. X − Y − Z if and only if Cov(X. and (2) E[g(Y )P [F |Y ]] = E[g(Y )IA ] for any g(Y ) with finite second moment. Example 4. Y . Xt2 − Xt1 . . CONDITIONAL INDEPENDENCE AND MARKOV PROCESSES Therefore. If Y is a random vector. Xtn ) is a function of the n increments Xt1 − X0 . A general definition of conditional probabilities and conditional independence. thinking of tn as the present time. . Brownian motions. tn+m in T with t1 < · · · < tn+m . Xtn+m ) (4. · · · . · · · . Equivalently. Xtn − Xtn−1 . P [Z ∈ B|X. .12). . Recall that P [F ] = E[IF ] for any event F . This means that P [F |Y ] is the unique (in the sense that any two versions are equal with probability one) random variable such that (1) P [F |Y ] is a function of Y and it has finite second moments. the condition X − Y − Z holds if and only if the correlation coefficients satisfy ρXZ = ρXY ρY Z . . .12) The definition (4. Z). . . But Xtn+1 is determined by V and Xtn . the following conditional independence condition holds: (Xt1 . . Definition 4. X is a Markov process. . but (4. . . Thus.3 (Markov property of independent increment processes) Let (Xt : t ≥ 0) be an independent increment process such that X0 is a constant. and Poisson processes are Markov processes. X − Y − Z holds if for any Borel set B. . where IF denotes the indicator function of F .12) is appealing because it is symmetric in time. if X. (Xt1 . we define P [F |Y ] to equal E[IF |Y ]. and is thus independent of the increment V = Xtn+1 − Xtn . 121 (4. the Markov property means that the past and future of X are conditionally independent given the present state Xtn .8. tn+1 in T with t1 < · · · < tn . Given arbitrary random vectors. · · · .2 A random process X = (Xt : t ∈ T) is said to be a Markov process if for any t1 .4. and Z are jointly Gaussian random variables with nonzero variances. (written X − Y − Z) if for any Borel sets A and B. Xtn ) − Xtn − Xtn+1 (4. . a3 . . . t) whenever r. Therefore.4 (Gaussian Markov processes) Suppose X = (Xt : t ∈ T) is a Gaussian random process with Var(Xt ) > 0 for all t. an S valued random variable is defined to be a function Y mapping Ω to S such that {ω : Y (ω) = s} ∈ F for each s ∈ S. (4.8. .13) If X = (Xk : k ∈ Z) is a discrete-time stationary Gaussian process. t) may be written as ρ(k). . Assume that the elements of S are ordered so that S = {a1 . . . for some constant α ≥ 0. a2 . s)ρ(s. Given a probability space (Ω. Xtn+1 ) = Cov( .11) is equivalent to Xt1 Xt1 Xt Xt 2 2 Cov( . . s.122 CHAPTER 4. is equivalent to the requirement ρ(t1 . Therefore. tn ) Therefore a Gaussian process X is Markovian if and only if ρ(r. or S = {a1 . Note that ρ(k) = ρ(−k).9 Discrete-state Markov processes This section delves further into the theory of Markov processes in the technically simplest case of a discrete state space. for some constant b with |b| ≤ 1. By the characterization of conditional independence for jointly Gaussian vectors (4. a stationary Gaussian process X = (Xk : k ∈ Z) with V ar(Xk ) > 0 for all k is Markovian if and only if the covariance function has the form CX (k) = Ab|k| for some constants A and b with A > 0 and |b| ≤ 1. Let S be a finite or countably infinite set. an } in case S has finite cardinality. Xtn Xtn which. tn+1 ) . t ≥ 0. . P ). tn+1 ) ρ(tn . . where k = s − t. Similarly. X is Markovian if and only if ρ(s + t) = ρ(s)ρ(t) for all s. . a stationary Gaussian process X with V ar(Xt ) > 0 for all t is Markovian if and only if ρ has the form ρ(τ ) = exp(−α|τ |). tn+1 )) ρ(t2 . t) denote the correlation coefficient between Xs and Xt . RANDOM PROCESSES Example 4. if (Xt : t ∈ R) is a continuous-time stationary Gaussian process with V ar(Xt ) > 0 for all t. ρ(tn . . X is Markovian if and only if ρ(k) = b|k| for all k. 4.10). tn ) = ρ(tn . . or equivalently. a2 . F. The only bounded realvalued functions satisfying such a multiplicative condition are exponential functions.} in case S has . Such a process is Markovian if and only if ρ(k1 + k2 ) = ρ(k1 )ρ(k2 ) for all positive integers k1 and k2 . the Markov property (4. . called the state space. tn+1 ) ρ(t1 . Xtn+1 ) . t ∈ T with r < s < t. if and only if CX has the form CX (τ ) = A exp(−α|τ |) for some constants A > 0 and α ≥ 0. . . Xtn )Var(Xtn )−1 Cov(Xtn . tn ) ρ(t2 . then ρ(s. t) = ρ(r. . letting ρ(s. . Equivalently. tn−1 )pin−1 in (tn−1 . . Let (Xt : t ∈ T) be a be a Markov process with state space S. t1 . t).14) which shows that the finite order distributions of X are indeed determined by the first order pmfs and the transition probabilities. . .. . ii . . Since the elements of S may not even be numbers. the collection H(s. in ∈ S one writes pX (i1 . an S valued random variable is equivalent to a positive integer valued random variable. which in . t) = P {X(t) = i}. it follows that π(t)e = 1 and H(s. and are called the transition probabilities of X. · · · . The first order pmfs π(t) and the transition probabilities pij (s. . A random process X = (Xt : t ∈ T) is said to have state space S if Xt is an S valued random variable for each t ∈ T. tn ) = pX (i1 .15) (4. t) : i. . Xt = j] = i πi (s)pij (s. in−1 . t2 ) · · · pin−1 in (tn−1 . sm ) For brevity. Equation (4. in−1 . and it is called the transition probability matrix for the interval [s. t) = (pij (s. jn . . it might not make sense to speak of the expected value of an S valued random variable. . Xsm = im ] = pX (j1 . Given the ordering. and the Markov property of such a random process is defined just as it is for a real valued random process. · · · . tn−1 )pX (in . . Since π(t) and the rows of H(s. · · · . in . tn ) (4. t1 . Given s < t. tn ) = πi1 (t1 )pi1 i2 (t1 . t) determine all the finite order distributions of the Markov process as follows. Similarly think of a deterministic function g on S as a column vector. j ∈ S) should be thought of as a matrix. . .). . s1 . then g(Y ) is a real-valued random variable and its expectation is given by the inner product of the probability vector pY and the column vector g: E[g(Y )] = i∈S pY (i)g(i) = pY g. However. DISCRETE-STATE MARKOV PROCESSES 123 infinite cardinality. if g is a function mapping S to the reals. i2 . g = (g(a1 ).)T . called a probability vector: pY = (P {Y = a1 }.. The following notation is used to denote conditional probabilities: P [Xt1 = j1 . . tn−1 ) = pX (i1 . t). That is. Computing the distribution of Xt by summing over all possible values of Xs yields that πj (t) = i P [Xs = i.. Xtn = jn |Xs1 = i1 . . t) defined by H(s. tn ) Application of this operation n − 2 more times yields that pX (i1 . t]. t1 . tn |i1 . conditional probabilities of the form P [Xt = j|Xs = i] are written as pij (s. t1 . .15) can be used to easily verify that the form (4. . < tn in T. Let e denote the column vector with all ones. For brevity we denote the first order pmf of X at time t as π(t) = (πi (t) : i ∈ S). P {Y = a2 }. · · · . . . Given t1 < t2 < . t1 .4. . t1 . . .9. im . . . πi (t) = pX (i. t)e = e. tn |i1 . so it is nothing exotic. indexed by S. in . · · · . .12) of the Markov property holds. g(a2 ). in−1 . Think of the probability distribution of an S valued random variable Y as a row vector of possibly infinite dimension. t) are probability vectors. n) = I. and Hij (t − s) instead of Hij (s. On one hand. t ∈ T. Some assumptions about the pipeline will be made in order to model it as a simple discretetime Markov process.15).9. and so forth yields H(n. we see that a time-homogeneous Markov process is stationary if and only if π(t) = π for all t for some equilibrium distribution π. k + 1) = H(n. computing the conditional distribution of Xt given Xs by summing over all possible values of Xτ yields H(s. t) depend on s and t only through t − s. Fixing n and using forward recursion starting with H(n. A Markov process is time-homogeneous if the transition probabilities pij (s. Repeated application of the Chapman-Kolmogorov equations yields that pij (s. consider Markov processes with index set the integers. τ )H(τ. n + 2) = P (n)P (n + 1). τ. t). Example 4.16) are known as the Chapman-Kolmogorov equations. Each stage has a single buffer. and π(l) = π(k)P l−k for l ≥ k. Normalize time so that in one unit of time a packet can make a single transition. In this case a probability distribution π is an equilibrium distribution if and only if πP = π. t) = H(s. then π(s + τ ) = π(s)H(τ ) for s. Call the time interval between k and k + 1 the kth “time slot.” and assume that the pipeline evolves in the following way during a given slot. t) for s. n + 1) = P (n). as pictured in Figure 4.8. For example. t) s. there are no packets in stage one. If at the beginning of the slot. a Markov random process that is stationary is time homogeneous. In that case we write pij (t − s) instead of pij (s.16) The relations (4.8: A two-stage pipeline. given s < τ < t. RANDOM PROCESSES matrix form yields that π(t) = π(s)H(s. independently of the past history of the pipeline and of the outcome at stage two. If the Markov process is time-homogeneous. k)P (k) for n ≤ k.124 CHAPTER 4. s + τ ∈ T and τ ≥ 0. . H(n. l) = P (n)P (n + 1) · · · P (l − 1) In particular. where P is the time independent one-step transition probability matrix. A probability distribution π is called an equilibrium (or invariant) distribution if πH(τ ) = π for all τ ≥ 0. referring to (4. t) can be expressed in terms of transition probabilities for s and t close together. s < τ < t. k + 1) is the one-step transition probability matrix. On the other hand. t). t ∈ T. Similarly. Then H(n. Recall that a random process is stationary if its finite order distributions are invariant with respect to translation in time. s ≤ t. if the chain is time-homogeneous then H(k) = P k for all k. then a new packet arrives to stage one with probability a. H(n.1 Consider a two-stage pipeline through which packets flow. a d1 d2 Figure 4. where P (k) = H(k. (4. 9: One-step transition probability diagram. independently of the state or outcome of stage one. the throughput rate η can be computed as follows. the first row is the probability distribution of the state at the end of a slot.4. 01. Once π is found. DISCRETE-STATE MARKOV PROCESSES a 00 a d1 10 d2 11 d 2 125 ad 2 01 ad 2 d1 ad2 ad 2 Figure 4. For example. It is defined to be the rate (averaged over a long time) that packets transit the pipeline. π10 . then the packet is transfered to stage two with probability d1 . The rate of arrivals to stage one The rate of departures from stage one (or rate of arrivals to stage two) The rate of departures from stage two . transition probability diagram shown in Figure 4. Since at most two packets can be in the pipeline at a time. π11 ) is the probability vector satisfying the linear equation π = πP . If at the beginning of the slot. 10.9. there is a packet in stage one and no packet in stage two. there is a packet in stage two. π01 . let us determine the throughput rate of the pipeline. These assumptions lead us to model the pipeline as a discrete-time Markov process with the state space S = {00. 11}. given that the state is 00 at the beginning of a slot. Now that the model is specified. If at the beginning of the slot. then the packet departs from the stage and leaves the system with probability d2 .9 (using the notation x = 1 − x) and one-step transition probability matrix P given by ¯ a ¯ 0 a 0 ¯ ad2 ad2 ad2 ad2 ¯ ¯¯ P = ¯1 0 d1 d 0 ¯ 0 0 d2 d2 The rows of P are probability vectors. the following three quantities are all clearly the same. The equilibrium probability distribution π = (π00 . and can be taken to be the throughput rate. 2.214 .5h o(h) o(h) o(h) h 1 − 2h h + o(h) o(h) o(h) 0 h 1−h o(h) o(h) o(h) .5 0. Consider the numerical example a = d1 = d2 = 0. obtain η = d2 (π01 + π11 ). .17) An example for state space S = {1. Let Q = (qij : i. 3.5. The transition probabilities for arbitrary time intervals can be described in terms of the transition probabilites over arbitrarily short time intervals. A pure-jump Markov process is an S valued Markov process such that. A purejump.19) where o(h) represents a quantity such that limh→0 o(h)/h = 0. 2/7.. time-homogeneous Markov process X has generator matrix Q if the transition probabilities (pij (τ )) satisfy lim (pij (h) − I{i=j} )/h = qij i. i = j i ∈ S. For the example this means that the transition probability matrix for a time interval of duration h is given by 1 − h 0. These three expressions for η must agree. Similarly. the sample functions are pure-jump functions. such that that x(t) = si for τi ≤ t < τi+1 . Let S be a finite set. By saving only a linearization of the transition probabilities. 0 1 −1 and this matrix Q can be represented by the transition rate diagram shown in Figure 4. 3} is −1 0.5 Q = 1 −2 1 .10. Finally. and a sequence of states with si = si+1 . A pure-jump function for a finite state space S is a function x : R+ → S such that there is a sequence of times. obtain η = d1 π10 . 2. The equation π = πP yields that π is proportional to the vector (1. j ∈ S (4. finite-state Markov process. the concept of generator matrix arises naturally.5h 0. j ∈ S (4.j=i qij i. j ∈ S) be such that qij ≥ 0 qii = − j∈S.18) h 0 or equivalently pij (h) = I{i=j} + hqij + o(h) i. 0 = τ0 < τ1 < · · · with limi→∞ τi = ∞. 1). j ∈ S. . 3/7. RANDOM PROCESSES = P [an arrival at stage 1|stage 1 empty at slot beginning]P [stage 1 empty at slot beginning] = a(π00 + π01 ). 1/7).126 Focus on the first of these three quantities to obtain η = P [an arrival at stage 1] CHAPTER 4. with probability one. In the remainder of this section we assume that X is a continuous-time. Applying the fact that π is a probability distribution yields that π = (1/7. by focusing on departures from stage 1. Therefore η = 3/14 = 0. as we describe next. by focusing on departures from stage 2. i ≥ 0. (4. 17). known as the Kolmogorov forward equa∂t tions. The finite order distributions of the process are uniquely determined by π(0) and Q.9. These equations.i=j i∈S. ∂π(t) = π(t)Q. Example 4. The Chapman-Kolmogorov equations imply that πj (t + h) − πj (t) = h πi (t) i∈S pij (h) − I{i=j} h .2 Given a matrix Q satisfying (4.3 Consider the two-state. owing to the assumptions on the generator matrix Q. the rows of the first matrix are probability distributions. DISCRETE-STATE MARKOV PROCESSES 0. For small enough h.10: Transition rate diagram for a continuous-time Markov process.4. in matrix notation. (4.5 1 0. Proposition 4.22) which shows that the rate change of the probability of being at state j is the rate of “probability flow” into state j minus the rate of probability flow out of state j.5 1 3 2 1 1 127 Figure 4. and a probability distribution π(0) = (πi (0) : i ∈ S). (4. The first order distributions and the transition probabilities can be derived from Q and an initial distribution π(0) by solving differential equations. there is a pure-jump.9. continuous-time Markov process with the transition rate diagram shown in Figure 4.i=j πj (t)qji . time-homogeneous Markov process with generator matrix Q and initial distribution π(0).20) Letting h converge to zero yields the differential equation: ∂πj (t) = ∂t πi (t)qij i∈S (4.21) or. The generator matrix is given . Fix t > 0 and let h be a small positive number. can be rewritten as ∂πj (t) = ∂t πi (t)qij − i∈S.11 for some positive constants α and β. derived as follows.9. lim π(t) = β (1 − e−(α+β)t ). ∂t π1 (0) given By differentiation we check that this equation has the solution π1 (t) = π1 (0)e−(α+β)t + 0 t e−(α+β)(t−s) βds = π1 (0)e−(α+β)t + so that π(t) = π(0)e−(α+β)t + For any initial distribution π(0).128 α 1 CHAPTER 4. discrete-state Markov process can be specified by an initial probability distribution. The equation for π1 (t) is ∂π1 (t) = −απ1 (t) + βπ2 (t). with rate parameter α + β. 4. and either a one-step transition probability matrix P (for discrete-time processes) or a generator matrix Q (for continuous-time processes). α+β α+β .11: Transition rate diagram for a two-state continuous-time Markov process. RANDOM PROCESSES 2 β Figure 4. by Q= −α α β −β Let us solve the forward Kolmogorov equation for a given initial distribution π(0). The space-time structure . π1 (0) given ∂t But π1 (t) = 1 − π2 (t). so ∂π1 (t) = −(α + β)π1 (t) + β. and the limiting distribution is the unique probability distribution satisfying πQ = 0. α+β α+β t→∞ α β . The rate of convergence is exponential. Another way to describe these processes is to specify the space-time structure. which is simply the sequences of states visited and how long each state is visited.10 Space-time structure of discrete-state Markov processes The previous section showed that the distribution of a time-homogeneous. α+β (1 − e−(α+β)t ) β α . ij ii (b) Given X(0). . and the conditional distribution of Tl is geometric with parameter pjl jl : P [Tl = k|X J (0) = j0 . . X J (n) = jn ] = pk−1 (1 − pjl jl ) jl jl 0 ≤ l ≤ n. . and then for continuous-time processes. the holding X(k) J X (1) s 3 s2 s 1 !0 J X (2) J X (3) k !2 !1 Figure 4.25) (4. . . . and pJ = 0. Tn are conditionally independent. More precisely. Proposition 4. k ≥ 1. Thus.10. Let Tk denote the time that elapses between the k th and k + 1th jumps of X. + Tk−1 ) (4. . and vice versa. One benefit is to show how little difference there is between discrete-time and continuous-time processes. . . X J (n)) = (j0 . and its one-step transition probabilities are given by pJ = pij /(1 − pii ) for i = j. .4. SPACE-TIME STRUCTURE OF DISCRETE-STATE MARKOV PROCESSES 129 is discussed first for discrete-time processes. (a) The jump process X J is itself a time-homogeneous Markov process. .10. the following description of the joint distribution of the holding times and the jump process characterizes the distribution of X. . 4. . i. the variables T0 . and let X J (k) denote the state after k jumps.12 for illustration. Let (Xk : k ∈ Z+ ) be a time-homogeneous Markov process with one-step transition probability matrix P . .24) Clearly the holding times and jump process contain all the information needed to construct X. + Tk−1 )} and the jump process X J = (X J (k) : k ≥ 0) is defined by X J (0) = X(0) and X J (k) = X(T0 + . See Fig. + Tk−1 + t) = X(T0 + . . times are defined by T0 = min{t ≥ 0 : X(t) = X(0)} Tk = min{t ≥ 0 : X(T0 + . . . .12: Illustration of jump process and holding times. . .1 Let X = (X(k) : k ∈ Z+ ) be a time-homogeneous Markov process with onestep transition probability matrix P .23) (4. . . j ∈ S. . X J (1) is conditionally independent of T0 . jn ). (c) Given (X J (0). X(k) = j}. and its one-step transition probabilities are given by pJ = ij −qij /qii for i = j 0 for i = j (4. then CHAPTER 4. . the variables T0 . the geometric distribution of mean 1/(1 − pii ) X J (1) has distribution (pJ : j ∈ S) (i fixed) ij T0 and X J (1) are independent More generally.10. .23)-(4. pJn−1 jn ij j j l=0 −1 (pklljl (1 − pjl jl )) j This establishes the proposition. pure-jump Markov process with generator matrix Q. RANDOM PROCESSES {T0 = k. .26) displays the product of two probability distributions. Observe that if X(0) = i. . . X J (1) = j} = {X(1) = i. so P [T0 = k. . . . Proposition 4. time-homogeneous Markov process.2 Let X = (X(t) : t ∈ R+ ) be a time-homogeneous. Tn are conditionally independent. X J (n) = jn ] = exp(cqjl jl ) 0 ≤ l ≤ n. Essentially the only difference between the discrete. . conclude that given X(0) = i. . . . . Then (a) The jump process X J is a discrete-time.27) (b) Given X(0). . check that P [X J (1) = j1 . To = k0 . define the holding times Tk .130 Proof. .25) as before. X J (n) = jn . X(2) = i. X J (1) is conditionally independent of T0 . Tn = kn |X J (0) = i] = n pJ 1 pJ1 j2 . . Indeed. . X J (1) = j|X(0) = i] = pk−1 pij = [(1 − pii )pk−1 ]pJ ij ii ii (4.and continuous-time Markov processes is that the holding times for the continuous-time processes are exponentially distributed rather than geometrically distributed. . k ≥ 0 and the jump process X J using (4. . .26) Because for i fixed the last expression in (4. Next we consider the space-time structure of time-homogeneous continuous-time pure-jump Markov processes. . . . X J (n) = jn . X(k − 1) = i. . (c) Given X J (0) = j0 . and the conditional distribution of Tl is exponential with parameter −qjl jl : P [Tl ≥ c|X J (0) = j0 . . k−1 T0 has distribution ((1 − pii )pii : k ≥ 1). . . X J. X J.27).10. First.13: Illustration of sampling of a pure-jump function. part a) yields that the one step transition probabilities pJ. .h (1). . SPACE-TIME STRUCTURE OF DISCRETE-STATE MARKOV PROCESSES 131 X(t) X(h)(1) (h) X (2) (h) X (3) s 3 s2 s 1 t Figure 4. X J (1). . .h is that of a Markov process with one-step transition probabilities given by (4.1.h) are ij given by pJ.29) follows from the definition (4. .10. Let (Tk : k ≥ 0) denote the sequence of holding times and (X J. . Since X (h) has one-step transition probabilities pij (h). . Proof.28) can be accomplished by identifying the limit of the distribution of the vector on the left. where the limit indicated in (4. hT1 .h (0).h = ij = pij (h) 1 − pii (h) pij (h)/h qij → as h → 0 (1 − pii (h))/h −qii (4.h (k) : k ≥ 0) the jump process for the process X (h) .h is identified. hT0 . the limiting distribution of the process X J. since log(1 + θ) = . The conditional independence properties stated in (b) and (c) of the proposition follow in the limit from the corresponding properties for the jump process X J. T0 .h (n). . The assumption that with probability one the sample paths of X are pure-jump functions. hTn ) = (h) (h) h→0 (X J (0).h guaranteed by Proposition 4. . . Fix h > 0 and define the “sampled” process X (h) by X (h) (k) = X(hk) for k ≥ 0.h for X (J. . establishing part (a) of the proposition. the limiting distribution of X J.1.29) for i = j.18) of the generator matrix Q.10. Tn ) (4. Thus. . Finally. . Then X (h) is a discrete-time Markov process with one-step transition probabilities pij (h) (h) (the transition probabilities for the original process for an interval of length h).13. . X J (n).28) Since convergence with probability one implies convergence in distribution. the formula for the jump process probabilities for discrete-time processes (see Proposition 4. See Fig. 4. . T1 .4. . implies that with probability one: (h) lim (X J. the goal of identifying the distribution of the random vector on the righthand side of (4. Find the mean function µX (t) and autocorrelation function RX (s. Find E[A]. 4. (b) Find P [Xt ≥ 0] for all t.5.5 A random line Let X = (Xt : t ∈ R) be a random process such that Xt = R − St for all t. (a) Sketch the possible sample functions. (a) Is the random process Y necessarily wide-sense stationary? (b) Give an example of random variables X1 and X2 satisfying the given conditions such that Y is stationary. Find the autocorrelation function RX (s. 4. X J. suppose A.132 θ + o(θ) by Taylor’s formula. Is X WSS? 4. .h = jn ] = = = (pjl jl (h)) c/h exp( c/h log(pjl jl (h))) exp( c/h (qjl jl h + o(h))) as h → 0 → exp(qjl jl c) which establishes the remaining part of (c). . t).1 Event probabilities for a simple random process Define the random process X by Xt = 2A + Bt where A and B are independent random variables with P [A = 1] = P [A = −1] = P [B = 1] = P [B = −1] = 0. . 5]. 4. where R and S are 2 independent random variables. . σS (a) Indicate three typical sample paths of X in a single sketch. the frequency V in Hertz is uniform on [0. t) where Xt = Yt Zt . (c) Give an example of random variables X1 and X2 satisfying the given conditions such that Y is not (strict sense) stationary. t) = 9 + exp(−3|s − t|4 ). 4.h (0) = j0 .4 Another sinusoidal random process Suppose that X1 and X2 are random variables such that EX1 = EX2 = EX1 X2 = 0 and Var(X1 ) = Var(X2 ) = σ 2 . (c) Find P [Xt ≥ 0 for all t]. 2π].3 A sinusoidal random process Let Xt = A cos(2πV t + Θ) where the amplitude A has mean 2 and variance 4. . V and Θ are independent. Define Yt = X1 cos(2πt) − X2 sin(2πt). and the phase Θ is uniform on [0. Furthermore.2 Correlation function of a product Let Y and Z be independent random processes with RY (s. t) = 2 exp(−|s − t|) cos(2πf (s − t)) and RZ (s. and the proposition is proved.11 Problems 4. having the Rayleigh distribution with positive parameters σR and 2 . respectively. we have for all c ≥ 0 that P [hTl (h) CHAPTER 4. RANDOM PROCESSES > c|X J. Describe in words the set of possible sample paths of X. Simplify your answer as much as possible. (b) Is X a Markov process? Why or why not? (c) Does X have independent increments? Why or why not? (d) Let A denote the area of the triangle bounded by portions of the coordinate axes and the graph of X. (c) Let Xt = Nt2 . Wt ]. explain.) 4. N (0. (b) Is X stationary? (c) Is X Markov? (d) Describe the first order distributions of X. and compute the mean square error. (a) Find the 2 covariance matrix of the random vector (X(2).) 133 4. the linear minimum mean square error (LMMSE) estimator of X5 given X1 . . The process B = (Bt : 0 ≤ t ≤ 1) is called a Brownian bridge process.e. (c) Is B a Markov process? (d) Show that B is independent of the random variable W1 .) Let Bt = Wt − tW1 for 0 ≤ t ≤ 1. (a) Find P {Wr ≤ Ws ≤ Wt }. Can you draw any conclusions? 4. X1 ] and compute the mean square error. . X(4))T . and compute the mean square error. Find the autocorrelation function of X. Let X = (Xk : k ∈ Z} be defined by Xk = max{Uk−1 .) (e) (Due to J. (b) Find E[X(4)|X(2)].11 A simple discrete-time random process Let U = (Un : n ∈ Z) consist of independent random variables. each uniformly distributed on the interval [0. Is X = (Xt : t ≥ 0) a time-homogeneous Markov process? If so. t1 . tn ∈ [0. (Hint: Can do by inspection. . (a) Sketch a typical sample path of the process X. (b) Give a simple expression for P [N2 = 2|N1 ≥ 1] in terms of λ. where A and B are independent. X(3)]. Like W .) Let Xt = (1 − t)W t . (a) Give a simple expression for P [N1 ≥ 1|N2 = 2] in terms of λ. Let X denote the 1−t random process X = (Xt : 0 ≤ t ≤ 1). . Like W . (This part is unrelated to part (a). (c) Find E[X5 |X0 .9 A random process corresponding to a random parabola Define a random process X by Xt = A + Bt + t2 . a Brownian motion with paramter σ 2 = 1. 1]. (a) Sketch a typical sample path of W . 4. Doob. Uk }. (b) Find the MMSE (possibly nonlinear) estimator ˆ of X5 given X1 . (b) Find the autocorrelation function of B. (c) Find E[X(4)|X(2). .11. 1) random ˆ variables. 4. . (b) Find E[Ws |Wr .10 MMSE prediction for a Gaussian process based on two observations Let X be a stationary Gaussian process with mean zero and RX (τ ) = 5 cos( πτ )3−|τ | . Btn )T is independent of W1 .6 Brownian motion: Ascension and smoothing Let W be a Brownian motion process and suppose 0 ≤ r < s < t. If not. Gaussian random process. . B is a mean zero Gaussian random process.4. (This means that for any finite collection.8 Some Poisson process calculations Let N = (Nt : t ≥ 0) be a Poisson process with rate λ > 0. 1]. the random vector (Bt1 .7 Brownian bridge Let W = (Wt : t ≥ 0) be a standard Brownian motion (i. for 0 ≤ t < 1 and let X1 = 0. give the transition probabilities pij (τ ). X(3). (e) Describe the second order distributions of X. X is a mean zero. (a) Find E[X5 |X1 ]. .L. PROBLEMS 4. and the corresponding sample path of B. . such that Y0 = 0 and Yk+1 = Ak Yk + Bk for k ≥ 0. (f) If X is wide sense stationary then Y is wide sense stationary.3]. (a) If X is Markov then Y is Markov. (c) If Y is Markov then X is Markov.3]. [1. (e) If Y is stationary then X is stationary. (d) If X is stationary then Y is stationary. RANDOM PROCESSES 4. jointly Gaussian random processes with mean zero. 4.2] and two counts in the the interval [1. (c) Is Y a stationary random process? Justify your answer. Determine whether each of the following statements is always true. give a justification. (a) Show that Yk is a Poisson random variable with parameter 2λ for each k. (b) Find the probability that there are two counts in the interval [0. Poisson sequence Let X = (Xk : k ∈ Z) be a random process such that the Xi are independent. You may express your answer in terms of the standard normal cumulative distribution function Φ. and cross-correlation function RXY (t) = (0.5) exp(−|t − 3|). Define a discrete-time random process Y by and let Bk have variance σB Y = (Yk : k ≥ 0). (h) If X is a martingale then Z is a martingale.i. Y = (Yn : n ∈ Z). If not. (a) Find the probability that there is (exactly) one count in each of the three intervals [0. autocorrelation functions RX (t) = RY (t) = exp(−|t|). (g) If X has independent increments then Y has independent increments. k ≥ 0 be mutually independent with mean zero. (a) Let Z(t) = (X(t) + Y (t))/2 for all t. (b) Show that X is a stationary random process. (b) Is Z a stationary random process? Explain. 4. and Z = (Zn : n ∈ Z) be random processes such that 2 3 Yn = Xn for all n and Zn = Xn for all n. give a simple counter example. Poisson random variables with mean λ.16 A linear evolution equation with random coefficients 2 Let the variables Ak .12 Poisson process probabilities Consider a Poisson process with rate λ > 0. Let Y = (Yk : k ∈ Z) be the random process defined by Yk = Xk + Xk+1 . Find the autocorrelation function of Z. 4. given that there are two counts in the interval [0.15 Invariance of properties under transformations Let X = (Xn : n ∈ Z). Bk .134 CHAPTER 4.2].2]. 3]. for some λ > 0. (Note: your answer to part (b) should be larger than your answer to part (a)).d. If true. Let Ak have variance σA 2 for all k.1]. 2] and two counts in the interval [1.14 Adding jointly stationary Gaussian processes Let X and Y be jointly stationary. (c) Find P [X(1) ≤ 5Y (2) + 1]. 4. (b) If X is Markov then Z is Markov. and [2. . (c) Find the probability that there are two counts in the interval [1.13 Sliding function of an i. Let τ be the first time that X returns to vertex 000 after time 0. let Xt denote which vertex the fly is at at time t. and describe the one-step transition probabilities. (a) Sketch the one step transition probability diagram for X. of some part takes place at time k. between vertex 000 and Xt . The number of customers in the system at time t is given by Xt = N (t − 1. In words. The process Y is a Markov process with states 0.19 Time elapsed since Bernoulli renewals Let U = (Uk : k ∈ Z) be such that for some p ∈ (0. 4. (d) Find the autocorrelation function of Y . N (a. Find E[τ ]. Interpret Uk = 1 to mean that a renewal. (a) Find the mean and autocovariance function of X. b]. (c) Does Y have independent increments? Explain. or equivalently. the first time that Y returns to 0 after time 0. and because the customers are served simultaneously. 4. let Xk = min{i ≥ 1 : Uk−i = 1}.18 A fly on a cube Consider a cube with vertices 000. (c) Suppose the fly begins at vertex 000 at time zero. Suppose a fly walks along edges of the cube from vertex to vertex. with each having the Bernoulli distribution with parameter p. For k ∈ Z. and the numbers of arrivals in disjoint intervals are independent. Sketch the one-step transition probability diagram for Y . and 3. has the Poisson distribution with mean λ(b − a).2.4. 110. 4. Assume X = (Xt : t ≥ 0) is a discrete-time Markov process. (b) Let Yt denote the distance of Xt .11. For example. 100. measured in number of hops.17 On an M/D/infinity system Suppose customers enter a service system according to a Poisson point process on R of rate λ. (e) Find a simple expression for P {Xt > 0 for t ∈ [0. (b) Is Y a Markov process? Explain. (b) Find the distribution of Xk for k fixed. (a) The process X is a time-homogeneous Markov process. if Xt = 101. Suppose each customer stays in the system for one unit of time.) (e) Find the corresponding linear innovations sequence (Yk : k ≥ 1). 1]} in terms of λ. 111.1. this queueing system is called an M/D/∞ queueing system. ( You can use the second moments (Pk ) in expressing your answer. Indicate a suitable state space. independently of other customers. corresponding to infinitely many servers. the random variables Uk are independent. 010. . 1). Because the arrival process is memoryless. meaning that the number of arrivals. the next state Xt+1 is equally likely to be any one of the three vertices neighboring Xt . t]. in an interval (a. b]. PROBLEMS 135 (a) Find a recursive method for computing Pk = E[(Yk )2 ] for k ≥ 0. 001. or replacement. because the service times are deterministic. such that given Xt . 011. (b) Is X stationary? Is X wide sense stationary? (c) Is X a Markov process? (d) Find a simple expression for P {Xt = 0 for t ∈ [0. Xk is the time elapsed since the last renewal strictly before time k. 1]} in terms of λ. and for any integer t ≥ 0. then Yt = 2. 101. Xn = Un for any n ∈ Z. 4.1 (x. and Xt is affine on each interval of the form [n. It follows that Var(limk→∞ Mk ) = 12 . During each time step between k and k + 1.5. and therefore that (k+1)2 1 E[Vk ] = (k+1) . Specifically. describe the one-step transition probabilities.5 and 0. are both put into the bag. and then identify the distribution of the limit random variable. Show that 2 E[Dk+1 |Xk ] = 1 (k + 1)2 2 k(k − 2)Dk + 1 4 . at time k there are k balls in the bag. you need only consider the case 0 ≤ a ≤ 0. Mk is the fraction of balls in the bag at time k that are blue. Let Xk denote the number of blue balls in the bag at time k. to some random variable M∞ . Thus. there is a bag with two balls in it. It’s helpful to consider the cases 0 ≤ a ≤ 0. (Hint: Let n = t and a = t − n. M∞ .20 A random process created by interpolation Let U = (Uk : k ∈ Z) such that the Uk are independent. Show that E[Vk+1 |Vk ] = k(k+2) Vk . and a new ball of the other color. find the distribution of Mk for each k. there is a bag with two balls in it. 1]. Let Xk denote the number of blue balls in the bag at time k. fX. for all k ≥ 2. one of the balls is selected from the bag at random.s. so that t = n + a. are both put into the bag.21 Reinforcing samples (Due to G. Mk is the fraction of balls in the bag at time k that are blue. Thus. one of the balls is selected from the bag at random. (c) By the theory of martingales.5}. Polya) Suppose at time k = 2. it converges a. since M is a bounded martingale. 6k (d) More concretely. Thus. Determine k whether M = (Mk : k ≥ 2) is a martingale. with all balls in the bag having equal probability. one orange and one blue. (a) Is X = (Xk : k ≥ 2) a Markov process? If so. RANDOM PROCESSES (c) Is X a stationary random process? Explain. (b) Let t ∈ R.5 < a < 1 separately. Let Vk = Mk (1 − Mk ).22 Restoring samples Suppose at time k = 2. Find and sketch the first order marginal density. During each time step between k and k + 1. 4. (a) Is X = (Xk : k ≥ 2) a Markov process? (b) Let Mk = Xk . with all balls in the bag having equal probability. t). (d) Find the k-step transition probabilities. That ball. That ball. and each is uniformly distributed on the interval [0. (c) Let Mk = Xk . one orange and one blue. 4. For brevity. n + 1] for n ∈ Z. (d) Find P {max0≤t≤10 Xt ≤ 0. (a) Sketch a sample path of U and a corresponding sample path of X. Determine k whether M = (Mk : k ≥ 2) is a martingale. Let X = (Xt : t ∈ R) denote the continuous time random process obtained by linearly interpolating between the U ’s. 1 (d) Let Dk = Mk − 2 .j (k) = P {Xn+k = j|Xn = i}. Then Xt = (1 − a)Un + aUn+1 .) (c) Is the random process X WSS? Justify your answer. Thus.136 CHAPTER 4. (b) Compute E[Xk+1 |Xk ] for k ≥ 2. for all k ≥ 2. pi. at time k there are k balls in the bag. and a new ball of the same color. Justify your answers. t) in terms of the function ρ for all s ≥ 0 and t ≥ 0.23 A space-time transformation of Brownian motion Suppose X = (Xt : t ≥ 0) is a real-valued. . t}. independent increment process. where U0 .) 4. (d) Define a process Z in terms of a standard Brownian motion W by Z0 = 0 and Zt = tW ( 1 ) for t t > 0.) 4. What can you conclude about the limit of Mk as k → ∞? (Be sure to specify what sense(s) of limit you mean. U1 .11. of X for B = 4. (b) Find the equilibrium probability distribution. Assume ρt < ∞ for all t. for a positive integer B and positive constant λ.. where W is a standard Brownian motion with RW (s. where Vk : k ∈ Z are independent Gaussian random variables with mean zero and variance one. Does Z have independent increments? Justify your answer. t) = min{s. Explain why Y is an independent increment process with E[Yt2 ] = ρt for all t ≥ 0. (b) Express the autocorrelation function RX (s. and Yk = Vk−2 + Vk−1 + Vk for k ≥ 2. ∞) is given. (a) Show that ρ must be nonnegative and nondecreasing over [0.4.25 Identification of special properties of two discrete-time processes Determine which of the properties: (i) Markov property (ii) martingale property (iii) independent increment property are possessed by the following two random processes. . 4. (Note: The process X models the number of customers in a queueing system with a Poisson arrival process. PROBLEMS 137 1 2 (e) Let vk = E[Dk ]. (b) Y = (Yk : k ≥ 0) defined by Y0 = V0 . are independent random variables. exponential service times. ∞).24 An M/M/1/B queueing system Suppose X is a continuous-time Markov process with the transition rate diagram shown. ! 0 1 1 1 ! 2 ! ! ! B!1 1 1 B . each uniformly distributed on the interval [0. nondecreasing function ρ on [0. suppose a nonnegative. (c) Conversely. Q. Justify your answers. mean zero. (a) X = (Xk : k ≥ 0) defined recursively by X0 = 1 and Xk+1 = (1 + Xk )Uk for k ≥ 0. 1 (a) Find the generator matrix. and a finite buffer. Y1 = V0 + V1 . .26 Identification of special properties of two discrete-time processes (version 2) Determine which of the properties: (i) Markov property (ii) martingale property (iii) independent increment property are possessed by the following two random processes.. 1]. one server. 4. and let 2 E[Xt ] = ρt for t ≥ 0. Prove by induction on k that vk ≤ 4k . Let Yt = W (ρt ) for t ≥ 0. . where W is a Brownian motion with parameter σ 2 .) (b) R = (Rt : t ≥ 0) defined by Rt = D1 + D2 + · · · + DNt .28 Identification of special properties of two continuous-time processes (version 2) Answer as in the previous problem. (a) Is Y = (Yk : k ≥ 0) a Markov process? Briefly explain your answer. Suppose the number of offspring of any individual has the probability distribution p. and for k ≥ 1 let Yk denote the number of individuals in the k th generation. The offspring of the initial individual comprise the first generation. each possibility having probability one half. indexed by the positive integers. k (b) Find constants ck so that Yk is a martingale. (Hint: Observe that E[Zt ] = 1 for all t. Initially. defined by Zt = Wt3 . and note that the Y1 subpopulations beginning with the Y1 individuals in generation one are independent and statistically identical to the whole population. Uk .138 CHAPTER 4. one ball at position two. so X0 = 1.30 Moving balls Consider the motion of three indistinguishable balls on a linear array of positions. independently of how many offspring other individuals have. where U1 . Express am+1 in terms of the distribution p and am (Hint: condition on the value of Y1 . the probability of extinction by the mth generation. (b) (Yk : k ≥ 0). U2 . c (c) Let am = P [Ym = 0]. Let Y0 = 1. each having mean 0 and variance σ 2 . for the following two random processes: 2 (a) Z = (Zt : t ≥ 0). Consider a population beginning with a single individual. . 2π]. each cell either dies or splits into two new cells. each uniformly distributed over the interval [0.29 A branching process Let p = (pi : i ≥ 0) be a probability distribution on the nonnegative integers with mean m. 4. defined by Zt = exp(Wt − σ2 t ). such that Y0 = 1 and. comprising generation zero. and it has mean m = 1−θ . are independent random variables. Let Xk denote the number of cells alive at time k. Given the positions of the balls at some integer time t. 4. where W is a Brownian motion with parameter σ 2 . RANDOM PROCESSES (a) (Xk : k ≥ 0). for k ≥ 1. 2] 4. where Xk is the number of cells alive at time k in a colony that evolves as follows. such that one or more balls can occupy the same position. . in terms of the distribution p.) (d) Express the probability of eventual extinction. (This distribution is θ similar to the geometric distribution. Suppose cells die or split independently. and one ball at position three. . defined by Rt = cos(2πt + Θ).27 Identification of special properties of two continuous-time processes Answer as in the previous problem. for the following two random processes: (a) Z = (Zt : t ≥ 0). . there is one cell. in general. the offspring of the kth generation comprise the k + 1st generation. (b) R = (Rt : t ≥ 0). the positions at time t + 1 are determined . Yk = U1 U2 .) 4. During each discrete time step. where Θ is uniformly distributed on the interval [0. . and. where N is a Poisson process with rate λ > 0 and Di : i ≥ 1 is an iid sequence of random variables. Under what condition is a∞ = 1? (e) Find a∞ in terms of θ in case pk = θk (1 − θ) for k ≥ 0 and 0 ≤ θ < 1. Suppose that at time t = 0 there is one ball at position one. a∞ = limm→∞ am . 1 1 10 2 1 5 3 (a) Write down the rate matrix Q.4 0. The ball that was picked up is then placed one position to the right of the selected ball.8 2 0.32 Mean hitting time for a continuous-time. PROBLEMS 139 as follows. (c) As time progresses. (b) Find the equilibrium probability distribution π.2 0. and give the one-step transition probability matrix for your process. . discrete-state Markov process Let (Xk : k ≥ 0) be a time-homogeneous Markov process with the one-step transition probability diagram shown.6 1 0. with probability one. the fraction of time the process is in a state i up to time t converges a.11. in a way similar to the analysis of the gambler’s ruin problem. Try to use a small number of states. and one of the other two balls is selected at random (but not moved). Give the generator matrix Q. Clearly a3 = 0. Solve the equations to find a1 and a2 . 4. find its equilibrium distribution. 0.4.31 Mean hitting time for a discrete-time. 4. a move as described above happens in the interval [t. (You can use the fact that for a finite-state Markov process in which any state can eventually be reached from any other. (Hint: Take the balls to be indistinguishable. t + h] with probability h + o(h). discrete-space Markov process Let (Xt : t ≥ 0) be a time-homogeneous Markov process with the transition rate diagram shown. Derive equations for a1 and a2 by considering the possible values of X1 . (c) Let τ = min{t ≥ 0 : Xt = 3} and let ai = E[τ |X0 = i] for 1 ≤ i ≤ 3. (b) Find the equilibrium distribution of your process. Given the current state at time t. with each choice having probability one half. (d) Consider the following continuous time version of the problem. One of the balls in the left most occupied position is picked up. (a) Define a finite-state Markov process that tracks the relative positions of the balls. and don’t include the position numbers. Find that limiting value. the balls all move to the right. (c) Let τ = min{k ≥ 0 : Xk = 3} and let ai = E[τ |X0 = i] for 1 ≤ i ≤ 3.s. and identify the long term average speed of the balls. Derive equations for a1 and a2 by considering the possible values of Xt (h) for small values of h > 0 and taking the limit as h → 0.6 (a) Write down the one step transition probability matrix P . (b) Find the equilibrium probability distribution π. to the equilibrium probability for state i as t → ∞. and the average speed has a limiting value. Clearly a3 = 0.) Describe the significance of each state.4 3 0. Solve the equations to find a1 and a2 . is a Poisson process with rate λ1 + . specifically. and suppose there are k types of coupons.33 Poisson merger Summing counting processes corresponds to “merging” point processes. each possibility having probability k . . and suppose each customer is one of K types.) Let W = (Wt : t ≥ 0) be a Brownian motion with σ 2 = 1 (called a standard 2 Brownian motion). the coupons could be pieces of a file to be distributed by some sort of gossip algorithm. . k ln k + kc). find the limit of the probability that the collection is complete at time t = k ln k + kc. and the types of different coupons are mutually independent.) Show that for any k. 4.35 Poisson method for coupon collector’s problem (a) Suppose a stream of coupons arrives according to a Poisson process (A(t) : t ≥ 0) with rate λ = 1. Let p(k.36 Some orthogonal martingales based on Brownian motion (This problem is related to the problem on linear innovations and orthogonal polynomials in the previous problem set. and then think about what else is needed to get the result for Poisson processes. Show that the sum of K independent Poisson processes. (The letter “d” is used here because the number of coupons.5.) (c) The rest of this problem shows that the limit found in part (b) also holds if the total number of coupons is deterministic. You can use the definition of a Poisson process or one of the equivalent descriptions given by Proposition 4. Show that the stream of customers of a given type i is again a Poisson stream. t) in terms of k and t. (The letter “p” is used here because the number of coupons arriving by time t has the Poisson distribution). (Hint for parts (a) and (b): For notational brevity. and suppose that for each k. Don’t forget to check required independence properties. . (Hint: First formulate and prove a similar result for sums of random variables. . n)P [A(t) ≥ n] ≤ p(k. n) denote the probability that the collection is complete after n coupon arrivals. Let (p1 . having rates λ1 . t) be the probability that at least one coupon of each type arrives by time t. (In network applications. n.34 Poisson splitting Consider a stream of customers modeled by a Poisson process. is deterministic.140 CHAPTER 4. . That is. and n fixed. One idea is that if t is large. . λK . RANDOM PROCESSES 4. d(k. rather than Poisson distributed.) The type of each coupon in the 1 stream is randomly drawn from the k types. (e) Combine parts (c) and (d) to identify limk→∞ d(k. then A(t) is not too far from its mean with high probability. . t. (b) Find limk→∞ p(k. (Same hint as in the previous problem applies. . and let Mt = exp(θWt − θ2 t ) for an arbitrary constant θ. the k th customer is type i with probability pi . Show. Express p(k. + λK .) Show furthermore that the K substreams are mutually independent. then (1 + ak )k → k ea .) 4. and that its rate is λpi . let Ws represent (Wu : 0 ≤ u ≤ s) for the purposes of conditioning. k ln k+kc) for an arbitrary constant c. pK ) be a probability vector. t) ≤ P [A(t) ≥ n] + P [A(t) ≤ n]d(k. that 0 if c < c limk→∞ P [A(k ln k + kc) ≥ k ln k + kc ] = 1 if c > c (d) Let d(k. If Zt is a function of Wt for each . n). . 4.2 in the notes. (a) Show that (Mt : t ≥ 0) is a martingale. . The types of the customers are mutually independent and also independent of the arrival times of the customers. respectively. (Hint: If ak → a as k → ∞. 6 0. P [Xk = ρn ]). P m . PROBLEMS 141 t. t ≥ 0.8 0. .0 0. Xk−1 = s1 .1 0. because then E[Zt |Zu ... (b) Identify a function f on S so that f (s) = a for two choices of s and f (s) = b for the third choice of s.37 A state space reduction preserving the Markov property Consider a time-homogeneous.38 * Autocorrelation function of a stationary Markov process Let X = (Xk : k ∈ Z) be a Markov process such that the state space.. π3 ). and give the one-step transition probability matrix of Y . . .2 P = 0.. ρ2 . Verify directly that Wt2 − t and Wt3 − 3tWt are martingales. Xk−m = sm ] = pij . Let e be the column vector of all ones. 4. is a finite subset of the real numbers. where a = b. {ρ1 . . Let P = (pij ) denote the matrix of one-step transition probabilities. 0. (a) Show that P e = e and π(k + 1) = π(k)P . . . then Mn (s) ⊥ Mm (t). because it is the linear innovations sequence for the variables 1. . and one-step transition probability matrix 0. where pij is the i. exp(θWt − θ2 t θ2 θ3 ) = 1 + θWt + (Wt2 − t) + (Wt3 − 3tWt ) + · · · 2 2 3! ∞ n θ = Mn (t) n! n=0 and Hn is the nth Hermite polynomial.2 0. .4. then a sufficient condition for Z to be a martingale is that E[Zt |Ws ] = Zs whenever 0 < s < t. and s1 . . (Mn (t) : n ≥ 0) is a sequence of orthogonal random variables. Briefly explain your answer... (c) For fixed t. √t tn/2 Hn ( Wt ). 3}. (e) Assume that X is stationary.0 (a) Sketch the transition probability diagram and find the equilibrium probability distribution π = (π1 . . jth element of the mth power of P . Wt . 2. Wt2 . and let π(k) be the row vector π(k) = (P [X k = ρ1 ].11. discrete-time Markov process X = (Xk : k ≥ 0) with state space S = {1. ρn }. and the vector π of parts (b) and (c). Use this fact and the martingale property of the Mn processes to show that if n = m and s.8 0. 0 ≤ u ≤ s] = E[E[Zt |Ws ]|Zu . (b) Show that if the Markov chain X is a stationary random process then π(k) = π for all k. (ρi )... 0 ≤ u ≤ s] = Zs ).. initial state X0 = 3. . (m) (m) (d) Show that P [Xk+m = ρj |Xk = ρi . Express RX (k) in terms of P . sm are arbitrary states. The fact that M is a martinwhere Mn (t) = gale for any value of θ can be used to show that Mn is a martingale for each n (you don’t need to supply details). 4..3 . where π is a vector such that π = πP . (c) Prove the converse of part (b). π2 . 0 ≤ u ≤ s] = E[Zs |Zu . (b) By the power series expansion of the exponential function. such that the process Y = (Yk : k ≥ 0) defined by Yk = f (Xk ) is a Markov process with only two states. 142 CHAPTER 4. RANDOM PROCESSES . Definition 5. the maximum likelihood estimator is defined by selecting the value of θ that maximizes the likelihood of the observation.Chapter 5 Inference for Markov Models This chapter gives a glimpse of the theory of iterative algorithms for graphical models. In fact. or θM L (y) = arg maxθ fY (y|θ). It begins with a brief introduction to estimation theory: maximum likelihood and Bayesian estimators are introduced. Rather. as well as an introduction to statistical estimation theory.1. 143 . which for each possible observed value y. and an iterative algorithm. The maximum likelihood estimate of θ given Y = y for a particular y is the value of θ that maximizes the likelihood of y. the maximum likelihood estimator θM L is given by θM L (y) = arg maxθ pY (y|θ). based on observation of a random variable Y . That is. if Y is continuous type. gives the estimate θ(y). beginning with the ML method. This general background is then focused on three inference problems posed using Markov models. The ML method is based on the assumption that Y has a pmf pY (y|θ) (if Y is discrete type) or a pdf fY (y|θ) (if Y is continuous type).1 For a particular value y and parameter value θ. for computation of maximum likelihood estimators in certain contexts. is known. Note that the maximum likelihood estimator is not defined as one maximizing the likelihood of the parameter θ to be estimated. or fY (y|θ). the likelihood of y for θ is pY (y|θ). if Y is discrete type. is described. known as the expectation-maximization algorithm. An estimator of θ based on Y is a function θ. Suppose a parameter θ is to be estimated. where θ is the unknown parameter to be estimated. θ need not even be a random variable. and the family of functions pY (y|θ) or fY (y|θ). These two methods are briefly described in this section.1 A bit of estimation theory The two most commonly used methods for producing estimates of unknown quantities are the maximum likelihood (ML) and Bayesian methods. 5. in the end. This has the advantage that the modeler does not have to come up with a probability distribution for θ. So a Bayesian estimator for C(z. and can still impose hard constraints on θ. which is positive for θ < k and negative for θ > k. and the observation. Thus.1. For an estimator Z = g(Y ). or equivalently. whereas the conditional distribution of Z given Y is called the posterior or a posteriori distribution. and a posterior pdf. Given a value y is 2 observed. pZ|Y . the variable to be estimated. A commonly used choice of C in case Z is a discrete random variable is C(z. where σ 2 is known. say Z. Z)]. are jointly distributed random variables.144 CHAPTER 5.3 Suppose Y is assumed to be a P oi(θ) random variable. z ) = I{z=b} is one such that g(y) maximizes P [Z = g(y)|Y = y] ˆ z for each y. Hence. That is. Example 5. θM L (k) = k. given the observation Y = y. we can write Y = θ + W . and a posterior pmf. for ˆ ˆ which the Bayesian estimators are the minimum mean squared error (MMSE) estimators. In ˆ z ˆ to minimize P {Z = Z}. g(y) = E[Z|Y = y]. for jointly distributed random variables Z and Y. where W is a N (0. z ) = I{z=b} . In particular. g(y) is a maximizer of the posterior pmf of Z.2 Suppose Y is assumed to be a N (θ. z . there is a prior pmf. there is a prior pdf. or if Z and Y are jointly continuous. But the ML method does not permit incorporation of soft probabilistic knowledge the modeler may have about θ before any observation is used. the ML estimator is obtained by maximizing pY (k|θ) = e−θ θk with respect to θ. Given the observation Y = k for some fixed k ≥ 0. is the function Z = g(Y ) of Y which minimizes the average cost. fZ . pZ . say Y . INFERENCE FOR MARKOV MODELS Example 5. and cost function C(z. E[C(Z. The derivative is −1 + k/θ. θ is to k! be selected to maximize −θ + k ln θ. y). Recall that the MMSE estimators are given by the conditional expectation.4 The Bayesian estimator of Z given Y. One of the most common choices of the cost function is the squared error. to maximize ˆ this case. for each y. θM L (y) = y. if Z is discrete. the ML estimator is obtained by maximizing fY (y|θ) = √ 1 2 exp(− (y−θ) ) with respect 2σ 2 2πσ to θ. The assumed distribution of Z is called the prior or a priori distribution. which. Equivalently. can be written concisely as ZM AP (y) = arg max pZ|Y (z|y). Definition 5. is the mean of the posterior distribution of Z given Y = y. Note that in the ML method. ˆ P {Z = Z} = y P [Z = g(y)|Y = y]pY (y) = y pZ|Y (g(y)|y)pY (y). is not assumed to be random. By inspection.1. called the maximum a posteriori probability (MAP) estimator. the quantity to be estimated. z ) = (z − z )2 . dropping the constant k! and taking the logarithm.1. The Bayesian method is based on estimating a random quantity. The estimator. σ 2 ) random variable. fZ|Y . the Bayesian objective is to select Z ˆ P {Z = Z}. θ. examined in Chapter 3. for some θ > 0. σ 2 ) random variable. C(z. Equivalently. is known for each θ. b2 +σ 2 The mean and maximizing value of this conditional density are both equal to E[Θ|Y = y]. suppose the prior density of θ is the N (0. the probability that any estimator is exactly equal to θ is zero. σ 2 ) random variable. a choice of cost function C and a prior probability distribution (i. (5. In that case. θM L (y) = arg maxθ pY (y|θ).1) where pY (y) = θ pΘ. the value of the MAP estimator is a value of θ that maximizes pΘ|Y (θ|y) with respect to θ. given Y = y. but taking ΘM AP (y) to maximize the posterior pdf maximizes the probability that the estimator is within of the true value of θ.1. Example 5. independent of Θ. and the prior pmf could be denoted by pΘ (θ). where Θ is a N (0.e. and suppose that the pmf of Y. the 2y conditional distribution of Θ given y) is the normal distribution with mean E[Θ|Y = y] = b2b+σ2 and variance b2 σ 2 . if θ is a discrete variable.Y (θ . Therefore. where the variance σ 2 is known and θ is to be estimated.Y (θ. θ (5.e.5. we can write Y = Θ + W . which we might denote by the upper case symbol Θ. y). as required by the Bayesian method. For that purpose.2). Then. The joint pmf would be given by pΘ. in addition. meaning it is the same for all θ. Equivalently. b2 ) density for some known paramber b2 . the two estimators agree if the prior pΘ (θ) is uniform. Using the Bayesian method. by Bayes’ formula: pΘ|Y (θ|y) = pΘ (θ)pY (y|θ) pY (y) (5. By the properties of joint Gaussian densities given in Chapter 3. θ (5. and the observation. would be jointly distributed random variables. The MAP criterion for selecting estimators can be extended to the case that Y and θ are jointly continuous variables. leading to the following: ΘM AP (y) = arg max fΘ|Y (θ|y) θ = arg max fΘ (θ)fY (y|θ).3) In this case. the denominator in the right-hand side of (5. a distribution for θ). The posterior probability distribution can be expressed as a conditional pmf.1) can be ignored. but determination of a Bayesian estimator requires.5 Suppose Y is assumed to be a N (θ.1. It is interesting to compare . For example. ΘM M SE (y) = ΘM AP (y) = E(Θ|Y = y). In fact. Given y. Θ. the variable to be estimated. Y ) = pΘ (θ)pY (y|θ). the posterior distribution (i. so that the MAP estimator is given by ΘM AP (y) = arg max pΘ|Y (θ|y) θ = arg max pΘ (θ)pY (y|θ). we can view the parameter to be estimated as a random variable.2) The expression. σ 2 ) random variable. pY (y|θ). b2 ) random variable and W is a N (0. Y . for ΘM AP (y) is rather similar to the expression for the ML estimator. A BIT OF ESTIMATION THEORY 145 Suppose there is a parameter θ to be estimated based on an observation Y. This is enough to determine the ML estimator. the Bayesian method would require that a prior pmf for θ be selected. in an asymptotic sense as → 0. . Example 5. . .2. It is interesting to compare this example to Example 5. Using the Bayesian method.1. Therefore.1. . pn ) by pi = ci /c.8 Suppose b = (b1 . can also be incorporated into ML estimation as a hard constraint.146 CHAPTER 5. The function to be maximized is a strictly concave function of p over a region with linear constraints. As b2 → ∞. θmax }. The remaining constraint is the equality constraint. the MAP estimator is obtained by maximizing e−θ θk I{0≤θ≤θθmax } pY (k|θ)fΘ (θ) = k! θmax with respect to θ. namely pi ≥ 0. the MAP estimator is strictly smaller than θM L (k) = k. As θmax → ∞. then clearly pj = 0 for the maximizing probability vector. As seen in Example 5. We thus introduce a Lagrange i=1 multiplier λ for the equality constraint and seek the stationary point of the Lagrangian L(p. with each Yt having probability distribution b: P {Yt = i} = bi for 1 ≤ t ≤ T and 1 ≤ i ≤ n. . . Given the observation Y = k for some fixed k ≥ 0. deterministic prior knowledge. . Lemma 5. If cj = 0 for some j. the prior information indicates that |θ| is believed to be small. . will be satisfied with strict inequality. the term decreasing in θ for θ > k. for some known value θmax .3. by the factor b2 +σ2 . n i=1 ci log pi is Proof. YT are independent. bn ) is a probability vector to be estimated by observing Y = (Y1 . . Actually. Setting ∂pi = pi − λ = 0 yields that pi = ci λ i for all i. we can assume without loss of generality that ci > 0 for all i. . b2 . . . the MAP estimator converges to the ML estimator.1. the stationary point is the point at which the partial c ∂L derivatives with respect to the variables pi are all zero. The next example makes use of the following lemma. because the prior restricted to θ ≤ θmax is uniform. If b2 is small compared to σ 2 . YT ). To satisfy the linear constraint. The Bayesian estimators (MMSE and MAP) are both smaller b2 in magnitude than θM L (y) = y. resulting in the Bayes estimators being smaller in magnitude than the ML estimator. If θmax is less than k. By definition. . Assume Y1 . and the Bayes estimators coverge to the ML estimator. We shall determine the maximum likelihood .1. λ must equal c. . θmax ]. the prior probability distribution indicates knowledge that θ ≤ θmax .1.1. .3. n pi = 1. INFERENCE FOR MARKOV MODELS this example to Example 5. . suppose the prior distribution for θ is the uniformly distribution over the interval [0. but no more than that. e−θ θk k! is increasing in θ for θ < k and ΘM AP (k) = min{k. Intuitively. Then i=1 maximized over all probability vectors p = (p1 .7 Suppose ci ≥ 0 for 1 ≤ i ≤ n and that c = n ci > 0. By eliminating such terms from the sum. such as θ ≤ θmax . the priori distribution gets increasingly uniform. The positivity constraints. Example 5.6 Suppose Y is assumed to be a P oi(θ) random variable. λ) = n n i=1 ci log pi −λ(( i=1 pi )−1). T T T . .1. That is. . bM L = ( k1 . The log likelihood is i=1 i ln p(y|b) = n ki ln(bi ). . this is maximized by the empirical distribution of the i=1 observations. . A BIT OF ESTIMATION THEORY 147 estimate. bM L (y). kn ). . By Lemma 5. given a particular observation y = (y1 . namely bi = ki for 1 ≤ i ≤ n. The likelihood to be maximized with respect to b is p(y|b) = by1 · · · byT = n bki where ki = |{t : yt = i}|. . yT ).7. .1. .5. bn ) is a probability vector to be estimated by observing Y = (Y1 . The likelihood is T p(y|q) = t=1 n yt q (1 − q)3−yt = cq s (1 − q)nT −s . . . .8 shows that the MAP estimate is the same as the ML estimate. The log likelihood is ln c + s ln(q) + s (nT − s) ln(1 − q). .148 CHAPTER 5. with the binomial distribution with parameters n and q. YT ) is observed. . If the di ’s are integers. and Z(d) is a constant chosen so that fB integrates to one. a distribution of the unknown distribution b must be assumed. qM L (y). YT ). . and q is an unknown parameter to be estimated from Y . .1. a distribution of the distribution is needed. . . .1.10 Suppose that Y = (Y1 . .1. . INFERENCE FOR MARKOV MODELS Example 5. . . . where T = n (di −1+ki ) = T −n+ n di . It di can be shown. bn−1 ) has the prior density: Qn fB (b) = di −1 i=1 bi Z(d) 0 if bi ≥ 0 for 1 ≤ i ≤ n − 1. The MAP estimate. for a particular observation y = (y1 . namely. . . YT are independent. and the specific density fB for this example is a called the Dirichlet density with parameter vector d. An alternative way to think about this is ˆ to realize that each Yt can be viewed as the sum of n independent Bernoulli(q) random variables. and s can be viewed as the observed sum of nT independent Bernoulli(q) random variables. yt where s = y1 + · · · + yT . di − 1 prior observations of outcome i for each i. . the MAP estimate is the ML estimate with some prior observations mixed in. and that it is assumed that the Yi are independent. i=1 i=1 e e T T Comparison with Example 5. b2 . A prior distribution such that the MAP estimate has the same algebraic form as the ML estimate is called a conjugate prior. Suppose n is known. . . Maximizing over q yields qM L = nT . For the Bayesian method. dn −1+kn ). . . A larger value of di for a fixed i expresses an a priori guess that the corresponding value bi may be larger. . . Suppose for some known numbers di ≥ 1 that (b1 . Suppose b = (b1 . yT ). That is right. bM AP (y) = ( d1 −1+k1 .7. in particular. that if B has this prior distribution.1. . for a given observation vector y. then E[Bi ] = d1 +···dn . Example 5. . A convenient choice is the following. Let us find the maximum likelihood estimate. and assume Y1 .9 This is a Bayesian version of the previous example. . . is given by: n bM AP (y) = arg max ln (fB (b)p(y|b)) = arg max − ln(Z(d)) + b b i=1 (di − 1 + ki ) ln(bi ) By Lemma 5. and else n−1 i=1 bi ≤1 where bn = 1 − b1 − · · · − bn−1 . bM AP (y). with each Yt having probability distribution b. and c depends on y but not on q. . except that di − 1 is added to ki for each i. 4).2 The expectation-maximization (EM) algorithm The expectation-maximization algorithm is a computational method for computing maximum likelihood estimates in contexts where there are hidden random variables. The idea is to estimate log pcd (X|θ) by its conditional expectation. there can be a very large number of terms in the sum in (5. If a vector of complete data x could be observed. such as in the case of hidden Markov models discussed in this chapter. the expectation step is completed using the latest value of θ. Some intuition behind the algorithm is the following. Given θ(k) . Q(θ|θ(k) ). (5. the complete data pcd (x|θ). a parameter to be estimated X. and then find θ to maximize this conditional expectation. in addition to observed data and unknown parameters. Z).) We write p(y|θ) to denote the pmf of Y for a given value of θ. to compute θM L (y)). This typically happens when pcd factors into simple terms. For each iteration of the algorithm. In other words. the observed random vector Z. it would be reasonable to estimate θ by maximizing the pmf of the complete data.1 (Expectation-maximization (EM) algorithm) An observation y is given. the next value θ(k+1) is computed in the following two steps: (Expectation step) Compute Q(θ|θ(k) ) for all θ. in computing the expectation of log pcd (X|θ). This plan is not feasible if the complete data is not observed. along with an intitial estimate θ(0) .5) (Maximization step) Compute θ(k+1) ∈ arg maxθ Q(θ|θ(k) ). the pmf of the complete data. θ(k) . The following notation will be used. find a value θ(k+1) of θ that maximizes Q(θ|θ(k) ) with respect to θ. making it difficult to numerically maximize p(y|θ) with respect to θ (i.e. Algorithm 5.4) {x:h(x)=y} In some applications. with respect to θ. The algorithm is iterative. THE EXPECTATION-MAXIMIZATION (EM) ALGORITHM 149 5. In most applications there is some additional structure that helps in the computation of Q(θ|θ(k) ).5. θ.2. It can be expressed in terms of the pmf of the complete data by: p(y|θ) = pcd (x|θ) (5. where Q(θ|θ(k) ) = E[ log pcd (X|θ) | y. pcd (x|θ). or when pcd has the form of an exponential raised to a low degree . which is a known function for each value of θ Y = h(X). θ(k) ].2. The conditional expectation is well defined if some value of the parameter θ is fixed. the unobserved data (This notation is used in the common case that X has the form X = (Y. is given by f (y|θ) = The natural log of the likelihood is given by log f (y|θ) = − y2 log(2π) log(θ + σ 2 ) − − . In others. there may be an algorithm that generates samples of X with the desired pmf pcd (x|θ(k) ) using random number generators. and the EM algorithm. INFERENCE FOR MARKOV MODELS polynomial. and the noise N is a N (0. similar to Example 3. S and N are independent.5. Example 5.5. such as the Gaussian or exponential distribution. we examine use of the EM algorithm for this example. For given θ.150 CHAPTER 5. θ(k) ] is an exercise in conditional Gaussian distributions. Therefore. While this one-dimensional case is fairly simple. To apply the EM algorithm for this example. For the direct approach. θ) random variable. σ 2 ) random variable where σ 2 is known and strictly positive. 2 2θ 2 2σ For the maximization step. N |θ) |y. θ(k) ] = − log(2πθ) E[S 2 |y. Computation of E[S 2 |y. so the complete data is not observed. the pdf of Y evaluated at y. θ(k) ] − − − . the situation is different in higher dimensions. θ(k) ] log(2πσ 2 ) E[N 2 |y.2 (Estimation of the variance of a signal) An observation Y is modeled as Y = S+N. 2 2 2(θ + σ 2 ) y exp(− 2(θ+σ2 ) ) 2 2π(θ + σ 2 ) . θ + σ 2 ) random variable. or likelihood of y. Thus. Y is a N (0. take X = (S. so the log of the joint pdf of the complete data is given as follows: log pcd (s.2. where θ is an unknown parameter. we find Q(θ|θ(k) ) = E[ log pcd (S. In some cases there are closed form expressions for Q(θ|θ(k) ). 2 2θ 2 2σ 2 log(2πθ) s2 log(2πσ 2 ) n2 − − − 2. We consider two approaches to finding θM L : a direct approach. θ(k) ] =− + ∂θ 2θ 2θ2 from which we see that θ(k+1) = E[S 2 |y.4. note that for θ fixed. we find ∂Q(θ|θ(k) ) 1 E[S 2 |y. n|θ) = − For the estimation step. Let y be a particular observed value of Y. The conditional second moment is . assumed to satisfy θ ≥ 0. N ) as the complete data. where the signal S is assumed to be a N (0. Maximizing over θ yields θM L = (y 2 − σ 2 )+ . The observation is only the sum. Y = S + N. and then log pcd (X|θ) is used as an approximation to Q(θ|θ(k) ). as explored in Problem 5. Suppose it is desired to estimate θ. θ(k) ]. the variance of the signal. which the EM algorithm is similar to. Note that even if the parameter set is convex (as it is for the case of hidden Markov models). . θ) such that (i) log p(y|θ) is differentiable in θ (ii) E ¯ ¯ ¯ log k(X|y. pn ) and q = (q1 . with the understanding that pi log(pi /qi ) = 0 if pi = 0 and pi log(pi /qi ) = +∞ if pi > qi = 0. the corresponding sets of probability distributions on Y are not convex. . which by definition means ∂p(y|θ) |θ=θ∗ = 0. Thus.3 shows that if θ(0) > 0. . the sequence could converge to a local. with equality if and only if p = q (ii) D is a convex function of the pair (p. then θ(k) → θM L as k → ∞. . θ is finite for all θ (iii) D(k(·|y. qn ). θ )) is differentiable with respect to θ for any θ fixed. Definition 5. θ) | y. Proposition 5.6) Problem 5.4 The divergence between probability vectors p = (p1 . This behavior is typical of gradient type nonlinear optimization algorithms.2. q). (5.2.5. . Then the likelihood p(y|θ(k) ) is nondecreasing in k. It is the geometry of the set of probability distributions that really matters for the EM algorithm. the EM algorithm becomes the following recursion: θ (k+1) = θ(k) θ(k) + σ 2 2 y2 + θ(k) σ 2 θ(k) + σ 2 (5. .2. However. THE EXPECTATION-MAXIMIZATION (EM) ALGORITHM 151 the sum of the square of the conditional mean and the variance of the estimation error. or possibly even to an inflection point of the likelihood. θ)||k(·|y.3 is given after the notion of divergence between probability vectors is introduced. . Lemma 5. .2. The following proposition shows that the likelihood p(y|θ(k) ) is nondecreasing in k.7) ∂θ The proof of Proposition 5. maximizer of the likelihood. and limk→∞ θ(k) = θM L (y). and suppose that p(y|θ(0) ) > 0.3 (Convergence of the EM algorithm) Suppose that the complete data pmf can be factored as pcd (x|θ) = p(y|θ)k(x|y. denoted by D(p||q). is defined by D(p||q) = i pi log(pi /qi ). . In the ideal case. and any limit point θ∗ of the sequence (θ(k) ) is a stationary point of the objective function p(y|θ). rather than the geometry of the space of the parameters.2. but not global. the likelihood converges to the maximum possible value of the likelihood.5 (Basic properties of divergence) (i) D(p||q) ≥ 0. b2 ) = (αp1 . In proving (i). (p1 . . θ(k) ]. To verify (5. θ(k) ] k(X|y. q). θ(k) ] = log p(y|θ) + E[ log k(X|y. θ) |y. (1 − α)q2 ) i i yields αp1 log i p1 p2 i i + (1 − α)p2 log 2 i 1 qi qi = αp1 log i ≥ pi log αp1 (1 − α)p2 i i + (1 − α)p2 log i 1 2 αqi (1 − α)qi pi . by Jensen’s 0 u=0 inequality. Similarly. That is. . INFERENCE FOR MARKOV MODELS Proof. note that it is true if and only if it is true with each ai replaced by cai . already proved.2. θ(k) )||k(·|y. . For i fixed (p 1 i with 1 ≤ i ≤ n. Therefore. b1 . (1 − α)p2 . . pi pi · qi ) = φ(1) = 0. q 1 ) + (1 − α)(p2 . the log-sum inequality (5. The proof of (ii) is based on the log-sum inequality. qn ) are probability distributions for n 1 1 2 j = 1. where R = E[ log k(X|y. . . bn : ai a ≥ a log . and let pi = αp1 + (1 − α)p2 and qi = αqi + (1 − α)qi . θ)) + R. 2. q 2 ) are two pairs of probability distributions. θ) = log p(y|θ) + E[ log |y. pj ) and q j = (q1 . R is finite. . . θ). an . θ(k) )||k(·|y.8) is equivalent to the fact D(a||b) ≥ 0. . .8) is proved. . so that D(p||q) is a convex function of the pair (p. Proof of Proposition 5. q 1 ) and i i 2 . but not on θ. ai log (5. . For a = b = 1. Q(θ|θ(k) ) = E[log pcd (X|θ)|y. Property (i) follows from Lemma 5. The function φ(u) = is convex. θ(k) ) |y. Suppose pj = (pj . By assumption (ii). (5. θ(k) ] + R k(X|y.8). the maximization step of the EM algorithm is equivalent to: θ(k+1) = arg max log p(y|θ) − D(k(·|y. Thus. a2 .152 CHAPTER 5. and it depends on y and θ(k) . . which is the fact that for nonnegative numbers a1 . θ(k) ) = log p(y|θ) − D(k(·|y. it can be assumed that b = 1. αqi . So it can be assumed that a = 1.7.8) bi b i where a = i ai and b = i bi .8) with (a1 . and (p. θ)) θ (5. . for 1 ≤ i ≤ n. q) = α(p1 .1. j j Let 0 < α < 1. . .9) (5. qi Summing each side of this inequality over i yields αD(p1 ||q 1 ) + (1 − α)D(p2 ||q 2 ) ≥ D(p||q). q 2 ).10) . b1 .3 Using the factorization pcd (x|θ) = p(y|θ)k(x|y. for any strictly positive constant c. Here is another proof. D(p||q) = φ( )qi ≥ φ( qi qi i i so (i) is proved. So (5. we can u log u u > 0 assume that qi > 0 for all i. and the inequality in (5. then it is a maximizer of f (θ) = log p(y|θ) − D (k(·|y. Thus. The same is true of the term D (k(·|y. under regularity conditions implying the existence of Lagrange multipliers.10) and the fact that the divergence is always nonnegative. using (5. also hold at θ∗ . Therefore.13) ≤ log p(y|θ(k+1) ) − log p(y|θ(k) ) → 0 as k → ∞. the quantity on the right-hand side of (5. 0 ≤ max log p(y|θ) − D k(·|y.2. THE EXPECTATION-MAXIMIZATION (EM) ALGORITHM 153 Thus. θ) || k(·|y. θ∗ ) || k(·|y. θ(k) )) = 0. θ) θ − log p(y|θ(k) ) (5. θ(k) ) || k(·|y. θ) || k(·|y. then we still find that if θ∗ is a limit point of θ(k) . it has value 0 at θ = θ∗ . By continuity. we assume that θ∗ is unconstrained. those conditions for the problem of maximizing the true log likelihood function. θ∗ ∈ arg max [log p(y|θ) − D (k(·|y. yields log p(y|θ(k+1) ) − D(k(·|y. and it is assumed to be differentiable in θ. where (5. p(y|θ(k) ) is nondecreasing in k.6 In the above proposition and proof. θ(k+1) )||k(·|y. max [log p(y|θ) − D (k(·|y. θ) || k(·|y.12) (5.5. θ))] θ So the derivative of log p(y|θ) − D (k(·|y.12) follows from the fact that θ(k) is a possible value of θ in the maximization. because this term is nonnegative. must be zero at θ∗ . θ∗ )) with respect to θ at θ = θ∗ are zero.12) converges to zero as k → ∞. the EM algorithm attempts to maximize the log likelihood ratio log p(y|θ) itself. So by continuity. θ))] − log p(y|θ∗ ) = 0 θ and therefore.11) In particular. and since the Kuhn-Tucker optimality conditions only involve the first derivatives of the objective function. θ∗ )) alone. θ(k) )) ≥ log p(y|θ(k) ) (5. If there are inequality constraints on θ and if some of them are tight for θ∗ .2. θ∗ )) with respect to θ at θ = θ∗ is zero. minus a term which penalizes large differences between θ and θ(k) . θ∗ )) . the Kuhn-Tucker optimality conditions are satisfied for the problem of maximizing f (θ). . for any limit point θ∗ of the sequence (θk ). implied by the differentiability assumption (i). θ) || k(·|y. θ∗ . the derivative of the first term. Suppose now that the sequence (θk ) has a limit point. The definition of θ(k+1) implies that Q(θ(k+1) |θ(k) ) ≥ Q(θ(k) |θ(k) ). log p(y|θ). Since the derivatives of D (k(·|y. limk→∞ log p(y|θ(k) ) exists.9) and the fact D(k(·|y. Remark 5. θ∗ ) || k(·|y. For each k. at each step. log p(y|θ). Thus. limk→∞ p(y|θk ) = p(y|θ∗ ) < ∞. Therefore. since the divergence is nonnegative. θ(k) )||k(·|y.13) follows from (5. Therefore. Y = (Y1 .yt . for a given choice of θ. The model is illustrated in Figure 5. . .3 Hidden Markov models A popular model of one-dimensional sequences with dependencies. Given the observed data and θ.14) corresponds to an edge in the graph...l ). with T ≥ 1. . INFERENCE FOR MARKOV MODELS 5. · · · . are the hidden Markov models. The pmf of Z 1 Z A 2 ! B A Z 3 A . In what follows we consider the following three estimation tasks associated with this model: 1. where Z is unobserved data and Y is the observed data Z = (Z1 . . for some z = (z1 . Given the observed data and θ. Here. Suppose that X = (Y. Yn are conditionally independent with P [Yt = l|Z = z] = bzt . with one-step transition probability matrix A = (ai. and the number of states of S is denoted by Ns . the variables Y1 . A ZT B B B Y 1 Y 2 Y 3 Y T Figure 5. Yn ) is the observed data. so that B is an Ns × No matrix and each row of B is a probability vector. ZT ) is a time-homogeneous Markov process.1.zt+1 t=1 bzt . is T −1 T pcd (y. explored especially in the context of speech processing. .154 CHAPTER 5. denotes the total number of observation times. z|θ) = πz1 t=1 azt . .1: Structure of hidden Markov model. . Given the observed data. . the complete data. zn ).l . The parameter for this model is θ = (π.j ). for a given observation generation matrix B = (bi. .14) The correspondence between the pmf and the graph shown in Figure 5.1 is that each term on the right-hand side of (5. . A. The state-space of Z is denoted by S. (5. Z). T . compute the maximum likelihood (ML) estimate of θ (solved by the Baum-Welch/EM algorithm). and with Z1 having the initial distribution π. . compute the most likely sequence for hidden states (solved by the Viterbi algorithm) 3. B). . compute the conditional distribution of the state (solved by the forward-backward algorithm) 2. The observations are assumed to take values in a set of size No . . It is such that given Z = z. MAP estimates of the state or transition of the Markov process at a given time.17) = arg max P [Zt = i|Y1 = y1 . The intial value is αi (1) = πi biy1 . . Zt = i. where the conventions for subscripts is similar to that used for Kalman filtering: “t|T ” denotes that the state is to be estimated at time t based on the observations up to time T .5. Zt = i. Zt+1 = j|θ] P [Y1 = y1 . . · · · . . · · · . Yt = yt . we have: Zt|tM AP Zt|T M AP (Zt . YT = yT . Zt = i|θ]. .1 Posterior state probabilities and the forward-backward algorithm In this subsection we assume that the parameter θ = (π. . . Y1 = y1 . θ] i∈S = arg max (i. . . . . . . Yt = yt . P [Zt = i|Y1 = y1 .j)∈S×S P [Zt = i. θ] = = P [Zt = i. B) of the hidden Markov model is known and fixed. based on past observations (case of causal filtering) or based on past and future observations (case of smoothing). Yt+1 = yt+1 . . θ].3. HIDDEN MARKOV MODELS 155 These problems are addressed in the next three subsections. A. . . We shall describe computationally efficient methods for computing posterior probabilites for the state at a given time t. The key to efficient computation is to recursively compute certain quantities through a recursion forward in time.18) . Yt = yt . Zt+1 )|TM AP = arg max P [Zt = i|Y1 = y1 . Yt = yt |θ] αi (t) j∈S αj (t) (5. . . Yt = yt . θ] = i∈S αi (t)aij bjyt+1 . 5. Zt+1 = j|Y1 = y1 . of the hidden Markov process. As we will see. . the first of these problems arises in solving the third problem. For example. These posterior probabilities would allow us to compute. Yt = yt . The second problem has some similarities to the first problem. but it can be addressed separately. .15) (5. θ] i∈S (5. Yt+1 = yt+1 |Y1 = y1 . and others through a recursion backward in time. YT = yT . · · · . Yt = yt |θ] P [Y1 = y1 .3. for i ∈ S and 1 ≤ t ≤ T. . .16) (5. . Zt = i|θ] i∈S = · P [Zt+1 = j. . the update rule is: αj (t + 1) = i∈S P [Y1 = y1 . or for a transition at a given pair of times t to t + 1. By the law of total probability. The right-hand side of (5. · · · . for example. .15) can be expressed in terms of the α’s as follows. We begin by deriving a forward recursion for the variables αi (t) defined as follows: αi (t) = P [Y1 = y1 . YT = yT .16). . . |Zt = j. . The difference is that for Kalman filtering equations. is more or less done at each step in the Kalman filtering equations. This asymmetry is introduced because the presentation of the model itself is not symmetric in time. . To express the posterior probabilities involving both past and future observations used in (5. YT = yT |θ] = P [Y1 = y1 . . . . Y1 = y1 . . YT = yT |θ. YT = yT |θ] = P [Zt = i. so it suffices to compute means and variances. . . . . . INFERENCE FOR MARKOV MODELS The computation of the α’s and the use of (5. θ] = j∈S aij bjyt βj (t). the distributions involved are all Gaussian. .156 CHAPTER 5. the update rule is βi (t − 1) = j∈S P [Yt = yt . . the Kalman filtering equations. . θ] P [Yt = yt . The backward equation for the β’s is as follows. which is done once after the α’s are computed. · · · . Yt = yt |θ] · P [Yt+1 = yt+1 . .16). . · · · . The intial condition for the backward equations is βi (T ) = 1 for all i. . θ] j∈S = · P [Yt+1 = yt . . . Zt = j|Zt−1 = i. YT = yT |θ] αi (t)βi (t) = j∈S αj (t)βj (t) The variable γi (t) defined here is the same as the probability in the right-hand side of (5. . . . . θ]. Yt = yt ] = P [Zt = i. YT = yT .16). Zt = i. . Zt−1 = i. Y1 = y1 . . By the law of total probability. . · · · . Zt = i] = αi (t)βi (t) from which we derive the smoothing equation for the conditional distribution of the state at a time t. For later . . . YT = yT |θ.18) is an alternative. because the event Zt = i is being conditioned upon in the definition of βi (t). . Y1 = y1 . Note that P [Zt = i. YT = yT |Zt = i. θ] P [Zt = i. Zt = j|Zt−1 = i. . . given all the observations: γi (t) = P [Zt = i|Y1 = y1 . and very similar to. so that we have an efficient way to find the MAP smoothing estimator defined in (5. . for i ∈ S and 1 ≤ t ≤ T. . Y1 = y1 . Y1 = y1 . and also the normalization in (5. The definition is not quite the time reversal of the definition of the α’s. the following β variables are introduced: βi (t) = P [Yt+1 = yt+1 . YT = yT . Yt = yt |θ] · P [Yt+1 = yt+1 . Yt = yt . .18). Y1 = y1 . Zt = i. . using: αj (t + 1) = i∈S αi (t)aij bjyt+1 . Yt+1 = yt+1 ] = αi (t)aij bjyt+1 βj (t + 1). P [Zt = i.19). Zt = i. θ] P [Zt = i. . so that we have an efficient way to find the MAP smoothing estimator of a state transition. Y1 = y1 .19) Similarly.j (t) defined here is the same as the probability in the right-hand side of (5. HIDDEN MARKOV MODELS use.17). Yt+1 = yt+1 |θ. we note from the above that for any i such that γi (t) > 0. YT = yT |θ] = αi (t)βi (t) . . . Y1 = y1 . . . . . . . YT = yT |θ] = P [Y1 = y1 . Y1 = y1 . . . . . Zt+1 = j. defined in (5. Yt = yt |θ] · P [Zt+1 = j. . . from which we derive the smoothing equation for the conditional distribution of a state-transition for some pair of consecutive times t and t + 1. Yt = yt ] · P [Yt+2 = yt+2 . . . . . given all the observations: ξij (t) = P [Zt = i. Y1 = y1 . Zt+1 = j. P [Y1 = y1 . . . . γi (t) 157 (5. The variable ξi. and the β’s recursively computed backward in time. . j∈S βi (t − 1) = . the forward-backward or α−β algorithm for computing the posterior distribution of the state or a transition is given by: Algorithm 5.3. . Summarizing. .1 (The forward-backward algorithm) The α’s can be recursively computed forward in time. βi (t) where the final expression is derived using (5. .5. YT = yT |θ] αi (t)aij bjyt+1 βj (t + 1) = i . with initial condition αi (1) = πi biy1 aij bjyt βj (t). .3. . Zt+1 = j. . . .17).j αi (t)ai j bj yt+1 βj (t + 1) = γi (t)aij bjyt+1 βj (t + 1) . with initial condition βi (T ) = 1. YT = yT . . . Zt+1 = j|Y1 = y1 . YT = yT |θ. . . YT = yT |θ] = P [Zt = i. . . the formulas (5.22) for the posterior probabilities in the forward-backward algorithm are still valid if the α’s and β’s are multiplied by time dependent (but state independent) constants (for this purpose. Zt = i. ZM AP (y. YT = yT . . using the initial values δi (1) = π(i)biy1 .. i=0 5. Yt = yt .3. (z1 . it is defined to be the z that maximizes the posterior pmf p(z|y. z ∗ |θ). θ] = ξij (t) = P [Zt = i. and as shown in Section 5. .20). . B) is known. z|θ). namely pcd (y. θ). . Moreover. θ] = γi (t) = P [Zt = i|Y1 = y1 .23) Remark 5. Yt = yt |θ]. is a computationally efficient algorithm for simultaneously finding the maximizing sequence z ∗ ∈ S T and computing pcd (y. . (5. · · · . . the α’s and β’s can become so small that underflow problems can be encountered in numerical computation. θ) = arg max pcd (y. It uses the variables: δi (t) = max P [Z1 = z1 . For the remainder of this section. θ). because (5. . . .2 Most likely state sequence – Viterbi algorithm Suppose the parameter θ = (π. . the α’s and β’s can be renormalized after each time step of computation to have sum equal to one.3. YT = yT .2 If the number of observations runs into the hundreds or thousands. . In some applications one wishes to have an estimate of the entire sequence Z. We will seek the MAP estimate.j αi (t)ai j bj yt+1 βj (t + 1) γi (t)aij bjyt+1 βj (t + 1) . and (5. . y = (y1 . . the sum of the logarithms of the normalization factors for the α’s can be stored in order to recover the log of the likelihood.zt−1 )∈S t−1 The δ’s can be computed by a recursion forward in time. yT ). log p(y|θ) = log Ns −1 αi (T ). However. . z|θ) z The Viterbi algorithm (a special case of dynamic programming). given Y = y.23). βi (t) (5. INFERENCE FOR MARKOV MODELS Then the posterior probabilities can be found: P [Zt = i|Y1 = y1 . described next. ZT ). A..23) invovles β’s at two different times).22) is more convenient than (5. By definition. of the entire state sequence Z = (Z1 . Since θ is known. Y1 = y1 . it is also equal to the maximizer of the joint pmf of Y and Z: ZM AP (y. . YT ) is observed.. Zt−1 = zt−1 . Y and Z can be viewed as random vectors with a known joint pmf..20) (5. . .1. and that Y = (Y1 . . . Zt+1 = j|Y1 = y1 .22) i .21) αi (t)aij bjyt+1 βj (t + 1) (5. Then.158 CHAPTER 5. . θ] = = αi (t) j∈S αj (t) αi (t)βi (t) j∈S αj (t)βj (t) (5. . . .21). . . (5. . let y denote a fixed observed sequence. Zt−1 = i. z|θ). in the particular context of HMMs. Y1 = y1 . θ(k) ] T −1 T = i∈S γi (1) log πi + t=1 i.2. The complete data consists of (Y. The initial parameter θ(0) = (π (0) . The parameter to be estimated is θ = (π.j bjyt } Note that δi (T ) = maxz:zT =i pcd (y. HIDDEN MARKOV MODELS and the recursion derived as follows: δj (t) = max i {z1 . Z|θ)|y. Zt−1 = i..yt } Then z ∗ = ZM AP (y. B). . Suppose θ(k) is given.5.zt−2 } 159 max P [Z1 = z1 . . θ) satisfies pcd (y..j ξij (t) log ai.. z ∗ |θ) = maxi δi (T ). (5. Zt−2 = zt−2 .j + t=1 i∈S γi (t) log bi. incomplete data consists of Y alone. Q(θ|θ(k) ). z|θ) = log πz1 + t=1 log azt . · · · . B (0) ) should have all entries strictly positive..j bjyt = max i i max = max {δi (t − 1)ai. or EM algorithm for HMM The EM algorithm. A(0) . .zt−2 } {z1 .3.3. Yt = yt |θ] P [Z1 = z1 .3 The Baum-Welch algorithm. or determine in closed form. · · · . the following algorithm correctly finds ZM AP (y.yt .. .. Algorithm 5.25) i 5.zt+1 + t=1 log bzt . Zt−2 = zt−2 . A. . which was developed earlier than the EM algorithm. This section shows how to apply it to the problem of estimating the parameter of a hidden Markov model from an observed output sequence. introduced in Section 5. can be usefully applied to many parameter estimation problems with hidden data. The first half of an iteration of the EM algorithm is to compute..3. Y1 = y1 . . Yt−1 = yt−1 |θ]ai. Thus.24) (storage of back pointers) φj (t) = arg max{δi (t − 1)ai. Taking logarithms in the expression (5.14) for the pmf of the complete data yields T −1 T log pcd (y. because any entry that is zero will remain zero at the end of an iteration. Zt = j.j bj. Z) whereas the observed. and z ∗ is given by tracing backward in time: ∗ ∗ ∗ zT = arg max δi (T ) and zt−1 = φzt (t) for 2 ≤ t ≤ T. .3 (Viterbi algorithm) Compute the δ’s and associated back pointers by a recursion forward in time: (initial condition) (recursive step) δi (1) = π(i)biy1 δj (t) = max{δi (t − 1)aij bj. . .yt Taking the expectation yields Q(θ|θ(k) ) = E[log pcd (y.yt } i i (5. This results in the BaumWelch algorithm.. θ). This computation can be done using the forward-backward algorithm.8 and Lemma 5. Equation (5.27)-(5. the next state will be j.1.j + i∈S t=1 T −1 γi (t) log bi. and the numerator is the posterior expected number of times two consecutive states are i. if we think of the time of a jump as being random.26) has the same form as the sum in Lemma 5. computed assuming θ(k) is the true parameter value. . but are more complicated because the transition matrix A and observation generation matrix B do not change with time. Therefore. the sum over j involving ai.1.l . The parameter θ = (π. Algorithm 5. Similarly. In view of this closed form expression for Q(θ|θ(k) ).l (5.29) ai. and bil as l varies. are probability vectors.29) have a natural interpretation.j t=1 ξij (t) log ai.j t=1 T ξij (t) log ai. and.1. the right-hand side of (5.3. π (k+1) .j . B) for this problem can be viewed as a set of probability vectors. π is a probability vector.7.28) is the time-averaged posterior conditional probability that.7 will be of use. The other two update equations are similar. The denominator of (5. Namely. Similarly.l (k+1) (k+1) The update equations (5. with θ = θ(k) .29) is the time-averaged posterior conditional probability that.yt = i∈S γi (1) log πi + i. the observation will be l. the expectation step of the EM algorithm essentially comes down to computing the γ’s and the ξ’s.160 CHAPTER 5. The second half of an iteration of the EM algorithm is to find the value of θ that maximizes Q(θ|θ(k) ). and the sum over l involving bi. given the state is i at a typical time.1. given the state at the beginning of a transition is i at a typical time.28) (5. Thus. aij as j varies. Therefore.j bi.26) The first summation in (5.27) (5. for each i fixed.7.28) is the posterior expected number of times the state is equal to i up to time T − 1. A. INFERENCE FOR MARKOV MODELS where the variables γi (t) and ξi.j (t) T −1 t=1 γi (t) T t=1 γi (t)I{yt =l} T t=1 γi (t) (5. we rewrite the expression found for Q(θ|θ(k) ) to get T −1 T (k) Q(θ|θ ) = i∈S γi (1) log πi + i.27) means that the new value of the distribution of the initial state. also have the same form as the sum in Lemma 5. is simply the posterior distribution of the initial state. the maximization step of the EM algorithm can be written in the following form: πi (k+1) = γi (1) = = T −1 t=1 ξi.j + i∈S l t=1 γi (t)I{yt =l} log bi.j (t) are defined using the model with parameter θ(k) . j.1. Example 5. Motivated by these. for each i fixed. the right-hand side of (5. and set θ(k+1) equal to that value. θM L = (y 2 − σ 2 )+ .4.4 Notes The EM algorithm is due to A. and let θ(0) denote a given initial choice of parameter.3. compute θ(k+1) by using the forward-backward algorithm (Algorithm 5. suppose the prior distribution of θ is the exponential distribution with some known parameter λ > 0.D.1. (b) For what values of λ is ΘM AP (k) ≈ θM L (k)? (The ML estimator was found in Example 5. the random variable N has the Poisson distribution with mean X 2 . the cardinality. (a) Find ΘM AP (k). 5. A(k+1) . You do not have to perform the integration.M.) (a) Express P {N = n]} in terms of σ 2 .5.2 that if θ(0) > 0. giving the Baum-Welch algorithm. over R+ . A tutorial on inference for HMMs and applications to speech recognition is given in [11]. σ 2 ) random variable. then θ(k) → θM L as k → ∞.2 A variance estimation problem with Poisson observation The input voltage to an optical device is X and the number of photons observed at a detector is N .29) to compute θ(k+1) = (π (k+1) . (b) Find the maximum likelihood estimator of σ 2 given N . for some k ≥ 0.3. (Caution: Estimate σ 2 .2. Be as explicit as possible–the final answer has a simple form. You can express this is the last candidate. Dempster.1 Estimation of a Poisson parameter Suppose Y is assumed to be a P oi(θ) random variable.3 Convergence of the EM algorithm for an example The purpose of this exercise is to verify for Example 5.4 (Baum-Welch algorithm. Suppose X is a Gaussian random variable with mean zero and variance σ 2 . Earlier related work includes that of Baum et al. Let F (θ) = θ θ+σ 2 2 y2 + θσ 2 θ+σ 2 so that the recursion (5. (a) Show that 0 is the only nonnegative solution of F (θ) = θ if y ≤ σ 2 and that 0 and y − σ 2 are the only nonnegative solutions of F (θ) = θ if y > σ 2 .) Why should that be expected? 5. and that given X. the MAP estimate of θ given that Y = k is observed. not X. B (k+1) ). of the state space. NOTES 161 Algorithm 5. 5. . then E[X 2n ] = σ n!2n .3. As shown in the example. [2]. ) 5. The paper includes examples and a proof that the likelihood is increased with each iteration of the algorithm.6) has the form θ(k+1) = F (θ(k) ). Laird. Then use (5.P. or EM algorithm for HMM) Select the state space S. N.5 Problems 5. Hint: You can first simplify your answer to e2n (2n)! part (a) by using the fact that if X is a N (0.27)-(5. Using the Bayesian method. F is increasing and bounded. Rubin [3]. Ns . and in particular.1) with θ = θ(k) to compute the γ’s and ξ’s. Given θ(k) . Clearly. (b) Show that for small θ > 0. An article on the convergence of the EM algorithm is given in [16]. (Recall that the Poisson distribution with mean λ has probability mass function λn e−λ /n! for n ≥ 0. and B. (a) Does 3 + 5θM L = (3 + 5θ)M L ? (b) Does (θM L )3 = (θ3 )M L ? (c) Does 3 + 5θM AP = (3 + 5θ)M AP ? (d) Does (θM AP )3 = (θ3 )M AP ? (e) Does 3 + 5θM M SE = (3 + 5θ)M M SE ? (f) Does (θM M SE )3 = (θ3 )M M SE ? 5. Show your work. 2}.162 F (θ) = θ + θ2 (y 2 −σ 2 ) σ4 CHAPTER 5.5 Using the EM algorithm for estimation of a signal variance This problem generalizes Example 5. (a) Suppose θ is known. MAP and MMSE. . and find an espression for the covariance matrix of the error vector.6 Finding a most likely path Consider an HMM with state space S = {0. T . Describe a trivial choice of the ML estimator θM L (y) for a given observation sequence y = (y1 . . 1] from an observation Y . is the parameter to be estimated. Suppose the observation is Y = S + N . INFERENCE FOR MARKOV MODELS + o(θ3 ). Answer the following questions and briefly explain your answers. and N is N (0. (Hint: For 0 < θ < σ 2 . A prior density of θ is available for the Bayesian estimators. (c) Sketch F and argue. and the conditional density of Y given θ is known. using the Viterbi algorithm. Describe a direct approach to computing θM L (Y ). (b) Suppose now that θ is unknown. then θ(k) → θM L . yT ). A. Assume that S is N (0.2 to vector observations. B) given by: π = (a. observation space {0. where θ. is the identity matrix. (c) Describe how θM L (Y ) can be computed using the EM algorithm. . Their actual numerical values aren’t important. = − . with θ > 0. θ θ+σ 2 1 θ θ θ θ = σ2 (1 − σ2 + ( σ2 )2 σ 2 1+θ/σ 2 that if θ(0) > 0. What is the likelihood of y for this choice of θ? . (d) Consider how your answers to parts (b) and (c) simplify in case d = 2 and the covariance matrix of the noise. SM M SE . 1}. B) for an HMM is unknown. and parameter θ = (π. using the above properties of F. 1. I is the identity matrix. θI). 5. S − SM M SE . ΣN ). ΣN . . 5. but that it is assumed that the number of states Ns in the statespace S for (Zt ) is equal to the number of observations. a3 ) A= a a3 a3 a B= ca ca2 ca3 ca2 ca3 ca Here a and c are positive constants. such that the signal S and noise N are independent random vectors in Rd . Find the MMSE estimate of S. 5. A. . other than the fact that a < 1. Find the MAP state sequence for the observation sequence 021201.7 An underconstrained estimation problem Suppose the parameter θ = (π. and ΣN is known.4 Transformation of estimators and estimators of transformations Consider estimating a parameter θ ∈ [0. .).2. Assuming that θ is known. Let T > 0. 5. Y (n) ). is generated. .11 Inference for a mixture model (a) An observed random vector Y is distributed as a mixture of Gaussian distributions in d dimensions. independently. θ). J}. y (1) . . (b) Describe the limit of the Boltzmann distribution as T → ∞. The Helmholtz free energy of a probability distribution Q on S is defined to be the average (internal) energy minus the temperature times entropy: F (Q) = i Q(i)V (i) + T i Q(i) log Q(i). (c) Describe the limit of the Boltzmann distribution as T → 0. PROBLEMS 163 5. Z (n) . . to generate Y a random variable Z. z (n) . The variable Z is uniformly distributed on {1. so that Xt = Yt for all t. 5. Here Z(T ) is the normalizing constant required to make BT a probability distribution. Specifically. (b) Still assuming that B is the identity matrix. . given an energy function V on S and temperature T > 0. Therefore. Find the ML estimator for π and A. . The class label Z is not observed. . (b) Suppose now that the parameter θ is random with the uniform prior over a very large region and suppose that given θ. θJ ).5. . Give a geometrical interpretation of the MAP estimate Z for a given observation Y = y. . and the conditional distribution of Y given (θ. Give an explicit expression for the joint distribution P (θ.10 Baum-Welch saddlepoint Suppose that the Baum-Welch algorithm is run on a given data set with initial parameter θ(0) = (π (0) . . . . y (2) . (We’re assuming Boltzmann’s constant is normalized to one. so that T should actually be in units of energy. y (n) ).e. Note that F is a convex function of Q. Z) is Gaussian with mean vector θZ and covariance the d × d identity matrix. n random variables are each generated as in part (a). . . Y (2) . suppose that S = {0. Z (2) . find the posterior pmf p(z|y. Explain what happens. but by abuse of notation we will call T the temperature. z (2) . If it is possible to simulate a random variable with the Boltzmann distribution. 5.5. z (1) . called the class label for the observation.8 Specialization of Baum-Welch algorithm for no hidden data (a) Determine how the Baum-Welch algorithm simplifies in the special case that B is the identity matrix. to produce (Z (1) . does this suggest an application? (d) Show that F (Q) = T D(Q||BT ) + (term not depending on Q). .. assuming an ideal computer with infinite precision arithmetic is used. minimizing free energy over Q in some set is equivalent to minimizing the divergence D(Q||BT ) over Q in the same set.9 Free energy and the Boltzmann distribution Let S denote a finite set of possible states of a physical system.) (a) Use the method of Lagrange multipliers to show that the Boltzmann distribution defined by 1 BT (i) = Z(T ) exp(−V (i)/T ) minimizes F (Q). The parameter of the mixture distribution is θ = (θ1 . Y (1) . the initial distribution of the state is an equilibrium distribution of the state) and every row of B (0) is identical. B (0) ) such that π (0) = π (0) A(0) (i. 1} and the observation sequence is 0001110001110001110001. where θj is a d-dimensional vector for 1 ≤ j ≤ J. . . A(0) . . and suppose the (internal) energy of any state s ∈ S is given by V (s) for some function V on S. y (n) .164 CHAPTER 5. y (1) . A. other than the number of states Ns in the state space of (Zt ).) Explain how the Baum-Welch algorithm can be adapted to this situation. A. aij ≤ aij for all i. β’s. For example. y (n) ) with respect to θ for z fixed and with respect to z for θ fixed. z (2) .13 * Implementation of algorithms Write a computer program to (a) simulate a HMM on a computer for a specified value of the paramter θ = (π. y (2) . how large does T need to be to insure that the estimates of θ are pretty accurate. . Suppose matrices A and B are given with the same dimensions as the matrices A and B to be esitmated. 5. Experiment a bit and describe your results. j. Suppose that A and B are constrained to satisfy A ≤ A and B ≤ B. if T observations are generated. . (c) To run the Baum-Welch algorithm. . . and then if the Baum-Welch algorithm is used to estimate the paramter. in the element-by-element ordering (for example. y (2) . (d) Derive the EM algorithm for this example.12 Constraining the Baum-Welch algorithm The Baum-Welch algorithm as presented placed no prior assumptions on the parameters π. B. 5. . and ξ’s . . with all elements of A and B having values 0 and 1. . γ’s. (b) To run the forward-backward algorithm and compute the α’s. in an attempt to compute the maximum likelihood estimate of θ given y (1) . INFERENCE FOR MARKOV MODELS (c) The iterative conditional modes (ICM) algorithm for this example corresponds to taking turns maximizing P (θ. . . z (1) . z (n) . Give a simple geometric description of how the algorithm works and suggest a method to initialize the algorithm (there is no unique answer for the later). B). the existence of an equilibrium probability distribution is related to whether the distribution of X(t) converges as t → ∞. whereas X(k) ∈ {b. then π(k) does not converge as k → ∞. 6. c. That is. b. h} with probability one.1. then the limiting distribution is supported by S1 . c. g. then the limiting distribution is supported by S2 . We shall see in this section that under certain natural conditions. so that if the initial state X(0) is in S1 with probability one. consider a Markov process on S1 .1. e}. πa (k) + πc (k) + πe (k) is one if k is even and is zero if k is odd. the limiting distribution is not unique for this process. In the discrete-time case. if πa (0) = 1. Does the distribution of X(0) necessarily converge if X(0) ∈ S1 with probability one? The answer is no. For example. 165 .1 Consider the discrete-time Markov process with the one-step probability diagram shown in Figure 6. then X(k) ∈ {a. Therefore. Often questions involving the long-time behavior of such systems are of interest. d. The natural way to deal with this problem is to decompose the original problem into two problems.1 Examples with finite state space Recall that a probability distribution π on S is an equilibrium probability distribution for a timehomogeneous Markov process X if π = πH(t) for all t.Chapter 6 Dynamics of Countable-State Markov Models Markov processes are useful for modeling a variety of dynamical systems. e} for all even values of k. Similarly if the initial state X(0) is in S2 = {f. Then the relevant definitions are given for finite or countably-infinite state space. Note that the process can’t escape from the set of states S1 = {a. we begin by considering four examples of finite state processes. To motivate the conditions that will be imposed. d} for all odd values of k. That is. or whether time-averages constructed using the process are asymptotically the same as statistical averages. Existence of an equilibrium distribution is also connected to the mean time needed for X to return to its starting state. Thus. Example 6. and then consider a Markov process on S2 . and propositions regarding convergence are presented. note that if X(0) = a. such as whether the process has a limiting distribution. this condition reduces to π = πP . 3. where we found that for any initial distribution π(0).5 d 0. β α .9.5 0.2: A transition rate diagram with two states.1: A one-step transition probability diagram with eight states. ! 1 " 2 Figure 6. This was already considered in Example 4.1.5 1 1 g 0.2 Consider the two-state.2 for some positive constants α and β. which happens to be the nonzero eigenvalue of Q. Example 6. the Markov process of Example 6.5 0. The Q matrix is the block-diagonal matrix given by −α α 0 0 β −β 0 0 Q= 0 0 −α α 0 0 β −β . Note that the limiting distribution is the unique probability distribution satisfying πQ = 0.5 1 0.1. α+β α+β t→∞ lim π(t) = lim π(0)H(t) = ( t→∞ The rate of convergence is exponential. Basically speaking.3.3 Consider the continuous-time Markov process with the transition rate diagram in Figure 6.5 e f b 0. The periodicity problem of Example 6.1.166 CHAPTER 6.1 fails to have a unique limiting distribution independent of the initial state for two reasons: (i) the process is not irreducible. ).1 does not arise for continuous-time processes. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 1 a 0. with rate parameter α+β.5 h 0. and (ii) the process is not aperiodic. Example 6.1.5 c Figure 6. continuous time Markov process with the transition rate diagram shown in Figure 6. 0) if k ≡ 1 mod 3 π(k) = π(0)P = (0. where λ is the probability placed on the subset {1. CLASSIFICATION AND CONVERGENCE OF DISCRETE-TIME MARKOV PROCESSES167 ! 1 " 2 3 " ! 4 Figure 6. On the other hand.1. 2}. 1) if k ≡ 2 mod 3 Therefore. then 3 3 3 (1. 1. 0 1 0 P = 0 0 1 1 0 0 Solving the equation π = πP we find there is a unique equilibrium probability vector. there exists s > 0 so that pij (s) > 0. The one-step transition probability matrix P is given by 1 1 1 3 2 1 Figure 6. 0. Definition 6. The equilibrium probability distributions β β α α are the probability distributions of the form π = (λ α+β . . (1 − λ) α+β ).1.2. The process is said to be irreducible if for all i.1 Let X be a time-homogeneous Markov process on the countable state space S. each equivalent to the diagram for Example 6. 0) if k ≡ 0 mod 3 k (0. but rather the transition rate diagram can be decomposed into two parts.2.4: A one-step transition probability diagram with three states. 1 ).3: A transition rate diagram with four states. j ∈ S.2.4. This process is not irreducible.6. λ α+β . 0. 0. (1 − λ) α+β . if π(0) = (1. 1 . 6. Example 6. 0).2 Classification and convergence of discrete-time Markov processes The following definition applies for either discrete time or continuous time. namely π = ( 1 .4 Consider the discrete-time Markov process with the transition probability diagram in Figure 6. π(k) does not converge as k → ∞. The process during the time interval {0. independently of whether j was visited in earlier excursions. Mi = +∞). if it exists. Proof.168 CHAPTER 6. the process behaves just as it did initially. (c) An equilibrium probability distribution π exists if and only if all states are positive recurrent. (a) All states are transient. If P [τi < +∞|X(0) = i] < 1. In particular. pii (n + k1 + k2 ) ≥ pij (k1 )pjj (n)pji (k2 ). τi } is the first excursion of X from state 0. (b) For any initial distribution π(0). Thus there is a second excursion from i. and each has the same distribution as T1 = τi . The Markov process is called aperiodic if the period of all the states is one. For any integer n. third excursion from i. and then within an average of 1/ additional 1 Such as the Euclidean algorithm. with the understanding that the limit is zero if Mi = +∞. all states have the same period. (d) If it exists.2. By irreducibility. which has mean 1/ . or Bezout theorem . or all are positive recurrent. The excursions are independent. After j is visited the process eventually returns to state i. Given X(0) = i. . and so on. Definition 6. state j is eventually visited with probability one. after leaving i the process returns to state i at time τi . (a) Suppose state i is recurrent. Thus. Proof. Proposition 6. Chinese remainder theorem. Thus the period of i is less than or equal to the period of j.4 Suppose X is irreducible and aperiodic. the equilibrium probability distribution π is given by πi = 1/Mi .3 If X is irreducible. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS The next definition is relevant only for discrete-time processes. Let i and j be two states. where we adopt the convention that the minimum of an empty set of numbers is +∞. . Proposition 6. so the set {k ≥ 0 : pii (k) > 0} contains the set {k ≥ 0 : pjj (k) > 0} translated up by k1 + k2 . where “GCD” stands for greatest common divisor. Since i and j were arbitrary states. limt→∞ πi (t) = 1/Mi . Let Mi = E[τi |X(0) = i]. From time τi onward. > 0. For a fixed state i.2. the equilibrium probability distribution is unique). (In particular. Since X is irreducible. then state i is called transient (and by convention. the proposition follows. so state j is visited during the kth excursion with probability . Let Tk for k ≥ 1 denote the length of the kth excursion. which by a result in elementary algebra1 implies that the set contains all sufficiently large integer multiples of the period. and i is then said to be positive recurrent if Mi < +∞ and to be null recurrent if Mi = +∞. .2 The period of a state i is defined to be GCD{k ≥ 0 : pii (k) > 0}. the number of excursions needed until state j is reached has the geometric distribution with parameter . . or all are null recurrent. Then the Tk ’s are independent. there are integers k1 and k2 so that pij (k1 ) > 0 and pji (k2 ) > 0. define τi = min{k ≥ 1 : X(k) = i}. The set {k ≥ 0 : pii (k) > 0} is closed under addition. Let j be another state and let denote the probability that X visits state j during one excursion from i.2. Otherwise P[τi < +∞|X(0) = i] = 1. for any state j. all states are recurrent. Similarly. The same argument shows that if i is positive recurrent. and if πi ≤ σi then σi = σi − πi and πi = 0. Proof.5 For any probability vectors π and σ. We conclude this section by describing a technique to establish a rate of convergence to the equilibrium distribution for finite-state Markov processes. for any state i. b}. π defined by πi = 1/Mi is a probability distribution. Taking a state i such that πi > 0. πi = 1/Mi . if there is an equilibrium probability distribution π. Note that if πi ≥ σi then πi = πi − σi ˜ ˜ ˜ and σi = 0. so that π is an equilibrium probability distribution. Thus. b ≥ 0. then j is positive recurrent. Then π(t) = π for all t. (d) Part (d) was proved in the course of proving part (c). Hence. (b) Part (b) of the proposition follows by an application of the renewal theorem. are both finite. So state i is positive recurrent. (c) Suppose all states are positive recurrent. Also.2. Given X(0) = i. we also have 1 − 2δ(P ) = min i. The number δ(P ) is known as Dobrushin’s coefficient of ergodicity. This implies that i 1/Mi = 1. if one state is positive recurrent. Furthermore.k j i |µi |. Thus. Define δ(P ) for a one-step transition probability matrix P by δ(P ) = min i. the mean time to hit j starting from i. all states are positive recurrent. taking s to infinity in the equation π(s)H(t) = π(s + t) yields πH(t) = π. Therefore 1/Mj = γij /Mi .k j pij ∧ pkj . it follows that Mi < ∞. all states are positive recurrent. j is positive recurrent. CLASSIFICATION AND CONVERGENCE OF DISCRETE-TIME MARKOV PROCESSES169 excursions. where γij is the mean number of visits to j in an excursion from i.6. if δ(P ) > 0 then there is a unique equilibrium distribution π ∞ . By part (a). if one state is recurrent. since on average 1/ excursions of mean length Mi are needed. the long run fraction of time the process is in state j is 1/Mj with probability one. πP − σP ≤ (1 − δ(P )) π − σ . for any states i and j. imply that i |πi (t) − πi | → 0. and the mean time to hit i starting from j. That is. where a ∧ b = min{a. πP l − π ∞ ≤ 2(1 − δ(P ))l . the mean time needed for the process to visit j and then return to i is Mi / . Let πi = πi − πi ∧ σi and σi = σi − πi ∧ σi . Conversely. Thus. consider running the process with initial state π.2. state j is also recurrent. The convergence for each i separately given in part (b). Let µ for a vector µ denote the L1 norm: µ = Proposition 6. the long run fraction of time the process is in state j is γij /Mi . |pij − pkj |. Hence. Since a + b − 2(a ∧ b) = |a − b| for a. it will return to state j again. By the law of large numbers. So by part (b). Thus. which can be found in [1]. and for any other probability distribution π on S. π and σ are both equal to ˜ ˜ ˜ ˜ ˜ . together with the fact that π is a probability distribution. . s0 . Iterating the inequality just proved yields that πP l − σP l ≤ (1 − δ(P ))l π − σ ≤ 2(1 − δ(P ))l .k ≤ (1/ π ) ˜ i. . Thus the sequence πP l is a Cauchy sequence and has a limit π ∞ .k π i σk ˜ ˜ j |Pij − Pkj | ≤ π (2 − 2δ(P )) = π − σ (1 − δ(P )). taking σ in (6. If τ ∗ is finite it is said to be the explosion time of the function x. 0 = τ0 < τ1 < . the state space is Z+ . Finally. . . n=0 6.2) where τ ∗ = limi→∞ τi . One possible complication. ˜ which proves the first part of the proposition. τi = i/(i + 1) and si = i is pictured in Fig.5 typically does not yield the exact asymptotic rate that π l − π ∞ tends to zero.3 Classification and convergence of continuous-time Markov processes Chapter 4 discusses Markov processes in continuous time with a finite number of states. Therefore. so that x(t) = si if τi ≤ t < τi+1 i ≥ 0 if t ≥ τ ∗ (6. . ˜ ˜ ˜ ˜ πP − σP = = j i πP − σP ˜ ˜ πi Pij − ˜ k σk Pkj ˜ πi σk (Pij − Pkj )| ˜ ˜ = (1/ π ) ˜ j | i. Furthermore. The asymptotic behavior can be investigated by computing (I − zP )−1 .}. is that a continuous time process can make infinitely many jumps in a finite amount of time. s1 .5. 6. . and if there is no upper bound on the number of customers that can be waiting in the queue. Proposition 6. and a sequence of states. and if τ ∗ = +∞ the function is said to be nonexplosive. and then matching powers of z in the identity (I − zP )−1 = ∞ z n P n . . and let ∈ S. The example corresponding to S = {0.170 1− i πi CHAPTER 6. A pure-jump function is a function x : R+ → S ∪ { } such that there is a sequence of times. Here we extend the coverage of continuous-time Markov processes to include countably infinitely many states. . For example. and si = si+1 . π − σ = π − σ = 2 π = 2 σ . the state of a simple queue could be the number of customers in the queue. .2.1) equal to π ∞ yields the last part of the proposition. Let S be a finite or countably infinite set. . with si ∈ S. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS ∧ σi . 1. and π ∞ P = π ∞ . that rarely arises in practice. i ≥ 0. (6.1) This inequality for σ = πP n yields that πP l − πP l+n ≤ 2(1 − δ(P ))l . Note that τ ∗ = 1 for this example. and . . and the Kolmogorov forward ∂πj (t) equations. A (continuous-time) birth-death process with parameters (λ0 . t) = H(s. j ∈ S (6.5: A pure-jump function with an explosion time. . j ∈ S) if lim (pij (h) − I{i=j} )/h = qij i. The Chapman-Kolmogorov equations.2 Given a matrix Q = (qij : i. and qkk−1 = µk for k ≥ 0.1 A pure-jump Markov process (Xt : t ≥ 0) is a Markov process such that. and a probability distribution π(0) = (πi (0) : i ∈ S). given the jump process. .3. µ2 . the following holds. H(s. qkk = −(µk + λk ). . A pure-jump.4) where o(h) represents a quantity such that limh→0 o(h)/h = 0. There is a discrete-time jump process.) (also set λ−1 = µ0 = 0) is a pure-jump Markov process with state space S = Z+ and generator matrix Q defined by qkk+1 = λk . hold.3) h 0 or equivalently pij (h) = I{i=j} + hqij + o(h) i.6.3 (Birth-death processes) A useful class of countable-state Markov processes is the set of birth-death processes.) and (µ1 . The space-time properties for continuous-time Markov processes with a countably infinite number of states are the same as for a finite number of states. are exponentially distributed. with probability one. Example 6.3. and the holding times. j ∈ S) satisfying qij ≥ 0 for distinct states i and j. τ )H(τ. λ2 . Also. ∂t = i∈S πi (t)qij . Generator matrices are defined for countable-state Markov processes just as they are for finitestate Markov processes. Definition 6.3. CLASSIFICATION AND CONVERGENCE OF CONTINUOUS-TIME MARKOV PROCESSES171 x(t) !0 !1 !2. . !" t Figure 6. its sample paths are pure-jump functions.j=i qij for each state i.3. t). with probability one. time-homogeneous Markov process with generator matrix Q and initial distribution π(0). there is a pure-jump. . . . and qii = − j∈S. The finite-dimensional distributions of the process are uniquely determined by π(0) and Q. Such a process is said to be nonexplosive if its sample paths are nonexplosive. time-homogeneous Markov process X has generator matrix Q = (qij : i. j ∈ S (6. Proposition 6. each holding time is exponentially distributed with parameter λ.5 Suppose X is irreducible. 10]. (a) All states are transient. The following propositions are analogous to those for discrete-time Markov processes. with the understanding that the limit is zero if Mi = +∞. since qk. Let Mi = E[τi |X(0) = i]. for i ∈ S. as defined in Section 4.3. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS qij = 0 if |i − j| ≥ 2.4 (Description of a Poisson process as a Markov process) Let λ > 0 and consider a birth-death process N with λk = λ and µk = 0 for all k.3. The transition rate diagram is shown in Fig. τi is the first time the process returns to state i. so that N is a Poisson process with rate λ. of such a process is as follows. but since the jump process is deterministic. Proofs can be found in [1. Otherwise P [τi < +∞] = 1. limt→+∞ πi (t) = 1/(−qii Mi ). Thus. or all are null recurrent. (b) For any initial distribution π(0).6. Ordinarily.10.6: Transition rate diagram of a birth-death process. with initial state zero with probability one. The holding time of state k is exponential with parameter λk + µk . If P [τi < +∞] < 1. the holding times are only conditionally independent given the jump process.3. or all are positive recurrent.5) Example 6. . Therefore. with the exception that τi = +∞ if the process never returns to state i. Each transition is an upward jump of size one. N satisfies condition (b) of Proposition 4. so the jump process is deterministic: N J (k) = k for all k.k = −λ for all k.5. The following definitions are the same as when X is a discrete-time process. the holding times are independent. The space-time structure of this Markov process is then rather simple. and i is then said to be positive recurrent if Mi < +∞ and to be null recurrent if Mi = +∞. !0 0 µ1 1 !1 µ2 2 !2 µ3 3 !3 µ4 Figure 6.2. 6. τio = min{t > 0 : X(t) = i}. the next state visited is k +1 with probability λk /(λk +µk ) and k −1 with probability µk /(λk +µk ). Proposition 6. Also. The space-time structure.6 Suppose X is irreducible and nonexplosive. and τi = min{t > τio : X(t) = i}. Proposition 6. Define. then state i is called transient. Given the process is in state k at time t. if X(0) = i.172 CHAPTER 6. The Kolmogorov forward equations for birth-death processes are ∂πk (t) = λk−1 πk−1 (t) − (λ + µ)πk (t) + µk+1 πk+1 (t) ∂t (6. flow balance must hold for all sets of the form {0. consider a birth-death process such that the birth rates.6) over k ∈ {1.9). that is.7) with the understanding that the i = 0 term in the sum defining S1 is one.3. In some cases. .7. (c) If all states are positive recurrent. Suppose for each k ≥ 1 there is a constant γk so that the jump intensities on Sk are bounded by γk . the equilibrium distribution is given by πn = (λ0 . (In particular. where ∞ S1 = i=0 λ0 . Now πQ = 0 if and only if flow balance holds for any state k: (λk + µk )πk = λk−1 πk−1 + µk+1 πk+1 . the equilibrium probability distribution is unique). . πQ = 0 if and only if πn−1 λn−1 = πn µn for n ≥ 1. . n − 1} (just sum each side of (6. λi−1 . . . The assumption that X be nonexplosive is needed for Proposition 6. under the assumption that X is nonexplosive. .7 Suppose X is irreducible. µ 1 . . µn ). Therefore. if it exists. µn ) for n ≥ 1. To avoid trivialities. .6) Equivalently. . (λi : i ≥ 0) and death rates (µi : i ≥ 1) are all strictly positive. λn−1 /(µ1 . . . and if it is positive recurrent. Thus.4. CLASSIFICATION OF BIRTH-DEATH PROCESSES (a) A probability distribution π is an equilibrium distribution if and only if πQ = 0.3. Suppose we already know that the process is nonexplosive. Then the process is positive recurrent if and only if πQ = 0 for some probability distribution π. suppose −qii ≤ γk for i ∈ Sk . and if X is positive recurrent. . Fix a state io and for k ≥ 1 let Sk denote the set of states reachable from io in k jumps. If ∞ γ1 = +∞.3. . .6. because if the rates are bounded or grow at most linearly. is relatively simple. Proposition 6. then k=1 k the process X is nonexplosive. 6.3. X is positive recurrent if and only if S1 < ∞.7 doesn’t apply. investigate whether X is positive recurrent. but the following proposition shows that the Markov processes encountered in most applications are nonexplosive. even if Proposition 6. investigate whether the process is nonexplosive. . π is the equilibrium distribution. which holds if and only if there is a probability distribution π with πn = π0 λ0 . µi (6. (6. Thus. it is not explosive. Next. and if it is recurrent. . the process is nonexplosive by Proposition 6. . Then the process is irreducible. 173 (b) An equilibrium probability distribution exists if and only if all states are positive recurrent. a probability distribution π with πQ = 0 exists if and only if S1 < +∞. it can be shown by some other means that the process is nonexplosive.3. n − 1}). introduced in Example 6.4 Classification of birth-death processes The classification of birth-death processes. This is usually a simple matter. First. . . For example. . because this is a necessary condition for both recurrence and positive recurrence. a test is given below for the process to be recurrent. the equilibrium probability distribution is given by πi = 1/(−qii Mi ).6(a) (per Problem 6. .3. λn−1 )/(S1 µ1 . is zero. . Fix i with 1 ≤ i ≤ n − 1.n + bi−1.8) i=0 λ1 λ2 . Due to the definition of pure jump processes used. Since all states have the same classification. starting from state i. In particular. and so on. and then let n → ∞ to find the probability the original process never reaches state 0.µi . investigate whether X is recurrent. µi . Thus. bin = λi µi bi+1. . Thus. . . the condition bnn = 1 yields the solution 1 b1n = n−1 µ1 µ2 . By the space-time structure. which stops upon reaching a large state n. X is recurrent if and only if S2 = ∞. for initial state i. because a positive recurrent process is recurrent. The following test for recurrence is valid whether or not X is explosive. We shall first find the probability of never hitting state zero for a modified process.. λ1 λ2 . the probability the first jump is up is λi /(λi + µi ) and the probability the first jump is down is µi /(λi + µi ). Conversely. .λi Note that b1n is the probability. . the process is recurrent if the probability the process never hits 0. Since Bn+1 ⊂ Bn for all n ≥ 1. (6. λ i + µi λi + µ i which can be rewritten as µi (bin − bi−1.. In summary.n ) = λi (bi+1. λi with the understanding that the i = 0 term in the sum defining S2 is one. Thus. state zero is never reached.n ).. b2n − b1n = b1n µ1 /λ1 and b3n − b2n = b1n µ1 µ2 /(λ1 λ2 ). whenever X visits a state in S the number of jumps up until that time is finite. P [zero is never reached|X(0) = 1] = 1/S2 . DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Finally. P [∩n≥1 Bn |X(0) = 1] = lim b1n = 1/S2 n→∞ (6. then either the process remains bounded (which has probability zero) or ∩n≥1 Bn is true.9) where S2 = ∞ i=0 µ 1 µ 2 . µi . the process is recurrent if and only if state 0 is recurrent.n . Thus.174 CHAPTER 6. and derive an expression for bin by first conditioning on the state reached by the first jump of the process. Set the boundary conditions. Let bin denote the probability. of the event Bn that state n is reached without an earlier visit to state 0. This step is not necessary if we already know that X is positive recurrent. for initial state 1. for initial state 1. which upon summing yields the expression k−1 bkn = b1n i=0 µ 1 µ 2 . . the following proposition is proved. b0n = 0 and bnn = 1. on the event ∩n≥1 Bn . the process does not reach zero before reaching n. λi with the convention that the i = 0 term in the sum is one. . Consequently. if state zero is never reached. λ1 λ2 . ..n − bi. Finally. Let nonnegative birth probabilities (λk : k ≥ 0) and death probabilities (µk : k ≥ 1) satisfy λ0 ≤ 1. k ≥ 0 are independent and identically distributed. The process X is recurrent if and only if S2 = ∞. The results of this section apply with minor modifications to the discrete-time setting as well.5 Time averages vs. t] is 1 t t I{X(s)=i} ds 0 Let T0 denote the time that the process is first in state i. and let Tk for k ≥ 1 denote the time that the process jumps to state i for the kth time after T0 .6. and λk + µk ≤ 1 for k ≥ 1. suppose X is a continuous-time process. . The one-step transition probability matrix P = (pij : i. where Mi is the mean “cycle time” of state i.4. A related consideration is convergence of the empirical distribution of the Markov process. µn ).1 holds as before. Proposition 6. with mean Mi .4. the fraction of time the process spends in state i during [0. with probability one.2 6. irreducible. . statistical averages Let X be a positive recurrent. or if S2 = ∞) then X is positive recurrent if and only if S1 < +∞. They are discrete-time. TIME AVERAGES VS. Above it is noted that limt→∞ πi (t) = πi = 1/(−qii Mi ). . To be definite. where the empirical distribution is the distribution observed over a (usually large) time interval. then the discrete-time process has period 2. 1 lim Tk /(kMi ) = lim k→∞ k→∞ kMi = 1 2 k−1 (Tl+1 − Tl ) l=0 If in addition λi + µi = 1 for all i. If the birth and death probabilities are strictly positive. j ≥ 0) is given by λi if j = i + 1 µi if j = i − 1 1 − λi − µi if j = i ≥ 1 pij = (6. by the law of large numbers. . with pure-jump sample paths and generator matrix Q.10) 1 − λ0 if j = i = 0 0 else. If X is positive recurrent the equilibrium probability distribution is given by πn = (λ0 . . Discrete-time birth-death processes have a similar characterization. Implicit in the specification of P is that births and deaths can’t happen simultaneously. STATISTICAL AVERAGES 175 Proposition 6. λn−1 )/(S1 µ1 . if the rates are bounded or grow at most linearly with n.5. time-homogeneous Markov processes with state space equal to the set of nonnegative integers. with the exception that the discrete-time process cannot be explosive. For a fixed state i. Therefore. If X is nonexplosive (for example. The cycle times Tk+1 − Tk . time-homogeneous Markov process with equilibrium probability distribution π.1 Suppose X is a continuous-time birth-death process with strictly positive birth rates and death rates. with probability one.11) is expected. since state i is arbitrary. So for any nonnegative function φ on S. 1 t→∞ t lim t 0 φ(X(s))ds = Tk 1 φ(X(s))ds k→∞ kMi 0 T1 1 = E φ(X(s))ds Mi T0 T1 1 φ(j) I{X(s)=j} ds = E Mi T0 lim j∈S = = 1 Mi T1 φ(j)E j∈S T0 I{X(s)=j} ds (6. because the process spends on average −1/qii time units in state i per cycle from state i. Thus. and the cycle rate is 1/Mi .12) By considering how the time in state j is distributed among the cycles from state i. it follows that the mean time spent in state j per cycle from state i is Mi πj . the limit (6. 1 lim k→∞ kMi Tk 0 1 I{X(s)=i} ds = lim k→∞ kMi = k−1 l=0 Tl+1 I{X(s)=i} ds Tl T1 1 E I{X(s)=i} ds Mi T0 = 1/(−qii Mi ) Combining these two observations yields that 1 t→∞ t lim t I{X(s)=i} ds = 1/(−qii Mi ) = πi 0 (6. Tk+1 ). Of course. In short. if j is any other state then 1 t→∞ t lim t I{X(s)=j} ds = 1/(−qjj Mj ) = πj 0 (6. during the kth cycle interval [Tk . the amount of time spent by the process in state i is exponentially distributed with mean −1/qii .13) holds for both φ+ and φ− . j∈S φ− (j)πj < ∞.11) with probability one. then . and the time spent in the state during disjoint cycles is independent.176 CHAPTER 6.13) φ(j)πj j∈S Finally. if φ is a function of S such that either j∈S φ+ (j)πj < ∞ or since (6. it must hold for φ itself. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Furthermore. if any. where b denotes the maximum number of customers that can be in the system. Thus. M/M/1 queue and Little’s law Some basic terminology of queueing theory will now be explained. M/M/1 QUEUE AND LITTLE’S LAW 177 queue server system Figure 6.7. there is a potential departure. QUEUEING SYSTEMS. and that the service times of the customers are independent and exponentially distributed with parameter µ.6 Queueing systems. given positive parameters λ and µ. When the service of a customer is complete it departs from the server and then another customer from the queue. and any other customers in the system are waiting in the queue.7: A single server queueing system. and there is a single server. starting from some initial value N (0). the service times are memoryless (i.e. a Poisson arrival process). Each time there is a jump of D.6. the system just described is called an M/M/1 queueing system. Process A marks customer arrival times and process D marks potential customer departure times. meaning that if there is a customer in the server at the time of the jump then the customer departs. are exponentially distributed). . there is a customer in the server. a customer arrives to the system. so-named because the arrival process is memoryless (i.e. or last-come first-served (LCFS) in which the customer that arrived most recently is served next. Some of the more complicated service disciplines involve priority classes. and s is the number of servers in the system. Ordinarily whenever the system is not empty. evolves as follows. immediately enters the server. For example. we may declare that the arrival process is a Poisson process with rate λ. because there is no finite bound on the number of customers in the system. A simple type of queueing system is pictured in Figure 6. Common service disciplines are first-come first-served (FCFS) in which customers are served in the order of their arrival. The number of customers in the system. Many queueing systems are given labels of the form A/B/s. A second way to specify an M/M/1 queueing system with parameters λ and µ is to let A(t) and D(t) be independent Poisson processes with rates λ and µ respectively. or the notion of “processor sharing” in which all customers present in the system receive equal attention from the server. where “A” is chosen to denote the type of arrival process. 6. “B” is used to denote the type of departure process. Often models of queueing systems involve a stochastic description. Other labels for queueing systems have a fourth descriptor and thus have the form A/B/s/b. Notice that the system is comprised of a queue and a server. an M/M/1 system is also an M/M/1/∞ system.6. The choice of which customer to be served next depends on the service discipline. Each time there is a jump of A. In particular. and N is the mean content of the system. the horizontal distance between the graphs gives the times in system for the customers.178 CHAPTER 6.8 pictures the cumulative number of arrivals (α(t)) and the cumulative number of departures (δ(t)) versus time. the Markov ergodic convergence theorem implies that the statistical averages just computed. which is also the mean number of customers in the server. . if water flows through a pipe with volume one cubic meter at the rate of two cubic meters per minute. This is clear if water flows through the pipe without mixing. Using the classification criteria derived for birth-death processes. Let ρ = λ/µ. mixing within the pipe does not effect the average transit time. Since the M/M/1 process N (t) is positive recurrent. for a queueing system assumed to be initially empty. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS If a potential departure occurs when the system is empty then the potential departure has no effect on the system. This is the geometric distribution with zero as a possible value. if ρ < 1 then the equilibrium distribution for the number of customers in the system is given by πk = (1 − ρ)ρk for k ≥ 0. Little’s law can be applied in a great variety of circumstances involving flow through a system with delay. for then the transit time of each drop of water is 1/2 minute. An important performance measure for a queueing system is the mean time spent in the system or the mean time spent in the queue. assuming that customers are served in first-come firstserved order. is 1 − π0 = ρ. However. On the other hand. then the mean time (averaged over all drops of water) that water spends in the pipe is T = N /λ = 1/2 minute. Littles’ law. for some parameters λ and µ. such as N . Moreover. and that it is positive recurrent if and only if ρ < 1. Little’s law is actually a set of results. Little’s law is that λT = N where λ is the mean flow rate. A third way to specify an M/M/1 queuing system is that the number of customers in the system N (t) is a birth-death process with λk = λ and µk = µ for all k. The mean number of customers in the queue is thus given by ρ/(1 − ρ) − ρ = ρ2 /(1 − ρ). it is easy to see that the system is recurrent if and only if ρ ≤ 1. The number of customers in the system N can thus be expressed as t N (t) = N (0) + A(t) + 0 I{N (s−)≥1} dD(s) It is easy to verify that the resulting process N is Markov. which leads to the third specification of an M/M/1 queueing system. and with mean ∞ ∞ N= k=0 kπk = (1 − ρ)ρ k=1 ρk−1 k = (1 − ρ)ρ( ρ 1 ) = 1−ρ 1−ρ The probability the server is busy. Note that the number of customers in the system at any time s is given by the difference N (s) = α(s) − δ(s). This third specification is the most commonly used way to define an M/M/1 queueing process. each with somewhat different mathematical assumptions. T is the mean delay in the system. For example. In the context of queueing systems we speak of a flow of customers. which is the vertical distance between the arrival and departure graphs in the figure. described next. is a quite general and useful relationship that aids in computing mean transit time. are also equal to the limit of the time-averaged number of customers in the system as the averaging interval tends to infinity. but the same principle applies to a flow of water through a pipe. The following version is quite general. Figure 6. This is the shaded region indicated in the figure. then so does the third variable. Once these definitions are accepted. the average. For example. is given by T t = γt /α(t). the number of customers in an M/M/1 queue is a positive recurrent Markov process so that lim N t = N = ρ/(1 − ρ) t→∞ where calculation of the statistical mean N was previously discussed.6.14) Furthermore. t→∞ lim T t = N /λ = 1 . Also. we have that the Poisson arrival process for an M/M/1 queue satisfies limt→∞ λt = λ with probability one. expressed using averages over time) For any t > 0. M/M/1 QUEUE AND LITTLE’S LAW 179 ! s N(s) "s t Figure 6. if any two of the three variables in (6.14) converge to a positive finite limit as t → ∞. µ−λ . let γt denote the area of the region between the two graphs over the interval [0. N t = λt T t (6. t].6. we have the following obvious proposition.8: A single server queueing system. Thus. by the law of large numbers applied to interarrival times. s Given a (usually large) t > 0. of the time spent in the system up to time t. over the α(t) customers that arrive during the interval [0. and the limits satisfy N ∞ = λ∞ T ∞ . QUEUEING SYSTEMS.6. It is then natural to define the time-averaged values of arrival rate and system content as λt = α(t)/t and Nt = 1 t t N (s)ds = γt /t 0 Finally. with probability one. Proposition 6. t].1 (Little’s law. The average time in service is 1/µ (this follows from the third description of an M/M/1 queue. If the arrival rate depends on the number of customers in the system then the distribution seen by arrivals need not be the same as the equilibrium distribution. the process spends a fraction of time πk in state k and while in state k the arrival rate is λk . This final result also follows from Little’s law applied to the queue alone. Example 6.7. In such cases the mean arrival rate is still typically meaningful. or also from Little’s law applied to the server alone) so that the average waiting time in queue is given by W = 1/(µ − λ) − 1/µ = ρ/(µ − λ). Intuitively. where µ and α are positive constants.7 Mean arrival rate. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS In this sense. For example. the parameter of the Poisson arrival process.180 CHAPTER 6. λ The following is an example of a system with variable arrival rate. Then ∞ S2 = k=0 (k + 1)!µk = ∞ and S1 = αk ∞ k=0 αk α = exp( ) < ∞ k µ k!µ . discouraged arrivals) Suppose λk = α/(k + 1) and µk = µ for all k. However for some queueing systems the arrival rate depends on the number of customers in the system. and PASTA The mean arrival rate for the M/M/1 system is λ.1 (Single-server. as far as the customers are concerned. Suppose the number of customers in a queuing system is modeled by a birth death process with arrival rates (λk ) and departure rates (µk ). the average arrival rate is ∞ λ= k=0 πk λk Similarly the average departure rate is ∞ µ= k=1 πk µk and of course λ = µ because both are equal to the throughput of the system. Often the distribution of a system at particular system-related sampling times are more important than the distribution in equilibrium. πk λk is the long-term frequency of arrivals which occur when there are k customers in the system. the average waiting time in an M/M/1 system is 1/(µ−λ). Suppose in addition that the process is positive recurrent. the distribution seen by arriving customers may be the most relevant distribution. distributions seen by arrivals. Therefore. so that the fraction of customers that see k customers in the system upon arrival is given by rk = πk λk . 6. Intuitively. and it can be used in Little’s law. The mean arrival rate is ∞ λ= k=0 πk α α = µ exp(− ) k+1 µ ∞ k=0 (α/µ)k+1 α α α = µ exp(− )(exp( ) − 1) = µ(1 − exp(− )).7. Y . The distribution of the number of customers in the system seen by arrivals. shifting the distribution down by one. Moreover. and then renormalizing. AND PASTA 181 so that the number of customers in the system is a positive recurrent Markov process. roughly speaking. and X J is recurrent because X is recurrent. Its equilibrium distribution π J (if it exists) is proportional to −πi qii (see Problem 6. The equivalence of time-averages and statistical averages for computing the mean arrival rate and the distribution seen by arrivals can be shown by application of ergodic properties of the processes involved. which is the Poisson distribution with mean N = α/µ. in slightly more generality. n→∞ Y πij lim (number of first n transitions of X that are type (i. positive-recurrent pure-jump Markov process. if and only if R = − i πi qii < ∞. MEAN ARRIVAL RATE. DISTRIBUTIONS SEEN BY ARRIVALS. The sequence of transitions of X forms a new Markov process. with no additional restrictions on α and µ. i. This mean is somewhat less than N because.e. (rk ) is given by rk = πk α (α/µ)k+1 exp(−α/µ) = for k ≥ 0 (k + 1)!(1 − exp(−α/µ)) (k + 1)λ which in words can be described as the result of removing the probability mass at zero in the Poisson distribution. j) ∈ S × S : qij > 0}. by Y (k) = (X J (k − 1). (Let X J (−1) be defined arbitrarily. Then πi = −πi qii /R is the equilibrium probability J . and X J is positive recurrent if and only if this distribution can be normalized to make a probability distribution. Assume J for simplicity that X J is positive recurrent. Y is positive recurrent and its equilibrium distribution is given distribution of X by J = πi pJ ij −πi qii qij = R −qii πi qij = R Since limiting time averages equal statistical averages for Y . If the process makes a jump from state i to state j at time t. because the departure rate is µ with probability 1 − π0 and zero otherwise. Let X denote an irreducible. Furthermore.15) (k + 1)! µ µ µ This expression derived for λ is clearly equal to µ. The mean number of customers in the queue seen by a typical arrival is therefore (α/µ − 1)/(1 − exp(−α/µ)). X J (k)) for k ≥ 0. The associated formal approach is described next.6.3). say that a transition of type (i. The process Y is a discrete-time Markov process with state space {(i. j))/n = πi qij /R . j) occurs. the customer arrival rate is higher when the system is more lightly loaded. (6. and it can be described in terms of the jump process for X. the equilibrium probability distribution is given by πk = (α/µ)k exp(−α/µ)/k!.) J The one-step transition probability matrix of the jump process X J is given by πij = qij /(−qii ). then λ = λ and π = r. Then deduce that the fraction of the first n arrivals that see i customers in the system upon arrival converges to πi λi / j πj λj with probability one. when there are k customers in the system then the service rate is µk for some given numbers µk . i + 1). Let ρ = λ/(mµ). so that the PASTA property holds. Then the equilibrium distribution is given by πk = πm+j (λ/µ)k for 0 ≤ k ≤ m S1 k! = πm ρj for j ≥ 1 where S1 is chosen to make the probabilities sum to one (use the fact 1 + ρ + ρ2 .8. Example 6. The total number of customers in the system is a birth-death process with µk = µ min(k.16) To apply this setup to the special case of a queueing system in which the number of customers in the system is a Markov birth-death processes. j) ∈ A. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS with probability one.1 (M/M/m systems) An M/M/m queueing system consists of a single queue and m servers. Also. m). = 1/(1 − ρ)): m−1 S1 = k=0 (λ/µ)k k! + (λ/µ)m .8 More examples of queueing systems modeled as Markov birthdeath processes For each of the four examples of this section it is assumed that new customers are offered to the system according to a Poisson process with rate λ. The number of customers in the system is a Markov birth-death process with λk = λ for all k. µk Special cases of this example are presented in the next four examples. The condition λi = λ also implies that the arrival process is Poisson. and if (i. . . Assume now that ρ < 1. the Markov process is not explosive. j) = number of first n transitions of X with type in A πi qij (i . . Therefore. Therefore the process is positive recurrent if and only if S1 is finite. This situation is called “Poisson Arrivals See Time Averages” (PASTA). m!(1 − ρ) . where ∞ S1 = k=0 λk µ 1 µ 2 . Since µk = mµ for all k large enough it is easy to check that the process is positive recurrent if and only if ρ < 1.j )∈A πi qi j (6. . let the set A be the set of transitions of the form (i. Note that if λi = λ for all i. then n→∞ lim number of first n transitions of X that are type (i.182 CHAPTER 6. 6. The arrival process is Poisson with some rate λ and the customer service times are independent and exponentially distributed with mean µ for some µ > 0. if A ⊂ S × S. Since the number of transitions of the process up to any given time t is at most twice the number of customers that arrived by time t. Example 6. Example 6. we see that ∞ S1 = k=0 1 <∞ 1 + k2 . By the PASTA property. Since there is no queue. this system is near the borderline between recurrence and transience. An arriving customer is blocked and cleared from the system if and only if the system already has m customers in it. The unique equilibrium distribution is given by (λ/µ)k πk = for 0 ≤ k ≤ m S1 k! where S1 is chosen to make the probabilities sum to one.3 (A system with a discouraged server) The number of customers in this system is a birth-death process with constant birth rate λ and death rates µk = 1/k. The arrival process is Poisson with some rate λ and the customer service times are independent and exponentially distributed with mean µ for some µ > 0. this is the same as the equilibrium probability of having m or more customers in the system: ∞ PQ = j=0 πm+j = πm /(1 − ρ) This formula is called the Erlang C formula for probability of queueing. MORE EXAMPLES OF QUEUEING SYSTEMS MODELED AS MARKOV BIRTH-DEATH PROCESSES 1 An arriving customer must join the queue (rather that go directly to a server) if and only if the system has m or more customers in it.8. Example 6. 1.8. By the PASTA property.2 (M/M/m/m systems) An M/M/m/m queueing system consists of m servers. Since the departure rates are barely larger than the arrival rates. . and with µk = kµ for 1 ≤ k ≤ m. It is not difficult to show that N (t) converges to +∞ with probability one as t → ∞. this is the same as the equilibrium probability of having m customers in the system: PB = πm = (λ/µ)m m! m (λ/µ)j j=0 j! This formula is called the Erlang B formula for probability of blocking.4 (A barely stable system) The number of customers in this system is a birth-death λ(1+k2 process with constant birth rate λ and death rates µk = 1+(k−1))2 for all k ≥ 1.8. It is is easy to check that all states are transient for any positive value of λ (to verify this it suffices to check that S2 < ∞). but with the state space reduced to {0. The total number of customers in the system is a birth death process.6. However. . if a customer arrives when there are already m customers in the system.8. . then the arrival is blocked and cleared from the system. . m}. The works [13. have demonstrated the wide applicability of the stability methods in various queueing network contexts. The criteria for stability and related moment bounds discussed in this chapter are useful for providing such bounds. computed with respect to the equilibrium probability distribution.184 CHAPTER 6. congestion. 12]. and an example. If f is a function on S. It is debatable whether this system should be thought of as “stable” even though all states are positive recurrent and all waiting times are finite with probability one.9 Foster-Lyapunov stability criterion and moment bounds Communication network models can become quite complex. The drift vector of V (X(t)) is A version of these moment bounds was given by Tweedie [15]. If f is nonnegative. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS so that N (t) is positive recurrent with equilibrium distribution πk = 1/(S1 (1 + k 2 )). 6. especially when dynamic scheduling. to serve as the Lyapunov function. multiple dimensional state space. It is thus useful to have methods to give approximations or bounds on key performance parameters.9. with the understanding that P f (i) = +∞ is possible for some. then P f is well defined. and presents examples involving load balancing routing. and physical layer effects such as fading wireless channel models are included. and a version of the moment bound method was used by Kingman [5] in a queueing context. Note that the mean number of customers in the system is ∞ N= k=0 k/(S1 (1 + k 2 )) = ∞ By Little’s law the mean time customers spend in the system is also infinity. Subsection 6. 14. a recurring theme of dynamical systems theory is that stability questions can often be settled by studying the potential of a system for some nonnegative potential function V . As noted in [9].1 Stability criteria for discrete-time processes Consider an irreducible discrete-time Markov process X on a countable state space S. Let V be a nonnegative function on S.9. 6. values of i.1 discusses the discrete-time tools. Aleksandr Mikhailovich Lyapunov (1857-1918) contributed significantly to the theory of stability of dynamical systems. Although a dynamical system may evolve on a complicated. with one-step transition probability matrix P . the moment bound method is closely related to Dynkin’s formula. then P f represents the function obtained by multiplication of the vector f by the matrix P : P f (i) = j∈S pij f (j). An important property of P f is that P f (i) = E[f (X(t + 1)|X(t) = i]. and input queued crossbar switches. Below we present the well known criteria due to Foster [4] for recurrence and positive recurrence. or all. Potential functions used for stability analysis are widely called Lyapunov functions. Similar stability conditions have been developed by many authors for stochastic systems. and many others.9.2 presents the continuous time tools. using quadratic Lyapunov functions. which are expectations of some functions on the state space. In addition we present associated bounds on the moments. 6.3 Subsection 6. 3 . and if P V − V ≤ 0 on S − C. f . (In particular.9. Then f ≤ g. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS 185 defined by d(i) = E[V (X(t + 1))|X(t) = i] − V (i). d2 ≤ 1. (b) If > 0 and b is a constant such that P V − V ≤ − + bIC .9.9. f .18) In addition. Then V.9. then g is finite. queue 1 and queue 2. the long term average drift of V (X(t)) is nonnegative. (6.1 (Foster-Lyapunov stability criterion) Suppose V : S → R+ and C is a finite subset of S. (a) If {i : V (i) ≤ K} is finite for all K. has a departure with probability di . as pictured in Figure 6. 0 ≤ a. if the value +∞ is permitted. g. . and g are nonnegative functions on S such that P V (i) − V (i) ≤ −f (i) + g(i) for all i ∈ S (6. Then X is positive recurrent and f ≤ g. (In particular. then X is positive recurrent.) Proof. d1 .2 are satisfied. and therefore f is finite. Proposition 6. The state space for the process is S = Z2 .9. suppose for some > 0 that the set C defined by C = {i : f (i) < g(i)+ } is finite. The assumptions in Propositions 6. x2 ) denotes x1 packets in queue 1 and x2 packets in queue 2. b.9. d = P V − V .6. so that X is positive recurrent. That is. and then is routed to queue 1 with probability u and to queue 2 with probability u. Let b = max{g(i) + − f (i) : i ∈ C}.19) In addition.1(b).9.4 (Probabilistic routing to two queues) Consider the routing scenario with two queues. and therefore f is finite. f. suppose X is positive recurrent. In each time slot. Example 6. The drift vector is also given by d(i) = j:j=i pij (V (j) − V (i)).9. Note that d(i) is always well-defined. Here. and g are nonnegative functions on S and suppose P V (i) − V (i) ≤ −f (i) + g(i) for all i ∈ S (6.9. f = πf and g = πg are well-defined.2 and Corollary 6. if g is bounded. then g is finite. This gives an intuitive reason why the mean downward part of the drift.2 (Moment bound) Suppose V . Even so. Therefore the hypotheses of Proposition 6.17) Proposition 6.) Corollary 6. and u = 1 − u. for a given initial state X(0). Then each queue i. since V is nonnegative.1 and 6. and satisfy the hypotheses of Proposition 6.9. then X is recurrent. a new arrival is generated with probability a.9.3 do not imply that V is finite. fed by a single stream of packets. so that f ≤ g. u. must be less than or equal to the mean upward part of the drift. where the state + x = (x1 .3 (Combined Foster-Lyapunov stability criterion and moment bound) Suppose V. C. if not empty. if g is bounded. so that the means. 186 CHAPTER 6. To keep the queue length process from going negative.20) • A(t) = (A1 (t). (0. We seek to show that this necessary condition is also sufficient.9: Two queues fed by a single arrival stream. 0) with probability au. This completes the specification of the one-step transition probabilities of the Markov process. 0) otherwise. for i ∈ {0. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS d queue 1 a u u queue 2 1 d 2 Figure 6. then the system dynamics can be described as follows: Xi (t + 1) = Xi (t) + Ai (t) − Di (t) + Li (t) where for i ∈ {0. 1} • All the A(t)’s. for any routing policy. are Bernoulli(di ) random variables. if Xi (t) is the number of packets in queue i at the beginning of slot t. Thus. we still allow Di (t) to equal one. a < d1 + d2 . Note that we allow a packet to arrive and depart in the same slot. 1) with probability au. and Di (t) − Li (t) is the actual number of departures. • Di (t) : t ≥ 0. and D2 (t)’s are mutually independent • Li (t) = (−(Xi (t) + Ai (t) − Di (t)))+ (see explanation next) If Xi (t) + Ai (t) = 0. there can be no actual departure from queue i. Di (t) is the potential number of departures from queue i during the slot.20). A2 (t)) is equal to (1. D1 (t)’s. we add the random variable Li (t) in (6. 1} (6. However. and A(t) = (0. Note that (Xi (t+1))2 = 1 2 (Xi (t) + Ai (t) − Di (t) + Li (t))2 ≤ (Xi (t) + Ai (t) − Di (t))2 . Thus. Let us calculate the drift of V (X(t)) for the choice V (x) = (x2 +x2 )/2. under the random routing policy. because the total arrival rate must be less than the total depature rate. A necessary condition for positive recurrence is. because addition of the variable Li (t) . 23) yields 1 X1 + X2 ≤ . implying that the Markov process is positive recurrent.9. d2 − au} d1 + d2 − a } 2 if d1 − d2 < −a if |d1 − d2 | ≤ a if d1 − d2 > a = min{d1 . we select u∗ to maximize the minimum of the two coefficients in (6. We find: = 0≤u≤1 max min{d1 − au. with f (x) = x1 (d1 − au) + x2 (d2 − au).25) is proved. and (6. (6. and the corresponding value u∗ of u is given by 0 ∗ −d 1 u = + d12a 2 2 1 For the system with u = u∗ . and any > 0.25) d1 + d2 − a If |d1 − d2 | ≤ a then the bounds (6. X1 + X2 ≤ . so that X2 = 0. Similarly. if d1 − d2 > a.22) ≤ i=1 = − (x1 (d1 − au) + x2 (d2 − au)) + 1 Under the necessary condition a < d1 + d2 . there are choices of u so that au < d1 and au < d2 .23) In order to deduce an upper bound on X 1 + X 2 .3 are satisfied. Thus.25). the first moments under the equlibrium distribution satisfy: (d1 − au)X1 + (d2 − au)X 2 ≤ 1.24) and (6. P V (x) − V (x) = E[V (X(t + 1))|X(t) = x] − V (x) ≤ 1 2 2 2 187 E[(xi + Ai (t) − Di (t))2 − x2 |X(t) = x] i i=1 = i=1 2 1 xi E[Ai (t) − Di (t)|X(t) = x] + E[(Ai (t) − Di (t))2 |X(t) = x] (6. and otherwise. in fact. then u∗ = 1. and (6. (6.25) coincide. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS can only push Xi (t) + Ai (t) − Di (t) closer to zero.6.24) 2 (6.21) 2 xi E[Ai (t) − Di (t)|X(t) = x] +1 (6.25).9. which implies (6. Thus. If d1 − d2 < −a then u∗ = 0.23) becomes (d1 − a)X1 ≤ 1. so that X1 = 0. (6. the bound (6.23) becomes (d2 − a)X2 ≤ 1 . Intuitively. and for such u the conditions of Corollary 6. In addition. g(x) = 1. this entails selecting u to minimize the absolute value of the difference between the two coefficients.23). We remark that. d2 . which implies (6. (6.25) is strictly tighter. . for i.25) is not always true for the route-to-shorter policy. . and so X 1 = 0. j containing packets that arrived at input i and are destined for output j. those customers will remain in queue 2 for a very long time.10. I{x1 >x2 } ) with probability a.9.188 CHAPTER 6. To be definite. and at most one packet can be transferred to each output. 0) otherwise. and A(t) = (0. that is because occasionally there will be a large number of customers in the system due to statistical fluctuations. such that at most one packet can be transferred from each input.6 and 6. if a and d2 are fixed with 0 < a < d2 < 1. . . It follows that the route-to-shorter process is positive recurrent as long as a < d1 + d2 .23) holds for any value of u such that au < d1 and au ≤ d2 . · · · . for which the control policy is selected to minimize the bound on the drift of V in each state.21) yields: 2 P RS V (x) − V (x) ≤ i=1 xi E[Ai (t) − Di (t))|X(t) = x] + 1 = a x1 I{x1 ≤x2 } + x2 I{x1 >x2 } − d1 x1 − d2 x2 + 1 Note that x1 I{x1 ≤x2 } + x2 I{x1 >x2 } ≤ ux1 + ux2 for any u ∈ [0. N }. Suppose the packets are all the same length.9. route-to-shorter routing can be viewed as a controlled version of the independent splitting model. The problem is that even if d1 − d2 < −a. it is routed to the shorter queue. 1].9. The paper [8] presents similiar results in the context of a packet switch. Proceeding as in (6. the drift bound for V under the route-to-shorter policy is less than or equal to the drift bound (6. Intuitively. But if d2 << 1.6 (An input queued switch with probabilistic switching) 4 Consider a packet switch with N inputs and N outputs.7. In fact.22). DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Example 6. where σ1 . so that during one time slot. . Suppose there are N 2 queues – N at each input – with queue i. In particular. σN are distinct 4 Tassiulas [12] originally developed the results of Examples 6. . but with with the description of the arrival variables changed to the following: • Given X(t) = (x1 . in the context of wireless networks. as pictured in Figure 6. We remark that the stronger bound (6. where E = {1.20) still holds.9. A(t) = (I{x1 ≤x2 } . In fact. . the route-to-shorter policy can still route to queue 1.24) holds for the route-to-shorter process. Example 6. and then there will be many customers in queue 1. Let P RS denote the one-step transition probability matrix when the route-to-shorter policy is used. the packet is routed to queue 1. A permutation σ of E has the form σ = (σ1 . (6.5 (Route-to-shorter policy) Consider a variation of the previous example such that when a packet arrives. in case of a tie. a transfer of packets can occur. and (6. Therefore. with equality for u = I{x1 ≤x2 } . and adopt a discrete-time model. . j ∈ E. Then the evolution equation (6. for V for any choice of probabilistic splitting. σN ). x2 ). then X1 → ∞ as d1 → 0 for the route-to-shorter policy. . equivalently. with mean λij and E[A2 ] ≤ Kij .3 4. for some constants λij and Kij .2 1.4 2. let R(σ) be the N × N switching matrix defined by Rij = I{σi =j} .1 3. A state x of the system has the form x = (xij : i. Assume that the variables (Aij (t) : i. called the virtual queue for output j. j ∈ E).3 1.2 4. j. and for each i. The evolution of the system over a time slot [t. if there is any such packet. or. identically distributed.3 3.4 3.1 4.6. input i is connected to output j. j. j is to depart.2 2. the set of packets waiting for output j.10: A 4 × 4 input queued switch. t ≥ 0) are mutually independent. the random variables (Aij (t) : t ≥ 0) are independent.2 3. Rij (σ) = 1 means that under permutation σ.4 189 output 1 input 2 output 2 input 3 output 3 input 4 output 4 Figure 6.1 1. and at most one packet in the virtual queue can be served in a time slot. elements of E. j ∈ E. and is zero otherwise.1 2. Let Π denote the set of all N ! such permutations. The mean number of arrivals to the virtual queue for output j is i∈E λij (t). its mean is given by the row sum j∈E λij . Let Λ = (λij : i. has size given by the column sum i∈E Xij (t). ij • σ(t) is the switch state used during the slot • Lij = (−(Xij (t) + Aij (t) − Rij (σ(t)))+ .9. j ∈ E).3 2. in the slot.26) . Given σ ∈ Π. destined for output j. Similarly. The number of packets at input i at the beginning of the slot is given by the row sum j∈E Xij (t). t + 1) is described as follows: Xij (t + 1) = Xij (t) + Aij (t) − Rij (σ(t)) + Lij (t) where • Aij (t) is the number of packets arriving at input i. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS input 1 1.4 4. where xij denotes the number of packets in queue i. Thus. and at most one packet at input i can be served in a time slot. a packet in queue i. which takes value one if there was an unused potential departure at queue ij during the slot. These considerations lead us to impose the following restrictions on Λ: λij < 1 for all i and j∈E i∈E λij < 1 for all j (6. d. each with distribution u∗ .26).) switching. A square matrix M is called doubly stochastic if it has nonnegative entries and if all of its line sums are one. As in (6.21). the random variables Rij (σ(t)) are independent. there is a probability distribution u∗ on Π so that λij + ≤ µij (u∗ ). i Mij . given a probability distribution u on Π. then if some of the entries of M are increased appropriately. suppose that all the line sums of Λ are less than one. Bernoulli(µij (u∗ )) random variables. j ∈ E) is less than or equal to one. Once the distributions of the Aij ’s and u are fixed.d.26) are necessary for stable operation. j Mij . Thus. 1 Consider the quadratic Lyapunov function V given by V (x) = 2 i.i. Let’s first explore random. let (σ(t) : t ≥ 0) be independent with common probability distribution u. a doubly stochastic matrix can be obtained.26). where µij (u) = Rij (σ)u(σ). N each line sum of (λij + : i.j x2 . j.j xij E[Aij (t) − Rij (σ(t))|Xij (t) = x] + 1 2 E[(Aij (t) − Rij (σ(t)))2 |X(t) = x]. the conditions (6. states that any doubly stochastic matrix M is a convex combination of switching matrices. ij P V (x) − V (x) ≤ i. j. That is. i. for any choice of the switch schedule (σ(t) : t ≥ 0). That is. such an M can be represented as M = σ∈Π R(σ)u(σ).j Now E[Aij (t) − Rij (σ(t))|Xij (t) = x] = E[Aij (t) − Rij (σ(t))] = λij − µij (u∗ ) ≤ − and 1 2 E[(Aij (t) − Rij (σ(t)))2 |X(t) = x] ≤ i.190 CHAPTER 6. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS Except for trivial cases involving deterministic arrival sequences.j . let (σ(t) : t ≥ 0) be independent. Applying Birkhoff’s theorem to M yields that there is a probability distribution u so that Mij ≤ σ∈Π R(σ)u(σ) for all i. we wish to determine a choice of u so that the process with i. there exists a doubly stochastic matrix M so that Mij ≤ Mij for all i. celebrated in the theory of switching. independent and identically distributed (i. That is. If M is a nonnegative matrix with all line sums less than or equal to one. That is. or a column sum. switch selection is positive recurrent. Then for each ij.j 1 2 E[(Aij (t))2 + (Rij (σ(t)))2 ] ≤ K i. Then with defined by = 1 − (maximum line sum of Λ) . Suppose Λ satisfies the necessary conditions (6. σ∈Π We consider the system using probability distribution u∗ for the switch states. where u = (u(σ) : σ ∈ Π) is a probability distribution on Π. Given Λ satisfying (6. That is. we have a discrete-time Markov process model. A line sum of a matrix M is either a row sum. Birkhoff’s theorem. by the observation at the end of the previous paragraph. Some standard background from switching theory is given in this paragraph.i. Let P M W denote the one-step transition probability matrix when the route-to-shorter policy is used. Letting V and K be as in Example 2a. the necessary condition (6. under i. the lowest numbered permutation is used. and in case there is a tie between two or more permutations for having the maximum weight.26) is also sufficient for positive recurrence and finite mean queue length in equilibrium. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS where K = 1 (N + 2 Kij ).j xij + K (6. we find under the maximum weight policy that P M W V (x) − V (x) ≤ ij xij (λij − Rij (σ M W (x))) + K The maximum of a function is greater than or equal to the average of the function. which is the weight of permutation σ with edge weights xij . Example 6. Also.3. This suggests using a dynamic switching policy. such as the maximum weight switching policy.29) σ∈Π ij The use of “arg max” here means that σ M W (x) is selected to be a value of σ that maximizes the sum on the right hand side of (6.6. which may be unknown a priori.d. and X ij ≤ ij K (6.i.27) Therefore.29).7 (An input queued switch with maximum weight switching) The random switching policy used in Example 2a depends on the arrival rate matrix Λ. the policy allocates potential departures to a given queue ij.9. defined by σ(t) = σ M W (X(t)). we assume that the set of permutations Π is numbered from 1 to N ! in some fashion. whether or not the queue is empty. In order to obtain a particular Markov model. random switching.9. by Corollary 6. (6. P V (x) − V (x) ≤ − ij 191 i. so that for any probability distribution u on Π xij Rij (σ M W (t)) ≥ ij σ u(σ) ij xij Rij (σ) (6. Thus. σ M W (x) = arg max xij Rij (σ). even if other queues could be served instead.30) = ij xij µij (u) . the process is positive recurrent. for an appropriate probability distribution u∗ on the set of permutations.28) That is. where for a state x.9. Proposition 6. and that a maximization problem be solved for each time slot. Then X is positive recurrent. The drift vector of V (X(t)) is the vector QV . and the same moment bound.9.33) Suppose further that {i : V (i) ≤ K} is finite for all K. 6. by Corollary 6.192 CHAPTER 6. In particular. holds.31). On one hand. .28). The following useful expression for QV follows from the fact that the row sums of Q are zero: QV (i) = qij (V (j) − V (i)). (6.27) still holds. (6.32) j:j=i Formula (6.2 Stability criteria for continuous time processes Here is a continuous time version of the Foster-Lyapunov stability criteria and the moment bounds. Suppose X is a time-homegeneous.30) if and only if u is concentrated on the set of maximum weight permutations. as for the randomized switching strategy of Example 2a.8 (Foster-Lyapunov stability criterion–continuous time) Suppose V : S → R+ and C is a finite subset of S. (6. This definition is motivated by the fact that the mean drift of X for an interval of duration h is given by dh (i) = = = j∈S j∈S E[V (X(t + h))|X(t) = i] − V (i) h pij (h) − δij V (j) h qij + o(h) h V (j).9. (6. and {i : V (i) ≤ K} is finite for all K then X is recurrent. if P is replaced by P M W .9.31) so that if the limit as h → 0 can be taken inside the summation in (6. (a) If QV ≤ 0 on S − C. then dh (i) → QV (i) as h → 0. the choice u = u∗ shows that xij Rij (σ M W (t)) ≥ ij ij xij µij (u∗) ≥ ij xij (λij + ) Therefore. continuous-time Markov process with generator matrix Q. Therefore.3.17) for the drift vector for a discrete-time process.32) is quite similar to the formula (6. it requires that queue length information be shared. irreducible. Much recent work has gone towards reduced complexity dynamic switching algorithms. (b) Suppose for some b > 0 and > 0 that QV (i) ≤ − + bIC (i) for all i ∈ S. implementing the maximum weight algorithm does not require knowledge of the arrival rates. or that X is nonexplosive. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS with equality in (6. the process is positive recurrent. but on the other hand. (6. f .11 (Combined Foster-Lyapunov stability criterion and moment bound–continuous time) Suppose V . suppose X is positive recurrent.9. Proposition 6.11: A system of three queues with two servers. Let δ be the probability that. V (0) = 0. so that the means. respectively. where µ > 0 and 1 λi > 0 such that i≥0 λi < +∞. which are Poisson processes with rates m1 and m2 . the drift conditions imply moment bounds. After the last visit to state zero. Suppose also that {i : V (i) ≤ K} is finite for all K. As for the case of discrete time. and suppose there are two potential departure processes. Then f ≤ g.33) is satisfied with = µ and b = λ0 + µ. with qi0 = µ for all i ≥ 1. for some > 0. D1 and D2 . X J is transient.9. and g are nonnegative functions on S.9. starting from state 0. and g are nonnegative functions on S such that QV (i) ≤ −f (i) + g(i) for all i ∈ S. f . Example 6. This example illustrates the need for the assumption just after (6. Then QV = −µ + (λ0 + µ)IC . Example 6. Thus. the jump process does not return to zero. FOSTER-LYAPUNOV STABILITY CRITERION AND MOMENT BOUNDS ! 1 193 queue 1 u 1 m 1 ! 2 queue 2 u 1 u 2 m 2 ! 3 queue 3 u 2 Figure 6. X is not µ λi positive recurrent. Then X is positive recurrent and f ≤ g. In addition. However.10 (Moment bound–continuous time) Suppose V .6. f = πf and g = πg are well-defined.9. so that the process X is explosive. Corollary 6. qii+1 = λi for all i ≥ 0. the following conditions are . Then 1 δ = ∞ pJ ≥ exp(−µ ∞ λi ) > 0. The five Poisson processes are assumed to be independent.9 Suppose X has state space S = Z+ .12 (Random server allocation with two servers) Consider the system shown in Figure 6.9. and.9. To see this. In fact. all i=0 i=0 ii+1 the jumps of X J are up one. Suppose that each queue i is fed by a Poisson arrival process with rate λi . X is explosive. Let C = {0}. and suppose QV (i) ≤ −f (i) + g(i) for all i ∈ S. note that pJ ii+1 = µ+λi ≥ exp(− λi ). the set C defined by C = {i : f (i) < g(i) + } is finite.11. and all other off-diagonal entries of the rate matrix Q equal to zero. No matter how the potential departures are allocated to the permitted queues.33) in Proposition 6. so that (6. The corresponding mean holding times of X are λi1 which have a +µ finite sum. and V (i) = 1 for i ≥ 0.8. so Corollary 6. x2 ) ∈ Z3 corresponds to xi packets in queue i for + each i.9. Examine the right hand side of (6. Rather than taking a fixed value of u. with the probabilities determined by u = (u1 .9. x2 . Suppose that the necessary condition (6. m2 − λ3 . A vector x = (x1 . m1 +m2 −λ1 −λ2 −λ3 }.35) where γ is the total rate of events. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS necessary for stability: λ1 < m1 .34) holds. Example 6. given by γ = λ1 +λ2 +λ3 +µ1 (u)+µ2 (u)+µ3 (u). concerned with the system shown in Figure 6. a potential service by server 1 is given to queue 1 with probability u1 . Using (6. The rates of potential service at the three stations are given by µ1 (u) = u1 m1 µ2 (u) = u1 m1 + u2 m2 µ3 (u) = u2 m2 .35).32). and to queue 3 with probability u2 . so that when Di has a jump. Similarly. and λ1 + λ2 + λ3 < m1 + m2 (6. and the sum of the potential service rates must exceed the sum of the potential arrival rates for stability.13 (Longer first server allocation with two servers) This is a continuation of Example 6.194 CHAPTER 6. As indicated in Figure 6. γ = λ1 + λ2 + λ3 + m1 + m2 . or equivalently. 1 Let V (x) = 2 (x2 + x2 + x2 ). and to queue 2 with probability u1 . a potential service by server 2 is given to queue 2 with probability u2 . Then there exists some > 0 and choice of u so that λi + ≤ µi (u) for 1≤i≤3 and the largest such choice of is = min{m1 − λ1 . the queue served is chosen at random. u2 ). suppose that the choice of u could be specified as a function . λ3 < m2 .12.34) That is because server 1 is the only one that can serve queue 1.11.11.11 implies that X is positive recurrent γ and X 1 + X 2 + X 3 ≤ 2 . (See excercise.) 3 So QV (x) ≤ − (x1 + x2 + x3 ) + γ for all x. we find that the drift vector QV is given by 1 2 3 QV (x) = 1 2 3 ((xi + 1)2 − x2 )λi i i=1 + 1 2 3 ((xi − 1)2 − x2 )µi (u) + i i=1 Now (xi − 1)2 ≤ (xi − 1)2 . so that + 3 QV (x) ≤ i=1 xi (λi − µi (u)) + γ 2 (6.9. Let us consider random selection. server 2 is the only one that can serve queue 3. 2.5 0 0. Then (6. time-homogeneous Markov chain with state space {0. (b) Compute E[min{n ≥ 1 : X(n) = 3}|X(0) = 0]. (Note: We see that 3 QLF V (x) ≤ i=1 xi λi − m1 (x1 ∨ x2 ) − m2 (x2 ∨ x3 ) + γ .) 6. and QLF V (x) ≤ − (x1 + x2 + x3 ) + γ for all x. PROBLEMS 195 of the state x. 6. Model the system as a continuous-time Markov process. Therefore if the necessary condition (6.10 Problems 6.1.9.10. 1.36) for a given state x if and only if a longer first policy is used: each service opportunity is allocated to the longer queue connected to the server. so that for any probability distribution u.2 A two station pipeline in continuous time This is a continuous-time version of Example 4.9. Consider a pipeline consisting of two singlebuffer stages in series.36) = max m1 (x1 u1 + x2 u1 ) + m2 (x2 u2 + x3 u2 ) = m1 (x1 ∨ x2 ) + m2 (x2 ∨ x3 ) with equality in (6.6.5 0 0 1 0 for some constant a with 0 ≤ a ≤ 1. describe all the equilibrium probability vectors. 3 xi µi (u) ≤ max i=1 u i u xi µi (u ) (6.12. So Corollary γ 6. 3} and one-step transition probability matrix 0 1 0 0 1−a 0 a 0 P = 0 0. and X 1 + X 2 + X 3 ≤ 2 . If the equilibrium vector is not unique. Let QLF denote the one-step transition probability matrix when the longest first policy is used. can be taken as in Example 6. Suppose new packets . The maximum of a function is greater than or equal to the average of the function.9. 2 but for obtaining a bound on X 1 + X 2 + X 3 it was simpler to compare to the case of random service allocation. (a) Sketch the transition probability diagram for X and give the equilibrium probability vector.35) continues to hold for any fixed u. when Q is replaced by QLF .34) holds.11 implies that X is positive recurrent under the longer first policy.1 Mean hitting time for a simple Markov process Let (X(n) : n ≥ 0) denote a discrete-time. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS are offered to the first stage according to a rate λ Poisson process. t+h) to stage two with probability hµ1 +o(h). or departure events during [t. t + h) is o(h). 2} and Q matrix −4 2 2 Q = 1 −2 1 2 0 −2. Otherwise the new packet is lost. (a) What is an appropriate state-space for this model? (b) Sketch a transition rate diagram. (So π and π J are identical if and only if qii is the same for all i.10) is given by πi = −Bqii πi .3 Equilibrium distribution of the jump chain Suppose that π is the equilibrium distribution for a time-homogeneous Markov process with transition rate matrix Q. Suppose that B −1 = i −qii πi .6 A mean hitting time problem Let (X(t) : t ≥ 0) be a time-homogeneous. if at time t the second stage has a packet.4 A simple Poisson process calculation Let (N (t) : t ≥ 0) be a Poisson random process with rate λ > 0. the probability of two or more arrival.196 CHAPTER 6. transfer. (d) Derive the throughput. (c) Write down the Q matrix. A new packet is accepted at stage one if the buffer in stage one is empty at the time of arrival. (e) Still assuming λ = µ1 = µ2 = 1. independently of the state of stage one. where the sum is over all i in the state space. assuming that λ = µ1 = µ2 = 1. Compute P [N (s) = i|N (t) = k] where 0 < s < t and i and k are nonnegative integers. Suppose the system starts with one packet in each stage. then the packet leaves the system during [t.5 A simple question of periods Consider a discrete-time Markov process with the nonzero one-step transition probabilities indicated by the following graph. then the packet is transfered during [t. 6. t + h) with probability hµ2 + o(h). . Finally. is finite.) 6. pure-jump Markov process with state space {0. 1. (Caution: note order of s and t carefully). 1 2 3 4 5 6 7 8 (a) What is the period of state 4? (b) What is the period of state 6? 6. What is the expected time until both buffers are empty? 6. If at a fixed time t there is a packet in stage one and no packet in stage two. Similarly. Show that the equilibrium distribution for the jump chain (X J (k) : k ≥ 0) (defined in J Section 4. Suppose the link shuts down after three packets pass in the absence of resets. for 0 ≤ j ≤ 2 and t ≤ tf for some fixed time tf . PROBLEMS 197 (a) Write down the state transition diagram and compute the equilibrium distribution. Derive a backward differential equation for: βj (t) = P [X(s) = 2 for t ≤ s ≤ tf |X(t) = j].6. and µb are strictly positive constants. Once the link is shut down.) (d) The forward Kolmogorov equations describe the evolution of an initial probability distribution going forward in time. and µb is it true that the distribution of the number in the system at the time of a typical arrival is the same as the equilibrium distribution of the number in the system? 6.. (You need not solve the equations. In other problems. µa . 2. until the link is reset again. at which time the process begins anew. (c) Derive a variation of the Kolmogorov forward differential equations for the quantities: αi (t) = P [X(s) = 2 for 0 ≤ s ≤ t and X(t) = i|X(0) = 0] for 0 ≤ i ≤ 2. and let h → 0. λb . You need not solve the equations. If possible. (a) Under what additional assumptions on these four parameters is the process positive recurrent? (b) Assuming the system is positive recurrent. and a differential equation working backwards in time from a final condition is called for (called Kolmogorov backward equations). Packets are offered to the link according to a Poisson process with rate λ. ! µ (a) Sketch a transition rate diagram for a finite state Markov process describing the system state. (b) Express the dropping probability (same as the long term fraction of packets dropped) in terms of λ and µ. 1. (b) Compute ai = E[min{t ≥ 0 : X(t) = 1}|X(0) = i] for i = 0. 6.10. a boundary condition is given at a final time. under what conditions on λa . !a 0 µ a 1 µ b !b 2 µ a !a 3 µ b ! b 4 µ a !a . λb .8 Markov model for a link with resets Suppose that a regulated communication link resets at a sequence of times forming a Poisson process with rate µ.) 6.. (Hint: Express βi (t−h) in terms of the βj (t) s for t ≤ tf . given an initial. additional offered packets are dropped.9 An unusual birth-death process Consider the birth-death process X with arrival rates λk = (p/(1 − p))k /ak and death rates µk = .7 A birth-death process with periodic rates Consider a single server queueing system in which the number in the system is modeled as a continuous time birth-death process with the transition rate diagram shown. use an approach that can be applied to larger state spaces. where λa . µa . The packet lengths are assumed mutually . Assume the queue evolves during one of the atomic intervals as follows: There is an arrival during the interval with probability αq. where . Is a an equilibrium distribution for X? Explain. . find the equilibrium distribution for N .12 An M/M/1 queue with impatient customers Consider an M/M/1 queue with parameters λ and µ with the following modification.) is a probability distribution on the nonnegative integers with ak > 0 for all k. and no arrival otherwise. the probability that a typical arriving customer defects instead of being served. and a = (a0 . (b) Under what conditions are all states positive recurrent? Under this condition.10 A queue with decreasing service rate Consider a queueing system in which the arrival process is a Poisson process with rate λ. (b) Under what condition on λ and µ are all states positive recurrent? Under this condition. The data for each connection is assumed to be in packets exponentially distributed in length with mean packet size 1 kilobit. and µ/2 when there are K + 1 or more customers in the system.11 Limit of a distrete time queueing system Model a queue by a discrete-time Markov chain by recording the queue state after intervals of q seconds each. If there is a customer in the queue at the beginning of the interval then a single departure will occur during the interval with probability βq. X J (d) Classify the states for the process X J as transient.e. give the equilibrium distribution.13 Statistical multiplexing Consider the following scenario regarding a one-way link in a store-and-forward packet communication network. The number in the system is modeled as a birth-death Markov process. 6. depart without service) with probability αh + o(h) in an interval of length h. Find an explicit expression for pD . independently of the other customers in the queue. (a) Sketch the transition rate diagram. What happens as q → 0? 6. of the number of customers in the system at the end of an atomic interval. Suppose that it is impossible to have an arrival and a departure in a single atomic interval. Let N (t) denote the number of customers in the system (queue plus server) at time t. 6.) (c) Suppose that α = µ. p = (pk : k ≥ 0). Suppose that the link supports eight connections. Suppose the instantaneous completion rate is µ when there are K or fewer customers in the system. . Does your answer make sense as λ/µ converges to zero or to infinity? 6. (b) Find the equilibrium distribution. Describe in words the typical behavior of the system. Each customer in the queue will defect (i. (a) Find ak =P[an interarrival time is kq] and bk =P[a service time is kq].5 < p < 1. given that it is initially empty.198 CHAPTER 6. (a) Give the transition rate diagram and generator matrix Q for the Markov chain N = (N (t) : t ≥ 0). (b) Check that aQ = 0. Otherwise no departure occurs. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS (p/(1 − p))k−1 /ak . (a) Classify the states for the process X as transient. Once a customer makes it to the server it no longer has a chance to defect and simply waits until its service is completed and then departs from the system. each generating traffic at 5 kilobits per second (kbps). (c) Find the one-step transition probabilities for the jump-chain. (You need not explicitly sum the series. (c) Suppose that λ = (2/3)µ. . null recurrent or positive recurrent. null recurrent or positive recurrent. a1 . 6.) (c) (Multiplexing over two links) The link is replaced by two 25 kbps links. If the service to a customer is interrupted because the server moves to the next station. of a typical customer that is not blocked? (d) Give a simple method to numerically calculate. The rotation of the server is modelled by a three state Markov process with the transition rates α. 6. Packets are queued at the beginning of the link if necessary. with mean one. PROBLEMS 199 independent and the packets for each stream arrive according to a Poisson process. . When at a station. (a) (Full multiplexing) The link transmit speed is 50 kbps. as pictured. the server works at unit rate. and is said to be blocked. Let N (t) denote the number of customers in the system at time t. W . and each of the two links carries four sessions. Then (N (t) : t ≥ 0) is a Markov chain. arrival rate λ. (b) The link is replaced by two 25 kbps links. " 1 station 1 ! " 2 station 2 $ # " 3 station 3 Customers arrive to station i according to a Poisson process of rate λi for 1 ≤ i ≤ 3.14 A queue with blocking (M/M/1/5 system) Consider an M/M/1 queue with service rate µ. (a) Under what condition is the system stable? Briefly justify your answer.) 6. if any). or is idle if the station is empty. the mean length of a busy period of the system.15 Three queues and an autonomously traveling server Consider three stations that are served by a single rotating server. that a typical customer is blocked? (c) What is the mean waiting time in queue. and γ as indicated by the dashed lines. or give a simple expression for. (a) Indicate the transition rate diagram of the chain and find the equilibrium probability distribution. pB . and neither link is idle whenever a packet from any session is waiting. and queue space is unlimited. (A busy period begins with the arrival of a customer to an empty system and ends when the system is again empty. (b) Identify a method for computing the mean customer waiting time at station one. Compare your answers. (Of course the delay would be larger if the sessions were not evenly divided.e. (b) What is the probability. already has five customers in it) then the customer is dropped. at most five customers can be in the system (including the one in service. the service is resumed when the server returns. and the modification that at any time. β.10. If a customer arrives and the system is full (i. Each packet is transmitted on one link or the other. Compute the mean delay (queueing plus transmission time–neglect propagation delay) for each of the following three scenarios. and the total service requirement of each customer is exponentially distributed. determine conditions on a1 and a2 that insure there is a choice of u = (u1 . Suppose that αt → ∞ and αt /δt → 1 as t → ∞.17 Recurrence of mean zero random walks (a) Suppose B1 . DYNAMICS OF COUNTABLE-STATE MARKOV MODELS 6. Show that X is recurrent. is a sequence of independent. The process Y is a reflected version of X. Let D(k. (Note. In particular. and let R(k.9.e.b to the scenario shown. B2 . Show that X is positive recurrent. integer valued random variables. where Ln = (−(Yn + Bn ))+ . P [|Bi | ≤ M ] = 1 for some M .t] that leave behind exactly k customers. (a) Let X0 = 0 and Xn = B1 + · · · + Bn for n ≥ 0. B2 . and select u to mnimize the bound. integer valued random variables. 6. Show that Y is recurrent. with ties broken in favor of queue 1. 6. t) − R(k. t)| ≤ 1 for all k and t.t) denote the number of customers that arrive in the interval [0. which are bounded.19 Routing with two arrival streams (a) Generalize Example 6. Show that if the following two limits exist for a given value k. find an upper bound on X1 + X 2 + X3 . (b) Suppose Y0 = 0 and Yn+1 = Yn + Bn + Ln .t] to find exactly k customers already in the system. and give an upper bound on the mean under the equilibrium distribution. . d queue 1 a 1 u 1 1 queue 2 a 2 u u 2 2 queue 3 d d 2 1 u 3 where ai . (b) Generalize Example 1. . mean zero. .200 CHAPTER 6. then they are equal: rk = limt→∞ R(k. it is not assumed that the B’s are bounded.9. In particular.20 An inadequacy of a linear potential function Consider the system of Example 6. Under those conditions. can you find a version of routeto-shorter routing so that the bound found in part (a) still holds? 6. where Ln = (−(Xn + Bn ))+ . X.18 Positive recurrence of reflected random walk with negative drift Suppose B1 . dj ∈ (0. each with mean B < 0 and second moment B 2 < +∞. . Suppose X0 = 0 and Xn+1 = Xn + Bn + Ln . (a) Show that |D(k. 1) for 1 ≤ i ≤ 2 and 1 ≤ j ≤ 3.5 (a discrete-time model.16 On two distibutions seen by customers Consider a queueing system in which the number in the system only changes in steps of plus one or minus one.4 to the scenario shown. using the route to shorter policy. u2 ) which makes the system positive recurrent. (b) Let αt (respectively δt ) denote the number of arrivals (departures) up to time t. so u = I{x1 ≤x2 } ): . i. is a sequence of independent. . . t)/δt . t) denote the number of customers that depart in the interval [0.) 6. t)/αt and dk = limt→∞ D(k. 37). s) ≥ 0. with u(i. ai < i∈s B:B∩s=∅ w(B) (6.10. PROBLEMS d queue 1 a u u queue 2 1 201 d 2 Assume a = 0. Explain why the function V (x) = x1 + x2 does not satisfy the Foster-Lyapunov stability criteria for positive recurrence. . Let S(t) for each t be a subset of E = {1. and in slot i. s) = 0 if i ∈ s. it can serve one packet from one of the queues in S(t). t + 1). identically distributed sequence. u can be selected to that ai + ≤ µi (u) for i ∈ E. Fading channel state s a N queue N (a) Explain why the following condition is necessary for stability: For all s ⊂ E with s = ∅.37) (b) Consider u of the form u = (u(i. with the number of arrivals in a given slot having mean ai > 0 and finite second moment Ki . u(i. Show that under the condition (6. 6. s ⊂ E).) (c) Show that using the u found in part (b) that the process is positive recurrent. N } and t ≥ 0. s) : i ∈ E). s) = I{s=∅} .21 Allocation of service Prove the claim in Example 6. (Hint: Apply the min-cut. for any choice of the constant b and the finite set C. s)w(s) for i ∈ E. and S(t) denotes the set of mobile users that have working channels for time slot [t. The random sets S(t) : t ≥ 0 are assumed to be independent with common distribution w.7 and d1 = d2 = 0. max-flow theorem to an appropriate graph. . .12 about the largest value of .4. See the illustration: a 1 queue 1 a 2 queue 2 . the queues might be in the base station of a wireless network with packets queued for N mobile users. . . .22 Opportunistic scheduling (Based on [14]) Suppose N queues are in parallel. . Then the probability of a potential service at queue i is given by µi (u) = s u(i. and suppose the arrivals to a queue i form an independent. The system is positive recurrent. and i∈E u(i. s) : i ∈ E. The interpretation is that there is a single server. 6.9.6. For example. for some > 0. Suppose that given S(t) = s. the queue that is given a potential service opportunity has probability distribution (u(i. (The smaller the bound the better. x2 ) = 21 + 22 . µ2 ) is there a choice of the constant u such that the Markov process describing the system is positive recurrent? x2 x2 (b) Let V be the quadratic Lyapunov function. (You don’t need to write out Q. V (x1 .25 Stability of a system with two queues and modulated server Consider two queues. 6.) + (a) Under what condition on (λ1 . ν. where u is a constant with u ∈ U = [0.23 Routing to two queues – continuous time model Give a continuous time analog of Examples 6. ν. In particular. (c) Under the condition of part (a). queue 1 and queue 2.) The system is described by a continuous-time Markov process on Z2 with some transition rate matrix Q. and using the moment bound associated with the FosterLyapunov criteria. λ2 . " 1 station 1 u! µ 1 " station 2 2 µ 2 Station i has Possion arrivals at rate λi and an exponential type server. µ1 . customers are transferred from station 1 to station 2 at rate uν. Compute the drift vector QV . find an upper bound on the mean number in the system in equilibrium. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS (d) Suggest a dynamic scheduling method which does not require knowledge of the arrival rates or the distribution w. which yields the same bound on the mean sum of queue lengths found in part (b).202 CHAPTER 6. and consider a system of two service stations with transfers as pictured. such that in each time slot.9. with rate µi .4 and 6. queue i receives a new packet with probability ai . λ2 . where 0 < a1 < 1 and 0 < a2 < 1. Suppose the server is described by a three state Markov process.9. 6. In addition. 1]. suppose that the arrival process is Poisson with rate λ and the potential departure processes are Poisson with rates µ1 and µ2 .) 6. µ2 ) be a vector of strictly positve parameters. X1 +X2 .5.24 Stability of two queues with transfers Let (λ1 . a 1 queue 1 1 0 ! server longer a 2 queue 2 2 . as shown. we will apply the method of FosterLyapunov stability theory in continuous time. µ1 . (Rather than applying dynamic programming here. 2} at the beginning of a slot.6. Then during the slot. then a potential service is given to station i. If the server process is in state 0 at the beginning of a slot. then a potential service is given to the longer queue (with ties broken in favor of queue 1).) For what values of a1 and a2 is the process stable? Briefly explain your answer (but rigorous proof is not required). . PROBLEMS 203 If the server process is in state i for i ∈ {1. (Note that a packet can arrive and depart in one time slot. the server state jumps with the probabilities indicated.10. DYNAMICS OF COUNTABLE-STATE MARKOV MODELS .204 CHAPTER 6. and in distribution (d. Chapter 2 covers convergence of sequences. convergence.3 and Corollaries 2. it suffices to prove the contrapositive: if (1) is not true then (2) is not true. Let’s check that (1) and (2) are equivalent. In addition. for any n ≥ 1.s. That is. and correlations of the limits are the limits of correlations (Proposition 2. See the appendix for a review of continuity. if either of the two equivalent conditions is true: (1) (Definition based on |s − to | ≤ δ. We’ve seen four different senses in which a sequence of random variables can converge: almost surely (a. convergence. In this chapter the same concepts are treated for random processes.s. Let > 0 and then let δ be as in condition (1).s. in probability (p. Suppose (1) is not true. a function f on R has a limit y at to . e notation for complex-valued random processes is introduced. and make use of the correlation version of the Cauchy criterion for m. Specifically.2. we will use the mean square sense of convergence the most.2. Since sn → to . These concepts all involve the notion of limits.1 Continuity of random processes The topic of this section is the definition of continuity of a continuous-time random process. with a focus on continuity defined using m. in mean square (m.). and integrals.). Of these senses. there exists δ > 0 so that | f (s) − y |≤ whenever (2) (Definition based on sequences) f (sn ) → y for any sequence (sn ) such that sn → to .s. But then |f (sn ) − y| ≤ by the choice of δ.s. and the associated facts that for m. it follows that there exists no so that |sn − to | ≤ δ for all n ≥ no . As an application of integration of random processes.). derivatives. differentiation and integration. (1) implies (2).2. and let (sn ) be such that sn → to .). written as lims→to f (s) = y. and δ) Given > 0. there exists a 205 . 7. For the converse direction.5). Then there exists an > 0 so that. f (sn ) → y. Suppose (1) is true. convergence.Chapter 7 Basic Calculus of Random Processes The calculus of deterministic functions revolves around continuous functions. the means of the limits are the limits of the means. Thus.4 and 2. ergodicity and the Karhunen-Lo´ve expansion are discussed. Limits for deterministic functions of a continuous variable can be defined in either of two equivalent ways. m. An equivalent condition is Xsn → Y as n → ∞. all subsets of N are events.s.s.8. sense. is equivalent to the requirement that {ω : lims→to Xs (ω) = Y (ω)} be an event and have probability one. if for any > 0. the first three types of convergence are equivalent to convergence along sequences.s.1.s.1. P ) is complete.1. or d. the following is true: Proposition 7. d.1 (c. An equivalent condition is Xsn → Y as n → ∞. F.1 (Limits for continuous-time random processes. if there is an event Fto having probability one such that Fto ⊂ {ω : lims→to Xs (ω) = Y (ω)}.) The process (Xt : t ∈ T) has limit Y at to : (i) in the m. convergence for a continuous-time random process can be defined using either and δ. That is. (ii) in probability. if given any > 0. convergence as s → to . written lims→to Xs = Y a.. namely a. so (2) is false. That is.1 The relationship among the above four types of convergence in continuous time is the same as the relationship among the four types of convergence of sequences. and if Xs → Y then Xs → Y Proof.. if whenever N is an event having probability zero.s. at least for limits in the p. If (Ω. the definition of lims→to Xs = Y a. there exists δ > 0 so that m. there exists δ > 0 so that p. whenever sn → to . m. Also.. or an interval in R. Similarly.3 shows that the converse is not true). As indicated in Definition 7. Let X = (Xt : t ∈ T) be a random process such that the index set T is equal to either all of R. a. F. if given any continuity point c of FY and any > 0. written lims→to Xs = Y p. in the corresponding senses.s.s. not (1) implies not (2). whenever sn → to .2 The following statements hold as s → to for a fixed to in T : If either Xs → Y p. E[(Xs − Y )2 ] < whenever s ∈ T and |s − to | < δ. If Xs → Y. A probability space (Ω. (Recall that FX. d. then Xs → Y. limits. This completes the proof that (1) and (2) are equivalent.1.) (iv) almost surely. or using sequences. An equivalent condition is Xsn → Y as n → ∞. {ω : lim Xt (ω) = Y (ω)} ⊂ {ω : lim Xsn (ω) = Y (ω)}. and fix to ∈ T. and by essentially the same reasons. But then sn → to .s. illustrated in Figure 2. p.. if there is a random variable Z with p.. the situation is slightly different for a. There is a way to simplify the definition as follows. s→to n→∞ 1 This definition is complicated by the fact that the set {ω : lims→to Xs (ω) = Y (ω)} involves uncountably many random variables.s. As we will see. there exists δ > 0 so that |FX. but it requires an extra assumption.s. That is true because if (sn ) is a sequence converging to to .s. and it is not necessarily an event. (iii) in distribution. Definition 7. written lims→to Xs = Y m. . The fourth type of convergence. E[Z] < ∞ and |Xt | ≤ Z for all t.1. s) = P {Xs ≤ c}. m.. P ) is complete. senses. or Xs → Y then Xs → Y. implies convergence along sequences (Example 7. whenever sn → to .206 CHAPTER 7. written lims→to Xs = Y d.1 (c. BASIC CALCULUS OF RANDOM PROCESSES 1 value sn with |sn − to | ≤ n such that |f (sn ) − y| > . s)−FY (c)| < whenever s ∈ T and |s−to | < δ. and yet f (sn ) → y.. P {|Xs − Y | ≥ ] < whenever s ∈ T and |s − to | < δ. s. It shows that for convergence of a random process at a single point. we describe a fifth type of continuity of random processes.1. if a set probability zero is ignored. if lims→to Xs = Xto in the corresponding sense. Since we have four senses of continuity at a point for a random process. the random process X = (Xt : t ∈ T) is continuous at to in any one of the four senses: m.4 (Four types of continuity at a point for a random process) For each to ∈ T fixed. sense.s. along sequences. which is the same as convergence in probability. or d. convergence as s → to is strictly stronger than a.1. for any fixed t. Therefore. . Continuity of a sample path is thus defined as it is for any deterministic function. {Xt is a continuous function of t}. and if X is continuous at to in probability. {ω : Xt (ω) is a continuous function of t}. convergence for continuous time implies a.. Also. and the fact the first three definitions of convergence for continuous time have a form based on sequences.2. convergence along sequences (as just shown). which is a function on T.s. P {Xsn = 0 for all n} = 1 so that Xsn → 0 a.. then it is continuous at to in the m. convergent at any to . If X is continuous at to in probability. or more concisely. for any sequence sn . which implies convergence in p. if the first of these sets contains an event which has probability one.1. Let Xt = 1 if t − U is a rational number.5 If X is continuous at to in either the a. The following is immediately implied by Proposition 7. this gives four types of continuity for random processes. p.s..s. or m. A deterministic function f on R is simply called continuous if it is continuous at all points. then X is continuous at to in distribution. a.. so that X is not a. the second of these sets is an event which has probability one.s.1.3 Let U be uniformly distributed on the interval [0. Definition 7. Each sample path of X takes values zero and one in any finite interval. sense. Before stating them formally. Example 7. and Xt = 0 otherwise. Recall that for a fixed ω ∈ Ω. CONTINUITY OF RANDOM PROCESSES 207 Therefore. In particular. The subset of Ω. The proposition then follows from the same relations for convergence of sequences. convergence along sequences.1.8 again hold.s.s. Corollary 7. the relations illustrated in Figure 2.s. then X is continuous at to in probability. The other implications of the proposition follow directly from the same implications for sequences. The fifth type of continuity requires that the sample paths be continuous. P {Xt = 0} = 1. The following example shows that a. However. since there are only countably many terms.s.7. 1]. a. the random process X gives a sample path.s. is the set of ω such that the sample path for ω is continuous. which is often used in applications. if there is a random variable Z with E[Z] < ∞ and |Xt | ≤ Z for all t. s. at each t continuous in d.1) Since X is a. If a process is a. continuous at each t. at each t a.s. it is a.s.) ina ariab men dom v mo s is dom nd ces e ran seco te pro gl (If a sin a fini a.7 If a process is a.s.s.s. if it is continuous in d.s. sample-path continuous it is a.s. sample-path continuous. m. continuous at each t continuous in p. if it is continuous in p. sample-path continuous. continuous at each t or m. continuous at each t. The relationship among the five types of continuity for a whole random process is pictured in Figure 7.1). If a process is continuous in p.s.s.. The remaining implications of the proposition follow from Corollary 7.s. sample-path continuous. it is continuous in d. then X is m. sample-path continuous . Thus. Suppose X is a. because such terminology could too easily be confused with a.1. a.1. Proposition 7.s.6 (Five types of continuity for a whole random process) A random process X = (Xt : t ∈ T) is said to be m. if.2 a. if follows that X is a. the set on the left-hand side of (7. continuous at each t p. continuous.s. BASIC CALCULUS OF RANDOM PROCESSES Definition 7. {ω : Xt (ω) is continuous at all t ∈ T} ⊂ {ω : Xt (ω) is continuous at to } (7. Then for any to ∈ T. continuous.s. continuous at each t. continuous at each t.1. sample!path continuous d. X is a. and is summarized in the following proposition.1) contains an event F with P (F ) = 1 and F is also a subset of the set on the the right-hand side of (7. if there is a random variable Y with E[Y ] < ∞ and |Xt | ≤ Y for all t.1. continuous” for the whole random process.s. as the phrase indicates. continuous if it is m. Since to was an arbitrary element of T.s. Proof.208 CHAPTER 7. 2 We avoid using the terminology “a.1: Relationships among five types of continuity of random processes. and if X is continuous in p.s.s. ith by ted le w t. Also. if F ⊂ {Xt is continuous in t} for some event F with P (F ) = 1. Figure 7.s.5. continuous at to . it is continuous in p. continuity at each t. t) as s → t. it is also m. m. continuous at all t ∈ T. 1]. ((ii) implies (iii)) Suppose condition (ii) is true. Thus. Therefore X is also continuous in p. it follows that Xsn → Xs and Xtn → Xt as n → ∞.s. continuous at each t. and d. continuity. in order for a random process X = (Xt : t ∈ T) to be continuous in the m. s) − RX (s. Whether X is m.1. t) (This condition involves RX for points in and near the set of points of the form (t. continuous depends only on the correlation function RX . continuous at t.s. as shown in the following proposition. For any fixed t and ω. By m.s. continuous. sn → s and tn → t as n → ∞. and consequently the limit also has a finite second moment.s. ((i) implies (ii)) Fix t ∈ T and suppose that RX is continuous at the point (t. tn ) ∈ T × T for all n ≥ 1 such that limn→∞ (sn .s.) Let X = (Xt : 0 ≤ t ≤ t) be given by Xt = I{t≥U } for 0 ≤ t ≤ 1. tn ) = (s. So X is m. t) ∈ T × T. then X is m. condition (b).s.8 (Shows a.) (ii) X is m.7. CONTINUITY OF RANDOM PROCESSES 209 Example 7. Recall that the definition of m. and suppose (sn . The following are equivalent: (i) RX is continuous at all points of the form (t. it must be a second order process: 2 E[Xt ] < ∞ for all t ∈ T. if the jump of X is not exactly at time t) then Xs (ω) → Xt (ω) as s → t. lims→t E[(Xs −Xt )2 ] = lims→t (RX (s. t)) = 0. So every sample path is discontinuous. t).1. sense. continuous. s) + RX (t. The remainder of this section focuses on m. t). continuous (iii ) RX (τ ) is continuous over all of R.4) . the following are equivalent: (i ) RX (τ ) is continuous at τ = 0 (ii ) X is m.s.s. sample-path continuous.s. Since P {U = t} = 1.10. Therefore if RX is continuous at all points of the form (t. Therefore (i) implies (ii). and therefore X is not a.s.e. If X is m. sample-path continuity is strictly stronger than a. then the mean function. Therefore. Since the limit of the correlations is the correlation of the limit for a pair of m. s). 1]. Finally. convergent sequences (Corollary 2.9 Suppose (Xt : t ∈ T) is a second order process. If X is wide sense stationary. each sample path of X has a single upward jump of size one. senses. convergence of a sequence of random variables requires that the random variables have finite second moments. is continuous.s. at a random time U uniformly distributed over [0. t).2.s. µX (t). t). s) all converge to RX (t.s. Let (s.s. Therefore. t) ∈ T × T.1. it follows that X is a.. continuous (iii) RX is continuous over T × T. It is stronger than requiring RX (t. Proposition 7. Then RX (s. if U (ω) = t (i.1. since |Xt | ≤ 1 for all t and X is continuous in p. Proof. and RX (t. t) to be continuous in t–see example 7. RX (s.s. Thus. where U is uniformly distributed over [0.s. t) − RX (t.s. a]] = e−λa and so P [N is continuous on R+ ] = 0. t < 0 or if s. convergent sequence (Corollary 2. where (s. Thus. continuous. It is thus not continuous in the m. continuous. RX (t.s. and the three conditions (i)-(iii) become (i )-(iii ). is continuous. proving that (ii) implies (iii). senses.2. so that RX (t. Thus. sense at zero either. t) = 1 if either s.s. where U and V are independent random variables with mean zero and variance one. continuity of X implies that the deterministic mean function. continuous. senses. t ≥ 0. and d. is continuous.5).210 CHAPTER 7. Finally. as n → ∞. However. RX is continuous at (s. because R( n . µX . So RX 1 1 1 1 is not continuous at (0. so R( n . t) = RX (τ ) where τ = s − t. E[(Nt − Ns )2 ] = λ(t − s) + (λ(t − s))2 → 0 as s → t. Another way to show W is m. In fact. Since P {|U − V | ≥ } = 0 for small enough. RX (s. t) = 0 otherwise. This illustrates the fact that continuity of the function of two variables. t) = λ(s ∧ t) + λ2 st. N is also a. Since N is m. is a stronger requirement than continuity of the function of one variable. continuous is to observe that the autocorrelation function. BASIC CALCULUS OF RANDOM PROCESSES it follows that RX (sn . at a particular fixed point (to . continuous. Obviously (iii) implies (i). Xs → Xt as s → t. The autocorrelation function is given by RX (s. Xtn does not converge to X0 in p. As we stated in defining W . and RX (s. Example 7. so the equivalence of (i)-(iii) implies the equivalence of (i )-(iii ). RW (s. It thus follows that µX (s) → µX (t). for any t ∈ T. N is not a. The only one of the five senses that the whole process could be continuous is continuous in distribution.s. then.1. As required. RN . because the probability of a jump at exactly time t is zero for any fixed t. Then for fixed t. because the limit of the means is the mean of the limit.s. t). However. at t = to . Therefore W is m. 0). Example 7. it is also continuous in the p. and therefore a. t). sample continuous. t) = σ 2 (s ∧ t). as well.1.s or a. 0) = 1. to ). continuous. . − n ) → RX (0. by definition. is continuous. t). tn ) → RX (s.s.s. if X is WSS. t) = 1 for all t.11 Let W = (Wt : t ≥ 0) be a Brownian motion with parameter σ 2 .10 Let X = (Xt : t ∈ R) be defined by Xt = U for t < 0 and Xt = V for t ≥ 0. sample-path continuous. Then E[(Wt − Ws )2 ] = σ 2 |t − s| → 0 as s → t. sense. so the proof of the equivalence of (i)-(iii) is complete.12 Let N = (Nt : t ≥ 0) be a Poisson process with rate λ > 0. Therefore RX is continuous over T × T.s. Finally. Let tn be a sequence of strictly negative numbers converging to 0. t) is a continuous function of t. m. continuous at any fixed t.s. it is true that RX (t. given by RN (s. let us check the continuity properties of the autocorrelation function.s. − n ) = 0 for all n ≥ 1. Then Xtn = U for all n and X0 = V . t) was an arbitrary point of T × T. and d.s.s. continuous at each t ≥ 0. then RX (s. Since W is m.s. So X is not continuous in probability at zero. for a m. m. The process X is continuous in distribution if and only if U and V have the same distribution. it is a. Therefore N is m. t) as n → ∞.1. P [N is continuous on [0. If X is m. it is also continuous in the p.s.s. Example 7. For example.s. Let ∂i denote the operation of taking the partial derivative with respect to the ith argument. MEAN SQUARE DIFFERENTIATION OF RANDOM PROCESSES 211 Definition 7. tk ). For example. derivative of X at t. differentiable and the derivative process X is m. differentiable at each t. The function f is called continuously differentiable if f is differentiable.s. such that. t) ∈ T × T}. Recall that for a fixed t in T. y) = 6xy 2 .s.s. and it is said to be m. Let X = (Xt : t ∈ T) be a second order random process such that the index set T is equal to either all of R or an interval in R.s. the random process X = (Xt : t ∈ T) is mean square (m. y) = x2 y 3 then ∂2 f (x.13 A random process (Xt : t ∈ T).s. continuous. if it exists.s. The whole function f is called differentiable if it is differentiable at all t.4). In many applications of calculus. when there is an assumption that a function is differentiable. such that T is a bounded interval (open. which is a function of two variables. Suppose f is a deterministic function on T. limits at the endpoints of (tk−1 . t) : (s.7. continuous over T if it is piecewise m. closed.s. The whole random process X is said to be m. X is piecewise m.2. More generally. In much of the applied literature. Example 11. The following definition for m.1.) differentiable at t if the following limit exists s→t lim Xs −Xt s−t m. and the derivative function f is continuous. is the m. the value of the limit is s−t the derivative. The partial derivative of a function is the same as the ordinary derivative with respect to one variable. with the other variables held fixed.s. continuous over (tk−1 . if f (x.s. but continuously differentiable.s. continuous over every bounded subinterval of T.s.4. derivatives is analogous to the definition of derivatives for deterministic functions. continuously differentiable if it is m. continuous. f (t). for 1 ≤ k ≤ n: X is m. .s. if there exist n ≥ 1 and a = t0 < t1 < · · · < tn = b.1 For each t fixed. tk ) and has m.2 Mean square differentiation of random processes Before considering the m. it is understood that the function is continuously differentiable. b f (b) − f (a) = a f (s)ds (7. it is important that a function f be not only differentiable. we review the definition of the derivative of a function (also. f is differentiable at t if lims→t f (s)−f (t) exists and is finite. by the fundamental theorem of calculus. We shall be applying ∂1 and ∂2 to an autocorrelation function RX = (RX (s. 7.s. differentiable if it is m.2) holds if f is a continuously differentiable function with derivative f . and if f is differentiable at t.s.2) might not hold if f is simply assumed to be differentiable.2 shows that (7. Let the index set T be either all of R or an interval in R.2. if T is all of R or an interval in R. is piecewise m. derivative of a random process. see Appendix 11. Definition 7. denoted by Xt . y) = 3x2 y 2 and ∂1 ∂2 f (x. or mixed) in R with endpoints a < b. The limit. 4) is just the definition of the statement that the derivative of µX at t is equal to µX (t). differentiable at t if and only if the following limit exists and is finite: s. t ) s→t s−t . continuously differentiable if and only if RX .s. RX X (t. X has mean zero (i. convergent sequences (Corollary 2. Since the limit of the correlations is the correlation of the limits for m. continuously differentiable if and only if RX (τ ).) (c) X is m. t) .s. t ) = ∂1 RX (t. and ∂1 ∂2 RX exist and are continuous. or more concisely. differentiable.) (b) If X is m. µX = 0) and autocorrelation function given by RX (τ ) = −RX (τ ). dt (b) Suppose X is m. t ∈ T.s. then RX (0) exists and RX (0) = 0. differentiable Gaussian process. t ) − RX (t. for t. the cross correlation functions are given by RX X = ∂1 RX and RXX = ∂2 RX . (f ) (A necessary condition for m.s. and (ii) differentiation with respect to t. then the mean function µX is differentiable.3) exists and is finite for all t ∈ T.4) s−t because the limit of the means is the mean of the limit.s. continuously differentiable. (By symmetry. differentiable if and only if the limit in (7. the whole process X is m. and RX (τ ) exist and are continuous functions of τ . the indicated partial derivatives exist. differentiable. (In particular. can be done in either order. → Xt as s → t.s.s.s.e. then X and its derivative process X are jointly Gaussian. differentiable. s ) + RX (t. dµX (t) = µX (t) for all t.s. continuously differentiable then X and X are jointly WSS.4). (7. (g) If X is a m. differentiable. RX (τ ). and the autocorrelation function of X is given by RX = ∂1 ∂2 RX = ∂2 ∂1 RX . convergent sequence (Corollary 2. and the cross correlation functions are given by RX X (τ ) = RX (τ ) and RXX (τ ) = −RX (τ ).e.s. t) − RX (t.s →t lim RX (s. s−t µX (s) − µX (t) → µX (t) as s → t.s. t It thus follows that ) = lim E s→t X(s) − X(t) s−t X(t ) = lim RX (s. If X is m.) (e) (Specialization of (d) for WSS case) Suppose X is WSS. Then X is m. then also ∂1 RX is continuous.2. That is.2.s. Xs − Xt m. (s − t)(s − t) (7. ∂2 RX . differentiability) If X is WSS and m. But (7.5). s ) − RX (s. BASIC CALCULUS OF RANDOM PROCESSES Proposition 7.2 (a) (The derivative of the mean is the mean of the derivative) If X is m.s. if X is m.s. (a) Suppose X is m. the operations of (i) taking expectation.3) (Therefore. and µX (t) = µX (t). Proof. Then for any t fixed. (i. µX = µX .212 CHAPTER 7. for a m. differentiable.) (d) X is m.s.2. which basically involves integrating over ω.s. s ) − RX X (t. Also.5) and using E[X(s)X(s )] = RX (s. v)dudv. differentiable at t if and only if the following limit exists and is finite: s. t ) = X(s ) − X(t ) s →t s −t RX X (t.s →t lim E X(s) − X(t) s−t X(s ) − X(t) s −t .7. (7.3). s ]. s] × [t. shown in Figure 7. Since ∂1 ∂2 RX is assumed to be continuous. t ) = ∂2 ∂1 RX (t. t ). (c) By the correlation form of the Cauchy criterion. differentiable.3). v)dv − t ∂2 RX (t.5) is equivalent to (7. (d) The numerator in (7. RX (t. v)] dv s = t t ∂1 ∂2 RX (u. the limit in (7. s] × [t.5) Multiplying out the terms in the numerator in the right side of (7.2. v) − ∂2 RX (t. So part (c) is proved. t ) = lim s →t s −t = ∂2 RX X (t. RX = ∂1 ∂1 RX . the autocorrelation function .2. s ]. RX X = ∂1 RX .3) exists and it is equal to ∂1 ∂2 RX (t. by part (c) already proved. MEAN SQUARE DIFFERENTIATION OF RANDOM PROCESSES 213 Thus.3) is the average value of ∂1 ∂2 RX over the rectangle [t. and so on. (Proposition 2. s ) − RX (s. t). By part (b). s ). the ratio in (7.3) involves RX evaluated at the four courners of the rectangle [t. (7. Suppose RX . t)) − (RX (t.s. and in particular. the partial derivative ∂1 RX exists. Similarly. v)dv = t s [∂2 RX (s.2: Sampling points of RX . lim E X (t) so that RX = ∂2 ∂1 RX . E[X(s)X(t)] = RX (s. by the same reasoning.6) Therefore. X is m. ! + + ! s! t t s Figure 7. Then by the fundamental theorem of calculus. s s (RX (s. Similarly. s ) − RX (t. X is m. t). ∂2 RX and ∂1 ∂2 RX exist and are continuous functions. shows that (7.2. t)) = t s ∂2 RX (s. Therefore. RXX = ∂2 RX .s. (g) The derivative process X is obtained by taking linear combinations and m. sense. convergent seqence.7). differentiable. limits of random variables in X = (Xt .9) is the same as the limit in (7.s. RX (τ ) and RX (τ ) exist and are continuous functions of τ .2(d) do not hold. which is not continuous. Therefore. W (s)−W (t) would converge in the m.s. is not continuous at zero. But W (s) − W (t) has mean zero and variance σ 2 |s − t|.s. t) = f (s)f (t). so that X is m. X is m. as a deterministic function. Thus. For a m. So indeed the conditions of Proposition 7. Therefore RX (0) = 0. differentiable. so that lim E W (s) − W (t) s−t 2 s→t = lim σ2 s→t |s − t| = +∞. then for any fixed t ≥ 0. differentiable then the right side of (7. BASIC CALCULUS OF RANDOM PROCESSES of X is ∂1 ∂2 RX .4 A Brownian motion W = (Wt : t ≥ 0) is not m.3(c)).2. So. RX X (τ ) = RX (τ ) and RX X = −RX . so X has mean zero.s.s. so in particular it is necessary that (RX (t) − RX (0))/t → 0 as t → 0.8) must converge to a finite limit as t → 0. continuously differentiable. so X and X are jointly WSS. Since X . as required. t) = RX (τ ) and ∂2 ∂1 RX (s. sense as s → t to a random variable s−t with a finite second moment. it follows that X is m.4.4. t) = −RX (τ ). Since X is differentiable as an ordinary function.s. which has derivative zero. but with s and s .9) Thus. and let X = (Xt : t ∈ R) be the deterministic random process such that X(t) = f (t) for all t ∈ R.s. continuous.s. derivative X is equal to f . which is finite. t ∈ T). it is also m. Example 7.214 CHAPTER 7.2. if X is m. t) = f (s)f (t) and ∂2 RX (s.s.s. (7. We have RX (s. the second moments of the variables in the sequence converge to the second moment of the limit random variable.3 Let f (t) = t2 sin(1/t2 ) for t = 0 and f (0) = 0 as in Example 11. The limit in (7. it is also not continuous at zero in the m.2 to deduce this result.s. Then ∂1 RX (s. and its m. and the derivative of with respect to t is −1. (g) follows from the fact that the joint Gaussian property is preserved under linear combinations and limits (Proposition 3.8) Therefore.2. t) = RX (τ ) where τ = s − t. Also by part (c) and (7. Note that X and X are each WSS and the cross correlation functions depend on τ alone. differentiable. (7. (e) If X is WSS.s. For another approach. differentiable at any t. Suppose RX (τ ). RXX (τ ) = −RX (τ ). its mean function µX is constant. (f ) If X is WSS then E X(t) − X(0) t 2 = − 2(RX (t) − RX (0)) t2 (7. W is not m. we could appeal to Proposition 7. Since this is assumed to be continuous. then RX (s − t) = RX (τ ) where τ = s − t. Similarly. If it were.2. Since X is WSS.7) appears because RX (s. the hypotheses of part (d) hold. Example 7.7) The minus sign in (7.2.5). or equivalently (7.s.2.2. X(s)f (s) − X(t)f (t) s−t = m.s.3). integration is that it relies much less on properties of sample paths of random processes. differentiable. such that. and its derivative process X is WSS with mean 0 and covariance function RX (τ ) = − 1 1 + τ2 = 2 − 6τ 2 . sense. is not a finite limit. then a random process X = (Xt : t ∈ T) is continuous and piecewise continuously differentiable in the m.3.s. differentiable because RX (0) does not exist. namely mean square (m.s. if T is all of R or a subinterval of R.s.s. if X is m.5 Suppose X is a m. Hence (7. INTEGRATION OF RANDOM PROCESSES 215 restricted to be equal. closed. b]. Definition 7. continuously differentiable over (tk−1 .s.s. tk ) and X has finite limits at the endpoints of (tk−1 . continuous over the interval.s. Then for each s = t. differentiable random process and f is a differentiable function. A WSS process X with RX (τ ) = 1 e−α|τ | is not m.s. Riemann integrals are based on Riemann sums. and so the integral can be defined as the integral of a deterministic function for each ω. differentiable at any t. We shall focus on another approach.10) One approach is to note that for each fixed ω. Then the product Xf = (X(t)f (t) : t ∈ R) is mean square differentiable and (Xf ) = X f + Xf .) integration.3 Integration of random processes Let X = (Xt : a ≤ t ≤ b) be a random process and let h be a function on a finite interval [a. (1 + τ 2 )3 Proposition 7. or mixed) with endpoints a < b is continuous and piecewise continuously differentiable in the m.s. An advantage of m. for 1 ≤ k ≤ n: X is m. Xt (ω) is a deterministic function of time. How shall we define the following integral? b a Xt h(t)dt. defined as follows.s. → (X(s) − X(t))f (s) X(t)(f (s) − f (t)) + s−t s−t X (t)f (t) + X(t)f (t) as s → t. Proof: Fix t.s. 7. More generally. Similarly. tk ). and if there exists n ≥ 1 and a = t0 < t1 < · · · < tn = b. implying that W is not differentiable at t.7. (7.5). Given: . As for integration of deterministic functions. sense if its restriction to any bounded interval is continuous and piecewise continuously differentiable in the m. sense. a Poisson process is not m.6 A random process X on a bounded interval (open. A WSS process X with RX (τ ) = 1+τ 2 is m. the m. b] and h is piecewise continuous over [a. . If Xt h(t) is m.s. tm . k=1 The norm of the partition is defined to be maxk |tk − tk−1 |. b jointly converge to +∞.s. This definition is equivalent to the following condition. b]. with corresponding sampling points ((v1 .5.s. If the sampling points for the Riemann sums are required to be horizontally and vertically alligned. t2 ].1 The Riemann integral a Xt h(t)dt is said to exist in the m.s. The condition involves Riemann integration of a deterministic function of two variables. expressed using convergence of sequences. . tk ].11) b b a a RX (s. Riemann integrable over every bounded interval [a. tm ) : m ≥ 1). BASIC CALCULUS OF RANDOM PROCESSES • A partition of (a. for 1 ≤ k ≤ n. The m. in particular. t1 ]. if X is m. piecewise continuous over [a. . sense and its value is the random variable I if the following is true. (t1 . b].s. provided that the indicated limit exist as a. a two-dimensional Riemann integral over a bounded rectangle is defined as the limit of Riemann sums corresponding to a partition of the rectangle into subrectangles and choices of sampling points within the subrectangles. there is a δ > 0 so that E[( n Xvk h(vk )(tk − tk−1 ) − I)2 ] ≤ whenever the norm of the partition is less than or equal to k=1 δ. . t)h(s)h(t)dsdt exists as a two dimensional Riemann integral with aligned sampling. As reviewed in Appendix 11. The m. Whether an integral exists in the m.3. Riemann integral exists and is equal to I. . the corresponding Riemann sum for Xh is defined by n Xvk h(vk )(tk − tk−1 ).s. (tn−1 . as shown next.s.3. specified by m m ((tm . Definition 7. . Riemann sense if and only if (7. Next. the corresponding sequence of Riemann sums converges in the m.b→∞ −a lim Xt h(t)dt m. The process Xt h(t) is said to be m.s. integral exists. b] of the form (t0 . Given any > 0. suppose Xt h(t) is defined over the whole real line. . then we say the two-dimensional Riemann integral exists with aligned sampling. if for any sequence of partitions.2 The integral b a Xt h(t)dt exists in the m. tn ]. vk ∈ (tk−1 . b] if the integral a Xt h(t)dt exists and is finite. where n ≥ 0 and a = t0 ≤ t1 · · · < tn = b • A sampling point from each subinterval.216 CHAPTER 7. . · · · . such that nm 1 2 norm of the mth partition converges to zero as m → ∞. Riemann b integrable over (a. then the Riemann integral of Xt h(t) over R is defined by ∞ b b Xt h(t)dt = −∞ a. sense is determined by the autocorrelation function of the random process involved. Proposition 7.s. . sense to I as m → ∞. vnm ) : m ≥ 1).s. v1 . sense if and only if limm. converges to zero.11) exists as a two-dimensional Riemann integral with aligned sampling. Finally. continuous over (sk−1 . Therefore. vnm . . b] into rectangles of the form (tm . b] into intervals specified by the collection of endpoints. such that the norm of the mth partition k−1 k converges to zero as m → ∞. Thus. tm ] × (tm . Further. Now nm nm E[Sm Sm ] = j=1 k=1 m m m m RX (vj . then there is a partition of [a. m → ∞. sk ) with finite limits at the endpoints.s. tm ] for each m and 1 ≤ k ≤ nm . b] into intervals of the form (sk−1 . is the restriction of a continuous function on [sj−1 .12) and the right-hand side of (7. . any Riemann sum for the integral (7. j j−1 k k−1 (7.s. sj ) × (sk−1 . ((tm . such that the norms of the partitions converge to zero. RX (s. . INTEGRATION OF RANDOM PROCESSES 217 Proof. .7. b] × (a. . the m.s. sk ]. if X is piecewise m. sk ] such that X is m. (Sm : m ≥ 1) converges in the m. tm ) : m ≥ 1). limits at the endpoints. and h is continuous over (sk−1 . sk ) with m.12) is the Riemann sum for the integral (7.s. Thus RX (s.s. . in that they are determined by the m + m numm m m m bers v1 . t)h(s)h(t) restricted to each rectangle of the form (sj−1 . vk )h(vj )h(vk )(tm − tm )(tm − tm ). b] and h is piecewise continuous over [a.s. . sense for an arbitary sequence of partitions and sampling points.s. By definition. tm . .m →∞ E[Sm Sm ] exists and is finite. . tm ] and the sampling points (vj .3). . Moreover. for the partition of m m (a. So consider an arbitrary sequence of partitions of (a. integral of Xt h(t) exists if and only if the Riemann sums converge in the m. the norm of this partition. . j−1 j k−1 k Note that the mm sampling points are aligned. k k−1 By the correlation form of the Cauchy criterion for m. .11) with aligned sampling can arise in this way.m →∞ E[Sm Sm ] exists for any sequence of partitions and sampling points if and only if the integral (7. sj ] × [sk−1 .3. b] × [a. For each m ≥ 1. t)h(s)h(t) is Riemann integrable over [a. the limit limm.2. with corresponding nm 0 1 m sampling point vk ∈ (tm . b]. which is the maximum length or width of any rectangle of the partition. continuous over [a.11). convergence (Proposition 2. . let Sm denote the corresponding Riemann sum: nm Sm = k=1 m m Xvk h(vk )(tm − tm ). sk ). as m. vnm . vk ). b]. 15) (7. and are left to the reader. m→∞ lim (7. b].14). The proofs of (7.2.18) E a Xt h(t)dt b = a b a b RX (s. b E a Xt h(t)dt = = m→∞ lim E[Sm ] nm m m µX (vk )h(vk )(tm − tm ). Let (Sm ) denote the sequence of Riemann sums appearing in the proof of Proposition 7.15). which implies (7.s. It follows from (7.14) (7.17) are similar to the proofs of (7. Then b b E a b Xt h(t)dt 2 = a b µX (t)h(t)dt b (7.14) yields (7.17) satisfy the corresponding additivity condition.16) (7. and is equal to E a Xt h(t)dt . For any partition of [a. convergence (Proposition 2. .2.s.13) (7.17).3.s. t)h(s)k(t)dsdt CXY (s.17) (7. a b Yt k(t)dt = Xt h(t) + Yt k(t)dt = a a Xt h(t)dt + a Yt k(t))dt Proof. BASIC CALCULUS OF RANDOM PROCESSES Proposition 7.14) and (7. so (7. b] and choice of sampling points. convergent sequence of random variables is the limit of the means (Corollary 2. t)h(s)h(t)dsdt. implying (7. the Riemann sums for the three integrals appearing (7.218 CHAPTER 7. k−1 k k=1 b a µX (t)h(t)dt.s.2.16) and (7.13) is proved. The second moment of the m. by the correlation form of the Cauchy criterion for m.3 Suppose Xt h(t) and Yt k(t) are both m. limit of (Sm : m ≥ 0) is equal to limm. a b a b Var b a b Xt h(t)dt Yt k(t)dt a b = = a b a b E a Xs h(s)ds b RXY (s.3).m →∞ E[Sm Sm ]. it b follows that a µX (t)h(t)dt exists as a Riemann integral. t)h(s)k(t)dsdt a b a b Cov a Xs h(s)ds.5).19) The right-hand side of (7.3. t)h(s)h(t)dsdt CX (s. integrable over [a.13) that b 2 b b E a Xt h(t)dt = a a µX (s)µX (t)h(s)h(t)dsdt Subtracting each side of this from the corresponding side of (7.15).19) is a limit of Riemann sums for the integral limit exists and is equal to E b b a Xt h(t)dt Since this for any sequence of partitions and sample points. Since the mean of a m. continuous.3.7.s.s. if X is continuous and piecewise continuously differentiable. it follows that φ is differentiable and φ (t) = E[Y Xt ].20) is the m.s. (11. are jointly Gaussian. Riemann integrals of X of the form I(a. Since X is m. we can use essentially the same proof. piecewise continuously differential functions. The following is the generalization of the fundamental theorem of calculus to the m. observing that φ is continuous and piecewise b continuously differentiable.s. piecewise continuously differentiable function are equal to integrals of the derivative of the function. E[Y B] = φ(b) − φ(a) − a φ (t)dt.s.5. b n E Y a Xt dt = = |tk −tk−1 |→0 lim E Y k=1 n Xvk (tk − tk−1 ) b |tk −tk−1 |→0 b lim φ (vk )(tk − tk−1 ) = k=1 a φ (t)dt Therefore. calculus. continuously differentiable.20) More generally.3. Next.s. Riemann integral) (7.5 Suppose X is a Gaussian random process. Theorem 7. continuous and only piecewise continuously differentiable. This establishes (7. Then for s = t. b Xb − Xa = a Xt dt (m. b Let B = Xb −Xa − a Xt dt.4) holds with Xt replaced by the right-hand derivative. Since the limit of the correlation is the correlation of the limt for m. together with all mean square derivatives of X that exist.s. It suffices to show that E[Y B] = 0.s.20) exists because X is assumed to be m. and all m. INTEGRATION OF RANDOM PROCESSES 219 The fundamental theorem of calculus.s. states the increments of a continuous. convergence.s. φ(s) − φ(t) Xs − Xt =E Y s−t s−t Taking a limit as s → t and using the fact the correlation of a limit is the limit of the correlations for m. convergent sequences.3. If X is m. (Note that D+ Xt = Xt whenever Xt is defined. limit of Riemann sums.s. stated in Appendix 11. with each Riemann sum corresponding to a partition of (a. . The m. it similarly follows that φ is continuous.s. b] specified by some n ≥ 1 and a = t0 < · · · < tn = b and sampling points vk ∈ (tk−1 . which is equal to zero by the fundamental theorem of calculus for deterministic continuously differentiable functions. and let Y be an arbitrary random variable with a finite second moment.20) in case X is m. Then X. Then for a < b. D+ Xt . Let φ(t) = E[Y Xt ].4 (Fundamental Theorem of m. b) = b a Xt h(t)dt that exist. tk ] for a ≤ k ≤ n.) Proof. continuously differentiable random process. Riemann integral in (7. continuous.s. Proposition 7. so that E[Y B] = φ(b) − φ(a) − a φ (t)dt = 0 by the fundamental theorem of calculus for deterministic continuous. Calculus) Let X be a m. we use the fact that the integral in (7. because a possible choice of Y is B itself. t) = 2 0 vdvdu + 0 t2 (s t 0 vdvdu t2 s t3 − . so v t u<v u>v t s u Figure 7. limits of linear combinations of X = (Xt . Thus (7. both sides are symmetric in s and t. v) = u ∧ v.4. Therefore.3: Partition of region of integration. t ∈ T).21) holds for all s. Theoretical Exercise Suppose X = (Xt : t ≥ 0) is a random process such that RX is continuous.3.220 CHAPTER 7. Show that Y is m. 2 6 (7. Since RW (u. To proceed. t) = E 0 s t Wu du 0 Wv dv = 0 0 (u ∧ v)dvdu.s. The contributions from the two triangular subregions is the same. differentiable.s.21) only for s ≥ t. t u s t RX (s. this expression can be rewritten as RX (s.21) Although we have found (7. Example 7. the proposition follows from the fact that the joint Gaussian property is preserved under linear combinations and limits (Proposition 3. consider first the case s ≥ t and partition the region of integration into three parts as shown in Figure 7. s t t 0 Ws ds for RX (s. t Let Yt = 0 Xs ds. BASIC CALCULUS OF RANDOM PROCESSES Proof.s. t) = st(s ∧ t) (s ∧ t)3 − . derivatives and integrals of X are obtained by taking m.3(c)).3. Let us find RX and P [|Xt | ≥ t] for t > 0. The m. and let Xt = t ≥ 0. and P [Yt = Xt ] = 1 for t ≥ 0.6 Let (Wt : t ≥ 0) be a Brownian motion with σ 2 = 1. t. . 2 6 = t3 + 3 − t) 2 = Still assuming that s ≥ t. Taking the expectation on each side of (7.s. t) = t3 . (7.23) or Xt = x0 e−t + e−t 0 t ev Nv dv. thereby expressing the mean function of the solution X in terms of the mean function of the driving process N .25). the second is to take expectations first and then solve the deterministic differential equation for µX .23) yields µX (t) = x0 e−t + 0 t e−(t−v) µN (v)dv.25) A different way to derive (7. X is a Gaussian process. µX (0) = x0 which can be solved to yield (7. differentiable random process X = (Xt : t ≥ 0) satisfying the linear differential equation Xt = −Xt + Nt .23) to derive (7. Example 7.7 Let N = (Nt : t ≥ 0) be a second order process with a continuous autocorrelation function RN and let x0 be a constant. To summarize.24) indeed gives the solution to (7.24) Using Proposition 7. Thus. E[Xt ] = 0 (because W is mean 3 2 zero) and E[Xt ] = RX (t. (7.22). X0 = x0 .25) is to take expectations in (7.2. Consider the problem of finding a m. it is not difficult to check that (7.25). Next.22) Guided by the case that Nt is a smooth nonrandom function. Xt P [|Xt | ≥ t] = 2P ≥ t3 3 t3 3 t = 2Q 3 t .7. The first is to solve (7.22) to yield the deterministic linear differential equation: µX (t) = −µX (t) + µN (t). INTEGRATION OF RANDOM PROCESSES 221 Since W is a Gaussian process. Also. Note that P [|Xt | ≥ t] → 1 as t → +∞.22) to obtain (7.5. we found two methods to start with the stochastic differential equation (7. . let us find the mean and autocovariance functions of X in terms of those of N . we write Xt = x0 e−t + 0 t e−(t−v) Nv dv (7.3. (7.23) and then take expectations.3. 27) yields (7.29) Using (7. x0 e−t + 0 t e−(t−v) Nv dv (7. t] is given by 1 t t 0 Xu du. (7. this is a differential equation in s. continuous. (7. The time average of X over the interval [0.4 Ergodicity Let X be a stationary or WSS random process.29) to replace CN X in (7. So we turn next to finding a differential equation for CN X . Xt ) = so ∂2 CN X (s.23) to obtain CX (s.26). t). this is a differential equation in t with initial condition CN X (s. t) = Cov(Ns . For the first method. t) = 0. t) = −CX (s. note that ∂1 CX (s. For example. Also. CX (0. t) = 0 can be used to find CX . BASIC CALCULUS OF RANDOM PROCESSES The same two methods can be used to express the covariance function of X in terms of the covariance function of N . (7. v)dvdu.28) Cov(Ns . −Xt + Nt ) For s fixed. ∂2 CN X (s.27) and solving (7. t) + CN X (s. Xt ) For t fixed. we use (7. Xt ) = so ∂1 CX (s. Ergodicity generally means that certain time averages are asymptotically equal to certain statistical averages. t). The mean µX is defined as a statistical average: µX = E[Xt ] for any t ∈ R.222 CHAPTER 7.28) yields t CN X (s. t) = −CN X (s.26) = 0 e−(s−u) e−(t−v) CN (u. To begin. If somehow the cross covariance function CN X is found. t) = 0 e−(t−v) CN (s.s. The second method is to derive deterministic differential equations.27) Cov (−Xs + Ns . v)dv. suppose X = (Xt : t ∈ R) is WSS and m. 7. 0) = 0. . t) + CN (s. Solving (7. t) = Cov (Xs .27) and the boundary condition CX (0. (7. t) = Cov s t 0 x0 e−s + 0 s e−(s−u) Nu du. s. if CX (n) = 1 for all n. ergodic in the m.s. CX . if X = (Xn : n ∈ Z) is WSS with CX (n) = I{n=0} (so that the Xi ’s are uncorrelated) then (7. Proof. A discrete time WSS random process X is similarly called mean ergodic (in the m. sense) if and only if 2 t→∞ t lim Sufficient conditions are (a) limτ →∞ CX (τ ) = 0. the time average is a random variable. (7.s. the process X is not ergodic if CX (n) = 1 for all n. convergence. version of the law of large numbers.s. (This condition is also necessary if limτ →∞ CX (τ ) exists. whether X is m. (7.31) < +∞.7. by the m. sense is determined by the autocovariance function.s. it means that X0 has variance one and P {Xk = X0 } = 1 for all k (because equality holds in the Schwarz inequality: CX (n) ≤ CX (0)). (7.4. Then X is mean ergodic (in the m. k=1 Since X0 has variance one. and is typically not equal to the statistical average µX . By the definition of m. (d) ∞ −∞ |RX (τ )|dτ < +∞. (c) limτ →∞ RX (τ ) = 0. continuous random process.s. for t fixed. X is mean ergodic if and only if lim E 1 t t 2 t→∞ Xu du − µX 0 = 0. In general.1 Let X be a real-valued. Then for all n ≥ 1.s.30) For example. sense) if 1 n→∞ n lim n Xi = µX i=1 m.s. The random process X is called mean ergodic (in the m.4. and the discrete-time version is true as well. Proposition 7.s.s. m. For another example. ERGODICITY 223 Of course. sense) if 1 t→∞ t lim t Xu du = µX 0 m. 1 n n Xk = X0 . The result is stated and proved next for continuous time.30) is true.) (b) ∞ −∞ |CX (τ )|dτ t 0 t−τ t CX (τ )dτ = 0. WSS.32) . The integral in (7. and both can be viewed as a weighted average of CX (τ ). the average of CX (u − v) over the square will be small.34) 0 t CX (τ )dτ dv −v t−τ CX (τ )dvdτ + 0 −t −τ CX (τ )dvdτ (7. For the remainder of the proof. it is important to keep in mind that the integral in (7.33). (7. BASIC CALCULUS OF RANDOM PROCESSES 1 t t 0 Xu du Since E 1 0 Xu du = 1 0 µX du = µX . where for v fixed the variable τ = u − v was introduced.224 t t CHAPTER 7. Indeed.s. We claim the left side of (7.31) is equivalent to the integral in (7. given ε > 0 there exists L > 0 so that |CX (τ ) − c| ≤ ε . 0 1 t t Xv dv 0 CX (u − v)dudv t−v (7. with a triangular weighting function.4. v) is small for u − v larger than some constant.31) is equal to c.33) (7.34) and (7.32) is the same as Var t t the properties of m. integrals. if CX (u. This establishes the first ! t t v !t Figure 7. Suppose CX (τ ) → c as τ → ∞. statement of the proposition. and we use the fact that in both (7. t]. By Xu du 0 = Cov = = = = = 1 t2 1 t2 1 t2 1 t 2 t t 0 t 0 t 0 t −t t 0 1 t t 0 t Xu du. if t is large.35). the pair (v. The function CX (u − v) is equal to CX (0) along the diagonal of the square.4: Region of integration for (7. τ ) ranges over the region pictured in Figure 7. and the magnitude of the function is bounded by CX (0) everywhere in the square. Thus.35).33) is simply the average of CX (u − v) over the square [0. Var 1 t t → 0 as t → ∞.34) and (7.35) t − |τ | t t−τ t CX (τ )dτ CX (τ )dτ. t] × [0. It remains to prove the assertions regarding (a)-(d). it is nonnegative. and CX (τ ) = RX (τ ) − µ2 . Hence if limτ →∞ CX (τ ) = c. (7. Let X = (Xt : t ∈ R) be defined by Xt = A cos(2πfc t + Θ).2 Let fc be a nonzero constant. (7. Since the integral in (7. X 0 ≤ 2 t t 0 t−τ t CX (τ )dt = −µ2 + X 2 t t 0 t−τ t RX (τ )dτ → −µ2 as t → ∞. In addition. By the same arguments applied to CX for parts (a) and (b). X Thus.31) holds. Suppose condition (b) holds.31) is equal to c. sin(Θ). it follows that 2 t t 0 t−τ t RX (τ )dτ → 0 as t → ∞. For 0 ≤ τ ≤ L we can use the Schwarz inequality to bound CX (τ ). and sin(2Θ) have mean zero. so that X is mean ergodic in the m. ERGODICITY 225 whenever τ ≥ L.4. (c) and (d) each imply (7.31). as claimed. we see that conditions (c) and (d) also each imply that µX = 0.7. cos(2Θ). Therefore.31) holds if and only if c = 0. Suppose either condition (c) or condition (d) holds. Then X is WSS with . namely |CX (τ )| ≤ CX (0). Also.s.31) is the variance of a random variable. Then 2 t t 0 t−τ t CX (τ )dτ ≤ ≤ 2 t 1 t t |CX (τ )|dτ 0 ∞ |CX (τ )|dτ → 0 as t → ∞ −∞ so that (7. sense. let Θ be a random variable such that cos(Θ). and let A be a random variable independent of Θ such that E[A2 ] < +∞. Therefore for t ≥ L. It remains to prove that (b). 2 t t 0 t−τ t CX (τ )dτ − c = ≤ ≤ 2 t 2 t t 0 t 0 t−τ (CX (τ ) − c) dτ t t−τ |CX (τ ) − c| dτ t 2ε t t − τ 2 L (CX (0) + |c|) dτ + dτ t 0 t L t 2L (CX (0) + |c|) 2ε t t − τ 2L (CX (0) + |c|) ≤ + dτ = +ε t L 0 t t ≤ 2ε for t large enough Thus the left side of (7.31) holds. the integral is a weighted average of CX (τ ). Example 7.4. Consider a random process (Wk : k ∈ Z) formed as follows. Clearly µW = E[Wk ] = E[Wk |S = 0]P [S = 0] + E[Wk | S = 1]P [S = 1] = 3 1 1 1 · + · 4 2 4 2 = 1 . BASIC CALCULUS OF RANDOM PROCESSES 2 µX = 0 and RX (τ ) = CX (τ ) = E[A ] cos(2πfc τ ) . Then the selected coin is used to generate the Wk ’s — the other coin is not used at all. πfc t Example 7. The variable S can be thought of as a switch state.5. Clearly W is stationary.4.5: A composite binary source. and hence also WSS. each with a zero on one side and a one on the other. the student selects one of the coins. 2 . Condition (7. Whenever the first coin is flipped the outcome is a one with probability 3 1 4 . This scenario can be modelled as in Figure 7.3 (Composite binary source) A student has two biased coins.31) is satisfied. as t → ∞. each coin being selected with equal probability. so X is mean ergodic. • (Uk : k ∈ Z) are independent Be • (Vk : k ∈ Z) are independent Be • S is a Be 1 2 3 4 1 4 random variables random variables random variable • The above random variables are all independent • Wk = (1 − S)Uk + SVk . 2 Mean ergodicity can also be directly verified: 1 t t Xu du 0 = = ≤ A t cos(2πfc u + Θ)du t 0 A(sin(2πfc t + Θ) − sin(Θ)) 2πfc t |A| → 0 m.s.226 CHAPTER 7. First. Value S = 0 corresponds to using the coin with 3 probability of heads equal to 4 for each flip. using the following random variables: Uk S=0 S=1 Wk Vk Figure 7. Is W mean ergodic? One approach to answering this is the direct one. Whenever the second coin is flipped the outcome is a one with probability 4 . . = 4 2 4 2 16 Therefore. Xj+k−1 ). (1 − S) 1 n n Uk k=1 +S 1 n n Vk k=1 → 1 3 (1 − S) + S 4 4 = 3 S − . . ERGODICITY So the question is whether 1 n But by the strong law of large numbers 1 n n n 227 Wk → k=1 ? 1 m. Another way to show that W is not mean ergodic is to find the covariance function CW and use 2 the necessary and sufficient condition (7. . 4 2 1 Thus. .7. rather than the constant 2 . What time average would we expect to be a good approximation to the statistical average E[h(X1 . Note that for k fixed. Borel measurable function on Rk for some k.31) for mean ergodicity. . so E[Wk 2 E[Wk Wl ] = E[Wk Wl | S = 0]P [S = 0] + E[Wk Wl | S = 1]P [S = 1] 1 1 = E[Uk Ul ] + E[Vk Vl ] 2 2 1 1 = E[Uk ]E[Ul ] + E[Vk ]E[Vl ] 2 2 2 2 1 1 1 5 3 + = . . Wk = Wk 2 ] = 1 . we are interested in averages of functions that depend on multiple random variables.s. . Xj+1 . 2 Wk k=1 = 1 n n ((1 − S)Uk + SVk ) k=1 = m.4. (Xn : n ∈ Z).s. the limit is a random variable. . . If k = l. W is not mean ergodic. In many applications. then with probability one. Intuitively. the process W has such strong memory due to the switch mechanism that even averaging over long time intervals does not diminish the randomness due to the switch. Let h be a bounded. CW (n) = 1 4 1 16 if n = 0 if n = 0 Since limn→∞ CW (n) exists and is not zero. We discuss this topic for a discrete time stationary random process. Xk )]? A natural choice is 1 n n j=1 h(Xj . 228 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES We define a stationary random process (Xn : n ∈ Z) to be ergodic if 1 n→∞ n lim n h(Xj , . . . , Xj+k−1 ) = E[h(X1 , . . . , Xk )] j=1 for every k ≥ 1 and for every bounded Borel measurable function h on Rk , where the limit is taken in any of the three senses a.s., p. or m.s.3 An interpretation of the definition is that if X is ergodic then all of its finite dimensional distributions are determined as time averages. As an example, suppose h(x1 , x2 ) = 1 if x1 > 0 ≥ x2 . 0 else Then h(X1 , X2 ) is one if the process (Xk ) makes a “down crossing” of level 0 between times one and two. If X is ergodic then with probability 1, 1 (number of down crossings between times 1 and n + 1) = P [X1 > 0 ≥ X2 ]. (7.36) n→∞ n lim Equation (7.36) relates quantities that are quite different in nature. The left hand side of (7.36) is the long time-average downcrossing rate, whereas the right hand side of (7.36) involves only the joint statistics of two consecutive values of the process. Ergodicity is a strong property. Two types of ergodic random processes are the following: • a process X = (Xk ) such that the Xk ’s are iid. • a stationary Gaussian random process X such that limn→∞ RX (n) = 0 or, limn→∞ CX (n) = 0. 7.5 Complexification, Part I In some application areas, primarily in connection with spectral analysis as we shall see, complex valued random variables naturally arise. Vectors and matrices over C are reviewed in the appendix. A complex random variable X = U + jV can be thought of as essentially a two dimensional random variable with real coordinates U and V . Similarly, a random complex n-dimensional vector X can be written as X = U + jV , where U and V are each n-dimensional real vectors. As far as distributions are concerned, a random vector in n-dimensional complex space Cn is equivalent to a random vector with 2n real dimensions. For example, if the 2n real variables in U and V are jointly continuous, then X is a continuous type complex random vector and its density is given by a function fX (x) for x ∈ Cn . The density fX is related to the joint density of U and V by fX (u + jv) = fU V (u, v) for all u, v ∈ Rn . As far as moments are concerned, all the second order analysis covered in the notes up to this point can be easily modified to hold for complex random variables, simply by inserting complex 3 The mathematics literature uses a different definition of ergodicity for stationary processes, which is equivalent. There are also definitions of ergodicity that do not require stationarity. 7.5. COMPLEXIFICATION, PART I 229 conjugates in appropriate places. To begin, if X and Y are complex random variables, we define their correlation by E[XY ∗ ] and similarly their covariance as E[(X − E[X])(Y − E[Y ])∗ ]. The Schwarz inequality becomes |E[XY ∗ ]| ≤ E[|X|2 ]E[|Y |2 ] and its proof is essentially the same as for real valued random variables. The cross correlation matrix for two complex random vectors X and Y is given by E[XY ∗ ], and similarly the cross covariance matrix is given by Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])∗ ]. As before, Cov(X) = Cov(X, X). The various formulas for covariance still apply. For example, if A and C are complex matrices and b and d are complex vectors, then Cov(AX + b, CY + d) = ACov(X, Y )C ∗ . Just as in the case of real valued random variables, a matrix K is a valid covariance matrix (in other words, there exits some random vector X such that K = Cov(X)) if and only if K is Hermitian symmetric and positive semidefinite. Complex valued random variables X and Y with finite second moments are said to be orthogonal if E[XY ∗ ] = 0, and with this definition the orthogonality principle holds for complex valued random variables. If X and Y are complex random vectors, then again E[X|Y ] is the MMSE estimator of X given Y , and the covariance matrix of the error vector is given by Cov(X) − Cov(E[X|Y ]). The MMSE estimator for X of the form AY + b (i.e. the best linear estimator of X based on Y ) and the covariance of the corresponding error vector are given just as for vectors made of real random variables: ˆ E[X|Y ] = E[X] + Cov(X, Y )Cov(Y )−1 (Y − E[Y ]) ˆ Cov(X − E[X|Y ]) = Cov(X) − Cov(X, Y )Cov(Y )−1 Cov(Y, X) By definition, a sequence X1 , X2 , . . . of complex valued random variables converges in the m.s. sense to a random variable X if E[|Xn |2 ] < ∞ for all n and if limn→∞ E[|Xn − X|2 ] = 0. The various Cauchy criteria still hold with minor modification. A sequence with E[|Xn |2 ] < ∞ for all n is a Cauchy sequence in the m.s. sense if limm,n→∞ E[|Xn − Xm |2 ] = 0. As before, a sequence converges in the m.s. sense if and only if it is a Cauchy sequence. In addition, a sequence X1 , X2 , . . . of complex valued random variables with E[|Xn |2 ] < ∞ for all n converges in the m.s. sense if ∗ and only if limm,n→∞ E[Xm Xn ] exits and is a finite constant c. If the m.s. limit exists, then the limiting random variable X satisfies E[|X|2 ] = c. Let X = (Xt : t ∈ T) be a complex random process. We can write Xt = Ut + jVt where U and V are each real valued random processes. The process X is defined to be a second order process if E[|Xt |2 ] < ∞ for all t. Since |Xt |2 = Ut2 + Vt2 for each t, X being a second order process is equivalent to both U and V being second order processes. The correlation function of a second ∗ order complex random process X is defined by RX (s, t) = E[Xs Xt ]. The covariance function is given by CX (s, t) = Cov(Xs , Xt ) where the definition of Cov for complex random variables is used. The definitions and results given for m.s. continuity, m.s. differentiation, and m.s. integration all carry over to the case of complex processes, because they are based on the use of the Cauchy criteria for m.s. convergence which also carries over. For example, a complex valued random process is m.s. continuous if and only if its correlation function RX is continuous. Similarly the cross correlation function for two second order random processes X and Y is defined by RXY (s, t) = E[Xs Yt∗ ]. Note ∗ that RXY (s, t) = RY X (t, s). Let X = (Xt : t ∈ T) be a complex random process such that T is either the real line or the set of integers, and write Xt = Ut + jVt where U and V are each real valued random pro- 230 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES cesses. By definition, X is stationary if and only if for any t1 , . . . , tn ∈ T, the joint distribution of (Xt1 +s , . . . , Xtn +s ) is the same for all s ∈ T. Equivalently, X is stationary if and only if U and V are jointly stationary. The process X is defined to be WSS if X is a second order process such that E[Xt ] does not depend on t, and RX (s, t) is a function of s − t alone. If X is WSS we use RX (τ ) to denote RX (s, t), where τ = s − t. A pair of complex-valued random processes X and Y are defined to be jointly WSS if both X and Y are WSS and if the cross correlation function RXY (s, t) is a ∗ function of s − t. If X and Y are jointly WSS then RXY (−τ ) = RY X (τ ). In summary, everything we’ve discussed in this section regarding complex random variables, vectors, and processes can be considered a simple matter of notation. One simply needs to add lines to indicate complex conjugates, to use |X|2 instead of X 2 , and to use a star “∗ ” for Hermitian transpose in place of “T ” for transpose. We shall begin using the notation at this point, and return to a discussion of the topic of complex valued random processes in a later section. In particular, we will examine complex normal random vectors and their densities, and we shall see that there is somewhat more to complexification than just notation. 7.6 The Karhunen-Lo`ve expansion e We’ve seen that under a change of coordinates, an n-dimensional random vector X is transformed into a vector Y = U ∗ X such that the coordinates of Y are orthogonal random variables. Here U is the unitary matrix such that E[XX ∗ ] = U ΛU ∗ . The columns of U are eigenvectors of the Hermitian symmetric matrix E[XX ∗ ] and the corresponding nonnegative eigenvalues of E[XX ∗ ] comprise the diagonal of the diagonal matrix Λ. The columns of U form an orthonormal basis for Cn . The Karhunen-Lo`ve expansion gives a similar change of coordinates for a random process on e a finite interval, using an orthonormal basis of functions instead of an othonormal basis of vectors. Fix an interval [a, b]. The L2 norm of a real or complex valued function f on the interval [a, b] is defined by b ||f || = a |f (t)|2 dt We write L2 [a, b] for the set of all functions on [a, b] which have finite L2 norm. The inner product of two functions f and g in L2 [a, b] is defined by b (f, g) = a f (t)g ∗ (t)dt The functions f and g are said to be orthogonal if (f, g) = 0. Note that ||f || = (f, f ) and the Schwarz inequality holds: |(f, g)| ≤ ||f || · ||g||. An orthonormal basis for L2 [a, b] is a sequence of functions φ1 , φ2 , . . . which are mutually orthogonal and have norm one, or in other words, (φi , φj ) = I{i=j} for all i and j. In many applications it is useful to use representations of the form N f (t) = n=1 cn φn (t). (7.37) ` 7.6. THE KARHUNEN-LOEVE EXPANSION 231 In such a case, we think of (c1 , . . . , cN ) as the coordinates of f relative to the basis (φn ), and we might write f ↔ (c1 , , . . . , cN ). For example, transmitted signals in many digital communication systems have this form, where the coordinate vector (c1 , , . . . , cN ) represents a data symbol. The geometry of the space of all functions f of the form (7.37) for the fixed orthonormal basis (φn ) is equivalent to the geometry of the coordinates vectors. For example, if g has a similar representation, N g(t) = n=1 dn φn (t), or equivalently g ↔ (d1 , . . . , dN ), then f + g ↔ (c1 , . . . , cN ) + (d1 , . . . , dN ) and b N N (f, g) = a N m=1 N cm φm (t) n=1 b a d∗ φ∗ (t) dt n n = m=1 n=1 N N cm d∗ n φm (t)φ∗ (t)dt n = m=1 n=1 N cm d∗ (φm , φn ) n cm d∗ m m=1 = That is, the inner product of the functions, (f, g), is equal to the inner product of their coordinate vectors. Note that for 1 ≤ n ≤ N , φn ↔ (0, . . . , 0, 1, 0, . . . , 0), such that the one is in the nth position. If f ↔ (c1 , . . . , cN ), the coordinates of f can be obtained by computing inner products between f and the basis functions: b N N (f, φn ) = a m=1 cm φm (t) φ∗ (t)dt = n m=1 cm (φm , φn ) = cn . Another way to derive that (f, φn ) = cn is to note that f ↔ (c1 , . . . , cN ) and φn ↔ (0, . . . , 0, 1, 0, . . . , 0), so (f, φn ) is the inner product of (c1 , . . . , cN ) and (0, . . . , 0, 1, 0, . . . , 0), or cn . Thus, the coordinate vector for f is given by f ↔ ((f, φ1 ), . . . , (f, φN )). The dimension of the space L2 [a, b] is infinite, so there are orthonormal bases (φn : n ≥ 1) with infinitely many functions (i.e. N = ∞). For such a basis, a function f can have the representation ∞ f (t) = n=1 cn φn (t) (7.38) In many instances encountered in practice, the sum (7.38) converges for each t, but in general what is meant is that the convergence is in the sense of the L2 [a, b] norm: b N →∞ a N lim |f (t) − n=1 cn φn (t)|2 dt = 0, 232 or equivalently, CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES N N →∞ lim ||f − n=1 cn φn || = 0 An orthonormal basis (φn ) is defined to be complete if any function f in L2 [a, b] can be written as in (7.38). A commonly used example of a complete orthonormal basis for an interval [0, T ] is given by φ1 (t) = φ2k (t) 1 √ , T 2 = T φ2 (t) = 2 T cos( 2πt ), T 2 T φ3 (t) = 2 T sin( 2πt ), T cos( 2πkt ), φ2k+1 (t) = T sin( 2πkt ) for k ≥ 1. T (7.39) What happens if we replace f by a random process X = (Xt : a ≤ t ≤ b)? Suppose (φn : 1 ≤ n ≤ N ) is an orthonormal basis consisting of continuous functions, with N ≤ ∞. (The basis does not have to be complete, and we allow N to be finite or to equal infinity. Of course, if the basis is complete, then N = ∞.) Suppose that X is m.s. continuous, or equivalently, that RX is continuous b b as a function on [a, b]×[a, b]. In particular, RX is bounded. Then E[ a |Xt |2 dt] = a RX (t, t)dt < ∞, b so that a |Xt |2 dt is finite with probability one. Suppose that X can be represented as N Xt = n=1 Cn φn (t). (7.40) Such a representation exists if the basis (φn ) is complete, but some random processes have the form (7.40) even if N is finite or if N is infinite but the basis is not complete. The representation (7.40) reduces the description of the continuous-time random process to the description of the coefficients, (Cn ). This representation of X is much easier to work with if the coordinate random variables are orthogonal. Definition 7.6.1 A Karhunen-Lo`ve (KL) expansion for a random process X = (Xt : a ≤ t ≤ b) e is a representation of the form (7.40) such that: (1) the functions (φn ) are orthonormal: (φm , φn ) = I{m=n} , and ∗ (2) the coordinate random variables Cn are mutually orthogonal: E[Cm Cn ] = 0. Example 7.6.2 Let Xt = A for 0 ≤ 1 ≤ T, where A is a random variable with 0 < E[A2 ] < ∞. √ I √ Then X has the form in (7.40) for N = 1, C1 = A T , and φ1 (t) = {0≤t≤T } . This is trivially a KL T expansion, with only one term. Example 7.6.3 Let Xt = A cos(2πt/T +Θ) for 0 ≤ t ≤ T, where A is a real-valued random variable with 0 < E[A2 ] < ∞, and Θ is a random variable uniformly distributed on [0, 2π] and independent ` 7.6. THE KARHUNEN-LOEVE EXPANSION 233 of A. By the cosine angle addition formula, Xt = A cos(Θ) cos(2πt/T ) − A sin(Θ) sin(2πt/T ). Then X has the form in (7.40) for N = 2, √ √ cos(2πt/T ) sin(2πt/T ) √ √ C1 = A 2T cos(Θ), C2 = −A 2T sin(Θ), φ1 (t) = φ2 (t) = . 2T 2T In particular, φ1 and φ2 form an orthonormal basis with N = 2 elements. To check whether ∗ ∗ this is a KL expansion, we see if E[C1 C2 ] = 0. Since E[C1 C2 ] = −2T E[A2 ]E[cos(Θ) sin(Θ)] = −T E[A2 ]E[sin(2Θ)] = 0, this is indeed a KL expansion, with two terms. Proposition 7.6.4 Suppose X = (Xt : a ≤ t ≤ b) is m.s. continuous and (φn ) is an orthonormal basis of continuous functions. If (7.40) holds for some random variables (Cn ), it is a KL expansion (i.e., the coordinate random variables are orthogonal) if and only if the basis functions are eigenfunctions of RX : RX φn = λn φn , (7.41) where for a function φ ∈ L2 [a, b], RX φ denotes the function (RX φ)(s) = (7.40) is a KL expansion, the eigenvalues are given by λn = E[|Cn |2 ]. Proof. Suppose (7.40) holds. Then Cn = (X, φn ) = b ∗ a Xt φn (t)dt, b a RX (s, t)φ(t)dt. In case so that ∗ E[Cm Cn ] = E [(X, φm )(X, φn )∗ ] b = E a b b a Xs φ∗ (s)ds m a b ∗ Xt φ∗ (t)dt n = a RX (s, t)φ∗ (s)φn (t)dsdt m (7.42) = (RX φn , φm ) ∗ Now, if the basis functions (φn ) are eigenfunctions of RX , then E[Cm Cn ] = (RX φn , φm ) = ∗ ] = 0 if n = m, so that (7.40) is (λn φn , φm ) = λn (φn , φm ) = λn I{m=n} . In particular, E[Cm Cn a KL expansion. Also, taking m = n yields E[|Cn |2 ] = λn . Conversely, suppose (7.40) is a KL expansion. Without loss of generality, suppose that the basis (φn ) is complete. (If it weren’t complete, it could be extended to a complete basis by augmenting it with functions from a complete basis and applying the Gramm-Schmidt method of orthogonalizing.) Then for n fixed, (RX φn , φm ) = 0 for all m = n. By completeness of the (φn ), the function RX φn has an expansion of the form (7.38), but all terms except possibly the nth are zero. Hence, Rn φn = λn φn for some constant λn , so the eigenrelations (7.41) hold. Again, E[|Cn |2 ] = λn by the computation above. The following theorem is stated without proof. Theorem 7.6.5 (Mercer’s theorem) If X = (Xt : a ≤ t ≤ b) is m.s. continuous, or, equivalently, that RX is continuous. Then there exists a complete orthonormal basis (φn : n ≥ 1) of continuous 234 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES eigenfunctions and corresponding nonnegative eigenvalues (λn : n ≥ 1) for RX , and RX is given by the following series expansion: ∞ RX (s, t) = n=1 λn φn (s)φ∗ (t). n (7.43) The series converges uniformly in s, t, meaning that N N →∞ a≤s,t≤b lim max |RX (s, t) − n=1 λn φn (s)φ∗ (t)| = 0 n Theorem 7.6.6 ( Karhunen-Lo`ve expansion) Any process X satisfying the conditions of Mercer’s e theorem has a KL expansion, ∞ Xt = n=1 φn (t)(X, φn ), and the series converges in the m.s. sense, uniformly in t. Proof. Use the complete orthonormal basis (φn ) guaranteed by Mercer’s theorem. By (7.42), E[(X, φm )∗ (X, φn )] = (RX φn , φm ) = λn I{n=m} . Also, E[Xt (X, φn )∗ ] = E[Xt a b b ∗ Xs φn (s)ds] = a RX (t, s)φn (s)ds = λn φn (t). These facts imply that for finite N, N 2 N E Xt − n=1 φn (t)(X, φn ) = RX (t, t) − n=1 λn |φn (t)|2 , (7.44) which, since the series on the right side of (7.44) converges uniformly in t as n → ∞, implies the stated convergence property for the representation of X. Remarks (1) The means of the coordinates of X in a KL expansion can be expressed using the mean function µX (t) = E[Xt ] as follows: b E[(X, φn )] = a µX (t)φ∗ (t)dt = (µX , φn ) n Thus, the mean of the coordinate of X is the nth coordinate of the mean function of X. (2) Symbolically, mimicking matrix notation, we can write the representation (7.43) of RX as ∗ φ1 (t) λ1 φ∗ (t) λ2 2 . λ3 RX (s, t) = [φ1 (s)|φ2 (s)| · · · ] . . .. . nth ` 7.6. THE KARHUNEN-LOEVE EXPANSION 235 (3) If f ∈ L2 [a, b] and f (t) represents a voltage or current across a resistor, then the energy dissipated during the interval [a, b] is, up to a multiplicative constant, given by b ∞ (Energy of f ) = ||f || = a 2 |f (t)| dt = n=1 2 |(f, φn )|2 . The mean total energy of (Xt : a < t < b) is thus given by b b E[ a |Xt |2 dt] = a RX (t, t)dt b ∞ = a n=1 ∞ λn |φn (t)|2 dt λn n=1 = (4) If (Xt : a ≤ t ≤ b) is a real valued mean zero Gaussian process and if the orthonormal basis functions are real valued, then the coordinates (X, φn ) are uncorrelated, real valued, jointly Gaussian random variables, and therefore are independent. Example 7.6.7 Let W = (Wt : t ≥ 0) be a Brownian motion with parameter σ 2 . Let us find the KL expansion of W over the interval [0, T ]. Substituting RX (s, t) = σ 2 (s ∧ t) into the eigenrelation (7.41) yields t T σ 2 sφn (s)ds + 0 t σ 2 tφn (s)ds = λn φn (t) (7.45) Differentiating (7.45) with respect to t yields T σ 2 tφn (t) − σ 2 tφn (t) + t σ 2 φn (s)ds = λn φn (t), (7.46) and differentiating a second time yields that the eigenfunctions satisfy the differential equation λφ = −σ 2 φ. Also, setting t = 0 in (7.45) yields the boundary condition φn (0) = 0, and setting t = T in (7.46) yields the boundary condition φn (T ) = 0. Solving yields that the eigenvalue and eigenfunction pairs for W are λn = 4σ 2 T 2 (2n + 1)2 π 2 φn (t) = 2 sin T (2n + 1)πt 2T n≥0 These functions form a complete orthonormal basis for L2 [0, T ]. Such a process is not a random process as defined in these notes. yielding ∗ ∗ ∗ ∗ E[|XT +τ − Xτ |2 ] = E[XT +τ XT +τ − XT +τ Xτ − Xτ XT +τ + Xτ Xτ ] ∗ = RX (0) − RX (T ) − RX (T ) + RX (0) = 0 .41) becomes simply σ 2 φ(t) = λn φ(t) for all t in the interval. satisfy ∗ E[Xn Xm ] = σ 2 I{n=m} . It is a generalized random process such that its coordinates (Xn : n ≥ 1) relative to an arbitrary orthonormal basis for a finite interval have mean zero and satisfy (7.e. but can be defined as a generalized process in the same way that a delta function can be defined as a generalized function. t) = E[Xs Xt ] given by RX (τ ) = σ 2 δ(τ ).) Proof. if (φn : n ≥ 1) is an arbitrary complete orthonormal basis for the square integrable functions on [a. this means that X is a WSS process with mean µX = 0 and ∗ autocorrelation function RX (s. Suppose (a) is true.47).47) This offers a reasonable interpretation of white noise. mean zero and correlations given by ∞ ∞ ∗ ∞ ∞ E −∞ f (s)Xs ds −∞ g(t)Xt dt = σ2 −∞ f (t)g ∗ (t)dt In a formal or symbolic sense. Proposition 7. formally given by Xn = (X.236 CHAPTER 7. BASIC CALCULUS OF RANDOM PROCESSES Example 7. For example. Thus. (7. just like generalized functions. Since RX (0) is real valued.7. What would the KL expansion be for a white noise process over some fixed interval [a. Generalized random processes.7 Periodic WSS random processes Let X = (Xt : t ∈ R) be a WSS random process and let T be a positive constant. φn ). b]. all the eigenvalues of a white noise process are equal to σ 2 . so is RX (T ).1 The following three conditions are equivalent: (a) RX (T ) = RX (0) (b) P [XT +τ = Xτ ] = 1 for all τ ∈ R (c) RX (T + τ ) = RX (τ ) for all τ ∈ R (i. Thus. 7. RX (τ ) is periodic with period T. and any function f with finite norm is an eigenfunction. The integrals are random variables with finite second moments. ∞ f (t)δ(t)dt = f (0) −∞ A white noise process X is such that integrals of the form −∞ f (t)X(t)dt exist for functions f with finite L2 norm ||f ||. the delta function δ is defined by the requirement that for any function f that is continuous at t = 0.b]? The eigenrelation (7.6. then the coordinates of the white noise process X.8 Let X be a white noise process. only make sense when multiplied by a suitable function and then integrated. continuous. T ].2 We call X a periodic. which by definition means that P [Xt = Yt ] = 1 for any t ∈ R. both RX (τ ) and φn are periodic with period dividing T . = where λn is given by λn = 0 T RX (t)e−2πjnt/T dt = √ T (RX . . Therefore (b) imples (c). (t mod T ) is equal to t + nT .7. Sums P PN of the form ∞ n=−∞ should be interpreted as limits of n=−N as N → ∞. Trivially (c) implies (a). T ] is described next. (7. Definition 7. (b). Since two random variables that are equal with probability one have the same expectation.7. (Recall that by definition.) Then Y has periodic sample paths.7. for any n fixed.7. the properties (a) through (c) are equivalent to the condition that X is WSS and there is a version of X with periodic sample paths of period T . t)φn (t)dt = 0 0 s RX (s − t)φn (t)dt RX (t)φn (s − t)dt s−T T = = 0 RX (t)φn (s − t)dt T 1 √ RX (t)e2πjns/T e−2πjnt/T dt T 0 = λn φn (s). so T T RX (s. Next.4 In addition. the sample paths need not be periodic. or (c) of Proposition 7. Property (b) almost implies that the sample paths of X are periodic. WSS process of period T if X is WSS and any of the three equivalent properties (a). periodic. The Karhunen-Lo`ve expansion e of X restricted to [0. T ]. and Y is a version of X. Due to the periodicity of X. WSS random process. (b) implies that ∗ ∗ RX (T + τ ) = E[XT +τ X0 ] = E[Xτ X0 ] = RX (τ ). However. for each τ it can be that Xτ = Xτ +T on an event of probability zero. suppose (b) is true and let τ ∈ R. (a) implies (b).48) 4 Here it is more convenient to index the functions by the integers. φn ). Let φn be the function on [0. Suppose X is a m. so the equivalence of (a) through (c) is proved. and since there are uncountably many real numbers τ .s. T ] defined by φn (t) = e2πjnt/T √ T The functions (φn : n ∈ Z) form a complete orthonormal basis for L2 [0. Thus. PERIODIC WSS RANDOM PROCESSES 237 Therefore. rather than by the nonnegative integers.1 hold. where n is selected so that 0 ≤ t + nT < T . suppose (b) is true and define a process Y by Yt = X(t mod T ) . it is natural to consider the restriction of X to the interval [0. However. The function pX is called the power spectral mass function of X.20) e of X over the interval [0. The Karhunen-Lo`ve expansion (5. By (7. It is related to the frequency f measured in cycles per second by ω = 2πf . RX (0) = E[|Xt |]2 .49) are periodic with period T . and it definitely has a cumulative power spectral distribution function. Periodicity is a rather restrictive assumption to place on a WSS process. (7. It is called the spectral representation of the periodic. φm )(X. WSS process X. T ˆ ˆ E[|Xn e2πjnt/T |2 ] = E[|Xn |2 ] = pX ( The Greek letter ω is used here as it is traditionally used for frequency measured in radians per second. A given random variable might have a pmf or a pdf. BASIC CALCULUS OF RANDOM PROCESSES Therefore φn is an eigenfunction of RX with eigenvalue λn . In the next chapter we shall further investigate spectral properties of WSS processes.48).50) ∞ where pX is the function on the real line R = (ω : −∞ < ω < ∞).5 defined by pX (ω) = λn /T 0 ω= else 2πn T for some integer n and the sum in (7. φn )∗ ] = I T T {m=n} Although the representation (7.49): 2πn ) T and the total mass of pX is the total power of X. the representation (7.49) holds for all t.49) ˆ where Xn is defined by 1 1 ˆ Xn = √ (X. T ] can be written as RX (t) = = ω λn 2πjnt/T e T n=−∞ pX (ω)ejωt . the series expansion (7.49) has been derived only for 0 ≤ t ≤ T . It is similar to a probability mass function. both sides of (7. and it definitely has a CDF. T ] can be written as ∞ Xt = n=−∞ ˆ Xn e2πjnt/T (7. We shall see that many WSS random processes have a power spectral density. In the same way. 5 . in that it is positive for at most a countable infinity of values. a given WSS process might have a power spectral mass function or a power spectral density function. The meaning of ω should be clear from the context. Here ω is not the same as a typical element of the underlying space of all outcomes.38) applied to the function RX over the interval [0. Therefore. φn ) = T T Note that 0 T Xt e−2πjnt/T dt 1 λn ˆ ˆ∗ E[Xm Xn ] = E[(X. The value pX ( 2πn ) is equal to the power of the nth term in T the representation (7.238 CHAPTER 7. The periodic WSS processes of period T are precisely those WSS processes that have a power spectral mass function that is concentrated on the integer multiples of 2π .50) is only over ω such that pX (ω) = 0. Ω. (b) The function RX defined by RX (τ ) = 0 τ >1 (c) Suppose X = (Xt : t ∈ R) is a mean zero stationary Gaussian random process.s.s.7.) . B. 7. sense? If so. 7. Xt and Xt are independent. differentiable? If so. (a) Find the following two probabilities.s. Show that RX X = ∂1 RX . find the transition probability function pi. differentiable.s. Is Y m. continuous? 7. (a) Is Y a Markov process? If so. ∞)}. (b) Is Y mean square continuous? (c) Is Y mean square differentiable? (d) Does 1 T limT →∞ T 0 yt dt exist in the m. where A. and P {N is continuous over the interval [0. the standard normal complementary CDF.4 Some statements related to the basic calculus of random processes Classify each of the following statements as either true (meaning always holds) or false.s. explaining your reasoning: P {N is continuous over the interval [0. N (0. m. Bn : n ≥ 0) are jointly Gaussian and limn→∞ An = A m.6 Cross correlation between a process and its m. (a) Let Xt = Z. that ∂1 RX exists.) 2 (c) Let X be a mean zero. σ 2 |τ | ≤ 1 is a valid autocorrelation function. and let Yt = Xt for all t.8. (b) Express P { 0 Xs ds ≥ 1} in terms of Q. differentiable Gaussian random process..1 Calculus for a simple Gaussian random process Define X = (Xt : t ∈ R) by Xt = A + Bt + Ct2 . (b) Is N sample path continuous a. where Z is a Gaussian random variable. (It follows.2 Lack of sample path continuity of a Poisson process Let N = (Nt : t ≥ 0) be a Poisson process with rate λ > 0. differentiable random process. Then X = (Xt : t ∈ R) is mean ergodic in the m.s.5 Differentiation of the square of a Gaussian random process (a) Show that if random variables (An : n ≥ 0) are mean zero and jointly Gaussian and if limn→∞ An = A m. where N is a Poisson process with rate λ > 0.) (b) Show that if random variables (An .s. B. and C are independent. differentiable. derivative Suppose X is a m. and D are mean zero and n jointly Gaussian.s.s. 7. then E[ABCD] = E[AB]E[CD] + E[AC]E[BD] + E[AD]E[BC].? Is N m. then limn→∞ A2 = A2 m. Then for any fixed time t. (a) Verify directly that X is m.8 Problems 7. then limn→∞ An Bn = AB m. and limn→∞ Bn = B m. and justify your answers. PROBLEMS 239 7. (Hint: Use part (a) and the identity 2 −a2 2 ab = (a+b) 2 −b .s. justify your answer and express the derivative in terms of Xt and Xt .s.s. in particular. (Hint: If A.s.s. 1) 1 random variables. identify the limit. 7.3 Properties of a binary valued process Let Y = (Yt : t ≥ 0) be given by Yt = (−1)Nt . t) and the transition rate matrix Q.j (s. C. and suppose X is m.T] } for a fixed T > 0. sense.s. Show that X is the m. (c) Is X Markov? Why or why not? (d) Is X mean-square continuous? Why or why not? t (e) Determine whether 1 0 Xs ds converges in the mean square sense as t → ∞.s. Thus. properties Suppose X is a mean zero random process. t ≥ 0. Let Y be the process defined by t Yt = 0 Xu du for t ≥ 0. differentiable.s. t ≥ 0. (not WSS. what is the probability distribution of the limit? ∞ T (b) Similarly.s. limt→∞ P [Yt < t]. (b) Find the mean function µX (t) and covariance function CX (s.12 Recognizing m. t) for s.s.9 An integral of white noise times an exponential t Let Xt = 0 Zu e−u du. m. (not WSS. Since t 1 wt Var( wt ) = 1 diverges as t → 0. indicate which of the following properties X has: m. BASIC CALCULUS OF RANDOM PROCESSES 7.s. t) = . and let X = (Xt : t ≥ 0) be defined by Xt = Nt+1 − Nt . continuous. t + 1].s. (a) Find the autocorrelation function. if the limit exists. 1 if s = t (d) RX (s. µY (t).11 An integrated Poisson process t Let N = (Nt : t ≥ 0) denote a Poisson process with rate λ > 0. sense. 7.240 CHAPTER 7. in particular.s.7 Fundamental theorem of calculus for m. integrable over finite length intervals. (c) Compute Var(Yt ) for t ≥ 0. (It follows. differentiable. we define 1 wt dt to be limT →∞ 1 wt dt m. (a) Sketch a typical sample path of Y . (a) X is WSS with RX (τ ) = (1 − |τ |)+ .8 A windowed Poisson process Let N = (Nt : t ≥ 0) be a Poisson process with rate λ > 0. you don’t need to check for mean ergodic property) 0 else √ (e) RX (s. if the limit exists. (b) Compute the mean function. (b) Is X δ(τ )σ X mean square differentiable? Justify your answer. for t ≥ 0. and the corresponding sample path of X. (d) Determine the value of the limit. t 7. and mean ergodic in the the m. Express your answer in a simple form.s. For each choice of autocorrelation function shown. m.10 A singular integral with a Brownian motion 1 Consider the integral 0 wt dt. R (s.) 7.s. derivative of Y . for some σ 2 > 0. that Y is m. you don’t need to check for mean ergodic property) . we define the integral as lim →0 t t t dt m. where Z is white Gaussian noise with autocorrelation function 2 . t) = s ∧ t for s. (c) Does Xt converge in the mean square sense as t → ∞? Justify your answer. (a) Sketch a typical sample path of N . (b) X is WSS with RX (τ ) = 1 + (1 − |τ |)+ . what is the probability distribution of the limit? 7. (a) Does the limit exist? If so. for t ≥ 0.s. Xt is the number of counts of N during the time window (t. Does the limit t t exist? If so. (c) X is WSS with RX (τ ) = cos(20πτ ) exp(−10|τ |).s. continuous random process. where w = (wt : t ≥ 0) is a standard Brownian motion. t ≥ 0. and let Yt = 0 Ns ds for s ≥ 0. t) for s. 7. calculus Suppose X = (Xt : t ≥ 0) is a m. 7. and has autocorrelation function RB (s. C.) Compute and compare limt→0 (mean square error)/t4 for the two estimators. You need not write it down in detail. 7. PROBLEMS 241 7.s.14 A stationary Gaussian process 1 Let X = (Xt : t ∈ Z) be a real stationary Gaussian process with mean zero and RX (t) = 1+t2 . (d) Describe the joint probability density of (X0 .s. (b) Find E[X3 |X0 ] and express P {|X3 − E[X3 |X0 ]| ≥ 10} in terms of Q.8. X1 )T . If A. 7. Xt )T in terms of the function RX and its derivatives. (d) (This part is optional . then E[ABCD] = E[AB]E[CD] + E[AC]E[BD] + E[AD]E[BC]. X0 .not required. the standard Gaussian complementary cumulative distribution function.7. derivative of X. Then correlation ergodicity of X is equivalent to mean ergodicity of Y . and D are mean zero. where W is a Brownian motion with parameter σ 2 = 1.s. (c) Express the optimal linear estimator E[Xt |X0 . 1 t t t→∞ Xs+h Xs ds = E[Xs+h Xs ] 0 Suppose X is a mean zero.s.15 Integral of a Brownian bridge A standard Brownian bridge B can be defined by Bt = Wt − tW1 for 0 ≤ t ≤ 1. we might propose the following estimator of Xt given X0 and X0 : Xt = X0 + tX0 . the m.17 A random process which changes at a random time Let Y = (Yt : t ∈ R) and Z = (Zt : t ∈ R) be stationary Gaussian Markov processes with mean zero . Guided by Taylor’s approximation for deterministic functions. Show that X is correlation ergodic. under the assumption that RX is four times continuously differentiable. t) = s(1 − t) for 0 ≤ s ≤ t ≤ 1. lim m. B. real-valued Gaussian process such that RX (τ ) → 0 as |τ | → ∞.s. (Hints: Let Yt = Xt+h Xt . (b) Express the mean square error E[(Xt − Xt )2 ] in terms of the function RX and its derivatives. Answer the following unrelated questions.13 A random Taylor’s approximation Suppose X is a mean zero WSS random process such that RX is twice continuously differentiable. sense? (b) Describe the joint distribution of the random variables X and W1 . 7. X0 ] in terms of X0 . (a) Express the covariance matrix for the vector (X0 . sample path continuous. (a) Is X a Markov process? Justify your anwer. (c) Find the autocorrelation function of X . X0 . and the function RX and its derivatives. X0 . Gaussian random process which is a. sense) if for any constant h. 1 (a) Why is the integral X = 0 Bt dt well defined in the m. A Brownian bridge is a mean zero. jointly Gaussian random variables.16 Correlation ergodicity of Gaussian processes A WSS random process X is called correlation ergodic (in the m. X1 = 3].) 2π 7. Thus. (a) Is X mean square differentiable? (Justify your answer. Find the fourth degree polynomial p to minimize the MSE: E[(X2 − p(X0 . X1 ). Although N is not an ordinary random process.) (c) Find the limits of µX (t) and RX (t + τ. (Note: The very definition of X also suggests that X is a Markov process.) . (b) Find the first order distributions of X. BASIC CALCULUS OF RANDOM PROCESSES and autocorrelation functions RY (τ ) = RZ (τ ) = e−|τ | . continuous? (e) Under what condition on FU is X a Gaussian random process? 7. t). we can interpret this as the condition that N is a Gaussian random process with mean µN = 0 and correlation function RN (τ ) = σ 2 δ(τ ).) 7. with initial condition x0 .” the future of X depends only on Xt and the future of the white noise.s. where N = (Nt : t ≥ 0) is a real valued Gaussian white noise with RN (τ ) = σ 2 δ(τ ) for some constant σ 2 > 0. t) as t → ∞. because if t is the “present time. X1 ))2 ] and find the resulting MMSE. s)CX (s. and U . (b) Verify that X is a Markov process by checking the necessary and sufficient condition: CX (r. such that all terms have the form cxi y j with i + j ≤ 4. (Hint: Think! Very little computation is needed.18 Gaussian review question Let X = (Xt : t ∈ R) be a real-valued stationary Gauss-Markov process with mean zero and autocorrelation function CX (τ ) = 9 exp(−|τ |). (Because these limits exist. are mutually independent. sense? (Justify your answer. You can express your answer using the Gaussian Q function 2 /2 ∞ Q(c) = c √1 e−u du. y) = a+bx+cy+dxy+ex2 y+f xy 2 +. the present value Xt contains all the information from the past of X that is relevant to the future of X. (a) A fourth degree polynomial of two variables is given by p(x. (c) Express the mean and autocorrelation function of X in terms of the CDF. Finally.19 First order differential equation driven by Gaussian white noise Let X be the solution of the ordinary differential equation X = −X + N . t)CX (s. s) whenever r < s < t.) (b) Is X mean ergodic in the m.s. (Hint: Think! Very little computation is needed. (d) Under what condition on FU is X m..20 KL expansion of a simple random process Let X be a WSS random process with mean zero and autocorrelation function RX (τ ) = 100(cos(10πτ ))2 = 50 + 50 cos(20πτ ).. Suppose X2 is to be estimated by an estimator of the form p(X0 . (a) Find the mean function µX (t) and covariance function CX (s.242 CHAPTER 7. This is the continuous-time analog of the discrete-time Kalman state equation. of U .) 1 (b) Find P [X2 ≥ 4|X0 = π . FU . The future of the white noise is independent of the past (Xs : s ≤ t). let X = (Xt : t ∈ R) be defined by Xt = Yt t < U Zt t ≥ U (a) Sketch a typical sample path of X. Let U be a real-valued random variable and suppose Y . t) = CX (r. Z. X is said to be asymptotically WSS. ((X . for X in terms of those for X : (a) φn (t) = e2πjnt . 3. √ (c) φn (t) = 2 sin( (2n+1)πt ). 7. The process Z is said to have rank N . (Hint: Sketch φn and φn for n = 1. φn )). in terms of the given functions (ξn ) and correlation matrix K. φn )φn (t). For each of the following choices of (φn (t)). for the KL expansion of X over the interval [0. Suppose K is not diagonal. such that E[|Xn |2 ] = T pX ( 2πn ). Under what conditions on pX is the process X mean ergodic in the m. 1]. Recall that X has a power spectral representation Xt = n∈Z Xn e2πjnt/T . (Suppose λn = 0 for n ∈ {1. φn ) are related to X. (a) Show that X is k-times differentiable in the m. but it may be that the functions φn are not orthonormal. coordinate random variables. The power spectral mass function of X is the discrete mass function pX supported on frequencies of the form 2πn . (b) Let X (k) denote the k th derivative process of X. sense. and the vector X = (X1 . φ2k (t) = 2 cos(2πkt). and φ2k+1 (t) = 2 sin(2πkt) for k ≥ 1.23 An infinitely differentiable process 2 Let X = (Xt : t ∈ R) be WSS with autocorrelation function RX (τ ) = e−τ /2 . . . . Let (ψk (t)). XN )T has a correlation matrix K with det(K) = 0. express the eigenfunctions. and eigenvalues.. Also. describe how the coordinates (Z. where the coefficients Xn are orthogonal random variables. and (µk ). for all k ≥ 1. Is X (k) mean ergodic in the m. 1]. . let (φn (t)).s.7. The constants cn should be selected so that ||φn || = 1 for n = 1. 7.s. ((X. coordinate random variables.) 7. 2.24 Mean ergodicity of a periodic WSS random process Let X be a mean zero periodic WSS random process with period T > 0. but there is no need to calculate the constants for this problem. Describe the Karhunen-Lo`ve expansion of Z.21 KL expansion of a finite rank process Suppose Z = (Zt : 0 ≤ t ≤ T ) has the form Zt = N Xn ξn (t) such that the functions ξ1 . T ]. 7. To explore this..8. That is.22 KL expansion for derivative process Suppose that X = (Xt : 0 ≤ t ≤ 1) is a m. PROBLEMS 243 (c) Describe a set of eigenfunctions and corresponding eigenvalues for the Karhunen-Lo`ve expane sion of (Xt : 0 ≤ t ≤ 1). .s. sense? Justify T your answer. sense for each k? Justify your answer. for k ≥ 1.s. n ≥ 0. For some cases it is not difficult to identify the KL expansion for X .) √ 2 √ (d) φ1 (t) = c1 (1 + 3t) and φ2 (t) = c2 (1 − 3t). denote the corresponding quantities for X . describe an orthornormal basis (φn : n ≥ 1). and e eigenvalues for the K-L expansion of X. and eigenvalues. ψk )). Differentiating the KL expansion of X yields X (t) = n (X. and (λn ) denote the eigenfunctions. which looks similar to a KL expansion for X . . ξN n=1 are orthonormal over the interval [0. 2.. 2}. n ∈√ Z √ (b) φ1 (t) = 1. continuously differentiable random process on the interval [0. Show that X is m.244 CHAPTER 7. Let N1 = (N.) 7. and are thus useless for the purpose of estimating B. (b) Let Xt = Af (t). What are λ1 . Think of X as an amplitude modulated random signal. give the Karhunen-Lo`ve expansion for X. Y3 . differentiable.27 * On the conditions for m.5 ≤ s.s. The same basis functions used for X can be used for the Karhunen-Lo`ve expansions of N and Y . t ≤ 0. φ3 . Note that f is not continuous.) (d) Let N = (Xt : 0 ≤ T ) be a real-valued Gaussian white noise process independent of X with RN (τ ) = σ 2 δ(τ ). . whereas this integral would equal f (1) − f (−1) if f were continuous. Is RX the autocorrelation function of a random process of the form X = (Xt : −0. e 7. . Find E[B|Y1 ] and the resulting mean square error.5)? If not. .5)) for −0. .s. Note that Y1 = X1 + N1 . are independent of both X and Y1 . Show that ∂1 RX and ∂2 ∂1 RX exist but are not continuous. where A and T are positive T constants and B is a N (0. (c) Find RX . Thus. . (Remark: The other coordinates Y2 . call it e λ1 . BASIC CALCULUS OF RANDOM PROCESSES 7. the MMSE estimate of B given the entire observation process Y . 1) random variable. differentiability t2 sin(1/t2 ) t = 0 . the corresponding eigenfunction φ1 . φ1 ). and −1 f (t)dt is not well defined.. and (a) Let f (t) = 0 t=0 1 find the derivative function f . φ1 )? You don’t need to explicitly identify the other eigenfunctions φ2 . They can simply be taken to fill out a complete orthonormal basis.5 where a is a positive constant.25 Application of the KL expansion to estimation Let X = (Xt : 0 ≤ T ) be a random process given by Xt = AB sin( πt ). φ1 ) e and Y1 = (Y.5 ≤ t ≤ 0. . Sketch f and show that f is differentiable over all of R. Think of Y as a noisy observation of X. explain why not. E[B|Y1 ] is equal to E[B|Y ]. and the first coordinate X1 = (X. t) = cosh(a(|s − t| − 0. and let Y = X + N .26 * An autocorrelation function or not? Let RX (s. (a) What is the expected total energy of X? (b) What are the mean and covariance functions of X? (c) Describe the Karhunen-Lo´ve expansion of X. . If so. (Hint: Only one eigenvalue is nonzero. where A is a random variable with mean zero and variance one. 5. A baseband random process can be recovered from samples taken at a sampling frequency that is at least twice as large as the largest frequency component of the process. while the set of frequencies relevant for discrete-time random processes is the interval [−π. Complex random processes naturally arise as baseband equivalent processes for real-valued narrowband random processes. While a m. A narrowband random process can be represented as baseband random processes that is modulated by a deterministic sinusoid. Thus. In a sense we shall see that Fourier transforms provide a diagonalization of WSS random processes. For ease of notation we shall primarily concentrate on continuous-time processes and systems in the first two sections. A related discussion of complex random processes is given in the last section of the chapter. Roughly speaking. continuous random process on a finite interval has a finite average energy. Roughly speaking.4 and 8. Representations of baseband random processes and narrowband random processes are discussed in Sections 8. narrowband random processes are those processes which have power only in a band (i.s. called the power.e. and in the frequency domain by the Fourier transform of the impulse response function. The set of frequencies relevant for continuous-time random processes is all of R. 245 . a WSS random process has a finite mean average energy per unit time. and give the corresponding definition for discrete time in the third section. A time-invariant linear system is described in the time domain by an impulse response function. baseband random processes are those which have power only in low frequencies. Nearly all the definitions and results of this chapter can be carried through in either discrete time or continuous time. just as the Karhunen-Lo`ve expansion allows for the diagonalization of a random process defined e on a finite interval. interval) of frequencies.Chapter 8 Random Processes in Linear Systems and Spectral Analysis Random processes can be passed through linear systems in much the same way as deterministic signals can. π]. operations and statistical calculations for a continuous-time baseband process can be reduced to considerations for the discrete time sampled process. 1. and output Y. the linear system could be a simple integrator from time zero. t) = 1 s≥t≥0 0 otherwise. sense.1) defining the output Y will be interpreted in the m. and ∞ h(s.3) As illustrated in Figure 8. Thus. The cross correlation function between the output µX h µY Figure 8.1) See Figure 8.1: A linear system with input X. the mean function of the output is the result of passing the mean function of the input through the linear system.2) A sufficient condition for Ys to be well defined is that RX is a bounded continuous function.2. by Ys = s 0 Xt dt 0 s≥0 s < 0. in which case the impulse response function is h(s. t)RX (t. t)|dt < ∞.1 Basic definitions The output (Yt : t ∈ R) of a linear system with impulse response function h(s.s. t) and a random process input (Xt : t ∈ R) is defined by ∞ Ys = −∞ h(s. For example. τ )h(s. The mean function of the output is given by ∞ ∞ µY (s) = E[ −∞ h(s. t) is continuous in t with −∞ |h(s. t)Xt dt (8. τ )dtdτ (8. the integral defining Ys for s fixed exists if and only if the following Riemann integral exists and is finite: ∞ −∞ ∞ −∞ h∗ (s. The integral (8.246CHAPTER 8. t)Xt dt] = −∞ h(s. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS 8. defined X h Y Figure 8. impulse response function h.2: A linear system with input µX and impulse response function h. t)µX (t)dt (8. . A linear system is called bounded input bounded output (bibo) stable if the output is bounded whenever the input is bounded.2) exists and is bounded by ∞ ∞ ∞ RX (0) −∞ −∞ |h(s − τ )||h(s − t)|dtdτ = RX (0)( −∞ |h(τ )|dτ )2 < ∞ Thus. if X is WSS and m. BASIC DEFINITIONS and input processes is given by ∞ 247 RY X (s.s. τ )dtdτ Recall that Ys is well defined as a m. τ )h(s. τ )dτ ∞ −∞ = −∞ h∗ (u.6) with u = s is well defined and gives a finite value for E[|Ys |2 ]. u) = E Ys −∞ ∞ h(u. A paragraph about convolutions is in order. and with this substitution the defining relation (8. m. If the system is time invariant we write h(s − t) instead of h(s. t)RX (t.5) (8. τ ) = E[ −∞ ∞ ∗ h(s. (8. t)RX (t. continuous process. integral if and only if the integral (8. the integral in (8. t only through s − t. time-invariant bibo stable system is well defined in the m. the output of a linear. If f and g are functions on R. RX is bounded by RX (0). τ )RY X (s.s.2) is well defined and finite. Comparing with (8. τ )Xτ dτ (8.s. It is useful to be able to recognize convolution integrals in disguise.s. If X is a WSS random process then by the Schwarz inequality. t). Thus. τ )dt (8. sense if the input is a stationary. In case the system is time invariant. continuous.6). the convolution is the function f ∗ g defined by ∞ f ∗ g(t) = −∞ f (s)g(t − s)ds . t) depends on s.1. then the output signal y = x ∗ h satisfies ∞ ∞ |y(t)| ≤ −∞ |h(t − s)|Lds = L −∞ |h(τ )|dτ for all t. t)Xt dtXτ ] = −∞ h(s. if (8. bibo stability is equivalent to the condition ∞ |h(τ )|dτ < ∞.8.6) = −∞ ∞ h∗ (u.7) holds and if an input signal x satisfies |xs | < L for all s. it means that Ys is well defined if and only if the right side of (8. and if the linear system is time-invariant and bibo stable.4) and the correlation function of the output is given by ∞ ∗ RY (s.1) becomes a convolution: Y = h ∗ X. The linear system is time invariant if h(s.7) −∞ In particular. g. k. in order to immediately recognize a convolution. Using the definition of h and (8.8) = h ∗ RX (s − τ ). Convolution is commutative: f ∗g = g ∗f and associative: (f ∗ g) ∗ k = f ∗ (g ∗ k) for three functions f. Then (8. For example. which is not constant as τ varies. with the arguments of the three functions ranging over all triples in R3 with a constant sum.9) looks nearly like a convolution. the salient feature is that the convolution is the integral of the product of f and g. Suppose that X is WSS and that the linear system is time invariant. Equation (8. with the arguments of both f and g ranging over R in such a way that the sum of the two arguments is held constant. u) is a function of s − u alone. To arrive at a true convolution.248CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS or equivalently f ∗ g(t) = −∞ ∞ f (t − s)g(s)ds or equivalently. and k. (8. The value of the constant is the value at which the convolution is being evaluated. g. A simple change of variable shows that the above three expressions are equivalent.9) The right side of (8.4) becomes ∞ RY X (s.3) becomes ∞ ∞ µY (s) = −∞ h(s − t)µX dt = µX −∞ h(t)dt Observe that µY (s) does not depend on s. τ ) = −∞ h(s − t)RX (t − τ )dt (8. We simply write f ∗ g ∗ k for (f ∗ g) ∗ k. The convolution f ∗ g ∗ k is equal to a double integral of the product of f . which in particular means that RY X (s. .8) in (8. τ ) is a function of s − τ alone. However.5) becomes ∞ RY (s. u) = −∞ h(τ − u)(h ∗ RX )(s − τ )dτ = h ∗ (h ∗ RX )(s − u) = h ∗ h ∗ RX (s − u) which in particular means that RY (s. The value of the constant is the value at which the convolution is being evaluated. define the new function h by h(v) = h∗ (−v). ∞ ∞ f ∗ g ∗ k(a + b + c) = −∞ −∞ f (a + s + t)g(b − s)k(c − t)dsdt.9) yields ∞ RY (s. but as τ varies the sum of the two arguments is u − τ + s − τ . u) = −∞ h∗ (u − τ )RY X (s − τ )dτ. for any real a and b ∞ f ∗ g(a + b) = −∞ f (a + s)g(b − s)ds. Equation (8. The derivations are the same except that covariances rather than correlations are computed. TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES249 To summarize. RY X and RY also hold for the covariance functions CX . CY X . so this is a good point to begin using Fourier transforms.12) Some important properties of Fourier transforms are stated next.8. then X and Y are jointly WSS with ∞ h(t)dt RY X = h ∗ RX RY = h ∗ h ∗ RX . if X is WSS and the system is linear and time invariant.11) = −∞ The expression shows that h ∗ h(t) is the correlation between h and h∗ translated by t from the origin. if X is WSS and if the linear system is time invariant.2. 8. In particular. then CY X = h ∗ CX and CY = h ∗ h ∗ CX . can also be written as ∞ h ∗ h(t) = −∞ ∞ h(s)h(t − s)ds h(s)h∗ (s − t)ds (8.2 Fourier transforms. The Fourier transform of a function g mapping R to the complex numbers C is formally defined by ∞ g(ω) = −∞ e−jωt g(t)dt (8. equal to h ∗ h. and CY . The equations derived in this section for the correlation functions RX . Linearity: ag + bh = ag + bh Inversion: g(t) = ∞ jωt g(ω) dω 2π −∞ e Convolution to multiplication: g ∗ h = g h and g ∗ h = 2π gh Parseval’s identity: ∞ ∗ −∞ g(t)h (t)dt = ∞ dω ∗ −∞ g(ω)h (ω) 2π Transform of time reversal: h = h∗ . FOURIER TRANSFORMS.10) µY = µX −∞ The convolution h ∗ h. where h(t) = h∗ (−t) Differentiation to multiplication by jω: dg dt (ω) = (jω)g(ω) Pure sinusoid to delta function: For ωo fixed: ejωo t (ω) = 2πδ(ω − ωo ) Delta function to pure sinusoid: For to fixed: δ(t − to )(ω) = e−jωto . (8. transfer functions and power spectral densities Fourier transforms convert convolutions into products. If ω = 2πf . In the frequency domain this becomes y(ω) = H(ω)x(ω). 2π Consequently.b] (ω)|2 |x(ω)|2 dω = 2π b |x(ω)|2 a dω . the Fourier transforms gn form a Cauchy sequence in the L2 norm. For reasons that we will soon see.b] (ω) be the ideal bandpass transfer function for frequency band [a. The idea is that if g has finite L2 norm. ∞ Energy of x in frequency interval [a. which is defined to be g. then the relation y(ω) = H[a. The Fourier transform of h is used so often that a special name and notation is used: it is called the transfer function and is denoted by H(ω). The relation (8. The Fourier transform of . Therefore. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS The inversion formula above shows that a function g can be represented as an integral (basically a limiting form of linear combination) of sinusoidal functions of time ejωt . the Fourier transform of its correlation function RX is denoted by SX . If x is the input and y is the output of a linear system with transfer function H[a. b] pass through the filter unchanged. The factor 2π in the formulas can be attributed to the use 2π of frequency ω in radians. called the cross power spectral density function of Y and X.250CHAPTER 8. and g(ω) is the coefficient in the representation for each ω.b] . including generalized functions such as delta functions. for example. the function SX is called the power spectral density of X.b] (ω) = (8. Similarly. 2π The Fourier transform can be defined for a very large class of functions. through the use of Parseval’s identity. and hence have a limit. g is a continuous function ∞ which is integrable: −∞ |g(t)|dt < ∞. In these notes we won’t attempt a systematic treatment. defined by 1 a≤ω≤b H[a. The output signal y = (yt : t ∈ R) for an input signal x = (xt : t ∈ R) is given in the time domain by the convolution y = x ∗ h. and in this case the dominated convergence theorem implies that g is a continuous function. but will use Fourier transforms with impunity. In applications. then it is the limit in the L2 norm of a sequence of functions gn which are integrable. The total energy of the output function y can therefore be interpreted as the energy of x in the frequency band [a. Given a WSS random process X = (Xt : t ∈ R). Paresval’s identity applied with g = h yields that the total energy of g (the square of the L2 norm) can be computed in either the time or frequency domain: ∞ ∞ ||g||2 = −∞ |g(t)|2 dt = −∞ |g(ω)|2 dω . Two sufficient conditions for the Fourier transform of g to be well defined are mentioned in the remainder of this paragraph. given a < b let H[a. b] = ||y||2 = −∞ |H[a.b] (ω)x(ω) shows that the frequency components of x in the frequency band [a. For example. and the frequency components of x outside of the band are completely nulled. it is appropriate to call |x(ω)|2 the energy spectral density of the deterministic signal x. then the Fourier transform of RY X is denoted by SY X . b].12) defining a Fourier transform of g is well defined if. b]. Return now to consideration of a linear time-invariant system with an impulse response function h = (h(τ ) : τ ∈ R). The Fourier transform can also be naturally defined whenever g has a finite L2 norm. one is often forced to determine in what senses the transform is well defined on a case-by-case basis.13) 0 otherwise. Owing to Parseval’s identity. if Y and X are jointly WSS. then f is the frequency in Hertz (Hz) and dω is simply df . b] is nonnegative. If SX is continuous. E[|Xt |2 ] = RX (0) = (8. Therefore. With the above notation. because if Xt is a voltage or current accross a resistor.14) means that the total power of X is the integral of SX over R. This is the first hint that the name power spectral density for SX is justified.3 we find that h ∗ h is a triangular shaped waveform: h ∗ h(τ ) = 1 |τ | (1 − )+ T T . Using (8. ∞ The Fourier inversion formula applied to SX yields that RX (τ ) = −∞ ejωτ SX (ω) dω .14) applied to Y .10) become: SY X (ω) = H(ω)SX (ω) SY (ω) = |H(ω)|2 SX (ω) ∞ Let us examine some of the properties of the power spectral density. SX .15) Two observations can be made concerning (8.14) 2π −∞ The expectation E[|Xt |2 ] is called the power (or total power) of X. Let a < b and let Y denote the output when the WSS process X is passed through the linear time-invariant system with transfer function H[a. b]. The process Y represents the part of X in the frequency band [a.2. Because RY X = RXY .8.11) and referring to Figure 8. In particular. Even if SX is not continuous.b] |2 SX and the power relationship (8. this implies that SX is nonnegative. By the relation SY = |H[a. so |H(ω)|2 is the Fourier transform of h ∗ h. In partic2π ular. we have ∞ Power of X in frequency interval [a.1 Suppose X is a WSS process and that Y is a moving average of X with averaging window duration T for some T > 0: Yt = 1 T t Xs ds t−T Equivalently. it follows that ∗ ∗ SY X = SXY . b] = E[|Yt |2 ] = −∞ SY (ω) dω = 2π b SX (ω) a dω 2π (8. we can conclude that SX is nonnegative except possibly on a set of zero measure. FOURIER TRANSFORMS. TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES251 the time reverse complex conjugate function h is equal to H ∗ . The second observation is that (8. Example 8. Y is the output of the linear time-invariant system with input X and impulse response function h given by 1 0≤τ ≤T T h(τ ) = 0 else The output correlation function is given by RY = h ∗ h ∗ RX .13).2. First. meaning that SX is real-valued. taking Y = X yields RX = RX and SX = SX . the integral of SX over any interval [a. ∞ dω SX (ω) . |Xt |2 is the instantaneous rate of dissipation of heat energy.15) fully justifies the name “power spectral density of X” given to SX .b] defined by (8. the second moment relationships in (8. (8.15). If −∞ |RX (t)|dt < ∞ then SX is well defined and is a continuous function. 17) (Some authors use somewhat different definitions for the sinc function. of the function CX . Observe that ∞ H(ω) = −∞ e−jωt h(t)dt = 1 T e−jωT − 1 −jω = 2e−jωT /2 Tω ejωT /2 − e−jωT /2 2j sin( ωT ) 2 ωT 2 = e−jωT /2 Equivalently. CY = h ∗ h ∗ CX . ∞ Var(Yt ) = CY (0) = −∞ (h ∗ h)(0 − τ )CX (τ )dτ |τ | )CX (τ )dτ T (8.16) = 1 T T (1 − −T The expression in (8. Example 8. The . using the substitution ω = 2πf . Let’s see the effect of the linear system on the power spectral density of the input.252CHAPTER 8.3: Convolution of two rectangle functions.) Therefore |H(2πf )|2 = |sinc(f T )|2 . H(2πf ) = e−jπf T sinc(f T ) where in these notes the sinc function is defined by sinc(u) = sin(πu) πu 1 u=0 u = 0.5. See Figure 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS Similarly. so that the output power spectral density is given by SY (2πf ) = SX (2πf )|sinc(f T )|2 . in the section on mean ergodicity.2 Consider two linear time-invariant systems in parallel as shown in Figure 8. Let’s find in particular an expression for the variance of Yt in terms h(s) h(s−t) s 0 t T −T T 1 T ~ h*h Figure 8.16) arose earlier in these notes.2.4. (8. τ ) = E ∞ h(t − s)Xs ds −∞ ∞ −∞ −∞ k(τ − v)Yv dv = −∞ ∞ h(t − s)RXY (s − v)k ∗ (τ − v)dsdv = −∞ {h ∗ RXY (t − v)} k ∗ (τ − v)dv = h ∗ k ∗ RXY (t − τ ). first has input X. Note that RU V (t.8. The second has input Y . TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES253 sinc( u) H(2π f) f 0 1 T 2 T u 0 1 2 Figure 8. Also. Together with the fact that U and V are individually WSS. We can find RU V as follows. and output U . and as the output signal the voltage across the capacitor.2. impulse response function h.2. The elementary equations for resistors and capacitors yield dq 1 = (xt − yt ) dt R and yt = qt C . ∞ ∞ ∗ RU V (t. Special cases of this example include the case that X = Y or h = k. X Y h k U V Figure 8. impulse response function k. this implies that U and V are jointly WSS. where K is the Fourier transform of k. let qt denote the charge on the upper side of the capacitor. FOURIER TRANSFORMS. The main trick is notational: to use enough different variables of integration so that none are used twice. Example 8. Let us first identify the impulse response function by assuming a deterministic input x and a corresponding output y. The relationship is expressed in the frequency domain as SU V = HK ∗ SXY . Suppose that X and Y are jointly WSS.5: Parallel linear systems. and RU V = h ∗ k ∗ RXY . τ ) is a function of t−τ alone. and output V .3 Consider the circuit with a resistor and a capacitor shown in Figure 8.6. Take as the input signal the voltage difference on the left side.4: The sinc function and the impulse response function. the linear system can be written in the equivalent form ˜ ˜ shown in Figure 8.7. Assume also. modeled by the input random process X. that the magnitude of the gain around the loop satisfies |H3 (ω)H1 (ω)H2 (ω)| < 1 for all ω such that SX (ω) > 0 or SN (ω) > 0. stationary Gaussian Markov process. Then 2A2 α SX (ω) = 2 ω + α2 and SY (ω) = SX (ω)|H(ω)|2 = 2A2 α (ω 2 + α2 )(1 + (RCω)2 ) Example 8. is passed into a linear time-invariant system with feedback and with noise modeled by the random process N . H2 .254CHAPTER 8. for example. for the sake of system stability. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS R + x(t) − C q (t) + − y(t) Figure 8. and H3 . that the input X is a real-valued. so that its autocorrelation function has the form RX (τ ) = A2 e−α|τ | for some constants A2 and α > 0. An expression for the signal-to-noise power ratio at the output will also be computed. The process X is the output due to the input signal X.8. and the three transfer functions H1 . and N is the output .2. The output is denoted by Y .4 A random signal. Under the assumed stability condition. We shall express the output power spectral density SY in terms the power spectral densities of X and N . as shown in Figure 8.6: An RC circuit modeled as a linear system. Therefore dy 1 = (xt − yt ) dt RC which in the frequency domain is jωy(ω) = 1 (x(ω) − y(ω)) RC so that y = H x for the system transfer function H given by H(ω) = 1 1 + RCjω Suppose. Assume that X and N are jointly WSS and that the random variables comprising X are orthogonal to the random variables comprising N : RXN = 0. To begin.7: A feedback system. ˜ ˜ ˜ ˜ SY (ω) = SX (ω) + SN (ω) = ˜ ˜ |H2 (ω)2 | |H1 (ω)2 |SX (ω) + SN (ω) |1 − H3 (ω)H1 (ω)H2 (ω)|2 The output signal-to-noise ratio is the ratio of the power of the signal at the output to the power of the noise at the output. due to the input noise N .8: An equivalent representation. so that H(ω) = 1 + jω 1 + jω = 3 1 + jω + (jω) 1 + j(ω − ω 3 ) .5 Consider the linear time-invariant system defined as follows. For this example it is given by ˜ E[|Xt |2 ] = ˜ E[|Nt |2 ] ∞ |H2 (ω)H1 (ω)|2 SX (ω) −∞ |1−H3 (ω)H1 (ω)H2 (ω)|2 ∞ |H2 (ω)|2 SN (ω) −∞ |1−H3 (ω)H1 (ω)H2 (ω)|2 dω 2π dω 2π Example 8. the system is described by ((jω)3 + jω + 1)y(ω) = (1 + jω)x(ω).2. Since RXN = 0 it follows that RX N = 0. TRANSFER FUNCTIONS AND POWER SPECTRAL DENSITIES255 Nt Xt + H (!) 1 + H3(!) H2(!) Y t Figure 8. we identify the transfer function of the system. Consequently.2. The structure in Figure 8.2. Xt H2(!)H1(!) 1!H (!)H ( )H2(!) 3 1! ~ Xt Yt =Xt+Nt ~ ~ ~ + Nt H2(!) ! 1!H (!)H ( )H2(!) 3 1 ~ Nt Figure 8. so that SY = SX + SN .2.8 is the same as considered in Example 8. We seek to find the power spectral density of the output process if the input is a white noise process X with RX (τ ) = σ 2 δ(τ ) and SX (ω) = σ 2 for all ω.8. FOURIER TRANSFORMS. In the frequency domain. For input signal x the output signal y is defined by y + y + y = x + x . Fourier transforms can be used to convert convolution to multiplication. π] defined by ∞ g(ω) = −∞ e−jωn g(n).3 Discrete-time processes in linear systems The basic definitions and use of Fourier transforms described above carry over naturally to discrete time. The equations in Section 8.10) becomes ∞ µ Y = µX n=−∞ h(n) RY X = h ∗ RX RY = h ∗ h ∗ RX . n)Xn . In particular. if X is WSS and if the linear system is time-invariant then (8. In particular.1 can be modified to hold for discrete time simply by replacing integration over R by summation over Z. SY (ω) = SX (ω)|H(ω)|2 = Observe that output power = −∞ σ 2 (1 + ω 2 ) σ 2 (1 + ω 2 ) = . (8. The Fourier transform of a function g = (g(n) : n ∈ Z) is the function g on [−π. discrete-time system with impule response function h.18) is defined for functions g and h on Z by ∞ g ∗ h(n) = k=−∞ g(n − k)h(k) Again. then the output Y is the random process given by ∞ Yk = n=−∞ h(k. 1 + (ω − ω 3 )2 1 + ω 2 − 2ω 4 + ω 6 ∞ SY (ω) dω < ∞. if the random process X = (Xk : k ∈ Z) is the input of a linear.18) where the convolution in (8. 2π 8.256CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS Hence. Some of the most basic properties are: Linearity: ag + bh = ag + bh Inversion: g(n) = π jωn g(ω) dω 2π −π e 1 2π gh Convolution to multiplication: g ∗ h = g h and g ∗ h = Parseval’s identity: ∞ ∗ n=−∞ g(n)h (n) = π −π g(ω)h∗ (ω) dω 2π . That is. and for −π < a < b ≤ π we have b dω Power of X in frequency interval [a. Recall that when we considered the Karhunen-Lo`ve e expansion for a periodic WSS random process of period T . 2π a so it is appropriate to call |x(ω)|2 the energy spectral density of the deterministic. b] = |x(ω)|2 . The Fourier transform and its inversion formula for discrete-time functions are equivalent to the Fourier series representation of functions in L2 [−π. π] fixed: ejωo n (ω) = 2πδ(ω − ωo ) Delta function to pure sinusoid: For no fixed: I{n=no } (ω) = e−jωno 257 The inversion formula above shows that a function g on Z can be represented as an integral (basically a limiting form of linear combination) of sinusoidal functions of time ejωn . then the Fourier transform of RY X is denoted by SY X . For −π ≤ a < b ≤ π. if Y and X are jointly WSS. The transforms used are essentially the same. DISCRETE-TIME PROCESSES IN LINEAR SYSTEMS Transform of time reversal: h = h∗ . The defining relation for the system in the time domain. and is called the power spectral density of X. the Fourier transform of h is denoted by H(ω). called the cross power spectral density function of Y and X. the second moment relationships in (8. functions on a time interval were 1 important and the power was distributed on the integers Z scaled by T . and g(ω) is the coefficient in the representation for each ω. Similarly. X 2π −π e In partic- The expectation E[|Xn |2 ] is called the power (or total power) of X. 2π −π π jωn S (ω) dω . Z is considered to be the time domain and the power is distributed over an interval. π]. becomes y(ω) = H(ω)x(ω) in the frequency domain. b] = SX (ω) 2π a . The e functions in this basis all have norm 2π. Given a linear time-invariant system in discrete time with an impulse response function h = (h(τ ) : τ ∈ Z). b dω Energy of x in frequency interval [a. In this section. discrete-time signal x. the Fourier transform of its correlation function RX is denoted by SX . where h(t) = h(−t)∗ Pure sinusoid to delta function: For ωo ∈ [−π. With the above notation. π dω E[|Xn |2 ] = RX (0) = SX (ω) . π] using the complete orthogonal basis (ejωn : n ∈ Z) for L2 [−π.8. y = h ∗ x. Paresval’s identity applied with g = h yields that the total energy of g (the square of the L2 norm) can be computed in either the time or frequency π 2 dω 2 domain: ||g||2 = ∞ n=−∞ |g(n)| = −π |g(ω)| 2π . but with j replaced by −j.3.18) become: SY X (ω) = H(ω)SX (ω) SY (ω) = |H(ω)|2 SX (ω) The Fourier inversion formula applied to SX yields that RX (n) = ular. the role of Z and a finite interval are interchanged. Given a WSS random process X = (Xn : n ∈ Z). as discussed in connection with the Karhunen-Lo`ve expansion. because for t = mT the only nonzero term in the sum is the one indexed by n = m. The weights are given by the Fourier transform x(w). with one-sided band limit fo Hz. define T by T = 2fo . before considering its extension to random processes. or equivalently ωo radians/second.20) for deterministic signals.20) goes as follows. and −ωo T |e−jωnT |2 dω = 1.17). (8. of sinusoidal functions of time. x over the interval [−ωo . which is essentially a sum. Taking t = nT in (8. Let fo > 0 and let ωo = 2πfo . The signal x is called a baseband signal. is finite. For such a signal. ωo ] has the 2π following Fourier series representation: ∞ x(ω) = T n=−∞ e−jωnT x(nT ) ω ∈ [−ωo . if x(ω) = 0 for |ω| ≥ ωo .23) . −∞ |x(t)|2 dt. Henceforth we will use ωo more often than fo .19) Equation (8. a ∞ function on R) such that its energy. Specifically. Nyquist’s equation (8. Let x be a continuous-time signal (i.21) = −ωo x(ω)(e−jωnT )∗ Equation (8.20) is indeed elegant. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS 8.258CHAPTER 8. the Fourier inversion formula becomes ωo x(t) = −ωo ejωt x(ω) dω 2π (8.e. A celebrated theorem of Nyquist states that the baseband signal x is completely determined by 1 its samples taken at sampling frequency 2fo .20) where the sinc function is defined by (8. ejωt . ωo ] (8.19) displays the baseband signal x as a linear combination of the functions ejωt indexed by ω ∈ [−ωo . so it is worth remembering that ωo T = π.19) yields ωo x(nT ) = −ωo ωo ejωnT x(ω) dω 2π dω 2π (8. The functions ˆ e−jωnT . ωo ]. form a complete orthogonal ωo basis for L2 [−ωo .19) yields ∞ ωo x(t) = n=−∞ x(nT )T −ωo ejωt e−jωnT dω . Therefore. The equation shows that the sinc function gives the correct interpolation of the narrowband signal x for times in between the integer multiples of T . the signal x is an integral. We shall give a proof of (8.4 Baseband random processes Deterministic baseband signals are considered first. By the Fourier inversion formula. It obviously holds by inspection if t = mT for some integer m.22) Plugging (8. Then ∞ x(t) = n=−∞ x(nT ) sinc t − nT T . considered on the interval −ωo < ω < ωo and indexed by n ∈ Z. A proof of (8.22) into (8.21) shows that x(nT ) is given by an inner product of x and e−jωnT . ωo ]. 2π (8. It must be shown that zero as N → ∞: εN = E Xt − defined by the following expectation converges to 2 N XnT sinc n=−N t − nT t ∗ When the square is expanded.20) as desired. (8. Proposition 8.8. −∞ ∞ jωnT −∞ dω 2π t − nT T .26) Proof. terms of the form E[Xa Xb ] arise.23) can be simplified using ωo 259 T −ωo ejωτ dω τ = sinc 2π T .24)) ∞ ejωt = T = e ejωnT sinc ωo −ωo ejωt e−jωnT . εN can be expressed as an integration over ω rather than as an expectation: ∞ N εN = −∞ e jωt − n=−N e jωnT sinc t − nT T 2 SX (ω) dω . 2π Therefore.27) For t fixed. The sampling theorem extends naturally to WSS random processes.1 Suppose X is a WSS baseband random process with one-sided band limit ωo and let T = π/ωo .25) If B is the process of samples defined by Bn = XnT .4. But ∗ E[Xa Xb ] = RX (a − b) = ∞ −∞ ejωa (ejωb )∗ SX (ω) dω . Then for each t ∈ R ∞ Xt = n=−∞ XnT sinc t − nT T m.s. where a and b take on the values t or nT for some n. (8.24) with τ = t − nT to yield (8.4. Fix t ∈ R. BASEBAND RANDOM PROCESSES The integral in (8. then the power spectral densities of B and X are related by SB (ω) = 1 ω SX T T N for | ω |≤ π (8. 2π (8. A WSS random process X with spectral density SX is said to be a baseband random process with one-sided band limit ωo if SX (ω) = 0 for | ω |≥ ωo . the function (ejωt : −ωo < ω < ωo ) has a Fourier series representation (use (8. using a change of variable ν = T ω and the fact T = π yields (8. The representation (8.26) become if X is W SS and has a power spectral density.2 If µX = 0 and the spectral density SX of X is constant over the interval [−ωo . Therefore RB (n) = CB (n) = 0 for n = 0.4. then µB = 0 and SB (ω) is constant over the interval [−π. so the processes have the same total power.28) RB (n) = −π ejnν ν dν 1 SX ( ) . As a check on (8.260CHAPTER 8. π]. uncorrelated random variables.27) is the approximation error for the N th partial Fourier series sum for ejωt . we note that B(0) = X(0). it must be that π SB (ω) −π dω 2π ∞ = −∞ SX (ω) dω . but X is not a baseband signal? . RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS so that the quantity inside the absolute value signs in (8.25) is proved.4.26). 2π −ωo ejnT ω SX (ω) π ωo ∞ so. a basic result in the theory of Fourier series yields that the Fourier approximation error is bounded by a single constant for all N and ω.29) which is indeed consistent with (8. and as N → ∞ the Fourier approximation error converges to 0 uniformly on sets of the form | ω |≤ ωo − ε.26). and the samples (B(n)) are mean zero. Since ejωt is continuous in ω. 2π (8. Clearly B is a WSS discrete time random process with µB = µX and RB (n) = RX (nT ) = = dω 2π −∞ ωo dω ejnT ω SX (ω) . Thus εN → 0 as N → ∞ by the dominated convergence theorem. Theoretical Exercise What does (8. π] such that π RB (n) = −π ejnω SB (ω) dω 2π so (8. ωo ]. T T 2π But SB (ω) is the unique function on [−π.26) holds. The proof of Proposition 8. Thus. Example 8.1 is complete. (ωc − ωo . A convenient alternative expression for x is obtained by defining a complex valued baseband signal z by z(t) = u(t) + jv(t). The reader is encouraged to verify from (8. 2 v to the right by 2 −j ωc . an example at the end of the section is given which describes how a narrowband random process (to be defined) can be simulated using a sampling rate equal to twice the one-sided width of a frequency band of a signal. A narrowband signal arises when a sinusoidal signal is modulated by a narrowband signal. −ωc + ωo ). Then z varies slowly compared to the jωc t .8. A narrowband signal (relative to ωo and ωc ) is a signal x such that x(ω) = 0 unless ω is in the union of two intervals: the upper band. Define a signal x by x(t) = u(t) cos(ωc t) − v(t) sin(ωc t). and the development for random processes follows a similar approach. It is a good idea to keep in mind the case that ωc is much larger than ωo (written ωc ωo ). 1 u to the left by ωc .30) becomes x(ω) = 1 {u(ω − ωc ) + u(ω + ωc ) + jv(ω − ωc ) − jv(ω + ωc )} 2 (8. For example.31) that x(ω) = x∗ (−ω). x is approximately a sinusoid complex sinusoid e with frequency ωc . As an application. NARROWBAND RANDOM PROCESSES 261 8. the signal is nearly as simple as the original baseband signal. modulating a complex sinusoidal . and 2 v to the left by ωc . (8.30) j 1 Graphically. Since cos(ωc t) = (ejωc t + e−jωc t )/2 and − sin(ωc t) = (jejωc t − je−jωc t )/2. the highest frequency of the resulting product is about 200. The signal z is called the complex envelope of x and |z(t)| is called the real envelope of x.31) (8. rather than twice the highest frequency of the signal. If such a signal is multiplied by a signal with frequency 109 Hz.000 times larger than that of the original signal. Fortunately. Of course x is real-valued by its definition. (−ωc − ωo . because the energy or power of such a modulated signal is concentrated in a narrow band. each with one-sided bandwidth less than ωo . x(ω) = 0 if || ω | −ωc | ≥ ωo . and then adding. and phase given by the argument of z(t). the effects of filtering can be analyzed using baseband equivalent filters. In a small neighborhood of a fixed time t. x is obtained by sliding 2 u to the right by ωc . Equation (8. For example.5 Narrowband random processes As noted in the previous section. ωc + ωo ).31) shows that indeed x is a narrowband signal. Na¨ application ıve of the sampling theorem would mean that the sampling rate would have to increase by the same factor. a typical voice signal may have highest frequency 5 KHz. a signal – modeled as either a deterministic finite energy signal or a WSS random process – can be reconstructed from samples taken at a sampling rate twice the highest frequency of the signal. and the lower band.5. as defined at the beginning of the previous section. as shown next. or equivalently. Let u and v be real-valued baseband signals. peak amplitude |z(t)|. Then x(t) = Re(z(t)ejωc t ). Let ωc > ωo > 0. So far we have shown that a real-valued narrowband signal x results from modulating sinusoidal functions by a pair of real-valued baseband signals. More compactly. The motivation of this section is to see how signals and random processes with narrow spectral ranges can be analyzed in terms of equivalent baseband signals. Deterministic narrowband signals are considered first. One attempt to obtain a baseband signal from x is to consider e−jωc t x(t). the output signal x is again real-valued. time-invariant system. Since this transfer function satisfies H ∗ (ω) = H(−ω).32) becomes (8. so ˇ x −j sgn(ω) x Figure 8. that shifts the portion of x in the upper band to the baseband interval (−ωo .30) or (8. Note ˇ that x(t) = Re(x(t)) = Re(x(t) + j x(t)).33) and u is the Hermetian symmetric part of z and v is −j times the Hermetian antisymmetric part of z: u(ω) = 1 (z(ω) + z ∗ (−ω)) 2 v(ω) = −j (z(ω) − z ∗ (−ω)) 2 . the portion of x in the lower band gets shifted to an interval centered about −2ωc .262CHAPTER 8. In addition. denoted by x. then u and v are real-valued baseband signals such that z(t) = u(t) + jv(t). ˇ Consider the Fourier transform of x + j x. ωo ). time-invariant system with ˇ transfer function −jsgn(ω) as pictured in Figure 8. and (8. where ˇ 1 ω>0 0 ω=0 sgn(ω) = −1 ω < 0 Therefore x can be viewed as the result of passing x through a linear.9: The Hilbert transform as a linear. As desired. In summary.32). so that e−jωc t x(t) is not a baseband signal. that the Fourier transforms of x and x have the same magnitude for all nonzero ω. and the graph of this transform is obtained by sliding the graph of x(ω) to the left by ωc . This has Fourier transform x(ω + ωc ). It is equal to 2x(ω) in the upper band and it is zero ˇ elsewhere. Does every real-valued narrowband signal have such a representation? The answer is yes. z defined by z(t) = (x(t) + j x(t))e−jωc t is a baseband complex valued signal. Thus. ˇ x and x have equal energies. An elegant solution to this problem is to use the Hilbert transform of x. where z(t) = u(t) + jv(t). RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS function by a complex-valued baseband signal. any finite energy real-valued narrowband signal x can be represented as (8. The Fourier transform z can be expressed in terms of x by z(ω) = 2x(ω + ωc ) |ω| ≤ ωo 0 else.30). |H(ω)| = 1 for all ω.32) If we let u(t) = Re(z(t)) and v(t) = Im(z(t)). (8.9. or equivalently ˇ x(t) = Re z(t)ejωc t (8. x(ω) is the signal with Fourier transform −jsgn(ω)x(ω). Let x be a real-valued narrowband signal with finite energy. However. In particular. except ω = 0. By ˇ definition. as we now show. RX (a.5. Xt = Re Zt ejωc t (8. 2. We summarize the results as a proposition. of the complex envelope z for x is given by (8. z. Then X is WSS if and only if U and V are mean zero with RU = RV and RU V = −RV U .34) In some sort of generalized sense. st = sin(ωc t). A similar development is considered next for WSS random processes. we expect that X is a narrowband process. and τ = a − b. Using these equations 1 and the fact x(ω) = x1 (ω)x2 (ω).35).31).8. Therefore. Using the trigonometric identities such as ca cb = (ca−b + ca+b )/2. b) = RU (τ )ca cb − RU V (τ )ca sb − RV U (τ )sa cb + RV (τ )sa sb . Thus. it must be that µU = µV = 0. Let us find the conditions on U and V that make X WSS. If x1 and x2 are each narrowband signals with corresponding complex envelope processes z1 and z2 . it must be that RU = RV and RU V = −RV U .35) (8. b. it is readily seen that z(ω) = 2 z1 (ω)z2 (ω) for all ω. Proposition 8. in order that RX (a. establishing the claim. To see this.5. the condition RU V = −RV U means that RU V is an odd function: RU V (τ ) = −RU V (−τ ). b) is a function of a−b. Equivalently. However. time invariant filtering of narrowband signals can be carried out in the baseband equivalent setting.1 Suppose X is given by (8. such an X need not even be WSS. b) = + RU (τ ) + RV (τ ) 2 RU (τ ) − RV (τ ) 2 ca−b + ca+b − RU V (τ ) − RV U (τ ) 2 RU V (τ ) + RV U (τ ) 2 sa−b sa+b . Similar equations hold for zi in terms of xi for i = 1. x can be expressed in terms of u and v by (8. If X is WSS then RX (τ ) = RU (τ ) cos(ωc τ ) + RU V (τ ) sin(ωc τ ) 1 SX (ω) = [SU (ω − ωc ) + SU (ω + ωc ) − jSU V (ω − ωc ) + jSU V (ω + ωc )] 2 . the analysis of linear. X is WSS if and only if Z = U + jV is mean zero and E[Za Zb ] = 0 for all a. where U and V are jointly WSS.34) or (8. and the corresponding complex 1 envelope is 2 z1 ∗ z2 . this can be rewritten as RX (a.33). First. defining Zt by Zt = Ut + jVt . then the convolution x = x1 ∗ x2 is again a narrowband signal. and let X be defined by Xt = Ut cos(ωc t) − Vt sin(ωc t) or equivalently. in order that µX (t) not depend on t. Since in general RU V (τ ) = RV U (−τ ). note that the Fourier transform. Using the notation ct = cos(ωc t). NARROWBAND RANDOM PROCESSES 263 In the other direction. Let U and V be jointly WSS real-valued baseband random processes. SU . Proposition 8.37) . where ωo < ωc . denoted by X. Then X is defined to be a narrowband random process if SX (ω) = 0 whenever | |ω| − ωc |≥ ωo . Since X and h are real valued. and Vt = Im(Zt ). and SV are nonnegative. RX (τ ) = 1 Re(RZ (τ )ejωc τ ) 2 The functions SX . However. Define X as the output when X is passed through the linear system with impulse response h. we can replace this function by the function given by H(ω) = −jsgn(ω)I|ω|≤ωo +ωc . Then µX = 0 and the following representations hold Xt = Re(Zt ejωc t ) = Ut cos(ωc t) − Vt sin(ωc t) where Zt = Ut + jVt .36) and SU V (ω) = j [SX (ω + ωc ) − SX (ω − ωc )] I|ω|≤ωo Equivalently.e. ˇ RU (τ ) = RV (τ ) = RX (τ ) cos(ωc τ ) + RX (τ ) sin(ωc τ ) and ˇ RU V (τ ) = RX (τ ) sin(ωc τ ) − RX (τ ) cos(ωc τ ) . As in the deterministic case. ˇ and V by Zt = (Xt + j Xt )e−jωc t . define random processes Z. SU V (ω) = Im(SU V (ω)) = −SU V (−ω). RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS ∗ and. with spectral density SX satisfying SX (ω) = 0 unless ωc − ωo ≤ |ω| ≤ ωc + ωo . with RZ (τ ) defined by RZ (a − b) = E[Za Zb ]. U . ˇ which has finite energy and it has a real-valued inverse transform h. (8. Equivalently.264CHAPTER 8. To proceed as in the case of deterministic signals. We show next that all narrowband random processes have such a representation. Ut = Re(Zt ). X is a narrowband random process if RX (t) is a narrowband function. and SU V is a purely imaginary odd function (i.) Let X by any WSS real-valued random process with a spectral density SX . and continue to let ωc > ωo > 0.39) (8.2 Let X be a narrowband WSS random process. A slight concern about defining X is that the function −jsgn(ω) does not have finite energy.38) (8. we first wish to define the Hilbert transform ˇ ˇ of X. even functions.5. We’ve seen how such a process can be obtained by modulating a pair of jointly WSS baseband random processes U and V . and U and V are jointly WSS real-valued random processes with mean zero and SU (ω) = SV (ω) = [SX (ω − ωc ) + SX (ω + ωc )] I|ω|≤ωo (8. the ˇ random process X is also real valued. 2 Equations (8.37) from (8. for the cross correlation function RXY is identically zero if and only if the upper lobe of SX is symmetric about ωc . Since K(ω) = 1 for all ω such that SX (ω) > 0. NARROWBAND RANDOM PROCESSES 265 Proof To show that µX = 0. it follows that RX = RY = RXY = RY X .10.38).36) and (8. b) is a function of a − b. the processes X and X are jointly WSS. for real numbers a and b.38).36) and (8. Since RU V is an odd function of τ . That does not imply that Us and Vt are uncorrelated for all s and t. and V . filtering of narrowband deterministic signals can be described using equivalent baseband signals. and K(ω) = 0 otherwise. Therefore E[|Xt − Yt |2 ] = 0 so that Xt has the same mean as Yt . we have ˇ Ut = Xt ct + Xt st ˇ Vt = −Xt st + Xt ct The remainder of the proof consists of computing RU . as claimed. respectively.3 ( Baseband equivalent filtering of a random process) As noted above. if follows that RU V (0) = 0. Ut and Vt are uncorrelated. RV . U . ˇ ˇ By the fact X is WSS and the definition of X. because it is not yet clear that U and V are jointly WSS. ∞ Then µY = µX −∞ h(τ )dτ = µX K(0) = 0. for any fixed time t.36) means that SU V is equal to the sum of j times the upper lobe of SX shifted to the left by ωc and −j times the lower lobe of SX shifted to the right by ωc . Finally. as illustrated in Figure 8. ˇ RXX = RX ˇ Thus.39). Example 8. using the notation ct = cos(ωc t) and st = sin(ωc t). namely zero. and the various spectral densities are given by SXX = HSX ˇ Therefore.39) are similar.5.8.37) have simple graphical interpretations. consider passing X through a linear. time-invariant system with transfer function K(ω) = 1 if ω is in either the upper band or lower band.5. and RU V as functions of two variables. and RU (τ ) is given by the right side of (8. Equivalently. RU (a. and the proof of (8. it is a simple matter to derive (8. Similarly. RU (a. The proof that RV also satisfies (8. SU and SV are each twice the symmetric part of the upper lobe of SX . b) = E ˇ X(b)cb + X(b)sb ˇ = RX (a − b)(ca cb + sa sb ) + RX (a − b)(sa cb − ca sb ) ˇ = RX (a − b)ca−b + RX (a − b)sa−b ˇ X(a)ca + X(a)sa ˇ RX X = −RX ˇ RX = RX ˇ SX X = H ∗ SX = −HSX ˇ SX = |H|2 SX = SX ˇ Thus.38) and (8. equation (8. namely the .36) means that SU and SV are each equal to the sum of the upper lobe of SX shifted to the left by ωc and the lower lobe of SX shifted to the right by ωc . Thus. and SU V is j times the antisymmetric part of the upper lobe of SX . Equation (8. By the definitions of the processes Z. Thus. is a generalized random process. namely that the upper lobe of SX is symmetric about ωc . Wiley. and suppose Y is the output process when X is filtered using impulse response function g.4 (Simulation of a narrowband random process) Let ωo and ωc be postive numbers with 0 < ωo < ωc . meaning that zg is the complex baseband signal such that g(t) = Re(zg (t)ejωc t . the filtering of X by g is equivalent to the filtering of Z by 2 zg . It can be shown that the complex envelope process of Y is 1 1 1 2 zg ∗ Z. We discuss briefly the problem of writing a computer simulation to generate a real-valued WSS random process X with power spectral density SX . Suppose X is a narrowband WSS random process. The same is true for filtering of narrowband random processes.e. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS SX SU=SV + = SUV + j j = j Figure 8. For increased tractability. Suppose SX is a nonnegative function which is even (i. Example 8. the processes U and V can be simulated by first generating sequences of random 1 variables UnT and VnT for sampling frequency T = 2fo = ωo . 1 . which. Then essentially the same method we described for filtering of deterministic narrowband signals works. Then Y is also a WSS narrowband random process.266CHAPTER 8. Let Z denote the complex envelope of X. Thus.2. In turn. suppose g is a finite energy narrowband signal. By Proposition 8.37). complex envelopes. A discrete time random process π with power spectral density SU can be generated by passing a discrete-time white noise sequence An elegant proof of this fact is based on spectral representation theory for WSS random processes. Stochastic Processes.5.5. it suffices to simulate baseband random processes U and V with the power spectral densities specified by (8. we impose an additional assumption on SX . and therefore that the processes U and V are uncorrelated with each other.5. like white noise. and let zg denote the complex envelope signal of g. 1953. given in Proposition 8. the processes U and V can be generated independently.36) and cross power spectral density specified by (8. This assumption is equivalent to the assumption that SU V vanishes.1. covered for example in Doob. The basic idea is to define the Fourier transform of a WSS random process. SX (ω) = SX (−ω) for all ω) with SX (ω) = 0 if ||ω| − ωc | ≥ ωo .10: A narrowband power spectral density and associated baseband spectral densities. A collection of random variables (Zi : i ∈ I) is said to be jointly circularly symmetric if for every real value of θ. 9. For a specific example.5×10−4 . though it might not be the most well behaved linear system. They must simply be positive numbers with ωo < ωc such that (9. taking H(ω) = SU (ω) works. fZ (z) depends on z only through |z|. we find the following representation for X: ∞ Xt = cos(ωc t) n=−∞ An sinc t − nT T ∞ − sin(ωc t) n=−∞ Bn sinc t − nT T 8. COMPLEXIFICATION. or.41) and (8. 000 kHz < |f | < 9. If Z has a pdf fZ . and if (Yj : j ∈ J) is another . 0 else (8. circular symmetry of Z means that fZ (z) is invariant under rotations about zero. For this choice.) The samples VkT can be generated similarly. 000 kHz.26). according to (8. or thousands of Hertz) SX (2πf ) = 1 9. Therefore we take fc = 9. namely fo = 10 kHz. 010 kHz makes the upper lobe of SX symmetric around fc .6.36) yields SU (2πf ) = SV (2πf ) = 2 |f | < 10 kHz 0 else (8.8. have power spectral density equal to 4 × 104 over the interval [−π. suppose that (using kHz for kilohertz. Putting the steps together. PART II 267 with unit variance through a discrete-time linear time-invariant system with real-valued impulse response function such that the transfer function H satisfies SU = |H|2 . (8. 010 kHz. the samples can be taken to be uncorrelated with E[|Ak |2 ] = E[|Bk |2 ] = 4 × 104 . only the choice fc = 9. which we shall return to in the next chapter. 020 kHz . For example. 020 kHz) ⊂ (fc − fo . and then use the Nyquist sampling representation described in Section 8. We take the minmum allowable value for fo .40) Notice that the parameters ωo and ωc are not uniquely determined by SX . Note that if (Zi : i ∈ I) is jointly circularly symmetric. the collection (Zi : i ∈ I) has the same finite dimensional distributions as the collection (Zi ejθ : i ∈ I). these variables can be taken to be independent real Gaussian random variables. π].4. Consequently. The processes of samples will.37) yields SU V (2πf ) = 0 for all f . To simulate these processes it is therefore enough to generate samples of them with sampling period T = 0. For example. Part II A complex random variable Z is said to be circularly symmetric if Z has the same distribution as ejθ Z for every real value of θ. The processes U and V are continuous-time baseband random processes with one-sided bandwidth limit 10 kHz.6 Complexification. equivalently. fc + fo ) However. (The problem of finding a transfer function H with additional properties such that SU = |H|2 is called the problem of spectral factorization. and Covp (ejθ Z) = ej2θ Covp (Z). it has pdf fY (y) = exp(−y Λ y) . covariance matrix. the Hermetian symmetric positive definite matrix K can be expressed as K = U ΛU ∗ .268CHAPTER 8. Let us switch now to random processes. is involved. has mean EZ = EU + jEV and covariance matrix Cov(Z) = E[(Z − EZ)(Z − EZ)∗ ]. Consequently. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS collection of random variables such that each Yj is a linear combination of Zi ’s (with no constants added in) then the collection (Yj : j ∈ J) is also jointly circularly symmetric. U ) = Cov(U. fZ (z) = fY (U ∗ x). t) = Cov(Zs . and Cov(U. Since E[ejθ Z] = ejθ EZ. Z and ejθ Z have the same distribution if and only if EZ = 0 and Covp (Z) = 0. Cov(U ) = Re (Cov(Z) + Covp (Z)) /2. V ) as: Cov(Z) = Cov(U ) + Cov(V ) + j (Cov(V. V ) = Im (−Cov(Z) + Covp (Z)) /2. Then its distribution is fully determined by its mean and the matrices Cov(U ). and it differs from the covariance of Z in that a transpose. . and ∗ −1 since det(Λ) = det(K).42) is easy to verify if K is diagonal.42). The pseudo-covariance matrix of Z is defined by Covp (Z) = E[(Z − EZ)(Z − EZ)T ]. V ). Recall that Z is Gaussian if U and V are jointly Gaussian. V )) Covp (Z) = Cov(U ) − Cov(V ) + j (Cov(V. Conversely. Recall that a complex random vector Z. Note that Cov(Z) and Covp (Z) are readily expressed in terms of Cov(U ). Let Z be a complex-valued random process and let U and V be the real-valued random processes such that Zt = Ut + jVt . (8.42) Cov(V ) = Re (Cov(Z) − Covp (Z)) /2. The random vector Y defined by Y = U ∗ Z is Gaussian and circularly symmetric with covariance matrix Λ. if θ is not a multiple of π. has the particularly elegant form: fZ (z) = exp(−z ∗ K −1 z) . Equation (8. expressed in terms of real random vectors U and V as Z = U + jV . Zt ). U ) − Cov(U. and Cov(U. Since | det(U )| = 1. The vector Z is defined to be Gaussian if the random vectors U and V are jointly Gaussian. or equivalently by its mean and the matrices Cov(Z) and Covp (Z).42) can be derived in the same way the density for Gaussian vectors with real components is derived. Z and ejθ Z have the same distribution if and only if (ejθ − 1)EZ = 0 and (ej2θ − 1)Covp (Z) = 0. for a real value of θ. Cov(ejθ Z) = Cov(Z). V )T . Namely. rather than a Hermitian transpose. and the covariance function of Z is defined by CZ (s. and pseudo-covariance matrix. Cov(V ). and Cov(U. with det K = 0. U ) + Cov(U. where U is a unitary matrix and Λ is a diagonal matrix with strictly positive diagonal entries. If K is not diagonal. Hence. π n det(K) (8. Cov(V ). Therefore. V )) where Cov(V. a Gaussian random vector Z is circularly symmetric if and only if its mean vector and pseudo-covariance matrix are zero. Suppose that Z is a complex Gaussian random vector. π n det(K) which yields (8. The joint density function of a circularly symmetric complex random vector Z with n complex dimensions and covariance matrix K. Z and ejθ Z have the same distribution if and only if they have the same mean. (Zt : t ∈ T). t. t) = 0 for Z all s. and let → 0. “Proper Complex Random Processes with Applications to Information Theory. 8. and if ||h||2 > 0.” IEEE Transactions on Information Theory. t) = Covp (Zs . 8. 4. are jointly circularly symmetric. |SXY (ω)|2 ≤ SX (ω)SY (ω). Some authors use a stronger definition of WSS. t)) is a function of s − t alone. let > 0.2 shows that the baseband equivalent process Z for a Gaussian real-valued narrowband WSS random process X is circularly symmetric. SY . A circularly symmetric complex Gaussian random process is stationary if and only if it is WSS. and the stronger definition mentioned in the previous paragraph. Consider passing both X and Y through a linear time-invariant system with transfer function H (ω) = IJ (ω). 39. Apply the Schwarz inequality to the output processes sampled at a fixed time. and both CZ (s. both CZ and CZ are needed to determine CU .7 Problems 8. For circularly symmetric complex random processes. and transfer function H. the definition of WSS we adopted. Show that for each ω.2 On the cross spectral density Suppose X and Y are jointly WSS such that the power spectral densities SX .5.1 On filtering a WSS random process Suppose Y is the output of a linear time-invariant system with WSS input X. Nearly all complex valued random processes in applications arise in this fashion. we define Z to be wide sense stationary (WSS) if µZ (t) is constant and if CZ (s. 8. and SXY are continuous. are equivalent. Following the vast majority of the literature.3 Modulating and filtering a stationary process 2 Let X = (Xt : t ∈ Z) be a discrete-time mean-zero stationary random process with power E[X0 ] = . Zt ).8.7. Hint: Fix ωo . Proposition 8. t) are functions of s − t • U and V are jointly WSS If Z is Gaussian then it is stationary if and only if it satisfies the stronger definition of WSS. t) and CZ (s. Justify your answers. t) (or RZ (s. and let J denote the interval of length centered at ωo . PROBLEMS 269 p The pseudo-covariance function of Z is defined by CZ (s. CV . (b) If X is periodic (in addition to being WSS) then Y is WSS and periodic. If Z is a complex Gaussian random process. (c) If X has mean zero and strictly positive total power. July 1993. it is circularly symmetric if and only if it has mean zero and Covp (s. A complex random process Z = (Zt : t ∈ T) is called circularly symmetric if the random variables of the process. and CU V . by defining Z to be WSS if either of the following two equivalent conditions is satisfied: p • µZ (t) is constant. As for covariance p matrices of vectors. then the output power is strictly positive. (a) If |H(ω)| ≤ 1 for all ω then the power of Y is less than or equal to the power of X. Indicate whether the following statements are true or false. The interested reader can find more related to the material in this section in Neeser and Massey. impulse response function h. vol. no. 5. 6 8. 8. in terms of the power spectral density of X. p 1−p . such that X is WSS with RX (τ ) = exp(−|τ |). 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS 1. for any value of a with 0 < a < 1. SY (ω).4 Filtering a Gauss Markov process Let X = (Xt : −∞ < t < +∞) be a stationary Gauss Markov process with mean zero and autocorrelation function RX (τ ) = exp(−|τ |).6 A stationary two-state Markov process Let X = (Xk : k ∈ Z) be a stationary Markov process with state space S = {1. (b) Find RY (0). Let Z be the stationary discrete time random process obtained from Y by the linear equations: Zt+1 = (1 − a)Zt + aYt+1 for all t. express the power spectral density of Y . 2π]. (a) Find the cross correlation function RXY . (d) Is Y a Markov process? Justify your answer. What is the approximate numerical value? (c) Is Y a Gaussian random process? Justify your answer. (c) Find and sketch the transfer function H(ω) for the linear system describing the mapping from Y to Z. where Θ is independent of X and is uniformly distributed over [0. If a 2 is small. (a) Why is the random process Y stationary? (b) Express the autocorrelation function of Y . then h approximates the delta function δ(τ ). (e) Describe an input X satisfying the assumptions above so that the power of Z is at least 0. o(a) denotes any term such that o(a)/a → 0 as a → 0. and use the power series expansion of eu to show that RY (0) = 1 − a + o(a) as 3 a → 0. (a) Find RY X (0). Similarly. Are X and Y jointly stationary? (b) Find E[Y5 |X5 = 3]. (d) Can the power of Z be arbitrarily large (depending on a)? Explain your answer. where a is a constant with 0 < a < 1. and consequently Yt ≈ Xt . −1} and one-step transition probability matrix 1−p p P = .270CHAPTER 8. This problem explores the accuracy of the approximation. (c) Show that E[|Xt − Yt |2 ] = a + o(a) as a → 0. RY (τ ) = E[Yτ Y0 ].5 Slight smoothing Suppose Y is the output of the linear time-invariant system with input X and impulse response 1 function h. Here. Define a random process Y = (Yt : t ∈ R) by the ˙ differential equation Yt = Xt − Yt . and use the power series expansion of eu to show that RY X (0) = 1 − a + o(a) as 4 a → 0. Let Y be the stationary discrete time random process obtained from X by modulation as follows: Yt = Xt cos(80πt + Θ). SX (ω). in terms of the autocorrelation function of X. and h(τ ) = a I{|τ |≤ a } for a > 0. σ 2 |z|2 − 2Re(z ∗ zo ) is equal to |σz − zo |2 − |zo2| . (c) Suppose instead that the input is a WSS random process X with autocorrelation function RX .7. find the choice of H that minimizes the mean square error. Find the mean. (c) Using your answer to part (b). and let Z be the output. The variable hu is the random complex gain for delay u. Given a deterministic signal x. Hint: For nonnegative integers k: Pk = 1 2 1 2 1 2 1 2 + (1 − 2p)k −1 2 1 2 −1 2 1 2 . the matrix of transition probabilities pij (s.8. The channel gains along distinct paths are often modeled as uncorrelated. the channel output is the random signal Y defined by Yi = ∞ u=−∞ hu xi−u . causing a delay spread. (Hint: Try working out the problem first assuming the processes are real valued. Suppose Y is passed through a linear time-invariant system with impulse response function h and transfer function H. (b) Express the mean square error in terms of SX . uncorrelated scattering channel A signal transmitted through a scattering environment can propagate over many different paths on its way to a receiver. RXY and h. RY . t) is given by H(τ ). (a) Determine the mean and autocorrelation function for Y in terms of x and g. Assume that G = u gu < ∞.) 2 8. possibly complex valued random variables with mean zero and E[|hu |2 ] = gu . Let h = (hu : u ∈ Z) consist of uncorrelated. (Hint: Recall from the example in the chapter on Markov processes that for s < t. −1} and Q matrix Q= −α α α −α .7 A stationary two-state Markov process in continuous time Let X = (Xt : t ∈ R) be a stationary Markov process with state space S = {1. where τ = t − s and H(τ ) = 1+e−2ατ 2 1−e−2ατ 2 1−e−2ατ 2 1+e−2ατ 2 . 8. Find the mean. The input X is assumed to be independent of the channel h. and g = (gu : u ∈ Z) is the energy gain delay mass function with total gain G. and associated spectral densities. The paths may differ in length. SY . 2 note that for σ 2 > 0 and complex numbers z and zo . cross-correlation function. The mean square error of estimating Xt by Zt is E[|Xt − Zt |2 ]. Is Y WSS? . correlation function and power spectral density function of X. PROBLEMS 271 where 0 < p < 1. (a) Express the mean square error in terms of RX . where α > 0. correlation function and power spectral density function of X. SXY and H. (b) Express the average total energy of Y : E[ i Yi2 ]. For the complex case. σ σ z which is minimized with respect to z by z = σo .9 Linear time invariant. in terms of x and g. 8. Express the mean and autocorrelation function of the output Y in terms of RX and g.8 A linear estimation problem Suppose X and Y are possibly complex valued jointly WSS processes with known autocorrelation functions. We can also view Y = (Yt : t ∈ R) as the output of a time-invariant linear system. t) = 0 0 λδ(u − v)dvdu. one at each arrival time of the Poisson process. and let ωo be the one-sided band limit of X.01)SX (ω) for u 6 |ω| ≤ ωo . Let X = (Xn : n ∈ Z) be defined by Xn = ∞ Un−k ak .s.s.s. The process X is m. and A is independent of the random process U . so is its Fourier transform. and its integral over intervals gives . the m.5) (the exact distribution is not specified). (a) What is the power spectral density of X ? −X (b) Let Yt = Xt+a2a t−a . t these functions give the correct values for the mean and covariance of N : E[Nt ] = 0 λds and s t CN (s. depending only on ωo . Find the power spectral density of D. The random process N can be extended to be defined for negative times by augmenting the original random process N by another rate λ Poisson process for negative times. Please keep in mind that µ = 0. Express the autocorrelation function of the random process H in terms of g. when integrated. with input X.) (f) Is Y mean ergodic in the m. for such a. and CN (s. for some a > 0. 8. (b) Is X a Markov process ? (Hint: X is not necessarily Gaussian. (c) Let Dt = Xt − Yt . error of approximating Xt by Yt is less than one percent of E[|Xt |2 ]. k=0 (a) Is X stationary? Find the mean function µX and autocovariance function CX for X. with E[Un ] = µ and Var(Un ) = σ 2 . (d) Is Y stationary? Find the mean function µY and autocovariance function CY for Y . (Hint: Find a so that SD (ω) ≤ (0. Then N can be viewed as a stationary random process.11 Some linear transformations of some random processes Let U = (Un : n ∈ Z) be a random process such that the variables Un are independent. Show that K(ω) → jω as a → 0. The derivative of N . and let Y = (Yn : n ∈ Z) be defined by Yn = ∞ Un−k Ak . where µ = 0 and σ 2 > 0.) 8.) (e) Is Y a Markov process? (Give a brief explanation. where A is a k=0 random variable distributed on the interval (0. You can use 2 the fact that 0 ≤ 1 − sin(u) ≤ u for all real u. identically distributed. N is stationary with mean and autocovariance functions given by E[Nt ] = λ. As a generalized random process. written N . so that E[|Dt |2 ] ≤ (0.01)E[|Xt |]2 .272CHAPTER 8. because. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS (d) Since the impulse response function h is random. Find the impulse response function k and transfer function K of the linear system.10 The accuracy of approximate differentiation Let X be a WSS baseband random process with power spectral density SX . repectively. picture N as a superposition of delta functions. does not exist as an ordinary random process. for a constant a with 0 < a < 1. differentiable and X can be viewed as the output of a time-invariant linear system with transfer function H(ω) = jω. t) = λδ(s − t).s. sense? Let U be as before. H = (H(ω) : −π ≤ ω ≤ π). Does X have a state representation driven by U ?) (c) Is X mean ergodic in the m. (d) Find a value of a. In other words. sense? 8. (Your answer may include expectations involving A. but it does exist as a generalized random process.12 Filtering Poisson white noise A Poisson random process N = (Nt : t ≥ 0) has indpendent increments. Graphically. 0. 8.7. PROBLEMS 273 rise to a process N (a, b] as described in Problem 4.17. (The process N − λ is a white noise process, in that it is a generalized random process which is stationary, mean zero, and has autocorrelation function λδ(τ ). Both N and N − λ are called Poisson shot noise processes. One application for such processes is modeling noise in small electronic devices, in which effects of single electrons can be registered. For the remainder of this problem, N is used instead of the mean zero version.) Let X be the output when N is passed through a linear time-invariant filter with an impulse response ∞ function h, such that −∞ |h(t)|dt is finite. (Remark: In the special case that h(t) = I{0≤t<1} , X is the M/D/∞ process of Problem 4.17.) (a) Find the mean function and covariance functions of X. (b) Consider the special case that h(t) = e−t I{t≥0} . Explain why X is a Markov process in this case. (Hint: What is the behavior of X between the arrival times of the Poisson process? What does X do at the arrival times?) 8.13 A linear system with a feedback loop The system with input X and output Y involves feedback with the loop transfer function shown. X + ! 1+j " Y (a) Find the transfer function K of the system describing the mapping from X to Y. (b) Find the corresponding impulse response function. (c) The power of Y divided by the power of X, depends on the power spectral density, SX . Find the supremum of this ratio, over all choices of SX , and describe what choice of SX achieves this supremum. 8.14 Linear and nonlinear reconstruction from samples Suppose Xt = ∞ n=−∞ g(t−n−U )Bn , where the Bn ’s are independent with mean zero and variance σ 2 > 0, g is a function with finite energy |g(t)|2 dt and Fourier transform G(ω), U is a random variable which is independent of B and uniformly distributed on the interval [0, 1]. The process X is a typical model for a digital baseband signal, where the Bn ’s are random data symbols. (a) Show that X is WSS, with mean zero and RX (t) = σ 2 g ∗ g(t). (b) Under what conditions on G and T can the sampling theorem be used to recover X from its samples of the form (X(nT ) : n ∈ Z)? (c) Consider the particular case g(t) = (1 − |t|)+ and T = 0.5. Although this falls outside the conditions found in part (b), show that by using nonlinear operations, the process X can be recovered from its samples of the form (X(nT ) : n ∈ Z). (Hint: Consider a sample path of X) 8.15 Sampling a cubed Gaussian process Let X = (Xt : t ∈ R) be a baseband mean zero stationary real Gaussian random process with t−nT 1 3 one-sided band limit fo Hz. Thus, Xt = ∞ where T = 2fo . Let Yt = Xt for n=−∞ XnT sinc T each t. 274CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS (a) Is Y stationary? Express RY in terms of RX , and SY in terms of SX and/or RX . (Hint: If A, B are jointly Gaussian and mean zero, Cov(A3 , B 3 ) = 6Cov(A, B)3 + 9E[A2 ]E[B 2 ]Cov(A, B).) t−nT 1 ? (b) At what rate T should Y be sampled in order that Yt = ∞ n=−∞ YnT sinc T (c) Can Y be recovered with fewer samples than in part (b)? Explain. 8.16 An approximation of white noise White noise in continuous time can be approximated by a piecewise constant process as follows. Let T be a small positive constant, let AT be a positive scaling constant depending on T , and let (Bk : k ∈ Z) be a discrete-time white noise process with RB (k) = σ 2 I{k=0} . Define (Nt : t ∈ R) by Nt = Bk for t ∈ [kT, (k + 1)T ). 1 (a) Sketch a typical sample path of N and express E[| 0 Ns ds|2 ] in terms of AT , T and σ 2 . For 1 simplicity assume that T = K for some large integer K. (b) What choice of AT makes the expectation found in part (a) equal to σ 2 ? This choice makes N a good approximation to a continuous-time white noise process with autocorrelation function σ 2 δ(τ ). (c) What happens to the expectation found in part (a) as T → 0 if AT = 1 for all T ? 8.17 Simulating a baseband random process Suppose a real-valued Gaussian baseband process X = (Xt : t ∈ R) with mean zero and power spectral density 1 if |f | ≤ 0.5 SX (2πf ) = 0 else is to be simulated over the time interval [−500, 500] through use of the sampling theorem with sampling time T = 1. (a) What is the joint distribution of the samples, Xn : n ∈ Z? (b) Of course a computer cannot generate infinitely many random variables in a finite amount of time. Therefore, consider approximating X by X (N ) defined by N (N ) Xt = n=−N Xn sinc(t − n) (N ) Find a condition on N to guarantee that E[(Xt − Xt )2 ] ≤ 0.01 for t ∈ [−500, 500]. (Hint: Use 1 |sinc(τ )| ≤ π|τ | and bound the series by an integral. Your choice of N should not depend on t because the same N should work for all t in the interval [−500, 500] ). 8.18 Filtering to maximize signal to noise ratio Let X and N be continuous time, mean zero WSS random processes. Suppose that X has power spectral density SX (ω) = |ω|I{|ω|≤ωo } , and that N has power spectral density SN (ω) = σ 2 for all ω. Suppose also that X and N are uncorrelated with each other. Think of X as a signal, and N as noise. Suppose the signal plus noise X + N is passed through a linear time-invariant filter with transfer function H, which you are to specify. Let X denote the output signal and N denote the output noise. What choice of H, subject the constraints (i) |H(ω)| ≤ 1 for all ω, and (ii) (power of X) ≥ (power of X)/2, minimizes the power of N ? 8.7. PROBLEMS 275 8.19 Finding the envelope of a deterministic signal (a) Find the complex envelope z(t) and real envelope |z(t)| of x(t) = cos(2π(1000)t)+cos(2π(1001)t), using the carrier frequency fc = 1000.5Hz. Simplify your answer as much as possible. (b) Repeat part (a), using fc = 995Hz. (Hint: The real envelope should be the same as found in part (a).) (c) Explain why, in general, the real envelope of a narrowband signal does not depend on which frequency fc is used to represent the signal (as long as fc is chosen so that the upper band of the signal is contained in an interval [fc − a, fc + a] with a << fc .) 8.20 Sampling a signal or process that is not band limited (a) Fix T > 0 and let ωo = π/T . Given a finite energy signal x, let xo be the band-limited signal o with Fourier transform xo (ω) = I{|ω|≤ωo } ∞ n=−∞ x(ω + 2nωo ). Show that x(nT ) = x (nT ) for all ∞ t−nT o (t) = integers n. (b) Explain why x . n=−∞ x(nT )sinc T o be the autocorrelation function for (c) Let X be a mean zero WSS random process, and let RX o o power spectral density SX (ω) defined by SX (ω) = I{|ω|≤ωo } ∞ n=−∞ SX (ω + 2nωo ). Show that o (nT ) for all integers n. (d) Explain why the random process Y defined by Y = RX (nT ) = RX t ∞ o o XnT sinc t−nT is WSS with autocorrelation function RX . (e) Find SX in case SX (ω) = n=−∞ T exp(−α|ω|) for ω ∈ R. 8.21 A narrowband Gaussian process Let X be a real-valued stationary Gaussian process with mean zero and RX (τ ) = cos(2π(30τ ))(sinc(6τ ))2 . (a) Find and carefully sketch the power spectral density of X. (b) Sketch a sample path of X. (c) The process X can be represented by Xt = Re(Zt e2πj30t ), where Zt = Ut + jVt for jointly stationary narrowband real-valued random processes U and V . Find the spectral densities SU , SV , and SU V . (d) Find P [|Z33 | > 5]. Note that |Zt | is the real envelope process of X. 8.22 Another narrowband Gaussian process Suppose a real-valued Gaussian random process R = (Rt : t ∈ R) with mean 2 and power spectral 4 density SR (2πf ) = e−|f |/10 is fed through a linear time-invariant system with transfer function H(2πf ) = 0.1 5000 ≤ |f | ≤ 6000 0 else (a) Find the mean and power spectral density of the output process X = (Xt : t ∈ R). (b) Find P [X25 > 6]. (c) The random process X is a narrowband random process. Find the power spectral densities SU , SV and the cross spectral density SU V of jointly WSS baseband random processes U and V so that Xt = Ut cos(2πfc t) − Vt sin(2πfc t), using fc = 5500. (d) Repeat part (c) with fc = 5000. 8.23 Another narrowband Gaussian process (version 2) Suppose a real-valued Gaussian white noise process N (we assume white noise has mean zero) with power spectral density SN (2πf ) ≡ No for f ∈ R is fed through a linear time-invariant system with 2 276CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS transfer function H specified as follows, where f represents the frequency in gigahertz (GHz) and a gigahertz is 109 cycles per second. 1 19.12−|f | 0.01 19.10 ≤ |f | ≤ 19.11 19.11 ≤ |f | ≤ 19.12 else H(2πf ) = 0 (a) Find the mean and power spectral density of the output process X = (Xt : t ∈ R). (b) Express P [X25 > 2] in terms of No and the standard normal complementary CDF function Q. (c) The random process X is a narrowband random process. Find and sketch the power spectral densities SU , SV and the cross spectral density SU V of jointly WSS baseband random processes U and V so that Xt = Ut cos(2πfc t) − Vt sin(2πfc t), using fc = 19.11 GHz. (d) The complex envelope process is given by Z = U + jV and the real envelope process is given by |Z|. Specify the distributions of Zt and |Zt | for t fixed. 8.24 Declaring the center frequency for a given random process Let a > 0 and let g be a nonnegative function on R which is zero outside of the interval [a, 2a]. Suppose X is a narrowband WSS random process with power spectral density function SX (ω) = g(|ω|), or equivalently, SX (ω) = g(ω) + g(−ω). The process X can thus be viewed as a narrowband signal for carrier frequency ωc , for any choice of ωc in the interval [a, 2a]. Let U and V be the baseband random processes in the usual complex envelope representation: Xt = Re((Ut +jVt )ejωc t ). (a) Express SU and SU V in terms of g and ωc . ∞ (b) Describe which choice of ωc minimizes −∞ |SU V (ω)|2 dω . (Note: If g is symmetric arround dπ some frequency ν, then ωc = ν. But what is the answer otherwise?) 8.25 * Cyclostationary random processes A random process X = (Xt : t ∈ R) is said to be cyclostationary with period T , if whenever s is an integer multiple of T , X has the same finite dimensional distributions as (Xt+s : t ∈ R). This property is weaker than stationarity, because stationarity requires equality of finite dimensional distributions for all real values of s. (a) What properties of the mean function µX and autocorrelation function RX does any second order cyclostationary process possess? A process with these properties is called a wide sense cyclostationary process. (b) Suppose X is cyclostationary and that U is a random variable independent of X that is uniformly distributed on the interval [0, T ]. Let Y = (Yt : t ∈ R) be the random process defined by Yt = Xt+U . Argue that Y is stationary, and express the mean and autocorrelation function of Y in terms of the mean function and autocorrelation function of X. Although X is not necessarily WSS, it is reasonable to define the power spectral density of X to equal the power spectral density of Y . (c) Suppose B is a stationary discrete-time random process and that g is a deterministic function. 8.7. PROBLEMS Let X be defined by Xt = −∞ 277 ∞ g(t − nT )Bn . Show that X is a cyclostationary random process. Find the mean function and autocorrelation function of X in terms g, T , and the mean and autocorrelation function of B. If your answer is complicated, identify special cases which make the answer nice. (d) Suppose Y is defined as in part (b) for the specific X defined in part (c). Express the mean µY , autocorrelation function RY , and power spectral density SY in terms of g, T , µB , and SB . 8.26 * Zero crossing rate of a stationary Gaussian process Consider a zero-mean stationary Gaussian random process X with SX (2πf ) = |f | − 50 for 50 ≤ |f | ≤ 60, and SX (2πf ) = 0 otherwise. Assume the process has continuous sample paths (it can be shown that such a version exists.) A zero crossing from above is said to occur at time t if X(t) = 0 and X(s) > 0 for all s in an interval of the form [t − , t) for some > 0. Determine the mean rate of zero crossings from above for X. If you can find an analytical solution, great. Alternatively, you can estimate the rate (aim for three significant digits) by Monte Carlo simulation of the random process. 278CHAPTER 8. RANDOM PROCESSES IN LINEAR SYSTEMS AND SPECTRAL ANALYSIS Chapter 9 Wiener filtering 9.1 Return of the orthogonality principle Consider the problem of estimating a random process X at some fixed time t given observation of a random process Y over an interval [a, b]. Suppose both X and Y are mean zero second order random processes and that the minimum mean square error is to be minimized. Let Xt denote the best linear estimator of Xt based on the observations (Ys : a ≤ s ≤ b). In other words, define Vo = {c1 Ys1 + · · · + cn Ysn : for some constants n, a ≤ s1 , . . . , sn ≤ b, and c1 , . . . , cn } and let V be the m.s. closure of V, which includes Vo and any random variable that is the m.s. limit of a sequence of random variables in Vo . Then Xt is the random variable in V that minimizes the mean square error, E[|Xt − Xt |2 ]. By the orthogonality principle, the estimator Xt exists and it is unique in the sense that any two solutions to the estimation problem are equal with probability one. Perhaps the most useful part of the orthogonality principle is that a random variable W is equal to Xt if and only if (i) W ∈ V and (ii) (Xt − W ) ⊥ Z for all Z ∈ V. Equivalently, W is equal to Xt if and only if (i) W ∈ V and (ii) (Xt − W ) ⊥ Yu for all u ∈ [a, b]. Furthermore, the minimum mean square error (i.e. the error for the optimal estimator Xt ) is given by E[|Xt |2 ] − E[|Xt |2 ]. b Note that m.s. integrals of the form a h(t, s)Ys ds are in V, because m.s. integrals are m.s. limits of finite linear combinations of the random variables of Y . Typically the set V is larger than the set of all m.s. integrals of Y . For example, if u is a fixed time in [a, b] then Yu ∈ V. In addition, if Y is m.s. differentiable, then Yu is also in V. Typically neither Yu nor Yu can be expressed as a m.s. integral of (Ys : s ∈ R). However, Yu can be obtained as an integral of the process Y multiplied by a delta function, though the integration has to be taken in a generalized sense. b The integral a h(t, s)Ys ds is the linear MMSE estimator if and only if b Xt − a h(t, s)Ys ds ⊥ Yu b for u ∈ [a, b] or equivalently E[(Xt − a ∗ h(t, s)Ys ds)Yu ] = 0 for u ∈ [a, b] 279 280 or equivalently b CHAPTER 9. WIENER FILTERING RXY (t, u) = a h(t, s)RY (s, u)ds for u ∈ [a, b]. Suppose now that the observation interval is the whole real line R and suppose that X and Y are jointly WSS. Then for t and v fixed, the problem of estimating Xt from (Ys : s ∈ R) is the same as the problem of estimating Xt+v from (Ys+v : s ∈ R). Therefore, if h(t, s) for t fixed is the optimal function to use for estimating Xt from (Ys : s ∈ R), then it is also the optimal function to use for estimating Xt+v from (Ys+v : s ∈ R). Therefore, h(t, s) = h(t + v, s + v), so that h(t, s) is a function of t−s alone, meaning that the optimal impulse response function h corresponds to a time∞ ˆ invariant system. Thus, we seek to find an optimal estimator of the form Xt = −∞ h(t − s)Ys ds. The optimality condition becomes ∞ Xt − −∞ h(t − s)Ys ds ⊥ Yu for u ∈ R which is equivalent to the condition ∞ RXY (t − u) = −∞ h(t − s)RY (s − u)ds for u ∈ R or RXY = h∗RY . In the frequency domain the optimality condition becomes SXY (ω) = H(ω)SY (ω) for all ω. Consequently, the optimal filter H is given by H(ω) = SXY (ω) SY (ω) ∞ and the corresponding minimum mean square error is given by E[|Xt − Xt |2 ] = E[|Xt |2 ] − E[|Xt |2 ] = −∞ SX (ω) − |SXY (ω)|2 SY (ω) dω 2π Example 9.1.1 Consider estimating a random process from observation of the random process plus noise, as shown in Figure 9.1. Assume that X and N are jointly WSS with mean zero. Suppose X + N Y h X Figure 9.1: An estimator of a signal from signal plus noise, as the output of a linear filter. X and N have known autocorrelation functions and suppose that RXN ≡ 0, so the variables of the process X are uncorrelated with the variables of the process N . The observation process is given by Y = X + N . Then SXY = SX and SY = SX + SN , so the optimal filter is given by H(ω) = SXY (ω) SX (ω) = SY (ω) SX (ω) + SN (ω) . and suppose SXN ≡ 0. RN (τ ) = e−2|τ | and RXN ≡ 0. note that 8 12 12 + ω2 4 + ω2 1 5 5 = 5 + = + 8 + 5ω 2 8 + 5ω 2 8 + 5ω 2 5 8 + 5ω 2 1 We know that 5 δ(t) ↔ 1 .1. but seek the best linear estimator of Xt given (Ys : s ≤ t). Also. 5 e−α|t| ↔ so 1 = 8 + 5ω 2 1 5 8 5 2α .9. Equivalently. Seeking an estimator of the form ∞ −|τ | Xt = −∞ h(t − s)Ys ds we find from the previous example that the transform H of h is given by SX (ω) = H(ω) = SX (ω) + SN (ω) 1 1+ω 2 1 1+ω 2 + 4 4+ω 2 = 4 + ω2 8 + 5ω 2 We will find h by finding the inverse transform of H.1) + ω2 = 1 5·2 5 8 ( 8 + ω2) 5 3 √ 5 10 ↔ 1 √ 4 10 e q − 8 |t| 5 Therefore the optimal filter is given in the time domain by 1 h(t) = δ(t) + 5 e q − 8 |t| 5 The associated minimum mean square error is given by (one way to do the integration is to use the ∞ fact that if k ↔ K then −∞ K(ω) dω = k(0)): 2π ∞ E[|Xt − Xt |2 ] = −∞ SX (ω)SN (ω) dω = SX (ω) + SN (ω) 2π ∞ −∞ 4 dω =4 2 2π 8 + 5ω 1 √ 4 10 1 =√ 10 In an example later in this chapter we will return to the same random processes.1. We seek the optimal linear estimator of Xt given (Ys : s ∈ R).2 This example is a continuation of the previous example. ω 2 + α2 2 8 5 (9. First. for a particular choice of power spectral densities. where Y = X + N . RX (τ ) = e 2 . Suppose that the signal process X is WSS with mean zero and power 1 spectral density SX (ω) = 1+ω2 . suppose the noise process N is WSS with mean zero and power 4 spectral density 4+ω2 . RETURN OF THE ORTHOGONALITY PRINCIPLE The associated minimum mean square error is given by ∞ 281 E[|Xt − Xt |2 ] = −∞ ∞ = −∞ SX (ω)2 SX (ω) + SN (ω) SX (ω)SN (ω) dω SX (ω) + SN (ω) 2π SX (ω) − dω 2π Example 9. for any α > 0. A transfer function H is said to be of positive type if the corresponding impulse response function h is causal. The following properties hold: . Indeed.2) (9. In the case of a linear.3 Causal functions and spectral factorization A function h on R is said to be causal if h(τ ) = 0 for τ < 0. Then [H]+ is called the positive part of H and [H]− is called the negative part of H.282 CHAPTER 9. Any transfer function can be written as the sum of a positive type transfer function and a negative type transfer function. Note that if Y is the same process as X and T > 0. the orthogonality principle implies that the estimator is optimal if and only if it satisfies ∞ Xt+T − −∞ h(t − s)Ys ds ⊥ Yu for u ≤ t which is equivalent to the condition ∞ RXY (t + T − u) = −∞ h(t − s)RY (s − u)ds for u ≤ t or RXY (t + T − u) = h ∗ RY (t − u).3) Equations (9. We shall show how to solve them in the case the power spectral densities are rational functions by using the method of spectral factorization. Let Xt+T |t be the minimum mean square error linear estimate of Xt+T given (Ys : s ≤ t). a fixed parameter T is introduced. Simply let u(t) = I{t≥0} and notice that h(t) is the sum of the causal function u(t)h(t) and the anticausal function (1 − u(t))h(t). we have the representation h = uh + (1 − u)h. then we are addressing the problem of predicting Xt+T from (Xs : s ≤ t). The next section describes some of the tools needed for the solution.2) and (9. WIENER FILTERING 9. Suppose X and Y are mean zero and jointly WSS. s) = 0 for s > t. and H is said to be of negative type if the corresponding impulse response function is anticausal. Any function h on R can be expressed as the sum of a causal function and an anticausal function as follows. and it is said to be anticausal if h(τ ) = 0 for τ > 0.3) are called the Wiener-Hopf equations. ∞ An estimator of the form −∞ h(t − s)Ys ds is sought such that h corresponds to a causal system. More compactly. For convenience in applications. That is to say that the impulse response function satisfies h(t. time-invariant system. In this section we will consider estimates of X given Y obtained by passing Y through a causal linear time-invariant system. causality means that the impulse response function satisfies h(τ ) = 0 for τ < 0. the problem is to find an impulse response function h satisfying: RXY (τ + T ) = h ∗ RY (τ ) h(v) = 0 for τ ≥ 0 for v < 0 (9. Setting τ = t − u and combining this optimality condition with the constraint that h is a causal function.2 The causal Wiener filtering problem A linear system is causal if the value of the output at any given time does not depend on the future of the input. suppose H is the Fourier transform of an impulse response function h. 9. Define [H]+ to be the Fourier transform of uh and [H]− to be the Fourier transform of (1 − u)h. Once again. Up to this point in these notes. ug and (1 − u)g are g(t) t u(t)g(t) t (1−u(t))g(t) t Figure 9. and its continuation defined for complex ω.3. if k is any causal function. it follows that [H]+ is the positive type function that is closest to H in the L2 norm.9. By Parseval’s relation. However.3. [H]− is the projection of H onto the space of negative type functions. for the purposes of factorization to be covered later.2: Decomposition of a two-sided exponential function. then ∞ 0 ∞ |h(t) − k(t)|2 dt = −∞ −∞ 0 |h(t)|2 dt + 0 |h(t) − k(t)|2 dt (9. CAUSAL FUNCTIONS AND SPECTRAL FACTORIZATION • H = [H]+ + [H]− (because h = uh + (1 − u)h) • [H]+ = H if and only if H is positive type • [H]− = 0 if and only if H is positive type • [[H]+ ]− = 0 for any H • [[H]+ ]+ = [H]+ and [[H]− ]− = [H]− • [H + G]+ = [H]+ + [G]+ and [H + G]− = [H]− + [G]− 283 Note that uh is the casual function that is closest to h in the L2 norm. We use the same notation H(ω) for the function H defined for real values of ω only.4) if and only if k = uh (except possibly on a set of measure zero).1 Let g(t) = e−α|t| for a constant α > 0. Similarly. uh is the projection of h onto the space of causal functions. Fourier transforms have been defined for real values of ω only. and consideration of transforms for complex ω. it is useful to consider the analytic continuation of the Fourier transforms to larger sets in C. Equivalently. The functions g. The following examples illustrate the use of the projections [ ]+ and [ ]− .4) ≥ −∞ |h(t)|2 dt and equality holds in (9. [H]+ is the projection of H onto the space of positive type functions. That is. Indeed. . Example 9. WIENER FILTERING pictured in Figure 9.3. Equivalently. the pole of [G]+ is in the upper half plane. Thus γ1 = γ2 = γ3 = 1 − ω2 (jω + 3)(jω − 2) 1 − ω2 (jω + 1)(jω − 2) 1 − ω2 (jω + 1)(jω + 3) = jω=−1 1 + (−1)2 1 =− (−1 + 3)(−1 − 2) 3 1 + 32 =1 (−3 + 1)(−3 − 2) = jω=−3 = jω=2 1 + 22 1 = (2 + 1)(2 + 3) 3 . The other constants are found similarly.2. By the theory of partial fraction expansions in complex analysis. The numerator of G has no factors in common with the denominator. More generally.284 CHAPTER 9. and the degree of the numerator is smaller than the degree of the denominator. it therefore follows that G can be written as γ1 γ2 γ3 G(ω) = + + jω + 1 jω + 3 jω − 2 In order to identify γ1 . Then N1 [G]+ (ω) = n=1 γn jω + αn N [G]− (ω) = n=N1 +1 γn −jω + αn Example 9. so that the imaginary part of the pole of [G]+ is positive. suppose that G(ω) has the representation N1 ∞ G(ω) = n=1 γn + jω + αn N n=N1 +1 γn −jω + αn where Re(αn ) > 0 for all n.2 Let G be given by G(ω) = 1 − ω2 (jω + 1)(jω + 3)(jω − 2) Note that G has only three simple poles. The corresponding transforms are given by: 1 jω + α 0 0 1 [G]− (ω) = eαt e−jωt dt = −jω + α −∞ 2α G(ω) = [G]+ (ω) + [G]− (ω) = 2 ω + α2 [G]+ (ω) = e−αt e−jωt dt = Note that [G]+ has a pole at ω = jα. for example. multiply both expressions for G by (jω + 1) and then let jω = −1. 9.3. We can also find [G]− by computing the transform of (1 − u(t))g(t) (still assuming that T ≤ 0): 0 [G]− (ω) = T e α(T −t) −jωt e eαT −(α+jω)t dt = −(α + jω) 0 = t=T e−jωT − eαT (jω + α) Example 9. [G]+ (ω) = − 1 1 + 3(jω + 1) jω + 3 and [G]− (ω) = 1 3(jω − 2) 285 e Example 9.3. First. and therefore [G]+ = G and [G]− = 0. if T ≥ 0. then g is causal.3. Multiplication by e−jωT in the frequency domain represents a shift by T in the time domain. T>0: g(t) t T T<0: g(t) t T Figure 9. CAUSAL FUNCTIONS AND SPECTRAL FACTORIZATION Consequently. so that −jωT g(t) = e−α(t−T ) t ≥ T . Consider two cases. Let us unravel the notation and express ∞ 2 dω ejωT H(ω) + 2π −∞ . G is positive type. Second.3: Exponential function shifted by T. 0 t<T as pictured in Figure 9.3.4 Suppose H is the transfer function for impulse response function h. if T ≤ 0 then g(t)u(t) = αT eαT e−αt t ≥ 0 0 t<0 −jωT αT e −e so that [G]+ (ω) = jω+α and [G]− (ω) = G(ω) − [G]+ (ω) = e (jω+α) .3 Suppose that G(ω) = (jω+α) . . if and only if all the poles of H(ω) are in the upper half plane Im(ω) > 0. Without loss of generality. β1 . Similarly. Let us find [HK]− . Since HK is the transform of h ∗ k. βK are nonzero. The function SY . a positive type function H is minimum phase if and only if 1/H is also positive type. . (Note that the factor ejωT is used. α1 . being nonnegative.) Multiplication by ejωT in the frequency domain corresponds to shifting by −T in the time domain. we assume that {αi } ∩ {βj } = ∅. Thus. αN .5 Suppose [H]− = [K]− = 0. . The supposition implies that h and k are both causal functions. the definition of u. so SY = SY . Since polynomials can be factored over the complex numbers. Suppose that SY is the power spectral density of a WSS random process and that SY is a ∗ rational function. A function H is said to be rational if it can be written as the ratio of two polynomials. a rational function H can be expressed in the form H(ω) = γ (jω + β1 )(jω + β2 ) · · · (jω + βK ) (jω + α1 )(jω + α2 ) · · · (jω + αN ) for complex constants γ. if the denominator of SY has a factor of the form jω + α then the denominator must also have a factor of the form −jω + α∗ . Example 9. . let h denote the inverse transform of H. Equivalently. so that ejωT H(ω) ↔ h(t + T ) and thus ejωT H(ω) + ↔ u(t)h(t + T ) Applying Parseval’s identity. Next we turn to multiplicative decomposition.3. . . . The function H is positive type if and only if Re(αi ) > 0 for all i. . . Therefore the convolution h ∗ k is also a causal function.286 CHAPTER 9. αN . it follows that HK is a positive type function. concentrating on rational functions. The decomposition H = [H]+ + [H]− is an additive one. βK . . . As usual. or equivalently. . WIENER FILTERING in terms of h and T . rather than e−jωT as in the previous example. We also assume that the real parts of the constants α1 . . A positive type function H is said to have minimum phase if Re(βi ) > 0 for all i. is also real-valued. and a change of variables yields ∞ ejωT H(ω) −∞ 2 + dω 2π ∞ = −∞ ∞ |u(t)h(t + T )|2 dt |h(t + T )|2 dt 0 ∞ = = T |h(t)|2 dt The integral decreases from the energy of h to zero as T ranges from −∞ to ∞. β1 . Thus. . and k denote the inverse transform of K. . . if the numerator of SY has a factor of the form jω + β then the numerator must also have a factor of the form −jω + β ∗ . . [HK]− = 0. formally.6) is called the cepstrum method. Use of (9. At least formally. minimum phase function and SY is a negative type function with + ∗ − SY = (SY ) .6) is − indeed a positive type function. whereas spectral factorization has to do with products. there is a host of problems. SOLUTION OF THE CAUSAL WIENER FILTERING PROBLEM FOR RATIONAL POWER SPECTRAL D Example 9. Note that the operators [ ]+ and [ ]− give us an additive decomposition of a function H into the sum of a positive type and a negative type function. both numerical and analytical.9. 1+h+ h∗h h∗h∗h H2 H2 + · · · ↔ exp(H) = 1 + H + + ··· 2! 3! 2! 3! + so that if H is positive type. in using the method.7) + − + Suppose SY is factored as SY = SY SY such that SY is a minimum phase. and then exponentiating: SX (ω) = exp([ln SX (ω)]+ ) exp([ln SX (ω)]− ) . so that it will not be used further in these notes. Thus. and the factor SX is a negative type function. Since the product of − Y .4. Then SY and S1 are negative type functions.3. the factor SX in (9. doing an additive decomposition. then exp(H) is also positive type. the factorization can be accomplished by taking a logarithm. positive type transfer − + − function and SY = (SY )∗ . 9.6 The function SY given by SY (ω) = can be factored as SY (ω) = √ (jω + 5 8 5) 8 + 5ω 2 (1 + ω 2 )(4 + ω 2 ) √ (−jω + 5 8 5) (jω + 2)(jω + 1) + SY (ω) (−jω + 2)(−jω + 1) − SY (ω) (9.2) and ( 9. Unfortunately.3) can be formulated in the frequency domain as follows: Find a positive type transfer function H such that ejωT SXY − HSY + =0 (9. + SX (ω) − SX (ω) (9.6) Notice that if h ↔ H then.4 Solution of the causal Wiener filtering problem for rational power spectral densities The Wiener-Hopf equations (9.5) − + where SY is a positive type. The power spectral density of Y is given by SY (ω) = SX (ω) + SN (ω) = 8 + 5ω 2 (1 + ω 2 )(4 + ω 2 ) . were X is WSS with 1 mean zero and power spectral density SX (ω) = 1+ω2 . being the product of two positive type functions. We seek the optimal casual linear estimator of Xt given (Ys : s ≤ t). yielding the equivalent problem: − Find a positive type transfer function H such that ejωT SXY + − HSY − SY =0 + Y (9. WIENER FILTERING two negative type functions is again negative type. t + T ) b ∞ = RX (0) − −∞ ∗ h(s)RXY (s + T )ds.8) becomes ejωT SXY + − HSY = 0 − SY + Solving for H yields that the optimal transfer function is given by H= 1 ejωT SXY + − SY SY (9. but here a causal estimator is sought. which involves the optimal filter h. is the following: MMSE = E[(Xt+T − Xt+T |t )(Xt+T − Xt+T |t )∗ ] ∗ = E[(Xt+T − Xt+T |t )Xt+T ] = RX (0) − RXX (t.7) is equivalent to the equation obtained by multiplying the quantity within square brackets in (9. Exercise Evaluate the limit as T → −∞ and the limit as T → ∞ in (9.1.7) by S1 .10). (9. and SXN = 0.4.288 CHAPTER 9.8) + The function HSY . is itself positive type. + dω 2π (9. Thus (9.10) Another expression for the MMSE. The observed random process is Y = X + N .1 This example involves the same model as in an example in Section 9. Example 9. N is WSS with mean zero and power spectral 4 density SN (ω) = 4+ω2 .9) + The orthogonality principle yields that the mean square error satisfies E[|Xt+T − Xt+T |t |2 ] = E[|Xt+T |2 ] − E[|Xt+T |t |2 ] ∞ = RX (0) − −∞ ∞ |H(ω)|2 SY (ω) ejωT SXY − SY dω 2π 2 = RX (0) − −∞ + where we used the fact that |SY |2 = SY . (jω + 1)(−jω + 1) (−jω + 2) √ 5(jω + 1)(−jω + γ1 γ2 + jω + 1 −jω + 8 5 8 5) where γ1 = −jω + 2 √ 5(−jω + 8 5 ) jω=−1 =√ 8 5 3 √ 5+ 8 γ2 = Therefore −jω + 2 √ 5(jω + 1) − = √ q jω= 8 5 +2 √ 5+ 8 SXY (ω) − SY (ω) = + γ1 jω + 1 2− 8 5 8 5 (9.9. the minimum mean square error is given by ∞ E[|Xt − Xt |2 ] = RX (0) − −∞ 2 γ1 dω 1 γ2 = − 1 ≈ 0. and (9.5). (9. 3 . yielding SY and SY .11). by (9.1). which is Xt .11) and thus γ1 (jω + 2) 3 √ 1 + H(ω) = √ = 8 5 + 2 10 5(jω + 5 ) jω + so that the optimal causal filter is h(t) = 3 √ 5 + 2 10 δ(t) + (2 − 8 −t )u(t)e 5 q 8 5 Finally.4. Since RXN = 0 it follows that SXY (ω) = SX (ω) = Therefore SXY (ω) − SY (ω) = = 1 .3246 1 + ω 2 2π 2 2 which is slightly larger than √1 10 ≈ 0.10) with T = 0. and slightly smaller than 1 . the MMSE for the best “instantaneous” 3 estimator of Xt given Yt . SOLUTION OF THE CAUSAL WIENER FILTERING PROBLEM FOR RATIONAL POWER SPECTRAL D + − and its spectral factorization is given by (9. the MMSE found for the best noncausal estimator (see the example in Section 9.3162.1). This leads to the pure prediction problem. and SY = SX ): H= 1 + jωT + S e SX X + (9. yields that the factorization of SX is given by SX (ω) = 1 1 (jω + (1 + j))(jω + (1 − j)) (−jω + (1 + j))(−jω + (1 − j)) + SX (ω) − SX (ω) so that + SX (ω) = = where γ1 = γ2 = 1 (jω + (1 + j))(jω + (1 − j)) γ1 γ2 + jω + (1 + j) jω + (1 − j) 1 jω + (1 − j) 1 jω + (1 + j) = jω=−(1+j) j 2 = jω=−1+j −j 2 + yielding that the inverse Fourier transform of SX is given by + SX ↔ j −(1+j)t j e u(t) − e−(1−j)t u(t) 2 2 j −(1+j)(t+T ) 2e j − 2 e−(1−j)(t+T ) t ≥ −T 0 else Hence + SX (ω)ejωT ↔ so that + SX (ω)ejωT + = je−(1+j)T je−(1−j)T − 2(jω + (1 + j)) 2(jω + (1 − j)) .2 A special case of the causal filtering problem formulated above is when the observed process Y is equal to X itself. we have (ω 2 + 2j) = (ω + 1 + j)(ω − 1 − j). suppose that SX (ω) = ω41 . Since +4 2j = (1 + j)2 . Then the optimal linear predictor of Xt+T given (Xs : s ≤ t) corresponds to a linear time-invariant system with transfer function H given by + + − − (because SXY = SX . Factoring the term (ω 2 − 2j) in a similar way.290 CHAPTER 9. SY = SX . WIENER FILTERING Example 9.12) To be more specific. SY = SX .4. and rearranging terms as needed. Let X be a WSS mean zero random process and let T > 0. Observe that ω 4 + 4 = (ω 2 + 2j)(ω 2 − 2j). the z-transform of RX is denoted by SX .5.5 Discrete time Wiener filtering Causal Wiener filtering for discrete-time random processes can be handled in much the same way that it is handled for continuous time random processes. the z-transform H restricted to the unit circle in C is equivalent to the Fourier transform H on [0. 2π]. The z-transform H is said to be positive type if h is causal. Thus. if X and Y are jointly WSS then the z-transform of RXY is . The projection [H]+ is defined as it was for Fourier transforms–it is the z transform of the function u(k)h(k). Setting z = ejω yields the Fourier transform: H(ω) = H(ejω ) for 0 ≤ ω ≤ 2π. then lim|z|→∞ H(z) = h(0). and H(z) for other z ∈ C is an analytic continuation of its values on the unit circle. but first the topic of spectral factorization for discrete-time processes is discussed. Spectral factorization for discrete time processes naturally involves z-transforms. Let h(k) = h∗ (−k) as before.12) for the optimal transfer function yields H(ω) = je−(1+j)T (jω + (1 − j)) je−(1−j)T (jω + (1 + j)) − 2 2 ejT (1 + j) − e−jT (1 − j) jω(ejT − e−jT ) + = e−T 2j 2j = e−T [cos(T ) + sin(T ) + jω sin(T )] so that the optimal predictor for this example is given by Xt+T |t = Xt e−T (cos(T ) + sin(T )) + Xt e−T sin(T ) 291 9.) If X is a discrete-time WSS random process with correlation function RX . Then the z-transform of h is related to the z-transform H of h as follows: ∞ ∞ ∞ ∞ ∗ h(k)z k=−∞ −k = k=−∞ h (−k)z ∗ −k = l=−∞ h (l)z = l=−∞ ∗ l h(l)(1/z ) ∗ −l = H∗ (1/z ∗ ) The impulse response function h is called causal if h(k) = 0 for k < 0. Note that if H is positive type. where u(k) = I{k≥0} . Both of these approaches will be discussed in this section.9. Similarly. The z transform of a function (hk : k ∈ Z) is given by ∞ H(z) = k=−∞ h(k)z −k for z ∈ C. An alternative approach can be based on the use of whitening filters and linear innovations sequences. DISCRETE TIME WIENER FILTERING The formula (9. (We will not need to define or use [ ]− for discrete time functions. Taking the later approach. Then SY (z) = H(z)H∗ (1/z ∗ )SX (z) = 1 (1 − ρ/z)(1 − ρ∗ z) The autocorrelation function RY can be found either in the time domain using RY = h ∗ h ∗ RW or by inverting the z-transform SY . We assume that the numerator and denominator have no zeros . factor out z and use the method of partial fraction expansion to obtain SY (z) = z (z − ρ)(1 − ρ∗ z) 1 1 = z + 2 )(z − ρ) ∗ ) − ρ)(1 − ρ∗ z) (1 − |ρ| ((1/ρ 1 1 zρ∗ = + (1 − |ρ|2 ) 1 − ρ/z 1 − ρ∗ z ρk 1−|ρ|2 (ρ∗ )−k 1−|ρ|2 which is the z-transform of RY (k) = k≥0 k<0 The z-transform SY of RY converges absolutely for |ρ| < z < 1/|ρ|. WIENER FILTERING denoted by SXY . meaning that it is a ratio of two polynomials of z with complex coefficients. Suppose that H(z) is a rational function of z. whereas the z-transform for h converges absolutely for |z| < 1/|ρ|. Let us find H.1 Suppose Y is the output process when white noise W with RW (k) = I{k=0} is passed through a linear time invariant system with impulse response function h(k) = ρk I{k≥0} . To begin.292 CHAPTER 9. ∞ H(z) = k=0 1 1−ρ∗ z . then X and Y are jointly WSS and RY X = h ∗ RX RXY = h ∗ RX RY = h ∗ h ∗ RX which in the z-transform domain becomes: SY X (z) = H(z)SX (z) SXY (z) = H∗ (1/z ∗ )SX (z) SY (z) = H(z)H∗ (1/z ∗ )SX (z) Example 9. (ρ/z)k = 1 1 − ρ/z and the z-transform of h is Note that the z-transform for h converges absolutely for |z| > |ρ|. SY . where ρ is a complex constant with |ρ| < 1. and RY .5. Recall that if Y is the output random process when X is passed through a linear time-invariant system with impulse response function h. RY = RY so that ∗ SY (z) = SY (1/z ∗ ) (9. These observations imply that S can be uniquely SY with z0 = 0. A similar statement concerning the zeros of SY also holds true. z0 . 1/z0 .7z) + SY (z) − SY (z) SY (z) = β β ∗ /z) (1 − ρz)(1 − ρ∗ z) (1 − ρ/z)(1 − ρ + SY (z) − SY (z) .9. then h and H are said to be minimum phase functions (in the time domain and z-transform domain. positive type z-transform + − • SY (z) = (SY (1/z ∗ ))∗ + • lim|z|→∞ SY (z) = β There is an additional symmetry if RY is real-valued: ∞ ∞ SY (z) = k=−∞ RY (k)z −k = ∗ (RY (k)(z ∗ )−k )∗ = SY (z ∗ ) k=−∞ (for real-valued RY ) (9.13) ∗ Therefore. if RY is real and if z0 is a nonzero pole of SY . we say that H is causal and causally invertible.13) and (9. DISCRETE TIME WIENER FILTERING 293 in common. The function H is positive type (the z-transform of a causal function) if its poles (the zeros of its denominator polynomial) are inside the unit circle in the complex plane. minimum phase function H has the property that both H and its inverse 1/H are causal functions. and that neither has a root on the unit circle.8/z) β(1 − . one with transfer function H and one with transfer function 1/H.6z)(1 − . then z0 is also a pole. Combining (9. Some example factorizations are as follows (where |ρ| < 1 and β > 0): SY (z) = β β 1 − ρ/z 1 − ρ∗ z + SY (z) − SY (z) SY (z) = β(1 − . Thus if H is positive type and minimum phase. Similarly.8z) (1 − . then 1/z0 is also a pole. if z0 is a pole of SY with z0 = 0. then 1/z0 Y Y factored as − + SY (z) = SY (z)SY (z) such that for some constant β > 0: + • SY is a minimum phase. Two linear time-invariant systems in series. passes all signals. if z0 is a zero of ∗ is also a zero of S . Assume that SY corresponds to a WSS random process Y and that SY is a rational function with no poles or zeros on the unit circle in the complex plane. with an eye towards its factorization. respectively). If H is positive type and if its zeros are also inside the unit circle. First.7/z) (1 − . and the other nonzero poles of SY come in quadruples: z0 .14) ∗ Therefore. A positive-type.14) yields that if RY is real then the real-valued nonzero poles of SY come in pairs: z0 and ∗ ∗ 1/z0 . and 1/z0 .6/z)(1 − . We shall investigate the symmetries of SY .5. 15) is given. Hence. based on the use of a whitening filter.3). for then the spectral density of the output is indeed given by + − H(z)H∗ (1/z ∗ )SW (z) = SY (z)SY (z) = SY (z) The spectral factorization can be used to solve the causal filtering problem in discrete time. This gives h(k) = RXY (T + k)I{k≥0} . time-invariant system. S any random variable in the m.15) + Finally.2) and (9. we find that if X and Y are jointly WSS random processes. and the z transform of the optimal h is given by H= 1 z T SXY + − SY SY (9. but the idea is to replace Y by an equivalent observation process Z that is white.s. which has z transform [z T SXZ ]+ .4.s. See Figure 9. The idea is the same as the idea of linear innovations sequence considered in Chapter 3. The appropriate filter is given by taking + H(z) = SY (z). Thus. Indeed. or equivalently.294 CHAPTER 9. closure of the linear span of (Zk : k ≤ n). closure of the linear span of (Yk : k ≤ n) is also in the m. The idea is to start with a discrete-time white noise process W with RW (k) = I{k=0} .s. By the previous paragraph. if the observed process Y is white noise with RY (k) = I{k=0} then for each k ≥ 0 the choice of h(k) is simply made to minimize the mean square error when Xn+T is estimated by the single term h(k)Yn−k . an alternative derivation of (9. The first step is to notice that the causal estimation problem is particularly simple if the observation process is white noise. and then pass it through an appropriate linear. closure of the linear span of (Zk : k ≤ n) is also in the m.3) in discrete time in case RY (k) = I{k=0} . In general. G is a positive type function and the system is causal. of course. with SW (z) ≡ 1. WIENER FILTERING An important application of spectral factorization is the generation of a discrete-time WSS random process with a specified correlation function RY . Since + (z) is a minimum phase function. the observation process Y is not white.2) and (9. any random variable in the m. . such estimator is obtained by passing Z through the linear time-invariant system with impulse response function RXZ (T + k)I{k≥0} . the optimal causal linear estimator of Xn+T based on (Yk : k ≤ n) is equal to the optimal causal linear estimator of Xn+T based on (Zk : k ≤ n). then the best estimator of Xn+T given (Yk : k ≤ n) having the form ∞ Xn+T |n = k=−∞ Yk h(n − k) for a causal function h is the function h satisfying the Wiener-Hopf equations (9. Conversely. Arguing just as in the continuous time case. Another way to get the same result is to solve the Wiener-Hopf equations (9. closure of the linear span of (Yk : k ≤ n). Let Z be the result of passing Y through a filter with transfer function G(z) = 1/S + (z). since Y can be recovered from Z by passing Z through the causal linear time-invariant system with transfer function S + (z).s. and appealing to a partial fraction expansion yields SXY (z) − SY (z) = = 1 β(1 − ρ/z)(1 − z0 z) z 1 + β(1 − ρ/z)(1 − z0 ρ) β((1/z0 ) − ρ)(1 − z0 z) (9. SXZ (z) = G ∗ (1/z ∗ )SXY (z) = SXY (z) − SY (z) Hence. Suppose SX (z) = (1−ρ/z)(1−ρz) where 0 < ρ < 1.4 is indeed equivalent to passing Y through the linear time invariant system with H(z) given by (9. DISCRETE TIME WIENER FILTERING 295 Y 1 S +(z) Y Z [z TS XZ (z)]+ ^ Xt+T|t Figure 9. SY (z) = SX (z) + SN (z) = 2 z + σ2 (z − ρ)(1 − ρz) 1 )z σ2 ρ = −σ 2 ρ z 2 − ( 1+ρ + ρ +1 (z − ρ)(1 − ρz) The quadratic expression in braces can be expressed as (z − z0 )(z − 1/z0 ). Let the observed process Y be given by Y = X + N .4: Optimal filtering based on whitening first. where z0 is the smaller root of the expression in braces.16) The first term in (9. Example 9.16) is positive type.5.15). Y + .5. Let us find the minimum mean square error linear estimator of Xn based on (Yk : k ≤ n). Thus.16) is the z transform of a XY function that is supported on the negative integers. We begin by factoring SY .2 Suppose that X and N are discrete-time mean zero WSS random processes such 1 that RXN = 0. yielding the factorization SY (z) = β(1 − z0 /z) β(1 − z0 z) (1 − ρ/z) (1 − ρz) + SY (z) − SY (z) where β2 = σ2ρ z0 Using the fact SXY = SX . and suppose that N is a discretetime white noise with SN (z) ≡ σ 2 and RN (k) = σ 2 I{k=0} . The transfer function for two linear. and the second term in (9.9. the series system shown in Figure 9. In addition. the first term is equal to SS − . time-invariant systems in series is the product of their z-transforms. noncausal estimation problem Let X = (Xt : t ∈ R) be a real valued.3 A simple. derive the optimality conditions that must be satisfied by the kernal function (the function that Y is multiplied by before integrating). Assuming the estimator takes the form of an integral. where A and fo are positive constants. for t fixed.6 Problems 9. (b) Identify the probability distribution of Dt . where Dt = Xt − Xt . mean zero stationary Gaussian process with RX (τ ) = e−|τ | . (b) Explain how your answer to part (a) simplifies if X is a Gaussian random process. ∞ Define the random process Y = (Yt : t ∈ R) by Yt = Xt + Nt . Let Xt = −∞ h(t − s)Ys ds. is chosen to minimize E[Dt ] for each t. Suppose X0 is estimated by X0 = c1 X−a + c2 Xa where the constants c1 and c2 are chosen to . WIENER FILTERING + Finally. 10]). 9. Use the orthogonality principle. if needed) of X for the triple (h0 .296 CHAPTER 9. 9. of the error process D. 3]∪[7. for t fixed. (c) Identify the conditional distribution of Dt given Yt . dividing by SY yields that the z-transform of the optimal filter is given by H(z) = or in the time domain β 2 (1 1 − z0 ρ)(1 − z0 /z) n z0 I{n≥0} β 2 (1 − z0 ρ) h(n) = 9. (d) Identify the autocorrelation function. k)Xj Xk (a) Find equations in term of the moments (second and higher. Let N = (Nt : t ∈ R) be a real valued Gaussian white noise process with RN (τ ) = σ 2 δ(τ ). where 2 the impulse response function h. stationary Gaussian process with mean zero and autocorrelation function RX (t) = A2 sinc(fo t). and the cross correlation function.4 Interpolating a Gauss Markov process Let X be a real-valued. h2 ) to minimize the one step prediction error: E[(Xn+1 − Xn+1 )2 ]. Let a > 0. RD .1 A quadratic predictor Suppose X is a mean zero. Consider estimating Xn+1 by a nonlinear one-step predictor of the form n n j Xn+1 = h0 + k=1 h1 (k)Xk + j=1 k=1 h2 (j. which is independent of X. (a) Find h.2 A smoothing problem Suppose X and Y are mean zero. RDY . stationary discrete-time random process and that n is an integer with n ≥ 1. second order random processes in continuous time. 9. h1 . which can be noncausal. Suppose the MMSE estimator of X5 is to be found based on observation of (Yu : u ∈ [0. 6 Proportional noise Suppose X and N are second order. the MMSE linear estimator of Xt+T given (Xs : s ≤ t). (Your answers should depend only on a. How is the optimal filter for estimating Xt+T from (Ys : s ≤ t) related to the problem of predicting Xt+T from (Xs : s ≤ t)? 9. Show how the equation for the optimal causal filter reduces to your answer to part (a) in case T ≤ 0. and identify the constant κ and the corresponding MSE. and let Y = X + N . + + (a) Find the positive type. (b) Let T be a fixed known constant with T ≥ 0. where X = h ∗ Z. (c) Find the MSE for the optimal estimator of part (b). WSS random process with power spectral density SX (ω) = ω4 +13ω2 +36 . E[(X0 − X0 )2 ]. Find h to minimize the mean square error E[(Xt − Xt )2 ].9. and t are given times with a < b. (d) Find the mean square error for the estimator of part (c). and that RN = γ 2 RX for some nonnegative constant γ 2 . b]. S (2 π f) X 8 Hz 1 8 Hz f 10 4 Hz 10 4 Hz (a) Explain how X can be simulated on a computer using a pseudo-random number generator that generates standard normal random variables. b. independent of X. Try to use the minimum number per unit time. Suppose the correlation functions RX and RN are known. Consider the problem of estimating Xt using a linear estimator based on (Yu : a ≤ u ≤ b). mean zero random processes such that RXN ≡ 0. minimum phase rational function SX such that SX (ω) = |SX (ω)|2 . (a) Use the orthogonality principle to find c1 .5 Estimation of a filtered narrowband random process in noise Suppose X is a mean zero real-valued stationary Gaussian random process with the spectral density shown. then the optimal estimator is given by Xt = κYt for some constant κ. c2 .7 Predicting the future of a simple WSS process 1 Let X be a mean zero. 9. Be as explicit as possible. (b) Suppose in addition that X and N are WSS and that Xt+T is to be estimated from (Ys : s ≤ t). Find Xt+T |t . Find an approximate numerical value for the power of the output.6. . except consider T > 0. (a) Use the orthogonality principle to show that if t ∈ [a. (Hint: Check that your answer is correct in case T = 0 and in case T → ∞).) 9.) (b) Use the orthogonality principle again to show that X0 as defined above is the minimum MSE estimator of X0 given (Xs : |s| ≥ a). with RW (τ ) = δ(τ ). (This implies that X has a two-sided Markov property. How many normal random variables does your construction require per simulated unit time? (b) Suppose X is passed through a linear time-invariant system with approximate transfer function H(2πf ) = 107 /(107 + f 2 ). and the resulting minimum MSE. where a. PROBLEMS 297 minimize the mean square error (MSE). (c) Continue under the assumptions of part (b). (c) Let Zt = Xt + Wt where W is a Gaussian white noise random process. 9 On the MSE for causal estimation Recall that if X and Y are jointly WSS and have power spectral densities. (c) Let H(2πf ) = sinc(f ). and find its power spectral density (the power spectral density only exists as a generalized function–i. WIENER FILTERING 9. then the mean square error for linear estimation of Xt+T using (Ys : s ≤ t) is given by ∞ (MSE) = RX (0) − −∞ ejωT SXY − SY 2 + dω . 9015 MHz] and SY (2πf ) = 0 unless |f | ∈[9022 MHz.8 Short answer filtering questions (a) Prove or disprove: If H is a positive type function then so is H 2 . Ki (ω) = 0 for i = 1 and i = 2. 2π Evaluate and interpret the limits of this expression as T → −∞ and as T → ∞. mean zero. E[|Xt − Xt |2 ]. The processes are the inputs to a system with the block diagram shown. Find [H]+ . plus Nout . mean zero random processes with continuous spectral densities such that SX (2πf ) = 0 unless |f | ∈[9012 MHz. 9. (b) Prove or disprove: Suppose X and Y are jointly WSS. (a) Verify that X is a WSS periodic process. designed so that the output at time t is Xt . and if SY is rational with a spectral factorization. K2 .11 Filtering a WSS signal plus noise Suppose X and N are jointly WSS. How should the parameter α be chosen to approximately minimize the MSE? 9. for some transfer functions K1 (ω) and K2 (ω): X K1 + N K2 Y=X out+Nout Suppose that for every value of ω. Xout .e. due to the input N .298 CHAPTER 9. and the power spectral densities SX and SN . ˆ (c) Find the mean square error. due to the input X. we can view the output process Y as the sum of two processes. there is a delta function in it). Find H. (c) Suppose Y is passed into a linear system with transfer function H. . (b) Give a simple expression for the output of the linear system when the input is X. 9. (a) What is the power spectral density SY ? (b) Find the signal-to-noise ratio at the output (the power of Xout divided by the power of Nout ). Because the two subsystems are linear. Let Yt = Xt + Nt . continuous time random processes with RXN ≡ 0. 9025 MHz]. Let X denote the output process when Y is filtered using the impulse response function h(τ ) = αe−(α−j2πfo )t I{t≥0} . Then the best linear estimate of X0 given (Yt : t ∈ R) is 0. where fo > 0 and A is a mean zero complex valued random variable with 2 2 E[A2 ] = 0 and E[|A|2 ] = σA .10 A singular estimation problem Let Xt = Aej2πfo t . Your answers to the first four parts should be expressed in terms of K1 . Let N be a white noise process with RN (τ ) = σN δ(τ ). the best linear estimator of Xt given (Ys : s ∈ R). Comment on the limiting value of the MMSE as T → ∞. find the mean square error.12 A prediction problem Let X be a mean zero WSS random process with correlation function RX (τ ) = e−|τ | . t) = σ 2 δ(s − t). and suppose N is a white noise process. minimum phase filter h (or its transform H) such that if white noise with autocorrelation function δ(τ ) is filtered using h then the output autocorrelation function is RX . b]. b] consisting of eigenfunctions of RX .16 Estimation of a random signal. being sure to provide justification. the λ’s. . 9. continuous. φi ) given Y in terms of the coordinates. Find the energy of x for all real values of the constant T . (Y. find the optimal linear MMSE estimator (i. of Y . 9. . (e) Express X as the solution of a stochastic differential equation driven by white noise. and find the corresponding MMSE. (a) Is X mean ergodic in the m. Let (φk : k ≥ 1) be a complete orthonormal basis for L2 [a. PROBLEMS 299 (d) Find the resulting minimum mean square error.14 Spectral decomposition and factorization (a) Let x be the signal with Fourier transform given by x(2πf ) = sinc(100f )ej2πf T + . or as No → 0. (Y. and σ.s. for a constant T > 0. as No → ∞. For simplicity you can assume that T ≥ 0. b]. f ) given Y in terms of the coordinates ((f. Express the MMSE estimator of (X. and find the corresponding mean square error. .9. with RXN ≡ 0 and RN (s. (a) Fix an index i. (e) The correct answer to part (d) (the minimum MSE) does not depend on the filter K2 . .s. φ2 ). Answer the following questions. Express the MMSE estimator of (X.e.13 Properties of a particular Gaussian process Let X be a zero-mean.6.) 9. wide-sense stationary Gaussian random process in continuous time with autocorrelation function RX (τ ) = (1 + |τ |)e−|τ | and power spectral density SX (ω) = (2/(1 + ω 2 ))2 . Suppose that Yt = Xt + Nt . that is. φ1 ).15 A continuous-time Wiener filtering problem Let (Xt ) and (Nt ) be uncorrelated. sense? (b) Is X a Markov process? (c) Is X differentiable in the m. mean zero process over an interval [a. (a) Find the optimal (noncausal) filter for estimating Xt given (Ys : −∞ < s < +∞) and find the resulting mean square error. the coordinates of Y . (Hint: 1 + 3j is a pole of S. Explain why your answer takes such a simple form. (b) Now suppose f is a function in L2 [a. (b) Find the optimal causal filter with lead time T . predictor) of Xt+T based on (Xs : s ≤ t). Suppose that Y = (Yt : a ≤ t ≤ b) is observed.s. Also. Why? 9. 9. sense? (d) Find the causal. using the KL expansion Suppose that X is a m. Comment on how the MMSE depends on No . and let (λk : k ≥ 1) denote the corresponding eigenvalues. 1 (b) Find the spectral factorization of the power spectral density S(ω) = ω4 +16ω2 +100 . the Wiener filter for estimating Xt+T given (Ys : −∞ < s ≤ t). φj ) : j ≥ 1) of f . Using the Wiener filtering equations. mean zero random processes with RX (t) = exp(−2|t|) and SN (ω) ≡ No /2 for a positive constant No . show that limn→∞ E[|Xt+T − Xt+T |t |2 ] = 0. or limits of linear combinations. E[|Xk − Xk|k−1 |2 ]. where Xk|k−1 is the MMSE predictor of Xk given (Xl : l ≤ k − 1). Your estimator can include a constant plus a linear combination. (a) Show that X can be obtained by passing X through a linear time-invariant filter. where an = max|ω|≤ωo |ejωT − H (n) (ω)|. Let Xt+T |t denote the output at time t when X is passed through the linear time invariant system with transfer function H (n) . WIENER FILTERING 9. we can perfectly predict the future of each of the subprocesses from its past. Draw sample paths of the processes. would yield a perfect prediction of X from its past. Suppose that the z-transform version of its power spectral density has the factorization as described in the notes: SX (z) = + − + − + SX (z)SX (z) such that SX (z) is a minimum phase. What is wrong with the following argument for general WSS processes? If X is an arbitrary WSS random process. (b) Show that limn→∞ an = 0. + and lim|z|→∞ SX (z) = β for some β > 0.) (c) Show that the mean square error can be made arbitrarily small by taking n sufficiently large. SX (z). (b) Identify the mean square prediction error. of Yt : 0 ≤ t ≤ 1.300 CHAPTER 9. (This means that the power series converges uniformly for ω ∈ [−ωo . So adding together the predictions. 9. we could first use a bank of (infinitely many) narrowband filters to split X into an equivalent set of narrowband random processes (call them “subprocesses”) which sum to X. (d) Thus. (n) (a) Describe Xt+T |t in terms of X in the time domain. ωo ]. and let H (n) (ω) = n (jωT ) . You should use SX (z). which is a partial sum of the power k=0 k! series of ejωT . (n) In other words. and/or β in giving your answers. (a) Find the optimal estimator of X1 among the estimators that are linear functions of (Yt : 0 ≤ t ≤ 1) and the constants. Verify that the linear system is causal. and find the corresponding mean square error. suppose X = (Xt : t ∈ R) is a baseband random process with k one-sided frequency limit ωo . (Hint: There is a closely related problem elsewhere in this problem set. The linear innovations sequence of X is the sequence X such that Xk = Xk − Xk|k−1 . and find the corresponding mean square error. SX (z) = (SX (1/z ∗ ))∗ . Xt+T |t is an estimator (not necessarily optimal) of Xt+T given (Xs : s ≤ t). − + Note that there is no constant multiplying Xk in the definition of Xk .18 Linear innovations and spectral factorization Suppose X is a discrete time WSS random process with mean zero. By the above. the future of a narrowband random process X can be predicted perfectly from its past.) (b) Find the optimal possibly nonlinear estimator of X1 given (Yt : 0 ≤ t ≤ 1).19 A singular nonlinear estimation problem Suppose X is a standard Brownian motion with parameter σ 2 = 1 and suppose N is a Poisson random process with rate λ = 10. Let Y = (Yt : t ≥ 0) be defined by Yt = Xt + Nt . which is independent of X.) (n) (n) .17 Noiseless prediction of a baseband random process Fix positive constants T and ωo . positive type function. and identify the corresponding value of H. 9. As the notation suggests. (Hint: No computation is needed. Y = X ∗ k + N .9. (c) Describe the limits of your answers to (a) and (b) as No → ∞. as an integral in the frequency domain. 9.) 9.22 Estimation given a strongly correlated process Suppose g and k are minimum phase causal functions in discrete-time. Assume T to be integer valued. X. mean zero.) (c) Describe the limits of your answers to (a) and (b) as No → 0. G. SX (ω) = I{|ω|≤ωo } and SN (ω) = 9. Give an intuitive reason for your answer. Identify the optimal filter in both the z-transform domain and in the time domain. 2 (b) Express the mean square error. Suppose Xt is to be estimated by passing Y through a causal filter with impulse response function h. RXY . SY . (Hint: Treat the case T ≤ 0 separately.20 A discrete-time Wiener filtering problem Extend the discrete-time Wiener filtering problem considered at the end of the notes to incorporate a lead time T . WSS random processes with No 2 2 where No > 0 and ωo > 0. and suppose X and N are orthogonal to each other.21 Causal estimation of a channel input process Let X = (Xt : t ∈ R) and N = (Nt : t ∈ R) denote WSS random processes with RX (τ ) = 3 e−|τ | 2 and RN (τ ) = δ(τ ). let ∞ Xn = ∞ i=−∞ k(n − i)Wi .24 * Resolution of Wiener and Kalman filtering Consider the state and observation models: Xn = F Xn−1 + Wn Yn = H T Xn + Vn (1 + cos( πω )) ωo . Find the choice of H and h in order to minimize the mean square error. is such that X is the optimal linear estimator of Xt based on (Ys : s ∈ R). (a) Find the transfer function H for the filter such that if the input process is Y . σe = E[(Xt − Xt )2 ]. K. SX . the output process. where X and N are independent. You need not identify the covariance of error. and SXY in terms of g. RY .6. Let k denote the impulse response function given by k(τ ) = 2e−3τ I{τ ≥0} . and suppose an output process Y is generated according to the block diagram shown: X k + N Y That is. with g(0) = k(0) = 1. (You needn’t carry out the integration. 9. i=−∞ g(n − i)Wi and Yn = (a) Express RX . and transfer function H. k. and z-transforms G and K. PROBLEMS 301 9. Think of X as an input signal and N as noise. (c) Find the resulting mean square error. (b) Find h so that Xn|n = ∞ i=−∞ Yi h(n − i) is the MMSE linear estimator of Xn given (Yi : i ≤ n). Let W = (Wk : k ∈ Z) be a mean zero WSS process with SW (ω) ≡ 1.23 Estimation of a process with raised cosine spectrum Suppose Y = X + N. identically distributed mean zero random variables.e. . and verify that it satisfies the orthogonality conditions.302 CHAPTER 9.) j=0 (c) Write down and solve the equations for the Kalman predictor in steady state to derive an expression for h. Can you find it? ) (a) What are the autocorrelation function RX and crosscorrelation function RXY ? (b) Use the orthogonality principle to derive conditions for the causal filter h that minimizes E[ Xn+1 − ∞ h(j)Yn−j 2 ]. H and the covariance matrices must satisfy a stability condition. (i. (F . Let ΣW and ΣV denote the respective covariance matrices of Wn and Vn . WIENER FILTERING where (Wn : −∞ < n < +∞) and (Vn : −∞ < n < +∞) are independent vector-valued random sequences of independent. derive the basic equations for the Wiener-Hopf method. rather than by a random variable. Second. In Chapter 1 we reviewed the following elementary definition of E[X|Y ]. if X and Y have a joint pdf. Note that g(i) = E[X|Y = i] is a function of i. even if X and Y are neither discrete random variables nor have a joint pdf.1 Conditional expectation revisited The general definiton of a martingale requires the general definition of conditional expectation. If X and Y are both discrete random variables. This second definition of E[X|Y ] is more general than the elementary definition. The definition of E[X|Y ] given next generalizes the previously given definition in two ways. Let E[X|Y ] be the random variable defined by E[X|Y ] = g(Y ). which is a weaker requirement than E[X 2 ] < ∞. the definition is based on having information represented by a σ-algebra. E[X|Y = i] = j jP [X = j|Y = i]. by definition. That is. First. That is. By the orthogonality principle. then E[X|Y | is the unique random variable such that: • it has the form g(Y ) for some (Borel measurable) function g such that E[(g(Y )2 ] < ∞. Specifically. such that the error X − E[X|Y ] is orthogonal to all other unconstrained estimators based on Y. Recall that. if differences on a set of probability zero are ignored. a σ-algebra D for a set Ω is a set of subsets of Ω such that: 303 . E[X|Y ] is an unconstrained estimator based on Y. E[X|Y ] exists and is unique. because it doesn’t require X and Y to be discrete or to have a joint pdf. Chapter 3 showed that E[X|Y ] could be defined whenever E[X 2 ] < ∞. The definition is based on a projection. if E[X 2 ] < ∞. E[X|Y = y] = xfX|Y (x|y)dx = g(y) and E[X|Y ] = g(Y ). Similarly. characterized by the orthogonality principle. and • E[(X − E[X|Y ])f (Y )] = 0 for all (Borel measurable) functions f such that E[(f (Y ))2 ] < ∞. E[X|Y = i] is the mean of the conditional pmf of X given Y = i. the definition applies as long as E[|X|] < ∞. which is well defined if P {Y = i} > 0 and either the sum restricted to j > 0 or to j < 0 is convergent.Chapter 10 Martingales 10. but it is less general because it requires that E[X 2 ] < ∞. We begin by reviewing the definitions we have given so far. b)} = b − a for all intervals (a. An equivalent definition would be that σ(Yi : i ∈ I) is the smallest σ-algebra such that each Yi is measurable with respect to it. such as observation of a continuous random process over an interval of time. as shown in a starred homework problem. .1. the set of events. On the other hand. so {ω : Y (ω) ≤ c} = Ω for c ≥ co . for important examples such as Ω = [0. related to the impossibility of extending P to be defined on all subsets of Ω. In practice. An equivalent condition would be to allow any estimator that is a σ(Y )measurable random variable. so {ω : Y (ω) ≤ co } = ∅ for c < co . However. On one hand. F. if Y and Z are random variables on the same probability space. Select an arbitrary ωo ∈ Ω and let co = Y (ωo ). Conversely. suppose Y is a {∅. σ-algebras are also useful for representing information available to an observer. random variables are functions on Ω that are F-measurable. That is. is required to be a σ-algebra.304 (a) Ω ∈ D. the information gained from observing a random variable Y was modeled by requiring estimators to be random variables of the form g(Y ). {∅. Using sub-σ-algebras is more general. MARTINGALES (c) if A. Y (ω) = co for all ω. Then X is measurable with respect to the trivial σ-algebra D defined by D = {∅.1. Example 10. 1] and P {(a. The smaller the σ-algebra D is. if A1 . A random variable Z is said to be D-measurable if {Z ≤ c} ⊂ D for all c. the fewer the set of D measurable random variables. That is. Ω}. constant random variables are {∅. A2 . for some constant co . is such that Ai ∈ D for i ≥ 1. i=1 In particular.. B ∈ D then AB ∈ D. Ω}-measurable random variables are constant.1 The σ-algebra generated by a collection of random variables (Yi : i ∈ I). We call D a sub-σ-algebra of F if D is a σalgebra such that D ⊂ F. sub-σ-algebras are usually generated by collections of random variables: Definition 10. Ω}-measurable. Suppose X is a random variable such that. The original motivation for introducing F in this context was a technical one. X(ω) = co for all ω ∈ Ω. CHAPTER 10. denoted by σ(Yi : i ∈ I). . A sub-σ-algebra of F represents knowledge about the probability experiment modeled by the probability space (Ω. Therefore. and more generally. because some σ-algebras on some probability spaces are not generated by random variables. then Z = g(Y ) for some Borel measurable function g if and only if Z is σ(Y ) measurable. {ω : Y (ω) ≤ c} can’t be empty for c ≥ co . F. {ω : Y (ω) ≤ co } doesn’t contain ωo for c < co . Using σ-algebras to represent information also works better when there is an uncountably infinite number of observations. That is.2 (The trivial σ-algebra) Let (Ω. P ) be a probability space. and sometimes as F Y . then ∪∞ Ai ∈ D. b). By definition. 1 The smallest one exists–it is equal to the intersection of all σ-algebras which contain all sets of the form {Yi ≤ c}. F. is the smallest σ-algebra containing all sets of the form {Yi ≤ c}. in a probability space (Ω. But in engineering practice. (b) if A ∈ D then Ac ∈ D.1 The σ-algebra generated by a single random variable Y is denoted by σ(Y ). the main difference between the two ways to model information is simply a matter of notation. P ). In Chapter 3. Ω}-measurable random variable. for a Borel measurable function g. F.. P ). (Uniqueness) Suppose U and V are each D-measurable random variables such that E[(X − U )ID ] = 0 and E[(X − V )ID ] = 0 for all D ∈ D. P ) with finite mean and D is a sub-σalgebra of F.1.1): E[XID ] = lim E[(X ∧ n)ID ] = lim E[Xn ID ] = E[X∞ ID ]. . there exits a random variable satisfying conditions (i) and (ii). Therefore. We show that X∞ satisfies the two properties. and it is unique. E[(X ∧ n − Xn )ID ] = 0 for all D ∈ D. for any n ≥ 0. (10. so by the monotone convergence theorem (Theorem 11.1. F. E[X] = E[E[X|D]]. so the orthogonality principle can be applied. First. We remark that a possible choice of D in property (ii) of the definition is D = Ω. required of E[X|D]. A possible choice of D is {U > V }. For any n ≥ 0. P {U < V } = 0. or equivalently.. Therefore. Let Xn be the projection of X ∧ n onto L2 (D). In particular. So P {U = V } = 1. F. Then D is a closed. the random variable X ∧ n is bounded and thus has a finite second moment.1.3 If X is a random variable on (Ω.1. X ∧ n − Xn is orthogonal to ID for any D ∈ D. X ∧ n − Xn is orthogonal to any random variable in L2 (D).10. is also satisfied. is the unique (two versions equal with probability one are considered to be the same) random variable on (Ω. In particular. E[(X ∧ n)ID ] = E[Xn ID ]. for any D ∈ D. Thus. Proof. E[(X − X∞ )ID ] = 0. It follows that E[(U − V )ID ] = E[(X − V )ID ]−E[(X −U )ID ] = 0 for any D ∈ D. the same is true of E[Xn ID ]. CONDITIONAL EXPECTATION REVISITED 305 Definition 10. Secondly. E[(Xn+1 − Xn )ID ] ≥ 0 for any D ∈ D. Similarly. Since (U − V )I{U >V } is nonnegative and is strictly positive on the event {U > V }. (Existence) Existence is first proved under the added assumption that P {X ≥ 0} = 1.6. Existence is proved in case P {X ≥ 0} = 1. so E[(U −V )I{U >V } ] = 0. n→∞ n→∞ So property (ii). (Here ID is the indicator function of D). Proposition 10. E[X|D]. X∞ is D-measurable because it is the limit of a sequence of D-measurable random variables.6) and (10. Taking D = {Xn+1 − Xn < 0} implies that P {Xn+1 ≥ Xn } = 1. (i) and (ii). the conditional expectation of X given D. it must be that P {U > V } = 0. P ) such that (i) E[X|D] is D-measurable (ii) E[(X − E[X|D])ID ] = 0 for all D ∈ D. nondecreasing and nonnegative. P ). Specifically. the sequences or random variables (X ∧ n)ID and Xn ID are a.3 is well posed.1) The next step is to take a limit as n → ∞. F. Equivalently. an implication of the definition is that E[X|D] also has a finite mean.s. and we denote the limit by X∞ . Let L2 (D) be the space of D-measurable random variables with finite second moments. Since E[(X ∧ n)ID ] is nondecreasing in n for each D ∈ D.4 Definition 10. Then by the orthogonality principle. since E[X] is assumed to be finite. so E[X|D] should satisfy E[X − E[X|D]] = 0.s. linear subspace of L2 (Ω. the sequence (Xn ) converges a. and E[(X − E[X|Y ])Z] = 0 for any σ(Y )-measurable random variable Z such that E[Z 2 ] < ∞.306 CHAPTER 10. from which if follows that E[(aX + bY − E[aX + bY |D]) ID ] = aE[(X − E[X|D])ID ] + bE[(Y − E[Y |D])ID ] = 0.) 4. E[X|D] and E[Y |D] are both D measurable. (Tower property) If E[X] is finite and A ⊂ D ⊂ F.1. then E[E[X|D]|A] = E[X|A]. E[E[X|D]] = E[X].2. In particular. Proof.1. V is simply the set of σ(Y )-measurable random variables Z such that E[Z 2 ] < ∞. Therefore. then E[X|D] ≥ 0 a.s. 5. E[X|Y ] satisfies conditions (i) and (ii) in Definition 10.5 Let X and Y be random variables on (Ω.s. (Positivity preserving) If E[X] is finite and X ≥ 0 a. a random variable has the form g(Y ) if and only if it is σ(Y )-measurable. then E[|E[Xn |D] − E[X∞ |D]|] → 0. Thus. aE[X|D] + bE[Y |D] = E[aX + bY |D]. As mentioned above.3. E[X|Y ] ∈ V and E[(X − E[X|Y ])Z] = 0 for any Z ∈ V. (L1 continuity) If E[Xn ] is finite for all n and E[|Xn − X∞ |] → 0. First. about projections onto nested subspaces. F. so their linear combination is also D-measurable. 7.3 of E[X|σ(Y )]. Then. (Consistency with definition based on projection) If E[X 2 ] < ∞ and V = {g(Y ) : g is Borel measurable such that E[g(Y )2 ] < ∞}. . Secondly. Thus. then aE[X|D] + bE[Y |D] = E[aX + bY |D]. with E[X|D] = E[X+ |D] − E[X− |D].) It suffices to check that the linear combination aE[X|D] + bE[Y |D] satisfies the two conditions that define E[aX +bY |D]. then E[XY |D] = XE[Y |D]. defined as the MMSE projection of X onto V (also written as ΠV (X)) is equal to E[X|σ(Y )]. (Linearity) If E[X] and E[Y ] are finite. By the case already proved. it is a simple matter to check that E]X|D] also satisfies conditions (i) and (ii). (Linearity Property) (This is similar to the proof of linearity for projections. by definition. 1. E[X+ |D] and E[X− |D] exist. Therefore. 2. Proposition 3. as required.) It suffices to check that E[E[X|D]|A] satisfies the two conditions that define E[X|A]. Proposition 10. (Tower Property) (This is similar to the proof of Proposition 3.2. (In particular. MARTINGALES For the general case. X can be represented as X = X+ −X− . (Consistency with definition based on projection) Suppose X and V are as in part 1. 6. P ) and let A and D be sub-σalgebras of F.3. then E[(X − E[X|D])ID ] = E[(Y − E[Y |D])ID ] = 0.2. where X+ and X− are nonnegative with finite means.1. So E[X|Y ] = E[X|σ(Y )]. E[(X − E[X|Y ])ID ] = 0 for any D ∈ σ(Y ). then E[X|Y ]. E[X|Y ] is σ(Y ) measurable. if D ∈ D. 3. of course. As a special case. (Pull out property) If X is D-measurable and E[XY ] and E[Y ] are finite. (L1 contraction property) E[|E|X|D]|] ≤ E[|X|]. and. they satisfy conditions (i) and (ii) in Definition 10. Let D = {E[X|D] < 0}. where X+ is the positive part of X and X− is the negative part of X. Hence. E[|E[Xn |D] − E[X∞ |D]|] = E[|E[Xn − X∞ |D]|] ≤ E[|E[Xn − X∞ |]|]. By the monotone convergence theorem.3 ) and are left to the reader.s. and Xn E[Y |D]ID is a nondecreasing sequence converging to XE[Y |D]ID a.2) becomes E[Y ID∩D1 ] = E[E[Y |D]ID∩D1 ]. (L1 contraction property) (This property is a special case of the conditional version of Jensen’s inequality. and by the linearity property. By linearity and the L1 contraction property. (10. which implies the L1 continuity property. the hypotheses imply that X∞ has a finite mean. Xn Y ID is a nondecreasing sequence converging to XY ID a. Equation (10. Let D ∈ D.10.3) Also.2. limit of a nondecreasing sequence of nonnegative simple random variables Xn . It remains to show that E[XY ID ] = E[XE[Y |D]ID ].3).2) If X has the form ID1 for D1 ∈ D then (10. E[E[X± |D]] = E[X± ]. E[X+ |D] and E[X− |D] are both nonnegative a.s.s. E[E[X|D]|A] itself is a conditional expectation given A. So E[E[X|D]ID ] = E[XID ] ≥ 0. This proves the pull out property under the added assumption that X and Y are nonnegative. taking n → ∞ on both sides of (10. the same is true of X± .) Therefore. Adding these last two equations yields E[(X − E[E[X|)D)ID ] = 0.1. By the positivity preserving property. Now X − E[E[X|D] = (X − E[X|D]) + (E[X|D] − E[E[X|D]). Therefore. Now (10..s. and the L1 contraction property is proved. (L1 continuity) Since for any n. (10. The proofs of linearity and tower property are nearly identical to the proofs of the same properties for projections (Propositions 3.2).) The variable X can be represented as X = X+ − X− .2. (The inequality is strict for ω such that both E[X+ |D] and E[X− |D] are strictly positive.2) is thus also true if X is a finite linear combination of random variables of the form ID1 . if X is a simple Dmeasurable random variables. and (use the fact D ∈ D): E[(X − E[X|D])ID ] and E[(E[X|D] − E[E[X|D])ID ] = 0. which holds by the definition of E[Y |D] and the fact D ∩ D1 ∈ D. . P {E[X|D]ID = 0} = 1. established in a starred homework problem. |X∞ | ≤ |Xn | + |Xn − X∞ |. so E[X+ |D] + E[X− |D] ≥ |E[X+ |D] − E[X− |D]| a. let D ∈ A. which is to say that E[X|D] ≥ 0 a. given by X+ = X ∨ 0 and X− = (−X) ∨ 0. Since X is assumed to have a finite mean. yields (10. Clearly XE[Y |D] is D measurable. E[|X|] = E[X+ ] + E[X− ] = E[E[X+ |D] + E[X− |D]] ≥ E[|E[X+ |D] − E[X− |D]|] = E[|E[X|D|].2 and 3. Then D ∈ D because E[X|D] is D-measurable.s. (Positivity preserving) Suppose E[X] is finite and X ≥ 0 a. E[X|D] = E[X+ |D] − E[X− |D]. Then X is the a. CONDITIONAL EXPECTATION REVISITED 307 First. while P {E[X|D]ID ≤ 0} = 1. that is.2) holds for X replaced by Xn : E[Xn Y ID ] = E[Xn E[Y |D]ID ]. so it is A measurable. Moreover.s. Second. Here a different proof is given..s. (Pull out property) The pull out property will be proved first under the added assumption that X and Y are nonnegative random variables. E[E[X|D]|A] = E[X|A]. MARTINGALES In the general case. the filtration generated by Y . (10. Definition 10.e. a filtration represents an sequence of observations or measurements. Then Y is a martingale with respect to F if for all n ≥ 0: (0) Yn is Fn measurable (i. If Y = (Yn : n ≥ 0) or Y = (Yn : n ≥ 1) is a sequence of random variables on Y (Ω.) In practice.2) follows from (10.4) where in these equations. then Y is also a martingale with respect to the filtration generated by Y itself. Fn 0 k 0 Y F0 = {∅. a. because conditional expectations with respect to a σ-algebra are random variables measurable with respect to the σ-algebra. often written as F Y = (Fn : n ≥ 0). and Y is a supermartingale relative to F if (i) and (ii) are true and E[Yn+1 |Fn ] ≤ Yn a. Note the condition (ii) in the definition of a margingale implies condition (0). where X+ = X ∨ 0 and X− = (−X) ∨ 0. If the filtration is generated by a random process. X = X+ − X− .2) can be expressed as a linear combination of terms of the form E[X± E[Y± |D]ID ].2 Martingales with respect to filtrations A filtration of a σ-algebra F is a sequence of sub-σ-algebras F = (Fn : n ≥ 0) of F. Similarly. Similarly.2. F. and similarly Y = Y+ − Y− . for each n. we take F Y to be the trivial σ-algebra. Some comments are in order.s. the sign on both appearances of X should be the same.1 Let (Ω. and the sign on both appearances of Y should be the same. Therefore. (10. . A random process (Xn : n ≥ 0) is adapted to a filtration F if Xn is Fn measurable for each n ≥ 0. P ) be a probability space with a filtration F = (Fn : n ≥ 0).308 CHAPTER 10. representing no observations.2) can be expressed as a linear combination of terms of the form E[X± Y± ID ]: E[XY ID ] = E[X+ Y+ ID ] − E[X+ Y− ID ] − E[X− Y+ ID ] + E[X− Y− ID ].4). then the information available at time n is represents observation of the random process up to time n. (If there is no variable Y defined.s. Note that if Y = (Yn : n ≥ 0) is a martingale with respect to a filtration F = (Fn : n ≥ 0). the process Y is adapted to F ) (i) E[|Yn |] < ∞.. 10. The hypotheses imply E[X± Y± ] and E[Y± ] are finite so that E[X± Y± |D] = X± E[Y± |D]. and therefore E[X± Y± ID ] = E[X± E[Y± |D]ID ]. The left side of (10. Indeed. Y is a submartingale relative to F if (i) and (ii) are true and E[Yn+1 |Fn ] ≥ Yn . P ). (ii) E[Yn+1 |Fn ] = Yn a. the right side of (10.s. such that Fn ⊂ Fn+1 for n ≥ 0. Let Y = (Yn : n ≥ 0) be a sequence of random variables adapted to F . Ω}. F. is defined by letting Y = σ(Y : k ≤ n). 2. for n ≥ 1. k ≥ 0. (10.2. Since Sn+1 = Sn + Un . Suppose the number of offspring per individual be represented by a random variable X. . F is the filtration generated by (Ui : i ≥ 1): F0 = {∅. Sn ).5 (Branching process) Let Gn denote the number of individuals in the nth generation. Then S = (Sn : n ≥ 0) is a martingale with respect to F : E[Sn+1 |Fn ] = E[Un+1 |Fn ] + E[Sn+1 |Fn ] = 0 + Sn = Sn .4 M (n) = eθSn /E[eθX1 ]n in case X’s are iid. . let Mn = Sn − nσ 2 for n ≥ 0.2. k ≥ 0. Equivalently. the fact Y is a margtingale so Fn n Y with respect to F . the tower property of conditional expectation. Then. Note that if Y is a martingale with respect to a filtration F . and the fact Yn is Fn measurable.2 in terms of a sequence of independent random variables U = (Ui : i ≥ 1). . Let F = (Fn : n ≥ 0) denote the filtration i=1 generated by S: Fn = σ(S0 . MARTINGALES WITH RESPECT TO FILTRATIONS 309 Y Yn is Fn measurable. . Suppose in 2 addition that Var(Ui ) = σ 2 for some finite constant σ 2 . Thus. Indeed. Sn = n Ui . by induction on k for n fixed: E[Xn+k |Fn ] = Xn .2. aGn is a martingale.2. M is adapted to F . .5) Example 10.2 Suppose (Ui : i ≥ 1) is a collection of independent random variables. Example 10. Sn ). . Example 10. .10. Select a > 0 so that E[aX ] = 1. then for any n. in practice. . each with mean zero. whereas Fn is the smallest σ-algebra with respect to which Yn is measurable. Finally.2. Then M = (Mn : n ≥ 0) is a martingale relative to F . E[Yn+k+1 |Fn ] = E[E[Yn+k+1 |Fn+k |Fn ] = E[Yn+k |Fn ] Therefore. . Y ⊂ F . Ω} and Fn = σ(S0 . imply Y Y Y E[Yn+1 |Fn ] = E[E[Yn+1 |Fn ]|Fn ] = E[Yn |Fn ] = Yn . Let S0 = 0 and for n ≥ 1. if Y is said to be a martingale and no filtration F is specified. for n. 2 we have Mn+1 = Mn + 2Sn Un+1 + Un+1 − σ 2 so that 2 E[Mn+1 |Fn ] = E[Mn |Fn ] + 2Sn E[Un |Fn ]] + E[Un − σ 2 |Fn ]] 2 = Mn + 2Sn E[Un ] + E[Un − σ 2 ] = Mn Example 10. Therefore.3 Suppose S = (Sn : n ≥ 0) and F = (Fn : n ≥ 0) are defined as in Example 10. at least Y is a martingale with respect to the filtration it generates. Definition 10. Example 10.7 (Doob martingale) Yn = E[Φ|Fn ].2. has finite means. (Dn : n ≥ 1) has the form Dn = Mn − Mn−1 for n ≥ 1. The requirement that (Hk : k ≥ 1) be predictable means that the gambler must decide how much to stake in round n based only on information available at the end of round n − 1. · · · . Yn−1 ]. · · · . Indeed. Equivalently. Definition 10. X1 . MARTINGALES Example 10.2. and so Hn Dn is the net stake if Hn dollars are staked on round n. for all n ≥ 0. and E[Hn+1 Dn+1 |Fn ] = Hn+1 E[Dn+1 |Fn ] = 0. because Fn determines H one step ahead. then the reserves of the gambler after n rounds would be given by: n Mn = M0 + k=1 Hk D k . We claim that the new process D = (Dn : n ≥ 1) defined by Dn = Hn Dn is also a martingale difference sequence with respect to F .2.6 (Cumulative innovations process) Let Mn = Yn − n−1 k=0 E[Yk+1 −Yk |Y0 . m edges in a graph.2. An interpretation is that Dn is the net gain to a gambler if one dollar is staked on the outcome of a fair game in round n. If the initial reserves of the gambler were some constant M0 . (Sometimes this is called “one-step” predictable. Xi = I{ith edge exists} and the filtration is generated by X0 .s.8 A martingale difference sequence (Dn : n ≥ 1) relative to a filtration F = (Fn : n ≥ 0) is a sequence of random variables (Dn : n ≥ 1) such that (0) (Dn : n ≥ 1) is adapted to F (i. Dn is Fn -measurable for each n ≥ 1) (i) E[|Dn |] < ∞ for n ≥ 1 (ii) E[Dn+1 |Fn ] = 0 a. n nodes.e.10 Suppose (Dn : n ≥ 1) is a martingale difference sequence and (Hk : k ≥ 1) is predictable. both relative to a filtration F = (Fn : n ≥ 0).310 CHAPTER 10.) Example 10. For example. it is adapted. for some (Mn : n ≥ 0) which is a martingale with respect to F .9 A random process (Hn : n ≥ 1) is said to be predictable with respect to a filtration F = (Fn : n ≥ 0) if Hn is Fn−1 measurable for all n ≥ 1. If would be an unfair advantage if the gambler already knew Dn when deciding how much money to stake in round n. where we pulled out the Fn measurable random variable Hn+1 from the conditional expectation given Fn .2. 10. the inequality to be proved becomes f (u) ≤ eu /2 .3. n E[(Mn − M0 ) ] = k=1 2 2 2 E[Hk E[Dk |Fk−1 ]]. where f (u) = ln (1 − θ)eu(1+θ) + (1 + θ)eu(−1+θ) 2 2 . b + d] with probability one.7) is less than or equal to e(αd) /2 for 2 any |b| < d.2 Lemma 10. the interval must contain zero. of the interval: eαx ≤ x − b + d α(b+d) b + d − x α(b−d) e + e . Therefore. x = b ± d. E[(Hk Dk )2 ] = E[E[(Hk Dk )2 |Fk−1 ]] = E[Hk E[Dk |Fk−1 ]]. 2d 2d (10. Elementary. The random variables are Hk Dk . and applications to prove concentration inequalities for some combinatorial problems are given. Proof. It is proven in this section.3. Since D has mean zero and D lies in the interval [b − d. 2d 2d (10. the value of eαx for x ∈ [b − d. Also. Then for any α ∈ R. for u ∈ R and |θ| < 1.6) remains true if x is replaced by the random variable D.10. calculations show that f (u) = 2 (1 − θ2 )(eu − e−u ) (1 − θ)eu + (1 + θ)e−u See McDiarmid survey paper .1 Suppose D is a random variable with E[D] = 0 and P {|D − b| ≤ d} = 1 for some 2 constant b. To avoid trivial cases we assume that |b| < d. Taking expectations on both sides and using E[D] = 0 yields E[eαD ] ≤ d − b α(b+d) b + d α(b−d) e + e . Letting u = αd and θ = b/d. 1 ≤ k ≤ n 2 2 are orthogonal.3 Azuma-Hoeffding inequaltity One of the most simple inequalities for martingales is the Azuma-Hoeffding inequality.7) 2 The proof is completed by showing that the right side of (10.6) Since D lies in that interval with probability one. Taylor’s formula implies that f (u) = f (0)+f (0)u+ f (v)u for some v in the interval with endpoints 2 0 and u. so |b| ≤ d. b + d] is bounded above by the linear function that is equal to eαx at the endpoints. AZUMA-HOEFFDING INEQUALTITY 311 Then (Mn : n ≥ 0) is a margingale with respect to F . Since eαx is convex in x. (10. E[eαD ] ≤ e(αd) /2 . but somewhat tedious. 3 (Azuma-Hoeffding inequality with centering) Let (Yn : n ≥ 0) be a martingale and (Bn : n ≥ 0) be a predictable process. yields . i. α = P {Yn − Y0 ≥ λ} ≤ exp − λ2 2 n 2 i=1 di 2 λ Pn i=1 d2 i . to use the tower property of conditional expectation. and to apply Lemma 10. by induction on n. E[eα(Yn −Y0 ) ] ≤ e(α 2 /2) Pn i=1 d2 i . This yields: E[eα(Yn −Y0 ) ] = E[E[eα(Yn −Yn−1 +Yn−1 −Y0 ) |Fn−1 ]] = E[eα(Yn−1 −Y0 ) E[eα(Yn −Yn−1 |Fn−1 ]] ≤ E[eα(Yn−1 −Y0 ) ]e(αdn ) 2 /2 . both with respect to a filtration F = (Fn : n ≥ 0). and f (u) ≤ 1 for all u ∈ R. P {Yn − Y0 ≤ −λ} satisfies the same bound because the previous bound applies for (Yn ) replaced by (−Yn ).312 and f (u) = = CHAPTER 10. Proposition 10. The idea is to write Yn = Yn − Yn−1 + Yn−1 . as was to be shown. .2 A random process (Bn : n ≥ 0) is said to be predictable with respect to a filtration F = (Fn : n ≥ 0) if B0 is a constant and Bn is Fn−1 measurable for all n ≥ 1. 2 1+θ f (u) ≤ u2 /2 for all u ∈ R.3. Thus. Similarly. Definition 10. such that P [|Yn+1 − Bn+1 | ≤ dn ] = 1 for all n ≥ 0. The remainder of the proof is essentially the Chernoff inequality: P {Yn − Y0 ≥ λ} ≤ E[eα(Yn −Y0 −λ) ] ≤ e(α 2 /2) Pn i=1 d2 −αλ i . yielding the proposition. MARTINGALES 4(1 − θ2 ) [(1 − θ)eu + (1 + θ)e−u ]2 1 . Then P {|Yn − Y0 | ≥ λ} ≤ 2 exp − λ2 2 n 2 i=1 di . Let n ≥ 0. Finally. taking α to make this bound as tight as possible. Therefore.e.1 to the random variable Yn − Yn−1 for d = dn . Proof. Note that f (0) = f (0) = 0.3.3. 2 cosh (u + β) where β = 1 ln( 1−θ ). the set of edges is E = {{i. . . no two of which are neighbors. . We could write a computer program for computing F. . . .j = 1. . . . for example by cycling through all subsets of V . E) be a random graph. y) denotes the Hamming distance. . Let G = (V. . . Also. Since f satisfies the Lipschitz condition with constant c. there is an undirected edge between vertices vi and vj (i. . j} : i < j and Zi. . which is the number of coordinates in which x and y differ. . .6 Let V = {v1 .5 (McDiarmid’s inequality) Suppose F = f (X1 . We shall apply McDiarmid’s inequality to find a concentration bound for I(G). Let (Zk : 0 ≤ k ≤ n) denote the Doob martingale defined by Zk = E[F |Fk ]. Equivalently. i. xk+1 ) as xk+1 varies. corresponding to no observations. given that the first k X’s are revealed. However. Ω}. . . such that for i < j.j = 1}. xk+1 ) + inf xk+1 gk (x1 . lies within some interval (depending on x1 . xk+1 .3. . . . . where. . . . . where 0 ≤ p ≤ 1. . xk+1 ) 2 c and let Bk+1 = mk (X1 .10. xn−1 .j : 1 ≤ i < j ≤ n)). . . For 0 ≤ k ≤ n − 1. nc X Proof. . vi and vj are neighbors) if and only if Zi. . let gk (x1 . In words. . because the graph is random. Note that Zk+1 = gk (X1 . and yi . xk fixed. . For each i. Xn are independent random variables. Then 2 P {|F − E[F ]| ≥ λ} ≤ 2 exp(− 2λ2 ). . In the analysis of functions of a continuous variable. . . the Azuma-Hoeffding inequality with centering can be applied with di = 2 for all i. . . xn is said to satisfy the Lipschitz condition with constant c if |f (x1 . AZUMA-HOEFFDING INEQUALTITY 313 Definition 10. . . . . y). Note that F0 is the trivial σ-algebra {∅. Let I = I(G) denote the maximum of the cardinalities of all independent sets for G. X X as usual. Suppose that the Z’s are mutually independent. Fk = σ(Xk : 1 ≤ k ≤ n) is the filtration generated by (Xk ). Zk is the expected value of F . . . the same is true of gk . xi+1 . and X1 . xk+1 ) = E[f (x1 . . . giving the desired result. vn } be a finite set of cardinality n ≥ 1. Xk ). . . 3 . . . . and reporting the largest cardinality of the independent.3. the set of possible values (i. We define mk (x1 . . f (x) − f (y) ≤ cdH (x. Note that I is a random variable.j is a Bernoulli random variable with parameter p. yi . xk . seeing which ones are independent sets. xk ) with length at most c. Xk+2 . suppose that Zi. Observe next that changing any one of the Z’s would change Equivalently. for an appropriate function F. . Note that I(G) = F ((Zi. Zn = F . .e.e. . . xn ) − f (x1 . . . for x1 .3. In particular. An independent set in the graph is a set of vertices. xn . · · · . there is no need to be so explicit about what f is. Xn ). xk ) = supxk+1 gk (x1 . . . Xn )]. Xk+1 ). . .3. . .3 Proposition 10. . .4 A function f of n variables x1 . Then B is a predictable process and |Zk+1 − Bk+1 | ≤ 2 with c probability one. where dH (x. where f satisfies the Lipschitz condition with constant c. . Example 10. . Thus. . xn )| ≤ c for any x1 . . j with 1 ≤ i < j ≤ n. xk ) to be the midpoint of the smallest such interval: mk (x1 . so Z0 = E[F ]. range) of gk (x1 . . . . . . the Euclidean distance is used instead of the Hamming distance. . A useful interpretation of a martingale X = (Xk : k ≥ 0) is that Xk is the reserve (amount of money on hand) that a gambler playing a fair game at each time step. So. such as the size of a maximum matching (a matching is a set of edges. Of course I is also determined by X1 . Thus. . for each i. .i+2 . That is.4 Stopping times and the optional sampling theorem Let X = (Xk : k ≥ 0) be a martingale with respect to a filtration F = (Fk : k ≥ 0). such that the function satisfies the Lipschitz condition with constant c = 1. and if one edge is added to the graph. I changes by at most one. Zi. T . and so on. Xn . has after k time steps. n(n − 1) More thought yields a tighter bound. but it makes little difference. by induction on n. chromatic index (number of colors needed to color all edges so that all edges containing a single vertex are different colors). P {|I − E[I]| ≥ λ} ≤ 2 exp(− 4λ2 ). In words. I can be expressed as a function of the n variables X1 . . . (If the gambler is allowed to go into debt. . if any one of the X’s changes. the reserve can be negative. The equality E[Xn ] = E[X0 ] has the natural interpretation that the expected reserve of the gambler after n games have been played. is equal to the inital reserve X0 . and a > 0. . For 1 ≤ i ≤ n. 10. . In particular. Note that E[Xk+1 ] = E[E[Xk+1 |Fk ] = E[Xk ]. the expected reserve after the next game is equal to the reserve at time k. Therefore. we have √ P {|I − E[I]| ≥ a n} ≤ 2 exp(−2a2 ) whenever n ≥ 1.n ). E[Xn ] = E[X0 ] for all n ≥ 0. MARTINGALES I(G) by at most one. if λ = a n. . Xi determines which vertices with index larger than i are neighbors of vertex vi . given the knowledge that is observable up to time k. if X0 is the initial reserve. no two of which have a node in common).) The condition E[Xk+1 |Fk ] = Xk means that. Thus.i+1 . let Xi = (Zi. by McDiarmid’s inequality with c = 1 and m = n(n − 1)/2 variables. McDiarmid’s inequality can similiarly be applied to obtain concentration inequalities for many other numbers associated with graphs. if there is an independent set in a graph. minimum number of edges that need to be cut to break graph into two equal size components.314 CHAPTER 10. n √ For example. . chromatic number (number of colors needed to color all vertices so that neighbors are different colors). 0 ≤ p ≤ 1. Zi. . by McDiarmid’s inequality with c = 1 and n variables. then at most one vertex would have to be removed from the independent set for the original graph to obtain one for the new graph. . . Is it true that E[XT ] = E[X0 ]? 4 Since Xn is degenerate. Moreover. .4 2λ2 P {|I − E[I]| ≥ λ} ≤ 2 exp(− ). Xn . of games. F satisfies the Lipschitz condition with constant c = 1. we could use n − 1 instead of n. What happens if the gambler stops after a random number. 4. Does example 10. . .4. the gambler wins back the stake and an additional equal amount. and stakes it on the first outcome. Since σ-algebras are closed under set complements. is one doller. then if Wn = 1. Xn ). W2 . and let T = min{n ≥ 0 : Xn ∈ A}. modeling fair coin flips.4. {T ≤ n} ∈ Fn . {T ≤ n} = {Xk ∈ A for some k with 0 ≤ k ≤ n}. .1 give a realistic strategy for a gambler to obtain a strictly positive expected payoff from a fair game? To implement the strategy. . Thus. The gambler continues playing until the first win. Example 10. for any n ≥ 0.5 for all k. So X0 = 0. defined as follows. Let A be some fixed (Borel measurable) set. and XT = 0 otherwise. the gambler loses the money staked. the gambler should have enough information by time n to know whether to stop by time n. which is in Fn because X is adapted to the filtration. However. For example. the condition in the definition of an optional stopping time is equivalent to requiring that. Definition 10. If Wn = 0.4.4. . and W3 . Example 10. Indeed. For simplicity. . Suppose that if a gambler stakes some money at the beginning of the nth round. {T > n} ∈ Fn . the gambler initially borrows one dollar. .1 Suppose that Xn = W1 + · · · + Wn where P {Wk = 1} = P {Wk = −1} = 0. Let T be the random time: T = 3 if W1 + W2 + W3 = 1 0 else Then XT = 3 with probability 1/8. at time zero. W2 . STOPPING TIMES AND THE OPTIONAL SAMPLING THEOREM 315 Example 10. Then T is a stopping time relative to F .10. Suppose the gambler adopts the following strategy. Let Xn denote the reserve of the gambler after n rounds. the gambler should stop gambling after T games. are independent Bernoulli random variables with p = 0.5. the event {T = 0} depends on the outcomes W1 . . Intuitively. a random time corresponds to an implementable stopping strategy if the gambler has enough information after n games to tell whether to play future games.2 An optional stopping time T relative to a filtration F = (Fk : k ≥ 0) is a random variable with values in Z+ such that for any n ≥ 0. E[XT ] = 3/8. That type of condition is captured by the notion of optional stopping time.4 Suppose W1 . and the W ’s are independent. Unless the gambler can somehow predict the future. the gambler stakes the amount of money needed to insure that the reserve at the time the gambler stops. and that the initial reserve of the gambler is zero.4. So {T ≤ n} is an event determined by (X0 . we assume that the gambler can borrow money as needed.3 Let (Xn : n ≥ 0) be a random process adapted to a filtration F = (Fn : n ≥ 0). Hence. This means that the gambler should have enough information by time n to know whether to continue gambling strictly beyond time n. the gambler will be unable to implement the strategy of stopping play after T games. The intuitive interpretation of the condition {T ≤ n} ∈ Fn is that. and in each round until stopping. the gambler is required to make a decision about whether to stop before any games are played based on the outcomes of the first thee games. next. P ). The random process (Xn : n ≥ 0) is a martingale. the total amount borrowed before the final win. . W2 = 0). Therefore.316 CHAPTER 10. But don’t run out and start playing this strategy. and.e. In particular. then E[XT ∧n ] = E[X0 ] for any n. F . So by induction on n. relative to (Ω. and the gambler stops. Proof.4. Note that XT ∧(n+1) − XT ∧n = 0 if T ≤ n X ∧ (n + 1) − X ∧ n if T > n = (X ∧ (n + 1) − X ∧ n)I{T >n} Using this and the tower property of conditional expectation yields E[XT ∧(n+1) − XT ∧n ] = E[E[(X ∧ (n + 1) − X ∧ n)I{T >n} |Fn ]] = E[E[(X ∧ (n + 1) − X ∧ n)|Fn ]I{T >n} ] = 0 because E[(X ∧ (n + 1) − X ∧ n)|Fn ] = 0. If W2 = 1 the gambler’s reserve is one dollar. Since no money is staked after time T . There is a catch–the amount borrowed can be very large. W1 = 0). that the gambler plays has the geometric distribution with parameter p = 0. In general. the gambler keeps playing. and the gambler is always up one dollar after stopping. Thus.4.5. XT = 1 a.s. let us compute the expectation of B. In that case. T . borrows four more dollars and stakes them on the third outcome. then X1 = −1. If the gambler loses in the first round (i.e. borrows two more dollars and stakes them on the second outcome. So T = 2 and XT = 1. E[XT ∧n ] = E[X0 ] for all n ≥ 0. T is finite with probability one. and two more in the second). E[X ∧ (n + 1)] = E[X ∧ n] for all n ≥ 0. If T = 1 then B = 1 (only the dollar borrowed in the first round is counted). while X0 = 0. If T = 2 then B = 3 (the first dollar in the first round. expecting to make money for sure. MARTINGALES If W1 = 1 the gambler’s reserve (money in hand minus the amount borrowed) is one dollar. so T = 1 and XT = 1. then X2 = −3. This strategy does not require the gambler to be able to predict the future. the number of rounds.. The following corollary follows immediately from Proposition 10. the gambler keeps playing. the expected amount of money the gambler will need to borrow is infinite. and the gambler stops.5 If X is a martingale and T is an optional stopping time. and. For this strategy. E[XT ] = E[X0 ]. and so on. next. ∞ ∞ ∞ E[B] = (2 − 1)P {T = n} = n=1 n (2 − 1)2 n=1 n −n = (1 − 2−n ) = +∞ n=1 That is. Proposition 10.5. In that case. Thus. Indeed. Thus. B = 2T − 1. E[T ] = 2. If the gambler loses in the second round (i. Xk = XT for all k ≥ T . Thus. Note a.8 If X is a martingale and T is an optional stopping time. P ). the corollary follows from Proposition 10. the trick to establishing E[XT ] = E[X0 ] comes down to proving (10. Proof.4. F . relative to (Ω. In particular. relative to (Ω. P {|XT ∧n − XT | ≥ } ≤ δ.9 Suppose (Xn : n ≥ 0) is a martingale relative to (Ω.4.8). P ). for all n. Therefore. Suppose (i) there is a constant c such that E[ |Xn+1 − Xn | |Fn ] ≤ c for n ≥ 0.5. F . then E[XT ] = E[X0 ]. Therefore.10.4.s. so (10. For such n. there exists δ > 0 so that if A is any set with P [A] < δ. .9) yields E[|XT ∧n − XT |] ≤ 3 . P ). we also have XT ∧n → XT . STOPPING TIMES AND THE OPTIONAL SAMPLING THEOREM 317 Corollary 10. so |E[XT ∧n ] − E[XT ]| = |E[XT ∧n − XT ]| ≤ E[|XT ∧n − XT |] < 3 Since was an arbitrary positive number. Then E[XT ] = E[X0 ]. F . and if there is a random variable Z such that |Xn | ≤ Z a. then E[XT ] ≥ E[X0 ]. then E[X0 ] = limn→∞ E[XT ∧n ].8) is simply requiring the convergence of the means to the mean of the limit. the stopping time T . that XT ∧n → XT as n → ∞. involving conditions on the martingale X.s.4. if n is sufficiently large. for an a. or both.s. F . and E[Z] < ∞. instead. So taking expectations of each side of (10. There are several different sufficient conditions for this to happen. the corollary is proved. Corollary 10. Let > 0. then p. P ). F . Since XT ∧n → XT . a. Since E[Z] < ∞.7 If X is a martingale and T is an optional stopping time. |XT ∧n − XT | ≤ ≤ + |XT ∧n − XT |I{|XT ∧n −XT |> + 2|Z|I{|XT ∧n −XT |> } } (10. convergent sequence of random variables. If. By Corollary 10. (Xn : n ≥ 0) is a submartingale relative to (Ω. Proof. so E[XT ∧n = XT ]. If P {T ≤ n} = 1 then T ∧ n = T with probability one. because both have absolute values less than or equal to Z. Both XT ∧n and XT have finite means. if n→∞ lim E[XT ∧n ] = E[XT ] (10.4.6 If X is a martingale and T is an optional stopping time.8) then E[XT ] = E[X0 ]. (ii) T is stopping time such that E[T ] < ∞. and if T is bounded (so P {T ≤ n} = 1 for some n) then E[XT ] = E[X0 ]. P ).9) Now E[|Z|I{|XT ∧n −XT |> } ] < by the choice of δ and n.s.4.4. For example: Corollary 10.6. satisfying (i) and (ii). E[ZIA ] < . relative to (Ω. Corollary 10. then some marbles are added back to the bag. E[Xn+1 |Xn .4. according to Table 10.4. initially with k1 red marbles and k2 blue marbles. Suppose (Xn : n ≥ 0) is a martingale relative to (Ω. one blue.5 on {T > n} . |XT ∧n | ≤ Z for all n ≥ 0. If instead X is a submartingale. then with probability one half. Example 10. The proof for the first part of the corollary. The following example shows how a lower bound can be found for a particular game.4. so it remains to show that E[Z] < ∞. For these turns.10 Consider the following game.8. MARTINGALES Proof. If heads appears.5. a fair coin is flipped. Hence. for any choice of un (representing the decision of the player for the n + 1th turn). There is an urn. Let Z = |X0 | + |X1 − X0 | + · · · + |XT − XT −1 | Obviously.1 Our goal will be to find a lower bound on E[T ]. Hence. one red and one blue.5 yields that E[XT ∧n ] ≥ E[X0 ]. Martingale inequalities offer a way to provide upper and lower bounds on the completion times of algorithms. and the set must be one of four types: one red. we see that it is enough to show that there is a random variable Z such that E[Z] < +∞ and |XT ∧n | ≤ Z for all n ≥ 0. Looking at the proof of Corollary 10. Let Xn denote the total number of marbles in the urn after n turns. shows that conditions (i) and (ii) imply that E[XT ∧n ] → E[XT ] as n → ∞. At the beginning of each turn. and the goal of the player is to minimize the expected number of turns required. the expected change in the number of marbles in the urn is -0. un ] ≥ Xn − 0. or two red and two blue. three marbles are put back into the urn. for either set with one marble.318 CHAPTER 10. If the player elects to remove two reds or two blues. E[XT ] ≥ E[X0 ]. then a minor variation of Proposition 10. already given. A player takes turns until the urn is empty. But ∞ E[Z] = E[|X0 ] + E i=1 |Xi − Xi−1 |I{i≤T } ∞ = E[|X0 |] + E E i=1 |Xi − Xi−1 |I{i≤T } | Fi−1 ∞ = E[|X0 |] + E I{i≤T } E i=1 ∞ |Xi − Xi−1 | | Fi−1 = E[|X0 |] + c i=1 P {i ≤ T } = E[|X0 |] + cE[T ] < ∞ The first statement of the Corollary is proved. the turn is over. the expected change in the total number of marbles in the urn is zero. P ). two marbles are put back. F . satisfying (i) and (ii). Therefore. If the player elects to remove only one marble during a turn (either red or blue) then with probability one half. If tails appears. The bound should hold for any strategy the player adopts. After removing the set of marbles. the player can remove a set of marbles. where T is the number of turns needed by the player until the urn is empty. this result is true. the numbers of offspring of different individuals are independent. Thus. Let Mn = Xn∧T + n∧T . is trivially true. E[MT ] ≥ E[M0 ] = k1 + k2 . . θ (b) Let E denote the event of eventual extinction.4. (c) Using the fact E[M1 ] = E[M0 ]. Mn = αYn is a martingale. 2 |Mn+1 − Mn | ≤ 2. and α < 1 if and only if E[X] > 1. and each has the same distribution as a random variable X. each base station (unless those near the edge of the array) covers 2r − 1 cells. Note that there can be more than one base station at a given cell. and let α = P {E}. (Note: Problem 4. Yn is the number of individuals in the nth generation. If E[T ] = +∞ then the inequality to be proved. find an equation for α. and uniformly distributed among the n cell locations.10.29 shows that α is the smallest positive solution to the equation. In fact.2 A covering problem Consider a linear array of n cells.5 Notes Material on Azuma-Hoeffding inequality and McDiarmid’s method can be found in McDiarmid’s tutorial article [7]. NOTES 319 Table 10. M is a submartingale.5. Furthermore. Thus.5 in magnitude. E[T ] ≥ 2(k1 + k2 ).9. 10.6 Problems 10.) 10. Then by Corollary 10. Thus. as claimed. . Y0 = 1. Also. Either E[T ] = +∞ or E[T ] < ∞. 2 10. . . so suppose E[T ] < ∞. Call a cell i covered if there is at least one base station at some cell j with |i − j| ≤ r − 1. and it is now proved. so E[T ] ≥ 2(k1 + k2 ).1: Rules of the marble game Set removed one red one blue two reds two blues Set returned to bag on “heads” one red and one blue one red and one blue three blues three reds That is. . such that the locations of the base stations are independent.5. and interference between base stations is ignored. Show that P [E|Y0 . Suppose that m base stations are randomly placed among the cells.1 Two martingales associated with a simple branching process Let Y = (Yn : n ≥ 0) denote a simple branching process. By the observations above. n (a) Identify a constant θ so that Gn = Yn is a martingale. Let r be a positive integer. Yn ] = αYn . so we suspect that no strategy can empty the urn in average time less than (k1 + k2 )/0. MT = T with probability one. the drift of Xn towards zero is at most 0. 10. then (Zi : i ∈ I). . (Hint: Invoke a version of the optional stopping theorem). Let S0 = 0 and Sn = W1 + .) Show that (i) FT is a σ-algebra.3 Doob decomposition Suppose X = (Xk : k ≥ 0) is an integrable (meaning E[|Xk |] < ∞ for each k) sequence adapted to a filtration F = (Fk : k ≥ 1). S ∨ T . (a) Show that E[Sτ ] = 0 if there is a constant D so that P [|Wi | > D] = 0. then S ∧ T .320 CHAPTER 10. it’d be nice if they are close. (b) Are the sequences B and M uniquely determined by X and F ? 10. for discrete time. For general r ≥ 1. such that Xk = Bk + Mk for all k. (c) If T is a stopping time for a filtration F . . identically distributed mean zero random variables. To avoid triviality. (Ideally you can find g1 = g2 . be a sequence of independent. is also a uniformly integrable collection. In case r = 1 we have seen that P [X ≥ n ln n + cn] → exp(−e−c ) (the coupon collectors problem). (b) (This part is related to the coupon collector problem and may not have anything to do with martingales. (b) In view of part (a). and (iii) if X is an adapted process then XT is FT measurable. (ii) T is FT measurable. . W2 .5 Stopping time properties (a) Show that if S and T are stopping times for some filtration F . Fix a constant c > 0 and let τ = min{n ≥ 0 : |Sn | ≥ c}. The goal of this problem is to show that E[Sτ ] = 0. . (b) Show that a collection of random variables (Xi : i ∈ I) is uniformly integrable if and only if there exists a convex increasing function ϕ : R+ → R+ with limc→∞ ϕ(c) = +∞ and a constant K. MARTINGALES (a) Let F denote the number of cells covered.) Rather than fixing the number of base stations.) 10. find g1 (r) and g2 (r) to so that for any > 0. where Zi = Xi + Yi for all i. m. but if not. and S + T .6 A stopped random walk Let W1 . the set of events A such that A ∩ {T = n} ∈ Fn for all n. Wn for n ≥ 1. Apply the method of bounded differences based on the Azuma-Hoeffding inequality to find an upper bound on P [|F − E[F ]| ≥ γ]. (a) Show that there is sequence B = (Bk : k ≥ 0) which is predictable relative to F (which means that B0 is a constant and Bk is Fk−1 measurable for k ≥ 1) and a mean zero martingale M = (Mk : k ≥ 0). we need to address the case that the W ’s are not bounded. . Let Wn = . are also stopping times. (b) Show that if F is a filtration and X = (Xk : k ≥ 0) is the random sequence defined by Xk = I{T ≤k} for some random time T with values in Z+ . assume P {W1 = 0} = 0. (Or. P [X ≥ (g2 (r) + )n ln n] → 0 and P [X ≤ (g1 (r) − )n ln n] → 0. then T is a stopping time if and only if X is F -adapted. such that E[ϕ(Xi )] ≤ K for all i ∈ I. c 10. recall that FT is the set of events A such that A ∩ {T ≤ n} ∈ Fn for all n.4 On uniform integrability (a) Show that if (Xi : i ∈ I) and (Yi : i ∈ I) are both uniformly integrable collections of random variables with the same index set I. let X denote the number of base stations needed until all cells are covered. b ≥ 2c. . Note that if τ < n and if Wn = Wn . A = {y : g(y) ≤ c} is a Borel subset of R. PROBLEMS 321 Wn if |Wn | ≤ 2c a if Wn > 2c where the constants a and b are selected so that a ≥ 2c. (Hint: Use part (b) and invoke a version of the optional stopping theorem). . . consisting of either one or two marbles of the same color. Wn for n ≥ 0 and let 2 Mn = Sn − nσ 2 . Let τ be the number of turns in the game. Let Z denote the maximum of the cardinalities of the matchings for G. so that a ≤ E[Z] ≤ b. Let Y and Z be random variables on the same probability space.6. let U = {u1 . by definition. no marbles are added back to the jar. . (b) Using the definition of Borel sets. then three marbles of the other color are added back to the jar. . vn } be disjoint sets of cardinality n. Show that M is a martingale. the distribution of Z is concentrated about its mean as n → ∞. A matching for G is a subset of edges M such that no two edges in M have a common vertex. On each turn. (b) Suggest a strategy that approximately minimizes E[τ ]. and each is uniformly distributed over the set of all n d subsets of V of cardinality d. E[Mτ ∧n ] = 0 for all n. one of each color is added to the jar. The “only if” part follows. showing that for fixed d. It must be shown that {Z ≤ c} ∈ σ(Y ). 10. . The goal of the player is to minimize E[τ ]. and then the game ends. . Let σ 2 = Var(Wi ). (a) Show that {Z ≤ c} = {Y ∈ A}.9 * Equivalence of having the form g(Y ) and being measurable relative to the sigma algebra generated by Y . Vn are independent. √ (b) Give an upper bound on P {|Z − E[Z]| ≥ γ n}. with 0 < a ≤ b < n. for γ > 0. τ defined above also satisfies τ = min{n ≥ 0 : |Sn | ≥ c}. If heads appears on the coin. The purpose of this problem is to establish that Z = g(Y ) for some Borel measurable function g if and only if Z is σ(Y ) measurable. Thus. Play continues until the jar is empty after a turn. the player removes a set of marbles. (“only if” part) Suppose Z = g(Y ) for a Borel measurable function g. (a) Find bounds a and b. and let c ∈ R. then if one marble was removed. . . (a) Find a lower bound on E[τ ] that holds no matter what strategy the player selects.7 Bounding the value of a game Consider the following game. then τ = n. and then flips a fair coin. then V1 . . . Initially a jar has ao red marbles and bo blue marbles. . If tails appears. . Conclude that E[τ ] < ∞ (c) Show that E[Sτ ] = 0. . . Since g is a Borel measurable function. and for that strategy. 10. such that if Vi denotes the set of neighbors of ui .8 On the size of a maximum matching in a random bipartite graph Given 1 ≤ d < n. The turn is then over. un } and V = {v1 . (c) Suggest a greedy algorithm for finding a large cardinality matching. 10. find an upper bound on E[τ ]. Let Sn = W1 + . A strategy is a rule to decide what set of marbles to remove at the beginning of each turn. show that {Y ∈ A} ∈ σ(Y ) for any Borel set A. Hence. and −b if Wn < −2c ˜ E[Wi ] = 0.10. and if two marbles were removed. and let G be a bipartite random graph with vertex set U ∪ V . . with each being a constant times an indicator function: Z = supn qn I{Z≤qn } . but any two such versions are equal with probability one. q2 . the result is a valid CDF (i. . . let Φ(q) = P [X ≤ q|D]. (c) Prove this first in the special case that Z has the form of an indicator function: Z = IB . (Hint: Z can be written as the supremum of a countable set of random variables. +∞}. ω) for ω ∈ N by letting Φ(q. P ) and let D be a sub-σ-algebra of F. fixed CDF. is a function of (c. . It must be shown that Z has the form g(Y ) for some Borel measurable function g. (2) for each ω fixed. which satisfies B ∈ σ(Y ). . defined next.m:qn <qm {Φ(qn ) > Φ(qm ). Show that. Let FX|D (c|ω) be a regular conditional CDF of X given D. we pick Φ(q) to be one particular version of P [X ≤ q|D]. Symbollically: E[X|D](ω) = R cFX|D (dc|ω).e. in the special case that E[|X|] < ∞. The purpose of this problem is to prove the existence of a regular conditional CDF. listed in some order. A regular conditional CDF of X given D.322 CHAPTER 10.) 10. for some event B. For each rational number q. FX|D (c|ω) is a valid CDF. Then for any c ∈ IR and ω ∈ Ω. P ) and let D be a sub-σ-algebra of F. Then for each ω.10 * Regular conditional distributions Let X be a random variable on (Ω. The event N defined by N = ∩n. as a function of c for ω fixed. Thus. P {Φ(q) > Φ(q )} = 0 if q < q .} thus has probability zero. FX|D (c|ω) is a D measurable function of ω.} denote the set of rational numbers. condtional CDF of X given D. denoted by FX|D (c|ω). A conditional probability such as P [X ≤ c|D] for a fixed constant c can sometimes have different versions.) (d) Prove the “if” part in general. F. the idea of regular conditional distributions. FX|D (c|ω) is a version of P [X ≤ c|D]. right-continuous. Roughly speaking. is an enumeration of the set of rational numbers. Φ(q) is a random variable.) The difficulty is that there are uncountably many choices of c. for each rational number q. (3) for any c ∈ R. which is contained in the extended real line R ∪ {−∞. ω) = Fo (q) for ω ∈ N and all rational q. where q1 . Let {q1 . MARTINGALES (“if” part) Suppose Z is σ(Y ) measurable. ω) q≥c Show that Φ so defined is a regular. let Φ(c. is to select a version of P [X ≤ c|D] for every real number c so that. and so we can also write it at as Φ(q.11 * An even more general definition of conditional expectation. nondecreasing. with limit zero at −∞ and limit one at +∞. as a function of c. Here is the definition. 10. where Fo is an arbitrary. . By the positivity preserving property of conditional expectations. and the conditional version of Jensen’s inequality Let X be a random variable on (Ω. F. (Hint: Appeal to the definition of σ(Y ). ω) ∈ R × Ω such that: (1) for each c ∈ R fixed. we can define E[X|D] at ω to equal the mean for the CDF FX|D (c|ω) : c ∈ R}. ω) = inf Φ(q. That is. q2 . this . Modify Φ(q. ω) to make explicit the dependence on ω. . PROBLEMS 323 definition is consistent with the one given previously. the following conditional version of Jensen’s inequality holds: If φ is a convex function on R. The proof is given by applying the ordinary Jensen’s inequality for each ω fixed.10. then E[φ(X)|D] ≥ φ(E[X|D]) a.s.6. for the regular conditional CDF of X given D evaluated at ω. As an application. MARTINGALES .324 CHAPTER 10. Ac AB A−B ∞ = = = = = = = = = complement of A A∩B AB c {a : a ∈ Ai for some i} {a : a ∈ Ai for all i} a if a ≥ b b if a < b min{a. b} a ∨ 0 = max{a. b] = {x : a < x ≤ b} [a. b) = {x : a < x < b} [a. 0} 1 if x ∈ A 0 else (a. b) = {x : a ≤ x < b} Z Z+ R R+ C − − − − = . b] = {x : a ≤ x ≤ b} set of integers set of nonnegative integers set of real numbers set of nonnegative real numbers set of complex numbers 325 A ⊂ B ↔ any element of A is also an element of B Ai i=1 ∞ Ai i=1 a ∨ b = max{a.1 Some notation The following notational conventions are used in these notes.Chapter 11 Appendix 11. b} a∧b a+ IA (x) (a. ..2 Convergence of sequences of numbers We begin with some basic definitions. (xn ) converges to +∞ as n goes to infinity if for every K > 0 there is an integer nK so that xn ≥ K for every n ≥ nK . . . Let us verify that limn→∞ xn = 0.n = 1 if m = n and am. .) be sequences of numbers and let x be a number.n = a to denote that am. Therefore it holds if both 2n ≤ 2 n2 and 4 ≤ 2 n2 .n→∞ am.n − a∗ |< for every m. .326 CHAPTER 11. We write limn→∞ xn = x to denote that xn converges to x.n converges to a as m and n jointly go to infinity. . Convergence to −∞ is defined in a similar way. Show that limn→∞ (limm→∞ am. APPENDIX A1 × · · · × An = {(a1 .2.n→∞ am. x2 . n ≥ 1) is considered. am. 1 Some authors reserve the word “convergence” for convergence to a finite limit.) and (yn ) = (y1 . 2. Occasionally a two-dimensional array of numbers (am.n = 0 if m = n. . By definition.n does not exist. 11. By definition. Let (xn ) = (x1 . Example 11. We write limm. So if n = max{ 4 .n converges to a number a∗ as m and n jointly go to infinity if for each > 0 there is n > 0 so that | am.n : m ≥ 1. So limn→∞ xn = 0. . xn converges to x as n goes to infinity if for each > 0 there is an integer n so that | xn − x |< for every n ≥ n . For example. Theoretical Exercise Let am. y2 . but the set of all real numbers in any interval of positive length is not countably infinite. n3 → ∞ as n → ∞ and n3 − 2n4 → −∞ as n → ∞. By definition. . . . When we say a sequence converges to +∞ some would say the sequence diverges to +∞. . the set of rational numbers is countably infinite. The inequality | xn |< n2 +1 holds if 2n + 4 ≤ (n2 + 1). 8 } then n ≥ n implies that | xn |< . Therefore it holds if 2n + 4 ≤ n2 .n ) = limm→∞ (limn→∞ amn ) = 0 but that limm.1 Let xn = 2n+4 .1 For example. . an )T : ai ∈ Ai for 1 ≤ i ≤ n} An = A × · · · × A t t = = n times greatest integer n such that n ≤ t least integer n such that n ≥ t A = expression − denotes that A is defined by the expression All the trigonometric identities required in these notes can be easily derived from the two identities: cos(a + b) = cos(a) cos(b) − sin(a) sin(b) sin(a + b) = sin(a) cos(b) + cos(a) sin(b) and the facts cos(−a) = cos(a) and sin(−b) = − sin(b). . n ≥ n . A set of numbers is countably infinite if the numbers in the set can be listed in a sequence xi : i = 1. based on the fact that for k ≥ 2. then limn→∞ bn = a∗ . our approach is safer. but it is automatically true. but limm.nk exists whenever ((mk . Namely.n→∞ am. The condition limm. is equivalent to the condition that the limit limk→∞ amk . by definition. Therefore the sum ∞ −α exists and is finite. a third sequence could be constructed by interleaving the first two.n exists. ∞ n yk = x k=1 means that n→∞ lim yk = x k=1 Often we want to show that a sequence converges even if we don’t explicitly know the value of the limit.n = a∗ can be expressed in terms of convergence of sequences depending on only one index (as can all the other limits discussed in these notes) as follows.” because for some authors it means strictly increasing and for other authors it means nondecreasing.nk wouldn’t be convergent for that sequence. nk ) : k ≥ 1) is a sequence of pairs of positive integers such that mk → ∞ and nk → ∞ as k → ∞. limm.nk = a∗ whenever ((mk . That is. CONVERGENCE OF SEQUENCES OF NUMBERS m+n 327 (−1) Theoretical Exercise Let am. For each n the nth partial k=1 sum can be bounded by comparison to an integral. k=1 k 2 We could add here the condition that the limit should be the same for all choices of sequences. Similarly a function f on the real line is nondecreasing if f (x) ≤ f (y) whenever x < y. a2 . Show that limm→∞ am.n) . .nk . .11.n→∞ am. Example 11.2. The sequence is called strictly increasing if ai < aj for i < j and the function is called strictly increasing if f (x) < f (y) whenever x < y. Any sequence that is bounded and monotone converges to a finite number.2 Consider the sum ∞ k −α for a constant α > 1. and a nondecreasing or nonincreasing sequence is said to be monotone. . strictly increasing). Theoretical Exercise If limm.n = a∗ is equivalent to the following: limk→∞ amk . nk ) : k ≥ 1) is a sequence of pairs of positive integers such that mk → ∞ and nk → ∞ as k → ∞.3 A strictly increasing or strictly decreasing sequence is said to be strictly monotone.n = min(m.n does not exist for any m. 3 We avoid simply saying “increasing. the k th term of the sum is less than the integral of x−α over the interval [k − 1.n = 0. k]: n k −α ≤ 1 + k=1 1 n x−α dx = 1 + 1 − n1−α 1 α ≤1+ = (α − 1) α−1 α−1 The partial sums are also monotone nondecreasing (in fact.n→∞ am. While inelegant.n does not exist for any n and limn→∞ am. The sum of an infinite sequence is defined to be the limit of the partial sums. is said to be nondecreasing if ai ≤ aj for i < j.n→∞ am. If if two sequences were to yield different limits of amk .2 A sequence a1 . . A sequence (xn ) is bounded if there is a number L so that | xn |≤ L for all n.n→∞ amn = a∗ and limm→∞ amn = bn for each n. The condition that the limit limm.2. and amk . . (xn ) is a Cauchy sequence. (b) yn = log n (c) zn = n k log k k=2 n2 +1 The minimum of a set of numbers. APPENDIX A sequence (xn ) is a Cauchy sequence if limm. .n→∞ | xm − xn |= 0. Then by the triangle inequality for absolute values: n−1 |xn − xm | ≤ i=m |xi+1 − xi | or. . the maximum of a set of numbers A. so the notion of supremum extends the notion of maximum to all subsets of R. The supremum of a set of numbers A. ∞ |xi+1 − xi |. Some sets fail to have a minimum. n−1 m−1 |xn − xm | ≤ i=1 |xi+1 − xi | − i=1 |xi+1 − xi | . . so the notion of infimum extends the notion of minimum to all subsets of R. n → ∞. The supremum of any subset of R exists. −2. and it is hence convergent. Theoretical Exercise 1. n → ∞.} have a smallest number. . . The Cauchy i=1 criteria can be used to show that the sequence (xn : n ≥ 1) is convergent. is the largest number in the set. We have sup A = − inf{−a : a ∈ A}. . is the least upper bound for A. equivalently.1) i=1 converges to zero as m. −1. for example neither {1. so the right side of (11. The infimum of any subset of R exists. . In particular. . By the definition of the sum. 5. If there is no finite lower bound. 19. More useful is the converse statement.e. Show that if limn→∞ xn = x and limn→∞ yn = y then limn→∞ xn yn = xy. For example. has finite cardinality). min A is well defined if A is finite (i. or the completeness property of R: If (xn ) is a Cauchy sequence then xn converges to a finite limit as n goes to infinity. For example. inf{0.328 CHAPTER 11.} = 0. Example 11. both sums i=1 on the right side of (11. the infimum of the empty set is +∞. then inf A = max{c : c ≤ a for all a ∈ A}. is the greatest lower bound for A. The infimum of a set of numbers A. . inf{1.1) converge to ∞ |xi+1 − xi | as m. If A is bounded below. then min A = inf A. −2. if A ⊂ B then inf A ≥ inf B. called the Cauchy criteria for convergence. is the smallest number in the set. written sup A.2. By convention. . and if min A exists. 1/2. 1/3. 1/4. 1/3. 2. −1. written inf A. written min A. the infimum is −∞. It is not hard to show that if xn converges to a finite limit x then (xn ) is a Cauchy sequence. Thus. Find the limits and prove convergence as n → ∞ for the following sequences: 2 n2 1 (a) xn = cos(n ) . .1) also holds if 1 ≤ n ≤ m. Of course. Similarly. if there is one. With these conventions. For example.1) Inequality (11. and if max A exists. (11. . Suppose 1 ≤ m < n.} = −∞. −2} = −2. if there is one.3 Suppose (xn : n ≥ 1) is a sequence such that ∞ |xi+1 − xi | < ∞. and sup ∅ = −∞.} nor {0. sup A = +∞ if A is not bounded above. 1/4. written max A. then max A = sup A. A. min{3. 1/2. which we also n+1 write as (xn : n ≥ 1) such that xn = (n+1)(−1) .e. The set of limit points of a sequence is the set of all limits of convergent subsequences.2) exists because the infimum inside the square brackets is monotone nondecreasing in n. inf{xk : k ≥ n} = inf{1/n. both 1 and -1 are limit points of the sequence.4 The liminf (also called limit inferior) of a sequence (xn : n ≥ 1). But for large n. CONVERGENCE OF SEQUENCES OF NUMBERS 329 The notions of infimum and supremum of a set of numbers are useful because they exist for any set of numbers. and the subsequence of even numbered terms. Some simple facts about the limit. (x2i−1 : i ≥ 1). whether or not the sequence has a limit.} = n .11. has limit +1. Thus. The values −∞ and +∞ are possible limit points.} = 0. . . So the liminf of (yn ) is zero. the subsequence of odd numbered terms. (11.2.2) and the limsup (also called limit superior) is defined by lim sup xn = lim n→∞ n→∞ [sup{xk : k ≥ n}] . 1/(n + 1). Similarly. which converges also to 0 at n → ∞. . . . −3/2. Example 11.2. So the limsup of (yn ) is also zero. The overall sequence itself does not converge (i. k2 . Not every sequence has a limit.2. The liminf and limsup of a sequence do not depend on any finite number of terms of the sequence. 6/5.3) exists. More precisely. (x2i : i ≥ 1}. and there aren’t any other limit points.2.6 Suppose yn = 121 − 25n2 for n ≤ 100 and yn = 1/n for n ≥ 101. The maximum (and supremum) of the sequence is n 2. Example 11. so the values of yn for n ≤ 100 are irrelevant.). . is a strictly increasing sequence of integers. . The limit on the right side of (11. .5 A subsequence of a sequence (xn : n ≥ 1) is a sequence of the form (xki : i ≥ 1). does not have a limit) but lim inf n→∞ xn = −1 and lim supn→∞ xn = +1. So every sequence of numbers has a liminf and limsup. For all n ≥ 101. the sequence alternates between numbers near one and numbers near minus one. and limit points of a sequence are collected in the following proposition. 1/(n + 1). limsup. There is a pair of related notions that generalizes the notion of limit. For all n ≥ 101. converges to 1. is defined by lim inf xn = lim n→∞ n→∞ [inf{xk : k ≥ n}] . +∞}. . and the minimum (and infimum) of the sequence is −3/2. −5/4. . but the following terminology is useful for describing the limiting behavior of a sequence. The proof is left to the reader. . 4/3. the limit on the right side of (11.3) The possible values of the liminf and limsup of a sequence are R ∪ {−∞. Definition 11. sup{xk : 1 k ≥ n} = sup{1/n.7 Consider the sequence of numbers (2. where k1 .2. Definition 11. (11. . Zero is also the only limit point of (yn ). liminf. . which trivially converges to 0 as n → ∞. Let n = 1.8 Let (xn : n ≥ 1) denote a sequence of numbers. . and if the limit exists. Given > 0. The function has a right-hand limit y at xo . and let xo ∈ Rn . The condition lim supn→∞ xn = x∞ is equivalent to the following: for any γ > x∞ . . This convergence condition can also be expressed in terms of convergence of sequences. x2 . x∗ . xn ≥ γ for all sufficiently large n. limn→∞ xn exists if and only if the liminf equals the limsup.8 2. and such situation is denoted by f (xo +) = y or limx xo f (x) = y. (P.3 Continuity of functions Let f be a function on Rn for some n. Theoretical Exercise 1. In terms of sequences. The liminf is the smallest limit point and the limsup is the largest limit point (keep in mind that −∞ and +∞ are possible values of the liminf. it is equal to that one limit point. x2 . liminf. from (xo . Here’s a more challenging one. 1. and let xo ∈ R.) 11. . f is continuous at xo if f (xn ) → f (xo ) whenever x1 . The condition limx→xo f (x) = y is equivalent to the condition f (xn ) → y for any sequence x1 . and limsup are equal. the fraction of the first n values falling into a subinterval converges to the length of the subinterval. lim inf n→∞ xn ≤ lim supn→∞ xn .2. . x2 . as follows. Show that every point in the interval [0. Given > 0. . The function f is said to be continuous at xo . The function f is simply said to be continuous if it is continuous at every point in Rn . and if the limit exists. +∞) such that xn → xo . so consider a function f on R. The function has a limit y at xo . Both the liminf and limsup of the sequence are limit points. 4. and H. Weyl independently proved a stronger result in 1909-1910: namely.2. The . APPENDIX Proposition 11. from Rn − xo such that xn → xo . W. f (xo +) = y if f (xn ) → y for any sequence x1 . or equivalently. 6. 5. and let xn = nr − nr for n ≥ 1. . Prove Proposition 11.330 CHAPTER 11. if the following is true. . or a limit point). Let r be an irrational constant. there exists δ > 0 so that | f (x) − y |≤ whenever 0 < x − xo < δ. Equivalently. Bohl. there exists δ > 0 so that | f (x) − y |≤ whenever 0 < x − xo < δ. 1] is a limit point of (xn : n ≥ 1). limn→∞ xn exists if and only if the sequence has exactly one limit point. and such situation is denoted by limx→xo f (x) = y. if the following is true. limsup. xo is said to be a continuity point of f . Sierpinski. then the limit. The condition lim inf n→∞ xn = x∞ is equivalent to the following: for any γ < x∞ . . xn ≤ γ for all sufficiently large n. is a sequence converging to xo . 2. if limx→xo f (x) = f (xo ). 3. . If f (xo ) exists then D+ f (xo ) = D− f (x0 ) = f (xo ). The right-hand derivative of f at a point xo . there exists δ > 0. written as f (xo ). is defined the same way as f (xo ). for 1 ≤ k ≤ n: f is continuous over (tk−1 . then f (xo ) exists if and only if D+ f (xo ) = D− f (x0 ).4 Derivatives of functions Let f be a function on R and let xo ∈ R. denoted by D+ f (xo ). if there exist n ≥ 1 and a = t0 < t1 < · · · < tn = b. tk ) and has finite limits at the endpoints of (tk−1 . tk ). so that f (x) − f (xo ) − f (xo ) ≤ δ x − xo whenever 0 < |x − xo | < . f is piecewise continuous over T if it is piecewise continuous over every bounded subinterval of T. Suppose f is defined on an open interval containing xo .3. . closed. x − xo The value of the limit is the derivative of f at xo . Alternatively. or mixed) with endpoints a < b is piecewise continuous. In more detail. such that. this condition that f is differentiable at xo means there is a finite value f (xo ) so that. More generally. except the limit is taken using only x such that x > xo . then the left-hand and right-hand limits exist. x−xo Similarly. If f is monotone nondecreasing. The extra condition x > xo is indicated by using a slanting arrow in the limit notation: D+ f (x0 ) = lim x xo f (x) − f (xo ) . the left-hand derivative of f at a point xo is D− f (x0 ) = limx Theoretical Exercise 1. Definition 11. A function f is simply called right-continuous if it is right-continuous at all points.1 A function f on a bounded interval (open. 11. The function f is differentiable if it is differentiable at all points.11. it means there is a finite value f (xo ) so that f (xn ) − f (xo ) lim = f (xo ) n→∞ xn − xo whenever (xn : n ≥ 1) is a sequence with values in R − {xo } converging to xo . for any > 0. A function f is called right-continuous at xo if f (xo ) = f (xo +). DERIVATIVES OF FUNCTIONS 331 left-hand limit f (xo −) = limx xo f (x) is defined similarly.4. Then f is differentiable at xo if the following limit exists and is finite: x→xo lim f (x) − f (xo ) . in terms of convergence of sequences. x − xo xo f (x)−f (xo ) . if T is all of R or an interval in R. and f (xo −) ≤ f (xo ) ≤ f (xo +) for all xo . n−1 f (x) = k=0 f (n) (y)(x − x0 )n f (k) (x0 ) (x − x0 )k + k! n! for some y between x and x0 . the cosine term dominates.2 is differentiable. and if there exist n ≥ 1 and a = t0 < t1 < · · · < tn = b. Clearly differentiable functions are continuous. But they can still have rather odd properties. Example 11. is to assume that f is continuously differentiable. Even though the function f of Example 11. tk ).2. as indicated by the following example. For some applications.2 Let f (t) = t2 sin(1/t2 ) for t = 0 and f (0) = 0. if T is all of R or a subinterval of R. so f (0) = 0.3 A function f on a bounded interval (open. piecewise continuously differentiable functions on R are: f (t) = min{t2 . . but for which the fundamental theorem of calculus still holds. For an integer n ≥ 0 we write f (n) to denote the result of differentiating f n times. x0 < b. More generally. note that | f (s)−f (0) | ≤ |s| → 0 as s → 0. Theorem 11. Definition 11. closed. yielding f (t) = 2t sin( t1 ) − 2 0 2 cos( 1 ) t2 t t=0 t=0 The derivative f is not even close to being continuous at zero. which means that f is differentiable and its derivative function is continuous. APPENDIX We write f for the derivative of f . for 1 ≤ k ≤ n: f is continuously differentiable over (tk−1 . such that. it does not satisfy the fundamental theorem of calculus (stated in the next section). A possible approach is to use the following condition. b). it is useful to work with functions more general than continuously differentiable ones. Then for a < x. if f is continuous over the interval. The usual calculus can be s used to compute f (t) for t = 0.4. tk ) and f has finite limits at the endpoints of (tk−1 .1 (Mean value form of Taylor’s theorem) Let f be a function on an interval (a. then a function f on T is continuous and piecewise continuously differentiable if its restriction to any bounded interval is continuous and piecewise continuously differentiable. Example 11.4. or mixed) with endpoints a < b is continuous and piecewise continuously differentiable. As t approaches zero.4. To check the derivative at zero. and f reaches both positive and negative values with arbitrarily large magnitude. 1} and g(t) = | sin(t)|. b) such that its nth derivative f (n) exists on (a. This function f is a classic example of a differentiable function with a derivative function that is not continuous.332 CHAPTER 11.4 Two examples of continuous.4. One way to rule out the wild behavior of Example 11.4.4. f (a+) = limx a f (x).4. Let g be a function from Rn to Rm . k=1 . is the n×m matrix with ijth entry ∂xj (x). and we write the derivative ∂y matrix as ∂x (x).5. b). Suppose f is a continuous function on a closed. Sometimes for brevity we write y = g(x) and think of y as a variable depending on x.5 The function given in Example 11. · · · . then the inverse mapping x = g −1 (y) is defined in a neighborhood of y0 = g(x0 ) and ∂x (y0 ) = ∂y ∂y (x0 ) ∂x −1 11.2 is not considered to be piecewise continuously differentiable because the derivative does not have finite limits at zero. exists. Theoretical Exercise 1. defined by D+ f (a) = lim x a f (x) − f (a) x−a also exists. where n ≥ 0 and a = t0 ≤ t1 · · · < tn = b • A sampling point from each subinterval.1 Integration Riemann integration Let g be a bounded function on a bounded interval of the form (a. then the right-hand derivative at a. b]. vk ∈ (tk−1 .6 (Implicit function theorem) If m = n and if ∂x is continuous in a neighborhood ∂y of x0 and if ∂x (x0 ) is nonsingular. t2 ]. for 1 ≤ k ≤ n. Thus for each vector x ∈ Rn . tk ]. then so does f . The ∂g ∂gi derivative matrix of g at a point x. INTEGRATION 333 Example 11. b] of the form (t0 . and the two limits are equal. Given: • An partition of (a. t1 ].5 11. tn ].4. (tn−1 . Show that if the right-hand limit of the derviative at a. Suppose f is a continuously differentiable function on an open bounded interval (a. ∂y Theorem 11. ∂x (x). 2. b] such that f exists and is continuous on the open subinterval (a. (t1 . bounded interval [a.4.5. g(x) is an m vector.11. b). Show that if f has finite limits at the endpoints. the corresponding Riemann sum for g is defined by n g(vk )(tk − tk−1 ). and (a1 . (11. b2 ]. or just piecewise continuous. k=1 This definition is equivalent to the following condition. . Next. t1 ] × (t2 . . If for every interval (a. is Riemann integrable over any bounded interval. Given: • A partition of (a1 .1 (Fundamental theorem of calculus) Let f be a continuously differentiable function on R. APPENDIX b The norm of the partition is defined to be maxk |tk − tk−1 |. Given any > 0. (11. if f is continuous and piecewise continuously differentiable. tm . suppose g is defined over the whole real line. The Riemann integral exists and is equal to I. if for any sequence of partitions. t2 ] for 1 ≤ j ≤ n1 and 1 ≤ k ≤ n2 . . . t1 ] × (t2 . specified by m m ((tm .b1 ]×(a2 . vnm ) : m ≥ 1). . b]. b2 ]. b] and Riemann integrable over (a. where j j−1 k k−1 ni ≥ 1 and ai = ti < ti < · · · < ti i = bi for i = 1. (Note that D+ f (x) = f (x) whenever f (x) is defined.k )(t1 − t1 )(t2 − t2 ).) We will have occasion to use Riemann integrals in two dimensions.b→∞ −a lim g(x)dx provided that the indicated limit exist as a. x2 )dsdt = I. b2 ] into n1 × n2 rectangles of the form (t1 . . D+ f (x).334 CHAPTER 11. the following is true for Riemann integration: Theorem 11. The function g is said to be Reimann integrable over (a. vjk ) inside (t1 . 2 n 0 1 1 2 • A sampling point (vjk . The Riemann integral a g(x)dx is said to exist and its value is I if the following is true. b] if the b integral a g(x)dx exists and is finite. b]. b f (b) − f (a) = a f (x)dx. if the value . with corresponding sampling points ((v1 .b2 ] g(x1 .k .2} maxk | ti − ti |. g is said k k−1 to be Riemann integrable over (a1 . j j−1 k k−1 j=1 k=1 The norm of the partition is maxi∈{1. vj. . tm ) : m ≥ 1).4) More generally. Let g be a bounded function on a bounded rectangle of the form (a1 . then the Riemann integral of g over R is defined by ∞ b g(x)dx = −∞ a. b1 ] × (a2 . . t2 ]. the corresponding sequence of Riemann norm of the m sums converges to I as m → ∞. such that nm 1 2 th partition converges to zero as m → ∞. g is bounded over [a. b1 ] × (a2 .5. b jointly converge to +∞. As in the case of one dimension. Then for a < b. Moreover. expressed using convergence of sequences. The values +∞ or −∞ are possible. there is a δ > 0 so that | n g(vk )(tk − tk−1 ) − I| ≤ whenever the norm of the partition is less than or equal to δ. A function that is continuous.4) holds with f (x) replaced by the right-hand derivative. j j−1 k k−1 the corresponding Riemann sum for g is n1 n2 1 2 g(vj. b1 ] × (a2 . b2 ] into finitely many subrectangles of the form (t1 . INTEGRATION 335 of the Riemann sum converges to I for any sequence of partitions and sampling point pairs. . If.11.2 A sufficient condition for g to be Riemann integrable (and hence Riemann integrable with aligned sampling) over (a1 . We define a function g on [a. b2 ] of a continuous function on [a1 . b] that is Riemann integrable is also Riemann integrable with aligned sampling. b] × [a. vn1 .5. we say the corresponding Riemann sum uses aligned sampling. b] to be Riemann integrable with aligned sampling in the same way as we defined g to be Riemann integrable. t1 ] × (t2 . b2 ] is that g be the restriction to (a1 . instead. . The same approach can be used to define the Lebesgue integral ∞ g(ω)dω −∞ for Borel measurable functions g on R. then for nonnegative random variables. Proposition 11. with the norms of the partitions converging to zero. b1 ] × (a2 . The Cauchy criteria for convergence of sequences of numbers is also used. t2 ]. More generally. the sampling points are restricted to have the 1 2 1 1 2 2 form (vj .5. such that g on (t1 . t2 ]. 11. . there is a δ > 0 so that the Riemann sums for any two partitions with norm less than or equal to δ differ by most . It’s proof uses the fact that continuous functions on bounded. b] × [a. except the family of Riemann sums used are the ones using aligned sampling. . b1 ]×(a2 . vn2 . t1 ] × (t2 . ∞ −∞ g+ (ω)dω < +∞ . closed sets are uniformly continuous. t1 ] × [t2 . vk ). . and then for general random variables by E[X] = E[X+ ] − E[X− ].2 Lebesgue integration Lebesgue integration with respect to a probability measure is defined in the section defining the expectation of a random variable X and is written as E[X] = Ω X(ω)P (dω) The idea is to first define the expectation for simple random variables. b2 ]. j j−1 k k−1 Proposition 11. t2 ] j j−1 j j−1 j j−1 k k−1 k k−1 k k−1 of a continuous function on [t1 . from which if follows that. g is Riemann integrable over (a1 . The above definition of a Riemann sum allows the n1 × n2 sampling points to be selected arbitrarily from the n1 × n2 rectangles. . for any > 0. b1 ] × [a2 .2 is a standard result in real analysis. Since the set of sequences that must converge is more restricted for aligned sampling. v1 . t2 ] is the restriction to (t1 . b2 ] if there is a partition of (a1 . .5. b1 ] × (a2 . for n1 + n2 numbers v1 . Such an integral is well defined if either ∞ or −∞ g− (ω)dω < +∞. a function g on [a. t1 ] × (t2 . b1 ]×(a2 .5. only the LS interpretation is used. APPENDIX 11. then ∞ ∞ (Lebesgue-Stieltjes) −∞ g(x)dF (x) = −∞ g(x)f (x)dx (Lebesgue) for any Borel measurable function g. Given a Borel measurable function g on R. An ∞ alternative definition of −∞ g(x)dF (x). except that the Riemann sums are changed to n g(vk )(F (tk ) − F (tk−1 )) k=1 Extension of the integral over the whole real line is done as it is for Riemann integration. If g is continuous and the LS integral is finite. If F has a corresponding pdf f . The Riemann-Stieltjes integral b g(x)dF (x) a (Riemann-Stieltjes) is defined the same way as the Riemann integral. the Lebesgue-Stieltjes integral of g ˜ with respect to F is defined to be the Lebesgue integral of g with respect to P : ∞ ∞ (Lebesgue-Stieltjes) −∞ ∞ g(x)dF (x) = −∞ ˜ g(x)P (dx) (Lebesgue) The same notation −∞ g(x)dF (x) is used for both Riemann-Stieltjes (RS) and Lebesgue-Stieltjes (LS) integration. Hence.336 CHAPTER 11. b] and let F be a nondecreasing function on [a. is given next. in these notes. then the integrals agree. .3 Riemann-Stieltjes integration Let g be a bounded function on a closed interval [a.4 Lebesgue-Stieltjes integration ˜ Let F be a CDF. and RS integration is not needed.3. for equivalence of the integrals ∞ g(X(ω))P (dω) and Ω −∞ g(x)dF (x). it is essential that the integral on the right be understood as an LS integral. In ∞ particular. However. −∞ xdF (x) is identical as either an LS or RS integral. As seen in Section 1. b]. even for continuous functions g. there is a corresponding probability measure P on the Borel subsets of R.5. preferred in the context of these notes. 11.5. The result is that E[Xn ] for large n can still be much larger than E[X∞ ].6. Since is arbitrary. . X2 . and let b1 . but the variables are assumed to be bounded only on one side: specifically. very large values of |Xn − X∞ | is to require the sequence (Xn ) to be bounded. and such that Xn → X as n → ∞.1. and on the complement of this event. Thus. The theorems in this section address the question of whether E[Xn ] → E[X∞ ]. Then E[Xn ] → E[X]. The following result is a good start–it is generalized to yield the dominated convergence theorem further below. so that P {| X |≥ L + } = 0. Thus. The restriction to nonnegative Xn ’s would rule out using negative constants bn with arbitrarily large magnitudes in Example 11.5) so that |E[X] − E[Xn ]| = |E[X − Xn ]| ≤ E[|X − Xn |] ≤ + 2LP |X − Xn | ≥ }. ON THE CONVERGENCE OF THE MEAN 337 11.6. Suppose (Xn : n ≥ 1) is a sequence of random variables such that Xn → X∞ . b2 .” which is defined in Appendix 11. hypothesis Xn → X∞ means that for any > 0 and δ > 0. the difference |X − Xn | is still bounded so that its contribution is small for n large enough. used to establish the dominated convergence theorem. A2 . But the mean of Xn can differ greatly from the mean of X if. P {| X |≥ L + } ≤ P {| X − Xn |≥ } → 0. ables such that for some finite L. it is very. and suppose A1 . which has probability close to one for n large. so Xn → U . E[Xn ] → E[X]. very large. Proof. is similar to the bounded convergence theorem. Again let > 0. } ≤ P {Xn = U } = P [An ] → 0 as n → ∞. Example 11. but such that P [An ] → 0. P {|Xn − X∞ | ≤ } ≥ 1 − δ. . The p. P {|Xn − U | ≥ p. Let Xn = U + bn IAn for n ≥ 1. For any > 0. Then |X − Xn | ≤ + 2LI{|X−Xn |≥ } . but cannot be much smaller.1. is a sequence of events.2. · · · be a sequence of nonzero numbers. The simplest way to rule out the very.6. Therefore. each with positive probability. Thus. . the random variables are restricted to be greater than or equal to zero. for n large enough. Equation (11.2 (Bounded convergence theorem) Let X1 . That would rule out using constants bn with arbitrarily large magnitudes in Example 11. The statement of the lemma uses “liminf.6. the mean E[Xn ] can be far larger or far smaller than E[U ].6 On the convergence of the mean p.1 Suppose U is a random variable with a finite mean. |E[X] − E[Xn ]| < 2 . Theorem 11. for some random variable X∞ . The following lemma. . if the bn have very large magnitude.11. the event that Xn is close to X∞ has probability close to one. P {|X − Xn | ≤ 2L} = 1 for all n ≥ 1. be a sequence of random varip. in the unlikely event that |Xn − X∞ | is not small. P {| X |≤ L} = 1. E[Xn ] = E[U ] + bn P [An ]. (11.5) is central to the proof just given. for all large n. It bounds the difference |X − Xn | by on the event {|X − Xn | < }. . Since was arbitrary. However. . . P {|X − Xn | ≥ } → 0 as n → ∞.6. Then for any > 0. By the hypotheses. P {|Xn | ≤ L} = 1 for all n ≥ 1. ZIAn → 0. |Xn ∧ Z − Z| = |Xn ∧ Z − X∞ ∧ Z| ≤ |Xn − X∞ | → 0. E[Xn ] → E[X∞ ]. But p. So Fatou’s lemma implies that lim inf n→∞ E[Xn + Y ] ≥ E[X∞ + Y ]. Summarizing. . subtracting E[Y ] from both sides. Proof. p. there would exist a sequence of events An with P {An } → 0 with E[ZIAn ] ≥ . so let γ be any constant with γ < E[X∞ ]. so that Xn ∧Z → Z. or equivalently.6. X2 . Then lim inf n→∞ E[Xn ] ≥ E[X∞ ]. By the definition of E[X∞ ]. We shall prove the equivalent form of the conclusion given in the lemma.338 CHAPTER 11. that Xn → X∞ . the liminf of a sequence is less than or equal to the limsup. Theorem 11. E[Xn ] ≥ γ for all sufficiently large n. lim sup E[Xn ] ≤ E[X∞ ] ≤ lim inf E[Xn ]. . Thus. The hypotheses imply that (Xn + Y : n ≥ 1) is a sequence of nonnegative random variables which converges in probability to X∞ + Y .) Proof. In general. so the dominated convergence theorem implies that E[ZIAn ] → 0. then the limit exists and is equal to both the liminf and limsup. for any γ < E[X∞ ]. limn→∞ E[Xn ∧Z] = E[Z] > γ. or equivalently.5 (A consequence of integrability) If Z has a finite mean. which would result in a contradiction. it follows that E[Xn ] ≥ γ for all sufficiently large n. Similarly. Since Z = X∞ ∧ Z. there is a simple random variable Z with Z ≤ X∞ such that E[Z] ≥ γ. since (−Xn + Y : n ≥ 1) is a sequence of nonnegative random variables which converges in probability to −X∞ + Y . is a sequence of random variables and X∞ and Y are random variables such that the following three conditions hold: (i) Xn → X∞ as n → ∞ (ii) P {|Xn | ≤ Y } = 1 for all n (iii) E[Y ] < +∞ then E[Xn ] → E[X∞ ].6. lim supn→∞ E[Xn ] ≤ E[X∞ ]. and if the liminf is equal to the limsup. lim inf n→∞ E[Xn ] ≥ E[X∞ ]. and ZIAn is dominated by the integrable random variable Z for all n. Fatou’s lemma implies that lim inf n→∞ E[−Xn + Y ] ≥ E[−X∞ + Y ]. p. . then |E[ZIA ]| ≤ . Since E[Xn ] ≥ E[Xn ∧ Zn ].4 (Dominated convergence theorem) If X1 . APPENDIX Lemma 11. > 0.6. by the bounded convergence theorem. n→∞ n→∞ p. then given any there exits a δ > 0 so that if P [A] < δ. If not. Proof. Corollary 11.3 (Fatou’s lemma) Suppose (Xn ) is a sequence of nonnegative random variables such p. (Equivalently. Therefore. . there is a simple random variable X less than or equal to X with E[X] ≥ γ. Let L be the largest. Therefore. . The n × n identity matrix is the n × n diagonal matrix with ones on the diagonal. Rather than a domination condition. 11. amn An m × n matrix over the reals R has the form a11 a12 · · · a21 a22 · · · A = .6. so that E[X ∧ L] > γ. it is assumed that the sequence is monotone in n. am1 am2 · · · where aij ∈ R for all i. .6 (Monotone convergence theorem) Let X1 . E[Xn ∧ L] > γ for all large enough n. X2 . . E[Xn ] → E[X]. Since γ is an arbitrary constant with γ < E[X]. . A square matrix A is called diagonal if the entries off of the diagonal are zero. be a sequence of random variables such that E[X1 ] > −∞ and such that X1 (ω) ≤ X2 (ω) ≤ · · · . X2 . . This matrix has m rows and n columns. The transpose of an m × n matrix A = (aij ) is the n × m matrix AT = (aji ). E[Xn ∧ L] → E[X ∧ L]. . . . .7 Matrices a1n a2n . 339 The following theorem is based on a different way to control the difference between E[Xn ] for large n and E[X∞ ]. follows. j. A matrix over the complex numbers C has the same form. Then. . Then the limit X∞ given by X∞ (ω) = limn→∞ Xn (ω) for all ω is an extended random variable (with possible value ∞) and E[Xn ] → E[X∞ ] as n → ∞. j. By the bounded convergence theorem. . By adding min{0. −X1 } to all the random variables involved if necessary.6. and deduce the dominated convergence theorem from Corollary 11. and therefore also X. The diagonal of a matrix is comprised by the entries of the form aii . Proof. Since E[Xn ∧ L] ≤ E[Xn ] ≤ E[X]. We write I to denote an identity matrix of some dimension n. Theorem 11. Recall that E[X] is equal to the supremum of the expectation of simple random variables that are less than or equal to X. MATRICES Theoretical Exercise 1. So let γ be any number such that γ < E[X]. . Work backwards. . Then X ≤ X ∧ L. we can assume without loss of generality that X1 . if follows that γ < E[Xn ] ≤ E[X] for all large enough n. .5.11. For example 1 2 T 1 0 3 = 0 1 2 1 1 3 1 The matrix A is symmetric if A = AT . are nonnegative.7. The simple random variable X takes only finitely many possible values. . Symmetry requires that the matrix A be square: m = n. the desired conclusion. with aij ∈ C for all i. ϕm corresponds to a coordinate system for Rm . xm The set of all dimension m vectors over R is the m dimensional Euclidean space Rm . U U T = I 3. . . . . . i The coordinates α1 . i=1 The vectors x and y are orthogonal if xT y = 0. Given a square matrix A. . A square matrix U is called orthonormal if any of the following three equivalent conditions is satisfied: 1. vectors are written in column form: x1 x2 x = . A set of vectors v1 . ϕn is orthonormal if the vectors are orthogonal to each other and ϕi = 1 for all i.340 CHAPTER 11. Otherwise a permutation is odd. . 2. APPENDIX If A is an m × k matrix and B is a k × n matrix. equal to m xi yi . U T U = I 2. where m is the dimension of the vector. the columns of U form an orthonormal basis. Some important properties of determinants are the following. . . . the coordinates of v relative to U are given by the vector U T v. An orthonormal basis for Rm is an orthonormal set of m vectors in Rm . the coordinates of v relative to ϕ1 . m). . Any permutation is either even or odd. . . . A vector x is an m × 1 matrix. An orthonormal set of vectors ϕ1 . . A permutation is even if it can be obtained by an even number of transpositions of two elements. We write (−1)π = 1 if π is even −1 if π is odd The determinant of a square matrix A. is defined by m det(A) = π (−1)π i=1 aiπ(i) The absolute value of the determinant of a matrix A is denoted by | A |. π(m)) is a reordering of (1. then the product AB is the m × n matrix with ij th element k ail blj . The inner product of two vectors x and y of the same dimension m is the number xT y. That is (π(1). αn ∈ R. . Let A and B be m × m matrices. αm are the unique numbers such that v = α1 ϕ1 + · · · + αm ϕm . written det(A). . The Euclidean length or norm of a vector x is given 1 by x = (xT x) 2 . 2. . ϕm are given by αi = ϕT v. . . . . . . vn in Rm is said to span Rm if any vector in Rm can be expressed as a linear combination α1 v1 + α2 v2 + · · · + αn vn for some α1 . . Thus | A |=| det(A) |. Given an m × m orthonormal matrix U and a vector v ∈ Rm . ϕn in Rm spans Rm if and only if n = m. . . l=1 Thus. . . . A set of vectors ϕ1 . a vector ϕ is an eigenvector of A and λ is an eigenvalue of A if the eigen relation Aϕ = λϕ is satisfied. . . . . . An orthonormal basis ϕ1 . . . . Given a vector v in Rm . A permutation π of the numbers 1. m} onto itself. . . . . . . . m is a one-to-one mapping of {1. . . 11. obtained from AT by taking the complex conjugate of each element of AT . If U is a subset of Rm and V is the image of U under the linear transformation determined by A: V = {Ax : x ∈ U} then (the volume of U) = | A | × (the volume of V) 3. . A symmetric matrix is positive semidefinite if and only if its eigenvalues are nonnegative. The eigenvalues can be complex i=1 valued with nonzero imaginary parts. The zeros λ1 . ϕ2 . det(AB) = det(A) det(B) 4. λm of the characteristic polynomial of A. For example. |U | = 1 if U is orthonormal. λ2 . . . . 2. 1 2 ∗ 1 0 3 + 2j 0 −j = 2 j 1 3 − 2j 1 . . ϕm and let Λ be the diagonal matrix with diagonal entries given by the eigenvalues 0 λ1 ∼ λ2 Λ = . are real-valued (not necessarily distinct) and there exists an orthonormal basis consisting of the corresponding eigenvectors ϕ1 . . 8. . The columns of A span Rn if and only if det(A) = 0. 7. . are the eigenvalues of A. The Hermitian transpose of a matrix A is the matrix A∗ . . repeated according to multiplicity. Therefore K = U ΛU T and Λ = U T KU . λ2 . then det(B) = c det(A). The equation p(λ) = det(λI − A) defines a polynomial p of degree m called the characteristic polynomial of A. .7. and det(A) = n λi . If K is a symmetric m × m matrix. MATRICES 341 1. The remainder of this section deals with matrices over C. ϕm . 6. 0 ∼ λm Then the relations among the eigenvalues and eigenvectors may be written as KU = U Λ. . . . . If B is obtained from A by multiplication of a row or column of A by a scaler constant c. then the eigenvalues λ1 . λm . . . A symmetric m × m matrix A is positive semidefinite if αT Aα ≥ 0 for all m-dimensional vectors α.. det(A) = det(AT ) 5. Let U be the orthonormal matrix with columns ϕ1 . . . . APPENDIX The set of all dimension m vectors over C is the m-complex dimensional space Cm . the coordinates of v relative to ϕ1 . i The coordinates α1 . The vectors x and y are orthogonal if x y = 0. ϕm corresponds to a coordinate system for Cm . 2. eigenvalues. The columns of A span Cn if and only if det(A) = 0. U ∗ U = I 2. The absolute value of the determinant of a matrix A is denoted by | A |.342 CHAPTER 11. . Thus | A |=| det(A) |. Eigenvectors. . | U |= 1 if U is unitary. 1. Let A and B by m × m matrices. . . . . ϕn in Cm spans Cm if and only if n = m. Some important properties of determinants of matrices over C are the following. . αm are the unique numbers such that v = α1 ϕ1 + · · · + αm ϕm . vn in Cm is said to span Cm if any vector in Cm can be expressed as a linear combination α1 v1 + α2 v2 + · · · + αn vn for some α1 . . det(AB) = det(A) det(B) 4. equal to m ∗ ∗ i=1 xi yi . If U is a subset of Cm and V is the image of U under the linear transformation determined by A: V = {Ax : x ∈ U} then (the volume of U) = | A |2 × (the volume of V) 3. . . . and determinants of square matrices over C are defined just as they are for matrices over R. . αn ∈ C. . 6. . det∗ (A) = det(A∗ ) 5. . . An orthonormal set of vectors ϕ1 . A square matrix U over C is called unitary (rather than orthonormal) if any of the following three equivalent conditions is satisfied: 1. . . the coordinates of v relative to U are given by the vector U ∗ v. . . the columns of U form an orthonormal basis. ϕm are given by αi = ϕ∗ v. . An orthonormal basis for Cm is an orthonormal set of m vectors in Cm . If B is obtained from A by multiplication of a row or column of A by a constant c ∈ C. . The inner product of two vectors x and y of the same dimension m is the complex number y ∗ x. . An orthonormal basis ϕ1 . . Given a vector v in Rm . ϕn is orthonormal if the vectors are orthogonal to each other and ϕi = 1 for all i. A set of vectors ϕ1 . then det(B) = c det(A). A set of vectors v1 . . The length or norm of a vector x is 1 given by x = (x∗ x) 2 . . Given an m × m unitary matrix U and a vector v ∈ Cm . U U ∗ = I 3. . A Hermitian symmetric matrix is positive semidefinite if and only if its eigenvalues are nonnegative.. . . . ϕm . If K is a Hermitian symmetric m × m matrix. for some m × m matrices A and B over R. .7. 0 ∼ 0 ∼ . if x is a vector in Cm then it can be written as x = u + jv for vectors u. repeated according to multiplicity. ϕ2 . . A Hermitian symmetric m × m matrix A is positive semidefinite if α∗ Aα ≥ 0 for all α ∈ Cm . Suppose that A−1 exists and examine the two 2m × 2m matrices A −B B A and A 0 B A + BA−1 B . Equating the determinants of the two . λm . MATRICES 343 7. . But det = 1. The equation p(λ) = det(λI − A) defines a polynomial p of degree m called the characteristic polynomial of A. We will show that Z= B A |Z|2 = det(Z) (11. . Let U be the unitary matrix with columns ϕ1 . and det(A) = n λi . It remains to prove (11. . the second matrix I A−1 B I A−1 B is obtained by right multiplying the first matrix by . Multiplication of x by the matrix Z is thus equivalent to multiplication of u by v v A −B . Similarly. The eigenvalues can be complex i=1 valued with nonzero imaginary parts. then Z can be expressed as Z = A + Bj.6) so that Property 2 for determinants of matrices over C follows from Property 2 for determinants of matrices over R. . 8. are the eigenvalues of A. are real-valued (not necessarily distinct) and there exists an orthonormal basis consisting of the corresponding eigenvectors ϕ1 . If Z is an m×m matrix over C. then the eigenvalues λ1 . (11. and adding the result to the left column. The zeros λ1 . Then Zx = (Au − Bv) + j(Bu + Av). λm of the characteristic polynomial of A. v ∈ Rm . There is a one-to-one and onto mapping from Cm to R2m defined by u + jv → u .11. A matrix K is called Hermitian symmetric if K = K ∗ . Therefore K = U ΛU ∗ and Λ = U ∗ KU . . ϕm and let Λ be the diagonal matrix with diagonal entries given by the eigenvalues Λ = λ1 λ2 . . λm Then the relations among the eigenvalues and eigenvectors may be written as KU = U Λ. Equivalently. . λ2 . . Many questions about matrices over C can be addressed using matrices over R. . λ2 .7) have the same determinant. so 0 I 0 I that the two matrices in (11. .7) The second matrix is obtained from the first by left multiplying each sub-block in the right column of the first matrix by A−1 B.6). . APPENDIX matrices in (11. (11. the following four matrices have the same determinant: A + Bj 0 0 A − Bj A + Bj 0 A − Bj A − Bj 2A A − Bj A − Bj A − Bj 2A A − Bj 0 A+BA−1 B 2 (11. . Since each side of (11.8) Equating the determinants of the first and last of the matrices in (11.8) yields that |Z|2 = det(Z) det∗ (Z) = det(A + Bj) det(A − Bj) = det(A) det(A + BA−1 B). Combining these observations yields that (11.6) is a continuous function of A.6) holds if A−1 exists.7) yields det(Z) = det(A) det(A + BA−1 B).344 CHAPTER 11. Similarly.6) holds in general. 06)+(1)(0.1)(0.315 0. then P [AB] = 0 and therefore P ]A ∪ B] = 0. (b) In general.4 Frantic search Let D.1)(0.06 ≈ 0.9)+(0. However.1)(0. respectively.01) ≈ 0. This can happen if P [E] = 0.4 = 0.6 Conditional probabilities–basic computations of iterative decoding (a) Here is one of several approaches to this problem.01) (0.8. then P [A ∪ B] = P [A] + P [B] − P [A]P [B] = 0. then P [E] = P [E ∩ E] = P [E]P [E].1)(0.1) = 0. 1.4225 1.6 and P [B] = 0. either P [E] = 0 or P [E] = 1.06) P [T |E] = P [T E] = P [E|D]PP [E|T ]P [T ] c ]P [Dc ] = (0.4 − (0.01) ≈ 0.03)+(1)(0.1)2 (0. If the events A and B are independent.9)+(1)(0.3 + 0.2 Independent vs.3)(0. If P [E] = 0 then cancelling a factor of P [E] on each side yields P [E] = 1. . two table searches. .1)2 (0. P [O] P [O|G] = PP[OG] = P [G|D]P [D]+P [G|T ]P[G|O]P[G|B]P [B]+P [G|O]P [O] [G] [T ]+P = (1)(0. if A and B were mutually exclusive.03)+(1)(0. P [B|F ] = PP [F ] ] = P [F |D]P [D]+P [F |T ]P[F |B]P[F |B]P [B]+P [F |O]P [O] [T ]+P (1)(0. Yn ) 345 .06)+(0.9)+(0.8. (a) Let E denote the event that the glasses were not found in the first drawer search. so it would not possible for A and B to be mutually exclusive if P [A] = 0. then P [A] + P [B] = P [A ∪ B] ≤ 1. mutually exclusive (a) If E is an event independent of itself.T .03) = (0. On the other hand. and one briefcase search.7. In summary.B.4) = 0.19 P [E] [D]+P [E|D (b) Let F denote the event that the glasses were not found after the first drawer search and first [BF P [B] table search. then the two events could be independent. Y1 ). we have P [A ∪ B] = P [A] + P [B] − P [AB]. (c) If P [A] = 0. Note that the n pairs (B1 . or in the office.6 and P [B] = 0. . . These four events partition the probability space.58. (Bn . and O denote the events that the glasses are in the drawer. if the events A and B are mutually exclusive. in the briefcase. on the table.Chapter 12 Solutions to Problems 1.3 + 0.22 (c) Let G denote the event that the glasses were not found after the two drawer searches. (1)(0. z1 . rj (0|zj ) 1. ..5.5 Thus.5] = c + 0.bn :b1 ⊕···⊕bn =1 n Therefore P [B1 = b1 .8 Blue corners (a) There are 24 ways to color 5 corners so that at least one face has four blue corners (there are 6 choices of the face. . and λi (bi ) = P [Bi = bi |Yi = yi ] = P [B = 1|Y1 = y1 . Zk = zk ] = = p(1. are mutually independent. F (0) > 1. . . 0. Yn = yn ] = b1 . .) Since there are 8 = 56 ways to select 5 out of 8 corners.5 (b) ΦX (u) = 0.12 CDF and characteristic function of a mixed type random variable (a) Range of X is [0. Yn = yn ] = b1 . not right continuous at 0. .5 FX (c) = 1 c ≥ 0. z1 . P [X ≤ c] = P [U ≤ c + 0. . .. z1 . For 0 ≤ c ≤ 0.. .bn :b1 ⊕···⊕bn =1 i=1 λi (bi ). zk ) p(0. . . .5. Another reason is that F is not nondecreasing (c) Invalid. = 0. . so the area of the triangle is 0. zk ) + p(1. 1.5 jux dx 0 e e−5 2 .. zk ) 1 2 k j=1 rj (1|zj ) k k 1 j=1 rj (0|zj ) + 2 j=1 rj (1|zj ) 1 2 k = η 1+η where η = j=1 rj (1|zj ) .10 Recognizing cumulative distribution functions √ √ √ √ (a) Valid (draw a sketch) P [X 2 ≤ 5] = P [X ≤ − 5] + P [X ≥ 5] = F1 (− 5) + 1 − F1 ( 5) = (b) Invalid. .14 Conditional expectation for uniform density over a triangular region (a) The triangle has base and height one. (b) Using the definitions.5 0 ≤ c ≤ 0... .5 + 0.5]. . (b) By counting the number of ways that B can happen for different numbers of blue corners we find P [B] = 6p4 (1 − p)4 + 24p5 (1 − p)3 + 24p6 (1 − p)2 + 8p7 (1 − p) + p8 . . . P [B = 1|Z1 = z1 . . 1. . .. and for each face there are four choices for which additional corner to color blue. . 0 c<0 c + 0.5 + eju/2 −1 ju 1. . P [B|exactly 5 corners colored blue] = 5 24/56 = 3/7. . SOLUTIONS TO PROBLEMS qi (yi |bi ) qi (yi |0)+qi (yi |1) . . Bn = bn |Y1 = y1 . . .346 def CHAPTER 12. Thus the joint pdf is 2 . .. z<0 That is. +∞). 0 else 1.42 )/0.16 Density of a function of a random variable (a) P [X ≥ 0. for 0 < x ≤ 1. the conditional density fY |X (y|x) is not well defined unless 0 < x < 2. For 2 1 < x ≤ 2.82 = 3 .8] = P [0.82 − 0. the conditional distribution of Y is uniform over the interval [0.8] = (0. z ≥ 0 0. In general we have 2 if 0 < x ≤ 1 and y ∈ [0.8|X ≤ 0. x ] 2 not defined if x ≤ 0 or x ≥ 2 Thus. x ] 2 2 if 1 < x < 2 and y ∈ [x − 1. 1 P {− ln(X) ≤ c} = P {ln(X) ≥ −c} = P {X ≥ e−c } = e−c 2xdx = 1 − e−2c so fY (c) = 2 exp(−2c) c ≥ 0 That is. x ]. yields: x if 0 < x ≤ 1 4 3x−2 if 1 < x < 2 E[Y |X = x] = 4 not defined if x ≤ 0 or x ≥ 2 1. the conditional distribution of Y is uniform over the interval [x − 1.4|X ≤ 0.4 ≤ X ≤ 0. For c ≥ 0.347 inside the triangle. x ] x 2 0 if 0 < x ≤ 1 and y ∈ [0. (b) R takes values in the positive real line and by independence the joint pdf of X1 and X2 is the . 4 (b) The range of Y is the interval [0. x ]. or integrating x against the conditional density found in part (c). y)dy = x/2 2dy = x 0 x/2 x−1 2dy = 2 − 0 if 0 < x < 1 x if 1 < x < 2 else (c) In view of part (c). X2 } ≤ z] = P [X1 ≤ z or X2 ≤ z] = 1 − P [X1 > z and X2 > z] = 1 − P [X1 > z]P [X2 > z] = 1 − e−λ1 z e−λ2 z = 1 − e−(λ1 +λ2 )z Differentiating yields that fZ (z) = (λ1 + λ2 )e−(λ1 +λ2 )z . x ] fY |X (y|x) = 2−x 2 0 if 1 < x < 2 and y ∈ [x − 1.18 Functions of independent exponential random variables (a) Z takes values in the positive real line. So let z ≥ 0. Z has the exponential distribution with parameter λ1 + λ2 . 2 (d) Finding the midpoints of the intervals that Y is conditionally uniformly distributed over. (b) ∞ fX (x) = −∞ fXY (x. Y is an exponential random variable with parameter 2. P [Z ≤ z] = P [min{X1 . fZ (c) = That is. X) + 10Cov(Y. (b) 2 4 19 1 (1 + xy)dy = fX (x) = 0 fY (y) = 4 19 3 2 (1 4 19 [1 + + 3x 2 ] 2≤x≤3 else 1≤y≤2 else + xy)dx = 0 4 19 [1 5y 2 ] Therefore fX|Y (x|y) is well defined only if 1 ≤ y ≤ 2. 3 √ } 2 = 1.24 Density of a difference (a) Method 1 The joint density is the product of the marginals.20 Gaussians and the Q function (a) Cov(3X + 2Y. 2). P [R ≤ r] = P [ = 0 ∞ 0 ∞ X1 ≤ r] = P [X1 ≤ rX2 ] X2 rx2 λ1 e−λ1 x1 λ2 e−λ2 x2 dx1 dx2 λ2 . Thus. SOLUTIONS TO PROBLEMS product of their individual densities. the probability P {|X − Y | ≤ c} is the integral of the joint density over the region of the positive quadrant such that {|x − y| ≤ c}. and for any c ≥ 0. r≥0 r<0 1. P {X−Y | ≤ c} = 1−2 0 exp(−λ(y+c))λ exp(−λy)dy = 1−exp(−λc).348 CHAPTER 12. λ exp(−λc) c ≥ 0 Thus. 17). rλ1 + λ2 = 0 (1 − e−rλ1 x2 )λ2 e−λ2 x2 dx2 = 1 − Differentiating yields that fR (r) = λ1 λ2 (λ1 r+λ2 )2 0. so c = 4/19. so P {(X − Y )2 > 9} = P {(X − Y ) ≥ 3 orX − Y ≤ −3} = 2P { X−Y ≥ 2 3 2Q( √2 ). Y ) = 3Var(X) + 10Var(Y ) = 13.22 Working with a joint density (a) The density must integrate to one. which by symmetry is one minus twice the integral of the density over the region ∞ {y ≥ 0 and y ≤ y+c}. √ (b) X + 4Y is N (0. . so P {X + 4Y ≥ 2} = P { X+4Y ≥ √2 } = Q( √2 ). 17 17 17 √ (c) X − Y is N (0. Z has the exponential distribution with parameter 0 else λ. For 1 ≤ y ≤ 2: fX|Y (x|y) = 1+xy 1+ 5 y 2 2≤x≤3 for other x 0 1. X + 5Y + 10) = 3Cov(X. So for r ≥ 0 . 0). yielding jEX = Φ (0) = 2j or EX = 2.0). from the square region {(u. 1) random variable. In fact. v)| |3u| x within the triangle with corners (0. In fact.28 A transformation of jointly continuous random variables (a) We are using the mapping. 1. v) : 0 ≤ u. One of them will burn out first.Y (x. so Var(x) = 14 − 22 = 10. this is the characteristic function of a P oi(λ) random variable. v ≤ 1} in the u − v plane to the triangular region with corners (0. for all u. 1)2 . and (3.Y (x. The mapping is one-to-one. v ∈ (0. v) 9u2 v 2 9y 2 = = 3uv 2 = |J(u. 22 ) random variable. and is a little tedious. (3. A simpler way is to use the Taylor series expansion exp(ju) = 1 + (ju) + (ju)2 /2! + (ju)3 /3!.1) in the x − y plane. and |X − Y ] is equal to how much longer that lightbulb will last. this is the characteristic function of a U (0. In fact. (b) Integrating out y from the joint pdf yields x 3 fX (x) = 0 9y 2 x dy = x2 9 0 if 0 ≤ x ≤ 3 else . and j 2 E[X 2 ] = Φ (0) = −14. Therefore the required pdf is fX. y) in the range we can recover (u.0). y) = 0 elsewhere. the inverse mapping is given by x u = 3 3y v = . given by x = 3u y = uv. Suppose X and Y are lifetimes of identical lightbulbs which are turned on at the same time.1)..26 Some characteristic functions (a) Differentiation is straight-forward. v) = det ∂x ∂u ∂y ∂u ∂x ∂v ∂y ∂v = det 3 0 v u = 3u = 0. as follows.349 (Method 2 The problem can be solved without calculation by the memoryless property of the exponential distribution. The result is EX = 0. and fX. Indeed. meaning that for any (x. and (3. yielding EX = Var(X) = λ. (3..V (u. y) = fU. v). (b) Evaluation of the derivatives at zero requires l’Hospital’s rule. x The Jacobian determinant of the transformation is J(u. At that time. this is the characteristic function of a N (10.0). (c) Differentiation is straight-forward. the other lightbulb will be the same as a new light bulb.5 and Var(X) = 1/12. 1. |y2 |. Another way to show the series is convergent is to n 3 3 notice that for n ≥ 3 the nth term can be bounded above by 3 = 3 3 3 · · · n ≤ (4. for n ≥ n . Thus. and the mapping (u. Then ln n < nη for all large enough n. 2. n! 3! 4 5 4 the sum is bounded by a constant plus a geometric series.350 CHAPTER 12. 2. (c) The support of both fU V and fY Z is the strip [0. for α > 1.v) | = 2u 0 v u = 2u2 = 2y. so for n large enough the nth term in the series is greater than or equal to n−5η . there exists n so large that |xn − x| ≤ 2L and |yn − y| ≤ 2(|x|+1) . because. Thus. Therefore. |yn1 −1 |. z) = 0 (y. by 2n·n = 2nη−2 . . Therefore. Let 0 < η < 0.y) v = zy − 2 . y) = fX (x) 81y 2 x3 0 if 0 ≤ y ≤ else x 3 1. z) 1 defined by y = u2 and z = uv is one-to-one. fY |X (y|x) = fX. Let 0 < η < 1. ∞) otherwise.. (b) Convergent.30 Jointly distributed variables ∞ V2 1 (a) E[ 1+U ] = E[V 2 ]E[ 1+U ] = 0 v 2 λe−λv dv (b) P {U ≤ V } = 1 ∞ −λv dvdu 0 u λe 1 1 0 1+u du 2 = ( λ2 )(ln(2)) = 2 ln 2 . (b) Given > 0. ∞). Also. ∞ n−α < 1+ 1 x−α dx = α−1 . plus 2 ∞ nη−2 . 2 + 2 ≤ . . |yn | ≤ L for all n. the sum in (b) is bounded above by finitely many terms n3 ∞ α of the sum. The series is therefore divergent.Z (y. which is everywhere convergent. . z) ∈ [0. Thus. where L = max{|y1 |.2 The limit of the product is the product of the limits (a) There exists n1 so large that |yn − y| ≤ 1 for n ≥ n1 . n=1 n=1 as shown in an example in the appendix of the notes. |xn yn − xy| ≤ |(xn − x)yn | + |x(yn − y)| ≤ |xn − x|L + |x||yn − y| ≤ So xn yn → xy as n → ∞. v) → (y. SOLUTIONS TO PROBLEMS Therefore the conditional density fY |X (y|x) is well defined only if 0 ≤ x ≤ 3. the nth term in the series is bounded above. Indeed. the inverse mapping is given by u = y 2 and 1 ∂(x. λ2 = 1 −λu du 0 e = (1 − e−λ )/λ. (c) Not convergent. This is the power series expansion for ex . |y| + 1}. For 0 ≤ x ≤ 3. so it is convergent. λ −λzy − 2 2y e 1 fY. n + 2 ≤ 2n for all large enough n. 1] × [0. for all suffiη ciently large n. The absolute value of the Jacobian determinant of the forward mapping is | ∂(u.4 Limits of some deterministic series (a) Convergent. which is finite. Thus.2. evaluated at x = 3. .Y (x.5)( 3 )n−3 . Then log(n + 1) ≤ nη for all n large enough. and n+5 ≥ n for all n. . 1] × [0. The value of the sum is thus e3 . (Another approach is to note that the distribution of Xn − X2n is the same for all n.8 Convergence of random variables on (0. implies convergence m. and d. m. only) For ω fixed and irrational.s. convergence p. By the double angle formula. then |1 − Θ(ω) | < 1 so that limn→∞ Yn (ω) = 0 for such ω. so it doesn’t converge in the m. so it also would suffice to prove the sequence does not converge in the m. sense. and hence also in the p.) (b) If ω is such that 0 < Θ(ω) < 2π. The Cauchy criteria for m..) For any ω ∈ Ω fixed. it therefore does not converge in the p. sense. the limit would also have to be zero. the limit would also have to be zero.s. but E[|Xn − 0|2 ] = E[Xn |2 ] = 1 n2 1 0 1 dω = +∞ → 0. which is infinite for 0 ≤ α ≤ 1.s. sense. d. To prove this.m→∞ E[Xn Xm ] does not exist. the sequence converges in distribution but not in the other three senses. (b) (a. it suffices to prove that the sequence doesn’t converge in p. not m.s.s. sense. convergence would be violated if E[(Xn − X2n )2 ] → 0 as n → ∞. sense either.s.s.s.6 Convergence of sequences of random variables (a) The distribution of Xn is the same for all n. except the single point 1 which has zero probability. So Xn → 0 a.1]. In summary. The sequence thus also converges in p.s. p. sense. If the sequence converged in the m. the sequence does not even come close to settling down. sense. (Because for bounded sequences. it also converges to zero in the m. so the sequence (Xn ) does not satisfy the Cauchy criteria for m. sense either. and d. Since the sequence is bounded. convergence. sense.s.351 We used the fact that ∞ n−α is infinite for any 0 ≤ α ≤ 1. the deterministic sequence Xn (ω) converges to zero. sense. it follows that (Yn ) converges to zero in the a..) For any ω ∈ Ω fixed. use the fact 1 cos(a) cos(b) = (cos(a+b)+cos(a−b))/2 to calculate that E[Xn Xm ] = 2 if n = m and E[Xn Xm ] = 0 if n = m.s... so intuitively we expect the sequence does not converge in any of the three strongest senses: a.s.s..s. so that the sequence doesn’t satisfy the Cauchy criteria for convergence in probability.s. If the sequence converged in the m.s. π Since P [0 < Θ(ω) < 2π] = 1. 2. the deterministic sequence Xn (ω) converges to zero. 2. Since the sequence is bounded. and d. Therefore. convergence in probability is equivalent to convergence in the m.s. (c) (d. d. The sequence also converges in p.s.. 2n + 1 The sequence thus does not converge in the m. not m. or p. sense. but 1 E[|Xn − 0|2 ] = E[Xn |2 ] = n2 0 ω 2n dω = n2 → 0. Since it is a bounded sequence.) Therefore the sequence doesn’t converge in the a.s. ω The sequence thus does not converge in the m. because it is greater than or equal n=1 ∞ to the integral 1 x−α dx. So Xn → 0 a. To check for mean square convergence. version 2 (a) (a.s. senses. X2n (ω) = 2ω sin(2πnω) cos(2πnω) so that 1 E[(Xn − X2n )2 ] = 0 ω 2 (sin(2πnω))2 (1 − 2 cos(2πnω))2 dω . p.s. so the sequence converges in distribution to any random variable with the distribution of X1 . limn. 1) where f is the function defined by f (ω) = ω. 4. it can be approximated by such step functions with maximum approximation error converging to zero as n → ∞. 2. Note that the CDF of Xn is given by 1 FXn (c) = 0 I{f (ω) sin(2πnω)≤c} dω (12. 3. SOLUTIONS TO PROBLEMS and this integral clearly does not converge to zero as n → ∞. p. following the heuristic reasoning below. where Θ is a random variable that is uniformly distributed over the interval [0.2) would be equal if f were constant within each interval of the form ( n .e.10 Convergence of a sequence of discrete random variables 1 (a) The CDF of Xn is shown in Figure 12. and ω is a fixed point in the small interval. The integrals i in (12. 4. i+1 ). 2π].1: Fx FX (x−) all x. . The number of oscillations of the integrand within each interval converges to infinity. 5. For any small > 0.) 2. 6}. (b) FX is continuous at x unless x ∈ {1. Details are left to the reader. If f n is continuous on [0. So the CDF of Xn converges for all constants c to: 1 P {f (ω) sin(Θ) ≤ c} dω. the limit can be shown to equal E[sin2 (Θ)(1 − 2 cos(Θ))2 ]/3.. Since Fn (x) = FX x − n it follows that limn→∞ Fn (x) = 1 F X n 0 0 1 2 3 4 5 6 Figure 12. 5. So the sequence (Xn ) does not converge in m.1) jumps between zero and one more and more frequently.1. In fact. As n → ∞. or 6. 2π].. 1] into intervals of length . senses. So limn→∞ Fn (x) = FX (x) unless FX (x) = FX (x−) i.2) (Note: The following observations can be used to make the above argument rigorous.. where Θ is uniformly distributed over the interval [0. unless x = 1. and the factor f (ω) is roughly constant over each interval. the integrand in (12. 1]. The fraction of a small interval for which the integrand is one nearly converges to P {f (ω) sin(Θ) ≤ c} .1) and (12. we can imagine partitioning [0.352 CHAPTER 12.s. The sequence does converge in the distribution sense. We shall give a heuristic derivation of the limiting CDF. 2. 3. or a. 0 (12.s. the sequence (Yn ) converges in distribution. . if Xn → X m. Then P [Yn ≤ c] = P [Xn ≤ cn−θ ] = 1 − P [Xn > cn−θ ] = 1 − (1 − cn−θ )n .s. and the limit distribution is the exponential distribution with parameter one. II √ 1 1 (a) ΦX1 (u) = 2 eju + 2 e−ju = cos(u). and hence in the m. Then Xn → 0 deterministically. since g is bounded. so g(Xn ) → g(X) a.i.353 (c) Yes. note that E[h(Xm )h(Xn )] = (−1)m+n . (d) No. h(Xn ) does not satisfy the necessary Cauchy criteria for m.s. For a counter example. and hence in the a. so ΦSn (u) = ΦX1 (u)n = (cos(u))n . Thus. then the set of all such ω has probability one. Since g is a continuous function. . use the fact that convergence in probability for a bounded sequence implies convergence in the m. by definition. it holds that limn→∞ g(Xn (ω)) = g(X(ω)). Thus. . to zero.s. then E[|g(Xn ) − g(X)|2 ] ≤ E[|X − Xn |2 ] → 0 as n → ∞.s. Un ≥ ] = (1 − )n . Thus. (b) Yes. Here is one proof. But h(Xn ) = (−1)n does not converge in the m. .s. random variables. the variable Yn is distributed over the interval [0.s. nθ ]. n → ∞. (For a proof. Thus. since it is bounded. then g(an ) converges to g(a). sense.s. as a function of n. convergence. which converges to zero as n → ∞.16 Sums of i. in the m. sense. sense. which converges with probability zero. the limit random variable has to be the same. and there are others. by definition. c limn→∞ P [Yn ≤ c] = 1 − limn→∞ (1 − n )n = 1 − exp(−c) for any c ≥ 0.. Thus.) 2. . if θ = 1. the sequence Xn converges in the a. Therefore g(Xn ) → g(X) m. . .14 Limits of functions of random variables (a) Yes.s. and ΦVn (u) = ΦSn (u/ n) = √ cos(u/ n)n . sense. X2 (ω). the sequence of numbers X1 (ω). If Xn → X a. the sequence converges to zero in d. sense and. if a sequence of numbers an converges to a limit a. limn→∞ Xn = X d.s. If a limit of random variables exists in different senses. For any with 0 < < 1. (b) For n fixed. But h(Xn ) = (−1)n . let Xn = (−1)n /n.s. sense to some limit random variable.) (c) No. For each ω. sense. so let c be a number in that interval.. which does not converge as m. A slightly more general proof would be to use the continuity of g (implying uniform continuity on bounded intervals) to show that g(Xn ) → g(X) p. Then Xn → 0 deterministically. For a counter example. Xn → 0 p. P [|Xn − 0| ≥ ] = P [U1 ≥ . Therefore. for any ω such that limn→∞ Xn (ω) = X(ω). is a nonincreasing sequence of numbers bounded below by zero.. .d. 2. not with probability one. A direct proof is to first note that |g(b) − g(a)| ≤ |b − a| for any numbers a and b. So. let Xn = (−1)n /n. if θ = 1.12 Convergence of a minimum (a) The sequence (Xn ) converges to zero in all four senses. and then. 2. so the sequence (Xn ) converges a.s. Therefore. (b) if u is an even multiple of π 1 does not exist if u is an odd multiple of π lim ΦSn (u) = n→∞ 0 if u is not a multiple of π.s. or p. so by Lemma 2. Xn } ≤ c ln n} = P {X1 ≤ c ln n. limn→∞ P (|V2n − Vn | > ) = 0 so by the Cauchy criteria for convergence in probability. FZn (c) = P {max{X1 .s. Hence Vn does not converge in the a. . V2n − Vn = W.1 (and the monotonicity of the function ex to extend to the case x = −∞). The proof below looks at the case m = 2n. . Therefore. a. 1) distribution. because.s. X2 ≤ c ln n. Clearly FZn (c) = 0 for c ≤ 0. 2 Thus. and by the central limit theorem. −∞ c < 1 −1 c = 1 xn → 0 c > 1. then most of the random variables in the sum defining Vm are independent of the variables defining Vn . at c = 1) of FZ∞ .. For c > 0. Xn ≤ c ln n} = P {X1 ≤ c ln n}P {X2 ≤ c ln n} · · · P {Xn ≤ c ln n} = (1 − e−c ln n )n = (1 − n−c )n (b) Or. limn→∞ ΦSn (π) = limn→∞ (−1)n does not exist. Observe that as n → ∞. sense or m. . then FZn (c) → FZ∞ (c) at the continuity points (i. √ 2 2 √ 2−2 1 where W is a normal random variable with mean 0 and Var(W ) = + √2 = 2 − 2. . Note that V2n − Vn = = X1 + · · · + X2n X1 + · · · + Xn √ √ − n 2n √ 2 − 2 X1 + · · · + Xn 1 Xn+1 + · · · + X2n √ √ +√ 2 n n 2 The two terms within the two pairs of braces are independent. . 1). sense either. FXn (c) = (1 + xn n n ) . . so that (Vn ) converges in distribution and the limit distribution is N (0. if Z∞ is the random variable that is equal to one with probability one. . It will next be proved that Vn does not converge in probability. there is no reason for Vm to be close to Vn with high probability. SOLUTIONS TO PROBLEMS u √ n 2 n→∞ lim ΦVn (u) = n→∞ lim +o u2 n n = e− u2 2 .s. (c) Sn does not converge in distribution.3. . where xn = −n1−c . 0 c<1 e−1 c = 1 FZn (c) → 1 c > 1.354 1 1− 2 CHAPTER 12. Vn does not converge in probability. for example.s. The limit of ΦVn is the characteristic function of the N (0. 1) distribution. sense either. So the sequence (Zn ) converges to one in . 2. Hence.e. Thus limn→∞ d.18 On the growth of the maximum of n independent exponentials (a) Let n ≥ 2. each converges in distribution to the N (0. So Sn does not converge in the m. The intuitive idea is that if m is much larger than n. and one suspects the noise terms would not stop the growth.s. as explained above. senses as well. if Xn = 3 for some n. n n µn+1 = (µn + µn−1 )/2. And so on. sense is not well defined. Xn does not converge in any sense to an ordinary random variable. if X does not diverge nicely from time N1 . Thus.1.s.9999. Note that if Xn+k ≥ 3·2k and Wn+k ≥ −3 · 2k . X will have arbitrarily many chances to diverge nicely to infinity. the sequence does not converge to an ordinary random variables in the a... finite (in fact it has the geometric distribution). using a union bound and the bound Q(u) ≤ 1 exp(−u2 /2) for u ≥ 0: 2 P [En |Xn ≥ 3] ≥ P {Wn+k ≥ −3 · 2k for all k ≥ 0} = 1 − P ∪∞ {Wn+k ≤ −3 · 2k } k=0 ∞ ∞ ≥ 1− ≥ 1− 1 2 P {Wn+k ≤ −3 · 2k } = 1 − k=0 ∞ Q(3 · 2k · k=0 ∞ √ 2) e−9 ≥ 0. 1853020188851841. Thus.s. We say that X diverges nicely from time n if the event En = {Xn+k ≥ 3·2k for all k ≥ 0} is true.5 in the notes. the sequence Xn will eventually cross above the threshold 3. We shall follow the above intuition to prove that Xn → ∞ a.9999. If Wn−1 ≥ 3 for some n. 2. with each chance having probability at least 0. then there is some first time of the form N1 + k such that XN1 +k < 3 · 2k . then Xn+k+1 ≥ (3 · 2k )2 − 3 · 2k = 3 · 2k (3 · 2k − 1) ≥ 3 · 2k+1 . Therefore. .. After that. . The number of chances needed until success is a. 2(1 − e−9 ) exp(−(3 · 2k )2 ) ≥ 1 − k=0 1 2 (e−9 )k+1 = 1 − k=0 The pieces are put together as follows. 43046721. since the Gaussian variables can be arbitrarily large. For example.9999 to diverge nicely to infinity. some value of Xn will cross above any fixed threshold with probability one.s.20 Limit behavior of a stochastic dynamical system Due to the persistent noise. imagine (or simulate on a computer) a typical sample path of the process. Let N1 be the smallest time such that XN1 ≥ 3. (Convergence to +∞ in the m. with probability one. and d.355 distribution. This second order recursion has a solution of the form µn = Aθ1 + Bθ2 . for n ≥ 1. Xn would probably converge to infinity quickly. 2. p. but eventually. Let N2 be the first time after that such that XN2 ≥ 3. A typical sample sequence hovers around zero for a while. However.22 Convergence analysis of successive averaging (b) The means µn of Xn for all n are determined by the recursion µ0 = 0. and. Note that the future of the process beyond that time has the same evolution as the original process. sense (and hence in the p. Then X again has chance at least 0. . 6561. .s. This suggests that Xn → +∞ in the a. just as for the example following Theorem 2.43 × 1030 . so that X diverges nicely to infinity from some time.s. then. To gain some insight. Then X diverges nicely from time N1 with probability at least 0. En ⊃ {Xn ≥ 3 and Wn+k ≥ −3 · 2k for all k ≥ 0}.9999.) Of course. senses. Thus. 81. and if the noise were ignored from that time forward. Then N1 is finite with probability one. µ1 = 1. 2. then X would go through the sequence 9. then Xn ≥ 3. or m. But E[Ym Yn ] = n∧m σk which converges to ∞ σk as n. (b) In particular. for e a > 0. sense if and only if k=1 σk < ∞. (Yn ) k=1 k=1 ∞ 2 converges in the m..26 A large deviation 2 Since E[X1 ] = 2 > 1.s.. sense. limn→∞ Dn = 0 a. which we now compute. = 2. The set of such ω has probability one.s.s. and M (θ) = E[X 2 exp(θX)]E[exp(θX)] − E[X exp(θX)]2 /E[exp(θX)]2 . = E[X expectation yields (dθ)k Interchanging differen- . Differentiating again dθ yields M (θ) = tiation and M (θ) = E[X exp(θX)]/E[exp(θX)].24 Mean square convergence of a random series Let Yn = X1 + · · · + Xn .2. M (0) = 0 and M (0) = Var(X) = σ 2 .s. By the hint. so the second order Taylor’s approximation d2 E[X exp(θX)] E[exp(θX)] − ( dE[exp(θX)] )2 /E[exp(θX)]2 .18 × 10−7 . which in turn implies limn→∞ ln Dn = −∞ a. Note that Dn = U1 · · · Un−1 . Therefore. m → ∞..28 A rapprochement between the central limit theorem and large deviations (a) Differentiating with respect to θ yields M (θ) = ( dE[exp(θX)] )/E[exp(θX)]. By Proposition 2. for each ω such that Dn (ω) converges to zero. Note. 2. 2. So ∞ M (θ) = ln E[eθx ] = ln −∞ 1 1 2 1 √ e−x ( 2 −θ) dx = − ln(1 − 2θ) 2 2π 1 ln(1 − 2θ) 2 (a) = max θa + θ 1 1 1− 2 a 1 1 θ∗ = (1 − ) 2 a 1 b = (2) = (1 − ln 2) = 0. the sequence Xn (ω) is a Cauchy sequence of numbers. and hence has a limit.s.s. for the tilted density fθ .s. which was to be proved. which is the second moment. the m. Thus. We are interested in determining whether limn→∞ Yn exists in the m. which is the mean for the tilted distribution fθ . limit exists if and only if the limit limm. the strong law of large 0 numbers implies that limn→∞ ln Dn = −1 a.s. SOLUTIONS TO PROBLEMS where θ1 and θ2 are the solutions to the equation θ2 = (1 + θ)/2. dθ (dθ)2 dk E[exp(θX)] k exp(θX)].1534 2 e−100b = 2. This yields µn = 2 (1 − (− 1 )n ). Since ln Dn = 1 ln(U1 ) + · · · ln(Un−1 ) and E[ln Ui ] = 0 ln(u)du = (x ln x − x)|1 = −1.3. ∞ −ax2 dx −∞ e = x ∞ − 2( 1 ) e 2a dx −∞ 2 2 = π a. minus the first moment squared. Cram´r’s theorem implies that b = (2). 3 2 (c) It is first proved that limn→∞ Dn = 0 a. or n−1 equivalently.n→∞ E[Ym Yn ] exists 2 2 and is finite.356 CHAPTER 12. so Xn converges a.. or simply the variance. 30 Large deviations of a mixed sum Modifying the derivation for iid random variables. we find that for θ ≥ 0: P Sn ≥a n ≤ E[eθ(Sn −an) ] = E[eθX1 ]nf E[eθY1 ]n(1−f ) e−nθa = exp(−n[θa − f MX (θ) − (1 − f )MY (θ)]) where MX and MY are the log moment generating functions of X1 and Y1 respectively. √ √ so as n → ∞. (a) Yes. The result is f 0 0+ 1/3 2/3 1 l(f.s. Each sample path of the sequence Bk is monotone nonincreasing and bounded below by zero.) 5 (c) If j ≤ k. E[(Bk − 0)2 ] = E[A2 ]k = ( 5 )k → 0 as k → ∞. the central limit theorem approximation and large deviations bound/approximation are consistent with each other. f ) is discontinuous in f at f = 0. adding only one exponentially distributed random variable to a sum of Poisson random variables can change the large deviations exponent.32 The limit of a sum of cumulative products of a sequence of uniform random variables m.876 1. Thus. 2. The exponent is the same as in the bound/approximation to the central limit approximation described in the problem statement. (a) for small a satisfies (a) ≈ maxθ (aθ− θ 2 ) = 2σ2 . 1) = a − 1 − ln(a) (large deviations exponent for the Exp(1) distribution). Therefore.282 1. a) by numerical optimization. 0) = a ln a + 1 − a (large deviations exponent for the P oi(1) distribution) and l(a. Bk → 0.545 2.s. 4) 2. For 0 < f < 1 we compute l(f. Thus. the large deviations upper bound behaves as P [Sn ≥ b n] ≤ exp(−n (b/ n)) ≈ b2 b2 exp(−n 2σ2 n ) = exp(− 2σ2 ). k! Note that l(a. and is hence convergent. Therefore. (The limit has to be the same as the m. a) = max θa − f MX (θ) − (1 − f )MY (θ) θ where MX (θ) = − ln(1 − θ) θ < 1 +∞ θ≥1 ∞ MY (θ) = ln k=0 eθk e−1 θ = ln(ee −1 ) = eθ − 1.614 Note: l(4. In fact. so Bk converges to zero almost surely. then E[Bj Bk ] = E[A2 · · · A2 Aj+1 · · · Ak ] = ( 8 )j ( 3 )k−j .719 1. exists.s. Thus. for moderately large b. l(f. limk→∞ Bk a.357 σ a for M near zero is M (θ) = θ2 σ 2 /2. 1 8 (b) Yes. 1 j 4 . Therefore. 2 2 2 2. limit. E[Θ] = π . . Thus.s.4). and hence an ordinary random variable. E[Y |Θ] = − π3 (Θ − π ). 3 3 (e) Yes. where E[Y ] = π 0 cos(θ)dθ = 0.Y ) (Θ − E[Θ]). so the optimal choice is a = π2 and b = − π3 . The limit has to be the same as the m. Each sample path of the sequence Sn is monotone nondecreasing and is hence convergent. 2 2 3 3 3 5 5 5 ··· 8 4 8 4 8 5 3 5 2 5 2 3 ··· 8 4 8 8 4 5 3 5 3 2 5 ··· 8 8 4 8 4 5 2 ∞ 3 Therefore. Choose b so that the determinants of the following seven matrices are nonnegative: (2) (1) (1) 2 1 1 1 2 b b 1 1 0 0 1 K itself .2 Linear approximation of the cosine function over an interval 1 π E[Y |Θ] = E[Y ] + Cov(Θ.4 Valid covariance matrix Set a = 1 to make K symmetric. l=1 4 plus all terms directly above or directly to the right of that term. . . 8 + 1 is readily seen to be the sum of the j th term on the diagonal. . . namely 3 .s. The second moment of the limit is the k=1 k=1 35 limit of the second moments. (d) Mean square convergence implies convergence of the mean.358 CHAPTER 12. limn→∞ Sn a. . . . SOLUTIONS TO PROBLEMS n m n m ∞ ∞ E[Sn Sm ] = E[ = 2 Bj Bk ] = E[Bj Bk ] j=1 j=1 k=1 k=1 ∞ ∞ ∞ j k−j 5 8 3 4 + j=1 ∞ → j=1 k=1 j E[Bj Bk ] (12. Thus. so the variance of the limit is 35 − 32 = 8 . is to note that (12.·. limit. (To show that the limit is finite with probability one. . the monotone convergence theorem could be applied. 3. 2 π π π2 12 . and Cov(Θ.3) j=1 k=j+1 ∞ ∞ 5 8 = 2 j=1 l=1 = ∞ j=1 5 8 j 5 8 j 3 4 ∞ l + j=1 5 8 +1 j 2 l=1 3 4 l (12.3) is the sum of all entries in the infinite 2-d array: .4) 5 35 = (2 · 3 + 1) = 3 3 A visual way to derive (12. Y ) = E[ΘY ] − E[Θ]E[Y ] = 0 π π π 24 12 24 Therefore. 2 −π. yielding E[limn→∞ S∞ ] = limn→∞ E[Sn ] = 1. exists. Var(Θ) = 2 Var(Θ) 2 E[ΘY ] = 0 θ cos(θ) dθ = θ sin(θ) |π − 0 sin(θ) dθ = − π . the mean of the limit is 3 limn→∞ E[Sn ] = limn→∞ n E[Bk ] = ∞ ( 4 )k = 3. ) j l 3. The conditional mean of X given Y = y is thus the midpoint of that interval. the conditional density of X is Gaussian with mean E[X]+ Cov(X. Y ) = 6 5 12 − 1 3 = 1 12 . 12 1 1 y 1 y( )2 dy = 12 0 2 96 For this example. Then E[e2 ] = E[E[e2 |Y ]] = E[ = 2 1 L(Y )2 ].Y) ) (3−E[Y ]) = 1 Var(Y 6 and variance Var(X) − Cov(X. 3. 2 2 3. the conditional distribution of X given Y = y is the uniform distribution over the interval [0.2 shows that for 0 ≤ y ≤ 2. note that given Y . and is independent of Y . Var(X) = 3 E[X | Y ] = E[e2 ] = 1 18 .8 An MMSE estimation problem 1 1+x 5 (a) E[XY ] = 2 0 2x xydxdy = 12 . Var(Y ) = 1 . the conditional distribution of X is uniform over some interval. (b) ∞ ∞ 1 2 y 2 1 2 √ e− 2 y dy = and EY = 0. yielding: E[X|Y ] = Y 4 3Y −2 4 0≤Y ≤1 1≤Y ≤2 To find the corresponding MSE.e. 2 2 2 2 (b) Given Y = 3. Let L(Y ) denote the length of the interval. Alternatively. The other moments can be found in a similar way. (The variance of the error was calculated to be 2 in part (b)).Y)) = 4 − 18 = 2. y/2] if 1 ≤ y ≤ 2. y/2] if 0 ≤ y ≤ 1 and the over the interval [y − 1.359 The fifth matrix has determinant 2 − b2 and det(K) = 2 − 1 − b2 = 1 − b2 . EY = 1. has mean zero and variance 2. which can also be written as 2Φ(− √2 ) or 2(1 − Φ( √2 )). So 1 1 1 + ( )−1 (Y − 1) = 3 12 6 1 1 1 1 − ( )( )−1 ( ) = 18 12 6 12 1 Y −1 + 3 2 1 = the MMSE for E[X|Y ] 72 Inspection of Figure 12. Var(Y (c) The estimation error X − E[X|Y ] is Gaussian. the MSE for the best estimator is 25% smaller than the MSE for the best linear estimator. symmetric and positive semidefinite) if and only if a = 1 and −1 ≤ b ≤ 1.6 Conditional probabilities with joint Gaussians II 1 1 (a) P [|X − 1| ≥ 2] = P [X ≤ −1 or X ≥ 3] = P [ X ≤ − 2 ] + P [ X ≥ 3 ] = Φ(− 2 ) + 1 − Φ( 3 ). Hence K is a valid covariance matrix (i. note that the marginal densities are given by 0≤y≤1 y 2(1 − x) 0 ≤ x ≤ 1 2−y 1≤y ≤2 fY (y) = fX (x) = 0 else 0 else so that EX = 1 . Thus the probability is 1 1 1 1 Φ(− √2 ) + 1 − Φ( √2 ). Cov(X. EX = |y| √ e−y /2 dy = 2 π 2π 2π −∞ 0 . the best linear estimator is the constant EX. Therefore. P [X ≥ 2|Y = y] = Q( 2−(0.8)y ). SOLUTIONS TO PROBLEMS 2 y E[X|Y=y] 1 E[X|Y=y] x 0 1 Figure 12. Cov(X. So E[(X − |Y |) |Y | = E[X|Y ].6.6 (c) Given the event {Y ≥ 3}. the conditional Var(Y √ distribution of X is N (0. Y ) = E[|Y |Y ] = 0 so E[X|Y ] = That is. with mean E[X|Y = y] and variance σe .2108). 3. The corresponding MSE is 0. so |Y | is the MMSE estimator of X given Y . we have MSE for estimation of X by E[X|Y ]. . and σe = Var(X) − Cov(X. The corresponding MSE is Var(X) = 2 2 E[X 2 ] − (EX)2 = E[Y 2 ] − π = 1 − π . Nothing can beat that.6324). or 100% smaller than the MSE for the best linear estimator.6 √ P [X ≥ 2|Y = 3] = Q( 2−(0. and then renormalizing by P {Y ≥ 3} = Q( √3 ) to make the density integrate 10 to one. 3. 3.2: Sketch of E[X|Y = y] and E[X|Y = y].8)3 ) = Q(−0. which is the Cov(X. Since X and Y are mean zero and Var(Y ) 2 2 E[X|Y = y] = 0.360 CHAPTER 12.10 Conditional Gaussian comparison X (a) pa = P {X ≥ 2} = P { √10 ≥ √2 } = Q( √2 ) = Q(0.8.Y)) = 3. pb = 3. Hence.8y.6). In particular. 2 0 + Y ≡ π 1 2 π Var(Y ) = 1. given Y = y. the conditional 2 distribution of X given Y = y is Gaussian. 3).8y. the conditional pdf of Y is obtained by setting the pdf of Y to zero on the interval (−∞. We can write this as fY (y) = e−y2 /20 √ y≥3 1−FY (3) Q( √3 ) 20π fY |Y ≥3 (y) = 10 0 else. Note that |Y | is a function of Y with mean square error 2 ] = 0.Y ) = 0. 10 10 (b) By the theory of conditional distributions for jointly Gaussian random variables. E[X|Y ])Var(Y )−1 (Y − EY ) which can be simplified using E[E[X|Y ]] = EX and Cov(Y. ∞) (using the pdf fY |Y ≥3 ). Suppose V0 and V1 are two closed linear subspaces of random variables with finite second moments. By part (c).6 (d) We will show that pa < pb < pc . Then X1 is the variable in V1 with the ∗. which are each orthogonal to 1 and to Y . and the result follows. Y ) . by considering the possible values of Y . 3.361 Using this density.8)y √ 3.8)y ) > pb .8)y √ )fY |Y ≥3 (y)dy 3. which can be proved in the same way. Let X be a random variable with finite second moment.8)y ) with respect to y 3. 3. we have ∞ ∞ pc = P [X ≥ 2|Y ≥ 3] = 3 ∞ P [X ≥ 2. minimum mean square distance to X0 Another solution to the original problem can be obtained by using the formula for E[Z|Y ] applied to Z = E[X|Y ]: E[E[X|Y ]|Y ] = E[E[X|Y ]] + Cov(Y.6 (ALTERNATIVE) The same expression can be derived in a more conventional fashion as follows: pe = P [X ≥ 2|Y ≥ 3] = ∞ ∞ P {X ≥ 2. Y ≥ 3} P {Y ≥ 3} = 3 ∞ 2 fX|Y (x|y)dx fY (y)dy/P {Y ≥ 3} Q 3 = 2 − (0.6 ∞ fY (y)dy/(1 − FY (3)) = 3 Q( 2 − (0.6 showing that pc > pb . E[X|Y ]− E[X|Y ] is also orthogonal to 1 and to Y . However E[X|Y ] − E[X|Y ] can be written as the difference of two random variables (X − E[X|Y ]) and (X −E[X|Y ]).12 An estimator of an estimator To show that E[X|Y ] is the LMMSE estimator of E[X|Y ]. E[X|Y ]) = E[Y (E[X|Y ] − EX)] = E[Y E[X|Y ]] − EY · EX = E[E[XY |Y ]] − EY · EX = E[XY ] − EX · EY = Cov(X. Y ) and to prove that E[X|Y ] − E[X|Y ] is orthogonal to 1 and to Y . Q( 2−(0. The inequality pa < pb follows from parts (a) and (b) and √ the fact the function Q is decreasing.6 √ over the region y ∈ [3. Y ∈ dy|Y ≥ 3] = 3 P [X ≥ 2|Y = y]P [Y ∈ dy|Y ≥ 3] = 3 Q( 2 − (0. Thus. it suffices by the orthogonality principle to note that E[X|Y ] is linear in (1. Here is a generalization. and let Xi∗ be the variable in Vi with the ∗ minimum mean square distance to X. such that V0 ⊃ V1 .8)y √ )fY |Y ≥3 (y)dy 3. pc is an average of Q( 2−(0. for i = 0 or i = 1. But everywhere in that region. Then E[X|Y ] has to be linear. let X be uniform on the interval [0. (0. (There is no need to even calculate E[X|Y ] here. 5 E[X ] Var(Y 3 3. The random variable E[X|Y ] cos(Y ) has the following two properties: • It is a function of Y with finite second moments (because E[X|Y ] is a function of Y with finite second moment and cos(Y ) is a bounded function of Y ) • (X cos(Y ) − E[X|Y ] cos(Y )) ⊥ g(Y ) for any g with E[g(Y )2 ] < ∞ (because for any such g. Then E[X|Y ] = Y while E[X|Y 2 ] = 0.5. 0. since Y has only two possible values. (0. 1) with equal probability. 0. and let W be indpendent of X. 1). let Y = XW . For example. The point is that the function y 3 is an invertible function. 8 (d) False.16 Some simple examples Of course there are many valid answers for this problem–we only give one. In fact. 0. Z} = (0. (a) Let X denote the outcome of a roll of a fair die. Then E[X|Y ] = Y 3 4 while E[X|Y 3 ] = E[X] + Cov(X. (1. Given Y . Then any pair of these variables takes the values (0. 0). For example. 3. Y 2 can give less information than Y . (1. 1).18 Estimating a quadratic (a) Recall the fact that E[Z 2 ] = E[Z]2 + Var(Z) for any second order random variable Z. The conditional distribution of Y given W is N (0. Y. 0).5. so X and Y are not jointly Gaussian. 1)} = 0 = P {X = 0}P {Y = 0}P {Z = 1} = 8 . let P {X = Y = 1} = P {X = Y = −1} = 0. any function of Y can be written in the form a + bY. any function of Y is linear. 1]. 1] and let Y be identically zero. and let Y = X. CHAPTER 12. let X be uniformly distributed on [−1. But these two spaces are the same. the conditional . because for each function g there is the function f (u) = g(u1/3 ). Y. 0).) Thus. Then 1 E[X 3 |Y ] = E[X 3 ] = 4 and E[X|Y ]3 = E[X]3 = 1 . so any function of Y can also be written as a function of Y 3 . indicating pairwise independence. with P {W = 1} = P {W = −1} = 1 . 1). However. So the three random variables are not independent. For example. so that X − Y is not a Gaussian random variable. 1. E[X|Y ] cos(Y ) is equal to E[X cos(Y )|Y ]. where g(Y ) = g(Y ) cos(Y ).Y ) ) (Y 3 − E[Y 3 ]) = E[X 6 ] Y 3 = 7 Y 3 . (e) False.362 to yield the desired result. (1. 1). Equivalently. E[(X cos(Y )−E[X|Y ] cos(Y ))g(Y )] = E[(X−E[X|Y ])g(Y )] = 0. SOLUTIONS TO PROBLEMS 3. P {X − Y = 0} = 0. 0). by the orthogonality principle. and let Y = 1 if X is odd and Y = 2 if X is even. The idea is to apply the fact to the conditional distribution of X given Y . so that not every function of Y can be written as a function of Y 2 .) (b) Let X be a N(0. The point is that the function y 2 is not invertible. for either 2 possible value of W . (b) True. But 1 P {(X. The left hand side is the projection of X onto the space {g(Y ) : E[g(Y )2 ] < ∞} and the right hand side is the projection of X onto the space {f (Y 3 ) : E[f (Y 3 )2 ] < ∞}. so the unconditional value of Y is also N (0. (c) False.14 Some identities for estimators (a) True. 1. but we note that is is given by E[Y |X] = X + 2. Finally. That is. 1) with equal probability. (1.1) random variable. Z) take on the four values (0. (c) Let the triplet (X. 25) and (3.5 1 Y3 3 Y1 1 and that Cov(X.5 Y1 (b) Since Cov Y2 = 0. Summarizing. That is.) 2 E[Y1 ] Y3 = Y3 − − = Y3 − (0. e2 E[Y1 ] e2 E[Y2 ] 3 e2 E[Y3 ] 4 e E[Y3 Y1 ] Y f2 ] 1 E[Y1 e E[Y3 Y2 ] Y f2 ] 2 E[Y2 3. 3 3 3 Y1 Y1 1 0 0 Y1 Y2 where A = − 1 1 0 Y2 .75. yk ) = Cov(f (xk − xk|k−1 ) + wk . 4 0 0 2 0.24) are still true: xk+1|k = f xk|k−1 + Kk (yk − xk|k−1 ) and Kk = Cov(xk+1 − f xk|k−1 .5 0.25).5 0.5 0. (b) The MSE=E[(X 2 )2 ]−E[(E[X 2 |Y ])2 ] = E[X 4 ]−ρ4 E[Y 4 ]−2ρ2 E[Y 2 ](1−ρ2 )−(1−ρ2 )2 = 2(1−ρ4 ) (c) Since Cov(X 2 .28) is replaced by wk to yield: ˜ y Cov(xk+1 − f xk|k−1 .5 1 0 0 it follows that Cov Y2 = A 0. Y2 = A 2 Y3 −1 −1 1 Y3 Y3 3 3 Y1 1 0.22 A Kalman filtering example (a) xk+1|k = f xk|k−1 + Kk (yk − xk|k−1 ) 2 2 2 2 2 σk+1 = f 2 (σk − σk (σk + 1)−1 σk ) + 1 = 2 σk f 2 2 +1 1 + σk and k Kk = f ( 1+σ2 ). it follows that E[X 2 |Y ] = E[X 2 ] = 1. yk )Cov(˜k )−1 . 3. 3.20 An innovations sequence and its application 2 2 e 2 (a) Y1 = Y1 .5 0.Y3 ) = 1 . Y3 0. xk − xk|k−1 ) + wk ˜ 2 = f σk + 1 .25 0.25)AT = (0 1 6 ).Y = 0 b = Cov(X.5 1 Y3 Y1 1 0. E[X 2 |Y ] = (ρY )2 + 1 − ρ2 .5 and Cov(X. (Note: E[Y1 ] = 1). the best linear estimator in this case is just the constant estimator equal to 1.5 1 0.25 0. Y2 ) = (0 0.5 1 0. Y2 ) = (0 0. Y ) = E[X 2 Y ] = 0.5Y1 (Note: E[Y2 ] = 0. 4 Y3 e e1 ) e (c) a = Cov(X. Thus.24 A variation of Kalman filtering Equations (3. k σ2 2 2 (b) Since σk ≤ 1 + f 2 for k ≥ 1.5)Y1 − 1 Y2 = Y3 − 1 Y1 − 1 Y2 .5 AT = 0 3 0 .363 distribution of X is Gaussian with mean ρY and variance 1 − ρ2 . The variable vk in (3.Y2 ) = 1 c = Cov(X. Y2 = Y2 − E[YfY1 ] Y1 = Y2 − 0. the sequence (σk ) is bounded for any value of f . SOLUTIONS TO PROBLEMS 2 Cov(˜k ) = Cov((xk − xk|k−1 ) + wk ) = σk + 1 y So now Kk = 2 f σk + 1 2 1 + σk and 2 (f σk + 1)2 2 1 + σk 2 2 σk+1 = (f 2 σk + 1) − 2 (1 − f )2 σk 2 1 + σk = 2 Note that if f = 1 then σk = 0 for all k ≥ 1.28 Linear innovations and orthogonal polynomials for the uniform distribution (a) 1 n 1 un+1 1 u n+1 n even E[U n ] = du = = 0 n odd 2(n + 1) −1 −1 2 (b) The formula for the linear innovations sequence yields: .364 2 As before. Yn ) = E[(U1 + · · · + UM )Yn ] = E[(U1 + · · · + UM )U1 · · · Un−1 (Un − 2 )] 1 = E[Un (U1 · · · Un−1 )(Un − 2 )] = 2−(n−1) Var(Un ) = 2−(n−1) /12. 1 (d) Y0 = Y0 = 1. . .26 An innovations problem 2 2 2 2 2 (a) E[Yn ] = E[U1 · · · Un ] = E[U1 ] · · · E[Un ] = 2−n and E[Yn ] = E[U1 · · · Un ] = E[U1 ] · · · E[Un ] = 3−n . YM ] = M + 2 M + 2 M + 2 M n=1 M n=1 M n=1 Cov(XM . Yn−1 ] = E[Yn−1 Un |Y0 . . . . CHAPTER 12. . Yn−1 ] = Yn−1 E[Un |Y0 . which makes sense because xk = yk−1 in that case. . (c) Since the conditional expectation found in (b) is linear. (b) E[Yn |Y0 . Yn−1 ] = Yn−1 E[Un ] = Yn−1 /2. Yn−1 ] = E[Yn |Y0 . 3. . . Var(Yn ) = E[(Yn )2 ] = E[U1 · · · Un−1 (Un − 1 )2 ] = 3−(n−1) /12 and 2 1 Cov(XM . . and Yn = Yn − Yn−1 /2 (also equal to U1 · · · Un−1 (Un − 2 )) for n ≥ 1. Yn )Yn Var(Yn ) 2−n+1 /12 Yn 3−n+1 /12 = = 3 ( )n−1 Yn 2 3. writing σk for Σk|k−1 . . . so Var(Yn ) = 3−n − (2−n )2 = 3−n − 4−n . 2 2 (e) For n ≥ 1. . it follows that E[Yn |Y0 . . . . . . . we have E[XM |Y0 . . . . . . Since Y0 = 1 and all the other innovations variables are mean zero. Yn−1 ] = Yn−1 /2. successive iterates F k xo spiral in towards zero. 1 6 3 U 2 − 3 . mean zero Gaussian . mean zero. Thus.30 Kalman filter for a rotating state (a) Given a nonzero vector xk . The autocorrelation function is given by 2 2 RY (s. t) = E[Ys Zs Yt Zt ] = E[Ys Yt Zs Zt ] = E[Ys Yt ]E[Zs Zt ] = RY (s. U . 5 3. Then Y0 = X1 and Y3/4 = X2 . (c) A simple solution to this problem is to take X1 and X2 to be independent. and then shrinking the vector towards zero by one percent. Y2 = U 2 − 3 .2 Correlation function of a product RX (s. X1 could be N (0. (b) If X1 and X2 are each Gaussian random variables and are independent then (Yt : t ∈ R) is a real-valued Gaussian WSS random process and is hence stationary. EYt ≡ 0.6 Brownian motion: Ascension and smoothing (a) Since the increments of W over nonoverlapping intervals are independent. t) 4. Note: These mutually orthogonal (with respect to the uniform distribution on [-1. with one-tenth revolution per time unit. t)RZ (s. a c T (b) Suppose Σk|k−1 = . t) = E[X1 ] cos(2πs) cos(2πt) − 2E[X1 X2 ] cos(2πs) sin(2πt) + E[X2 ] sin(2πs) sin(2πt) = σ 2 (cos(2πs) cos(2πt) + sin(2πs) sin(2πt)] = σ 2 cos(2π(s − t)) (a function of s − t only) So (Yt : t ∈ R) is WSS.4 Another sinusoidal random process (a) Since EX1 = EX2 = 0.1] ) polynomials 1. the vector F xk is obtained by rotating the vector xk one tenth revolution counter-clockwise about the origin. so that Y is not stationary. For example. 4. and 1 − 3) = U4 − 1 − 5 1 −1 7 5 1 − 2 +1 5 3 Y4 = U4 − E[U 4 ·1] E[12 ] ·1− 1 E[U 4 (U 2 − 3 )] 2 − 1 )2 ] E[(U 3 (U 2 6 (U 2 − 1) = U 4 − 7 U 2 + 3 35 . The equations for c b Σk+1|k and Kk can thus be written as Σk+1|k = F Σk|k−1 − Kk = F a 1+a c 1+a a2 1+a ac 1+a ac 1+a c2 1+a FT + Q .365 1 Y1 = U . Note that Hk Σk|k−1 Hk is simply the scaler a. and shrinking by about ten percent per revolution. Y3 = U 3 − 3U 5 . U 3 − 3 U . 4. U 4 − 7 U 2 + 35 are (up to constant multiples) known as the Legendre polynomials. so Y0 and Y3/4 do not have the same distribution. σ 2 ) and X2 could 1 be discrete with P (X1 = σ) = P (X1 = −σ) = 2 . variance σ 2 random variables with different distributions. so that E[Ws |Wr . 9. Wt − Ws ≥ 0} = P {Ws − Wr ≥ 0}P {Wt − Ws ≥ 0} = 1 1 1 · = . Xr ) Var(Xt ) −1 Wr Wt 4. Wt are jointly Gaussian. 4 (Method 2) Since N1 and N2 − N1 are independent. Note that as s varies from r to t. The process N is Markov (because N0 is constant and N has independent increments). Thus.j 2 (τ ) = P [Xt+τ = j 2 |Xt = i2 ] = (λτ )j−i e−λτ (j−i)! 2 −2λ 2 −2λ 2 −2λ 0 0≤i≤j else . and since Xt and Nt are functions of each other for each t. They also all have mean zero. P [N2 = 2|N1 ≥ 1] = 3λ e /( (2λ) 2e ) = 3 . Thus. Wt ] is obtained by linearly interpolating between Wr and Wt . Poi(λ) random variables. The state space for X is S = {0. j ∈ Z and t. . 2]. CHAPTER 12. 1. Ws . at least one of them falls in the first half of the interval with probabilty 4 3 1 − 1 = 4. −1 −1 Var(Xr ) Cov(Xr . N2 = 2} = P {N1 = 1. Wt ] = E[Ws |Wr . Wr ). Therefore.366 random variables. N2 − N1 = 1} + P {N1 = 2. 2 4 2 −2λ (b) P {N1 ≥ 1} = 1 − e−λ . Wt ] = (Cov(Ws . SOLUTIONS TO PROBLEMS P {Wr ≤ Ws ≤ Wt } = P {Ws − Wr ≥ 0. = t−r where we use the fact a b d −b 1 = ad−bc .8 Some Poisson process calculations (a) (Method 1) Given N2 = 2. Cov(Ws . 2 (c) Yes. Thus. the distribution of the locations of the first two jump points are as if they are independently and uniformly distributed over the interval [0. the process X is also Markov. each falls in the second half of the interval with probability 1/2. P [N2 = 2|N1 ≥ 1] = 3λ e /(1 − e−λ ). the three random variables Wr . Wt )) r r Wr = (r. . c d −c a E[Ws |Wr . P {N1 ≥ 1. 4. s) r t Wt (t − s)Wr + (s − r)Wt . 2 2 4 (b) Since W is a Gaussian process. Xt ) Cov(Xt . N2 = 2} + P {N1 = 2.}. and they both fall into the second interval with 1 probability ( 2 )2 = 1 . N2 − N1 = 0} λ2 e−λ −λ 3λ2 e−2λ = (λe−λ )(λe−λ ) + ( )e = 2 2 and P {N2 } = (2λ) 2e . τ ≥ 0. pi2 . For i. N2 = 2} = P {N1 = 1. . . RX (1) = 0. X(4))T . Xn+2 .2]. Think of n as the present time. 4. t) = E[Ys (At−1 Yt−1 + Bt−1 )] = Pk if s = t = k E[At−1 ]E[Ys Yt−1 ] + E[Ys ]E[Bt−1 ] = 0. . 2 2 4 So RZ (s. given the present. −5 0 5 9 (b) As the variables are all mean zero. (e) The variables Y1 . 9 4. Thus. E[X(4)|X(2)] = Cov(X(4). Xn . Z is obtained from the jointly Gaussian processes X and Y by linear operations.X(2)) X(2) = − X(2) . X1 −X0 = X1 = B1 .367 4.1]. Thus. (b) The event is the same as the event that the numbers of counts in the intervals [0. 111. X(4)) X(3) is also independent of (X(2). However. (Given B1 = b. the future is conditionally independent of the past. [1. where σ −4 σ 2 = Var(X(1) − 5Y (2)) = RX (0) − 10RXY (1 − 2) + 25RY (0) = 1 − 10e + 25 = 26 − 5e−4 . Also. Poisson random variables with mean λ. Thus. t) = 0 else. Bk : k ≥ n). (c) The variable X(3) is uncorrelated with (X(2). or 202. are already orthogonal by part (d) (and the fact the variables have . For example. the mean function of Z is constant (µZ ≡ 0) and RZ (s. s).10 MMSE prediction for a Gaussian process based on two observations 5 0 −5 9 5 (a) Since RX (0) = 5. are all functions of Xn and (Ak . Then RY (s. Bk : k ≥ n) are independent of X0 .16 A linear evolution equation with random coefficients 2 2 2 2 (a) Pk+1 = E[(Ak Xk + Bk )2 ] = E[A2 Xk ] + 2E[Ak Xk ]E[Bk ] + E[Bk ] = σA Pk + σB . X1 . divided by the answer to part (b). The future values Xn+1 . Since Z is Gaussian and WSS. t) is a function of s − t. 2 4. t) = RXY (t. and RX (2) = − 9 . . (b) Yes. (c) No. and X2 −X1 = A2 B1 +B2 .14 Adding jointly stationary Gaussian processes X(t)+Y (t) (a) RZ (s. But the variables (Ak . so Z is WSS. . −|τ −3| −|τ +3| 1 RZ (τ ) = 4 [2e−|τ | + e 2 + e 2 ]. 9 Var(X(2) T .12 Poisson process probabilities (a) The numbers of arrivals in the disjoint intervals are independent. Since the variables are jointly Gaussian. The probability is thus e−λ ( λ e−λ )e−λ + (λe−λ )3 + ( λ e−λ )e−λ ( λ e−λ ) = 2 2 2 4 2 ( λ + λ3 + λ )e−3λ .) (d) Suppose s and t are integer times with s < t. the probability is (λe−λ )3 = λ3 e−3λ . k (b) Yes. . . t) is a function of s − t only. σA b2 + σB ). RY X (s. X(3)] = − X(2) . so Z is a Gaussian process. t) = E[ X(s)+Y (s) ] = 1 [RX (s − t) + RY (s − t) + RXY (s − t) + RY X (s − t)]. the conditional distribution of A2 B1 + B2 is N (0. and clearly B1 and A2 B1 +B2 2 2 are not independent. RY (s. Thus.3] are 020. it is stationary. Y2 . and 2 2 2 [2. which depends on b. . . 1 1 (c) P [X(1) < 5Y (2) + 1] = P [ X(1)−5Y (2) ≤ σ ] = Φ( σ ). the covariance matrix is 0 5 0 . So E[X(4)|X(2)] = E[X(4)|X(2). or λ2 λ2 λ4 3 2 2 2 /( 2 + λ + 4 ) = 2λ /(2 + 4λ + λ ). 2 4 (c) This is the same as the probability the counts are 020. . E[τ ] = 1 + a1 = 8. For part (a). a3 = 1 + a2 .5} = P {Uk ≤ 0. Thus.20 A random process created by interpolation (a) X t n n+1 t (b) Xt is the sum of two random variables. Using the first and third 3 3 of these equations to eliminate a1 and a3 from the second equation yields a2 .368 mean zero). . which is uniformly distributed on the interval [0. a2 = 1 + 3 a1 + 1 a3 . Since this depends on 12 P {max0≤t≤10 Xt ≤ 0.5)11 . t) = a +(1−a) for t = n + a. Given Xk . each two-headed line represents a directed edge in each direction and all directed edges have probability 1/3. Therefore.18 A fly on a cube (a)-(b) See the figures.5 for 0 ≤ k ≤ 10} = t. 9. Xk+1 is equal to Xk with probability Xk Xk k and is equal to Xk + 1 with probability 1 − k . which is uniformly distributed on the interval [0. 10). X is not WSS. Yk = Yk for all k ≥ 1. a3 ) = (7. (0. The possible values of Xk are {1. CHAPTER 12. k−1}. 4. . Conditioning on the 2 first time step yields a1 = 1 + 2 a2 . 4. 1 − a]. Thus. Another way to say this is that the one-step transition probabilities for the transition from Xk to Xk+1 are given by i for j = i k i pij = 1 − k for j = i + 1 0 else . a]. (a) 100 110 (b) 1 000 010 101 111 0 1/3 1 2/3 2/3 2 1 1/3 3 001 011 (c) Let ai be the mean time for Y to first reach state zero starting in state i. . . and aUn+1 . and then a1 and a3 .22 Restoring samples (a) Yes. a2 . (1 − a)Ut . The solution is (a1 . the density of Xt is the convolution of the densities of these two variables: 1 a 1 1!a 1 1!a * = 0 1!a 0 a 0 a 1!a 1 (c) (d) 2 2 CX (t. SOLUTIONS TO PROBLEMS 4. 1 vk → 0 as k → ∞. is thus true for k = 2. Hence. so that (Mk ) does not form a k martingale sequence. by the tower property of conditional expectations. πn = λn /(1 + λ + · · · + λB ). the information equivalence of Xk and Mk . . Then. vk ≤ 4k for all k.s. Therefore. and also it holds in distribution. taking the expectation on each side of the equation found in part (d) yields vk+1 = 1 (k + 1)2 k(k − 2)vk + 1 4 1 and the initial condition v2 = 0 holds. πn = λn π0 . yields (with some tedious algebra steps not shown) 2 E[Dk+1 |Xk ] = = = Xk + 1 1 2 k − Xk Xk 1 2 Xk + − − k+1 2 k k+1 2 k 1 2 (4k − 8)Xk − (4k − 8)kXk + k(k − 1)2 4k(k + 1)2 1 1 2 k(k − 2)Dk + 2 (k + 1) 4 2 2 (e) Since. πB ) solves πQ = 0. . since Mk is bounded. . suppose that vk ≤ 4k for some k ≥ 2. inply that 1 E[Mk+1 |M2 . λπ0 = π1 . 4(k + 1)2 4(k + 1) 1 So the desired inequality is true for k +1. the convergence also holds in probability. By definition. Mk ] = E[Mk+1 |Mk ] = k+1 (Xk + 1 − Xk ) = Mk . For 1 the purpose of proof by induction. . (b) The equilibrium vector π = (π0 . Thus.24 An M/M/1/B queueing system (a) Q = −λ 1 0 0 0 λ 0 0 0 −(1 + λ) λ 0 0 1 −(1 + λ) λ 0 0 1 −(1 + λ) λ 0 0 1 −1 . Also. and part (b).369 (b) E[Xk+1 |Xk ] = Xk ( Xk ) + (Xk + 1) 1 − Xk = Xk + 1 − Xk . m. . π1 . (d) Using the transition probabilities mentioned in part (a) again. The desired inequality. Since the probabilities must sum to one. k k k (c) The Markov property of X. . Continuing this way yields that πn = λπn−1 for 1 ≤ n ≤ B. . vk ≤ 4k . . (We could also note that. λπ0 − (1 + λ)π1 + π2 = 0. by proof by induction. this means that Mk → 2 as k → ∞.) 4. which with the first equation yields λπ1 = π2 . Thus. vk+1 ≤ = 1 1 1 k(k − 2) + 2 (k + 1) 4k 4 1 1 {k − 2 + 1} ≤ . vk+1 = E[Dk+1 ] = E[E[Dk+1 |Xk ]]. . (iii) R does not have independent increments. then θ could be completely determined. Yk ] = Yk E[Uk+1 ] = Yk . So the increments are not independent. 0 ≤ u ≤ s] = E[Wt3 |Ws ] = E[(Wt − Ws + Ws )3 |Ws ] = 3 3 3 3E[(Wt − Ws )2 ]Ws + Ws = 3(t − s)Ws + Ws = Ws . . P [X2 − X1 = 0|X1 − X0 = −1] = 1 and P [X2 − X1 = 0|X1 − X0 = 1] = 1/2. which would give more information about the future of R. Given Yk . . The process is Markov by its description. because. . For example the increments R(0. Yk ] = Yk E[Uk+1 |Y1 . (ii)R is not a martingale. To check for the martingale property in discrete time.5 × 0 + 0. for example. is independent of Y0 . But it does. . But this equality is true because for each cell alive at time k.5) − R(1) are identical random variables. no.no) Z is Markov because W is Markov and the mapping from Wt to Zt is invertible. observing R on a finite interval total determines R. Rs = Rt . the state 12 means that one ball is in the left most occupied position and the other two balls are one position to the right of it. The process Y does not have independent increments. Let k be the present time. The process Y is a martingale because E[Yk+1 |Y1 . and 21. and if s − t is not an integer.5) − R(0) and R(1. Given the number of cells alive at the present time k (i.26 Identification of special properties of two discrete-time processes (version 2) (a) (yes.28 Identification of special properties of two continuous-time processes (version 2) (a) (yes. . Since Uk+1 . . . Yk ] = E[Uk+1 Yk |Y1 . W 3 is not a martingale. it suffices to check that E[Xk+1 |X1 . (To argue this further we could note that the conditional density of Y2 − Y1 given Y1 − Y0 = y − 1 is the uniform distribution over the interval [−y. To see if W 3 is a martingale we suppose s ≤ t and use the independent increment property of W to get: E[Wt3 |Wu . . . 21. . . But for one of these values R has a positive derivative at t. . . (b) (no. . . Therefore. (b) (yes. So E[Rt |(Ru . and for the other R has a negative derivative at t. . the one-step transition probability matrix . Xk ] = Xk . Yk .5 × 2). not independent random variables.370 CHAPTER 12. no).30 Moving balls (a) The states of the “relative-position process” can be taken to be 111. no) R is not Markov because knowing Rt for a fixed t doesn’t quite determines Θ to be one of two values. y]. So Y is Markov. 4. the expected number of cells alive at time k + 1 is one (=0. . Uk+2 . . 12. 12. . . yes. because.no. . If the past of R just before t were also known. So Wt and Zt have the same information. With the states in the order 111. which depends on y. Think of a time k as the present time. . The process does not have independent increments. then since Ws is the increment Ws − W0 .e.. . the future values are all determined by Yk . For example.) 4. So R is not Markov. . The state 111 means that the balls occupy three consecutive positions. So X2 − X1 is not independent of X1 − X0 . . If the increments were independent. given Xk ) the future evolution does not depend on the past. the future of Y is conditionally independent of the past. for example Y1 − Y0 = U1 − 1 is clearly not independent of Y2 − Y1 = U1 (U2 − 1). Uk+1 . given the present value Yk . 0 ≤ u ≤ s] = Rt . it would have to be that E[(Wt − Ws + Ws )3 |Ws ] doesn’t depend on Ws . yes. SOLUTIONS TO PROBLEMS 4. and the state 21 means there are two balls in the leftmost occupied position and one ball one position to the right of them. no). Uk+2 . . the leftmost position of the actual configuration advances one position to the right at rate one.5 0. the left-most position of the configuration advances one position to the right. This solution is based directly on the definition of a Poisson process. 4. is given by P = 0 0.5 −1 leftmost ball is moved to the rightmost position. The −0. The long-term average speed is thus 2/3. or a1 = 12 and a2 = 11. for π= 50 5 1 . while when the relative-position process is in state is 21. Thus. and the left-most position of the configuration does not advance. and divide by three. After each visit to states 111 or 12.5 0 −1 1 . the state of the relative-position process is 111 the entire time.5 0. the relative-position process misses such jumps in the actual configuration process. discrete-space Markov process −1 1 0 1 Q = 10 −11 0 5 −5 Consider Xh to get a1 = h + (1 − h)a1 + ha2 + o(h) a2 = h + 10a1 + (1 − 11h)a2 + o(h) or equivalently 1 − a1 + a2 + o(h) = 0 and 1 + 10a1 − 11a2 + o(h) = 0.5 0. and let each of X individuals be independently assigned a type. (Note that if the state is 111 and if the generator matrix is given by Q = 0 0.5 0 0 1 . with type i having probability pi . 4. . (d) The same states can be used to track the relative positions of the balls as in discrete time.34 Poisson splitting This is basically the previous problem in reverse.371 0. Let X be Possion random variable. 3 ) as before. the next state will be 12. we expect the process to be in each of the states about a third of the time. When the relative-position process is in states 3 3 111 or 12. 1 ). 56 56 56 . the rightmost position of the actual configuration cannot directly move right. as in the discrete-time case. Another approach is to compute the mean distance the moved ball travels in each slot.32 Mean hitting time for a continuous-time.5 0 (b) The equilibrium distribution π of the process is the probability vector satisfying π = πP . (c) Over a 3 3 long period of time. That is. Let h → 0 to get 1 − a1 + a2 = 0 h h and 1 + 10a1 − 11a2 = 0.) The equilibrium distribution can be determined by solving the equation πQ = 0. and 1 the solution is found to be π = ( 1 . 1 . So the long-term speed of the balls is 2/3. all three states are equally likely in equilibrium. from 1 which we find π = ( 1 . That is.5 0. After a visit to state 21. 3 . after 2/3 of the slots there will be an advance. but there are other valid approaches. . Then. with mean λi = λpi and they are independent of each other. . Now Ms = exp(θY − θ (t−s) ). . M is a martingale. 4. pK . let 0 < s < t. Note that Y is independent of Ws and it has the N (0. and E[Y 2 − (t − s)|Ws ] = E[Y 2 − (t − s)] = 0. . pK . and that Ni is the process of type i points. independent splitting of a Poisson number of individuals yields that the number of each type i is Poisson. P (X1 = i1 . NK are mutually independent. . we get that all elements of the array (12. . t − s) distribution. given independent splitting of N with split distribution p1 . N1 (t2 ) − N1 (t1 ) N2 (t2 ) − N2 (t1 ) . 3 2 Wt3 −3tWt = (Y +Ws )3 −3(s+t−s)(Y +Ws ) = Ws −3sWs +3Ws Y +3Ws (Y 2 −(t−s))+Y 3 −3tY . the K processes N1 . By the definition of a Poisson process. . Similarly. N1 (t1 ) − N1 (t0 ) N2 (t1 ) − N2 (t0 ) . . . 2 Thus E[Mt |Ws ] = Ms . so by the hint. . XK = iK ) = P (X = i1 + · · · + iK ) K (i1 + · · · + iK )! k1 p · · · p iK K i1 ! i2 ! · · · iK ! 1 = j=1 e−λj λjj ij ! i where λi = λpi . Now suppose that N is a rate λ Poisson process. so the martingale property follows from the hint. it follows that Because Y is independent of Ws and because E[Y ] = E[Y 2 E[3Ws Y + 3Ws (Y 2 − (t − s)) + Y 3 − 3tY |Ws ] = 0. . with the appropriate means. but E[2Ws Y |Ws ] = 2Ws E[Y |Ws ] = 2Ws E[Y ] = 0. Therefore.5). · · · . .6) NK (t1 ) − NK (t0 ) NK (t2 ) − NK (t1 ) Then by the splitting property of Poisson random variables described above. . E[ Ms |Ws ] = E[ Ms ] = 1.5) Suppose each column of the following array is obtained by independent splitting of the corresponding variable in (12.6) are independent. . 2 Mt Mt Mt Mt (a) E[Mt |Ws ] = Ms E[ Ms |Ws ]. the following random variables are independent. It follows that E[2Ws Y + Y 2 − (t − s)|Ws ] = 0. NK (tp ) − NK (tp−1 ) (12. .36 Some orthogonal martingales based on Brownian motion Throughout the solution of this problem. . X2 = i2 . Thus.372 CHAPTER 12. SOLUTIONS TO PROBLEMS some probability distribution p1 . so the martingale property follows from the hint. ··· ··· ··· ··· N1 (tp ) − N1 (tp−1 ) N2 (tp ) − N2 (tp−1 ) . 2 − (t − s)] = E[Y 3 ] = 0. . 2 (b) Wt2 − t = (Ws + Y )2 − s − (t − s) = Ws − s + 2Ws Y + Y 2 − (t − s). and let Y = Wt − Ws . . . Let Xi denote the number assigned type i. By definition. with the ith having the P oi(λ(ti+1 − ti )) distribution: N (t1 ) − N (t0 ) N (t2 ) − N (t1 ) ··· N (tp ) − N (tp−1 ) (12. and because of the independence of the rows of the array. the ith process Ni is a rate λpi random process for each i. σ (1 + 2σ 2 ) 2 −x2 where the constant c1 depends on n but not on σ 2 .4 Transformation of estimators and estimators of transformations (a) Yes. Then the above integral can be written as follows: P {N = n} = = σ σ ∞ x2n e 2e2 √ dx σ −∞ n! 2πσ 2 c1 σ 2n+1 c1 σ 2n = 2n+1 . (d) No. 5. (c) Yes.2 A variance estimation problem with Poisson observation (a) (X 2 )n e−X ] P {N = n} = E[P [N = n|X]] = E[ n! ∞ 2 = −∞ x2n e−x e− 2σ2 √ dx n! 2πσ 2 2 x2 (b) To arrive at a simple answer. 2 σM L (n) = n. then E[X 2n ] = σ n!2n . we find that P {N = n} is maximized at σ 2 = n. expectation martingale property orthogonality of variables at a fixed time 5. expectation property of cond. σ 2 = 1+2σ2 . (f) No. using the fact that if X e2n (2n)! 1 1 is a N (0. because the MMSE estimator is given by the conditional expectation. That is. Let σ 2 be such that 2˜ 2 = 1 + 2σ2 . Then E[Mn (s)Mm (t)] = E[E[Mn (s)Mm (t)|Ws ]] = E[Mn (s)E[Mm (t)|Ws ]] = E[Mn (s)Mm (s)] = 0 property of cond. because the transformation is linear.373 (c) Fix distinct nonnegative integers m and n. Typically E[Θ3 |Y ] = E[Θ|Y ]3 . Here we simplify first. which is linear. That is. the pdf of 3 + 5Θ is a scaled version of the pdf of Θ. σ 2 ) random variable. because the transformation is invertible. because the transformation is not linear. (b) Yes. Taking the logarithm of P {N = n} and calculating the derivative with respect to σ 2 . because the transformation is invertible. or ˜ σ σ2 equivalently. (e) Yes. . we could set the derivative of P {N = n} with respect to σ 2 equal to zero either before or after simplifying. 3 + 5E[Θ|Y ] = E[3 + 5Θ|Y ]. . the upper one. SOLUTIONS TO PROBLEMS 5. namely c6 a18 . . z|θ) and δj (t) correspond to smaller values of i. . Similarly.8 Specialization of Baum-Welch algorithm for no hidden data (a) Suppose the sequence y = (y1 . Thus.3(b) shows the result of running the Viterbi algorithm. for any path z = (z1 .3(a) with minimum weight. larger values for pcd (y. . z6 ). Rather than keeping track of products.374 CHAPTER 12. we keep track of the exponents of the products.6 Finding a most likely path Finding the path z to maximize the posterior probability given the sequence 021201 is the same as maximizing pcd (y. z|θ) has the form c6 ai for some i ≥ 0. z|θ). If θ(0) = θ = (π. B) is such that B is the identity matrix. such as ai aj . the variable δj (t) has the form ct ai for some i ≥ 0. . pcd (y. the larger probability. path is the sum of all the numbers indicated on the vertices and edges of the graph.3: Trellis diagram for finding a MAP path. where the weight of a (a) Observations 0 1+1 0 1 2 3 3 3 1 1 1 3 1 1 2 3 3 1 1 1 2 3 3 3 1 2 1 0 1 3 3 1 3 t=6 1 1 2 State 1 3 3 3+2 t=1 (b) Observations 0 2 1+1 0 2 6 3 3 3 1 1 6 1 1 9 2 3 3 3 10 1 2 13 3 3 3 1 12 1 0 15 1 3 3 2 15 1 3 19 t=6 1 18 2 ! ’s 1 1 1 1 1 State 1 3 3 3+2 5 t=1 Figure 12. Of the two paths reaching the final states of the trellis. namely the path 000000. Figure 12. the problem at hand is equivalent to finding a path from left to right in trellis indicated in Figure 12. which for ai aj would be i + j. . has the smaller exponent. 5. Since a < 1. Therefore. B). A. A. yT ) is oberserved. then directly by the definitions (without . and all entries of π and A are nonzero. . where for i is indicated by the numbers in boxes. . Due to the form of the parameter θ = (π. 18. 000000 is the MAP path. The value of δj (t) has the form ct ai . and therefore. such that each row of B (1) is the empirical distribution of the observation sequence.375 using the α’s and β’s): γi (t) = P [Zt = i|Y1 = y1 . so that each time the state is i the observation should (1) also be i. θ(1) is the ML estimate.29) for the first iteration. and if there are T = 100 observations. there is no way to change beliefs in a particular direction in order to increase the likelihood. Zt+1 = j|Y1 = y1 . and simply not consider B as part of the parameter to be estimated. where B (1) is the matrix with identical rows. the EM algorithm converges to an inflection point of the likelihood function. YT = yT . . In real computer experiments. for example. Thus. of which 37 observations are zero and 63 are 1. Thus. the EM algorithm converges in one iteration. even if there is.) The next iteration will give the same values of π and A. 0) and A = 2 3 1 3 1 3 2 3 .10 Baum-Welch saddlepoint It turns out that π (k) = π (0) and A(k) = A(0) . B (k) = B (1) for each k ≥ 1. which simplifies to 2/3. and 8 of those 12 times the next state was a zero.63). This estimator of A results from the fact that.27) . if the observations are binary valued. .yt+1 )=(i. by Lemma 5. (Alternatively. ai. (5.e. bil = I{i=l} for any state i that is visited. Similarly. t = 0. 5. of the first 21 times. . then the observation sequence is initially believed to be independent of the state sequence. That is consistent with the assumption that B is the identify matrix. Also. Hence. θ] = I{yt =i} ξij (t) = P [Zt = i. Thus. Note that. j) transitions observed number of visits to i up to time T − 1 number of times the state is i and the observation is l (1) bil = number of times the state is i It is assumed that B is the identity matrix. So a00 = 8/12 = 2/3 is the ML estimate. then each row of B (1) would be (0. the ML estimate of a11 is 6/9.1. . the Baum-Welch algorithm converges in one iteration to the final value θ(1) = (π (1) . For example. for then it follows for all iterations by induction.j)} Thus.(5. and the state process is initially believed to be stationary. the ML estimates are π = (1. the algorithm may still eventually reach a near maximum likelihood estimate.j (b) In view of part (a). A(1) . due to round-off errors in the computations which allow the algorithm to break away from the inflection point.7. and unless θ(0) happens to be a local maximum or local minimum. the probability vector for S with all mass on y1 number of (i. YT = yT . One intuitive explanation for this assertion is that since all the rows of B (0) are the same. 0.37. It is enough to prove the assertion for the first iteration only. notable time variation in the observed data sequence. . for each k ≥ 0. θ] = I{(yt . . . the sate was zero 12 times. B (1) ) already described. . . The assertion can be proved by use of the update equations for the Baum-Welch algorithm. since B is fixed to be the identity matrix. become πi (1) (1) = I{y1 =i} = i. we could just work with estimating π and A. we find αi (t) = by1 · · · byt πi and βj (t) = byt+1 · · · byT .2. h10 = µ1 + h01 1 h11 5 h11 = µ2 + h10 . it makes sense for the initial parameter value to respect the constraint. 0. If λ = µ1 = µ2 = 1.12 Constraining the Baum-Welch algorithm A quite simple way to deal with this problem is to take the initial parameter θ(0) = (π. And if it does. (These constraints are added in addition to the usual constraints that π.4. then the same constraint will be satisfied after each iteration. T 5.) After all. and by (5.2. In particular.l = T t=1 πi I{yt =l} T πi = number of times l is observed . (0) (0) Similarly. Thus. 01. we write bl to denote bil for an arbitrary value of i.29) gives (1) bi.27). and no changes are needed to the algorithm itself.0 then π = (0. (e) Let τ = min{t ≥ 0 : X(t) = 00}.j . 0. (0) By induction on t. and define hs = E[τ |X(0) = s]. A. with π and each row of A and b being probability vectors. B) in the Baum-Welch algorithm to be such that aij > 0 if and only if aij = 1 and bil > 0 if and only if bil = 1. and therefore γi (t) = πi . Zt+1 = j|y. 1 h10 = 4 . (5. π (1) = π (0) . By (5. θ(0) ] = πi ai. So the vector (αi βi : i ∈ S) is proportional to π (0) . βj (t) does (0) not depend on j. 11} (b) µ 00 ! 10 µ 2 µ1 11 2 01 ! −λ 0 λ 0 µ2 −µ2 − λ 0 λ . A. SOLUTIONS TO PROBLEMS (0) Since the rows of B (0) ) are all the same.0 this yields h11 .376 CHAPTER 12.j (t) = P [Zt = i. 0. 6. (c) Q = 0 µ1 −µ1 0 0 0 µ2 −µ2 (d) η = (π00 + π01 )λ = (π01 + π11 )µ2 = π10 µ1 . for s ∈ S.4. 10. ξi.2 A two station pipeline in continuous time (a) S = {00.28). We wish to find h00 = 0 h00 0 λh11 h01 3 h01 = µ21 + µ2 h00 + µ2 +λ +λ µ2 +λ For If λ = µ1 = µ2 = 1. Finally. and B have the appropriate dimensions.2) and η = 0. A(1) = A(0) . ∂α1 ) = (α0 . given N (t) = k. −qij qii . the locations of the k points are uniformly and independently distributed on the interval [0. 0). (Note: the matrix involved here is the Q matrix with the row and column for state 2 removed. α1 (0)) = (1. The initial holding time in state i has mean − q1 and ii the next state is j with probability pJ = ij 1 a0 = a2 1.6 A mean hitting time problem (a) 2 0 2 2 2 1 1 1 3 πQ = 0 implies π = ( 2 . N (t) = k] P [N (t) = k] e−λs (λs)i i! k i s t i e−λ(t−s) (λ(t − s))k−i (k − i)! t−s t k−1 e−λt (λt)k k! −1 That is. α1 ) ∂t ∂t β0 (t − h) = (1 + q00 h)β0 (t) + q01 hβ1 (t) + o(h) β1 (t − h) = q10 hβ0 (t) + (1 + q11 h)β1 (t)) + o(h) . 6.377 h11 = 5 is the required answer. Condition on the first step. 6. 7 ). Subtract αi (t) from each side and let h → 0 to yield ( ∂α0 . 2 .4 A simple Poisson process calculation Suppose 0 < s < t and 0 ≤ i ≤ k.) (d) Similarly. Solving yields . Thus a0 a2 = − q1 00 − q1 22 + 0 pJ 02 pJ 0 20 a0 a2 .5 (c) Clearly α2 (t) = 0 for all t. This could have been deduced with no calculation. α0 (t + h) = α0 (t)(1 + q00 h) + α1 (t)q10 h + o(h) α1 (t + h) = α0 (t)q01 h + α1 (t)(1 + q11 h) + o(h) q00 q01 with the q10 q11 inital condition (α0 (0). P [N (s) = i|N (t) = k] = = = P [N (s) = i. t]. the conditional distribution of N (s) is binomial. 7 7 (b) Clearly a1 = 0. using the fact that given N (t) = k. (λ+µ)3 λ3 . πk = ( 2λ )2−k∧K π0 . Finally. the dropping probability is π3 . Each event is a packet arrival with λ probability λ+µ and is a reset otherwise. SOLUTIONS TO PROBLEMS − ∂β0 ∂t − ∂β1 ∂t q00 q01 q10 q11 β0 β1 Subtract βi (t)’s. µ ! K µ/2 ! K+1 µ/2 ! K+2 µ/2 ! . In this case. We can find the equilibrium distribution π by solving the equation πQ = 0. so that π1 = π0 ( λ+µ ) λ λ λ and π2 = π0 ( λ+µ )2 . (c) If λ = 2µ . 2. 3}. S1 = ∞ ( 2λ )k 2−k∧K .10 A queue with decreasing service rate (a) ! 0 µ 1 µ ! . The balance equation for state i ∈ {1. the queue appears to be stable until if fluctuates above K.8 Markov model for a link with resets (a) Let S = {0. The formula follows. ! 0 µ 1 µ µ ! 2 ! 3 (b) By the PASTA property. divide by h and let h → 0 to get: β0 (tf ) β1 (tf ) = 1 1 = with 6. X(t) K t µ (b) S2 = ∞ ( 2λ )k 2k∧K . The balance equation for state 0 is λπ0 = µ(1 − π0 ) so that µ λ π0 = λ+µ . 2} is λπi−1 = (λ + µ)πi . π3 (t) is the probability that the last three events before time t were arrivals.) 6. so if λ < µ then S1 < +∞ and the process is positive k=0 µ 2 recurrent. where µ π0 = 1 1 − (λ/µ)K (λ/µ)K + = S1 1 − (λ/µ) 1 − (2λ/µ) −1 .. 1. Finally.378 CHAPTER 12. where the state is the number of packets passed since the last reset. if λ < µ then S2 < +∞ and the k=0 2 process is recurrent. See figure above. Eventually the queue3 length will grow to infinity at rate λ − µ = µ .. where k ∧ K = min{k. K}.12 An M/M/1 queue with impatient customers (a) . λπ2 = µπ3 so that π3 = π0 ( λ+µ )2 µ = λ3 . (λ+µ)3 The dropping prob- ability is π3 = (This formula for π3 can be deduced with virtually no calculation from the properties of merged Poisson processes... Thus. Fix a time t. 2 6 6. The types of different events are independent. Alternatively. t) R(k. the mean busy period duration is given by λ [ π0 − 1] = λ(1−ρ) = µ(1−ρ) k k 6. µ if α > 0.e. (b)Observe that D(k. between any two transtions from state k to k + 1 there is a transition form state k + 1 to k.. ∞ 1 ρ − 1 + e−ρ 1 as ρ → ∞ pD = pk (k − 1)α = → 0 as ρ → 0 λ ρ k=1 where l’Hˆspital’s rule can be used to find the limit as ρ → 0. 1 (c) W = NW /(λ(1 − pB )) where NW = 5 (k − 1)πk . t) − R(k. t) + − − αt αt αt δt 1 R(k. W is equal to the mean time in system minus the mean time in service) 1 1 (d) π0 = = Thereλ(mean cycle time for visits to state zero) λ(1/λ+mean busy period duration) 1−ρ5 ρ−ρ6 1 1 fore. This establishes that |D(k. k cλk (c) If α = µ. t) R(k. o 6. t)| ≤ 1 for all k.16 On two distibutions seen by customers N(t) k+1 k t As can be seen in the picture. divided by the mean arrival rate λ.379 ! 0 µ 1 µ+" ! 2 µ+2" ! 3 µ+3" ! 4 µ+4" k ! . t) αt + 1− αt αt δt 1 αt + 1− → 0 as t → ∞ αt δt . k! Furthermore. pk = k!µk = cρ . t) − αt δt ≤ ≤ ≤ D(k. the number of transitions of one type is within one of the number of transitions of the other type. (b) pB = π5 by the PASTA property. t) R(k. Thus. and vice versa..14 A queue with blocking (a) ! 0 µ 1 µ ! 2 µ ! 3 µ ! 4 µ ! 5 (1−ρ) ρ πk = 1+ρ+ρ2 +ρ3 +ρ4 +ρ5 = ρ 1−ρ6 for 0 ≤ k ≤ 5. Thus. Therefore. pD is the mean departure rate by defecting customers. t) R(k. and pk = µ(µ+α)···(µ+(k−1)α) where c is chosen so that the pk ’s sum to one. W = N /(λ(1 − pB )) − µ k=1 (i. (pk : k ≥ 0) is the Poisson distribution with mean ρ. cλ (b) The process is positive recurrent for all λ. and X ≤ −2B . given X(t) = x.02. and the righthand side is the probability that some queue in s is eligible for service in a given time slot. D(k.4. SOLUTIONS TO PROBLEMS and R(k.18 Positive recurrence of reflected random walk with negative drift 1 Let V (x) = 2 x2 . and otherwise stay the same. during the slot. if the limits of either exists. Then. The condition is necessary for the stability of the set of queues in s.t) αt CHAPTER 12.20 An inadequacy of a linear potential function Suppose x is on the postive x2 axis (i.380 Thus. Then P V (x) − V (x) = E[ (x + Bn + Ln )2 x2 ]− 2 2 (x + Bn )2 x2 ≤ E[ ]− 2 2 2 B = xB + 2 Therefore. x1 = 0 and x2 > 0).t) δt have the same limits. Queue 2 will decrease by one with probability 0. Therefore.) −2B 6. yielding that X is positive recurrent.22 Opportunistic scheduling (a) The left hand side of (6. (b) Fix > 0 so that for all s ∈ E with s = ∅. the drift of V .e.42. the drift is strictly positive for infinitely many states. whereas the Foster-Lyapunov condition requires that the drift be negative off of a finite set C. (This bound is somewhat weaker than Kingman’s moment bound. (ai + ) ≤ i∈s B:B∩s=∅ w(B) Consider the flow graph shown. Thus. 6. E[V (X(t + 1) − V (x)|X(t) = x] is equal to 0. the linear choice for V does not work for this example. 6. queue 1 will increase to 1 with probability a(1 − d1 ) = 0.37) is the arrival rate to the set of queues in s. the conditions of the combined Foster stability criteria and moment bound corollary B2 apply. . disussed later in the notes: X ≤ Var(B) . and otherwise stay at zero. So. .qN }−A (12. If qi ∈ A and sj ∈ B such that (qi . s) = f (qi . Then P [δ(t) = i|S(t) = s] = u(i. u(i. and the weight of a link (sk . There are three stages of links in the graph.. b) which comes from link (qi . . . there are two columns of nodes in the graph. Let u(i. The first column of nodes corresponds to the N queues. we can suppose that A contains all the nodes sj in the second column such that sj ∩ A = ∅. s such that f (s. s) in some arbitrary way. qi ) = s f (qi . and each such link has capacity greater than the sum of the capacities of all the links in the first stage. . which is greater than v ∗ . with δ(t) = 0 if no queue is given potential service. it suffices to show that any other a − b cut has capacity greater than or equal to v ∗ . . 2 1 w(s ) 1 1 s . b q N .. w(s ) 2 w(s ) k sk . Let A = A ∩ {q1 . s) = µi (u). The . . as required.qi ) in the first stage is ai + .. there is a link (qi . . s) ≥ 0. The capacity of a link (a. . b) for all i.e. C(A : B) ≥ ≥ (ai + ) + e i∈{q1 . s) ≤ s w(s)u(i. u(i. set of queues) that are in A. Thus. s) = 0 if i ∈ s. and i∈E u(i. s). . Therefore. respecting the requirements u(i. . The a − b cut ({a} : V − {a}) (here V is the set of nodes in the flow network) has capacity v ∗ .. . s) is the fraction of flow on link (s. sj ) in the second stage if and only if qi ∈ sj . Fix any a − b cut (A : B). N! w(s ) N! In addition to the source node a and sink node b. s).7) where the inequality in (12. e i∈A (ai + ) + e i∈{q1 . . s . b)u(i. sj ) is a link in the flow graph. qN }. Here is a i=1 proof of the claim. t) in the third stage is w(sk ). so to prove the claim. . so the required inequality is proved in that case. Therefore there is an a−b flow f which saturates all the links of the first stage of the flow graph. a +! N .. and the second column of nodes corresponds to the 2N subsets of E. s) = s f (s. That is. . sj ). The claim is proved. b) > 0. 2 . . A is the set of nodes in the first column of the graph (i. Let δ(t) denote the identity of the queue given a potential service at i time t.qN }−A e s⊂E:s∩A=∅ w(s) (ai + ) = v ∗ . . We claim that the minimum of the capacities of all a − b cuts is v∗ = N (ai + ). or in words. define u(i.381 s q a +! 1 q a 2 +! a .. b) = 0. s)/f (s. .7) follows from the choice of . . then the capacity of (A : B) is greater than or equal to the capacity of link (qi . Then ai + = f (a.. s) = I{s=∅} . For those s such that f (s. . 1 (c) Let V (x) = 2 i∈E x2 . under the necessary condition and as in part (b). µ2 − λ2 − uν} + K (12.8) holds for any u. s) = 1 if i is the smallest index in s such that xi = maxj∈s xj . the system is positive recurrent. and λ1 +λ2 < µ1 +µ2 .8) ≤ − +K (12. Since i∈E (Ai (t) − Ri (δi (t)))2 ≤ i∈E (Ai (t))2 + (Ri (δi (t)))2 ≤ N + i∈E Ai (t)2 we have P V (x) − V (x) ≤ i∈E xi (ai − µi (u)) xi i∈E +K (12.11) (c) If the righthand side of (12. (b) QV (x) = y:y=x qxy (V (y) − V (x)) λ1 λ2 µ1 [(x1 + 1)2 − x2 ] + [(x2 + 1)2 − x2 ] + [(x1 − 1)2 − x2 ] + 1 2 + 1 2 2 2 uνI{x1 ≥1} µ2 [(x2 − 1)2 − x2 ] + [(x1 − 1)2 − x2 + (x2 + 1)2 − x2 ] + 2 1 2 2 2 = (12. Therefore. x. and Xi ≤ i∈E K (12. or u = u∗ . so that QV (x) ≤ x1 (λ1 − µ1 − uν) + x2 (λ2 + uν − µ2 ) + K ≤ −(x1 + x2 ) min{µ1 + uν − λ1 . under the necessary stability conditions we have that under the i=1 2 vector of scheduling probabilities u. and . if P is replaced by P LCF . 1] nearest to µ1 +µ22ν 1 −λ2 .10) holds for the LCF policy.24 Stability of two queues with transfers (a) System is positive recurrent for some u if and only if λ1 < µ1 +ν. we find QV (x) ≤ − (x1 + x2 ) + K where = min{µ1 + ν − λ1 . we select u to maximize the min 2 −λ term in (12. λ2 < µ2 . If P LCF denotes the one-step transition probability matrix for the LCF policy.12). in which the longest connected queue is served in each time slot. To get the best bound on X1 + X2 . and (12.8) would be minimized by taking u(i.12) where K = λ1 +λ2 +µ1 +µ2 +2ν . µ2 − λ2 . 6. then the right hand side of (12. u∗ = 0. This suggests using the longest connected first (LCF) policy.9) where K = N + N Ki .9) also holds with P replaced by P LCF . (12. where u∗ is the point in [0. SOLUTIONS TO PROBLEMS dynamics of queue i are given by Xi (t + 1) = Xi (t) + Ai (t) − Ri (δ(t)) + Li (t). then (12.382 CHAPTER 12. For u = u∗ .11) is changed by dropping the positive part symbols and dropping the factor I{x1 ≥1} . Which of the 2 three terms is smallest in the expression for corresponds to the three cases u∗ = 1. µ1 +µ2 −λ1 −λ2 }. where Ri (δ) = I{δ=i} . then it is not increased.10) (d) If u could be selected as a function of the state. Thus. By assumption.383 0 < u∗ < 1.s. λ2 + ). For another. t (b) False. integer valued with initial value zero. so CX (s. Xt and Xt are jointly Gaussian.2. so they are independent. n]}. RX (0) = 0. Since X is a Gaussian process. then there are three disjoint intervals. and D3 are independent. for t fixed.s. if and only if N (T ) = 0. They jump by +1 at each jump of N . (b) Since P [N is continuous over [0. In general. Xt and Xt are jointly Gaussian and uncorrelated. Xs = D0 +D1 and Xt = D0 + D2 . for any t. (Note: Xs is not necessarily independent of Xt if s = t. t+1] = I0 ∪I2 .6 Cross correlation between a process and its m. D0 + D2 ) = . +∞)] = 1. respectively. I1 .4 Some statements related to the basic calculus of random processes t (a) False. D2 . E[|Ns − Nt |2 ] = λ|s − t| + λ2 |s − t|2 → 0 as s → t. (c) True. sample continuous. RX X (τ ) = RX (τ ). u) s−t Hence ∂1 RX (s. derivative Fix t. equivalently. t+1]. and D0 is a Poisson random variable with mean and variance equal to λ times the length of I0 . If |s−t| < 1. t) = Cov(D0 + D1 . Therefore. (b) Method 1: If |s−t| ≥ 1 then Xs and Xt are increments of N over disjoint intervals. I0 . One reason is that the function is continuous at zero. Equivalently. λ2 ) replaced by (λ1 + . N is not a. n]] = limn→∞ e−λn = 0. T ] if and only if it has no jumps in the interval. and differentiation is a linear operation. +∞)] = limn→∞ P [N is continuous over [0. E s−t E[Xt Xu ] as s → t. 7. RX (s. is continuous. Xt ) = 0 as well. u ∈ T . u).T] ] = exp(−λT ).2 Lack of sample path continuity of a Poisson process (a) The sample path of N is continuous over [0. and are therefore independent. 7. such that [s. we would have Var(X1 − X0 − X2 ) = 3RX (0) − 4RX (1) + 2RX (2) = 3 − 4 + 0 = −1.s. A more direct proof is to note that for fixed t. E[Xt Xt ] = RX X (0) = RX (0) = 0. given by RN (s. ) 7. u) − RX (t. s+1]∪[t. it follows that Cov(Xt . but not everywhere. continuous. X (t. which is 1 − |s − t|. s+1] = I0 ∪I1 and [t. it follows n=1 that P [N is continuous over [0. One proof is to simply note that the correlation function. Therefore. u) = RX → RX X (t. by Corollary 2.8 A windowed Poisson process (a) The sample paths of X are piecewise constant. t) = 0. The three increments D1 . Thus.s. Since RX is an even function. However N is m. where Di is the increment of N over the interval Ii . So P [N is continuous over the interval [0. Summarizing. with I0 = [s. Since {N is continuous over [0. Since the process X has mean zero. t) = λ(s ∧ t) + λ2 st. u) Xs −Xt s−t Xu → as s → t. and ∂1 RX (t. lims→t Xs − = Xt m. It is easy to check that this same is the largest constant such that the stability conditions (with strict inequality relaxed to less than or equal) hold with (λ1 . 7.4. Thus. and I2 . +∞)} = ∩∞ {N is continuous over [0. and jump by -1 one time unit after each jump of N . u) exists. limt→∞ 1 0 Xs ds = Z = E[Z] (except in the degenerate case that Z has variance zero). CX (s. t + 1) − min(s + 1. continuous since RX is continuous. X has a -1 jump one time unit after each +1 jump. (c) No. Var( 1 0 Xs ds) = 2 0 (1 − s )CX (s)ds → 0 as t → ∞. t) is a function of s − t alone. t)]. 0 0 t 0 = 1 1 Thus the m. continuous over [ . The integral has the N (0. and CX (s) → 0 as s → ∞. so the value Xt for a “present” time t tells less about the future. → 0. D0 ) = λ(1 − |s − t|).12 Recognizing m. This answer can be simplified to the one found by Method 1 by considering the cases |s − t| > 1. t < s < t + 1. recall that RX (s. CX (s.s. Summarizing. 1 E ws ds s 1 wt dt t s∧t dsdt st 1 1 s∧t → dsdt st 0 0 1 t 1 t s∧t s = 2 dsdt = 2 dsdt st 0 0 0 0 st 1 t 1 1 = 2 dsdt = 2 1dt = 2. limit defining the integral exits. tells about the future .2. SOLUTIONS TO PROBLEMS λ(1 − |s − t|) if |s − t| < 1 0 else Method 2: CX (s. To see if the limit exists we apply the correlation form of the Cauchy criteria (Proposition 2.s.10 A singular integral with a Brownian motion 1 wt (a) The integral t dt exists in the m. Using different letters as variables of integration and the fact Rw (s. Since CX and µX are continuous functions. Using the facts that CX (s. t) = s ∧ t (the minimum of s and t). than the past. and s < t < s + 1 separately.s. we t t find as in the section on ergodicity. (b) As a. (d) Yes.2). 2) distribution.s. b → ∞. continuous.s. properties (a) Yes m. and the integral is not well defined. t) = CX (s. sense for any > 0 because wt /t is m. (Xs : s ≥ t). (Xs : 0 ≤ s ≤ t). t) − µX (s)µX (t). limit does not exist. t) − min(s. (e) Yes. 1].384 CHAPTER 12. t t 1 1 1 = a b so that the m. so that X is m. No not m. 7.s.s. t) = 7. t + 1) − min(s. yields that as .s. differentiable since RX (0) doesn’t . a E 1 ws ds s b 1 wt dt t s∧t dsdt st 1 1 ∞ ∞ s∧t → dsdt st 1 1 ∞ t ∞ t s∧t s = 2 dsdt = 2 dsdt st 1 1 1 1 st ∞ t ∞ 1 t−1 = 2 dsdt = 2 dt = ∞. Nt+1 − Nt ) = λ[min(s + 1. t t t Cov(D0 . t) = Cov(Ns+1 − Ns . so is RX . Var(X0 ) = RX (0) = 2. X1 )T has 4 0 1 0 0. the N 0 .14 A stationary Gaussian process (a) No.s. yes.s.16 Correlation ergodicity of Gaussian processes Fix h and let Yt = Xt+h Xt . All mean zero stationary. The error is Gaussian with mean zero and variance 10 MSE = Var(X3 ) − Var( X0 ) = 1 − 0. sense. continuous since RX is continuous. continuous. 10 0. yes.s. So P {|X3 − E[X3 |X0 ]| ≥ 10} = 2Q( √10 ). No. Yes.s. Yes mean ergodic in m.5 0. integrable over finite intervals since m. m. (b) Yes.99 2−6τ (c) RX (τ ) = −RX (τ ) = (1+τ 2 )3 . since −RX exists and is continuous.s. for the same reasons as in part (a). t) − RX (t. 7. Cov(Xτ . Since X is mean zero. E Xt − X0 t 2 = = 1 [RX (t.s. not m. integrable over finite intervals. (e) Yes. continuous since RX is not continuous. yes. continuous.s. sense. for the same reasons as in (a). X0 ) = RXX (τ ) = −RX (τ ) = (1+τ 2 )2 .s.s.5 2 0.s.5.99.385 exist.5 distribution. t2 Yes. In particular. For example. 0 0 0. (d) The vector has a joint Gaussian distribution because X is a Gaussian process and differ2τ entiation is a linear operation. RX (3) (b) E[X3 |X0 ] = E[X3 |X0 ] = RX (0) X0 = X0 . not m. X0 ) = 0 and Cov(X1 . X0 . for the region of integration is a simple bounded region and the integrand is piece-wise constant. continuous. since RX (T ) → 0 as |T | → ∞. t)dsdt exists and is finite. Thus |T |→∞ lim CX (T ) = |T |→∞ lim RX (T ) = 1 Since the limit of CX exists and is net zero. no. X0 ) = 2 = 0. X is not mean ergodic in the m. So (X0 . In particular. RX (T ) = CX (T ) for all T . Also.5 1 7. 0)] t2 1 √ t − 0 − 0 + 0 → +∞ as t → 0.s. 2 Cov(X0 . m. No. no. t) + RX (0. integrable over finite intervals since m. 0) − RX (0.s. m. (d) No. not m. Clearly Y is stationary with mean µY = RX (h). (c) Yes. m.01 = 0. Observe that CY (τ ) = E[Yτ Y0 ] − µ2 Y = E[Xτ +h Xτ Xh X0 ] − RX (h)2 = RX (h)2 + RX (τ )RX (τ ) + RX (τ + h)RX (τ − h) − RX (h)2 . X is continuously differentiable in the m. Yes. because the Riemann integral b b a a RX (s. differentiable. Gaussian Markov processes have autocorrelation functions of the form RX (t) = Aρ|t| . differentiable since X is not even m.s. where A ≥ 0 and 0 ≤ ρ ≤ 1 for continuous time (or |ρ| ≤ 1 for discrete time). The eigenvalues are thus 50. φ1 (t) = 2 cos(20πt) and φ2 = 2 sin(20πt). The eigenvalues are thus 50. The resulting mean square error is given by MMSE = Var(X2 ) − (Cov(X1 X2 )2 )/Var(X1 ) = 9(1 − e−2 ) (b) Given (X0 = π. it is given by Cov(X2 . φn ). Since X is Gaussian. so X is correlation ergodic. and the other eigenvalues are zero. φn )φn (t) = n [(2πjn)(X. The other eigenfunctions can be selected to fill out an orthonormal basis. t t (b) No. Therefore. Hence Y is mean ergodic.s. t) = 50 + 50 cos(20π(s − t)) = 50 + 50 cos(20πs) cos(20πt) + 50 sin(20πs) sin(20πt) = 50φ0 (s)φ∗ (t) + 25φ1 (s)φ∗ (t) + 25φ2 (s)φ∗ (t) 0 1 2 √ √ for the choice φ0 (t) ≡ 1. the equation given in the problem statement leads to: X (t) = n (X.) 1 APPROACH TWO The identity cos(θ) = 2 (ejθ + e−jθ ). and µn = (2πn)2 λn for n ∈ Z . the best estimator of X2 given (X0 . Since X is mean zero. (c) APPROACH ONE Since RX (0) = RX (1). Thus.) 7. Thus E[X2 |X0 . the t t t t necessary and sufficient condition for mean ergodicity in the m.386 CHAPTER 12. X1 )Var(X1 )−1 X1 = e−1 X1 . 25. (X . No function of (X0 . The other eigenfunctions can be selected to fill out an orthonormal basis. limt→∞ 2 0 ( t−τ )CX (τ )dτ = 50 + limt→∞ 100 0 ( t−τ ) cos(20πτ )dτ = 50 = 0.22 KL expansion for derivative process (a) Since φn (t) = (2πjn)φn (t). 1] we have RX (s. the eigen-functions can be taken to be φn (t) = e2πjnt for n ∈ Z. which is a KL expansion. This is the optimal polynomial. because RX (τ ) is twice continuously differentiable. X1 = 3). X1 ) for the polynomial p(x0 . ψn (t) = φn (t). SOLUTIONS TO PROBLEMS Therefore. and 25. APPROACH THREE For s. by the theory of WSS periodic processes. 9(1 − e−2 ) so P [X2 ≥ 4|X0 = π. 1] and the coordinates are orthogonal random variables. because the functions φn are orthonormal in L2 [0. the derivative of each φn is a constant times φn itself. X1 ) is a function of X1 alone.1). such estimator is linear in X1 . t ∈ [0.20 KL expansion of a simple random process (a) Yes. sense does not hold. (Note: the eigenspace for eigenvalue 25 is two dimensional. φn )]φn (t). 7. yields RX (s − t) = 50 + 25e20πj(s−t) + 25e−20πj(s−t) = 50 + 25e20πjs e−20πjt + 25e−20πjs e20πjt = 50φ0 (s)φ∗ (t) + 25φ1 (s)φ∗ (t) + 25φ2 (s)φ∗ (t) for the choice φ0 (t) ≡ 1. X1 ) is a better estimator! But e−1 X1 is equal to p(X0 .18 Gaussian review question (a) Since X is Markovian. and 25. 25. Thus. X1 = 3] = P X2 − 3e−1 9(1 − e−2 ) ≥ 4 − 3e−1 9(1 − e−2 ) =Q 4 − 3e−1 9(1 − e−2 ) 7. X1 ] = e−1 X1 . ψn ) = (2πjn)(Xn . Thus. (Still have to identify the eigenvalues. and the other eigenvalues are zero. with period 0. φ1 (t) = e20πjt and φ2 = 0 1 2 e−20πjt . CY (τ ) → 0 as |τ | → ∞. the process X is periodic with period one (actually. x1 ) = x1 /e. X2 is N 3e−1 . so the choice of eigen functions spanning that space is not unique. t Xn = n∈Z.t | = | 1 t not important as t → ∞. As in part (a). The same basis functions can be used for X as for X. For mean ergodict ity (in the m. φ1 )c1 3 − (X. ψ0 ) = 0 µ0 = 0. sense). |an.) √ (c) Note that φn (t) = (2n+1)π ψn (t). φ2 )φ2 (t) = (X. 1 2 7. component.) (d) Differentiating the KL expansion of X yields √ √ Xt = (X. Thus. the process should have no zero frequency. pX (0) = µ2 . because the corresponding eigenvalue is zero.n=0 |an. but with sin replaced by cos . φn ). This is similar to part (a). µ2k = (2πk)2 λ2k+1 .s. φ2 )c2 3. the functions ψn are orthonormal.t Xn T πnt . by the hint. which is true if and only if pX (0) = 0. (X . φ2k ). and for n = 0. (Note: More generally. ψ2k ) = (2πk)(X. or equivalently. Indeed. n ≥ 0. φ1 )φ1 (t) + (X. µ2k+1 = (2πk)2 λ2k . or equivalently. we can take ψn (t) = φn (t) for n ≥ 0. That is.t |2 pX (n) ≤ T2 π 2 t2 pX (n) → 0 as t → ∞ n∈Z. respectively. ψn ) = (2n+1)π (X. some could be multiplied by -1 to yield another valid set. for k ≥ 1 (It would have been acceptable to not define ψ0 . φ1 )c1 3 − (X. µn = ( (2n+1)π )2 λn . we also have (X .s. we discover that ψn is obtained from φn by time-reversal: ψn (t) = φn (1 − t)(−1)n . ψ2k+1 ) = −(2πk)(X.387 (Recall that the eigenvalues are equal to the means of the squared magnitudes of the coordinates. (X . and therefore.2 On the cross spectral density Follow the hint. φ2k+1 ). and therefore µ1 = 3c2 λ1 + 3c2 λ2 .) X 8. where ψn (t) = 2 cos( (2n+1)πt ). Then (X .) (b) Note that φ1 = 0. or DC. The limit has mean zero and variance pX (0). the random process X is constant in time. φ2k (t) = −(2πk)φ2k+1 (t) and φ2k+1 (t) = (2πk)φ2k (t). 2 E n∈Z. 1 0 Xu du → X0 m. Specifically. with the eigenfunction ψ1 (t) = 1 for 0 ≤ t ≤ 1. the limit should be zero with probability one. (X . ψn is the same 2 2 as φn . X0 is a constant a.24 Mean ergodicity of a periodic WSS random process 1 t t Xu du = 0 1 t t Xn e2πjnu/T du = 0 n n∈Z t 2πjnu/T du| 0 e 2πjnt/T an. That is. φ2 )c2 3. where a0 = 1. then X would be mean ergodic if and only if Var(X0 ) = 0. (The set 2 2 of eigenfunctions is not unique–for example. ψ1 ) = (X. but the (2k)th and (2k + 1)th coordinates of X come from the (2k + 1)th and (2k)th coordinates of X.s.n=0 Therefore. So its KL expansion involves only one nonzero √ √ term. That is. for all k ≥ 1.n=0 t −1 = | e 2πjnt/T | ≤ The n = 0 terms are an. if X were not assumed to be mean zero. Let U be the output if X is filtered by H and V be the output if Y is filtered by . Or equivalently. we see that SY (ω) = (1+ω2 )2 . SOLUTIONS TO PROBLEMS H . 2 (d) No. Y is Gaussian. The future could be better predicted by knowing something more about Xt than Yt gives alone. Thus µX = 0. (Note: the R2 -valued process ((Xt . RX (n) = (1 − 2p)|n| . The corresponding power spectral density is given by: ∞ ∞ ∞ SX (ω) = n=−∞ (1 − 2p)n e−jωn = = = ((1 − 2p)e−jω )n + n=0 ((1 − 2p)ejω )n − 1 n=0 1 1 + −1 1 − (1 − 2p)e−jω 1 − (1 − 2p)ejω 1 − (1 − 2p)2 1 − 2(1 − 2p) cos(ω) + (1 − 2p)2 .) 8. the future of Y is determined by Yt and (Xs : s ≥ t). or equivalently. 2 (c) Yes. which does not have the form required for a stationary mean zero Gaussian process to be Markov (namely α22A 2 ). Yt ) : t ∈ R) is Markov. 8. For n ≥ 1 1 2 for all RX (n) = P [Xn = 1. and Cov(Y5 .4 Filtering a Gauss Markov process (a) The process Y is the output when X is passed through the linear time-invariant system with impulse response function h(τ ) = e−τ I{τ ≥0} . mean zero.6 A stationary two-state Markov process πP = π implies π = ( 1 . Thus. X5 ) = RXY (0) = 1 . 2π which implies that | SXY (ωo ) + o( )|2 ≤ ( SX (ωo ) + o( ))( SY (ωo ) + o( )) Letting → 0 yields the desired conclusion. which is provided by knowing the past of Y . so E[Y5 |X5 = 3] = (Cov(Y5 . Y is not Markov. For example. X0 = −1] 1 1 1 1 1 1 1 1 1 1 1 1 = [ + (1 − 2p)n ] + [ + (1 − 2p)n ] − [ − (1 − 2p)n ] − [ − (1 − 2p)n ] 2 2 2 2 2 2 2 2 2 2 2 2 = (1 − 2p)n So in general. because X is a Gaussian process and Y is obtained from X by linear operations. if t is the present time. given Yt . 1 ) is the equilibrium distribution so P [Xn = 1] = P [Xn = −1] = 2 2 n. X0 = −1] − P [Xn = −1. X5 )/Var(X5 ))3 = 3/2. | J SXY (ω) dω 2 | ≤ 2π SX (ω) J dω 2π SY (ω) J dω . and 1 −τ τ ≥0 ∞ ∞ 2e RXY (τ ) = RX ∗ h(τ ) = t=−∞ RX (t)h(τ − t)dt = −∞ RX (t)h(t − τ )dt = 1 τ τ ≤0 ( 2 − τ )e (b) X5 and Y5 are jointly Gaussian. X and Y are jointly WSS. X0 = 1] + P [Xn = −1. Another +ω explanation is that.388 CHAPTER 12. with Var(X5 ) = RX (0) = 1. The Schwarz inequality applied to random variables Ut and Vt for t fixed yields |RU V (0)|2 ≤ RU (0)RV (0). X0 = 1] − P [Xn = 1. X is Markov. X = −X + N . as already found in Problem 4. 1 jωa 2a (e − e−jωa ) = j sin(aω) .77 ωo (≈ ωo ). ∞ 1 (b) k(τ ) = 2a (δ(τ + a) − δ(τ − a)) and K(ω) = −∞ k(τ )e−jωt dτ = lima→0 jω cos(aω) 1 SXY (ω) SY (ω) . lima→0 K(ω) = o = jω. where the future of N is independent of X up to time to .01)SX (ω) for ω in the base band. Formally. in between arrival times of N . 8.) (b) In the special case. SD (ω) = SX (ω)|1 − sin(aω) 2 aω | . we can write. At each arrival time of N .6 0.01)E[|Xt |2 ].389 8. E[Xt ] = 0. so that 6 SD (ω) ≤ (0. The output thus has power spectral density SD (ω) = SX (ω)|H(ω) − K(ω)|2 = SX (ω)|ω − sin(aω) |2 . X decreases exponentially. 2 sin(aω) ≤ (aω) aω 6 √ Then by the bound given 2 n E[g(t −n− .1. Hence.10 The accuracy of approximate differentiation (a) SX (ω) = SX (ω)|H(ω)|2 = ω 2 SX (ω). (In particular. Suppose 0 < a ≤ in the problem statement. which we think of as the present time. X has an upward jump of size one.14 Linear and nonlinear reconstruction from samples (a) We first find the mean function and autocorrelation function of X.12 Filtering Poisson white noise ∞ (a) Since µN = λ. a By l’Hˆspital’s rule. Also. following the equation X = −X. µX = λ −∞ h(t)dt. zo = S(XY (ω). if h(t) = I{0≤t<1} . and z = H(ω) implies Hopt (ω) = 8.17. a (d) Or. then CX (τ ) = λ(1 − |τ |)+ . if |ω| ≤ ωo then 0 ≤ 1 − ≤ (aωo ) ≤ 0. 8. For a fixed time to . (c) D is the output of the linear system with input X and transfer function H(ω) − K(ω). and random variables independent of the past. the future evolution of X depends only on the current value. the process after time to is the solution of the above differential equation. Thus. CX = h ∗ h ∗ CN = λh ∗ h. Integrating this inequality over the band yields that E[|Dt |2 ] ≤ (0.8 A linear estimation problem E[|Xt − Zt |2 ] = E[(Xt − Zt )(Xt − Zt )∗ ] = RX (0) + RZ (0) − RXZ (0) − RZX (0) = RX (0) + h ∗ h ∗ RY (0) − 2Re(h ∗ RXY (0)) ∞ = −∞ SX (ω) + |H(ω)|2 SY (ω) − 2Re(H ∗ (ω)SXY (ω)) dω 2π The hint with σ 2 = SY (ω). 5).390 U )]E[Bn ] = 0 because E[Bn ] = 0 for all n.5 . That must be the correct reconstruction of X. the power spectral density of X is σ 2 |G(ω)|2 .75 for integers n. Figure 12. SOLUTIONS TO PROBLEMS RX (s.4(b) shows the corresponding samples. Either all pairs of samples of the form (Xn . (b) By part (a). ∞ ∞ CHAPTER 12. We can try reconstructing X using both cases. With probability one. The sample paths of X are continuous and piecewise linear. so that |G(ω)2 | = 0 for ω ≥ ωo . taken at integer multiples of T = 0.75. and at least two sample points fall within each linear portion of X.5. (c) For this case.4(c) . t) = E n=−∞ ∞ g(s − n − U )Bn m=−∞ g(t − m − U )Bm ∞ 2 n=−∞ 0 1 = σ 2 n=−∞ ∞ E[g(s − n − U )g(t − n − U )] = σ n+1 g(s − n − u)g(t − n − u)du g(s − v)g(t − v)dv ∞ = σ2 n=−∞ n ∞ g(s − v)g(t − v)dv = σ 2 −∞ = σ2 −∞ g(s − v)g(v − t)dv = σ 2 (g ∗ g)(s − t) So X is WSS with mean zero and RX = σ 2 g ∗ g. G(2πf ) = sinc2 (f ). (a) 0 (b) 1 2 3 4 5 6 1 0 (c) 1 0 (d) 1 0 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 Figure 12. only one of the cases will yield a reconstruction with change points having spacing one. or all pairs of samples of the form (Xn+0.4(a) shows a sample path of B and a corresponding sample path of X.5 ≤ U ≤ 1).4. the breakpoints of X are at times of the form n + 0. If g is a baseband signal. Figure 12. Xn+0. Xn+1 ) fall within linear regions (happens when 0 ≤ U ≤ 0. Thus. which is not supported in any finite interval. The algorithm is illustrated in Figure 12. Figure 12.4: Nonlinear reconstruction of a signal from samples for U = 0.5 ) fall within linear regions (happens when 0. So part (a) does not apply. then by the sampling theorem for WSS baseband random processes. X can π be recovered from the samples (X(nT ) : n ∈ Z) as long as T ≤ ωo . o (c) The fact RX (nT ) = RX (nT ) follows from part (a) with x(t) = RX (t). 8. Xt n=−∞ XnT T o : n ∈ Z) have the same autocorrelation function by part time processes (XnT : n ∈ Z) and (XnT (c). So Y is WSS o with mean zero and autocorrelation function RX . SX (ω) = n=−∞ exp(−α|ω + 2nωo |) ∞ = n=0 exp(−α(ω + 2nωo )) + −∞ exp(α(ω + 2nωo )) n=−1 (2m+1)ω ∞ . and Figure 12.4(c) are connected to reconstruct X. But the discrete sampling theorem for baseband random processes. H should be zero outside of 2π the interval [−ωo . The ratio of these contributions is |ω|/σ 2 . Thus. Under this scaling the T process N approximates white noise with power spectral density σ 2 as T → 0.5 ). In summary.4(c) yields breakpoints with unit spacing. ∞ o (e) For 0 ≤ ω ≤ ω o .391 shows the result of connecting pairs of the form (Xn . 8. Xn+0. and (ii) −ωo |ω| |H(ω)|2 dω ≥ (the power of X)/2. 1 (c) If the constant scaling AT = 1 is used.16 An approximation of white noise (a) Since E[Bk Bl∗ ] = I{k=l} . 1 K K K E[| 0 Nt dt|2 ] = E[|AT T k=1 2 2 Bk |2 ] = (AT T )2 E[ k=1 Bk l=1 Bl∗ ] = (AT T ) σ K = A2 T σ 2 T 1 (b) The choice of scaling constant AT such that A2 T ≡ 1 is AT = √T . Thus X o and Y have the same autocorrelation function. then E[| 0 Nt dt|2 ] = T σ 2 → 0 as T → 0. where a is selected to meet the signal power constraint with equality: (power of X)= (power of √ X)/2. the optimal H has the form H(ω) = I{a≤|ω|≤ωo } . ωo ]. for otherwise it would be contributing to the output noise power with no contribution to the output signal power. the dashed lines in Figure 12.5 . Then by the ∞ o = o sinc t−nT . whereas the contribution to the output noise is σ 2 |H(ω)|2 . First.18 Filtering to maximize signal to noise ratio ∞ The problem is to select H to minimize σ 2 −∞ |H(ω)|2 dω . ωo ]. Also. which we would like to be as large as possible. subject to the constraints (i) |H(ω)| ≤ 1 2π ωo for all ω. then the contribution to the output signal power is |ω||H(ω)|2 . Y has mean zero. Furthermore. o (d) Let X o be a WSS baseband random process with autocorrelation function RX . if |H(ω)|2 > 0 over a small interval of length 2π contained in [ωo .20 Sampling a signal or process that is not band limited (a) Evaluating the inverse transform of xo at nT . the optimal choice is |H(ω)|2 = I{ωo /√2≤|ω|≤ωo } .4(d) shows the result of connecting pairs of the form (Xn+0. Of these two. Xn+1 ). This yields a = ωo / 2. 8. Thus. only Figure 12. and using the fact that 2ωo T = 2π yields ∞ ωo ωo jωnT x(ω + 2mω ) dω xo (nT ) = −∞ ejωnT xo (ω) dω = −ωo ejωnT xo (ω) dω = ∞ o 2π m=−∞ −ωo e 2π 2π o jωnT = ∞ x(ω) dω = −∞ ejωnT x(ω) dω = x(nT ). m=−∞ (2m−1)ωo e 2π 2π (b) The observation follows from the sampling theorem for the narrowband signal xo and part (a). cosh(α(ωo − |ω|)) .54 √ 6 11. SU (2πf ) = SV (2πf ) = 10−2 e−(f +5500)/10 + e−(−f +5500)/10 = e−. for any ω.039 4 X25 ∼ N (0.5). SX For fc= 5500 SU=SV SU=SV SUV j SUV j For fc= 5000 Figure 12.22 Another narrowband Gaussian process (a) ∞ µX = µR SX (2πf ) = (b) ∞ h(t)dt = µR H(0) = 0 = 10−2 e−|f |/10 I5000≤|f |≤6000 4 −∞ |H(2πf )|2 SR (2πf ) RX (0) = −∞ SX (2πf )df = so 2 102 6000 5000 e−f /10 df = (200)(e−0.6 ) = 11.76) ≈ 0.54) P [X25 > 6] = Q (c) For the narrowband representation about fc = 5500 (see Figure 12. sinh(αωo ) 8. 11.54 ≈ Q(1.55 sinh(f /104 )I|f |≤500 50 4 4 I|f |≤500 SU V (2πf ) = 10−2 je−(f +5500)/10 − je−(−f +5500)/10 4 4 I|f |≤500 = .55 cosh(f /104 )I|f |≤500 50 −je−. SOLUTIONS TO PROBLEMS = exp(−α(ω−ωo ))+exp(α(ω−ω0 )) exp(αωo )−exp(−αωo )) o SX (ω) = I{|ω|≤ωo } = cosh(α(ωo −ω)) sinh(αωo ) . Thus.392 = exp(−αω)+exp(α(ω−2ωo )) 1−exp(−2αωo ) CHAPTER 12.5 − e−0.5: Spectra of baseband equivalent signals for fc = 5500 and fc = 5000. SU (2πf ) = SV (2πf ) = 10−2 e0. u)ds + 7 g(s)RY (s. u)ds for u ∈ [0.393 (d) For the narrowband representation about fc = 5000 (see Figure 12. The mean square error is minimized over all linear estimators if and only if (X5 − X5 ) ⊥ Yu for u ∈ [0. b]. If u ≥ a then 1 E[(X0 − c(X−a + Xa ))Xu ] = e−u − ea +e−a (e−a−u + ea+u ) = 0. 10]. κYt must be in the linear span of (Yu : a ≤ u ≤ b). This is true since t ∈ [a. The orthogonality condition is thus true whenever |u| ≥ a as required. Solving for c1 and c2 (one could begin by e−a 1 1 subtracting the two equations) yields c1 = c2 = c where c = 1+e−2a = ea +e−a = 2 cosh(a) .4 Interpolating a Gauss Markov process (a) The constants must be selected so that X0 − X0 ⊥ Xa and X0 − X0 ⊥ X−a . u) = 0 g(s)RY (s. u) for u ∈ [a. 9. by the orthogonality principle. 9. . or equivalently 3 10 RXY (5. (b) The integral becomes: ∞ ∞ ∞ dω 2 dω 2 dω 2 −∞ (g(ω+ωc )+g(−ω+ωc )) 2π = 2 −∞ g(ω) 2π +2 −∞ g(ω+ωc )g(−ω+ωc ) 2π = 2||g|| +g∗g(2ωc ) Thus. b]. 10]. b] It remains to show that κ can be chosen so that the orthogonality condition is true.6 Proportional noise (a) In order that κYt be the optimal estimator. 3] ∪ [7. (ea +e−a )2 (b) The claim is true if (X0 − X0 ) ⊥ Xu whenever |u| ≥ a.5 e−|f |/10 I|f |≤1000 SU V (2πf ) = jsgn(f )SU (2πf ) 4 8. 3] ∪ [7. u) = κRY (t. 2. The condition is ∗ equivalent to E[(Xt −κYt )Yu ] = 0 for u ∈ [a. Orthogonality condition: (Xt − κYt ) ⊥ Yu for u ∈ [a. or equivalently RXY (t.5). 9. The corre2 2 sponding minimum MSE is given by E[X0 ] − E[X0 ] = 1 − c2 E[(X−a + Xa )2 ] = 1 − c2 (2 + 2e−2a ) = a −e−a )(ea +e−a ) 2a −e−2a e = (e (ea +e−a )2 = tanh(a).24 Declaring the center frequency for a given random process (a) SU (ω) = g(ω + ωc ) + g(−ω + ωc ) and SU V (ω) = j(g(ω + ωc ) − g(−ω + ωc )).2 A smoothing problem 10 3 Write X5 = 0 g(s)Ys ds + 7 g(s)ys ds. it suffices to check two things: 1. Similarly if u ≤ −a then 1 E[(X0 − c(X−a + Xa ))Xu ] = eu − ea +e−a (ea+u + e−a+u ) = 0. select ωc to maximize g ∗ g(2ωc ). or equivalently e−a − [c1 e−2a + c2 ] = 0 and e−a − [c1 + c2 e−2a ] = 0. b] is assumed. u) = κ(1 + γ 2 )RX (t. and H(2πf0 ) = 1. Therefore. the factors in the spectral factorization of SY are proportional to the factors in the spectral factorization of X: SY = (1 + γ 2 )SX = + 1 + γ 2 SX + SY − 1 + γ 2 SX . 2 RX (s. t). 2 2 Thus. (c) As already observed. − SY That and the fact SXY = SX imply that H(ω) = 1 ejωT SXY + − SY SY = + 1 1+ + γ 2 SX + ejωT SX 1+ γ2 = + κ + SX (ω) + ejωT SX (ω) + Therefore H is simply κ times the optimal filter for predicting Xt+T from (Xs : s ≤ t). or MSE=(h∗h∗RN )(0) = ∞ −∞ (h∗h)(t)RN (0−t)dt 2 2 2 = σN h∗h(0) = σN ||h||2 = σN ∞ 2 −2αt dt 0 α e = 2 σN α 2 . X is WSS with µX = 0 and RX (τ ) = σA ej2πfo τ . b]. which agrees with part (a). and the estimator of Xt+T is simply Xt+T |t = κYt+T . 1+γ 2 X (b) Since SY is proportional to SX . (c) Since sinc(f ) is the Fourier transform of I[− 1 .) 2π ∞ ∞ ∞ (b) (h ∗ X)t = −∞ h(τ )Xt−τ dτ = 0 αe−α−j2πfo )τ Aej2πfo (t−τ ) dτ = 0 αe−(ατ dτ Aej2πfo t = Xt . the mean square error is the power of the output due to the noise. or equiv∞ ∞ 2 alently. SX (2πf )SY (2πf ) ≡ 0. u) for u ∈ [a. which does not depend on t. . So if H is a positive type function then H 2 is positive type. it follows that 2 2 [H]+ (2πf ) = 0 1 2 1 e−2πf jt dt = e−πjf /2 sinc 2 f 2 9. if T > 0 then the optimal filter is κ times the prediction filter for Xt+T given (Xs : s ≤ t). SX (2πf ) = σA δ(f −f0 ). The MSE is given by E[|Xt − Xt |2 ] = E[|Xt |2 ] − E[|Xt |]2 = γ2 R (t. and H 2 has transform h ∗ h. Since |SXY (2πf )|2 ≤ SX (2πf )SY (2πf ) (by the first problem in Chapter 6) it follows that SXY ≡ 0. In particular. if T < 0 then H(ω) = κejωT . which is true for κ = 1/(1 + γ 2 ). Hence the assertion is true.8 Short answer filtering questions (a) The convolution of a causal function h with itself is causal. so the orthogonality condition becomes RX (t. 9. The form of the estimator is proved.10 A singular estimation problem (a) E[Xt ] = E[A]ej2πfo t = 0.394 CHAPTER 12. SX (ω) = 2πσA δ(ω−ω0 ) (This makes RX (τ ) = −∞ SX (2πf )ej2πf τ df = −∞ SX (ω)ejωτ dω . (c) In view of part (b). SOLUTIONS TO PROBLEMS The assumptions imply that RY = RX + RN = (1 + γ 2 )RX and RXY = RX . t) = E[Aej2πfo s (Aej2πfo t )∗ ] = σA ej2πfo (s−t) is a function of s − t. Another way to see this is to note that X is a pure tone sinusoid at frequency fo . (b) Since the intervals of support of SX and SY do not intersect. 1 ] . That is. the minimum mean square error for estimation of Xt from (Ys : s ≤ t) is zero. so that perfect estimation is possible. so that XT +t|t = e−T Xt .14 Spectral decomposition and factorization (a) Building up transform pairs by steps yields: sinc(f ) ↔ I{− 1 ≤t≤ 1 } 2 2 sinc(100f ) ↔ 10−2 I{− 1 ≤ 2 t ≤1} 100 2 sinc(100f )e2πjf T sinc(100f )e so ||x|| = 10 2 −4 j2πf T + ↔ 10−2 I{− 1 ≤ t+T ≤ 1 } 2 100 2 ↔ 10 −2 I{−50−T ≤t≤50−T }∩{t≥0} 10−2 T ≤ −50 − T ) −50 ≤ T ≤ 50 0 T ≥ 50 length of ([−50 − T. 9.Xt )Xt mean zero. while the noise power in a small interval around that frequency is small. the power of the signal X is concentrated at a single frequency.6: √ + 2ejωT SX in the time domain the time domain it is h(t) = e−T δ(t). By the form of RX we recognize that X is Markov so the best estimate of XT +t given (Xs : s ≤ t) is a function of Xt alone. Since linear estimation is being considered. the optimal estimator of Xt+T given Xt is E[Xt+T |Xt ] = = e−T Xt . We can therefore assume without loss of generality that X is a real valued Gaussian process. Since X is Gaussian with Cov(Xt+T .6). 50 − T ] ∩ [0. Var(Xt ) 9. +∞)) = 10−4 (50 . the spectral factoriza- SX (ω) = 2 jω + 1 + SX √ 2 −jω + 1 − SX √ + + so [ejωT SX ]+ = e−T SX (see Figure 12.12 A prediction problem The optimal prediction filter is given by tion of SX is given by 1 + SX + ejωT SX . Since RX (τ ) = e−|τ | . Intuitively.395 The MSE can be made arbitrarily small by taking α small enough. Thus the optimal prediction filter is H(ω) ≡ e−T . This simple form can be explained and derived another way. or in 2 e -(t+T) -T 0 t Figure 12. only the means (assumed zero) and correlation functions of the processes matter. φi ) i i given Y is Cov((X. φj )φj (t). i +σ i (b) Since f (t) = j (f.φi )) (Y. Thus. f ) is the sum of the errors for estimating the terms (f.φ2 ) |2 ] = λi − (λi +σ2 )2 = λλ+σ2 . 1 − 3j must also be a pole of S. the best linear estimator of (X. φj ) : j ≥ 1). f ) given Y can be written as the corresponding weighted sum of linear estimators: λi (Y. φj ).φi ). φi )+(N. −(1 + 3j) and −(1 − 3j) must also be poles. φj )(X. Observation of the process Y is linearly equivalent to observation of ((Y. λi + σ 2 i The error in estimating (X. φj ) = (X. φj ) + (N. 1 + 3j is a pole of S. (Y. the optimal linear estimator of (X.(Y. SOLUTIONS TO PROBLEMS (b) By the hint. Thus. φj ). the MMSE estimator is the sum of the projections onto the individual observations. i. only the ith observation. φi ). φi ) (MMSE estimator of (X. φj ) for all j. j ≥ 1 and (N. Since these random variables are orthogonal and all random variables are mean zero. That is.) Since S is real valued. the poles can be found by first solving for values of ω 2 for which the denominator of S is zero. we find S(ω) = 1 . (ω − (1 + 3j))(ω − (1 − 3j))(ω + 1 + 3j)(ω + 1 − 3j) or. the mean square error for (X. The mean square error is (using the orthogonality i +σ Var((Y. is not orthogonal to (X.396 CHAPTER 12. j ≥ 1 are all mutually orthogonal. using the KL expansion Note that (Y. Let and let Y be the output when X is passed through a linear time-invariant system . φj )|2 ] = σ 2 . (Y. and those errors are orthogonal. Since S is an even function. φi ) = λλ(Y.φ2 ) . f ) given Y ) = .16 Estimation of a random signal.18 Linear innovations and spectral factorization First approach: The first approach is motivated by the fact that H(z) = β + SX (z) 1 + SY is a whitening filter. S(ω) = S(−ω). φi )(f. φj )|2 ] = λj and E[|(N. φi ) = (X. φi )|2 (MSE) = . φj )(X. with E[|(X. Indeed. (Without the hint. where the variables (X. λi + σ 2 i 9.φi )) 2 λ2 (λi +σ 2 ) i i i iσ principle): E[|(X. multiplying each term by j (and using j 4 = 1) and rearranging terms: S(ω) = 1 1 (jω + 3 + j)(jω + 3 − j) (−jω + 3 + j)(−jω + 3 − j) S + (ω) S − (ω) or S + (ω) = constant. (jω 2 )+6jω+10 The choice of S + is unique up to a multiplication by a unit magnitude 9. φj ). Thus. f ) = j (f. we have (X.e. the random variable to be estimated is the sum of the random variables of the form treated in part (a). φj ). But for fixed i. φi ). φi )|2 ] − E[| λλ(Y. φj ). 1 . f ) is the sum of the mean square errors of the individual terms: λi σ 2 |(f. −T + 2. The output X has SX (z) ≡ β 2 . Yk−2 .. −T − 2. the innovations sequence. So if X is filtered using 1 − β time k is Xk|k−1 . So if X is filtered using H(z) = 1 − (1 − S + (z) ) = X β . yielding z T SXY − SY H(z) = = + T zo β(1 − ρ/z)(1 − zo ρ) T zo znI . In particular. the one step predictor for Xk . . z T SXY (z) − SY (z) = zT z T +1 + β(1 − ρ/z)(1 − zo ρ) β( z1 − ρ)(1 − zo z) o The right hand side corresponds in the time domain to the sum of an exponential function supported on −T. The one step prediction error is E[|Yk |2 ] = RY (0) = β 2 . If T ≥ 0 then only the first term contributes to the positve part. and an exponential function supported on −T − 1. Xk−2 . + SX (z) the output at β + SX (z) then the output at time k is Xk − Xk|k−1 = Xk . Xk−2 . · · · }. it follows that RY (k) = β 2 I{k=0} . the above orthogonality condition becomes. so the prediction e error is RX (0) = β 2 . the linear span of {Yk−1 . it follows that Yk = Xk + h(1)Xk−1 + h(2)Xk−2 + · · · Since SY (z) = H(z)H ∗ (1/z ∗ )SX (z) ≡ β 2 . e 9. Since H is positive type and lim|z|→∞ H(z) = 1. Second approah: The filter K for the optimal one-step linear predictor (Xk+1|k ) is given by (take T = 1 in the general formula): K = 1 + + zSX SX + . Hence K(z) = z − S + (z) . β 2 (1 − zo ρ) o {n≥0} T zo β 2 (1 − zo ρ)(1 − zo /z) and h(n) = . · · · } Therefore −h(1)Xk−1 − h(2)Xk−2 − · · · must equal Xk|k−1 . · · · } is the same as the linear span of {Xk−1 . + The z-transform zSX corresponds to a function in the time domain with value β at time -1. If X is X filtered using K. (Yk ) is the innovations sequence for (Xk ). . We prove that Y is the innovations process for X. Yk−2 . . so [zSX ]+ = zSX − zβ. Thus. . Xk − (−h(1)Xk−1 − h(2)Xk−2 − · · · ) ⊥ linear span of {Xk−1 . . . Thus.20 A discrete-time Wiener filtering problem To begin. Yk ⊥ linear span of {Yk−1 . and zβ + + value zero at all other negative times. the output at time k is Xk+1|k . · · · } Since H and 1/H both correspond to causal filters.397 with z-transform H(z). −T + 1. − + (b) Note that SY (z) = K(z) and SY (z) = K∗ (1/z ∗ ).13) and (12. M SE = RX (0) − RX (0) = −π SX (ejω ) dω − −π SX (ejω ) dω = 0. so the estimation error is zero. so that filtering Y with K produces the process W . By the formula for the optimal causal estimator.2 A covering problem (a) Let Xi denote the location of the ith base station. (This structure can be deduced by considering that the optimal casual estimator of Xt+T is the optimal causal estimator of the optimal noncausal estimator of Xt+T .14) Although it is helpful to think of the cases T ≥ 0 and T ≤ 0 separately. Then F = f (X1 .22 Estimation given a strongly correlated process (a) RX = g ∗ g ↔ SX (z) = G(z)G ∗ (1/z ∗ ). SOLUTIONS TO PROBLEMS T z(z T − zo ) zT + β(1 − ρ/z)(1 − zo ρ) β( z1 − ρ)(1 − zo z) o T zT z(z T − zo )(1 − ρ/z) .13) Graphically. π π Therefore. (12. slid to the right by −T and set to zero for negative times. [G]+ 1 SXY G(z)K∗ (1/z ∗ ) G(z) 1 H(z) = + = = = − ∗ (1/z ∗ ) K K K(z) SY SY + K(z) + (c) The power spectral density of the estimator process X is given by H(z)H∗ (1/z ∗ )SY (z) = SX (z). . Thus. 10. minus a one sided exponential function on the nonnegative integers. Filtering that with G then yields X. RXY = g ∗ k ↔ SXY (z) = G(z)K∗ (1/z ∗ ). Xm ). filtering Y with H produces X. .) Going back to the z-transform domain. That is. 9. 2 β 2 (1 − zo )( z1 − ρ)(1 − zo /z) o (12. .14) for the optimal h and H hold for any integer value of T . where f satisfies the Lipschitz condition with constant (2r − 1). A simple explanation b b 2π 2π 1 1 for why the MSE is zero is the following. Using K inverts K. h is the sum of a two-sided symmetric exponential function. by the method of bounded differences based . we find that H can be written as H(z) = zT β 2 (1 − zo /z)(1 − zo z) − + T zo (zo − ρ) . interestingly enough. . RY = k ∗ k ↔ SY (z) = K(z)K∗ (1/z ∗ ). + 2 1 β 2 (1 − zo ρ)(1 − zo /z) β ( z − ρ)(1 − zo z)(1 − zo /z) o Inverting the z-transforms and arranging terms yields that the impulse response function for the optimal filter is given by h(n) = 1 2 − zo ) |n+T | zo − β 2 (1 zo − ρ 1 zo − ρ n+T zo I{n≥0} .398 On the other hand if T ≤ 0 then z T SXY − SY so H(z) = = + CHAPTER 12. the expressions (12. because for any u. The slope of ϕ(u) is at least n on the interval (cn . we compare to the case that the number of stations has a Poisson distribution with mean m. Thus. Suppose ϕ exists with the desired properties. E[(|Xi | − c)+ ] = E[(|Xi | − c)I{|Xi |≥c} ] ≤ K(c). Therefore. 2 P {X ≥ m} ≤ ≤ 2P {Poi(m) stations is not enough} 2ne−m(2r−1)/n + P {cell 1 or cell n is not covered} (1 + )n ln n → 0 as n → ∞ if m = 2r − 1 As to a bound going the other direction. Let ϕ(u) = ∞ (u − cn )+ . whenever A is an event with P {A} ≤ . there exists δ so small that E[|Xi |IA ] ≤ 2 and E[|Yi |IA ] ≤ 2 for all i. for all c ≥ co . First. Therefore E[(|Xi +Yi |IA ] ≤ E[|Xi |IA ]+E[|Yi |IA ] ≤ for all i. u Also. so that limu→∞ ϕ(u) = +∞. Thus. (Xi + Yi : i ∈ I) is u. 1 ≤ j ≤ 2 1−e − m(2r−1) n m(2r−1) n n−1 2r−1 n−1 } 2r − 1 2 exp(−e− → 0 as n → ∞ n−1 ) 2r − 1 (1 − )n ln n if m = 2r − 1 · 1 2r−1 . ∞) for all n. were K(c) = supi∈I E[|Xi |I{|Xi |≥c} ]. Thus. we can take g1 (r) = g2 (r) = 10.i. supi (E[|Xi |] + E[|Yi |]) ≤ (supi E[|Xi |]) + (supi E[|Yi |]) < ∞ Second. Select co so large that u ≥ co implies that u ≤ ϕ(u). The function ϕ is convex and increasing because each term in the sum is. The converse is now proved. and let > 0. P {X ≤ m} ≤ ≤ ≤ ≤ 2P {Poi(m) stations cover all cells} 2P {Poi(m) stations cover cells 1 + (2r − 1)j. Let cn be a sequence of positive numbers monotonically converging to +∞ so that K(cn ) ≤ 2−n .. the function ϕ has the n=1 n=1 required properties. whenever A is an event with P {A} ≤ .i.i. for any i ∈ I and c. Then. E[|Xi |I{|Xi |≥c} ] ≤ E[ϕ(|Xi |)I{|Xi |≥c} ] ≤ E[ϕ(|Xi |)] ≤ K . unless cell i is near one of the boundaries. n then all the other cells within distance r of either boundary are covered. in conclusion. P {|F − E[F ]| ≥ γ} ≤ 2 exp(− m(2r−1)2 ).399 γ on the Azuma-Hoeffding inequality. Note that the mean number of stations that cover cell i is m(2r−1) . If cells 1 and n are covered. (b) Using the Possion method and associated bound technique. given > 0. only n=1 finitely many terms are nonzero. (b) Suppose (Xi : i ∈ I) is u. Hence. Thus.4 On uniform integrability (a) Use the small set characterization of u. note that if cells differ by 2r − 1 or more then the events that they are covered are independent. The sum is well defined. which by definition means that limc→∞ K(c) = 0. E[ϕ(|Xi |)] ≤ ∞ E[(|Xi | − cn )+ ] ≤ ∞ 2−n ≤ 1. E[Sτ ] = E[S0 ] = 0. so b = n(1 − (1 − n )n ) ≈ n(1 − e−d ) works. so 0 = E[Sn∧τ ] → E[Sτ ] as n → ∞. An upper bound is given by the expected number of vertices of V with degree at d least one.i. The (improved version of the) √ 2 Azuma-Hoeffding inequality yields that P {|Z − E[Z]| ≥ γ n} ≤ 2e−2γ . so that E[τ ] ≤ (c+a∧b) . 2 ]/σ 2 .6 A stopped random walk (a) By the fact S0 = 0 and the choice of τ . Since the W ’s are bounded by D. If d = 1. (c) Process the vertices of U one at a time. Therefore. (b) The variable Z can be expressed as Z = F (V1 . and such a matching 1 is a maximum matching with expected size n(1 − (1 − n )n ) ≥ n(1 − e−1 ). Note here that we are taking it for granted that T < ∞ with probability one.8 On the size of a maximum matching in a random bipartite graph (a) Many bounds are possible. 10. Remove edges from the graph that become ineligible for the matching. or E[τ ∧ n] = E[Sn∧τ 2 |Sn∧τ | ≤ c + a ∧ b for all n. Then Mn+1 − Mn = 2Wn+1 Sn + Wn+1 − σ 2 so that 2 E[Mn+1 − Mn |Fn ] = 2E[Wn+1 |Fn ]Sn + E[Wn+1 − σ 2 ] = 0. |Sn | ≤ c for 0 ≤ n ≤ τ − 1. or both. and F satisfies the Lipschitz condition with constant 1. Generally the tighter bounds are more complex. (This algorithm can be improved by selecting the next vertex in U from among those with smallest reduced degree. Select an unprocessed vertex in U . . by the dominated convergence theorem. (Or we could invoke the first or third form of the optional stopping theorem discussed in class. ) 2 (b) Let Fn = σ(W1 . But E[τ ∧ n] → E[τ ] as n → ∞ by σ2 e the monotone convergence theorem. the resulting matching has maximum cardinality. 2 10. select one of the edges associated with it and add it to the matching. E[τ ∧ n] ≤ (c+a∧b) for all n. . If every matched edge has an endpoint with reduced degree one at the time of matching.400 Since this inequality holds for all i ∈ I. we see that the sequence E[Mτ ∧n ] = 0 for all n. and also selecting outgoing edges with endpoint in V having the smallest reduced degree in V . so a = n(1 − e−1 ) is a lower bound for any d and n. Therefore. . then every vertex in V of degree one or more can be included in a single matching. it follows that |Sτ | ≤ c + D. Vn ). We are thus considering a martingale associated with Z by exposing the vertices of U one at a time. Wn ). so by the third version of the optional stopping theorem discussed in class. ) . SOLUTIONS TO PROBLEMS is arbitrary. Sn∧τ is a bounded sequence of random variables. and reduce the degrees of the unprocessed vertices correspondingly. The cardinality of a maximum matching cannot decrease as more edges are added. . it follows that (Xi : i ∈ I) is u. σ2 e (c) We have that E[|Sn+1 − Sn ||Fn ] = C for all n and E[τ ] < ∞ by part (b). where Vi is the set of neighbors of ui . . we could appeal to part (b) or some other method. Arguing as in part a. Thus. . If we don’t want to make that assumption. . and CHAPTER 12. . so that M is a martingale. If the vertex has positive degree. 6 of a random process. 53. 328 criterion for m. 8. convergence in correlation correlation coefficient. 47 247 in probability. 208 Bernoulli distribution. 73 characteristic function count times. 232 pseudo-covariance matrix. 326 of a random vector. 58 matrix. 7 almost sure. 43 Brownian motion. 211 Borel sets. see covariance function 401 . 330 Baum-Welch algorithm.s. 73 joint. 54 cross correlation matrix. 247 criterion for convergence. 258 of a function. 267 cross covariance matrix. 52. in distribution. 328 function. 25. 62 e conditional expectation. 269 completeness matrix. 207 binomial distribution. 269 of an orthogonal basis. 268 of the real numbers. 43. 41 bounded convergence theorem. 59 Cauchy convolution. 28 probability. 113 Chebychev inequality. 161 of a function at a point. 75 counting process. autocorrelation function. 206 pseudo-covariance function. 74 of a probability space.s. 26 form. 21 countably infinite. see correlation function autocovariance function. 113 of a random variable. 21 of a random process at a point. 20 covariance circular symmetry. 5 baseband conjugate prior. 3 convergence of sequences Borel-Cantelli lemma. 28.. 259 continuity signal. 330 Bayes’ formula. see conditional expectation pdf. 111 mean square. 229 convex function.Index mean. 148 random process. 22 piecewise m. 105 central limit theorem. 105. 337 deterministic. 73 sequence. 326 bounded input bounded output (bibo) stability. 267 function. 328 Cram´r’s theorem. 81 cumulative distribution function (CDF). 23 birth-death process. 328 differentiable. 60 fundamental theorem of calculus.s. 149 weak law. 334 log-sum inequality. 6 expectation. 330 limit points. 22 generator matrix. 110 right-hand. 126 implicit function theorem. 86 energy spectral density. 340. 333 holding times. 179 Parseval’s identity. 338 limsup. or limit inferior. 332 Riemann-Stieltjes.s. 336 215 m. 335 m. 211 Riemann. 338 joint Gaussian distribution. 183 Erlang C formula. 57 exponential distribution. 330 forward-backard algorithm. 211 Lebesgue. 157 linear innovations sequence. 124 random vector. 250 Kalman filter. Riemann. 249 log moment generating function. 59 dominated convergence theorem. 211 intercount times. 184. 183 law of total probability. 121 Gaussian aperiodic. 57 of a random vector. 167 jump process. 24 Markov process. 168 distribution. Lebesgue-Stieltjes. or limit superior. at a point. 20 gamma distribution. 216 m. 5 pairwise. 28 discrete-time random process. 332 integration continuously. 73 strong law. 152 gambler’s ruin problem. 342 continuous and piecewise continuously. 113 Dirichlet density. 23 liminf. 92 Erlang B formula.s. 148 Jacobian matrix. 94 at a point. 109 Markov inequality. 57 expectation-maximization (EM) algorithm.402 105 INDEX events. 329. 249 Lipschitz condition. 249 Little’s law. 86 equilibrium distribution.s. 92 Fourier transform. 246 irreducible. 17 law of large numbers. 313 inversion formula. 129 independence . continuous and piecewise continuously. 336 m. 86 Chapman-Kolmogorov equations. 171 joint pdf. continuously. 105 Jensen’s inequality. 129 impulse response function. 192 jointly Gaussian random variables. sense. 5 derivative independent increment process. 329. 124 geometric distribution. 331 infimum. 329. 330 Fatou’s lemma. 333 m. 86 drift vector.s. 331 information update. 331 inner product. 26 second order random process. 6 periodic WSS random processes. 343 symmetric. 168 positive recurrent. 171 null recurrent. 341. see Gaussian distribution Nyquit sampling theorem. 124 period of a state. 250 prior. 258 orthogonal. 340 of an interval partition. 172 transition probabilities. 339 eigenvalue. 22 minimum. 331 Poisson arrivals see time averages (PASTA). 144 power of a random process. 251 spectral density. see expectation Mean ergodic. 76 Rayleigh distribution. 340 partition. 328 maximum a posteriori probability (MAP) estimator. 229 random variables. 339 unitary. 339 narrowband random process. 340. 24 Riemann sum. 144 probability density function (pdf). 327 subsequence. 261 norm of a vector. 264 403 signal. 130 stationary. 342 complex random variables. 168. 143 mean. 340 diagonal. 168. 172 pure-jump for a countable state space. 25 projection. 126 martingale. 75 vectors. 105 mean square closure. 252 span. 340 Hermitian symmetric. 238 stationary. 182 Poisson distribution. 168. 77 orthonormal. 127 nonexplosive. 230 wide sense. or a posteriori. 124 transient. 279 memoryless property of the geometric distribution. 113 posterior. 123 transition probability diagram. 236 permutation. 340 basis. 124 time homogeneous. 343 determinant. 110 matrices. 125 transition rate diagram. 343 eigenvector. 334 normal distribution. 223 mean function. 342 matrix. 116. 341. 144 maximum likelihood (ML) estimator. 343 Hermitian transpose of. 171 pure-jump for a finite state space. 106 sinc function. 328 monotone convergence theorem. 341 identity matrix. 116. 342 maximum. 105 Schwarz’s inequality. 340 orthogonality principle. 126 space-time structure. 129. 339 positive semidefinite. or a priori. 341. 339 characteristic polynomial.INDEX Kolmogorov forward equations. 340. 22 Poisson process. 340 piecewise continuous. 340 spectral representation. 329 . 269 strictly increasing. 172 one-step transition probability matrix. 333 sample path. 247 tower property. 332 time update. 24 uniform prior. 269 wide sense stationary. see Brownian motion INDEX . 116 Wiener process. 328 Taylor’s theorem. 250 triangle inequality. 159 wide sense sationary. 94 time-invariant linear system.404 supremum. 26 uniform distribution. 145 version of a random process. 83 transfer function. L2 . 237 Viterbi algorithm. 2003. Norris. Rabiner.L. Wu. 1995. Information Theory.F. Dynamic server allocation to parallel queues with randomly varyingconnectivity. Ephremides. second edition. Markov chains and stochastic stability. Anantharam. 1983. Meyn.C. 37(12):1936–1948. Communications. J. Existence of moments for stationary markov chains. 1993. 24:355–360. 1989. McDiarmid. [12] L. Some inequalities for the queue GI/G/1. Stability of queueing networks and scheduling policies. The Annals of Statistics. 1992. N. Baum. Scheduling and performance limits of networks with constantlychanging topology. A. IEEE Trans. A tutorial on hidden Markov models and selected applications in speech recognition. Surveys in Combinatorics. Math. Rubin. Mekkittikul. [7] C. [13] L. Meyn and R. and N. Information Theory. [6] P.R. McKeown. A maximization technique occurring in the statistical analysis of probabilisitic functions of markov chains. [2] L. Statist. [16] C. On the convergence property of the EM algorithm. Springer. 141:148–188.D. Stability properties of constrained queueing systems and schedulingpolicies for maximum throughput in multihop radio networks. Statist. Proceedings of the IEEE. [5] J. Ann. IEEE Trans. Journal of Applied Probability.M. [9] S. [4] F. 1953. Petrie. [11] L. 39(1):1–38. 40(2):251–260. Biometrika.P. Ephremides. Kumar and S. V. On the method of bounded differences. On the stochastic matrices associated with certain queueing processes. February 1989. Laird. 39(2):466–478. on Automatic Control. [15] R. [10] J. 43(3):1067–1073. Maximum likelihood from incomplete data via the EM algorithm.P. Markov Chains. Weiss. [8] N. IEEE Trans. Tassiulas and A. Kingman.L. Tassiulas. 47(8):1260–1267. Royal Statistical Society. 1993.P. 49(3/4):315–324. IEEE Trans. Foster. 11:95–103. 77(2):257–286. and B. Tweedie.R. Springer-Verlag London. 1962. Ann.Bibliography [1] S. Math. Dempster. Tweedie. 405 . 1997. Achieving 100% throughput in an input-queued switch. Cambridge University Press. Soules. 1983. on Automatic Control. Applied Probability and Queues.R. and J.G. Walrand. Tassiulas and A. 1977. IEEE Trans. [14] L.E. G. T. [3] A. 1970. 20(1):191–196.. 1999. Asmussen. 1997. 41:164–171.