Stat111

Stat 111 | Section 8Stat 111 – Introduction to Statistical Inference Section 8: Bayesian Mar 26 - Mar 30, 2018 Kin Wai Chan and Sanqian Zhang Department of Statistics, Harvard University 1 Sufficiency Definition Let Y be a sample from model FY (y|θ). The statistic T (Y ) is a sufficient statistic for θ if the conditional distribution of Y given T (Y ) is free of θ. iid Pn Example 1.1. X1 , · · · , Xn ∼ P o(λ). Using the definition of sufficiency, show that T = i=1 Xi is a sufficient statistic for λ. Solution Based on the definition, we want to show that P (X1 = x1 , · · · , Xn = xn |T = t) is free of λ. Based on definition, we have P (X1 = x1 , · · · , Xn = xn , T = t) P (X1 = x1 , · · · , Xn = xn |T = t) = P (T = t) Note that i) T ∼ P o(nλ) Pn ii) P (X1 = x1 , · · · , Xn = xn , T = t) > 0 only if xi ≥ 0, ∀i and i=1 xi = t. Pn For xi ≥ 0 and i=1 xi = t, we have P (X1 = x1 , · · · , Xn = xn , T = t) P (X1 = x1 , · · · , Xn = xn |T = t) = P (T = t) Qn e−λ λxi i=1 xi ! = e−nλ (nλ)t t! −nλ Qn xi t! e i=1 λ = Qn −nλ Q n xi i=1 xi ! e i=1 (nλ) n Y 1 i x t! = Qn i=1 xi ! i=1 n Hence, we have (X1 , · · · , Xn |T ) ∼ Multinomial T, (n−1 , · · · , n−1 ) . This conditional distribution doesn’t depend on λ. By definition of sufficiency, T is a sufficient statistic for λ. When n = 2, this is a special case of the Chicken and Egg story (Thm 7.1.10 in Blitzstein and Hwang). 1 |7 Stat 111 | Section 8 Factorization Theorem If we can write fY (y|θ) = g(θ, T (y))h(y) for non-negative functions g, h, then T (y) is sufficient. iid Example 1.2. X1 , · · · , Xn ∼ U nif (θ − 1, θ + 1), find the sufficient statistic for θ. Solution The pdf is given by 1 fX (x|θ) = 1(θ−1≤x≤θ+1) . 2 As a result, the likelihood is n Y 1 L(θ; x) = 1(θ−1≤xi ≤θ+1) i=1 2 n 1 = 1(θ−1≤x1 ,··· ,xn ≤θ+1) 2 n 1 = 1(θ−1≤x(1) ) 1(θ+1≥x(n) ) . 2 By the factorization theorem, X(1) and X(n) are the sufficient statistics for θ. From this example, we can also see that the dimension of sufficient statistics can be larger than the dimension of parameter θ. 2 Bayesian Inference 2.1 Prior Distribution A key feature of Bayesian inference is that we assume a prior distribution f (θ) on the parameter(s). Here are some types of prior: • Proper vs improper prior – Proper priors are priors on θ that is a probability distribution on θ. Proper priors always lead to proper posterior. – Improper priors are priors on θ that is not a probability distribution. For example, f (θ) ∝ 1 on θ ∈ R, f (θ) ∝ θ−1 on θ > 0. When you use an improper prior, it is important to check that the posterior is proper. • Conjugate prior – Under a model f (y|θ), a family of prior distribution f (θ) is considered to be conjugate prior if the posterior distribution belongs to the same family as the prior distribution. Common examples include Normal-Normal, Gamma-Poisson, Beta-Binomial. • Informative vs uninformative prior – Informative prior gives explicit information on the parameter. In applied studies, this may come from previous literature. – Uninformative prior is describing a prior that gives little information on the parameter (so inference is based on the likelihood). However, there is no consensus as to what exactly does uninformative prior mean and what is an uninformative prior. An example of an uninformative prior is the Jeffreys’ prior. 2 |7 2.2 Posterior Distribution Stat 111 | Section 8 2.2 Posterior Distribution Under the Bayesian framework, the key object of interest is f (θ|y), the posterior distribution. Inferential tasks related to θ can be completed by making reference to the posterior distribution. By Bayes’ rule, the posterior distribution is given by f (y|θ)f (θ) f (θ|y) = f (y) R where f (y) = f (y|θ)f (θ)dθ. Note that the denominator f (y) does not depend on θ. When y is fixed, this can be considered as a constant. We typically express the posterior distribution up to proportionality constant, that is f (θ|y) ∝ f (y|θ)f (θ) iid Example 2.1. We have y1 , ..., yn ∼ Bin(N, θ). Consider the following inference problems: (a) N known but θ unknown. Assume a prior θ ∼ Beta(α, β). Derive f (θ|y1:n ). (b) θ known but N unknown. Assume an improper prior f (N ) ∝ N −1 . Derive f (N |y1:n ). Solution (a) f (θ|y1:n ) ∝ f (y1:n |θ)f (θ) n ! Y yi N −yi ∝ θ (1 − θ) θα−1 (1 − θ)β−1 i=1 P P yi −1 ∝ θα+ (1 − θ)β+nN − yi −1 X X θ|y1:n ∼ Beta(α + yi , β + nN − yi ) (b) n ! Y N! yi N −yi 1 f (N |y1:n ) ∝ θ (1 − θ) I(N ≥yi ) y !(N − yi )! i=1 i N " n # Y N! N ∝ ((1 − θ)n ) N −1 I(N ≥y(n) ) i=1 (N − yi )! 2.3 Predictive Distributions 2.3.1 Prior Predictive Distribution Before we observe any data, our prediction about unobserved data y is based on the model and the prior. The marginal distribution of y, or the prior predictive distribution of y is given by Z f (y) = f (y|θ)f (θ)dθ 2.3.2 Posterior Predictive Distribution After observing data y, we can make prediction about potential new observations ỹ from the same process. This posterior predictive distribution is given by Z f (ỹ|y) = f (ỹ|θ, y)f (θ|y)dθ Z = f (ỹ|θ)f (θ|y)dθ , if ỹ and y conditionally independent given θ 3 |7 2.3 Predictive Distributions Stat 111 | Section 8 Example 2.2 (Normal-Normal model with known variance). Suppose we observe data y ∼ N (µ, σ 2 ) , σ 2 known and assume prior µ ∼ N (µ0 , τ02 ). Derive the prior predictive distribution p(y), posterior distribution p(µ|y) and posterior predictive distribution p(ỹ|y). Solution Method 1. Using representation and m.g.f Using representation, we have y = µ + σZ , where µ ∼ N (µ0 , τ02 ) and Z ∼ N (0, 1) such that Z |= µ. Then we have 2 τ02 µ µ0 τ0 ∼ N2 , . y µ0 τ02 τ02 + σ 2 Using properties of multivariate normal distribution, it follows that y ∼ N (µ0 , τ02 + σ 2 ), and N µ1 , τ12 µ|y ∼ σ2 with B = , µ1 = Bµ0 + (1 − B)y and τ12 = Bτ02 . τ02 + σ2 Consider the m.g.f. of ỹ given y, E etỹ |y = E E etỹ |y, µ |y = E E etỹ |µ |y h 2 2 i = E etµ+0.5σ t |y 2 2 τ1 +0.5σ 2 t2 = etµ1 +0.5t This implies ỹ|y ∼ N µ1 , τ12 + σ 2 . 4 |7 2.3 Predictive Distributions Stat 111 | Section 8 Method 2. Working with densities directly For the prior predictive distribution of y, Z ∞ p(y) = p(y|µ)p(µ)dµ −∞ Z ∞ 1 1 ∝ exp − 2 (y − µ)2 exp − 2 (µ − µ0 )2 dµ −∞ 2σ 2τ0 2 Z ∞ 1 y 1 2 1 1 y µ0 ∝ exp − exp − µ + 2 − 2µ + 2 dµ 2 σ2 −∞ 2 σ2 τ0 σ2 τ0 Z ∞ 1 y2 yBτ02 µ0 Bτ02 1 2 ∝ exp − exp − µ − 2µ + dµ 2 σ2 −∞ 2Bτ02 σ2 τ02 Z ∞ 1 y2 1 ∝ exp − exp − µ2 − 2µ (y(1 − B) + Bµ0 ) 2 σ2 −∞ 2Bτ02 2 2 + (y(1 − B) + Bµ0 ) − (y(1 − B) + Bµ0 ) dµ 1 y2 y 2 (1 − B)2 2y(1 − B)µ0 B ∝ exp − − − 2 σ2 Bτ02 Bτ02 1 1 µ0 ∝ exp − y 2 − 2y 2 σ 2 + τ02 σ 2 + τ02 1 ∝ exp − (y − µ0 )2 . (Complete the square) 2 (σ 2 + τ02 ) ⇒ y ∼ N (µ0 , σ 2 + τ02 ). For the posterior distribution p(µ|y), p(µ|y) ∝ p(y|µ)p(µ) 1 2 1 2 ∝ exp − 2 (y − µ) exp (µ − µ0 ) 2σ 2τ02 1 µ2 − 2µy + y 2 µ2 − 2µµ0 + µ20 ∝ exp − + 2 σ2 τ2 0 1 1 1 y µ0 ∝ exp − µ2 + − 2µ + 2 σ2 τ02 σ2 τ02 ( " −1 #) 1 1 1 2 1 1 y µ0 ∝ exp − + 2 µ − 2µ + 2 + 2 2 σ2 τ0 σ2 τ0 σ2 τ0 ( 2 ) 1 ∝ exp − µ − (Bµ 0 + (1 − B)y) . (Complete the square) 2Bτ02 For the posterior predictive distribution p(ỹ|y), Z p(ỹ|y) = p(ỹ|µ)p(µ|y)dµ Z 1 1 ∝ exp − 2 (ỹ − µ)2 exp (µ − µ1 )2 dµ (The calculus is same as above.) 2σ 2τ12 5 |7 2.4 Bayesian Inference as a Sampling Problem Stat 111 | Section 8 2.4 Bayesian Inference as a Sampling Problem Bayesian inference is performed based on the posterior distribution p(θ|y). Based on the posterior distribution we can construct point estimators and credible intervals for θ. The typical point estimators that we look at are • MAP: value of θ that maximizes the posterior distribution of θ; • median(θ|y): the estimator that minimizes posterior expected loss under loss function L(θ, c) = |θ−c| • E [θ|y]: the estimator that minimizes posterior expected loss under the squared loss function (θ − c)2 Our posterior distribution look like this 1 f (θ|y) = f (y|θ)f (θ) C R where C = f (y|θ)p(θ)dθ is a normalizing constant. This normalizing constant may not be easy to compute. However, computation of estimators such as the posterior median and posterior mean would typically involve knowing this constant C. To bypass the problem of calculating C, we can try to frame our inference problem as a sampling problem. That is, if we can sample s1 , · · · , sm from f (θ|y) (without knowing C), how can we construct (estimators for) point estimators for θ? How about the credible interval? Some nice theoretical results (which is out of our scope!) tell us that under certain appropriate sampling procedures, we have m 1 X h(sj ) → E [h(θ)|y] , m → ∞. m j=1 The implication is that if we want to construction E [θ|y] as a point estimator, we can estimate it by 1 P m m j=1 sj . Similarly, we can use median(s1 , · · · , sm ) to estimate median(θ|y). Similarly, we can estimate credible intervals based on appropriate quantiles of the sample. iid Example 2.3. X1 , · · · , Xn ∼ U nif (θ − 1, θ + 1). We assume a flat prior on θ, what is the posterior distribution of θ? Now suppose we have a sample s1 , · · · , sm from the posterior distribution of θ, describe how you would construct: i) A point estimate of θ. ii) A 95 % credible interval for θ. What is the interpretation of this credible interval? iii) A point estimate of θ2 . iv) A 95 % credible interval for θ2 . Solution From previous exercise, we know that L(θ; x) ∝ 1(θ−1≤x(1) ) 1(θ+1≥x(n) ) . The question specifies a flat prior on θ. Since θ ∈ R, this is an improper prior f (θ) ∝ 1. The posterior is given by f (θ|x) ∝ 1 × 1(θ−1≤x(1) ) 1(θ+1≥x(n) ) ∝ 1(θ≤x(1) +1) 1(θ≥x(n) −1) ∝ 1(x(n) −1≤ θ ≤x(1) +1) Hence, θ|x ∼ U nif (x(n) − 1, x(1) + 1). 6 |7 2.4 Bayesian Inference as a Sampling Problem Stat 111 | Section 8 i) We can choose to use either the posterior mean or the posterior median Pm as the estimator. If we want to 1 use posterior mean, then our estimator based on the samples is m j=1 sj . If we want to use posterior median, our estimator is the median of the m samples. ii) A 95% credible interval can be constructed using s(d0.025me) , s(d0.975me) , the empirical 2.5% and 97.5% quantiles of the sample. In R, this would be quantile(posteriorsamples,c(0.025,0.975)). This interval can be interpreted as, given the data, there is a 95% probability that the parameter θ falls in this region. 1 Pm 2 2 2 iii) Our estimator can be m j=1 sj or the empirical median of s1 , · · · , sm , depending on which estimator is chosen. iv) We can take the empirical 2.5% and 97.5% quantiles of s21 , · · · , s2m . 7 |7

Comments

Description