IntroductionOverview Derivation BIC and Bayes Factors BIC vs. AIC Use of BIC 171:290 Model Selection Lecture V: The Bayesian Information Criterion Joseph E. Cavanaugh Department of Biostatistics Department of Statistics and Actuarial Science The University of Iowa September 25, 2012 Joseph E. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion Introduction Overview Derivation BIC and Bayes Factors BIC vs. AIC Use of BIC Introduction BIC, the Bayesian information criterion, was introduced by Schwarz (1978) as a competitor to the Akaike (1973, 1974) information criterion. Schwarz derived BIC to serve as an asymptotic approximation to a transformation of the Bayesian posterior probability of a candidate model. In large-sample settings, the fitted model favored by BIC ideally corresponds to the candidate model which is a posteriori most probable; i.e., the model which is rendered most plausible by the data at hand. The computation of BIC is based on the empirical log-likelihood and does not require the specification of priors. Joseph E. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion Introduction Overview Derivation BIC and Bayes Factors BIC vs. AIC Use of BIC Introduction In Bayesian applications, pairwise comparisons between models are often based on Bayes factors. Assuming two candidate models are regarded as equally probable a priori, a Bayes factor represents the ratio of the posterior probabilities of the models. The model which is a posteriori most probable is determined by whether the Bayes factor is less than or greater than one. In certain settings, model selection based on BIC is roughly equivalent to model selection based on Bayes factors (Kass and Raftery, 1995; Kass and Wasserman, 1995). Thus, BIC has appeal in many Bayesian modeling problems where priors are hard to set precisely. Joseph E. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion .Introduction Overview Derivation BIC and Bayes Factors BIC vs. AIC Use of BIC Introduction Outline: Overview of BIC Derivation of BIC BIC and Bayes Factors BIC versus AIC Use of BIC Joseph E. Fitted model: f (y | θˆk ). AIC Use of BIC Overview of BIC Key Constructs: True or generating model: g (y ). Candidate or approximating model: f (y | θk ). Joseph E.Introduction Overview Derivation BIC and Bayes Factors BIC vs. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . Candidate class: F(k) = {f (y | θk ) | θk ∈ Θ(k)} . AIC Use of BIC Overview of BIC Akaike information criterion: AIC = −2 ln f (y | θˆk ) + 2k.Introduction Overview Derivation BIC and Bayes Factors BIC vs. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . Joseph E. AIC and BIC feature the same goodness-of-fit term.) Consequently. Bayesian (Schwarz) information criterion: BIC = −2 ln f (y | θˆk ) + k ln n. The penalty term of BIC is more stringent than the penalty term of AIC. k ln n exceeds 2k. (For n ≥ 8. BIC tends to favor smaller models than AIC. Common acronyms: BIC. AIC provides an asymptotically unbiased estimator of the expected Kullback discrepancy between the generating model and the fitted approximating model. SC. SIC. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . By choosing the fitted candidate model corresponding to the minimum value of BIC. BIC provides a large-sample estimator of a transformation of the Bayesian posterior probability associated with the approximating model. AIC Use of BIC Overview of BIC The Bayesian information criterion is often called the Schwarz information criterion. Joseph E. one is attempting to select the candidate model corresponding to the highest Bayesian posterior probability. SBC.Introduction Overview Derivation BIC and Bayes Factors BIC vs. Leonard (1982). and Cavanaugh and Neath (1999).” under the assumption that the likelihood is from the regular exponential family. Joseph E. Kashyap (1982). yet informal. AIC Use of BIC Overview of BIC BIC was justified by Schwarz (1978) “for the case of independent. and linear models.Introduction Overview Derivation BIC and Bayes Factors BIC vs. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . Haughton (1988). We will consider a justification which is general. Generalizations of Schwarz’s derivation are presented by Stone (1979). identically distributed observations. where θk is an element of the parameter space Θ(k) (k ∈ {k1 . . kL }). Joseph E. . . . MkL . Note: L(θk | y ) = f (y | θk ). Let θˆk denote the maximum likelihood estimate of θk obtained by maximizing L(θk | y ) over Θ(k). Mk2 . Assume that y is to be described using a model Mk selected from a set of candidate models Mk1 . k2 . Assume that each Mk is uniquely parameterized by a vector θk . Let L(θk | y ) denote the likelihood for y based on Mk . . .Introduction Overview Derivation BIC and Bayes Factors BIC vs. . AIC Use of BIC Derivation of BIC Let y denote the observed data. . Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . . . k2 .Introduction Overview Derivation BIC and Bayes Factors BIC vs. Let g (θk | k) denote a prior on θk given the model Mk (k ∈ {k1 . . . Mk2 . MkL . . . AIC Use of BIC Derivation of BIC We assume that derivatives of L(θk | y ) up to order two exist with respect to θk . . kL }) denote a discrete prior over the models Mk1 . Joseph E. k2 . . . Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . and are continuous and suitably bounded for all θk ∈ Θ(k). kL }). . . Let π(k) (k ∈ {k1 . The motivation behind BIC can be seen through a Bayesian development of the model selection problem. . Introduction Overview Derivation BIC and Bayes Factors BIC vs. The posterior probability for Mk is given by Z P(k | y ) = m(y )−1 π(k) L(θk | y ) g (θk | k) dθk . the joint posterior of Mk and θk can be written as h((k. A Bayesian model selection rule might aim to choose the model Mk which is a posteriori most probable. AIC Use of BIC Derivation of BIC Applying Bayes’ Theorem. m(y ) where m(y ) denotes the marginal distribution of y . θk ) | y ) = π(k) g (θk | k) L(θk | y ) . Joseph E. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . for the purpose of model selection. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . AIC Use of BIC Derivation of BIC Now consider minimizing −2 ln P(k | y ) as opposed to maximizing P(k | y ). this term can be discarded. We have −2 ln P(k | y ) = 2 ln {m(y )} − 2 ln {π(k)} Z −2 ln L(θk | y ) g (θk | k) dθk .Introduction Overview Derivation BIC and Bayes Factors BIC vs. The term involving m(y ) is constant with respect to k. thus. Joseph E. In order to obtain an approximation to this term. Now consider the integral which appears above: Z L(θk | y ) g (θk | k) dθk .Introduction Overview Derivation BIC and Bayes Factors BIC vs. AIC Use of BIC Derivation of BIC We obtain −2 ln P(k | y ) ∝ −2 ln {π(k)} Z L(θk | y ) g (θk | k) dθk −2 ln ≡ S(k | y ). Joseph E. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . we take a second-order Taylor series expansion of the log-likelihood about θˆk . = ln L(θˆk | y ) − (θk − θˆk ) n I( 2 where 2 ˆ ¯ θˆk . Joseph E. y ) = − 1 ∂ ln L(θk0 | y ) I( n ∂θk ∂θk is the average observed Fisher information matrix. AIC Use of BIC Derivation of BIC We have ˆk | y ) 0 ∂ ln L(θ ln L(θk | y ) ≈ ln L(θˆk | y ) + (θk − θˆk ) ∂θk " # 2 ˆk | y ) 1 ∂ ln L( θ 0 + (θk − θˆk ) (θk − θˆk ) 0 2 ∂θk ∂θk h i 1 0 ¯ θˆk .Introduction Overview Derivation BIC and Bayes Factors BIC vs. y ) (θk − θˆk ). Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . Joseph E. AIC Use of BIC Derivation of BIC Thus. y ) (θk − θˆk ) . h i 1 0 ¯ θˆk . Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . L(θk | y ) ≈ L(θˆk | y ) exp − (θk − θˆk ) n I( 2 We therefore have the following approximation for our integral: Z L(θk | y ) g (θk | k) dθk ≈ Z h i 1 0 ¯ θˆk .Introduction Overview Derivation BIC and Bayes Factors BIC vs. y ) (θk − θˆk ) L(θˆk | y ) exp − (θk − θˆk ) n I( 2 g (θk | k) dθk . In this case. AIC Use of BIC Derivation of BIC Now consider the evaluation of Z h i 1 0 ¯ θˆk . y ) (θk − θˆk ) g (θk | k) dθk exp − (θk − θˆk ) n I( 2 using the noninformative prior g (θk | k) = 1.Introduction Overview Derivation BIC and Bayes Factors BIC vs. y ) (θk − θˆk ) dθk = exp − (θk − θˆk ) n I( 2 (k/2) ¯ θˆk . Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . we obtain Z h i 1 0 ¯ θˆk . (2π) |n I( Joseph E. y )|−1/2 . Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . y )|−1/2 . y )|−1/2 ≈ L(θˆk | y ) (2π)(k/2) |n I( ¯ θˆk . y )|−1/2 = L(θˆk | y ) (2π)(k/2) n(−k/2) |I( (k/2) 2π ¯ θˆk .Introduction Overview Derivation BIC and Bayes Factors BIC vs. AIC Use of BIC Derivation of BIC We therefore have Z L(θk | y ) g (θk | k) dθk ¯ θˆk . = L(θˆk | y ) |I( n Joseph E. (See Tierney and Kadane.) This approximation is valid in large-sample settings provided that the prior g (θk | k) is “flat” over the neighborhood of θˆk where L(θk | y ) is dominant. Joseph E. Kass and Raftery. although the choice of g (θk | k) = 1 makes our derivation more tractable. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . The prior g (θk | k) need not be noninformative.Introduction Overview Derivation BIC and Bayes Factors BIC vs. 1986. 1995. AIC Use of BIC Derivation of BIC The preceding can be viewed as a variation on the Laplace method of approximating the integral Z L(θk | y ) g (θk | k) dθk . AIC Use of BIC Derivation of BIC We can now write S(k | y ) = −2 ln{π(k)} Z L(θk | y ) g (θk | k) dθk −2 ln ≈ −2 ln{π(k)} " # (k/2) 2π ¯ θˆk .Introduction Overview Derivation BIC and Bayes Factors BIC vs. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . y )|. −2 ln L(θˆk | y ) + k ln + ln |I( 2π Joseph E. y )|−1/2 −2 ln L(θˆk | y ) |I( n = −2 ln{π(k)} n n o ¯ θˆk . the Bayesian (Schwarz) information criterion is defined as follows: BIC = −2 ln L(θˆk | y ) + k ln n = −2 ln f (y | θˆk ) + k ln n. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . we obtain S(k | y ) ≈ −2 ln L(θˆk | y ) + k ln n.Introduction Overview Derivation BIC and Bayes Factors BIC vs. With this motivation. Joseph E. AIC Use of BIC Derivation of BIC Ignoring terms in the preceding that are bounded as the sample size grows to infinity. B12 .Introduction Overview Derivation BIC and Bayes Factors BIC vs. a Bayes factor is often used. if B12 < 1. If B12 > 1. P(k1 | y )/P(k2 | y ). To choose between these models. π(k1 )/π(k2 ). is defined as a ratio of the posterior odds of Mk1 . The Bayes factor. model Mk1 is favored by the data. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . to the prior odds of Mk1 . AIC Use of BIC BIC and Bayes Factors Consider two candidate models Mk1 and Mk2 in a Bayesian analysis. model Mk2 is favored by the data. Joseph E. Introduction Overview Derivation BIC and Bayes Factors BIC vs. AIC Use of BIC BIC and Bayes Factors Kass and Raftery (1995) write “The Bayes factor is a summary of the evidence provided by the data in favor of one scientific theory. as opposed to another.” Joseph E. represented by a statistical model. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . (BIC(k1 ) − BIC(k2 )) can be viewed as a rough approximation to −2 ln B12 . and let BIC(k2 ) denote BIC for model Mk2 . It is well suited for summarizing results in scientific communication.” Joseph E. Kass and Raftery (1995) argue that as n → ∞.Introduction Overview Derivation BIC and Bayes Factors BIC vs. −2 ln B12 Thus. which is easy to use and does not require evaluation of prior distributions. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . AIC Use of BIC BIC and Bayes Factors Let BIC(k1 ) denote BIC for model Mk1 . Kass and Raftery (1995) write “The Schwarz criterion (or BIC) gives a rough approximation to [−2] the logarithm of the Bayes factor. −2 ln B12 − (BIC(k1 ) − BIC(k2 )) → 0. AIC is asymptotically efficient yet not consistent. Joseph E. suppose that the generating model is of an infinite dimension. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . Suppose that the generating model is of a finite dimension. A consistent criterion will asymptotically select the fitted candidate model having the correct structure with probability one. On the other hand. and that this model is represented in the candidate collection under consideration. and therefore lies outside of the candidate collection under consideration. AIC Use of BIC BIC versus AIC Recall the definitions of consistency and asymptotic efficiency. An asymptotically efficient criterion will asymptotically select the fitted candidate model which minimizes the mean squared error of prediction.Introduction Overview Derivation BIC and Bayes Factors BIC vs. BIC is consistent yet not asymptotically efficient. but the penalty term of BIC (k ln n) is potentially much more stringent than the penalty term of AIC (2k). BIC tends to choose fitted models that are more parsimonious than those favored by AIC. The differences in selected models may be especially pronounced in large sample settings. Thus. why is the complexity penalization so much greater for BIC than for AIC? Joseph E.Introduction Overview Derivation BIC and Bayes Factors BIC vs. Intuitively. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . AIC Use of BIC BIC versus AIC AIC and BIC share the same goodness-of-fit term. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . AIC Use of BIC BIC versus AIC In Bayesian analyses. Why? The Bayesian analytical paradigm incorporates estimation uncertainty and parameter uncertainty.Introduction Overview Derivation BIC and Bayes Factors BIC vs. the strength of evidence required to favor additional complexity is generally greater than in frequentist analyses. The frequentist analytical paradigm only incorporates estimation uncertainty. Joseph E. . i. AIC will increasingly favor the inclusion of such effects.e. BIC could be advocated when the primary goal of the modeling application is descriptive. AIC Use of BIC BIC versus AIC From a practical perspective. Joseph E. to build a model that will effectively predict new outcomes. As the sample size grows. BIC will not. predictive accuracy improves as subtle effects are admitted to the model.e.. based on an assessment of relative importance. to build a model that will feature the most meaningful factors influencing the outcome. AIC could be advocated when the primary goal of the modeling application is predictive.Introduction Overview Derivation BIC and Bayes Factors BIC vs. i. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . BIC can be used to compare models based on different probability distributions. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . when the criterion values are computed. no constants should be discarded from the goodness-of-fit term −2 ln f (y | θˆk ).Introduction Overview Derivation BIC and Bayes Factors BIC vs. Joseph E. AIC Use of BIC Use of BIC BIC can be used to compare non-nested models. However. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . AIC Use of BIC Use of BIC In a model selection application. models with similar values should receive the same “ranking” in assessing criterion preferences. the optimal fitted model is identified by the minimum value of BIC. the criterion values are important. However.Introduction Overview Derivation BIC and Bayes Factors BIC vs. as with the application of any model selection criterion. Joseph E. AIC Use of BIC Use of BIC Question: What constitutes a substantial difference in criterion values? For BIC.Introduction Overview Derivation BIC and Bayes Factors BIC vs.10 > 10 Evidence Against Model i Not worth more than a bare mention Positive Strong Very Strong Joseph E. Kass and Raftery (1995. BICi − BICmin 0-2 2-6 6 . 777) feature the following table (slightly revised for presentation). Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . p. However. since BIC tends to choose fitted models that are more parsimonious than those favored by AIC. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . BIC is often employed in frequentist analyses.Introduction Overview Derivation BIC and Bayes Factors BIC vs. given the Bayesian justification of BIC. However. Some frequentist practitioners prefer BIC to AIC. AIC Use of BIC Use of BIC The use of BIC seems justifiable for model screening in large-sample Bayesian analyses. is the use of the criterion in frequentist analyses defensible? Joseph E. A. Estimating the dimension of a model. J. 199–203. and Cavanaugh. and Raftery. E. 773–795. and applications. (1995). A. R. Joseph E. (1978). Kass. A. Journal of the American Statistical Association 90. AIC Use of BIC References Schwarz.Introduction Overview Derivation BIC and Bayes Factors BIC vs. Neath. WIREs Computational Statistics 4. (2012). The Bayesian information criterion: Background. Bayes Factors. Cavanaugh 171:290 Model Selection The University of Iowa Lecture V: The Bayesian Information Criterion . Annals of Statistics 6. G. 461–464. derivation.