UNIVfeKSlTY OF ILLINOIS LIBRARY Ht URBANA CHAMPAIGN STACKS Digitized by the Internet Archive in 2011 with funding from University of Illinois Urbana-Champaign http://www.archive.org/details/informationcrite455sawa Faculty Working Papers College of Commerce and Business Administration University of Illinois a? U r b a n a - C h a m p a I g n FACULTY WORKING PAPERS College of Commerce and Business Administration University of Illinois at Urbana-Champaign January 5, 1978 INFORMATION CRITERIA FOR DISCRIMINATING AMONG ALTERNATIVE REGRESSION MODELS Takamitsu Sawa, Professor, Department of Economics #455 ^ViJiWJ •'■■•*'.' ■■■■■ : v.ruoA y '•' f ."•,'. '1 7 8 tf .':!M (-■■'. '(■■ ". .■•:'! ! '.i.'.'''. "..'"..:.'. '. '- .1 .•;,'.".. '.oj J!) igii.LJLcC" n:,.i liit^rcifO— .■ .'}ii-=■-: ' f' v 'A i? 't "-" •/, liMOM W ^■s-jiK-; .•:'. : raK; : )? ! ma ^ ;1 ■•■? &% Information Criteria for Discriminating Among Alternative Regression Models Takamitsu Sawa* First Draft January 17, 1977 Second Draft June 28, 1977 Final Draft October 24, 1977 *The author is indebted to Professors R. Akaike, P. Dhrymes, E. Learner, and A. Zellner, and also to anonymous referees of Econometrica for their helpful comments to the earlier version of this paper. Any errors are my sole responsibility. Research was supported by the National Science Foundation Grant SOC 76-22232 at the University of Illinois. ABSTRACT Some decision rules for discriminating among alternative regression models are proposed and mutually compared. They are essentially based on the Akaike Information Criterion as well as the Kullback-Leibler Information Criterion (KLIC) : namely, the distance between a postulated model and the true unknown structure is measured by the KLIC. The proposed criteria combine the parsimony of parameters with the goodness of fit. Their relationships with conventional criteria are discussed in terms of a new concept of unbiasedness . • > ,\: 1. Introduction In most statistical analyses it is taken for granted that the family of the probability distribution functions, say F(y|©), may be correctly specified on a priori grounds. Uncertainty exists, therefore, only with reference to the values of parameters 9 involved in the speci- fied family of probability distribution functions (p.d.f.). In practice, however, we are seldom in such an ideal situation; that is, we are more or less uncertain about the family to which the true p.d.f. might belong. It may be very likely that the true distribution is in fact too compli- cated to be represented by a simple mathematical function such as is given in ordinary textbooks. In practice we approximate the true distribution by one of the alter- native p.d.f. 's listed in textbooks. Needless to say, we try to choose the most adequate p.d.f. with due thought to a priori considerations. A p.d.f. specified by a convenient mathematical function is usually termed a model . For further analysis a postulated model is identified at least tentatively with the true distribution. To put it differently, in the process of conventional statistical analysis a sharp distinction is sel- dom drawn between the postulated model and the true distribution. To avoid the arbitrariness that inevitably occurs in the process of model building, nonparametric statistical methods have been extensively developed in the past two decades. It seems to me, however, that these methods have not been used very successfully in practical data analysis. In fact, most statistical inferences are based on some specific parametric models, very often on the model of normal distribution. ■ •■ -2- In recent years, however, more and more emphasis has been placed on the problem of model identification;— that is, how to identify the model when it cannot be completely specified from a priori knowledges. The main purpose of the present paper is to propose and analyze statisti- cal criteria for model identification in regression analysis. Our basic attitude toward the problem is to recognize the fact that a certain amount of discrepancy inevitably exists between the true distribution and the model. The best we can do in trying to cope with this sort of situation is to identify the most adequate model relatively among a given set of alternatives. The adequacy of a model needs to be quantified by defining a suitable measure of the distance of the model from the unknown true distribution. It is expected intuitively that the more complicated model will provide the better approximation to reality. But, on the contrary, in most practical situations the less complicated model is likely to be preferred if we wish to pursue the accuracy of estimation. To illustrate this point, let us consider the situaiton where two alternative density functions, f_(»|6) and f-(*|c)» are given as possible models of the density g(0 of the unknown true distribution, where 8 and T, are finite-dimensional vectors of unknown parameters. Even if f («|8) is the better approximation to the true density g(') in the sense that inf || f,0|9) - 8(0 || < inf || f,(-|c) - g(0|| where || • || 9 ? l is a suitably defined distance measuring the difference between two p.d.f.'s, it is quite likely that E q II V'le) ~ 8(0 || > E- || f 2 (*U) - g(0 (I if dim 6 > dim C where and t, are some reasonable estimates for and £,, respectively. i :■ f -3- The above consideration leads us naturally to the so-called prin- ciple of parsimony . That is, more parsimonious use of parameters should be pursued so as to raise the accuracy of estimates for unknown parameters in a model. In general, closeness to the true distribution is incompatible with parsimony of parameters. These two criteria form a trade-off: if one pursues one of the criteria, the other must be necessarily sacrificed. The multiple correlation coefficient adjusted for the degrees of freedom may be the most commonly used statistic that incorporates the two incom- patible criteria into a single statistic. Akaike [1] has proposed a more general as well as more widely appli- cable statistic, that ingeniously incorporates the above two criteria. As it is based on the Kullback-Leibler Information Criterion, Akaike' s statistic is called the Akaike Information Criterion and is abbreviated as the AIC. Indeed, the procedure developed here is "also based on the Kullback-Leibler Information Criterion, but the criterion for the choice of the most adequate regression model implied by our procedure is con- siderably different from that implied by the AIC. The disagreement stems from, among other things, a difference between Akaike' s and our views on the true distribution. Some readers may feel that it is useless to study the preliminary test any more because the resultant estimator has been proved to be inadmissible. To avoid this criticism in advance, we point out that what we are proposing is not an estimation procedure but a procedure for model identification. More precisely, in the present context we aim to develop a procedure for identifying the most adequate model from a given set of alternatives rather than estimating unknown parameters involved in a given true model. -4- In Section 2 we briefly review the Kullback-Leibler Information Criterion and the Akaike Information Criterion. In Section 3 we develop a criterion for the choice of the most adequate regression model and ; compare it with a criterion implied by the Akaike Criterion. In Section 4 a different criterion is derived on the basis of the minimum attainable Bayes risk. The biases of those criteria are discussed in Section^, i 2. Information Criterion Suppose that we are concerned with the probabilistic structure of a vector random variable Y' ■ (Y- , Y a , ... , Y ). Let G(y) be the 12 n true joint distribution of Y. On the basis of _a priori knowledge we postulate a model F(y|9) to approximate the unknown true distribution G(y), where 8 is a finite-dimensional vector of unknown parameters. The adequacy of a postulated model may be appropriately measured by the Kullback-Leibler Information Criterion (KLIC). (2.1) I(G:F(.|6)) = E G [log^^-] = / log ^^y dG(y) where g and f are density (or probability) functions of, respectively, G and F; E (•) stands for expectation with respect to the true distribu- te tion G; the integration is over the entire range of Y. It can be easily shown that the KLIC is nonnegatlve. (2.2) I(G:F(-|6)) > with equality only when F(yJ6) = G(y) almost everywhere in the possible range of Y; namely, only when the model is essentially correct. (See, for instance, Rao [7] pp. 58-59.) Incidentally, the negative value of the KLIC is termed the entropy of a probability distribution G(y) with -5- respect to F(y|9). Noting the inequality (2.2) as well as an obvious equality (2.3) I(G:F(-|6)) - / log g(y)dG(y) - /log f(y|e)dG(y) , we are led to propose the following rule for a comparison of alternative 2/ models or estimates.— Rule 2.1 : (i) A model F, ( - 1 6) is regarded as the better approximation to the true distribution G(*)» i.e., the more adequate model than an alternative model F 2 (*|c) if and only if (2.4) inf KGzFjHe)) < inf I(G:F ? (-|0) , or equivalently (2,.5) sup E G [log f (Yje)] > su? E G [log f^Y^)] . 9 Q (ii) Given a model F(«J0), estimate 6 is regarded as a better esti- mate than e„, if and only if (2.6) Eg {E G [log f(Y[8 1 )!6 1 ]} > Eg {E^log f(Y|8 2 )|§ 2 ]} where Eg and E~ stand for expectations with respect to the sampling distributions of 9 and 6 , respectively. (Note that when we first take an expectation with respect to G the estimate 9 or 9„ should be treated as if it were a constant.) In words, the adequacy of a postulated model is measured by the minimum possible KLIC distance between the model and the true distribution. -6- r u, It was pointed out by Akaike [1] that if the Y' s are independent and identically distributed the maximum likelihood estimate may be regarded as an estimate that minimizes the estimated KLIC, or equivalently maximizes the estimated entropy, because the log likelihood function divided by the sample size n 1 n (2.7) ~ Z log f(y |e) n j»l 2 may be regarded as a reasonable estimate for E-flog f(YJ9)} whatever G(y) is. i Apparently, the above rule for a comparison of models is not directly applicable in practice, because the criteria are totally dependent on the unknown true probability distribution- To establish a practical usable criterion for model identification on the basis of the KLIC, we need to replace unknowns in (2.5) by their reasonable estimates. In fact, the Akaike Information Criterion (AIC) has been derived as an approximately unbiased estiijfmte for the KLIC, neglecting its irrelevant constant terms and based implicitly on a fairly strong assumption that will, be stated later. For the sake of convenience in developing our argument we give the following definition: Definition : Given a model F(*J9), a parameter value 9- such that (2.8) I(G:F(-|9 Q )) < I(G:F(-|e)) for any possible 9 in the admissible parameter space is called a pseudo- true parameter value ; F(*J6_) is called a pseudo-true model . -7- If the true distribution G(y) and a model F(yJB) satisfy due regularity conditions, the pseudo-true parameter 6« must satisfy (2.9) E^logfCYle)] -0". The model F(y|e n ) may be regarded as the most adequate relatively within the family of models F(y|e) in the sense that the KLIC for F(y|8) is minimized by F(yJ8 ). We note that Rule 2.1 is based on the comparison of the KLIC distances between the pseudo-true models and the true model. Assuming that I(G:F(*J8 )) - 0(n ), i.e., the pseudo-true model is nearly true, Akaike [1] derives his criterion (2.10) AIC(F(«J6)) = ~2 log f(y|6) + 2k as an almost unbiased estimate for -2 E_ [log f(Yj8_)], where 8 is the maximum likelihood estimate for 8 based on observations y and k is the number of the unknown parameters, i.e., the dimension of 6. The procedure of choosing a model that minimizes the AIC is called the Minimum AIC (MAIC) procedure. The first term of the AIC measures the goodness-of-fit of the model to a given set of data, because f(y|6) is the maximized likelihood function. The second term is interpreted as representing a penalty that should be paid for increasing the number of parameters. In this sense the AIC may be regarded as an explicit formulation of the so-called prin- ciple of parsimony in model building. Indeed, the assumption that (2.11) I(G:F(.|8 )) - OOT 1 ) for every model F simplifies the derivation substantially, but there is no denying that this simplifying assumption lessens the plausibility of -8- the AIC to some extent. To see this point in more detail let us consider the case where we have to choose one from the two alternatives, say F and F„« The AIC for F is evaluated assuming that F. with pseudo-true parameter value be true, while the AIC for F is evaluated assuming that F_ with pseudo-true parameter value be true. Thereafter, the two AIC's are numerically compared. In the next section, confining ourselves to linear regression, we derive another criterion called the BIC on the basis of weaker assumptions than (2.11) and compare it with the AIC to see how much difference might arise depending on whether or not we assume (2.11). 3. Identification of a Regression Model We are interested in investigating a joint distribution of a vector random variable Y' = (Y, , Y 0> ..., Y ). Each of Y.'s may be an observa- 1 l n l • tion on a certain characteristic of a randomly chosen individual; or Y.'s may constitute a sequence of observed time series. The distribution func- tion G(y) is unknown, but each Y. is assumed to possess finite variance. We denote the mean vector and the variance-covariance matrix, respectively, by y and ft, where y is a vector of n components and ft is a n x n positive definite matrix. Unless we place more a priori restrictions on the ele- ments of y and ft, we can make no inference at all about the joint distri- bution of Y. What we usually do is to assume that y belongs to a linear subspace of lower dimension than n and Y.'s are mutually uncorrelated. Then we have a familiar linear regression model (3.1) E(Y) - XS, V(Y) - c 2 I- , n -9- where X is a n x k matrix of known constants, the k columns of which constitute a basis of the subspace to which u is assumed to belong; g 2 is a vector of k unknown parameters; o is an unknown positive constant 1 is an identity matrix of order n. In most practical situations the n columns of X are vectors of observations on certain characteristics con- sidered to be associated with Y. Then the model implies that the i-th mean u. is represented as a linear function of k explanatory variables, k i.e., \i . » T. |3, x.. where x. . is the (i,1)-th element of X. By assum- i . , 1 ij ij ing a regression model we can reduce the number of unknown parameters from n + n(n + l)/2 to k + 1. In addition to (3.1) we often assume the normal distribution for Y and postulate a model (3.2) Y * N(X3, cr^I ) , n or Y » Xg + u , u I , then 6 and a are uncorrelated. n -11- Along the lines of the previous section, one can measure the loss incurred by modelling G(y) by F(yJ6) with some estimate 9 in place of unknown 6^ by the quantity (3.10) W(P(-|9)) - - | E G [log f(Y|e)(8l , where f(y|8 ) is the density function of the pseudo-true model 2 N(Xg , a I), i.e., the likelihood function of the model. It should be noted that the expectation on the right-hand side of (3.10) refers only to the argument Y of the density function; i.e., 6 is taken as a fixed constant, Lemma 3.3 : The loss incurred by modelling the distribution of Y by F(y|8) with an estimated value 6 substituted for 6 is evaluated as 2 (3.11) W(F(-|e» - log (2ir) + log (a 2 ) + (JU + 4r || X(g - B Q ) !| 2 a no- where || ■ || is the Euclidean norm. In this section we adhere to the sampling theory approach, and hence we base our decision about model selection on the risk function derived by integrating the loss function with respect to the sampling distribution of the estimate 6. Since the ML estimate 6 possesses the nice property in Lemma 3.2, even when a postulated model is incorrect, we define the risk of postulating a model F(yJ6) by an integral of the loss function of F(y|s) with respect to the sampling distribution of the ML estimate 6. 2 Theorem 3.1: Suppose that Q = w I and each Y. is symmetrically 3/ distributed with the same kurtosis as a normal distribution.— Then -12- the risk of a model F(*J6), i.e., the expected value of W(F(-J8)), is evaluated to order 0(n ) as 2 2 (3.12) R(F(.|8» - log (2rr) + log (o 2 ) + 1 + £±?- (^~) - ^ <-^) + Q(n" 2 ) 2 The proof is given in the Appendix. It should be noted that a n u decreases along with the successive addition of explanatory variables, i.e., the increase of k. To develop a practical and useful criterion for model identifica- tion, the risk function involving unknown parameters needs to be somehow estimated from a given set of observations. Theorem 3.2 : Suppose, that we have an estimate, say w , for w such *? 9 — 1/9 -1 19 that id = id + (n ), where (n ' ) stands for the term of ? P stochastic order of n and with finite second order moment.— Then '2 "2 (3.13) BIC (F(.|e» = -2 log f (y|6) + 2(k + 2)(^) - 2(~) a a is an asymptotically unbiased estimate of nR(F(-|o)). "2 "2 5/ If we equate ui to a , the BIC is identical with the AIC— ' As was pointed out in the preceding section, the AIC is based on the assump- tion that the true distribution defers from the pseudo-true model only -1 2 2 in the order of n ; hence it is justifiable to equate a n to u> in "2 "2 (3.12) or to equate o to a in (3.13). '2 '2 The variance ratio w la increases with successive addition of explanatory variables, and possibly it approaches one as long as the ~2 "2 degrees of freedom are sufficiently large. Its reciprocal o /w (>_ 1) -13- may be interpreted as a discounting factor for the penalty that has to be paid for increasing the number of parameters. Therefore, the favor to parsimonious models is more pronounced in the minimum BIC procedure. When we compare two regression models, one with less explanatory variables and poorer fit, the other with more explanatory variables and better fit, the BIC is rather more favorable to the former model than the AIC. The following numerical evaluations show that the difference between the two criteria is far from negligible. Let us develop a decision rule to choose one from two nested alternative regression models V Y * N(x i B i> a i 2 V > (3.14) F 9 : Y 1.510(.133) 1.709 (.116) 1.836C096) 2.758(.040) 20 1.504C151) 1,558(.139) 1.625(.125) 1.707C.110) 2.494(.034) 30 1.469 (.153) 1.500 (.146) 1.536(.137) 1.576(.128) 1.912(.071) 50 1.445 (.155)' 1.462(.151) 1.480(.146) 1.449(.154) 1.625(.112) 100 1.429(.156) 1.437(.154) 1.445 (.152) 1.453(.150) 1.499(.138) 200 1.421(.156) 1.425(.156) 1.429(.154) 1.433(.154) 1.453(.148) 500 1.417 (.158) 1.419(.156) 1.420(.156) 1.42K.156) 1.429(.154) 1000 1.416(.158) 1.416 (.158) 1.417(.156) 1.418(.156) 1.42K.156) n is the sample size and p is the number of the explanatory variables already included in the model. The decision rule is described as follows: if the t-value for an optional variable exceeds the MBIC critical point, we decide to augment the model by the optional variable, and vice versa . Note that, the MBIC critical point approaches slowly to /2 as n tends to infinity for every p. Table 3.2: The MAIC Critical Points and Significance Levels for the Preliminary t-Test 10 10 12 16 20 30 50 100 200 500 1000 1.245(.253) 1.153(.293) 1-052(.341) 1.278(.233) 1.205(.263) 1.127(.297) 1.316( 1.337( 1.364( 1.385( 1.400( 1.407( 1.41K ,211) ,199) .184) ,173) ,164) ,160) .158) 1.264 (.230) 1.297(.2i3) 1.339 (.192) 1.370(.177) 1.393(.166) 1.404 (.1*2) 1.410C.160) 1.21QC.252) 1.256(.228) 1.313(.201) 1.355C.182) 1.385(.170) 1.400(.164) 1.409 (.160) 1.413(.158) 1.412(.158) 1.411(.158) .94K.400) 1.043(.337) 1.154(.275) 1.213(.245) 1.286(.211) 1.340 (.187) 1.378(.172) 1.396(.164) 1.407(.160) 1.41K.158) .816 (.452) •973(.356) 1.144(.267) 1.262 (.214) 1.34K.184) 1.378(.170) 1.400(.162) 1.407C.160) See the footnote to Table 3.1. -17- less than 10. Indeed it is difficult to see a clear-cut connection between the two basically different approaches, but it would be worth noting that if a loss function is specified in terms of the prediction 10/ error, the more prodigal model is likely to be preferred. — We often encounter a situation where we have to choose one of two unnested alternatives: Y ^ N (X 8. , oA ) and Y tion the unknown true variance u*" may be reasonably estimated from a regression of y on all the explanatory variables X,UX . Another 2 reasonable estimate of to may be the smallest value of "unbiased" estimates, instead of the maximum likelihood estimates, of variances for all possible regressions of y on a subset of X (j X 9 . 2 The difficulty in estimating w does admittedly place a serious limitation to the practical usefulness of the MBIC procedure. However, it should be noted that the same difficulty is shared by Mallows' [5] procedure which is based on what he calls C statistic. Incidentally, Mallows' procedure gives a decision rule essentially similar to the 11/ 2 AIC. — It is worth noting that according to Akaike's procedure w is '• 2 ~ 2 estimated by a when we evaluate the AIC for the model F and by o when we evaluate the AIC for the model F ? , This means that, given a class of nested alternative models, the AIC for each model is evaluated assuming it is nearly true in the sense that the difference of the error 9 variance in the model from the true variance w tends to zero as n tends to infinity. (See the equation (2.11).) On the other hand, the BIC for -18- each model is evaluated assuming that the most complex model within the class would be nearly true but the rest are not necessarily so. 4, A Decision Rule Based on Bayes Risk In this section we look at the problem another way. Given a model F(«J8) coupled with a prior distribution ?(6) we define the Bayes risk, say B(8|F), for an estimate 6 as the expectation of the loss function (3.10) or (3.11) with respect to the posterior distribution, that is, (4.1) B(eJF) = / W(F(-|e)) dP(6|y) where P(8Jy) is the posterior distribution for Q given an observation y. If there exists an estimate 6* such that (4.2) B(6*iF) - min B(6JF) , e then it is called the Bayes estimate of 9 with respect to the loss func- tion (3.10). Recalling that W(F(-|e)) measured the discrepancy of a model F(*J6) from the true distribution G(-)> we take B(6*JF) as a measure of the adequacy of a postulated model F(*|o) associated with a prior dis- tribution P(6). That iSj along the lines of previous sections, if we compare two alternative models, say, F («|e) with F (0) and F„( « [ c) with P„(c), then we decide to choose F, or F^ according to whether or not B(9*|F ) < B(c*|F 2 ). In what follows let us be specific to a linear normal regression model for a vector random variable Y: (4.3) F: Y ^ N (X3, a I ) n -19- where Y is n * 1, X is n * k, 6 is k x 1, and u is n * 1; the true dis- 2 2 tribution of Y is N(u, to I ) with unknowns u and ui . If we assume a n 2 diffuse prior for 8 and a , the minimum attainable Bayes risk is evaluated as follows: Lemma 4.1 . Given a model F with a diffuse prior, the minimum attainable Bayes risk is (4.4) B(8*, a 2 *JF) - - | log f (y||3, a 2 ) + log (^77^2) , * "2 2 ~2* where 8 and a are the ML estimates for 8 and a , S* and a are the Bayes estimates, and f is the density function of N(X8» I ) . n Let us make a comparison of two nested alternatives F, and F given in (3.14). The Bayes decision rule, based on the magnitude of the mini- mum attainable Bayes risk, leads us to the following decision rule which 1?/ is again described in terms of a familiar F-statistic. — Theorem 4.1 . A decision rule based on the minimum attainable Bayes risk is equivalent to: choose F. if (4 ' 5; - (n +p)(n - p - q - 2) » choose F ? otherwise, where W defined by (3.15), is an F-statistic conventionally employed to test the hypothesis that 8* = 0. We call the right-hand side of (4.5) the Bayes critical po int, which tends to 2 asymptotically, increases with q, and decreases with p if n is moderately large. Limiting ourselves to the case of q = 1, we tabulate the numerical values of the square root of the Bayes critical point in Table 4.1 which is comparable to Tables 3.1 and 3.2. -20- Table 4.1. Bayes Critical Points and Significance Levels for the Preliminary t-Test V 2 3 4 5 10 10 1. 499(.178) 1.441 (.200) 1.464(.203) 1.549 (.196) «.«■■. 12 1.42K.189) 1.398(.200) 1.387 (.208) 1.393(.213) 16 1.403(.184). 1.376(.194) 1.354(.203) 1.336 (.211) 1.387(.224) 20 1.399 (.180) 1.374(.188) 1.352(.196) 1.332(.204) 1.276 (.234) 30 1.399(.173) 1.380 (.179) 1.362(.185) 1.345(.191) 1.270(.220) 50 1.403(.167)- 1.390(.171) 1.378(.i75) 1.366(.179) 1.312(.197) 100 1.408(.162) 1.40K.164) 1.395(.166) 1.388(.168) 1.357(.175) 200 1.41K.160) 1.407(.162) 1.404(.162) 1.40K.162) 1.384(.166) 1000 1.414(.158) 1.413(.158) 1.412(.158) 1.41K.158) 1.408(.159) See the footnote to Table 3.1. It is interesting to note that the Bayes critical point varies quite little according to the changes in the values of n and p. Also, it is very close to the minimax regret critical point in Sawa and Hiromatsu [8]. 5. Bias of Decision Rules Now we return to Section 3 and reconsider the problem from the view- point of sampling theory. When we compare the two nested alternative models given in (3.14), our decision rule should be in principle based on the risk function given in Theorem 3.1. That is, we should choose Y ± if R(F 1 (-|6 )) < R(F 2 (»|6 )) and vice versa . 2 2 -1 Lemma 5.1 . If 6 = a - o_ = 0(n ), then • 2 (5.1) R(F 1 (- |8 )) - RCF 2 (- |9 2 )) - -^ - "^ + (n~ 2 ) • c 2 na 2 -21- The proof is given in the Appendix. It should be recalled that -2 when we derived the BIC the terms of 0(n ) were neglected. It is, therefore, consistent that we evaluate the difference of risk only to order 0(n ). The difference 6 between the pseudo-variances is assumed to be 0(n ). This assumption may seem to be somewhat uncomfortable. However, it may be justified by the fact that the model discrimination procedure would be unnecessary unless the difference between the two alternatives is as small as the reciprocal of the sample size. Inci- dentally, starting from Mallows' type risk function, Sawa and Takeuchi [9] has arrived at the essentially same result as (5.1). This reflects the asymptotic equivalence of the two different approaches. We can legitimately define a correct decision rule as follows: 2 2 choose the model F if n5/w £ q and choose F£ if n6/oi > q. Based on the preceding consideration, we introduce the notion of unbiasedness of a decision rule ; a decision rule is said to be unbiased 2 if the probability of choosing F. is greater than 1/2 when n q. If the probability decreases con- 2 tinuously with the increase of n6/w , the condition of unbiasedness is simply described as follows: the probability of choosing F^. (or F^) 2 2 is 1/2 when n6/u> ■ q. Note that when n6/cc = q the two alternative models are equally desirable. If the above probability exceeds 1/2, then the decision rule is said to be biased toward a simpler model; if it falls below 1/2, then the decision rule is biased toward a more com- plex model. All decision rules considered so far are based on whether or not an observed value of W, given by (3.15), exceeds a constant which changes ;.- -22- 2 with n, p, and q. Under the assumption that Y *\» N(u, oj I ), W is n distributed as a doubly noncentral F with (q, n-p-q) degrees of free- dom and the noncentrality parameters a'Xft(IS l »r 1 » , |i (5.2) ^-Sf --— ^- 2 2 2 ; u u'[i - x oq x r 1 ^ - &(tt'x*)~hs*']v (5.3) 5 2 « 2 » u where X* » X - X (X'X ) -1 X'X . It would be worth noting here that a decision is correct if we decide to choose F. when the noncentrality parameter of the numerator is less than its degrees of freedom and vice versa . In Table 5.1 we tabulate the probability that W exceeds the BIC 2 critical point when n6/cu =» q, i.e., when F. and F„ are indifferent. It can be observed from the Table that the BIC procedure is considerably biased toward a simpler model. Table 5.1. Bias of the BIC Decision Rule noncentrality n - 10 n = 20 n - 30 n - 40 n * 50 .0 .696 .671 .664 .661 .659 .2 .742 .720 .714 .711 .709 .4 .781 .762 .756 .753 .752 .6 .814 .797 .791 .789 .788 .8 .842 .827 .822 .820 .818 1.0 .866 .852 .848 .846 .844 .0 .738 .689 .675 .669 .666 .2 .781 .738 .725 .719 .715 .4 .817 .779 .767 .761 .758 .6 .848 .813 .802 .796 .793 .8 .873 .842 .832 .827 .824 1.0 .894 .866 .857 .852 .850 -23- Bach entry in the table is the probability that a doubly non-central F variate, with noncentrality parameters (5 , 6 ? ) and {1, n - p - 1) degrees of freedom, fails below the BIC critical point when 6- "1. The noncentrality is 5 2 /(n - p - 1), i.e., the normalized noncentrality parameter of the denominator in F, where 6. is given by (5.3). The unbiased decision rule has been considered in more detail by Sawa and Takeuchi [9]. -24- Appendix Proof of Lemma 3.1 The log likelihood function is (A.l) iogf(yje) - - | log (2ir) - | log (a 2 ) --^Ijy-xejf 2 , 2a where 9' = (@ ' , a ) and jj • }[ stands for an Euclidean norm. Differentiating it with respect to g and a „ we have (A . 2) 12£|Jpd!i..^x'(y-XB), a (A . 3) l-io^IzM . . » + 1 S| y „ xg [|2. ' 3 'X(e - 6 Q ) ] + |f xcs - B ) || 2 - na 2 - 2y'P x X($ - g Q ) + [| X(g - 3 Q ) || 2 -na 2 +.||x(3 - B Q )ir 2 therein, we obtain (3.11). Proof of Theorem 3.1 The risk function is -26- (A.ll) R(F(. |8)) « E[W(F(-|8))} rj 2 2 - log (2tt) + log (a ) - E[log (~) ] + E(~) a a 2 + -ijE<%) E||X(6 - 6) f| 2 na a where use is made of the independence of a and 6, and the suffix of 2 a fi end 3_ is dropped. We have the following power series expansions: "2 (A. 12) log (2-) - log (l + A) = A - -| A 2 + ••• a 2 (A. 13) ~ „ _1_- , 1 _ A + A 2 + ... 2 1 + A a where ~2 2 (A. 14) A - " ° a Note that under the assumptions stated in the Theorem the expectations -2 of higher order terms in the expansions are of order 0(n ). (A. 15) A - -~ [Y'P Y - nw 2 - u'P u] no 1 2 « -~ [V'P V + 2y»PV] - ~ na a where V ■ Y - u. Under the assumptions in the Theorem (A. 16) E(V'P X V) - aj 2 trP x - (n - k)u 2 (A.17) E(V'P X V) 2 = w 4 [(trP x ) 2 + 2trP x ] = [(n - k) 2 + 2(n - k)]u A (A. 18) E(u'P v V) - E[V'P v Vp'P V] - A A A -27- (A.19) E(y»PV) 2 - w 2 u'P x U - nw 2 (c 2 - t/) Hence, rearranging the terms, we obtain . 2 (A.20) E(A) - - i- (SL.) , a 2 2 2 (A. 21) E(A 2 ) - ~ (%) - f (%) + 0(n" 2 ). a a" Also, we have (A. 22) E|| X(8 - B) |! 2 - Ej| XCX'X)"^^ || 2 - o^trXCX'X)"^' . 2 KU Therefore, 2 2 (A.23) E[log (— )] + E (~) « 1 + | E(A 2 ) + 0(n' 2 ) - i +i 4 -i 4 2+ °<»~ 2 > a 2 (A. 24) E (~) B|| X(g - 6) |! 2 ■ kto 2 + 0(n _1 ) a Substituting (A.23) and (A. 24) into (A. 11), we finally obtain (3.12). Proof of Theorem 3»2 From (A. 12), (A.20) and (A. 21) (A. 25) E (log a 2 ) - log a 2 + E(A) - ~ E (A 2 ) - log a 2 - J <4> ~ | (4 + J (4) 2 + 0(n" 2 ). n ■<£ n / n z a a o Moreover, as w 2 = u 2 + (n~ ) by assumption and a = a + (n ), we hav< "2, 2 (A.26) E(%) -2L- [1 + OCn" 1 )] a o Q -28- and Ml A / (A.27) E(%) « \ [1 + 0(n _1 ) a a Noting that (A. 28) -2 log f(y|e) - n log (2ir) + n log a 2 + 1 and combining the above expectations, we obtain (A.29) n E [3IC(F(.|e))3 - nR(F(-|e)) + OdT 1 ) Proof of Lemma 4.1 2 If we assume a linear normal regression model Y ^> N(X8, a 1) 2 with diffuse prior for 3 and o , the conditional posterior distribu- tion of 8, given a , is N (8, a (X'X) ) where «» (X'X) X'y is the maximum likelihood estimate, and also the marginal prior distribution 2 for a is the inverse gamma distribution with the, density function , 2 v/2 , 2 (A>30) r(v72T C "T" ) "^hi "P C - — 3D a 2a 2 "2 where v ■ n - k and s ■ no /(n - k) . The proof is given by Zellner 2 [ 10] . The conditional expectation of |j X(8 - 8) j| with respect to the posterior distribution is (A.31) E g j y>a tj X(8 - 8) j| 2 - E B j y>c If X(8 - 8)|| 2 + || X(8 - 8) |( 2 >E 3Jy>fl ||X(8-8)|| 2 , 2 - kc where the lower bound is attainable when 8 ■ 8; i.e., the Bayes estimate 8 of 8 is nothing but the ML estimate. A straightforward integration ' . ' .. -29- ylelds /a oon u tJl\ v 2 n - k 2 (A.32) E 2 (« ) - ^-^ s * ^XTT s 17 as long as v > 2. Hence 2 (A.33) E „ [W(F(-|6))1 > log (2ir) + log o 2 + g - ± ^ ■ (1 + g) % 2 ~2 The Bayes estimate of is a that minimizes the right-hand side of the above inequality; i.e. (A.34) a « n _ k , 2 a "2 2 where a is the ML estimate of a . Substituting this into the right-hand side of (A.33), the minimum attainable Bayes risk is evaluated as follows : (A. 35) B(3 , a JF) - log 2ir + log a " + 1 - log 2tt + log a + 1 + log ( p ° k _ 2 ) - - f log f(y|9) + log( n ;^ 2 ). Proof of Theorem 4.1 Let B. and B„ be the minimum attainable Bayes risks, respectively, for F 1 and F. with diffuse prior for parameters. The difference between B. and B ? is " 2 or (A . 36) Bl - b 2 - los <^> + „. t g:;^ q) -^.v.-^ i °2 If this is negative, we should choose F. , and vice versa . By the mono tonicity of the logarithm transformation, B- - B ? < is equivalent to -30- (A .37) !i < ^SL±JE + 3iInjIJEJl2). a 2 (n. 4- p;(n - p - q - 2) 2 which is again equivalent to (4.5). Proof of Lemma 5.1 (A.38) R(F 1 (r!e)) - R(F 2 (.|e)) - log C-ij) + *-i-£ (-^ - -ij) 2 a, 2 a) V V a l Z « +0 < n > a 2 o 2 a x If we assume that (A. 39) 6 - o x 2 - ct 2 2 = OCn" 1 ), we have an expansion 2 (A.40) log(-4j) - log (1 + -— ) - -~ + 0(n" 2 ). °2 °2 °2 Also, it follows that the second and third terms on the right-hand side of (A.38 ) are of order 0(n ). Hence, if we neglect the terms of order 0(n~ ), we can assert that R(F.(-|6)) < R(F 2 («|e)) if and only if (A. 41) 2| < q in and vice versa. -31- FOOTNOTES 1. Regarding the importance of the model identification in econometrics, the readers should refer to excellent comprehensive survey papers by Gaver and Geisel [3] and Ramsey [6] . Particularly in Section 2 of Ramsey [6]> a very illuminating as well as profound discussion is given about a concept of models. 2. In what follows, for simplicity of exposition F(*J6) will be simply called a model, instead of a family of models, except for cases, when sharp distinction needs to be drawn between a family of models and its particular element. 3. It is fair to say that the assumption here is nearly equivalent to assuming the normal distribution. 4. The precise meaning of (n ) is as follows: Given e > 0, if there exists a positive number X such that r e PrfixJ < A £ n~ a } > 1 - £ , then we say that X = (n ) . Note that J n p (i) (n" a ) (n~ Y ) - (n" a ~ Y ) p p p (ii) (n~ a ) + (n" a ) - (n~ a ) . p P P Also, if E JX | k < », then E JX | k - G(n~ ka ) . 1 n' ' n 1 5. Note that the number of unknown parameters is k + 1, i.e., k regression coefficients and variance. 6. It should be here emphasized that the difference between the AIC and BIG decision rules stems from the following: the AIC for F, is evaluated assuming that id 2 - a 2 - o(l), whereas the BIC for F, is evaluated without assuming that w 2 - a 2 ■ o(l). See the last paragraph of Section 3. 1 7. It is impossible to explicitly write down the BIC critical point as a function of n, p and q. However, for each combination of n, p and q, we can evaluate the BIC critical point numerically. Note that the inequality BICCF,) < BIC(F 2 ) is equivalent to the Inequality that the F statistic is less than a critical point determined by n, p and q. 3. It should be here noted that the decision based on the adjusted multiple correlation coefficient is also equivalent to a decision based on the F-statistic with a constant critical point equalling one. Also j. Mallows' C statistic leads us to a decision based on the F-statistic with a critical point equalling two, irrespective of n, p and q. -32- 9. The difference between the AIC and the BIC ia more substantial for a larger value of q. In his personal correspondence Dr. Akaike pointed out that the two criteria give almost identical critical points for cases when p/n < 0.1. An implication may be that the simplifying assumption made by Akaike is virtually harmless if the sample size is large enough to satisfy the above condition. 10. A decision rule based on R, the multiple correlation coefficient adjusted for the degrees of freedom, is equivalent to a decision based on F-statistic with critical point unity regardless of the degrees of freedom. (The proof is quite straightforward.) This decision rule is perhaps most often used in practical regression analysis. The implied significance level is a little bit greater than 30%. Presumably 5 this is the most prodigal decision rule. 11. Mallows' C statistic is C - RSS + 2 p u , where RSS is the residual sum of squares, p' is the number of explanatory variables, and u) 2 is an estimate of the common variance of YJs. It is straightforward to show that a decision based on C is equivalent to a decision based on the F-statistic with a constant critical point equalling two. Therefore, the AIC and BIC decision rules are asymptotically equivalent to Mallows' decision rule. 12. In his personal correspondence Dr. Akaike noticed the following: . , n+k . _ (k+1) since log ( — : — =-) > k, a decision rule based on Bayes risk is almost equivalent to the MAIC decision rule. This may provide another justification for the MAIC procedure. It is fair to note that the decision rule derived in this section is considerably different from orthodox Bayesian approach. -33- RBFERENCES [lj Akaike, H. (1972) "Information Theory sad an Extension of the Maximum Likelihood Principle," in I'roc. 2nd Int. Symp. on Info rtaat ion Theo ry, pp. 267-281. 12] Akaike, H. (1974) "A New Look at the Statistical Model Identifica- tion," IESS Tr ansactions on Automatic Control, Vol. AC -19,. No. 6, pp. 716-723. [3} Gave?:, Kenneth K. and Martin 3. Geisei (1974) "Discriminating among Alternative Models: Bayesian and Non-Bay esian Methods," in Fron tier a in Ec onometrics, pp. 49-77. [4] Kullback, S. (1959) Info rmation The ory and Statistics, New York, John Wiley and Sons. [5] Mallows, C. L< (1973) "Some Comments on C ," Technomet rics , Vol. IS, pp. 661-675. [6] Ramsey, James E. (1974) "Classics! Model Selection through Specifi- cation Error Tests," in Frontiers in Econometrics, pp. 13-47. [7] Rao, C. R. (1973) Linear Statistica l Infe rence and Its A pplic ation , 2nd ed. , New York, John Wiley ami Sons. {8] Sawa, T. and T. Eiromatsu (1973) "Minima's Regret Significance Points for a Preliminary Test in Regression Analysis," Econometric a, Vol. 41, pp. 1093-1101. [9] Sawa, T. and^Kei Xakeuchi (1977) "Unbiased Decision Rule for the Choice of Regression Models," Working Paper No. 400, College of Commerce and Business Administration, University of Illinois at Urbana-Champaign . [10] Zellner, A. (1973) An Introduction to Bayesia n Infer ence in Econometrics , New York, John Wiley and Sons.