Model Complexity, Goodness of Fit and 
Diminishing Returns 
Igor V. Cadez 
Information and Computer Science 
University of California 
Irvine, CA 92697-3425, U.S.A. 
Padhrai $myth 
Information and Computer Science 
University of California 
Irvine, CA 92697-3425, U.S.A. 
Abstract 
We investigate a general characteristic of the trade-off in learning 
problems between goodness-of-fit and model complexity. Specifi- 
cally we characterize a general class of learning problems where the 
goodness-of-fit function can be shown to be convex within first- 
order as a function of model complexity. This general property 
of "diminishing returns" is illustrated on a number of real data 
sets and learning problems, including finite mixture modeling and 
multivariate linear regression. 
I Introduction, Motivation, and Related Work 
Assume we have a data set D = {x,x2,...,x,}, where the xi could be vectors, 
sequences, etc. We consider modeling the data set D using models indexed by a 
complexity index k, I _ k _ kmax. For example, the models could be finite mixture 
probability density functions (PDFs) for vector xi's where model complexity is 
indexed by the number of components k in the mixture. Alternatively, the modeling 
task could be to fit a conditional regression model y = g(zk) + e, where now y is 
one of the variables in the vector x and z is some subset of size k of the remaining 
components in the x vector. 
Such learning tasks can typically be characterized by the existence of a model and 
a loss function. A fitted model of complexity k is a function of the data points D 
and depends on a specific set of fitted parameters O. The loss function (goodness- 
of-fit) is a functional of the model and maps each specific model to a scalar used 
to evaluate the model, e.g., likelihood for density estimation or sum-of-squares for 
regression. 
Figure I illustrates a typical empirical curve for loss function versus complexity, for 
mixtures of Markov models fitted to a large data set of 900,000 sequences. The 
complexity k is the number of Markov models being used in the mixture (see Cadez 
et al. (2000) for further details on the model and the data set). The empirical 
curve has a distinctly concave appearance, with large relative gains in fit for low 
complexity models and much more modest relative gains for high complexity models. 
A natural question is whether this concavity characteristic can be viewed as a 
general phenomenon in learning and under what assumptions on model classes and 
880x10' 
 900x10  
.2 
 920x10  
 940x10  
960x10 s 
980x10  
50 100 150 
Nmnber of Mtxture Components [k] 
Figure 1: Log-likelihood scores for a Markov mixtures data set. 
loss functions the concavity can be shown to hold. The goal of this paper is to 
illustrate that in fact it is a natural characteristic for a broad range of problems in 
mixture modeling and linear regression. 
We note of course that for seneralization that using goodness-of-fit alone will lead 
to the selection of the most complex model under consideration and will not in 
general select the model which generalizes best to new data. Nonetheless our pri- 
mary focus of interest in this paper is how goodness-of-fit loss functions (such as 
likelihood and squared error, defined on the training data D) behave in general as a 
function of model complexity k. Our concavity results have a number of interesting 
implications. For example, for model selection methods which add a penalty term 
to the goodness-of-fit (e.g., BIC), the resulting score function as a function of model 
complexity will be unimodal as a function of complexity k within first order. 
Li and Barron (1999) have shown that for finite mixture models the expected value 
of the log-likelihood for any k is bounded below by a function of the form -C/k 
where C is a constant which is independent of k. The results presented here are 
complementary in the sense that we show that the actual maximizing log-likelihood 
itself is concave to first-order as a function of k. Furthermore, we obtain a more 
general principle of "diminishing returns," including both finite mixtures and subset 
selection in regression. 
2 Notation 
We define y = y(x) as a scalar function of x, namely a prediction at x. In linear 
regression y = y(x) is a linear function of the components in x while in density 
estimation y = y(x) is the value of the density function at x. Although the goals 
of regression and density estimation are quite different, we can view them both as 
simply techniques for approximating an unknown true function for different values 
of x. We denote the prediction of a model of complexity k as Yk (xl0) where the 
subscript indicates the model complexity and 0 is the associated set of fitted param- 
eters. Since different choices of parameters in general yield different models, we will 
typically abbreviate the notation somewhat and use different letters for different 
parameterizations of the same functional form (i.e., the same complexity), e.g., we 
may use yk(x),g(x),h(x) to refer to models of complexity k instead of specifying 
y (xlOt), y (x102), yk (x103), etc. Furthermore, since all models under discussion are 
functions of x, we sometimes omit the explicit dependence on x and use a compact 
notation y, g, h. 
We focus on classes of models that can be characterized by more complex models 
having a linear dependence on simpler models within the class. More formally, any 
model of complexity k can be decomposed as: 
Yk = clgl + c2hl + ... + ckwl. (1) 
In PDF mixture modeling we have y = p(x) and each model g, h,..., z is a basis 
PDF (e.g., a single Gaussian) but with different parameters. In multivariate linear 
regression each model g, h,..., w represents a regression on a single variable, e.g., 
g (x) above is g (x) - 7pXp where Xp is the p-th variable in the set and "/p is the 
corresponding coefficient one would obtain if regressing on Xp alone. One of the 
g, h,..., w can be a dummy constant variable to account for the intercept term. 
Note that the total parameters for the model y in both cases can be viewed as 
consisting of both the mixing proportions (the cFs) and the parameters for each 
individual component model. 
The loss function is a functional on models and we write it as E(y). For simplicity, 
we use the notation E to specify the value of the loss function for the best k- 
component model. This way, E _ E(y) for any model yk. For example, the loss 
function in PDF mixture modeling is the negative log likelihood. In linear regression 
we use empirical mean squared error (MSE) as the loss function. The loss functions 
of general interest in this context are those that decompose into a sum of functions 
over data points in the data set D (equivalently an independence assumption in a 
likelihood framework), i.e., 
n 
E(y) --  f(y(xi)) (2) 
i----1 
For example, in PDF mixture modeling f(y) = -lny, while in regression model- 
ing f(yk) - (y - y)2 where y is a known target value. 
3 Necessary Conditions on Models and Loss Functions 
We consider models that satisfy several conditions that are commonly met in real 
data analysis applications and are satisfied by both PDF mixture models and linear 
regression models: 
1. As k increases we have a nested model class, i.e., each model of complexity 
k contains each model of complexity k' < k as a special case (i.e., it reduces 
to a simpler model for a special choice of the parameters). 
2. Any two models of complexities k and ka can be combined as a weighted 
sum in any proportion to yield a valid model of complexity k = 
3. Each model of complexity k = k + ka can be decomposed into a weighted 
sum of two valid models of complexities k and ka respectively for each 
valid choice of k and k2. 
The first condition guarantees that the loss function is a non-increasing function of 
k for optimal models of complexity k (in sense of minimizing the loss function E), 
the second condition prevents artificial correlation between the component models, 
while the third condition guarantees that all components are of equal expressive 
power. As an example, the standard Gaussian mixture model satisfies all three 
properties whether the covariance matrices are unconstrained or individually con- 
strained. As a counter-example, a Gaussian mixture model where the covariance 
matrices are constrained to be equal across all components does not satisfy the 
second property. 
We assume the learning task consists of minimization of the loss function. If maxi- 
mization is more appropriate, we can just consider minimization of the negative of the loss 
function. 
4 Theoretical Results on Loss Function Convexity 
We formulate and prove the following theorem: 
Theorem 1: In a learning problem that satisfies the properties from Section 3, the 
loss function is first order convex in model complexity k, meaning that E+ - 2E + 
E_ _> 0 within first order (as defined in the proof). The quantities E and E 
are the values of the loss function for the best k and k + 1-component models. 
Proof: In the first part of the proof we analyze a general difference of loss functions 
and write it in a convenient form. Consider two arbitrary models, g and h and 
the corresponding loss functions E(g) and E(h) (g and h need not have the same 
complexity). The difference in loss functions can be expressed as: 
z(g)- z(h) 
n 
---- y {f [g(xi)] -- f [h(xi)]} 
i--1 
--  {f [h(xi)(1 q- (g,h(Xi))] -- f [h(xi)]} 
i----1 
n 
---- OZ  h(xi) f' (h(xi)) (g,h(Xi). 
i--1 
(3) 
where the last equation comes from a first order Taylor series expansion around each 
5s,n(xi ) = 0, a is an unknown constant of proportionality (to make the equation 
exact) and 
5,h(x) -' s(x)- n(x) 
(4) 
represents the relative difference in models g and h at point x. For example, Equa- 
tion 3 reduces to a first order Taylor series approximation for a = 1. If f(y) is a 
convex function we also have: 
n 
E(S ) -E(h) _  h(xi)ft(h(xi))(g,h(Xi). 
i--1 
(5) 
since the remainder in the Taylor series expansion R2 = 1/2f"(h(1 + )) _> O. 
In the second part of the proof we use Equation 5 to derive an appropriate condi- 
tion on loss functions. Consider the best k and k + 1-component models and the 
appropriate difference of the corresponding loss functions E+ - 2E + E_, which 
we can write using the notation from Equation 3 and Equation 5 (since we consider 
convex functions f(y) = -lny for PDF modeling and f(y) = (y - yi)  for best 
subset regression) as: 
n 
n 
 [f(y;+l(xi))- f(y;(xi))] q-  [f(y;_l(xi))- f(y;(xi))] 
i=1 i=1 
n n 
 y(xi)f'(y(xi))ey+x,y  (xi) q-  y(xi)f'(y(xi))ey_x,y  (xi) 
i----1 i----1 
n 
Yk(Xi)f (Yk(Xi)) (y+x,y(Xi) q- ey_x,y(Xi)  
i----1 
(6) 
According to the requirements on models in Section 3, the best k + 1-component 
model can be decomposed as 
y+ = (1 - e)gk + 
where gk is a k-component model and g is a 1-component model. Similarly, an 
artificial model can be constructed from the best k - 1-component model: 
 = (1 - e)y_ + eg. 
Upon subtracting y from each of the equations and dividing by y, using notation 
from Equation 4, we get: 
(.  
= (1 - e)iSg,y -]- eiSgl,y  
= (1 - e)(y_ 1,y -- 
which upon subtraction and rearrangement of terms yields: 
If we evaluate this equation at each of the data points xi and substitute the result 
back into equation 6 we get: 
- + _> 
n 
* f (Yk(Xi))[(1--e)(gu,y;(Xi)+(u,y;(Xi)+e(y;_x,y;(Xi) ]  
Yk(Xi) '  
i----1 
(8) 
In the third part of the proof we analyze each of the terms in Equation 8 using 
Equation 3. Consider the first term: 
n 
/Xgu,y; --  y; (xi) ft (y; (xi))gu,y; (Xi) (9) 
i----1 
that depends on a relative difference of models gk and y at each of the data points 
xi. According to Equation 3, for small 5g,y (x) (which is presumably true), we can 
set c  I to get a first order Taylor expansion. Since y is the best k-component 
model, we have E(gk) _ E(y) = E and consequently 
E(gk) - E(y) = /gu,y;  /gu,y; _ 0 (10) 
Note that in order to have the last inequality hold, we do not require that a  1, 
but only that 
c>O (11) 
which is a weaker condition that we refer to as the first order approximation. 
In other words, we only require that the sign is preserved when making Taylor 
expansion while the actual value need not be very accurate. Similarly, each of 
the three terms on the right hand side of Equation 8 is first order positive since 
E(y) _ E(gk),E(),E(y_). This shows that 
- + _> 0 
within first order, concluding the proof. 
5 Convexity in Common Learning Problems 
In this section we specialize Theorem I to several well-known learning situations. 
Each proof consists of merely selecting the appropriate loss function E(y) and model 
family y. 
5.1 Concavity of Mixture Model Log-Likelihoods 
Theorem 2: In mixture model learning, using log-likelihood as the loss function and 
using unconstrained mixture components, the in-sample log likelihood is a first- 
order concave function of the complexity k. 
Proof: By using f(y) = -In y in Theorem I the loss function E(y) becomes the 
negative of the in-sample log likelihood, hence it is a first-order convex function of 
complexity k, i.e., the log likelihood is first-order concave. 
Corollary 1: If a linear or convex penalty term in k is subtracted from the in-sample 
log likelihood in Theorem 2, using the mixture models as defined in Theorem 2, then 
the penalized likelihood can have at most one maximum to within first order. The 
BIG criterion satisfies this criterion for example. 
5.2 Convexity of Mean-Square-Error for Subset Selection in Linear 
Regression 
Theorem 3: In linear regression learning where Yk represents the best linear regres- 
sion defined over all possible subsets of k regression variables, the mean squared 
error (MSE) is first-order convex as a function of the complexity k. 
Proof: We use f(yk(xi)) = (Yi - Y(Xi)) 2 which is a convex function of y. The 
corresponding loss function E(y) becomes the mean-square-error and is first-order 
convex as a function of the complexity k by the proof of Theorem 1. 
Corollary 2: If a concave or linear penalty term in k is added to the mean squared 
error as defined in Theorem 3, then the resulting penalized mean-square-error can 
have at most one minimum to within first order. Such penalty terms include Mal- 
low's Cp criterion, AIG, BIG, predicted squared error, etc., (e.g., see Bishop (1995)). 
6 Experimental Results 
In this section we demonstrate empirical evidence of the approximate concavity 
property on three different data sets with model families and loss functions which 
satisfy the assumptions stated earlier: 
1. Mixtures of Gaussians: 3962 data points in 2 dimensions, representing the first 
two principal components of historical geopotential data from upper-atmosphere 
data records, were fit with a mixture of k Gaussian components, k varying from 1 
to 20 (see Smyth, Ide, and Ghil (1999) for more discussion of this data). Figure 
illustrates that the log-likelihood is approximately concave as a function of k. Note 
that it is not completely concave. This could be a result of either local maxima in the 
fitting process (the maximum likelihood solutions in the interior of parameter space 
were selected as the best obtained by EM from 10 different randomly chosen initial 
conditions), or may indicate that concavity cannot be proven beyond a first-order 
characterization in the general case. 
2. Mixtures of Markov Chains: Page-request sequences logged at the msnbc. corn 
Web site over a 24-hour period from over 900,000 individuals were fit with mixtures 
of first-order Markov chains (see Gadez et al. (2000) for further details). Figure 1 
again clearly shows a concave characteristic for the log-likelihood as a function of 
k, the number of Markov components in the model. 
3. Subset Selection in Linear Regression: Autoregressive (AR) linear models were 
fit (closed form solutions for the optimal model parameters) to a monthly financial 
time series with 307 observations, for all possible combinations of lags (all possible 
1600. 
1650' 
185o. 
1900. 
1950. 
572 
570 
566 
5 54 
2 4 6 8 10 12 14 16 18 2O I 2 3 4 5 6 7 8 9 10 11 12 
Number of Mixture Components [k] Number of Regression Variables [k] 
Figure 2: (a) In-sample log-likelihood for mixture modeling of the atmospheric data 
set, (b) mean-squared error for regression using the financial data set. 
subsets) from order k = 1 to order k = 12. For example, the k = 1 model represents 
the best model with a single predictor from the previous 12 months, not necessarily 
the AR(1) model. Again the goodness-of-fit curve is almost convex in k (Figure 
2(b)), except at k = 9 where there is a slight non-concavity: this could again be 
either a numerical estimation effect or a fundamental characteristic indicating that 
concavity is only true to first-order. 
7 Discussion and Conclusions 
Space does not permit a full discussion of the various implications of the results 
derived here. The main implication is that for at least two common learning sce- 
narios the maximizing/minimizing value of the loss function is strongly constrained 
as model complexity is varied. Thus, for example, when performing model selection 
using penalized goodness-of-fit (as in the Corollaries above) variants of binary search 
may be quite useful in problems where k is very large (in the mixtures of Markov 
chains above it is not necessary to fit the model for all values of k, i.e., we can simply 
interpolate within first-order). Extensions to model selection using loss-functions 
defined on out-of-sample test data sets can also be derived, and can be carried over 
under appropriate assumptions to cross-validation. Note that the results described 
here do not have an obvious extension to non-linear models (such as feed-forward 
neural networks) or loss-functions such as the 0/1 loss for classification. 
References 
Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, 
1995, pp. 376-377. 
Cadez, I., D. Heckerman, C. Meek, P. Smyth, and S. White, 'Visualization of 
navigation patterns on a Web site using model-based clustering,' Technical 
Report MS-TR-00-18, Microsoft Research, Redmond, WA. 
Li, Jonathan Q., and Barron, Andrew A., 'Mixture density estimation,' presented 
at NIPS 99. 
Smyth, P., K. Ide, and M. Ghil, 'Multiple regimes in Northern hemisphere height 
fields via mixture model clustering,' Journal of the Atmospheric Sciences, 
vol. 56, no. 21, 3704-3723, 1999. 
