Algebraic Information Geometry for 
Learning Machines with Singularities 
Sumio Watanabe 
Precision and Intelligence Laboratory 
Tokyo Institute of Technology 
4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan 
swatahab pi. titech. ac.jp 
Abstract 
Algebraic geometry is essential to learning theory. In hierarchical 
learning machines such as layered neural networks and gaussian 
mixtures, the asymptotic normality does not hold, since Fisher in- 
formation matrices are singular. In this paper, the rigorous asymp- 
totic form of the stochastic complexity is clarified based on resolu- 
tion of singularities and two different problems are studied. (1) If 
the prior is positive, then the stochastic complexity is far smaller 
than BIC, resulting in the smaller generalization error than regular 
statistical models, even when the true distribution is not contained 
in the parametric model. (2) If Jeffreys' prior, which is coordi- 
nate free and equal to zero at singularities, is employed then the 
stochastic complexity has the same form as BIC. It is useful for 
model selection, but not for generalization. 
1 Introduction 
The Fisher information matrix determines a metric of the set of all parameters of 
a learning machine [2]. If it is positive definite, then a learning machine can be un- 
derstood as a Riemannian manifold. However, almost all learning machines such as 
layered neural networks, gaussian mixtures, and Boltzmann machines have singular 
Fisher metrics. For example, in a three-layer perceptton, the Fisher information 
matrix I(w) for a parameter w is singular (der I(w) -- 0) if and only if w represents 
a small model which can be realized with the fewer hidden units than the learning 
model. Therefore, when the learning machine is in an almost redundant state, any 
method in statistics and pixysics that uses a quadratic approximation of the loss 
function can not be applied. In fact, the maximum likelihood estimator is not sub- 
ject to the asymptotic normal distribution [4]. The Bayesian posterior probability 
converges to a distribution which is quite different from the normal one [8]. To 
construct a mathematical foundation for such learning machines, we clarified the 
essential relation between algebraic geometry and Bayesian statistics [9,10]. In this 
paper, we show that the asymptotic form of the Bayesian stochastic complexity is 
rigorously obtained by resolution of singularities. The Bayesian method gives pow- 
erful tools for both generalization and model selection, however, the appropriate 
prior for each purpose is quite different. 
2 Stochastic Complexity 
Let p(xlw) be a learning machine, where x is a pair of an input and an output, 
and w c R d is a parameter. We prepare a prior distribution 99(w) on R d. Training 
samples X n = (X, X2, ..., Xn) are independently taken from the true distribution 
q(x), which is not contained in p(xlw ) in general. The stochastic complexity F(X ) 
and its average F(n) are defined by 
n 
F(X ) -- - log / H p(Xilw) 99(w)dw 
i:1 
and F(n) = Ex{F(X)}, respectively, where Ex{.} denotes the expectation 
value overall training sets. The stochastic complexity plays a central role in Bayesian 
statistics. Firstly, F(n+ 1)-F(n)-S, where S: - f q(x) log q(x)dx, is equal to the 
average Kullback distance from q(x) to the Bayes predictive distribution p(xlX), 
which is called the generalization error denoted by G(n). Secondly, exp(-F(Xn)) 
is in proportion to the posterior probability of the model, hence, the best model is 
selected by minimization of F(X ) [7]. And lastly, if the prior distribution has a hy- 
perparameter 0, that is to say, 99(w) = 99(wl0 ), then it is optimized by minimization 
of F(X ) [1]. 
We define a function Fo(n) using the Kullback distance H(w), 
FoCn):-log/expC-nHCw))99(w)dw, Hew):/qCx)log qCX--) dx. 
p(xlw) 
Then by Jensen's inequality, F(n)- Sn _< Fo(n). Moreover, we assume that 
L(x, w) -- log q(x) - logp(xlw) is an analytic function from w to the Hilbert space 
of all square integrable functions with the measure q(x)dx, and that the support of 
the prior W = supp 99 is compact. Then H(w) is an analytic function on W, and 
there exists a constant c > 0 such that, for an arbitrary n, 
n 
_< _< 
3 General Learning Machines 
In this section, we study a case when the true distribution is contained in the 
parametric model, that is to say, there exists a parameter w0 c W such that q(x) = 
p(xlwo ). Let us introduce a zeta function J(z) (z  C) of H(w) and a state density 
function v(t) by 
d(z): / H(w)Z99(w)dw, v(t) : / 5(t- H(w))99(w)dw. 
Then, d(z) and F0(n) are represented by the Mellin and the Laplace transform of 
v(t), respectively. 
t zv(t)dt, - log exp(-t)v(t)dt, 
where h = maxew H(w). Therefore Fo(n), v(t), and J(z) are mathematically 
connected. It is obvious that J(z) is a holomorphic function in Re(z) > 0. Moreover, 
by using the existence of Sato-Bernstein's b-function [6], it can be analytically 
continued to a meromorphic function on the entire complex plane, whose poles are 
real, negative, and rational numbers. Let -hi > -h2 > -h3 > ... be the poles of 
J(z) and mk be the order of --hk. Then, by using the inverse Mellin tansform, it 
follows that v(t) has an asymptotic expansion with coefficients {c,}, 
oo m k 
k:l m:l 
(t - +0). 
Therefore, also F0 (n) has an asymptotic expansion, by putting h = hi and m = ml, 
Fo(n) = hlogn - (m - )log logn + 
which ensures the asymptotic expansion of F(n) by eq.(1), 
F(n) = Sn + hlogn-(m- )loglogn + 
The Kullback distance H(w) depends on the analytic set Wo = {w c W; H(w) = 0}, 
resulting that both h and m depend on Wo. Note that, if the Bayes generalization 
error G(n) = F(n q- 1) - F(n) - S has an asymptotic expansion, it should be 
h/n -(m - 1)/(nlogn). The following lemma is proven using the definition of 
Fo(n) and its asymptotic expansion. 
Lemma i (1) Let (hi,m/z/) (i = ,2) be constants corresponding to (Hi(w), i(w)) 
(i = 1, 2). If Hi(w) < H2(w) and gl(w) > g2(w), then ;hi < h2' or ;hi = h2 and 
Tt 1 __ Tt 2 . 
(2) Let (hi, TIzi) (i = ,2) be constants corresponding to (Hi(wi), i(wi)) (i = ,2). 
Let w = (wl,w2), H(w) = Hl(wl) q- H2(w2), and g(w) = gl(wl)g2(w2). Then the 
constants of (H(w), g(w)) are h = hi q- h2 and m = ml q- m2 - 1. 
The concrete values of h and m can be algorithmically obtained by the following 
theorem. Let W i be the open kernel of W (the maximal open set contained in W). 
Theorem 1 (Resolution of Singularities, Hironaka [5]) Let H(w) k 0 be a real 
analytic function on W i. Then there exist both a real d-dimensional manifold U and 
a real analytic function g  U - W i such that, in a neighborhood of an arbitrary 
uCU, 
H(g(u)) a(u)u    
: (2) 
where a(u) > 0 is an analytic fnction and {si} are non-negative integers. More- 
over, for arbitrary compact set K c W, g-(K) c U is a compact set. Such a 
fnction g(u) can be found by finite blowing-ups. 
Itemark. By applying eq.(2) to the definition of J(z), one can see the integral 
in J(z) is decomposed into a direct product of the integral of each variable [3]. 
Applications to learning theory are shown in [9,10]. In general it is not so easy to 
find g(u) that gives the complete resolution of singularities, however, in this paper, 
we show that even a partial resolution mapping gives an upper bound of h. 
Definition. We introduce two different priors. 
(1) The prior distribution g(w) is called positive if g(w) > 0 for an arbitrary 
iv C W i, (W = supp(iv)). 
(2) The prior distribution gd(iv) is called Jeffreys' one if 
i v/de t I(iv) Iij(iv) -- / OL OL 
--p(xlw)dx, 
where Z is a normalizing constant and I(w) is the Fisher information matrix. In neu- 
ral networks and gaussian mixtures, Jeffreys' prior is not positive, since det I(w) -- 0 
on the parameters which represent the smaller models. 
Theorem 2 Assume that there exists a parameter wo  W i such that q(x) = 
p(xlwo ). Then followings hold. 
(1) If the prior is positive, then 0 < h _< d/2 and 1 _< m _< d. If p(xlw ) satisfies the 
condition of the asymptotic normality, then h = d/2 and m = 1. 
(2) f Jefreys' prior is applied, then  > /2' or  : /2 and , :  '. 
(Outline of the Proof) (1) In order to examine the poles of J(z), we can divide the 
parameter space into the sum of neighborhoods. Since H(iv) is an analytic function, 
in arbitrary neighborhood of iv0 that satisfies H(ivo) = 0, we can find a positive 
definite quadratic form which is smaller than H(iv). The positive definite quadratic 
form satisfies  = d/2 and ra = 1. By using Lemma 1 (1), we obtain the first half. 
(2) Because Jeffreys' prior is coordinate free, we can study the problem on the 
parameter space U instead of W i in eq. (2). Hence, there exists an analytic function 
t(x, il) such that, in each local coordinate, 
$d 
For simplicity, we assume that si > 0 (i = 1, 2, ..., d). Then 
OL Ot s - 
W i --(Wi wi - sit)ill '' 'ili '' 'ild 
By using blowing-ups ui = vlv2 . . .vi (i = 1, 2, ..., d) and a notation rrp = sp +sp+ + 
 .. + sa, it is easy to show 
d d 
dot I(v) < H  2a+p-a-2 dil: (H Ivl-9 dr. 
__ 'Up , 
p:l p:l 
(3) 
2rrpZ 
By using H(g(u))  ---- lp vp and Lemma.1 (1), in order to prove the latter half 
of the theorem, it is sufficient to prove that 
d 
= pl<W 
2rrpZ 
ilp  ]Wp]drrp-l+(d-p)/2dWp 
has a pole z = -d/2 with the order m = 1. Direct calculation of integrals in .(z) 
completes the theorem. (Q.E.D.) 
4 Three-Layer Perceptron 
In this section, we study some cases when the learner is a three-layer perceptton 
and the true distribution is contained and not contained. We define the three layer 
perceptron p(x, yl w) with M input units, K hidden units, and N output units, 
where x is an input, y is an output, and w is a parameter. 
k:l 
where w = {(a, b, c); a c R N, bk C R M, c  R}, r(x) is the probability density 
on the input, and cr 2 is the variance of the output (either r(x) or a is not estimated). 
Theorem 3 If the true distribution is represented by the three-layer perceptton with 
Ko _< K hidden units, and if positive prior is employed, then 
1 {K0(M + N + 1) + (K - K0) min(M + 1, N)). 
(5) 
(Outline of Proof) Firstly, we consider a case when g(x) = 0. Then, 
K 
H(w): 272 {Zatanh(b.z)+c}2r(x)dx. (5) 
k:l 
Let ak = (a, ..., aN) and b = (b, ..., bkM). Let us consider a blowing-up, 
a:c, aj :ea[j (k751,j751), bl:b' 
kl , Ck = C . 
Then da db dc = cK-dc da  db  dc  and there exists an analytic function 
H(a,Y,c ) such that H(a,b,c) = c2H(a,Y,c). Therefore J(z) has a pole at 
z = -KN/2. Also by using another blowing-up, 
then, da db dc = C(M+)K-dc da" db" dc" and there exists an analytic 
function H %" b", c") 
2 , such that H(a,b,c) : c2H2(a", b", c"), which shows that 
J(z) has a pole at z = -K(M + 1)/2. By combining both results, we obtain 
h _< (K/2)min(M + 1, N). Secondly, we prove the general case, 0 < K0 _< K. 
Then, 
a2 { a tanh(b  x + ck) - g(x)}2r(x)dx 
K 
a tanh(b  x + c)}2r(x)dx. 
By combining Lemma. 1 (2) and the above result, we obtain the Theorem. (Q.E.D.). 
If the true regression function g(x) is not contained in the learning model, we assume 
that, for each 0 _< k _< K, there exists a parameter w? ) G W that minimizes the 
square error 
f(x, w)ll%(x)dx. 
We use notations ---- = + N + 1) + 
min(N + 1, 
Theorem 4 If the true regression function is not contained in the learning model 
and positive prior is applied, then 
F(n) _ rain [2E(k)+ h(k) logn] +O(1). 
OkK 
(Outline of Proof) This theorem can be shown by the same procedure as eq.(6) in 
the preceding theorem. (Q.E.D.) 
If G(n) has an asymptotic expansion G(n) Q 
= Eq:l aqfq(TI), where fq(n) is a de- 
creasing function of n that satisfies fq+ (n)= o(fq(n)) and fc2(n)= i/n, then 
G(n) < rain L  + ] 
which shows that the generalization error of the layered network is smaller than the 
regular statistical models even when the true distribution is not contained in the 
learning model. It should be emphasized that the optimal k that minimizes G(n) 
is smaller than the learning model when n is not so large, and it becomes larger as 
n increases. This fact shows that the positive prior is useful for generalization but 
not appropriate for model selection. Under the condition that the true distribution 
is contained in the parametric model, Jeffreys' prior may enable us to find the true 
model with higher probability. 
Theorem 5 If the true regression function is contained in the three-layer perceptton 
and Jeffrey's prior is applied, then h -- d/2 and m -- 1, even if the Fisher metric is 
degenerate at the true parameter. 
(Outline of Proof) For simplicity, we prove the theorem for the case g(x) = 0. The 
general cases can be proven by the same method. By direct calculation of the Fisher 
information matrix, there exists an analytic function D(b, c) _> 0 such that 
K N 
detI(w): H( akp)2(4+)D(b,c) 
k:l p:l 
By using a blowing-up 
all = O, akj = Oakj = = Ck, 
we obtain H(w) : c2H(a',b',c ') same as eq.(5), detI(w)  c 2(x+)K, and 
da db dc = cwX-  dc da  db de. The integral 
J(z) = fl c2zc(M+):+N:-dc 
has a pole at z = -(M + N + 1)K/2. By combining this result with Theorem 3, 
we obtain Theorem.5. (Q.E.D.). 
5 Discussion 
In many applications of neural networks, rather complex machines are employed 
compared with the number of training samples. In such cases, the set of optimal 
parameters is not one point but an analytic set with singularities, and the set 
of almost optimal parameters {w; H(w) < e} is not an ellipsoid'. Hence neither 
the Nullback distance can be approximated by any quadratic form nor the saddle 
point approximation can be used in integration on the parameter space. The zeta 
function of the Nullback distance clarifies the behavior of the stochastic complexity 
and resolution of singularities enables us to calculate the learning efficiency. 
6 Conclusion 
The relation between algebraic geometry and learning theory is clarified, and two 
different facts are proven. 
(1) If the true distribution is not contained in a hierarchical learning model, then 
by using a positive prior, the generalization error is made smaller than the regular 
statistical models. 
(2) If the true distribution is contained in the learning model and if Jeffreys' prior 
is used, then the average Bayesian factor has the same form as BIC. 
Acknowledgments 
This research was partially supported by the Ministry of Education, Science, Sports 
and Culture in Japan, Grant-in-Aid for Scientific Research 12680370. 
References 
[1] Akaike, H. (1980) Likelihood and Bayes procedure. Bayesian Statistics, (Bernaid J.M. 
eds.) University Press, Valencia, Spain, 143-166. 
[2] Amari, S. (1985) Differential-geometrical methods in Statistics. Lecture Notes in Statis- 
tics, Springer. 
[3] Atiyah, M. F. (1970) Resolution of singularities and division of distributions. Comm. 
Pure and Appl. Math. , 13, pp.145-150. 
[4] Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and 
application to mixture models. Probability and Statistics, 1,285-317. 
[5] Hironaka, H. (1964) Resolution of Singularities of an algebraic variety over a field of 
characteristic zero. Annals of Math., 79,109-326. 
[6] Kashiwara, M. (1976) B-functions and holonomic systems. Inventions Math., 38,33-53. 
[7] Schwarz, G. (1978) Estimating the dimension of a model. Ann. of Star., 6 (2), 461-464. 
[8] Watanabe, S. (1998) On the generalization error by a layered statistical model with 
Bayesian estimation. IEICE Transactions, J81-A (10), 1442-1452. English version: 
(2000)Electronics and Communications in Japan, Part 3, 83(6) ,95-104. 
[9] Watanabe, S. (2000) Algebraic analysis for non-regular learning machines. Advances 
in Neural Information Processing Systems, 12, 356-362. 
[10] Watanabe, S. (2001) Algebraic analysis for non-identifiable learning machines. Neural 
Computation, to appear. 
