Direct Classification with Indirect Data 
Timothy X Brown 
Interdisciplinary Telecommunications Program 
Dept. of Electrical and Computer Engineering 
University of Colorado, Boulder, 80309-0530 
t imxbcolorado. edu 
Abstract 
We classify an input space according to the outputs of a real-valued 
function. The function is not given, but rather examples of the 
function. We contribute a consistent classifier that avoids the un- 
necessary complexity of estimating the function. 
I Introduction 
In this paper, we consider a learning problem that combines elements of regression 
and classification. Suppose there exists an unknown real-valued property of the 
feature space, p(qS), that maps from the feature space, q5  R , to R. The property 
function and a positive set A c R, define the desired classifier as follows: 
+1 ifp() A 
C*() = -1 otherwise (1) 
Though p() is unknown, measurements, /, associated with p() at different fea- 
tures, , are available in a data set X = { (i,/i)} of size IXl - N. Each sample 
is i.i.d. with unknown distribution f(c), I). This data is indirect in that/ may be 
an input to a sufficient statistic for estimating p() but in itself does not directly 
indicate C*() in (1). Figure 1 gives a schematic of the problem. 
Let Cx() be a decision function mapping from R  to {-1,1} that is estimated 
from the data X. The estimator, Cx () is consistent if, 
lim P {Cx(qS) y C*(qS)): 0. (2) 
IXI-4o 
where the probabilities are taken over the distribution f. 
This problem arises in controlling data networks that provide quality of service 
guarantees such as a maximum packet loss rate [1]-[8]. A data network occasionally 
drops packets due to congestion. The loss rate depends on the traffic carried by the 
network (i.e. the network state). The network can not measure the loss rate directly, 
but can collect data on the observed number of packets sent and lost at different 
network states. Thus, the feature space, , is the network state; the property 
function, p(), is the underlying loss rate; the measurements, /, are the observed 
p(0) l 
x x x 
Figure 1: The classification problem. The 
classifier indicates whether an unknown 
function, p(95), is within a set of interest, 
A. The learner is only given the data "x". 
packet losses; the positive set, A, is the set of loss rates less than the maximum loss- 
rate; and the distribution, f, follows from the arrival and departures processes of the 
traffic sources. In words, this application seeks a consistent estimator of when the 
network can and can not meet the packet loss rate guarantee based on observations 
of the network losses. Over time, the network can automatically collect a large set 
of observations so that consistency guarantees the classifier will be accurate. 
Previous authors have approached this problem. In [6, 7], the authors estimate the 
property function from X as,/5(b) and then classify via 
+1 if/(b)  A 
C(b) = -1 otherwise. (3) 
The approach suffers two related disadvantages. First, an accurate estimate of 
the property function may require many more parameters than the corresponding 
classifier in which only the decision boundary is important. Second, the regression 
requires many samples over the entire range of b to be accurate, while the fewer 
parameters in the classifier may require fewer samples for the same accuracy. 
A second approach, used in [4, 5, 8], makes a single sample estimate,/(bi) from 
and estimates the desired output class as 
+1 if/(Si)  A 
oi = -1 otherwise . (4) 
This forms a training set Y = {bi, oi} for standard classification. This was shown 
to lead to an inconsistent estimator in the data network application in [1]. 
This paper builds on earlier results by the author specific to the packet network 
problem [1, 2, 3] and defines a general framework for mapping the indirect data 
into a standard supervised learning task. It defines conditions on the training set, 
classifier, and learning objective to yield consistency. The paper defines specific 
methods based on these results and provides examples of their application. 
2 Estimator at a Single Feature 
In this section, we consider a single feature vector b and imagine that we can collect 
as much monitoring data as we like at b. We show that a consistent estimator of the 
property function, p(b), yields a consistent estimator of the optimal classification, 
C* (b), without directly estimating the property function. These results are a basis 
for the next section where we develop a consistent classifier over the entire feature 
space even if every bi in the data set is distinct. 
Given the data set X - {b,/i), we hypothesize that there is a mapping from data 
set to training set Y- {b, wi, oi) such that IXI- IYI and 
Ixl 
Cx(b): sign( wo) 
i----1 
(5) 
is consistent in the sense of (2). The wi and oi are both functions of/i, but for 
simplicity we will not explicitly denote this. 
Do any mappings from X to Y yield consistent estimators of the form (5)? We 
consider only thresholds on p(). That is, sets A in the form A = [-0% r) (or 
similarly A = (r, o]) for some threshold r. Since most practical sets can be formed 
from finite union, intersection, and complements of sets in this form, this is sufficient. 
(6) 
Consider an estimator fix that has the form 
^ 
for some functions a > O, and estimator fl. Suppose that fix is a consistent estimator 
of p(b), i.e. for every e > O: 
lim P {lfix - p()l > e): 0. (7) 
For threshold sets such as A = [-o, r), we can use (6) to construct the classifier: 
Cx(b): sign(r - fix(b)): sign (i_l(a(li)r - fi(/i)) : sign( wioi) (8) 
i=1 
where 
wi : I,(i)r - fi(i)l (9) 
oi = sign(,(i)r- fi(i)) (10) 
If Ir - P()I: e then the above estimator can be incorrect only if Ifix - P()I > e. 
The consistency in (7) guarantees that (8)-(10) is consistent if e > 0. 
The simplest example of (6) is when/i is a noisy unbiased sample of p(4). The 
natural estimator is just the average of all the/4, i.e. (/4) = i and fi(4) = 4. In 
this case, w4 = [ - 4[ and o4 = sign( - 4). A less trivial example will be given 
later in the application section of the paper. 
We now describe a range of objective functions for evaluating a classifier C(; 0) 
parameterized by 0 and show a correspondence between the objective minimum and 
(5). Consider the class of weighted L-norm objective functions (L > 0): 
J(X,O) = wi[C(;O) -oi[  (11) 
i=1 
Let the 0 that minimizes this be denoted O(X). Let 
Cx(O): C(O;O(X)) (12) 
For a single , C(; 0) is a constant +1 or -1. We can simply try each value and 
see which is the minimum to find Cx(0). This is carried out in [3] where we show: 
Theorem 1 When C(b;0) is a constant over X then the Cx(b) defined by (11) 
and (12) is equal to the Cx(b) defined by (5). 
The definition in (5) is independent of L. So, we can choose any L-norm as conve- 
nient without changing the solution. This follows since (11) is essentially a weighted 
count of the errors. The L-norm has no significant effect. 
This section has shown how regression estimators such as (6) can be mapped via 
(9) and (10) and the objective (11) to a consistent classifier at a single feature. The 
next section considers general classifiers. 
3 Classification over All Features 
This section addresses the question of whether there exist any general approach to 
supervised learning that leads to a consistent estimator across the feature space. 
Several considerations are important. First, not all feature vectors, q, are rele- 
vant. Some q may have zero probability associated with them from the distribution 
f(,/). Such  we denote as unsupported. The optimal and learned classifier can 
differ on unsupported feature vectors without affecting consistency. Second, the 
classifier function C(q, 8) may not be able to represent the consistent estimator. 
For instance, a linear classifier may never yield a consistent estimator if the op- 
timal classifier, C*(q), decision boundary is non-linear. Classifier functions that 
can represent the optimal classifier for all supported feature vectors we denote as 
representative. Third, the optimal classifier is discontinuous at the decision bound- 
ary. A classifier that considers any small region around a feature on the decision 
boundary will have both positive and negative samples. In general, the resulting 
classifier could be +1 or -1 without regard to the underlying optimal classifier at 
these points and consistency can not be guaranteed. These considerations are made 
more precise in Appendix A. Taking these considerations into account and defining 
wi and oi as in (9) and (10) we get the following theorem: 
Theorem 2 If the classifier (5) is a consistent estimator for every supported non- 
boundary q), and C(b; t)) is representative, then the t)(X)) that minimizes (11) yields 
a consistent classifier over all supported q) not on the decision boundary. 
Theorem 2 tells us that we can get consistency across the feature space. This result 
is proved in Appendix A. 
4 Application 
This section provides an application of the results to better illustrate the method- 
ology. For brevity, we include only a simple stylized example (see [3] for a more 
realistic application). We describe first how the data is created, then the form of 
the consistent estimator, and then the actual application of the learning method. 
The feature space is one dimensional with  uniformly distributed in (3, 9). The 
underlying property function is p(b) - 10 -. The measurement data is generated 
as follows. For a given q)i, si is the number of successes in Ti = 105 Bernoulli trials 
with success probability p(bi). The monitoring data is thus, /i = (s,T). The 
positive set is A = (0, -) with - = 10 -6, and IXI = 1000 samples. 
As described in Section 1, this kind of data appears in packet networks where the 
underlying packet loss rate is unknown and the only monitoring data is the number 
of packets dropped out of Ti trials. The Bernoulli trial successes correspond to 
dropped packets. The feature vector represents data collected concurrently that 
indicates the network state. Thus the classifier can decide when the network will 
and will not meet a packet loss rate guarantee. 
0.001 +. , , , , , 
sample loss rate + 
[_ ' true loss rate ......... 
0.0001 t *+ threshold 
I :'..:.: ',,,, :+ 
- IIIIIII l,,,-Jllll 4--I-M-44- 4- 4- 
le-05 .- +'"':',',',',:',:',:',',',', ',', ',', - + 
 - sample-based .. 
 le-06 ................................. 7: ::c_ _o_ _n_ _s_i_ _s_t_e_ _n_ _t ................... 
 le-07 
le-08 '"-.. 
le-09 i III III I- 
3 4 5 6 7 8 9 
Feature 
Figure 2: Monitoring data, true property function, and learned classifiers in the 
loss-rate classification application. The monitoring data is shown as sample loss 
rate as a function of feature vector. Sample loss-rates of zero are arbitrarily set to 
10 - for display purposes. The true loss rate is the underlying property function. 
The consistent and sample-based classifier results are shown as a a range of thresh- 
olds on the feature. An z and y error range is plotted as a box. The z error range 
is the 10th and 90th percentile of 1000 experiments. This is mapped via the under- 
lying property function to a//-error range. The consistent classifier finds thresholds 
around the true value. The sample-based is off by a factor of 7. 
Figure 2 shows a sample of data. A consistent estimator in the form of (6) is: 
1Ox = Y-i s4 (13) 
Ei Ti ' 
Defining w and o as in (9) and (10) the classifier for our data set is the threshold 
on the feature space that minimizes (11). This classifier is representative since p(b) 
is monotonic. 
The results are shown in Figure 2 and labeled "consistent". This paper's methods 
find a threshold on the feature that closely corresponds to the r = 10 - threshold. 
As a comparison we also include a classifier that uses wi = 1 for all i and sets 
o to the single-sample estimate,/5(0) = s/T, as in (4). The results are labeled 
"sample-based". This method misses the desired threshold by a factor of 7. 
This application shows the features of the paper's methods. The classifier is a simple 
threshold with one parameter. Estimating p(b) to derive a classifier required 10's 
of parameters in [6, 7]. The results are consistent unlike the approaches in [4, 5, 8]. 
5 Conclusion 
This paper has shown that using indirect data we can define a classifier that directly 
uses the data without any intermediate estimate of the underlying property function. 
The classifier is consistent and yields a simpler learning problem. The approach 
was demonstrated on a problem from telecommunications. Practical details such 
as choosing the form of the parametric classifier, C(b; 0), or how to find the global 
minimum of the objective function (11) are outside the scope of this paper. 
Two Dimensional Feature Space 
Both Classifiiers 
Decision 
Both Classifiers 
Positive 
Figure 3: A classifier C(b;0) and the 
optimal classifier C* (b) create four dif- 
ferent sets in feature space: where they 
agree and are positive; where they agree 
and are negative; where they disagree 
and C*(05) = +1 (false negatives); and 
where they disagree and C*(05) = -1 
(false positives). 
A Appendix: Consistency of Supervised Learning 
This appendix proves certain natural conditions on a supervised learner lead to a 
consistent classifier (Theorem 2). First we need to formally define several concepts. 
Since the feature space is real, it is a metric space with measure m. 
A feature vector 05 is supported by the distribution f if every neighborhood around 
05 has positive probability. 
A feature vector b is on the decision boundary if in every neighborhood around b 
there exists supported Y, b" such that C*(Y)  C*(b"). 
A classifier function, C(b; 0) is representative if there exists a 0* such that C(b; 0') = 
C* (b) for all supported, non-boundary b. 
Parameters 0 and 0  are equivalent if for all supported, non-boundary b; C(b; 0) = 
C;0'). 
Given a 0, it is either equivalent to 0* or there are supported, non-boundary 05 where 
C(05; 0) is not equal to the optimal classifier as in Figure 3. We will show that for 
any 0 not equivalent to 0', 
lim P{J(X,O) _< J(X,O*)} : 0 (14) 
In other words, such a 0 can not be the minimum of the objective in (11) and so 
only a 0 equivalent to 0* is a possible minimum. 
To prove Theorem 2, we need to introduce a further condition. An estimator of the 
form (5) has uniformly bounded variance if Var(wi) < B for some fixed B < cx for 
all 05. 
Let E[w(05)o(05)] = e(05) be the expected weighted desired output for independent 
samples at 05 where the expectation is from f(/]05). To start, we note that if (5) is 
consistent, then: 
sign(e(05)): C*(05) (15) 
for all non-boundary states. Looking at Figure 3, let us focus on the false negative 
set minus the optimal decision boundary, call this . From (15), e(05) is positive for 
every 05 c . Let x be the probability measure of . Define the set 
  = {] e  and e()  e}. 
Let xe be the probability measure of e. Choose e > 0 so that xe > 0. 
The proof is straight forward from here and we omit some details. With 0, C(; 0) = 
-1 for all b c . With 0', C(b;0*) = +1 for all b c . Since the minimum of a 
constant objective function satisfies (5), we would incorrectly choose 0 if 
lim wioi<0 
Ixl- i=1 
For the false negatives the expected number of examples in  and e is x[X[ and 
xe[X[. By the definition of  and the bounded variance of the weight, we get that 
E[ wioi] _ ex[X[ (16) 
i----1 
Var[wioi] < BxlXl. (17) 
i----1 
Since the expected value grows linearly with the sample size and the standard 
deviation with the square root of the sample size, as IXI -- c the weighted sum will 
with probability one be positive. Thus, as the sample size grows, +1 will minimize 
the objective function for the set of false negative samples and the decision boundary 
from )* will minimize the objective. 
The same argument applied to the false positives shows that t)* will minimize the 
false positives with probability one. Thus )* will be chosen with probability one 
and the theorem is shown. 
Acknowledgments 
This work was supported by NSF CAREER Award NCR-9624791. 
References 
[1] Brown, T.X (1995) Classifying loss rates with small samples, Proc. Inter. Work- 
shop on Appl. of NN to Telecom (pp. 153-161). Hillsdale, N J: Erlbaum. 
[2] Brown, T.X (1997) Adaptive access control applied to ethernet data, Advances 
in Neural Information Processing Systems, 9 (pp. 932-938). MIT Press. 
[3] Brown, T. X (1999) Classifying loss rates in broadband networks, INFOCOMM 
'99 (v. 1, pp. 361-370). Piscataway, NJ: IEEE. 
[4] Estrella, A.D., et al. (1994). New training pattern selection method for ATM 
call admission neural control, Elec. Let., v. 30, n. 7, pp. 577-579. 
[5] Hiramatsu, A. (1990). ATM communications network control by neural net- 
works, IEEE T. on Neural Networks, v. 1, n. 1, pp. 122-130. 
[6] Hiramatsu, A. (1995). Training techniques for neural network applications in 
ATM, IEEE Comm. Mag., October, pp. 58-67. 
[7] Tong, H., Brown, T. X (1998). Estimating Loss Rates in an Integrated Services 
Network by Neural Networks, Proc. of Global Telecommunications Conference 
(GLOBECOM 98) (v. 1, pp. 19-24) Piscataway, NJ: IEEE. 
[8] Tran-Gia, P., Gropp, O. (1992). Performance of a neural net used as admission 
controller in ATM systems, Proc. GLOBECOM 92 (pp. 1303-1309). Piscat- 
away, N J: IEEE. 
