Bayes Classifier and Naive Bayes
Idea: Estimate ˆP(y|→x) from the data, then use the Bayes Classifier on ˆP(y|→x).
So how can we estimate ˆP(y|→x)?
One way to do this would be to use the MLE method. Assuming that y is discrete,
ˆP(y|→x)=∑ni=1I(→xi=→x∧→yi=y)∑ni=1I(→xi=→x)
From the above diagram, it is clear that, using the MLE method, we can estimate ˆP(y|→x) as
ˆP(y|→x)=|C||B|
But there is a big problem with this method.
Problem: The MLE estimate is only good if there are many training vectors with the same identical features as
→x!
In high dimensional spaces (or with continuous →x), this never happens! So |B|→0 and |C|→0.
To get around this issue, we can make a 'naive' assumption.
Naive Bayes
We can approach this dilemma with a simple trick, and an additional assumption. The trick part is to estimate P(y) and P(→x|y) instead, since, by Bayes rule,
P(y|→x)=P(→x|y)P(y)P(→x).
Recall from
Estimating Probabilities from Data
that estimating P(y) and P(→x|y) is called generative learning.
Estimating P(y) is easy. If Y takes on discrete binary values, for example, this just becomes coin tossing. We simply need to count how many times we observe each outcome (in this case each class):
P(y=c)=∑ni=1I(yi=c)n=ˆπc
Estimating P(→x|y), however, is not easy!
The additional assumption that we make is the Naive Bayes assumption.
Naive Bayes Assumption:
P(→x|y)=d∏α=1P(xα|y),where xα=[→x]α is the value for feature α
i.e., feature values are independent given the label! This is a very bold assumption.
For example, a setting where the Naive Bayes classifier is often used is spam filtering. Here, the data is emails and the label is spam or not-spam. The Naive Bayes assumption implies that the words in an email are conditionally independent, given that you know that an email is spam or not. Clearly this is not true. Neither the words of spam or not-spam emails are drawn independently at random. However, the resulting classifiers can work well in practice even if this assumption is violated.
Illustration behind the Naive Bayes algorithm. We estimate P(xα|y) independently in each dimension (middle two images) and then obtain an estimate of the full data distribution by assuming conditional independence P(x|y)=∏αP(xα|y) (very right image).
So, for now, let's pretend the Naive Bayes assumption holds.
Then the Bayes Classifier can be defined as
h(→x)=argmaxyP(y|→x)=argmaxyP(→x|y)P(y)P(→x)=argmaxyP(→x|y)P(y)(P(→x) does not depend on y)=argmaxyd∏α=1P(xα|y)P(y)(by the naive assumption)=argmaxyd∑α=1log(P(xα|y))+log(P(y))(as log is a monotonic function)
Estimating log(P(xα|y)) is easy as we only need to consider one dimension. And estimating P(y)
is not affected by the assumption.
Estimating P([→x]α|y)
Now that we know how we can use our assumption to make the estimation of P(y|→x) tractable.
There are 3 notable cases in which we can use our naive Bayes classifier.
Case #1: Categorical features
 |
Illustration of categorical NB. For d dimensional data, there exist d independent dice for each class. Each feature has one die per class. We assume training samples were generated by rolling one die after another. The value in dimension i corresponds to the outcome that was rolled with the ith die. |
Features:
[→x]α∈{f1,f2,⋯,fKα}
Each feature α falls into one of Kα categories.
(Note that the case with binary features is just a specific case of this, where Kα=2.) An example of such a setting may be medical data where one feature could be gender (male / female) or marital status (single / married / widowed).
Model P(xα∣y):
P(xα=j|y=c)=[θjc]α and Kα∑j=1[θjc]α=1
where [θjc]α is the probability of feature α having the value j, given that the label is c.
And the constraint indicates that xα must have one of the categories {1,…,Kα}.
Parameter estimation:
[ˆθjc]α=∑ni=1I(yi=c)I(xiα=j)+l∑ni=1I(yi=c)+lKα,
where xiα=[→xi]α and l is a smoothing parameter. By setting l=0 we get an MLE estimator, l>0 leads to MAP. If we set l=+1 we get Laplace smoothing.
In words (without the l hallucinated samples) this means
# of samples with label c that have feature α with value j # of samples with label c.
Essentially the categorical feature model associate a special coin with each feature and label. The generative model that we are assuming is that the data was generated by first choosing the label (e.g. "healthy person"). That label comes with a set of d "coins", for each dimension one. The generator picks each coin and tosses it, and fills in the feature value with the outcome of the coin toss. So if there are C possible labels and d dimensions we are estimating dC "coins" from the data, but per example only d of them are tossed. Coin α (for any label) has Kα possible "sides". Of course this is not how the data is generated in reality - but it is a modeling assumption that we make. We then learn these models from the data and during test time see which model is more likely given the sample.
Prediction:
argmaxyP(y=c∣→x)∝argmaxyˆπcd∏α=1[ˆθjc]α
Case #2: Multinomial features
 |
Illustration of multinomial NB. There are only as many dice as classes. Each die has d sides. The value of the ith feature shows how many times this particular side was rolled. |
If feature values don't represent categories (e.g. male/female) but counts we need to use a different model. E.g. in the text document categorization, feature value xα=j means that in this particular document →x the αth word in my dictionary appears j times. Let us consider the example of spam filtering. Imagine the αth word is indicative towards ``spam''. Then if xα=10 means that this email is likely spam (as word α appears 10 times in it). And another email with x′α=20 should be even more likely to be spam (as the spammy word appears twice as often). With categorical features this is not guaranteed. It could be that the training set does not contain any email that contain word α exactly 20 times. In this case you would simply get the hallucinated smoothing values for both spam and not-spam - and the signal is lost. We need a model that incorporates our knowledge that features are counts - this will help us during estimation (you don't have to see a training email with exactly the same number of word occurances) and during inference/testing (as you will obtain these monotonicities that one might expect). The multinomial distribution does exactly that.
Features:
xα∈{0,1,2,…,m} and m=d∑α=1xα
Each feature α represents a count and m is the length of the sequence.
An example of this could be the count of a specific word α in a document of length m and d is the size of the vocabulary.
Model P(→x∣y):
Use the multinomial distribution
P(→x∣m,y=c)=m!x1!⋅x2!⋅⋯⋅xd!d∏α=1(θαc)xα
where θαc is the probability of selecting xα and ∑dα=1θαc=1.
So, we can use this to generate a spam email, i.e., a document →x of class y=spam by picking m words independently at random from the vocabulary of d words using P(→x∣y=spam).
Parameter estimation:
ˆθαc=∑ni=1I(yi=c)xiα+l∑ni=1I(yi=c)∑dβ=1xiβ+l⋅d
where the numerator sums up all counts for feature xα and the denominator sums up all counts of all features across all data points. E.g.,
# of times word α appears in all spam emails# of words in all spam emails combined.
Again, l is the smoothing parameter.
Prediction:
argmaxyP(y=c∣→x)∝argmaxyˆπcd∏α=1ˆθxααc
Case #3: Continuous features (Gaussian Naive Bayes)
 |
Illustration of Gaussian NB. Each class conditional feature distribution P(xα|y) is assumed to originate from an independent Gaussian distribution with its own mean μα,y and variance σ2α,y. |
Features:
xα∈R(each feature takes on a real value)
Model P(xα∣y): Use Gaussian distribution
P(xα∣y=c)=N(μαc,σ2αc)=1√2πσαce−12(xα−μαcσαc)2
Note that the model specified above is based on our assumption about the data - that each feature α comes from a class-conditional Gaussian distribution. The full distribution P(x|y)∼N(→μy,Σy), where Σy is a diagonal covariance matrix with [Σy]α,α=σ2α,y.
Parameter estimation:
As always, we estimate the parameters of the distributions for each dimension and class independently. Gaussian distributions only have two parameters, the mean and variance. The mean μα,y is estimated by the average feature value of dimension α from all samples with label y. The (squared) standard deviation is simply the variance of this estimate.
μαc←1ncn∑i=1I(yi=c)xiαwhere nc=n∑i=1I(yi=c)σ2αc←1ncn∑i=1I(yi=c)(xiα−μαc)2
Naive Bayes is a linear classifier

Naive Bayes leads to a linear decision boundary in many common cases. Illustrated here is the case where P(xα|y) is Gaussian and where σα,c is identical for all c (but can differ across dimensions α). The boundary of the ellipsoids indicate regions of equal probabilities P(→x|y). The red decision line indicates the decision boundary where P(y=1|→x)=P(y=2|→x).
1. Suppose that yi∈{−1,+1} and features are multinomial
We can show that
h(→x)=argmaxyP(y)d∏α−1P(xα∣y)=sign(→w⊤→x+b)
That is,
→w⊤→x+b>0⟺h(→x)=+1.
As before, we define P(xα|y=+1)∝θxαα+ and P(Y=+1)=π+:
[→w]α=log(θα+)−log(θα−)b=log(π+)−log(π−)
If we use the above to do classification, we can compute for →w⊤⋅→x+b
Simplifying this further leads to
→w⊤⋅→x+b>0⟺d∑α=1[→x]α(log(θα+)−log(θα−))+log(π+)−log(π−)>0⟺∏dα=1P([→x]α|Y=+1)π+∏dα=1P([→x]α|Y=−1)π−>1⟺P(Y=+1|→x)>P(Y=−1|→x)(By our naive Bayes assumption)⟺h(→x)=+1(By definition of h(→x))
2. In the case of continuous features (Gaussian Naive Bayes), we can show that
P(y∣→x)=11+e−y(→w⊤→x+b)
This model is also known as logistic regression. NB and LR produce asymptotically the same model if the Naive Bayes assumption holds.