An Introduction to Generative Models - Naive Bayes for Binary Features
Generative models represent a powerful class of machine learning techniques. Unlike methods that directly map inputs
Generalized Linear Models vs. Generative Models
To recall, generalized linear models focus on the conditional distribution
This prediction process connects naturally to conditional distributions. Using Bayes’ Rule, we can rewrite
In practice, we often bypass computing
With this foundation, let us explore one of the most straightforward and widely used generative models: Naive Bayes (NB).
If you’re unable to follow the above formulation, here’s a quick refresher on Bayes’ Rule to help you out.
Bayes’ Rule relates conditional probabilities to joint and marginal probabilities. It can be expressed as:
where:
: Posterior probability of given , : Joint probability of and , : Likelihood of given , : Prior probability of , : Marginal probability of , which ensures proper normalization.
Naive Bayes: A Simple and Effective Generative Model
To understand Naive Bayes, consider a simple yet practical problem: binary text classification. Imagine we want to classify a document as either a fake review or a genuine review. This setup offers a clear context to explore the mechanics of generative modeling.
Representing Documents as Features
To make this task computationally feasible, we use a bag-of-words representation. A document is expressed as a binary vector
Here,
Modeling the Joint Probability of Documents and Labels
For a document
However, modeling the dependencies between features (
The Naive Bayes Assumption
Naive Bayes simplifies the problem by assuming that features are conditionally independent given the label
This assumption significantly reduces computational complexity while often delivering excellent results in practice. While the assumption of conditional independence may not hold in all cases, it is surprisingly effective in many real-world applications.
Parameterizing the Naive Bayes Model
To make predictions, we need to parameterize the probabilities
Why? Parameterizing these distributions allows us to learn the necessary values (e.g.,
Binary Features
For simplicity, let us assume the features
Similarly, the label distribution is modeled as:
How do we arrive at these definitions? These definitions arise from the following assumptions and modeling principles:
- Binary Nature of Features: Since the features
are binary ( ), we need a probability distribution that models the likelihood of binary outcomes. The Bernoulli distribution is a natural choice for this. - Parameterization with Bernoulli Distributions:
- For
, the Bernoulli distribution models the probability that for each possible value of . - We introduce parameters
and , which represent the probability of given and , respectively.
- For
- Label Distribution
:- The label
is also binary ( ), so we model using a Bernoulli distribution with a single parameter , where . - This parameter reflects the prior probability of the positive class.
- The label
- Learning from Data: These parameters (
) are learned from data using methods like Maximum Likelihood Estimation (MLE), ensuring that the model reflects the observed distribution of features and labels in the dataset.
Thus, the definitions provide a straightforward and interpretable way to model binary features and labels within the Naive Bayes framework.
With these definitions, the joint probability
Substituting the probabilities for binary features:
Here,
How to intuitively understand this equation? This equation represents the joint probability
- For each feature
, the term ensures that the corresponding parameter is used if , while ensures that is used if . - The product
combines the contributions of all features under the Naive Bayes assumption of conditional independence. - Finally, multiplying by
incorporates the prior belief about the label , providing the full joint distribution .
By this decomposition, we can efficiently compute
Learning Parameters with Maximum Likelihood Estimation (MLE)
The parameters
Taking the logarithm of the likelihood to simplify optimization, we obtain the log-likelihood:
For binary features, substituting the joint probability
Focusing on a specific feature
Step 1: Derivative of the Log-Likelihood
Taking the derivative of the log-likelihood with respect to
Did you follow the derivative? You might be wondering how derivative
The transition from equation (1) to equation (2) involves taking the derivative of the log-likelihood with respect to
Here, the derivative is applied to the logarithm term. Using the chain rule, we first compute the derivative of the logarithm, which is:
where
For a single
The derivative of
Using the chain rule:
Applying this to the summation over
This is exactly what equation (48) represents, showing the decomposition of the derivative into two terms for
The simplification uses the indicator functions
and
Key Insight:
At each step of the derivation, we are dealing with a single term inside the logarithm. As a result, when we take the derivative of the logarithm, the result is simply
- If
, the term inside the log is , and its derivative is . - If
, the term inside the log is , and its derivative is .
Thus, at each
I hope that makes sense now. Let’s continue.
Step 2: Setting the Derivative to Zero
To find the maximum likelihood estimate, we set the derivative to zero:
Simplifying:
The above simplification is quite straightforward. I encourage you to write it out for yourself and work through the steps. Simply multiply both sides by
Note:
Step 3: Solving for
Rearranging to isolate
Interpretation: This estimate corresponds to the fraction of examples with
Next Steps:
-
Compute the other
values: You should calculate the parameters for all other features in the model (for example, and for binary features). These values represent the probability of a feature given a class, so you’ll continue by maximizing the likelihood for each and class to estimate these parameters. -
Estimate
: You’ll also need to compute the class prior probability , which is simply the proportion of each class in the training data. This can be done by counting how many times each class label appears and normalizing by the total number of examples.
is the proportion of samples in the dataset that belong to the class .- It serves as the prior probability of
.
Substituting the Probabilities for Binary Features:
The likelihood of the joint probability
Where
Remember this equation; it’s the one we started with.
Once all parameters are estimated, you will have a fully parameterized Naive Bayes model. The model can then be used for prediction by computing the posterior probabilities for each class
Where
So, the fundamental idea is:
You are estimating the parameters
This gives you the predicted class
Recipe for Learning a Naive Bayes Model:
- Choose
: Select an appropriate distribution for the features, e.g., Bernoulli distribution for binary features . - Choose
: Typically, use a categorical distribution for the class labels. - Estimate Parameters by MLE: Use Maximum Likelihood Estimation (MLE) to estimate the parameters, following the same strategy used in conditional models.
Where Do We Go From Here?
So far, we have focused on modeling binary features. However, many real-world datasets involve continuous features. How can Naive Bayes be extended to handle such cases? In the next blog, we’ll explore Naive Bayes for continuous features and see how this simple model adapts to more complex data types. See you there!