Bayesian Decision Theory - Concepts and Recap

Bayesian decision theory is a powerful framework for making decisions under uncertainty. It provides a principled way to combine prior knowledge with observed data to make optimal choices. In this post, we’ll take a closer look at its key components, revisit Bayesian point estimation, and connect these ideas to classical probability modeling. Let’s dive in!

Ingredients of Bayesian Decision Theory

At the heart of Bayesian decision theory lie several key components. First, we have the parameter space, denoted as \(\Theta\), which represents all possible values of the unknown parameter we aim to estimate or make decisions about. Next, we have the prior distribution, \(p(\theta)\), which encodes our beliefs about \(\theta\) before observing any data. This prior serves as a starting point in the Bayesian framework.

Equally important is the action space, \(A\), which includes all possible actions we might take. To evaluate these actions, we rely on a loss function, \(\ell : A \times \Theta \to \mathbb{R}\), which quantifies the cost of taking a specific action \(a \in A\) when the true parameter value is \(\theta \in \Theta\).

With these components, we can define the posterior risk of an action \(a \in A\), which represents the expected loss under the posterior distribution:

\[r(a) = \mathbb{E}[\ell(\theta, a) \mid D] = \int \ell(\theta, a) p(\theta \mid D) d\theta\]

The goal is to minimize this risk. The action that achieves this minimization is called the Bayes action, \(a^*\), which satisfies:

\[r(a^*) = \min_{a \in A} r(a)\]

Bayesian Point Estimation

Bayesian point estimation builds upon this foundation. Imagine we have data \(D\) generated from some distribution \(p(y \mid \theta)\), where \(\theta \in \Theta\) is unknown. Our task is to find a single point estimate, \(\hat{\theta}\), that best represents \(\theta\).

To do this, we first specify a prior distribution \(p(\theta)\) over \(\Theta\), which reflects our beliefs about \(\theta\) before observing the data. We then define a loss function, \(\ell(\hat{\theta}, \theta)\), to measure the cost of estimating \(\theta\) with \(\hat{\theta}\). Finally, we seek the point estimate \(\hat{\theta} \in \Theta\) that minimizes the posterior risk:

\[r(\hat{\theta}) = \mathbb{E}[\ell(\hat{\theta}, \theta) \mid D] = \int \ell(\hat{\theta}, \theta) p(\theta \mid D) d\theta.\]

Important Loss Functions and Their Role

The choice of loss function significantly influences the optimal estimate. Here are three commonly used loss functions and the corresponding Bayes actions:

Squared Loss: \(\ell(\hat{\theta}, \theta) = (\theta - \hat{\theta})^2\).
For squared loss, the Bayes action is the posterior mean, \(\mathbb{E}[\theta \mid D]\).
Zero-One Loss: \(\ell(\hat{\theta}, \theta) = 1[\theta \neq \hat{\theta}]\).
For zero-one loss, the Bayes action is the posterior mode, the most probable value of \(\theta\) under the posterior.
Absolute Loss: \(\ell(\hat{\theta}, \theta) = |\theta - \hat{\theta}|\).
For absolute loss, the Bayes action is the posterior median, the value that splits the posterior distribution into two equal halves.

Example: Card Drawing

To see this in action, consider drawing a card from a deck consisting of the values \(\{2, 3, 3, 4, 4, 5, 5, 5\}\). Suppose you are asked to guess the value of the card. Based on the posterior distribution:

The mean of the distribution is \(3.875\).
The mode (most frequent value) is \(5\).
The median (middle value) is \(4\).

This simple example highlights how different loss functions lead to different optimal estimates.

Bayesian Point Estimation with Squared Loss

We seek an action \(\hat{\theta}\) that minimizes the posterior risk, given by:

\[r(\hat{\theta}) = \int (\theta - \hat{\theta})^2 p(\theta | \mathcal{D}) \, d\theta\]

To find the optimal \(\hat{\theta}\), we differentiate:

\[\frac{d r(\hat{\theta})}{d\hat{\theta}} = - \int 2 (\theta - \hat{\theta}) p(\theta | \mathcal{D}) \, d\theta\]

Rearranging,

\[= -2 \int \theta p(\theta | \mathcal{D}) \, d\theta + 2\hat{\theta} \int p(\theta | \mathcal{D}) \, d\theta\]

Since the total probability integrates to 1,

\[\int p(\theta | \mathcal{D}) \, d\theta = 1,\]

this simplifies to:

\[\frac{d r(\hat{\theta})}{d\hat{\theta}} = -2 \int \theta p(\theta | \mathcal{D}) \, d\theta + 2\hat{\theta}\]

Setting the derivative to zero,

\[-2 \int \theta p(\theta | \mathcal{D}) \, d\theta + 2\hat{\theta} = 0\]

Solving for \(\hat{\theta}\),

\[\hat{\theta} = \int \theta p(\theta \mid D) d\theta = \mathbb{E}[\theta \mid D]\]

Thus, under squared loss, the Bayes action is the posterior mean.

Recap and Interpretation

Bayesian Decision Theory is built on a few core ideas that tie together probability, decision-making, and inference. Let’s revisit these concepts and unpack their meaning(again) all at once to gain the full picture.

Note: If you feel this isn’t necessary, feel free to skip it. However, I believe it’s helpful to reinforce these concepts periodically to build strong intuition and apply them effectively when needed.

The Prior (\(p(\theta)\))

The prior represents our initial beliefs about the unknown parameter \(\theta\) before observing any data. It encapsulates what we know (or assume) about \(\theta\) based on prior knowledge, expert opinion, or historical data.

For example, if \(\theta\) represents the probability of success in a coin toss, a reasonable prior might be a Beta distribution centered around 0.5, reflecting our belief that the coin is fair.

The Posterior (\(p(\theta \mid D)\))

The posterior is the updated belief about \(\theta\) after observing the data \(D\). It combines the prior \(p(\theta)\) with the likelihood of the data \(p(D \mid \theta)\) using Bayes’ theorem:

\[p(\theta \mid D) = \frac{p(D \mid \theta) p(\theta)}{p(D)}\]

The posterior is the foundation of all Bayesian inference. It reflects how the data has rationally updated our initial beliefs.

Inferences and Actions

In the Bayesian framework, all inferences (e.g., estimating \(\theta\)) and actions (e.g., making decisions) are based on the posterior distribution. This is because the posterior contains all the information we have about \(\theta\), combining both prior knowledge and observed data.

For example, if we want to estimate \(\theta\), we might compute the posterior mean, median, or mode, depending on the loss function we choose.

No Need to Justify an Estimator

In classical statistics, we often need to justify why a particular estimator (e.g., the sample mean) is a good choice. In Bayesian statistics, this issue doesn’t arise because the estimator is derived directly from the posterior distribution, which is fully determined by the prior and the data.

The only choices we need to make are:

The family of distributions (e.g., Gaussian, Beta) that model the data.
The prior distribution on the parameter space \(\Theta\).

Role of the Loss Function

The loss function \(\ell(a, \theta)\) quantifies the cost of taking action \(a\) when the true parameter is \(\theta\). It bridges the gap between inference and decision-making.

The optimal action \(a^*\) is the one that minimizes the posterior risk, which is the expected loss under the posterior distribution:

\[r(a) = \mathbb{E}[\ell(a, \theta) \mid D] = \int \ell(a, \theta) p(\theta \mid D) \, d\theta\]

Different loss functions lead to different optimal actions. For example:

Squared loss leads to the posterior mean.
Absolute loss leads to the posterior median.
Zero-one loss leads to the posterior mode.

Philosophical Interpretation

Bayesian Decision Theory is fundamentally about rational decision-making under uncertainty. It provides a coherent framework for updating beliefs and making decisions that minimize expected loss.

Unlike frequentist methods, which focus on long-run properties of estimators, Bayesian methods focus on the current state of knowledge, as represented by the posterior distribution.

Why Does This Matter?

Understanding these concepts is crucial because they form the backbone of Bayesian thinking. Here’s why:

Flexibility: The Bayesian approach allows us to incorporate prior knowledge into our analysis, which can be especially useful when data is limited.
Transparency: All assumptions (e.g., the choice of prior) are explicitly stated, making the analysis transparent and interpretable.
Decision-Oriented: By focusing on minimizing expected loss, Bayesian Decision Theory directly addresses the practical goal of making optimal decisions.

Example: Estimating the Mean of a Normal Distribution

Suppose we want to estimate the mean \(\theta\) of a normal distribution based on observed data \(D\). Here’s how the Bayesian approach works:

Prior: We choose a normal prior \(p(\theta) = N(\mu_0, \sigma_0^2)\), where \(\mu_0\) and \(\sigma_0^2\) reflect our initial beliefs about \(\theta\).
Likelihood: The data \(D\) is modeled as \(p(D \mid \theta) = N(\theta, \sigma^2)\).
Posterior: Using Bayes’ theorem, the posterior \(p(\theta \mid D)\) is also a normal distribution, with updated mean and variance that balance the prior and the data.
Decision: If we use squared loss, the optimal estimate \(\hat{\theta}\) is the posterior mean.

This example illustrates how the Bayesian approach seamlessly integrates prior knowledge with observed data to produce a rational and optimal estimate.

Example: Estimating the Mean of a Normal Distribution (Frequentist Approach)

Model Assumption:
Assume the data \(D\) comes from a normal distribution: \(p(D \mid \theta) = N(\theta, \sigma^2),\)
where \(\theta\) is the unknown mean and \(\sigma^2\) is the known variance.
Estimator:
Use the sample mean: \(\bar{D} = \frac{1}{n} \sum_{i=1}^n D_i\)
as the estimator for \(\theta\). This is derived as the maximum likelihood estimator (MLE) because it maximizes the likelihood function:
\[L(\theta) = \prod_{i=1}^n p(D_i \mid \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(D_i - \theta)^2}{2\sigma^2}\right).\]
Properties of the Estimator:
- The sample mean \(\bar{D}\) is an unbiased estimator of \(\theta\), meaning: \(E[\bar{D}] = \theta.\)
- Its variance is: \(\text{Var}(\bar{D}) = \frac{\sigma^2}{n},\)
  which decreases as the sample size \(n\) increases.
Decision:
The sample mean \(\bar{D}\) is reported as the estimate of \(\theta\). There is no explicit loss function in the frequentist framework; instead, the sample mean is justified by its desirable properties (e.g., unbiasedness, efficiency, and consistency).

Note: If you’re unsure about this derivation from MLE, check out this link for clarification.

Key Takeaways

The prior represents initial beliefs, the posterior represents updated beliefs, and the loss function guides decision-making.
Bayesian methods are inherently decision-oriented, focusing on minimizing expected loss.
The only choices we need to make are the family of distributions and the prior—everything else follows logically from these choices.

A few follow-up questions you might have:

Explanation of “Actions/Decision-Making”

In the Bayesian framework, actions or decision-making refer to the choices or decisions we make based on the information encoded in the posterior distribution. These decisions could range from estimating a parameter to choosing between different courses of action based on the expected outcomes.

For example:

If you’re estimating a parameter \(\theta\), the action could be selecting the posterior mean, median, or mode as your estimate.
If you’re deciding whether to launch a product, the action could involve calculating the probability of success using the posterior and deciding based on a predefined threshold.
In medical diagnostics, the action could be choosing a treatment plan based on the likelihood of a disease inferred from the posterior.

In essence, actions are the outcomes of the decision-making process, guided by the posterior distribution and a loss function that quantifies the cost of making an incorrect decision.

What is an “Estimator” in the frequentist approach?

An estimator is a statistical function or rule used to estimate an unknown parameter \(\theta\) based on observed data. In frequentist statistics, estimators are often chosen based on their theoretical properties, such as:

Unbiasedness: The estimator’s expected value equals the true parameter value.
Efficiency: The estimator has the smallest possible variance among all unbiased estimators.
Consistency: The estimator converges to the true parameter value as the sample size increases.

For example:

The sample mean is a common estimator for the population mean.
The sample variance is an estimator for the population variance.

In Bayesian statistics, however, the estimator is derived directly from the posterior distribution. For instance:

The posterior mean minimizes squared error loss.
The posterior median minimizes absolute error loss.
The posterior mode corresponds to the most likely value of \(\theta\).

The key difference is that Bayesian methods do not require separate justification for an estimator because the posterior distribution naturally incorporates both the prior beliefs and observed data, making the choice of estimator a consequence of the decision-making process.

We’ve covered most of it, right? Now, let’s revisit the foundational concepts that underpin everything we’ve discussed so far, including conditional probability, the likelihood function, and MLE in a general sense.

Conditional Probability Modeling

In this context, we have:

An input space, \(X\), which represents the features or predictors.
An outcome space, \(Y\), which represents the possible outputs.
An action space, \(A\), consisting of probability distributions on \(Y\).

A prediction function \(f : X \to A\) maps each input \(x \in X\) to a distribution on \(Y\). This setup allows us to model the relationship between inputs and outputs probabilistically.

In a parametric framework, we define a family of conditional densities:

\[\{p(y \mid x, \theta) : \theta \in \Theta\},\]

where \(p(y \mid x, \theta)\) is a density on \(Y\) for each \(x \in X\), and \(\theta\) is a parameter in the finite-dimensional space \(\Theta\). This is the common starting point for either classical or Bayesian regression.

Classical Treatment: Likelihood Function

In the classical approach, we begin with data \(D = (y_1, \dots, y_n)\) and assume it is generated by the conditional density \(p(y \mid x, \theta)\). The probability of the data is:

\[p(D \mid x_1, \dots, x_n, \theta) = \prod_{i=1}^n p(y_i \mid x_i, \theta)\]

For fixed \(D\), the likelihood function is defined as:

\[L_D(\theta) = p(D \mid x, \theta),\]

where \(x = (x_1, \dots, x_n)\).

Maximum Likelihood Estimator (MLE)

The Maximum Likelihood Estimator (MLE) for \(\theta\) is the value that maximizes the likelihood function:

\[\hat{\theta}_{\text{MLE}} = \arg\max_{\theta \in \Theta} L_D(\theta)\]

Interestingly, MLE corresponds to Empirical Risk Minimization (ERM) if we set the loss function to the negative log-likelihood. The resulting prediction function is:

\[\hat{f}(x) = p(y \mid x, \hat{\theta}_{\text{MLE}})\]

Conclusion

So, we’ve reached the end of this blog. I know it can be confusing at times—I’ve been confused too—but take your time to get a solid grasp on it. This is the foundation of Bayesian ML. In the past few blogs, we’ve discussed these concepts, but do you recall if we’ve applied them to any prediction tasks yet? The answer is no. In the next one, we’ll put them into practice through Bayesian conditional models.