Readings on Machine Learning: Bayesian Statistics

The following materials is based on the Chapter 05 of Machine Learning: A Probabilistic Approach and some examples from (Mackay, 2003)¹ and (Bishop, 2006)².

Discussion highlights

Posterior distribution (page 149): Using the posterior distribution \( p(\theta\mid\mathcal{D}) \) to summarize everything we know about a set of unknown variables is at the core of Bayesian statistics.
- Prior of \( \theta \): \( p(\theta) \)
- Likelihood of \( \theta \): \( p(\mathcal{D}\mid\theta) \)
- Posterior: \[ p(\theta\mid\mathcal{D}) = \frac{p(\mathcal{D}\mid\theta)p(\theta)}{p(\mathcal{D})} = \frac{p(\mathcal{D}\mid\theta)p(\theta)}{\int_{\theta} p(\mathcal{D}\mid\theta)p(\theta) d\theta} \]
- The challenge of estimating \( p(\theta\mid\mathcal{D}) \) is the computation of \( \int_{\theta} p(\mathcal{D}\mid\theta)~p(\theta) d\theta \)
Maximum a posteriori estimation (page 149 - 150): MAP estimate identifies the mode of the posterior \[ \hat{\theta} = \text{argmax}_{\theta} p(\theta\mid\mathcal{D}) = \text{argmax}_{\theta} p(\mathcal{D}\mid\theta) p(\theta) \]
- The existence of efficient algorithms: (1) essentially, this is an optimization problem; (2) no need to compute \( p(\mathcal{D}) \)
- The connection with regularization. For example, in linear regression, \( \ell_2 \) regularization is equivalent to the Bayesian linear regression with a standard Gaussian prior on \( \theta \), \( \mathcal{N}(\theta\mid 0, \sigma^2I) \).
- Issues of MAP: (1) no measure of uncertainty; (2) without no uncertainty of model parameters, predictive distributions could be over-confident; (3) the mode can be untypical point — the mode is a point of measure zero, whereas the mean and median take the volume of the space into account.
Model selection (page 155):
- Model selection problem: when faced with a set of models (i.e., families of parameteric distributions) of different complexity, how should we choose the best one?
- Example: fitting data from the function \( \sin(2\pi x) \) with polynomial functions with degree \( M \)
  
  functions with high complexity (e.g., \( M=9 \)) have potential to fit more complex data and could exhibit over-fitting on simple datasets (reprinted from Chapter 01 of (Bishop, 2006)²)
- Cross validation: estimate the models on \( K \)-fold of the validation data
- Bayesian model selection: \[ p(m\mid \mathcal{D})\propto p(\mathcal{D}\mid m) p(m), \] then \( p(m)\propto 1 \), model selection can be done with only \[ p(\mathcal{D}\mid m)\propto \int_{\theta} p(\mathcal{D}\mid\theta)p(\theta\mid m)d\theta, \] which is called the evidence for model \( m \).
- Justification of Bayesian model selection: Consider two models \( m_1 \) and \( m_2 \), where model \( m_2 \) has higher complexity than \( m_1 \). For the same data \( \mathcal{D} \), if both \( m_1 \) and \( m_2 \) can fit \( \mathcal{D} \), Bayesian model selection will suggest to pick \( m_1 \) as illustrated in the following plot (reprinted from the Chapter 28 of (Mackay, 2003)¹ with \( H_1 \) and \( H_2 \) represent the models)
Non-informative priors (page 165):
- If we don’t have strong beliefs about what \( \theta \) should be, it is common to use an uninformative or non-informative prior, and to “let the dat speak for itself”.
- If \( \theta \) is discrete and has \( K\) states, then \( p(\theta) = \frac{1}{K} \) is a non-informative prior.
Bayesian decision theory (page 176)
- Action: \( a\in\mathcal{A} \), where \( \mathcal{A} \) is the action space
- (Ground-truth) label \( y\in\mathcal{Y} \), where \( \mathcal{Y} \) is the label space
- Loss \( L(y,a) \): the loss incurred if we pick action \( a \) when the ground-truth label is \( y \)
- Example: in classification problems, we often assume \( \mathcal{A}=\mathcal{Y} \), and \( a \) as the predicted label \( \hat{y} \). The popular 0-1 loss function is defined as \[ L(\hat{y},y) = 1 \] if \( \hat{y}\not= y \); \( L=0 \) when \( \hat{y}=y \).
- In general, Bayesian decision theory pick an action based on the the minimal expected loss \[ \delta(x) = \text{argmin}_{a\in\mathcal{A}} E[L(y,a)] \] where \( x \) is the input.
- With a predictive distribution of \( y \) given \( x \), the expected loss in the previous item is defined as \[ E[L(y,a)] = \sum_{a\in\mathcal{A}} L(y,a)p(y\mid x) \] when we use the 0-1 loss, then the decision rule reduces to the simple classification rule.
- Reject option: In classification problems where \( p(y\mid x) \) is very uncertain, we may prefer to choose a reject action, in which we refuse to classify the example as any of the specified classes, and instead say “don’t know”.

Mackay. Information Theory, Inference, and Learning Algorithms. 2003 ↩ ↩²
Bishop. Pattern Recognition and Machine Learning. 2006 ↩ ↩²