Readings on Machine Learning: Bayesian Statistics
The following materials is based on the Chapter 05 of Machine Learning: A Probabilistic Approach and some examples from (Mackay, 2003)^{1} and (Bishop, 2006)^{2}.
Discussion highlights
 Posterior distribution (page 149): Using the posterior distribution \( p(\theta\mid\mathcal{D}) \) to summarize everything we know about a set of unknown variables is at the core of Bayesian statistics.
 Prior of \( \theta \): \( p(\theta) \)
 Likelihood of \( \theta \): \( p(\mathcal{D}\mid\theta) \)
 Posterior: \[ p(\theta\mid\mathcal{D}) = \frac{p(\mathcal{D}\mid\theta)p(\theta)}{p(\mathcal{D})} = \frac{p(\mathcal{D}\mid\theta)p(\theta)}{\int_{\theta} p(\mathcal{D}\mid\theta)p(\theta) d\theta} \]
 The challenge of estimating \( p(\theta\mid\mathcal{D}) \) is the computation of \( \int_{\theta} p(\mathcal{D}\mid\theta)~p(\theta) d\theta \)
 Maximum a posteriori estimation (page 149  150): MAP estimate identifies the mode of the posterior
\[ \hat{\theta} = \text{argmax}_{\theta} p(\theta\mid\mathcal{D}) = \text{argmax}_{\theta} p(\mathcal{D}\mid\theta) p(\theta) \]
 The existence of efficient algorithms: (1) essentially, this is an optimization problem; (2) no need to compute \( p(\mathcal{D}) \)
 The connection with regularization. For example, in linear regression, \( \ell_2 \) regularization is equivalent to the Bayesian linear regression with a standard Gaussian prior on \( \theta \), \( \mathcal{N}(\theta\mid 0, \sigma^2I) \).
 Issues of MAP: (1) no measure of uncertainty; (2) without no uncertainty of model parameters, predictive distributions could be overconfident; (3) the mode can be untypical point — the mode is a point of measure zero, whereas the mean and median take the volume of the space into account.
 Model selection (page 155):
 Model selection problem: when faced with a set of models (i.e., families of parameteric distributions) of different complexity, how should we choose the best one?
 Example: fitting data from the function \( \sin(2\pi x) \) with polynomial functions with degree \( M \)
functions with high complexity (e.g., \( M=9 \)) have potential to fit more complex data and could exhibit overfitting on simple datasets (reprinted from Chapter 01 of (Bishop, 2006)^{2})  Cross validation: estimate the models on \( K \)fold of the validation data
 Bayesian model selection: \[ p(m\mid \mathcal{D})\propto p(\mathcal{D}\mid m) p(m), \] then \( p(m)\propto 1 \), model selection can be done with only \[ p(\mathcal{D}\mid m)\propto \int_{\theta} p(\mathcal{D}\mid\theta)p(\theta\mid m)d\theta, \] which is called the evidence for model \( m \).
 Justification of Bayesian model selection: Consider two models \( m_1 \) and \( m_2 \), where model \( m_2 \) has higher complexity than \( m_1 \). For the same data \( \mathcal{D} \), if both \( m_1 \) and \( m_2 \) can fit \( \mathcal{D} \), Bayesian model selection will suggest to pick \( m_1 \) as illustrated in the following plot (reprinted from the Chapter 28 of (Mackay, 2003)^{1} with \( H_1 \) and \( H_2 \) represent the models)
 Noninformative priors (page 165):
 If we don’t have strong beliefs about what \( \theta \) should be, it is common to use an uninformative or noninformative prior, and to “let the dat speak for itself”.
 If \( \theta \) is discrete and has \( K\) states, then \( p(\theta) = \frac{1}{K} \) is a noninformative prior.
 Bayesian decision theory (page 176)
 Action: \( a\in\mathcal{A} \), where \( \mathcal{A} \) is the action space
 (Groundtruth) label \( y\in\mathcal{Y} \), where \( \mathcal{Y} \) is the label space
 Loss \( L(y,a) \): the loss incurred if we pick action \( a \) when the groundtruth label is \( y \)
 Example: in classification problems, we often assume \( \mathcal{A}=\mathcal{Y} \), and \( a \) as the predicted label \( \hat{y} \). The popular 01 loss function is defined as \[ L(\hat{y},y) = 1 \] if \( \hat{y}\not= y \); \( L=0 \) when \( \hat{y}=y \).
 In general, Bayesian decision theory pick an action based on the the minimal expected loss \[ \delta(x) = \text{argmin}_{a\in\mathcal{A}} E[L(y,a)] \] where \( x \) is the input.
 With a predictive distribution of \( y \) given \( x \), the expected loss in the previous item is defined as \[ E[L(y,a)] = \sum_{a\in\mathcal{A}} L(y,a)p(y\mid x) \] when we use the 01 loss, then the decision rule reduces to the simple classification rule.
 Reject option: In classification problems where \( p(y\mid x) \) is very uncertain, we may prefer to choose a reject action, in which we refuse to classify the example as any of the specified classes, and instead say “don’t know”.

Mackay. Information Theory, Inference, and Learning Algorithms. 2003 ↩ ↩^{2}

Bishop. Pattern Recognition and Machine Learning. 2006 ↩ ↩^{2}