__Chapter 04 Discussion Highlights__

Discussion highlights

- Page 103: learning as an optimization problem \( \hat{\theta}=\text{argmin}_{\theta}\mathcal{L}(\theta) \)
- Page 103 - 104: maximum (log-)likelihood estimation (MLE)
- The IID assumption - independent and identically distributed
- NLL as the objective function

- Page 104 - 105: Justification of MLE
- Find a distribution that can minimize the KL divergence with the empirical distribution \( p_{\mathcal{D}} \)

- Page 107: MLE for the categorical distribution
- Using the Lagrange multipliers to incorporate the constraint \( \sum_{k}\theta_k=1 \)

- Page 110: MLE for linear regression
- When the noise of observation follows Gaussian distributions, minimizing the mean squared error is equivalent to maximize the log-likelihood

- Page 111: empirical risk minimization (ERM)
- For classification problems, the empirical 0-1 loss (risk) is non-differentiable

- Page 112: surrogate loss
- Because of the non-differentiable 0-1 loss, we need to find some surrogate loss functions that statisfy two conditions: (1) it should be tight convex upper bound of the original loss; (2) is should be easy to minimize
- Hinge loss: an agressive loss function with a margin, as shown in Figure 4.2

- Page 116: definition of overfitting
- the learning procedure picks parameters by minimizing the empirical loss (or its surrogates) on a training set, which may not minimize the loss on future data
- the difference between empirical distribution and true distribution

- Page 116 - 117: regularization — add a penalty term on the loss function
- Example: introducing a prior distribution on \( \theta \), which is equivalent to the maximum a posterior (MAP) estimation

- Page 119 - 120: \( \ell_2 \) regularization and weight decay