Discussion highlights

• Page 103: learning as an optimization problem $$\hat{\theta}=\text{argmin}_{\theta}\mathcal{L}(\theta)$$
• Page 103 - 104: maximum (log-)likelihood estimation (MLE)
• The IID assumption - independent and identically distributed
• NLL as the objective function
• Page 104 - 105: Justification of MLE
• Find a distribution that can minimize the KL divergence with the empirical distribution $$p_{\mathcal{D}}$$
• Page 107: MLE for the categorical distribution
• Using the Lagrange multipliers to incorporate the constraint $$\sum_{k}\theta_k=1$$
• Page 110: MLE for linear regression
• When the noise of observation follows Gaussian distributions, minimizing the mean squared error is equivalent to maximize the log-likelihood
• Page 111: empirical risk minimization (ERM)
• For classification problems, the empirical 0-1 loss (risk) is non-differentiable
• Page 112: surrogate loss
• Because of the non-differentiable 0-1 loss, we need to find some surrogate loss functions that statisfy two conditions: (1) it should be tight convex upper bound of the original loss; (2) is should be easy to minimize
• Hinge loss: an agressive loss function with a margin, as shown in Figure 4.2
• Page 116: definition of overfitting
• the learning procedure picks parameters by minimizing the empirical loss (or its surrogates) on a training set, which may not minimize the loss on future data
• the difference between empirical distribution and true distribution
• Page 116 - 117: regularization — add a penalty term on the loss function
• Example: introducing a prior distribution on $$\theta$$, which is equivalent to the maximum a posterior (MAP) estimation
• Page 119 - 120: $$\ell_2$$ regularization and weight decay