Discussion highlights

  • Page 103: learning as an optimization problem \( \hat{\theta}=\text{argmin}_{\theta}\mathcal{L}(\theta) \)
  • Page 103 - 104: maximum (log-)likelihood estimation (MLE)
    • The IID assumption - independent and identically distributed
    • NLL as the objective function
  • Page 104 - 105: Justification of MLE
    • Find a distribution that can minimize the KL divergence with the empirical distribution \( p_{\mathcal{D}} \)
  • Page 107: MLE for the categorical distribution
    • Using the Lagrange multipliers to incorporate the constraint \( \sum_{k}\theta_k=1 \)
  • Page 110: MLE for linear regression
    • When the noise of observation follows Gaussian distributions, minimizing the mean squared error is equivalent to maximize the log-likelihood
  • Page 111: empirical risk minimization (ERM)
    • For classification problems, the empirical 0-1 loss (risk) is non-differentiable
  • Page 112: surrogate loss
    • Because of the non-differentiable 0-1 loss, we need to find some surrogate loss functions that statisfy two conditions: (1) it should be tight convex upper bound of the original loss; (2) is should be easy to minimize
    • Hinge loss: an agressive loss function with a margin, as shown in Figure 4.2
  • Page 116: definition of overfitting
    • the learning procedure picks parameters by minimizing the empirical loss (or its surrogates) on a training set, which may not minimize the loss on future data
    • the difference between empirical distribution and true distribution
  • Page 116 - 117: regularization — add a penalty term on the loss function
    • Example: introducing a prior distribution on \( \theta \), which is equivalent to the maximum a posterior (MAP) estimation
  • Page 119 - 120: \( \ell_2 \) regularization and weight decay