(13th-June-2020)
• Overfitting can occur when some regularities appear in the training data that do not appear in the test data, and when the learner uses those regularities for prediction.
Maximum A Posteriori Probability and Minimum Description Length
One way to trade off model complexity and fit to the data is to choose the model that is most likely, given the data. That is, choose the model that maximizes the probability of the model given the data, P(model|data). The model that maximizes P(model|data) is called the maximum a posteriori probability model, or the MAP model.
The probability of a model (or a hypothesis) given some data is obtained by using Bayes' rule:
P(model|data) = (P(data|model)×P(model))/(P(data)) .
The likelihood, P(data|model), is the probability that this model would have produced this data set. It is high when the model is a good fit to the data, and it is low when the model would have predicted different data. The prior P(model) encodes the learning bias and specifies which models are a priori more likely. The prior probability of the model, P(model), is required to bias the learning toward simpler models. Typically simpler models have a higher prior probability. The denominator P(data) is a normalizing constant to make sure that the probabilities sum to 1.
Because the denominator of Equation (7.5.1) is independent of the model, it can be ignored when choosing the most likely model. Thus, the MAP model is the model that maximizes
P(data|model)×P(model) .
Comments