Learning Probabilities

(26th-May-2020)

For many of the prediction measures, the optimal prediction on the training data is the empirical frequency. Thus, making a point estimate can be interpreted as learning a probability. However, the empirical frequency is typically not a good estimate of the probability of new cases; just because an agent has not observed some value of a variable does not mean that the value should be assigned a probability of zero. A probability of zero means that the value is impossible.
Typically, we do not have data without any prior knowledge. There is typically a great deal of knowledge about a domain, either in the meaning of the symbols or in experience with similar examples that can be used to improve predictions.
A standard way both to solve the zero-probability problem and to take prior knowledge into account is to use a pseudocount or prior count for each value to which the training data is added.

• Conditional Probability Distribution

The same idea can be used to learn a conditional probability distribution. To estimate a conditional distribution, P(Y|X), of variable Y conditioned on variable(s) X, the agent can maintain a count for each pair of a value for Y and a value for X. Suppose cij is a non-negative number that will be used as a pseudocount for Y=yi ∧X=xj. Suppose nij is the number of observed cases of Y=yi ∧X=xj. The agent can use
P(Y=yi|X=xj)=(cij+nij)/(∑i' ci'j+ni'j),
but this does not work well when the denominator is small, which occurs when some values of X are rare. When X has structure - for example, when it is composed of other variables - it is often the case that some assignments to X are very rare or even do not appear in the training data.

Monologue of