(1st-June-2020)
The use of a linear function does not work well for classification tasks. When there are only two values, say 0 and 1, a learner should never make a prediction of greater than 1 or less than 0. However, a linear function could make a prediction of, say, 3 for one example just to fit other examples better.
Initially let's consider binary classification, where the domain of the target variable is {0,1}. If multiple binary target variables exist, they can be learned separately.
For classification, we often use a squashed linear function of the form
fw(X1,...,Xn) = f( w0+w1 ×X1 + ...+ wn ×Xn) ,
where f is an activation function, which is a function from real numbers into [0,1]. Using a squashed linear function to predict a value for the target feature means that the prediction for example e for target feature Y is
pvalw(e,Y)=f(w0+w1 ×val(e,X1) + ...+ wn ×val(e,Xn)) .
A simple activation function is the step function, f(x), defined by
f(x)=
1 if x ≥ 0
0 if x< 0 .
A step function was the basis for the perceptron [Rosenblatt (1958)], which was one of the first methods developed for learning. It is difficult to adapt gradient descent to step functions because gradient descent takes derivatives and step functions are not differentiable.
Comentários