High-Dimensional Outputs

DR.GEEK
Nov 19, 2020
1 min read

(19teen-Nov-2020)

• In many natural language applications, we often want our models to produce words (rather than characters) as the fundamental unit of the output. For large vocabularies, it can be very computationally expensive to represent an output distribution over the choice of a word, because the vocabulary size is large. In many applications, V contains hundreds of thousands of words. The naive approach to representing such a distribution is to apply an aﬃne transformation from a hidden representation to the output space, then apply the softmax function. Suppose we have a vocabulary V with size | | V . The weight matrix describing the linear component of this aﬃne transformation is very large, because its output dimension is | | V . This imposes a high memory cost to represent the matrix, and a high computational cost to multiply by it. Because the softmax is normalized across all | | V outputs, it is necessary to perform the full matrix multiplication at training time as well as test time—we cannot calculate only the dot product with the weight vector for the correct output. The high computational costs of the output layer thus arise both at training time (to compute the likelihood and its gradient) and at test time (to compute probabilities for all or selected words). For specialized loss functions, the gradient can be computed eﬃciently ( , ), but Vincent et al. 2015 the standard cross-entropy loss applied to a traditional softmax output layer poses many diﬃculties.

Monologue of

Dr. GEEK

Daily Blog by Dr. GEEK

High-Dimensional Outputs

Recent Posts

Comments