Importance Sampling

DR.GEEK
Nov 21, 2020
1 min read

(21st-Nov-2020)

• One way to speed up the training of neural language models is to avoid explicitly computing the contribution of the gradient from all of the words that do not appear in the next position. Every incorrect word should have low probability under the model. It can be computationally costly to enumerate all of these words. Instead, it is possible to sample only a subset of the words. Using the notation introduced in equation , the gradient can be written as follows:

• Noise-Contrastive Estimation and Ranking Loss

• Other approaches based on sampling have been proposed to reduce the computational cost of training neural language models with large vocabularies. An early example is the ranking loss proposed by Collobert and Weston 2008a ( ), which views the output of the neural language model for each word as a score and tries to make the score of the correct word ay be ranked high in comparison to the other scores ai. The ranking loss proposed then is

The gradient is zero for the i-th term if the score of the observed word, ay, is greater than the score of the negative word ai by a margin of 1. One issue with this criterion is that it does not provide estimated conditional probabilities, which are useful in some applications, including speech recognition and text generation (including conditional text generation tasks such as translation). A more recently used training objective for neural language model is noisecontrastive estimation, which is introduced in section . This approach has 18.6 been successfully applied to neural language models (Mnih and Teh 2012 Mnih , ; and Kavukcuoglu 2013 , )

Monologue of

Dr. GEEK

Daily Blog by Dr. GEEK

Importance Sampling

Recent Posts

Comentarios