Use of a short list

DR.GEEK
Nov 20, 2020
1 min read

(20th-Nov-2020)

Hierarchical Softmax

• A classical approach ( , ) to reducing the computational burden Goodman 2001 of high-dimensional output layers over large vocabulary sets V is to decompose probabilities hierarchically. Instead of necessitating a number of computations proportional to | | V (and also proportional to the number of hidden units, nh), the | | V factor can be reduced to as low as log| | V . ( ) and Bengio 2002 Morin and Bengio 2005 ( ) introduced this factorized approach to the context of neural language models. One can think of this hierarchy as building categories of words, then categories of categories of words, then categories of categories of categories of words, etc. These nested categories form a tree, with words at the leaves. In a balanced tree, the tree has depth O(log| | V ). The probability of a choosing a word is given by the product of the probabilities of choosing the branch leading to that word at every node on a path from the root of the tree to the leaf containing the word. Figure illustrates a simple example. ( ) also describe 12.4 Mnih and Hinton 2009 how to use multiple paths to identify a single word in order to better model words that have multiple meanings. Computing the probability of a word then involves summation over all of the paths that lead to that word.

• A simple hierarchy of word categories, with 8 words w0,...,w7 organized into a three level hierarchy.

Monologue of

Dr. GEEK

Daily Blog by Dr. GEEK

Use of a short list

Recent Posts

Comments