(19th-December-2020)
• Neural auto-regressive networks ( , , ) have the same Bengio and Bengio 2000a b left-to-right graphical model as logistic auto-regressive networks (figure ) but 20.8 employ a different parametrization of the conditional distributions within that graphical model structure. The new parametrization is more powerful in the sense that its capacity can be increased as much as needed, allowing approximation of any joint distribution. The new parametrization can also improve generalization by introducing a parameter sharing and feature sharing principle common to deep learning in general. The models were motivated by the objective of avoiding the curse of dimensionality arising out of traditional tabular graphical models, sharing the same structure as figure . In tabular discrete probabilistic models, each 20.8 conditional distribution is represented by a table of probabilities, with one entry and one parameter for each possible configuration of the variables involved. By using a neural network instead, two advantages are obtained:
1. The parametrization of each P(xi | xi−1,...,x1) by a neural network with (i− 1)×k inputs and k outputs (if the variables are discrete and take k values, encoded one-hot) allows one to estimate the conditional probability without requiring an exponential number of parameters (and examples), yet still is able to capture high-order dependencies between the random variables.
2. Instead of having a different neural network for the prediction of each xi, a connectivity illustrated in figure allows one to merge all left-to-right 20.9 the neural networks into one. Equivalently, it means that the hidden layer features computed for predicting xi can be reused for predicting xi k + (k > 0). The hidden units are thus organized in groups that have the particularity that all the units in the i-th group only depend on the input values x1,...,xi. The parameters used to compute these hidden units are jointly optimized to improve the prediction of all the variables in the sequence. This is an instance of the reuse principle that recurs throughout deep learning in scenarios ranging from recurrent and convolutional network architectures to multi-task and transfer learning.
Comments