Walk-Back Training Procedure

DR.GEEK
Dec 23, 2020
2 min read

(23th-December-2020)

• The walk-back training procedure was proposed by ( ) as a way Bengio et al. 2013c to accelerate the convergence of generative training of denoisingautoencoders. Instead of performing a one-step encode-decode reconstruction, this procedure consists in alternative multiple stochastic encode-decode steps (as in the generative Markov chain) initialized at a training example (just like with the contrastive divergence algorithm, described in section ) and penalizing the last probabilistic 18.2 reconstructions (or all of the reconstructions along the way). Training with k steps is equivalent (in the sense of achieving the same stationary distribution) as training with one step, but practically has the advantage that spurious modes further from the data can be removed more eﬃciently.

• Generative stochastic networks or GSNs ( , ) are generalizaBengio et al. 2014 tions of denoisingautoencoders that include latent variables h in the generative Markov chain, in addition to the visible variables (usually denoted ). x A GSN isparametrized by two conditional probability distributions which specify one step of the Markov chain:

• 1. p(x( ) k | h( ) k ) tells how to generate the next visible variable given the current latent state. Such a “reconstruction distribution” is also found in denoisingautoencoders, RBMs, DBNs and DBMs.

• 2. p(h( ) k | h( 1) k− ,x( 1) k− ) tells how to update the latent state variable, given the previous latent state and visible variable.

• Discriminant GSNs

• The original formulation of GSNs ( , ) was meant for unsupervised Bengio et al. 2014 learning and implicitly modeling p(x) for observed data x, but it is possible to modify the framework to optimize . p( ) y | x For example, Zhou and Troyanskaya 2014 ( ) generalize GSNs in this way, by only back-propagating the reconstruction log-probability over the output variables, keeping the input variables ﬁxed. They applied this successfully to model sequences (protein secondary structure) and introduced a (one-dimensional) convolutional structure in the transition operator of the Markov chain. It is important to remember that, for each step of the Markov chain, one generates a new sequence for each layer, and that sequence is the input for computing other layer values (say the one below and the one above) at the next time step. Hence the Markov chain is really over the output variable (and associated higherlevel hidden layers), and the input sequence only serves to condition that chain, with back-propagation allowing to learn how the input sequence can condition the output distribution implicitly represented by the Markov chain. It is therefore a case of using the GSN in the context of structured outputs.