Model Compression

DR.GEEK
Nov 11, 2020
3 min read

(11th-Nov-2020)

• In many commercial applications, it is much more important that the time and memory cost of running inference in a machine learning model be low than that the time and memory cost of training be low. For applications that do not require personalization, it is possible to train a model once, then deploy it to be used by billions of users. In many cases, the end user is more resource-constrained than the developer. For example, one might train a speech recognition network with a powerful computer cluster, then deploy it on mobile phones. A key strategy for reducing the cost of inference is model compression (Buciluˇ a 2006 et al., ). The basic idea of model compression is to replace the original, expensive model with a smaller model that requires less memory and runtime to store and evaluate. Model compression is applicable when the size of the original model is driven primarily by a need to prevent overﬁtting. In most cases, the model with the lowest generalization error is an ensemble of several independently trained models. Evaluating all n ensemble members is expensive. Sometimes, even a single model generalizes better if it is large (for example, if it is regularized with dropout). These large models learn some function f(x), but do so using many more parameters than are necessary for the task. Their size is necessary only due to the limited number of training examples. As soon as we have ﬁt this function f(x), we can generate a training set containing inﬁnitely many examples, simply by applying f to randomly sampled points x. We then train the new, smaller, model to match f(x) on these points. In order to most eﬃciently use the capacity of the new, small model, it is best to sample the new x points from a distribution resembling the actual test inputs that will be supplied to the model later. This can be done by corrupting training examples or by drawing points from a generative model trained on the original training set. Alternatively, one can train the smaller model only on the original training points, but train it to copy other features of the model, such as its posterior distribution over the incorrect classes (Hinton 2014 2015 et al., , ).

• One strategy for accelerating data processing systems in general is to build systems that have dynamic structure in the graph describing the computation needed to process an input. Data processing systems can dynamically determine which subset of many neural networks should be run on a given input. Individual neural networks can also exhibit dynamic structure internally by determining which subset of features (hidden units) to compute given information from the input. This form of dynamic structure inside neural networks is sometimes called conditional computation ( , ; , ). Since many components of Bengio 2013 Bengio et al. 2013b the architecture may be relevant only for a small amount of possible inputs, the system can run faster by computing these features only when they are needed. Dynamic structure of computations is a basic computer science principle applied generally throughout the software engineering discipline. The simplest versions of dynamic structure applied to neural networks are based on determining which subset of some group of neural networks (or other machine learning models) should be applied to a particular input. A venerable strategy for accelerating inference in a classiﬁer is to use a cascade of classiﬁers. The cascade strategy may be applied when the goal is to detect the presence of a rare object (or event). To know for sure that the object is present, we must use a sophisticated classiﬁer with high capacity, that is expensive to run. However, because the object is rare, we can usually use much less computation to reject inputs as not containing the object. In these situations, we can train a sequence of classiﬁers. The ﬁrst classiﬁers in the sequence have low capacity, and are trained to have high recall. In other words, they are trained to make sure we do not wrongly reject an input when the object is present. The ﬁnal classiﬁer is trained to have high precision. At test time, we run inference by running the classiﬁers in a sequence, abandoning any example as soon as any one element in the cascade rejects it. Overall, this allows us to verify the presence of objects with high conﬁdence, using a high capacity model, but does not force us to pay the cost of full inference for every example. There are two diﬀerent ways that the cascade can achieve high capacity.

Monologue of

Dr. GEEK

Daily Blog by Dr. GEEK

Model Compression

Recent Posts

Comments