Dataset Augmentation
- DR.GEEK
- Nov 16, 2020
- 3 min read
(16th-Nov-2020)
• As described in section , it is easy to improve the generalization of a classifier 7.4 by increasing the size of the training set by adding extra copies of the training examples that have been modified with transformations that do not change the class. Object recognition is a classification task that is especially amenable to this form of dataset augmentation because the class is invariant to so many transformations and the input can be easily transformed with many geometric operations. As described before, classifiers can benefit from random translations, rotations, and in some cases, flips of the input to augment the dataset. In specialized computer vision applications, more advanced transformations are commonly used for dataset augmentation. These schemes include random perturbation of the colors in an image ( , ) and nonlinear geometric distortions of Krizhevsky et al. 2012 the input ( , ).
• For the task of speech recognition is to map an acoustic signal containing a spoken natural language utterance into the corresponding sequence of words intended by the speaker. Let X = (x(1),x(2),...,x( ) T ) denote the sequence of acoustic input vectors (traditionally produced by splitting the audio into 20ms frames). Most speech recognition systems preprocess the input using specialized hand-designed features, but some ( , ) deep learning systems learn features Jaitly and Hinton 2011 from raw input. Let y = (y1,y2,...,yN ) denote the target output sequence (usually a sequence of words or characters). The automatic speech recognition (ASR) task consists of creating a function f∗ASR that computes the most probable linguistic sequence given the acoustic sequence :

• Since the 1980s and until about 2009–2012, state-of-the art speech recognition systems primarily combined hidden Markov models (HMMs) and Gaussian mixture models (GMMs). GMMs modeled the association between acoustic features and phonemes ( , ), while HMMs modeled the sequence of phonemes. Bahl et al. 1987 The GMM-HMM model family treats acoustic waveforms as being generated by the following process: first an HMM generates a sequence of phonemes and discrete sub-phonemic states (such as the beginning, middle, and end of each phoneme), then a GMM transforms each discrete symbol into a brief segment of audio waveform. Although GMM-HMM systems dominated ASR until recently, speech recognition was actually one of the first areas where neural networks were applied, and numerous ASR systems from the late 1980s and early 1990s used neural nets (Bourlard and Wellekens 1989 Waibel 1989 Robinson and , ; et al., ; Fallside 1991 Bengio 1991 1992 Konig 1996 , ; et al., , ; et al., ). At the time, the performance of ASR based on neural nets approximately matched the performance of GMM-HMM systems. For example, Robinson and Fallside 1991 ( ) achieved 26% phoneme error rate on the TIMIT ( , ) corpus (with 39 Garofolo et al. 1993 phonemes to discriminate between), which was better than or comparable to HMM-based systems. Since then, TIMIT has been a benchmark for phoneme recognition, playing a role similar to the role MNIST plays for object recognition. However, because of the complex engineering involved in software systems for speech recognition and the effort that had been invested in building these systems on the basis of GMM-HMMs, the industry did not see a compelling argument for switching to neural networks. As a consequence, until the late 2000s, both academic and industrial research in using neural nets for speech recognition mostly focused on using neural nets to learn extra features for GMM-HMM systems.
Comments