Studies in Computational Intelligence | 2021
Recognition of Isolated English Words of E-Lecture Video Using Convolutional Neural Network
Abstract
Speech Recognition has been gaining a lot of importance, as there is tremendous growth in its applications such as building subtitles for e-lecture videos, transcription of recorded speech for people with physical disabilities. Speech recognition is a complex task because it involves various strong accents, run-over words, varying rates of speech and background noise. Previously used Speech Recognition systems were speaker dependent and was challenging in constructing acoustic models, which had non-linear boundaries. The paper presents recognition of isolated English words in three steps, pre-processing, segmentation and extraction of Mel Frequency Cepstrum Coefficient (MFCC) of an audio signal from a video. Further, training and classification of audio signals is done using Convolutional Neural Network (CNN). Various types of input features, which play a vital role in the recognition process is described and we infer that, the representation of MFCC feature with varying length in analysis window and window step produces different results. The feature set with higher Winlen and Winstep, yields better result i.e., 97%.