• Julian Fritsch
  • 15.01.2014
  • 13.06.2014

The task of automatic speech recognition (ASR) can be described as the estimation of a word sequence given a recorded speech signal. For this purpose, the relevant information in the time-frequency pattern of the speech signal is extracted to obtain a feature representation with minimum dimensionality. These features are supposed to contain all necessary information to classify phonemes defined as the smallest distinctive speech units. The estimation process is based on an acoustical and a language model whose parameters are determined in a training phase with known utterances. This training data generally consists of undistorted speech signals which results in a mismatch between training and deployment environments for distant-talking speech recognition: The reverberation caused by a large speaker-to-microphone distance degrades the recognition performance and makes medical applications, like hands-free interfaces for computer interaction, very challenging.

A very common open source software for speech recognition research is the Hidden Markov Model Toolkit (HTK). However, this toolkit shows disadvantages with respect to flexibility and math support. The main goal of this thesis is to apply the open-source Kaldi software for the task of ASR. The student is asked to set up a speech recognition toolkit for the World Street Journal (WSJ) corpus of November 1992 by using the “trunk” version of the Kaldi Toolkit.