Résumé :
The diachronic nature of broadcast news causes frequent variations in the linguistic content and vocabulary, leading to the problem of Out-Of-Vocabulary (OOV) words in automatic speech recognition. Most of the OOV words are found to be proper names whereas proper names are important for automatic indexing of audio-video content as well as for obtaining reliable automatic transcriptions.
Open vocabulary systems based on word and sub-word units are an interesting solution to OOV words encountered by automatic speech recognition but these systems are still unable to produce reliable automatic transcriptions.
Instead, new proper names missed by a speech recognition system can be recovered by a dynamic vocabulary multi-pass recognition approach in which new proper names are added to the speech recognition vocabulary based on the context of the spoken content.
Existing methods for vocabulary selection rely on web search engines and adaptation corpora, and choose the new vocabulary words using term-document frequency and co-occurrence based features. As opposed to relying on ad-hoc methods and count based hand crafted features, we adopt unsupervised and theoretically well defined methods for document specific vocabulary selection.
The goal of this thesis is to model the semantic and topical context of new proper names in order to retrieve those which are relevant to the spoken content in the audio document. The motivation to explore semantic/topic context models for addressing the OOV problem in speech recognition comes from a more fundamental question – How can we leverage semantic and topic context to improve automatic speech transcription systems?
Training semantic/topic models is a challenging problem in this task because (a) several new proper names come with a low amount of data to model their context and (b) the context should be inferred from the automatic transcription which contains word errors. These two primary issues are one of the main focus throughout the thesis.
Probabilistic topic models and word embeddings from neural network models are explored for the task of retrieval of relevant proper names. Proposed methodologies for retrieval are introduced using topic representations from LDA (in Chapter 4). These methodologies are extended to semantic vectors from the LSA model and then to Entity-Topic models which use LDA topic spaces to model relationships between in-vocabulary words and OOV proper names.
Methods to handle less frequent OOV proper names are proposed alongside.
An evaluation of the proposed retrieval methodologies is performed and the performances of these representations are compared.
Discussion and experiments on the problem of selection of a diachronic text corpora from the internet, which is essential for training the context models and to retrieve OOV proper names, is also presented. The proposed retrieval methodologies are then extended to word embeddings obtained from the Skip-gram and Continuous Bag-Of-Words models (in Chapter 5). Analysis of the retrieval performances of all these representations reveals their inadequacies and sensitivity to hyper-parameters.
Following the thorough evaluation of contextual representations from topic models and word embeddings, it is argued that these representations are not the best for the proper name retrieval task. Neural network context models trained with an objective to maximise the retrieval performance are proposed (in Chapter 6). A Neural Bag-of-Words (NBOW) model is first presented to learn discriminative context vector representations at a document level.
Then a Neural Bag-of-Weighted-Words (NBOW2) model is proposed to improve the learning of the discriminative context representations. Techniques to effectively train these models are presented and it is shown that these models outperform the generic topic space representations and word embeddings. Analysis reveal that the NBOW2 model can successfully assign a degree of importance to input words and has the ability to capture task specific key-words.
Proposed combination of NBOW and NBOW2 model learns even faster and reaches the performance limit obtained with the NBOW and NBOW2 models. Experiments on automatic speech recognition on French broadcast news videos demonstrate the effectiveness of the proposed NBOW models, with improvements in recovery of OOV proper names and in proper name error rates. Further evaluation of the NBOW2 model on standard text classification tasks, including movie review sentiment classification and newsgroup topic classification, shows that it learns interesting information about the task and gives the best classification accuracies among the state-of-the-art bag-of-words models.