Session JEP orale - O4

Reconnaissance de la parole et du locuteur

Jeudi 12 Juin - 14h00 16h00

papier 1574 Enrichissement dynamique du vocabulaire à partir du Web

Stanislas Oger  ( Université d'Avignon)

Georges Linarès  ( Université d'Avignon)

Frédéric Béchet  ( Université d'Avignon)

Pascal Nocéra  ( Université d'Avignon)

Résumé : Most of the Web-based methods for lexicon augmenting consist in capturing global semantic features of the targeted domain in order to collect relevant documents from the Web. We suggest that the local context of the out-of-vocabulary words contains relevant information on the OOV words. With this information, we propose to use the Web to build locally-augmented lexicons which are used in a final local decoding pass. We first demonstrate the relevance of the Web for the OOV word retrieval. Then, different methods are proposed to retrieve the hypothesis words. Finally we present the integration of new words in the transcription process based on part-of-speech models. This technique allows to recover 7.6% of the significant OOV words and the accuracy of the system is slightly improved.

article

papier 1579 Segmentation et regroupement en locuteurs pour la parole conversationnelle

Elie El-Khoury  ( université paul sabatier)

Sylvain Meignier  ( Université du Maine)

Christine Sénac  ( université Paul Sabatier)

Résumé : In the context of the ANR EPAC project of which aim is to treat conversational speech, we present a hybrid speaker diarization system based on the combination of the LIUM and IRIT systems. It contains speech detection followed by GLR/BIC segmentation. Then we apply a BIC clustering followed by a CLR clustering. Moreover, we make some improvements by optimizing clustering thresholds and by purifying the BIC clustering using feature F0. Results show this hybrid system is suitable on one hand for traditional corpus as ESTER and on the other hand for conversational data as used in EPAC project.

article

papier 1604 Modèles discriminants pour la prédiction d'erreur dans les réseaux de confusion

Alexandre Allauzen  ( LIMSI-CNRS, Université Paris-Sud)

Résumé : In this article, error detection for broadcast news transcription system is addressed in a post-processing stage. To estimate the probability of errors, we introduce the use of linear-chain conditionnal random fields based on features extracted from confusion networks. The linear-chain is a discriminative alternative to hidden Markov models for sequence classification. The linear chain configuration is experimented with both real valued and binarized features showing a slight impact of binarization on classification performances. To improve our models, the linear chain is then augmented to include dependencies to adjacent feature vectors. Our best model yields to an absolute reduction of the classification error rate of 9% to be compared with the standard ASR output (from 13.9% to 4.7%) and 6% to be compared to a logistic regression model trained in same conditions.

article

papier 1626 La reconnaissance du locuteur : un problème résolu ?

Jean-François Bonastre  ( Université d'Avignon)

Driss Matrouf  ( Université d'Avignon)

Résumé : Cet article présente un court résumé des progrès réalisés ces dernières années en Reconnaissance du Locuteur. Il tente de montrer qu'en dépit de l'impressionant gain enregistré en termes de réduction des taux d'erreurs, plusieurs questions restent ouvertes. Le papier conclut en ouvrant une série de pistes de recherche pour la reconnaissance du locuteur.

article

papier 1636 Etude de la cohabitation entre la bande large et la bande étroite en reconnaissance automatique de la parole

Mohamed-Ali Ben-Salah  ( Orange Labs)

Jean Monné  ( Orange Labs)

Denis Jouvet  ( Orange Labs)

Régine André-Obrecht  ( IRIT-Université Paul Sabatier)

Résumé : dans cet article nous traitons la question de la cohabitation entre la bande large et la bande étroite en reconnaissance automatique de dans le but de garantir une réponse optimale des ASR face aux divers types de données parole échantillonnées à 8 kHz (bande étroite), 16 kHz (bande large) et surtout les données en fausse bande large ou où les données présentées comme des données WB sont en réalité issues d'un codage ou un transcodage bande étroite.

article

papier 1659 Mesures de confiance locales et trame-synchrones

Joseph Razik  ( LORIA)

Odile Mella  ( LORIA)

Dominique Fohr  ( LORIA)

Jean-Paul Haton  ( LORIA)

Résumé : This paper presents several new confidence measures with the major advantage that they can be evaluated as soon as possible without having to wait for the recognition process to be completed : synchronously with the frame processed by the engine or with a slight delay. Such measures are useful to drive the recognition process by modifying the likelihood score or to validate recognized words in on-the-fly applications as keyword spotting task and on-line automatic speech transcription for deaf people. The EER evaluation on a French broadcast news corpus shows performance close to the batch version of these measures (23.0% against 22.0% of EER) with only 0.84s of data before and after the word to analyze.

article