Session JEP orale - O3
Parole spontanée et interaction
Mercredi 11 Juin - 10h30 12h30
Caractérisation et détection de parole spontanée dans de larges collections de documents audio
- Vincent Jousse ( Laboratoire d'Informatique de l'Université du Maine (LIUM))
- Yannick Estève ( Laboratoire d'Informatique de l'Université du Maine (LIUM))
- Frédéric Béchet ( Laboratoire d'Informatique d'Avignon (LIA))
- Thierry Bazillon ( Laboratoire d'Informatique de l'Université du Maine (LIUM))
- Georges Linarès ( Laboratoire d'Informatique d'Avignon (LIA))
- Résumé : Processing spontaneous speech is one of the many challenges that ASR systems have to deal with. The main evidences characterizing spontaneous speech are disfluencies (filled pause, repetition, repair and false start) and many studies have focused on the detection and the correction of these disfluencies. In this study we define spontaneous speech as unprepared speech, in opposition to prepared speech where utterances contain well-formed sentences close to those that can be found in written documents. This paper proposes a set of acoustic and linguistic features that can be used for characterizing and detecting spontaneous speech segments from large audio databases.
Penser tout haut. Analyse multimodale de fins de séquences
- Gaelle Ferre ( Universite de Nantes)
- Résumé : This paper proposes a multimodal analysis of thoughts spoken out loud by which speakers show some degree of inattention to the current conversation. The sequences under study have been detected in the video files of conversational CID, recorded at the LPL, mainly thanks to the unfocused and fixed gaze of speakers. An analysis showed that all the sequences under study share prosodic properties (in terms of F0 and intensity range and span . presence of pause before and after the sequences). Gesturally speaking the speakers' fixedness of gaze is paralleled by a completely relaxed attitude of the body and absence of any hand gesture. A discourse analysis of the utterances shows that they are all post-closing sequences (rather than closing sequences proper) and that they appear before a topic change.
Perception de la communication expressive : Icônes Gestuelles statiques vs. dynamiques du >Feeling of Thinking>
- Anne Vanpé ( GIPSA-lab, Département Parole et Cognition (ex-ICP), UMR 5216 CNRS/INPG/UJF/Stendhal)
- Résumé : Most studies concerning expressive communication concentrate on (visual/vocal/auditory) expressions of the speaker while he is talking. But information about what a speaker is doing while he is not talking is also important. We first tried to build an empirical methodology of ethograms for information about the (non)talker's mental or affective states, that we called 'Feeling of Thinking'. Then we confronted some of the identified Gestural Icons with the perceptual validation of their relevance, in an association task with subject's self-annotation labels. We tested : (1) the static form of the Icons and their dynamic one. (2) three presentation conditions: whole face, upper part of the face only and lower part of the face only. The Icons were globally well identified, and can consequently be considered as relevant. Moreover, our results showed the importance of the dynamism for the 'Feeling of Thinking' perception and called additivity of the upper and lower parts of the face in terms of affective information into question.
Composition sémantique pour la compréhension de la parole dans un cadre de dialogue
- Frédéric Duvert ( Laboratoire d'Informatique d'Avignon)
- Marie-Jean Meurs ( Laboratoire d'Informatique d'Avignon)
- Christophe Servan ( Laboratoire d'Informatique d'Avignon)
- Frédéric Béchet ( Laboratoire d'Informatique d'Avignon)
- Fabrice Lefèvre ( Laboratoire d'Informatique d'Avignon)
- Résumé : A knowledge representation formalism for SLU is introduced. It is used for incremental and partially automated annotation of the sc Media corpus in terms of semantic structures. An automatic interpretation process is described for composing semantic structures from basic semantic constituents using patterns involving constituents and words. The process has procedures for obtaining semantic compositions and for generating Frame hypotheses by inference. This process is evaluated on a dialogue corpus manually annotated at the word and semantic constituent levels.
Approche non supervisée pour la gestion stochastique du dialogue
- Fabrice Lefèvre ( LIA - Université d'Avignon)
- Renato De-Mori ( LIA - Université d'Avignon)
- Résumé : Following recent studies in stochastic dialog management, an unsupervised approach is proposed to reduce the cost and complexity of a probabilistic POMDP-based dialog manager setup. A first decoding step derives semantic basic constituents from user utterances. Then the isolated units along with some relevant context features (previous system actions, previous user utterances...) are combined in vectors representing the current dialog situation. A clustering step is performed with these vectors and each partition of the derived space represents a particular dialog state. Any new utterance can be classified according to these automatic states and the system belief can be updated before the POMDP-based dialog manager can take a decision on the best next action to perform. The approach is applied to the challenging French sc Media task (tourist information and hotel booking). The sc Media training corpus is semantically rich (over 80 basic concepts) and its 10k utterances are segmentally annotated in terms of basic concepts. Some insights on the method effectiveness are obtained by an analysis of the POMDP model training convergence using user simulation.
Emotions actées vs. spontanées : variabilité des compétences perceptives
- Nicolas Audibert ( GIPSA-lab Parole & Cognition (ICP), CNRS UMR 5216/U. Stendhal/INPG)
- Véronique Aubergé ( GIPSA-lab Parole & Cognition (ICP), CNRS UMR 5216/U. Stendhal/INPG)
- Albert Rilliard ( LIMSI-CNRS)
- Résumé : This paper reports the results of a discrimination experiment of acted vs. spontaneous expressive speech by naive listeners. Monoword utterances of 4 French-speaking actors trapped in a Wizard of Oz before simulating in an acting protocol supposed to be optimal for them were extracted from the Sound Teacher/E-Wiz multimodal corpus. Pairs of acted vs. spontaneous stimuli, expressing affective states related to anxiety, irritation and satisfaction were discriminated by 33 French listeners in audio-only (A), visual-only (V) and audiovisual (AV) conditions. 70% of listeners were able to identify acted vs. spontaneous pairs over chance in V, 78% in A and 85% in AV. A strong listener effect confirms the hypothesis of variable competence for separating involuntary vs. simulated affects. Perceived emotional intensity differences appear as a strong cue to discrimination but cannot account for the whole variability.