Phase two: simulated lab interaction with the system via a TV remote with a microphone

Stimulation strategies to prompt natural voice interactions with the TV

4 Experimental procedures and results

4.2 Phase two: simulated lab interaction with the system via a TV remote with a microphone

Based on the information gathered in the previous phase, another set of sessions was carried out with groups of 2 to 3 people in a lab context. Stimulated by the images presented in a monitor, the participants used a TV remote control with a microphone to record the sentence they wanted to say. Although the system was unable to react to the participant’s request, this pseudo-interaction had the advantage of approaching them to a more realistic scenario than the one used in phase one. It also helps to re-duce the level of noise and the number of overlapped phrases.

The sessions of approximately 30 minutes each took place between October and December of 2018. In order to allow a broader coverage of the various domains and their related intents, presented in 3.1, three sets of tasks (A, B and C) were defined, each including 15 images related to the following intents: Search content by thematic (T1 and T2) Open app saying its name or when asking some content (T3 and T4);

Search TV apps of a certain category (T5); Access EPG of a channel (T6); Access to contents of the last days of a channel (T7); Search a content by name (T8) ; Access to TV STB Menus (T9 and T10); Ask for a trending content (T11); Search channels by thematic (T12); See contents of a certain type on VideoClub (T13); Search content of a pre-existent category (T14); Change to a specific channel (T15). For example, T1 was related with asking for action movies in set A, asking for musicals in B and ask-ing for romance movies in C.

For each set, two versions for some of the associated tasks were created. In this way, 6 groups of tasks were prepared: A1, A2, B1, B2, C1 and C2, where 1 indicates that all the stimuli presented were based on frames of the user interface of the iTV system and 2 refers to an alternative stimulus which was created for some of the tasks when we considered that it would be a reliable option. As a result, T1, T2, T3, T4, T6, T12, T14, T15 had a pictogram alternative and T8 e T11 had as alternative frames presenting only video content. This means that T5, T7, T9, T10, and T13 were pre-sented in the same way both in groups with number 1 and 2, that is, as Tv system frames. Considering the close to reality simulation, in this phase the dialogue balloons were removed, because they were considered not relevant to understand the images and their goal (Fig. 2)

Fig. 2. Examples of stimulus presented in task 6 for A1 (left) e A2 (right)

The sample consisted of 18 groups (16 composed of 2 participants and 2 by 3 par-ticipants), and each group of tasks was repeated 3 times. In total, 38 participants took

part in the study, ranging in age from 17 to 46 years old, 55.3% were male and 44.7%

were female. Of the 38 participants, 89.5% reported having a Pay-Tv service.

4.2.1 Experimental procedure

The contextualization of the study, the explanation of the procedure and the presenta-tion of the tasks were carried with the support of a multimedia presentapresenta-tion that also included demonstration videos. The sessions followed the following steps: 1- Wel-come of the participants with the offer of a snack and a drink; 2 - Presentation and contextualization about the study; 3 - Explanation of the general procedure (partici-pants were told that the dialogue with the TV should be as natural as possible and that they could discuss ideas with each other; the phrases should start with "Television ...

"; there are no right or wrong answers and that creativity will be valued); 4 - Comple-tion of tasks (each session consisted of a set of 15 images and participants should say

“next” to move to the next image); 5 - Fill in a characterization questionnaire. In addi-tion, the person who was organizing the test session was responsible for registering opinions of the participants on what they consider to be important in an ILN context, which domains/intents they would most often use and other observations or failures during the test.

4.2.2 Results and discussion

In total, 1882 sentences (49.52 sentences per person) were collected, of which 1862 (49 sentences per person) were unique. The TV system frames allowed to collect 1342 sentences, stimuli based on pictograms allowed to collect 403 sentences and video frames 137 sentences.

Task 8 generated the higher number of sentences (157), with the stimuli being pre-sented with TV system frames (81 sentences) showing better results than the stimuli with video frames (76 sentences). The stimulus based on TV system frames that gen-erated most sentences was presented on task 15 and the stimulus based on pictograms that generated most sentences was presented on task 3. The stimuli that generated most phrases and with a higher-level of complexity by the participants evoked interest and/ or personal knowledge in the subjects presented (Football, YouTube, movies). In turn, stimuli that generated fewer sentences were referred to by participants as topics that were unknown or uninteresting to them.

Comparing the performance in the tasks where different stimuli were presented, it was possible to conclude that in some cases pictograms performed better than TV system frames but in others the opposite happened. However, it was possible to con-clude that TV system frames allowed to collect a higher number of sentences, being clear that the abstraction inherent to the pictograms sometimes limited the reaction of participants. After the first participant saying a sentence, the other (or remaining) feels that the right answer has already been said (Pictionary game-like effect) and usually was unable to generate more sentences. However, this “Pictionary effect”

does not happen when participants have a deep knowledge associated with the shown pictograms (e.g. a football club logo generates a large number of sentences from a

fan, a “baby” pictogram generates sentences about movies and children channels). On the contrary, the TV system frames offered more visual affordances. Even though the most obvious word in the interface was said in the first sentence, there were usually other words or images that other participants could use to generate sentences. This may also justify the fact that in tasks where video content frames were used as an alternative to tv system frames the former had worse performance.

5 Conclusions

The collection procedures carried in this study allowed to gather important insights for future similar activities. First, we can say that images can often be adopted as objects that allow to transmit in a clear and unambiguous way the context in which the user is expected to interact and additionally to present scenarios sufficiently open to generate a great diversity of phrases (our study showed that the number of repeated sentences in phase 2 was very small). Considering the visual stimuli, pictograms showed to be less efficient when participants did not have a deep knowledge about the subject that the stimuli represented although, in other cases, their higher level of ab-straction allowed them to perform better than other stimuli. Participants were also more creative when the content was known, and they had the needed information to create multiple phrases.

In addition, the simulated interaction environment revealed to be a more dynamic way to generate reaction from participants when compared with vox pop. It should be emphasized that the phases here presented were carried out with small groups in order to identify general aspects, being part of a general process of optimization of a meth-odology. The points discussed are therefore not intended to represent a sample, but rather to gather a set of assumptions that ultimately aim at improving the training process, and which include all the process from the analysis of training methodologies to the simultaneous identification of general traits and specifications that must be taken into account.

References

1. Nokinova, J., Lemon, O., Rieser, V.: Crowd-sourcing NLG data: pictures elicit better data.

In: Proceedings of the 9^th International Natural Language Generation Conference, pp. 265-273. Association for Computational Linguistics, Edinburgh, UK (2016)

2. Braunger, P., Maier, W., Wessling, J., Werner, S.: Natural language input for in-car spo-ken dialog systems: how natural is natural? In: Proceedings of the 18^th Annual Special In-terest Group on Discourse and Dialogue (SIGDIAL), pp.137-146. Association for Compu-tational Linguistics, Saarbrücken, Germany (2017)

3. Lola.: NLP vs. NLU: what’s the difference?, https://medium.com/@lola.com/nlp-vs-nlu-whats-the-difference-d91c06780992, last accessed 3/10/2019

4. Schnelle-Walka, D., Radomski, S., Milde, B., Biemann, C., Mühlhäuser, M.: NLU vs. dia-log management: to whom am I speaking? In: Proceedings of the 5^th IUI Workshop on In-teracting with Smart Objects, pp. 4. Sonoma, CA, USA (2016)

5. Benesty, J., Sondhi, M. M., Huang, Y., A.: Springer handbook of speech processing.

Springer-Verlag, Berlin, Heidelberg (2008)

6. Braun, D., Mendez, A. H., Matthes, F., Langen, M.: Evaluating natural language under-standing services for conversational question answering systems. In: Proceedings of the 18^th Annual Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 174-185.

Association for Computational Linguistics, Saarbrücken, Germany (2017)

7. Singh, B. R.: Chat Bots – designing intents and entities for your NLP models, https://medium.com/@brijrajsingh/chat-bots-designing-intents-and-entities-for-your-nlp-models-35c385b7730d, last accessed 3/10/2019

8. Manuvinakurike, R., Paetzel, M., DeVault, D.: Reducing the cost of dialogue system train-ing and evaluation with online, crowd-sourced dialogue data collection. In: Proceedtrain-ings of the 19^th Workshop on the Semantics and Pragmatics of Dialogue (SemDial), pp. 9 (2015) 9. Braunger P., Maier, W., Wessling, J., Schmidt, M.: Towards an automatic assessment of

crowdsourced data for NLU. In: Proceedings of the 11^th International Conference on Lan-guage Resources and Evaluation (LREC), pp. 2002-2008. European LanLan-guages Resources Association (ELRA), Miyazaki, Japan (2018)

Towards Proactivity Behaviours in Voice Assistants for

In document Libro de aplicaciones y usabilidad de la televisión digital interactiva : VIII Conferencia Iberoamericana de Aplicaciones t Usabilidad de la TV Interactiva (jAUTI 2019) (Page 169-173)