Chapter 5 Support in the Form of Guidance: A Novel Interface for Multi-
5.10 Discussion
5.10.2 Jargon vs Accuracy
As mentioned before, the machine seems to do better in the Stripe and Rust files than in theUbuntu-Meeting and MediaWiki files. One potential explanation for why we believe this occurs is because people might not be familiar with jargon, and so might take more time and/or be more confused when reading these messages. Since our machine models have been trained on messages from the #Ubuntu channel, these models might be better at disentangling in those channels whose messages resemble those from the #Ubuntu channel. However, if people are less distracted or confused by the jargon, they can better disentangle the conversations based on other relevant context, such as that found in the natural language around the jargon itself. If we can better profile and preanalyze the entangled text, perhaps we can create human-machine pairings to create even better accuracy outcomes.
5.10.3
Annotator Feedback
In Section 5.8.2, we saw a few themes emerge from the post-task surveys. Here, we address potential ways to ameliorate annotator concerns.
Task Difficulty
• Understanding task rules: One way to lower the barrier to entry for this task can be to make the interactive tutorial longer, or to have annotators do a tutorial on more than just one sample file, so that the ‘rules” become clearer. Because there are a lot of edge-cases that can come into play, another option is to make the rules easier (e.g., if there are certain edge cases that do not impact the disentanglement outputs, then having more flexibility with that rule can reduce the number of rules annotators need to remember through this process.)
• Parsing technical jargon: A few annotators mentioned not having familiarity with the jargon seen in the text. One way to help familiarize annotators can be to provide examples or a dictionary where annotators can look up what these unfamiliar terms might mean. In a large-scale annotation effort, having the ability for annotators to ask follow-up questions with the channel owner can also be helpful.
• Picking up the conversational context: This particular theme is hard to immediately overcome, as gaining context involves spending some time with the conversations to start figuring out topics, number of threads, etc. One way to make this effort easier is to ask annotators to focus on just one conversation; that way, they can scan only for the particular conversation they’ve been assigned and will not need to keep track of other threads. Another approach, as suggested by P6, is to provide short summaries of the files before annotators start the task, to better help their mental-model formations. As P6 wrote, “I wish I could’ve had a sense of a general topic of the conversation (e.g., this is about meetings or Ubuntu or whatever) just to have a better sense of the terminology to look for as I was about to start the task.”
• One-to-many responses: Similar to the context theme, it is difficult to easily disen- tangle messages from users that speak to many other users. A way to mitigate this burden can be to better highlight related messages not just based on the users, but also the content within those messages.
• Bots created additional confusion: Annotators were confused as to how to treat Bot messages. Although we had examples in the instructions for what to do with “ubottu,” an Ubuntu bot, it is clear that annotators could have benefitted with more examples.
Interface Difficulty
Annotators using the IS interface had positive views towards the interactional slingshots present in them, but the usability of the interface could still be improved. In particular, we restricted the ability to undo an annotation to be for the most-recent-annotation only, rather than have the ability to undo any prior annotation. This was a design decision to remove freedom from the annotator’s side (to avoid ill-effects that arose when anno- tators had more freedom, as seen in Chapter 3. However, since these annotators have domain knowledge, perhaps relaxing this constraint can lead to better outcomes. More- over, improving instructions and providing video examples can also help annotators better navigate the interface.
Cognitive Load
Even given difficulties faced, we find that our interface helps annotators to better disen- tangle files in channels that are harder in nature (e.g., MediaWiki). In the harder tasks, there are more undirected messages, fewer users, and more errors made by the machine suggestions; nevertheless, we find higher accuracies. One potential explanation for this is that, although people took more time to do the annotations, perhaps this slowdown helped annotators think more critically about the task itself.
This leads us to posit that the expert annotators were dealing with a case of managing their cognitive load [15]: intrinsic cognitive load of the task itself being difficult, and ex- trinsic cognitive load from how the interface was guiding their interactions. Not only is the task itself difficult, but the annotators had to use the interactional slingshot support to evaluate whether the guidance was accurate or not, and then make an annotation. P16 expresses a sentiment shared by a few other annotators, namely the issue of trusting the machine’s suggestions. They write, “I think the highlighting was done well, that helped a lot. The highlights actually I found to be wrong most of the time. I honestly relied on the time stamps, those disentagled [sic] it for me more than anything else.”
Managing the two intrinsic and extrinsic sources of cognitive load can also help ex- plain why the experts found less success in the jargon-heavy files, since now the annota- tors have to expend more effort to understand the different threads and content taking place in the file. This makes them more prone to errors, something that can also be explained by thinking about this slowdown in the Expert Reversal effect [50]. Experts have the externally-provided guidance from the interactional slingshots, but also pull into working memory their domain knowledge. By relying on their domain knowledge store, the expert annotators were slowed down more when they had to double check the sys- tem’s predictions. As a result, the effort to combine these two knowledge structures causes cognitive overload and can also cause slowdowns.
On the other hand, there was no indication from the non-expert annotators that they were being slowed down by the external cognitive load, apart from the start where most workers were unsure what to do. As one of the key points of Chapter 2 in How People Learn states, “Experts notice features and meaningful patterns of information that are not noticed by novices” [14]. Because non-expert annotators do not know the subject matter as well, perhaps they are less affected by the system predictions being erroneous, and so their cognitive load was less impacted as they do this novel task.
A design point to take away from this is that, if experts are working on the task, inter- actional slingshots that are providing support in the form of guidance could throttle back their support to be less-guidance driven, and more along the nudging type of support.