Data Collection - Design and Evaluation of User-Centered Explanations for Machine Learning Mode

6.0 Evaluation

6.1.4 Data Collection

Table 13 summarizes the data collected for each study task. It should be noted that the original scale items for the key UTAUT constructs in the subjective assessment task (performance expectancy, effort expectancy) were experimentally selected from the scale items of constructs from other models of technology acceptance and use (called root constructs). Only a few scale items were originally selected for each key construct and not all scale items were relevant to assess in the context of the proposed experiment (e.g., performance expectancy scale item of “If I use the system, I will increase my chances of getting a raise”). Therefore, for each key construct, I selected a set of scale items from each respective root construct that were relevant to assess in the context of the proposed experiment.

Table 13. Data collected for each study task

Study Task Data Collected

Background Questionnaire

 Current clinical position (e.g., resident)

 Length of time in current position (e.g., <1 year) Patient Case

Review

Data collected for each patient case:

 Time-stamped interactions with application interface (e.g., tab selections, lab tests viewed)

 List of information selected to discuss during rounds

 Urgency decision accuracy (see Figure 20)

 Urgency decision confidence, rated from 1—not confident at all to 5—extremely confident

 Free-text rationale for urgency decision

 Time (in seconds) to review patient case (excludes verbal case presentation, see Figure 20)

 Audio-recording of verbal patient case presentation

 Moderator notes on interesting comments or behavior during case review

Subjective Assessments

Data collected for “prediction only” and “explanation” displays:

 Selected UTAUT Root Construct Scale Items for Performance Expectancy38_{(Likert scale} agreement):

1. Using the system would enable me to accomplish tasks more quickly.

2. Using the system would make it easier to do my job.

3. Using the system would increase my productivity.

4. I would find the system useful in my job.

 Selected UTAUT Root Construct Scale Items for Effort Expectancy38_{(Likert scale agreement):}

1. My interaction with the system would be clear and understandable.

2. I would find the system easy to use.

3. It would be easy for me to become skillful at using the system.

 Free-text feedback on the display (optional)

6.1.5Data Analysis

Audio recordings of all verbal case presentations were transcribed verbatim and compiled with urgency decision rationales and moderator notes for each case. Answers to background questionnaires were summarized in a contingency table. Based on the background questionnaire responses, two levels of clinical experience (residents and fellow/attendings) were defined for use in analyses. Primary outcomes of interest included the impact of the user-centered explanation display on decision accuracy, decision confidence, case review efficiency, and provider perceptions of the pediatric ICU in-hospital mortality risk model. Analyses for each outcome are summarized in Table 14 and described in the next few sections. P-values of <0.05 were considered

significant for all statistical analyses, which were carried out using Stata version 15.133_{Plots were} generated using the Python packages seaborn version 0.9.0134 and matplotlib version 3.0.3.135

Table 14. Summary of analyses examining the impact of the user-centered explanation display on outcomes

Outcome

Display Comparison

Groups

Metrics Analytic approach

Decision accuracy

“No model” “Prediction

only”

Urgency decision accuracy Proportion of correct decisions with 95% CI

Logistic mixed effect analysis Precision and recall in selecting

relevant information

Visual review of violin plots Mentions of predictive model in

rationales, transcripts, or notes

Qualitative review to assist in interpretation of quantitative results Decision confidence “No model” “Prediction only”

Urgency decision confidence Visual review of stacked bar charts

Ordinal logistic mixed effects analysis Mentions of predictive model in

rationales, transcripts, or notes

Qualitative review to assist in interpretation of quantitative results Case review efficiency “No model” “Prediction only”

Time to review patient case Descriptive statistics

Log-linear mixed effects analysis Number of unique items viewed

(computed from interactions data)

Descriptive statistics

Poisson mixed effects analysis Total number of items viewed

(computed from interactions data)

Descriptive statistics

Negative binomial mixed effects analysis Provider

perceptions

“Prediction

only” UTAUT questionnaire responses Visual review of stacked bar charts

Free-text feedback on displays and moderator notes

Qualitative review for insights about participant perceptions of predictive model

Analysis of decision accuracy

Decision accuracy included participant accuracy in urgency decisions (i.e., identifying patients who need to be seen urgently) as well as selecting relevant information to discuss with the rounding team. To evaluate urgency decision accuracy, the proportion of correct decisions with 95% CIs for each of the three displays were calculated and a logistic mixed effects analysis of the relationship between urgency decision accuracy and display was performed. Display, case urgency (urgent, non-urgent), and participant experience (resident, attending/fellow) were included as fixed effects in the model (no interaction terms), and an intercept for participant was included as a random effect in the model. To assess accuracy in selecting relevant information, participant

precision and recall in selecting ‘relevant’ items were calculated, where information items selected by a senior pediatric ICU attending using the “explanations” display served as the gold standard. Precision and recall scores for each display were visualized using violin plots. Decision urgency rationales, case presentation transcripts, and moderator notes were reviewed for mentions of the predictive model tool and to assist in interpretation of the results.

Analysis of decision confidence

To assess the relationship between the display shown and participant-reported confidence in their urgency decision, confidence ratings for each of the displays were visualized in a stacked bar chart and an ordinal logistic mixed effects analysis was performed. Display, case urgency (urgent, non-urgent), and participant experience (resident, attending/fellow) were included as fixed effects in the model (no interaction terms), and an intercept for participant was included as a random effect in the model. Decision urgency rationales, case presentation transcripts, and moderator notes were reviewed for mentions of the predictive model tool and to assist in interpretation of the results.

Analysis of case review efficiency

Case review efficiency consisted of the time it took participants to review each patient case and the amount of information being viewed, which was measured by the number of items (e.g., lab test, vital sign) viewed during the case. Descriptive statistics were used to summarize the case review time, number of unique items viewed, and the total number of items viewed. To assess the relationship between the display shown and case review time, a log-linear mixed effects analysis was performed after it was determined that case review time followed a log-normal distribution.

To assess the relationship between the display shown and the number of unique items viewed, a Poisson mixed effects analysis was performed. To assess the relationship between the display shown and the total number of items viewed, a negative binomial mixed effects analysis was performed after it was determined that the distribution of the total number of items was over- dispersed (mean=33.0; variance=206.3). For all three models, display, case urgency (urgent, non- urgent), participant experience (resident, attending/fellow), and case order (i.e., the order in which the case was seen by a participant) were included as fixed effects (no interaction terms) and an intercept for participant was included as a random effect.

Analysis of provider perceptions

Responses to the UTAUT scale items for the “explanation” and “prediction only” displays were visualized and compared using stacked bar charts. Free-text feedback on displays and moderator notes were qualitatively reviewed to assist in the interpretation of the UTAUT questionnaire responses and to identify additional insights about participant perceptions of the pediatric ICU in-hospital mortality risk model and the displays.

6.2Results

A total of 15 participants were recruited for this study. Responses to the background questionnaire on clinical experience are summarized in Table 15. As per the study design, each participant reviewed and provided responses for 6 patient cases. Due to a technical error, one participant failed to successfully complete one of their assigned cases. Thus, there were a total of 89 participant responses for the patient cases. The breakdown of case responses by display and

case urgency is shown in Table 16. In 6.2.1-6.2.3, I describe the results from the analyses on decision accuracy and confidence, case review efficiency, and provider perceptions of the model, respectively.

Table 15. Summary of participant clincial experience Time in current position

Position <1 year 1 to <2 years 2 to <3 years Total

Attending 1 0 0 1

Fellow 1 5 1 7

Resident 0 2 5 7

Table 16. Participant responses by case urgency and display Case Urgency

Display Non-urgent Urgent Total

No model 14 15 29

Prediction only 15 15 30

Explanation 15 15 30

Total 44 45 89

In document Design and Evaluation of User-Centered Explanations for Machine Learning Model Predictions in Healthcare (Page 109-114)