In this section, we re-evaluate the Graph Inference model using the new relev- ance assessments. For clarity, the original relevance assessments pertaining to TREC MedTrack are denoted “TREC” qrels, while the new relevance assess- ments provided by University of Queensland medical students are denoted “UQ” qrels.9
7.4.1
Retrieval Results
Table 7.5 presents the retrieval results of the GIN (lv1, lv2) and the Bag-of- concepts baseline (lvl0) using the old qrels (TREC) and the new qrels (TREC + UQ). The percentages indicate how the measure has changed between the old and new qrels.
Considering bpref, there was little change in overall e↵ectiveness using the new qrels. This is not surprising as bpref considers only judged documents so
9In TREC, the term “qrels” is often used to denote relevance assessments; henceforth we adopt this terminology.
Qrel set System Bpref P@10 P@20 TREC lvl0 0.4309 0.5123 0.4389 lvl1 0.4294 0.4481 0.4086 lvl2 0.4208 0.4247 0.3630 TREC + UQ lvl0 0.4252 (-1%) 0.5415 (+6%)† 0.4732 (+8%)† lvl1 0.4264 (0%) 0.5037 (+12%)† 0.4604 (+12%)† lvl2 0.4113 (-2%) 0.4878 (+15%)† 0.4220 (+16%)† Table 7.5: Retrieval results using old (TREC) and combined (TREC + UQ) qrels. The percentages indicate how the measure has changed using the qrels. † indicates statistical significant di↵erences between the TREC and TREC + UQ qrel sets (paired t-test, p < 0.05).
0 20 40 60 80 0.0 0.2 0.4 0.6 0.8 1.0 Queries Precision @ 20 lvl1 − TREC lvl1 − TREC + UQ
Figure 7.5: Graph Inference model performance of individual queries between the old (TREC) and new qrels (TREC + UQ). Greater number of improvements was observed in hard queries.
the large number of unjudged documents in the TREC qrels did not significantly a↵ect this evaluation measure. However, for precision @ 10 and precision @ 20, all three systems were deemed more e↵ective when evaluated with the new qrels. The percentages indicate by how much the e↵ectiveness of the system was under- estimated using only the TREC qrels. The e↵ectiveness was underestimated for all three systems but was significantly more so with the GIN. Furthermore, lvl2, which leverages more of the GIN inference mechanism, was underestimated more than lvl1. This means that lvl2 was returning a larger number of unjudged but relevant documents.
Considering only precision @ 20, Figure 7.5 shows how the performance of individual queries changed between the old and new qrels. A significant
number of queries had improved performance using the new qrels, with only a handful showing degradation. Additionally, a greater number of improvements was observed in hard queries (those with poor performance using the TREC qrels; righthand side of the plot). This highlights that hard queries were the ones where performance was most underestimated.
7.4.2
Analysis and Discussion
Besides the quantitative relevance assessments, assessors also provided substan- tial qualitative comments regarding their relevance choices. This feedback high- lighted how the notion of relevance within medical IR can be complex and subjective.
Assessors worked together in the same room and at times discussed their de- cisions regarding relevance assessments. Although they were confident in their assessments, they stated that the interpretation of the query was subjective and often required careful consideration regarding di↵erent possible interpretations. For the control query 101: Patient with Hearing Loss, assessors debated whether a patient born deaf could be considered as exhibiting hearing loss. (Technically, if they never had any hearing, then they never had a loss of hear- ing.) One assessor marked such a document as relevant, while another assessor marked the document as not relevant. A medical encyclopaedia was consulted and assessors agreed to include patient born deaf as relevant. This disagreement could be identified and resolved for the control queries, where assessors judged the same documents, but not for the actual queries where there was no overlap. The task description given to assessors (recruitment of patients, matching a certain inclusion criteria, for clinical trials) also a↵ected their decisions regard- ing relevance. Certain documents described patients who had hearing loss on admission but the hearing loss was treated and resolved by discharge. In this case, assessors decided these patients would not be eligible for the clinical trial and were therefore not relevant to the query. For other tasks, for example find- ing how hearing loss is treated, these documents may have been highly relevant. These cases highlight the complex and often subjective information needs of clinical information retrieval.
Queries with multiple dependent aspects received more debate by assessors and were also among the hardest queries (in terms of lower performance in the empirical evaluation). The second control query (query 102 Patients with complicated GERD who receive endoscopy) was one example. Gastroeso- phageal reflux disease (GERD) is caused when stomach acid comes up from the stomach into the esophagus. It is a common condition and is therefore found in many patients’ records. The difficulty in interpreting this query was
whether the endoscopy was performed because of the GERD or for some other, unrelated condition. There were a number of documents where patients had GERD but received the endoscopy for another reason; these were marked as not relevant. A similar query was 103 Hospitalized patients treated for methicillin resistant Staphylococcus aureus (MRSA) endocarditis, where endocarditis and MRSA were mentioned in the same document, but the cause of the endocarditis was not the MRSA. Again, these documents were marked as not relevant. These queries all have multiple dependent aspects to the query; even if both aspects are present in a document, that document may still not be relevant unless the dependence between them can be determined.
Temporality also played a significant role in relevance assessments. The most common situation was when information pertaining to the query was found in the patient’s past medical history section. Assessors had to decide whether the information was still valid. Some conditions are ongoing, for example, Gast- roesophageal reflux disease (GERD), so the fact that this was stated in past medical history does not a↵ect the relevance of the document; others are tem- poral and are unlikely to still be valid. In certain cases, assessors consulted the actual dates of the past medical history information to determine how recent the information was and whether it might still apply.
Simulated Precision Revisited
In Section 7.2.1we provided a simulated precision @ 20 measure if completed judgements were obtained for the top 20 rank positions. We revisit that analysis here in light of the actual results obtained.
The correlation coefficient between the simulated performance estimate and the actual performance estimate was 0.92, whereas the correlation coefficient between the original performance estimate and the actual performance estimate was 0.89. This shows that the simulated estimate was more accurate than the original estimate. A plot comparing the three estimates — original, simulated and actual — for individual queries is shown in Figure 7.6. The simulated estimate generally follows the trend of the actual estimate, except for a few cases where the actual estimate was lower than the original estimate. Although the simulated estimate diverges from the actual estimate in these cases, it does provide a more accurate estimate of retrieval e↵ectiveness than the original estimate that used the relevance judgements from TREC MedTrack. It can, therefore, be used as one possible indicator of retrieval e↵ectiveness when large numbers of unjudged documents are retrieved by a system.
0 20 40 60 80 0.0 0.2 0.4 0.6 0.8 1.0 Queries Precision @ 20 Original (TREC) Simulated Actual (TREC + UQ)
Figure 7.6: Per-query precision @ 20 retrieval e↵ectiveness comparing the original qrels from TREC, simulated performance and actual performance using TREC + UQ qrels.