Page 1 of 1

We perform a focused study

Posted: Sun May 25, 2025 4:55 am
by rochona
on the most recent BART and BRIO models, which are supervised language models, and the zero-shot LLMs of T0 and GPT-3.5 (davinci-002). The results are shown in the diagram below.


We find that GPT3 does better on protocols where the annotator does not see the reference, as it’s not trained to produce such summaries. Furthermore, there are issues with the afghanistan phone number list references which have been noted by prior work, and they are not favored by annotators.

We also analyze for potential confounding factors in our protocol comparison. We find that the annotator’s prior score is a better predictor of their own reference-free score than if you use the scores of other annotators; the correlation between an annotator’s prior and ref-free score is .404, while the correlation between the ref-free score and score from other annotators is only 0.188. This suggests that the prior score influences the score of the annotator’s judgment even when they have access to the documents.