We take the reference summary and ask experts to extract simple facts, or atomic units, from the reference. The first stage is done by experts, as it’s considered a more difficult task to write the units. Then for each system generated summary, we check whether that fact is present or not. This second matching stage is done by crowdsourced workers. This process is illustrated in the figure below.
We collect data following this protocol over three common summarization datasets covering news and dialogue domains. The resulting benchmark, which we call RoSE (Robust Summarization Evaluation) contains three test sets and a validation set on the CNN/DailyMail dataset. We achieve a high inter-annotator agreement, with a Krippendorff’s alpha of 0.75. Dataset statistics are shown below.
Our benchmark consists of 22,000 summary-level annotations afghanistan phone number list over 28 top-performing systems on three datasets. Standard human evaluation datasets for summarization typically include around 100 documents, but we have 5 times as many documents, which allows us to draw stronger conclusions about differences in system performance.
To better understand our ACU protocol, we compare it against three other common human evaluation protocols, namely protocols that do not take into account a reference summary (Prior and Ref-free) and those that compare with a reference (Ref-based and our ACU protocol). In order to evaluate these protocols, we collect annotations on 100 examples from our benchmark for the three other protocols. The following diagram shows what the annotator sees when annotating according to each protocol.