To be more concrete
Posted: Sun May 25, 2025 4:56 am
we ask GPT3.5-turbo to edit the seed summary by modifying a few words to create additional consistent and inconsistent examples. The edited summaries have and average of 3.6 words inserted, and 3.5 words deleted. We ask the model to generate about 30 modified summaries per document and thus the annotator does not have to read multiple input documents. This results in a cost of about $300 for 500 annotated samples. We hire a professional editor for this annotation step.
SummEdits Benchmark
We apply this approach on 10 domains raining from news to dialogue, scientific text and sales calls. For 5 of the domains, we automatically generate a seed summary from GPT 3.5 turbo due to the lack of high-quality existing reference summaries.
In the final benchmark, 37% of summaries are consistent, approaching afghanistan phone number list our objective of a balanced benchmark to facilitate robust evaluation.
The authors of the paper annotated 20% of the benchmark samples for factually consistent or inconsistent and achieve a high inter-annotator agreement, and even higher when removing the borderline cases from initial annotations. See the below table for an overview of the domains covered, dataset size, and agreement.
SummEdits Benchmark
We apply this approach on 10 domains raining from news to dialogue, scientific text and sales calls. For 5 of the domains, we automatically generate a seed summary from GPT 3.5 turbo due to the lack of high-quality existing reference summaries.
In the final benchmark, 37% of summaries are consistent, approaching afghanistan phone number list our objective of a balanced benchmark to facilitate robust evaluation.
The authors of the paper annotated 20% of the benchmark samples for factually consistent or inconsistent and achieve a high inter-annotator agreement, and even higher when removing the borderline cases from initial annotations. See the below table for an overview of the domains covered, dataset size, and agreement.