as a binary classification to improve interpretability; a yes/no classification of whether a summary is factually consistent with the input is more interpretable than a score between 1 and 5.
Our focus is factual consistency, so we do not want factors such as grammaticality or fluency errors to influence annotations. Previous benchmarks include summaries that may have imperfections in fluency or formatting. We remove such summaries in order to have the annotator only focus on the factual consistency label.
We want high inter-annotator agreement to improve the reproducibility of the protocol, and we also want a task that humans can perform but the models may struggle with.
We want the benchmark to be diverse across a wide afghanistan phone number list range of error types and domains.
SummEdits Annotation Protocol
We incorporate the above principles into our annotation schema for factual consistency.
In the first step, we select a document and a seed summary (which may come from the original dataset or be LLM-generated) and validate that they do not contain errors. We use an LLM to generate many minor edits of that summary, producing additional consistent and inconsistent edits. Then, we manually classify each edit as consistent, inconsistent, or borderline/unsure. We remove the borderline cases to ensure that our dataset only includes high-quality samples. An overview of our annotation pipeline is found below.