Trusting Benchmark Evaluation

rochona · Post by **rochona** » Sun May 25, 2025 10:55 am

This section follows our work SUMMEDITS: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization. As mentioned in the previous part, we want to perform a targeted evaluation of additional quality dimensions, and in this work, we focus on factual consistency.

Motivation
Prior work has pointed to low inter-annotator agreement and afghanistan phone number list variations in how different papers have annotated factuality categories. This is unfortunate given how factuality should be one of the more objective categories to annotate. Another factor in this annotation, is that as opposed to a quality dimension such as coherence or our ACU annotation evaluating factuality generally requires reading the entire input, which can be very costly when only annotating several examples per document.

Guiding Principles for Factual Consistency Benchmarking
We design a benchmark that embodies several principles from our analysis of existing work on factual consistency. Additional details on our analysis are found in the paper.