Case Studies

rochona · Post by **rochona** » Sun May 25, 2025 4:54 am

Several of our benchmarking works have centered around summarization. In our AggreFact benchmark, we aggregate recent summarization systems and propose a method to align their error types for a more comprehensive analysis. SummEdits proposes an efficient method to leverage LLMs in the annotation process for factual error analysis and benchmarks the most recent LLMs as evaluation models. Similarly aiming to refine evaluation protocols, our RoSE benchmark introduces a protocol for human evaluation of whether a summary is relevant that achieves high inter-annotator agreement among crowd-sourced workers, and our DiverseSumm work builds on the insights of our previous papers on evaluation and breaking down tasks to simpler components in the context of multi-document summarization.

Below we select two representative works that highlight afghanistan phone number list our efforts to improve evaluation and understand model performance across the quality dimensions of relevance and factual consistency. Our case studies focus on summarization but could be applied to other generation tasks.

Case Study 1: Trusting Human Evaluation
This section follows our work Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation.