In the table below
Posted: Sun May 25, 2025 4:57 am
we visualize the average performance across competing models. As can be seen, there still remains a gap between the top model and human performance and a sizable gap with other models. Thus we believe that the SummEdits provides a robust, challenging benchmark for LLM factual consistency evaluation comparisons.
In this work, we benchmark LLMs on prior data, but also afghanistan phone number list use them to take a critical approach to previous benchmarks and potential errors in their creation. We call for more reproducible and challenging benchmarks like the one we introduce here. Clear and atomic edits and breaking down the annotation process, similar to the above RoSE benchmark, simplify the data collection and allow for better annotator agreement.
Overall, given the improvements of LLMs, we are able to leverage them to scale annotation processes. In this work, we found only GPT3.5 or 4 were able to generate refined enough edits without completely rewriting summaries. The LLM chosen could very well have an effect on which models are preferred, and ideally, we could leverage multiple LLMs in this process.
In this work, we benchmark LLMs on prior data, but also afghanistan phone number list use them to take a critical approach to previous benchmarks and potential errors in their creation. We call for more reproducible and challenging benchmarks like the one we introduce here. Clear and atomic edits and breaking down the annotation process, similar to the above RoSE benchmark, simplify the data collection and allow for better annotator agreement.
Overall, given the improvements of LLMs, we are able to leverage them to scale annotation processes. In this work, we found only GPT3.5 or 4 were able to generate refined enough edits without completely rewriting summaries. The LLM chosen could very well have an effect on which models are preferred, and ideally, we could leverage multiple LLMs in this process.