the benchmark creation process
Posted: Sun May 25, 2025 4:57 am
As noted, is quite efficient, costing $300 per domain. We compare to the estimated costs of a previous factual consistency protocol FRANK, which would cost about $6000 to produce a dataset of similar size for a new domain, which you would then have to multiple by 10 to match our overall benchmark size.
Analysis
In the table below we show the averaged accuracy results across domains for SummEdits for top LLMs at the time of submission.
image.png
We find that the QAFactEval performs well, only being outperformed overall by 4 LLMs. We find the the non-LLM metrics perform best on the news domain, which makes sense given that they were largely developed and tested on news. For the legal domain and Shakespeare dialogue domains, most models performed the worse. These differences point to the necessity to develop and test across multiple domains.
At the bottom of the table, we show estimated human afghanistan phone number list performance as well as an oracle setting. In the oracle setting, we append the seed summary to the article and modified summary. The seed summary here serves as an information scaffold, and the improved performance confirms that high model performance is attainable; the challenge in the benchmark lies in aligning the facts of the edited summary with the document when the model doesn’t know what has been edited.
Analysis
In the table below we show the averaged accuracy results across domains for SummEdits for top LLMs at the time of submission.
image.png
We find that the QAFactEval performs well, only being outperformed overall by 4 LLMs. We find the the non-LLM metrics perform best on the news domain, which makes sense given that they were largely developed and tested on news. For the legal domain and Shakespeare dialogue domains, most models performed the worse. These differences point to the necessity to develop and test across multiple domains.
At the bottom of the table, we show estimated human afghanistan phone number list performance as well as an oracle setting. In the oracle setting, we append the seed summary to the article and modified summary. The seed summary here serves as an information scaffold, and the improved performance confirms that high model performance is attainable; the challenge in the benchmark lies in aligning the facts of the edited summary with the document when the model doesn’t know what has been edited.