ACU Protocol
Posted: Sun May 25, 2025 4:54 am
Motivation
To properly evaluate a summarization system, the model output summary is evaluated to the information in the source, its intrinsic characteristics, and potentially to reference summaries. These comparisons are done using both automatic evaluation metrics as well as human evaluation. This setup is shown in the figure below.
Human evaluation is regarded as the gold standard for both evaluating the summarization systems and the automatic metrics. However, simply doing human evaluation does not automatically make it the ‘’gold’ standard. In fact, it is very difficult to properly conduct a human evaluation study, as afghanistan phone number list annotators may disagree on what a good summary is and it may be hard to draw statistically significant conclusions based on the current evaluation set sizes.
These difficulties motivate us to perform an in-depth analysis of human evaluation of text summarization. Our first focus is a better protocol and benchmark.
The goal of this protocol is to allow our annotators to objectively judge whether the summary contains salient information from the reference summary.
To do so, inspired by the pyramid evaluation protocol, we dissect the evaluation task into finer-grained subtasks with the notion of atomic content units.
To properly evaluate a summarization system, the model output summary is evaluated to the information in the source, its intrinsic characteristics, and potentially to reference summaries. These comparisons are done using both automatic evaluation metrics as well as human evaluation. This setup is shown in the figure below.
Human evaluation is regarded as the gold standard for both evaluating the summarization systems and the automatic metrics. However, simply doing human evaluation does not automatically make it the ‘’gold’ standard. In fact, it is very difficult to properly conduct a human evaluation study, as afghanistan phone number list annotators may disagree on what a good summary is and it may be hard to draw statistically significant conclusions based on the current evaluation set sizes.
These difficulties motivate us to perform an in-depth analysis of human evaluation of text summarization. Our first focus is a better protocol and benchmark.
The goal of this protocol is to allow our annotators to objectively judge whether the summary contains salient information from the reference summary.
To do so, inspired by the pyramid evaluation protocol, we dissect the evaluation task into finer-grained subtasks with the notion of atomic content units.