When evaluating system

rochona · Post by **rochona** » Sun May 25, 2025 4:55 am

performance, we should have clearly defined targets. Otherwise, as seen above, the annotator’s prior preference can play a large role in evaluation results.

We also want to note the difference between reference-free and reference-based evaluation and how to choose between them. Reference-based evaluation can be more objective and easier to perform. However, reference-based evaluation can be more restrictive, and some quality aspects are by definition reference-free such as factual consistency and coherence. Furthermore, reference-free evaluation aligns with training techniques such as RLFH; however, it may be noisier and more subjective.

We stress that fine-grained human evaluation across afghanistan phone number list protocols can lead to more robust and objective results. The same principle has been applied to the evaluation of several summary qualities such as factual consistency and coherence. We believe extending our ACU protocol to reference-free additional evaluation dimensions is a promising direction.

Overall, human evaluation is becoming even more important with the current progress of LLMs and the introduction of training techniques like RLHF, and there is much room left for improvement, such as proposing more targeted human evaluation protocols and improving the reliability and reproducibility of human evaluation practices.