Tech »  Topic »  Benchmarking Long-Form Factuality in Large Language Models

Benchmarking Long-Form Factuality in Large Language Models


by Language Models (dot tech) April 9th, 2025

This paper introduces SAFE, an automatic evaluation method for long-form factuality, outperforming human annotators and offering cheaper, scalable solutions for model evaluation. Future research will focus on improving LLM factuality through pretraining and external tools.

People Mentioned

Table of Links

Abstract and 1 Introduction

2 LongFact: Using LLMs to generate a multi-topic benchmark for long-form factuality

3 Safe:LLM agents as factuality autoraters

4 LLMs agents can be better factuality annotators than humans

5 F1@k: Extending F1 with recall from human-preferred length

6 Larger LLMs are more factual

7 Related Work

8 Limitations

9 Conclusion, Acknowledgments, Author Contribution, and References

Appendix

A. Frequently asked questions

B. LongFact details

C. SAFE details

D. Metric details

E. Further analysis

9 CONCLUSION

In this paper, we examined how to thoroughly benchmark long-form factuality in large language models. To do so, we first used GPT-4 ...


Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE