Benchmarking Long-Form Factuality in Large Language Models
This paper introduces SAFE, an automatic evaluation method for long-form factuality, outperforming human annotators and ...
This paper introduces SAFE, an automatic evaluation method for long-form factuality, outperforming human annotators and ...