Patronus AI Launches Industry-first LLM Benchmark for Finance to Address Hallucinations

News provided by

Nov 16, 2023, 09:47 ET

Model evaluation shows state-of-the-art systems fail spectacularly on finance-related questions

NEW YORK, Nov. 16, 2023 /PRNewswire/ -- Patronus AI today launched "FinanceBench", the industry's first benchmark for testing how LLMs perform on financial questions.

Developed by AI researchers at Patronus AI and 15 financial industry domain experts, FinanceBench is a high quality, large-scale set of 10,000 question and answer pairs based on publicly available financial documents like SEC 10Ks, SEC 10Qs, SEC 8Ks, earnings reports, and earnings call transcripts. It is presented as a first line of evaluation for LLMs on financial questions, with more advanced tests to be released in the future.

Initial analysis by Patronus AI shows that state-of-the-art LLM retrieval systems fail spectacularly on a sample set of questions from FinanceBench.

GPT-4 Turbo with a retrieval system fails 81% of the time
Llama 2 with a retrieval system fails 81% of the time

Patronus AI also evaluated LLMs with long context windows, noting that they perform better but are less practical for use in a production setting. In particular,

GPT-4 Turbo with long context fails 21% of the time
Anthropic's Claude-2 with long context fails 24% of the time

Patronus AI notes that LLM retrieval systems are commonly used by enterprises today for multiple reasons. LLMs with long context windows are not only much slower and more expensive to use, but the context windows are still not large enough to support long documents typically used by analysts.

"While LLMs show promise in analyzing mass volumes of financial data, most models out in the market need a lot of refinement and steering to work properly," Anand Kannappan, CEO and co-founder, Patronus AI. "And based on our specific evaluation of GPT-4 Turbo and other models, the margin of error is just too big for financial applications."

"Analysts are spending valuable time creating prompt test sets to evaluate LLM retrieval systems and manually inspecting outputs to identify hallucinations," Rebecca Qian, CTO and co-founder, Patronus AI. "And there exist no benchmarks that can help identify exactly where LLMs fail in real world financial use cases. This is precisely why we developed FinanceBench."

The new benchmark spans several LLM capabilities in finance:

Numerical reasoning: Finance metrics requiring numerical calculations, e.g. EBITDA, PE ratio, CAGR.
Information retrieval: Specific details extracted directly from the documents.
Logical reasoning: Questions involving financial recommendations, which require interpretation and a degree of subjectivity.
World knowledge: Basic accounting and finance questions that analysts are familiar with.

As a part of this release, customers can now evaluate their LLM system against FinanceBench on the Patronus AI platform. The platform can also detect hallucinations and other unexpected LLM behavior on financial questions in a scalable way. Several financial services companies are piloting Patronus AI in the coming months.

About Patronus AI

Patronus AI is the first automated evaluation and security platform that helps companies use large language models (LLMs) confidently. The company was founded by machine learning experts Anand Kannappan and Rebecca Qian, formerly of Meta AI and Meta Reality Labs. For more information, please visit https://www.patronus.ai/.

SOURCE Patronus AI