Technology and Innovation Community

View Only

Back to discussions

Expand all | Collapse all

LLM BENCHMARK UPDATE

1. LLM BENCHMARK UPDATE

Recommend
Carlos Salas

Community Champion
Posted 09-03-2026 11:38

Reply Reply Privately
Quick benchmark comparison data for GP (General Purpose) models and finance-specialized ones (KPIs explained at the end) as of March 2026:

Finance LLMs

KPIs Description

SA – FPB micro F1: Measures how accurately a model classifies sentiment (positive, negative, neutral) in financial sentences from the Financial PhraseBank dataset using the micro-averaged F1 score.

SA-FiQA weighted F1: Evaluates sentiment detection in financial tweets and news headlines from the FiQA dataset using a weighted F1 score that accounts for class imbalance.

SA-FOMC micro-F1: Measures sentiment classification performance on central bank communications (e.g., Federal Reserve statements) using micro-averaged F1.

HC Avg F1: Average F1 score for categorizing financial news headlines into predefined topic classes (e.g., corporate actions, market movements).

NER (Entity-level F1): Evaluates how accurately a model identifies and classifies financial entities (e.g., companies, dates, monetary values) in text using entity-level F1 score.

QA-FinQA (EmAcc): Exact match accuracy for answering financial reasoning questions from the FinQA dataset, often requiring calculations based on tables or reports.

QA-ConvFinQA (EmAcc): Exact match accuracy for multi-turn financial question answering in the ConvFinQA dataset, where the model must maintain context across questions.

BigData22 (Acc/MCC): Measures financial document classification performance using Accuracy and Matthews Correlation Coefficient on the BigData22 dataset.

ACL18 (Acc/MCC): Evaluates financial text classification using Accuracy and Matthews Correlation Coefficient on a dataset introduced at ACL 2018.

CIKM18 (Acc/MCC): Measures classification of financial events or sentiment using Accuracy and Matthews Correlation Coefficient from a dataset introduced at CIKM 2018.

ECTSum (Rouge-1): Evaluates summarization quality of earnings call transcripts using ROUGE-1, which measures overlap of words between generated and reference summaries.

EDTSum (Rouge-1): Measures summarization quality of earnings disclosures or financial reports using ROUGE-1.

Feel free to reply should any of the scores be outdated.

------------------------------
Carlos Salas
Portfolio Manager & Freelance Investment Research Consultant
------------------------------

Technology and Innovation Community

LLM BENCHMARK UPDATE

1. LLM BENCHMARK UPDATE

Contact Us

Follow

Privacy & Terms

Technology and Innovation Community

LLM BENCHMARK UPDATE

1. LLM BENCHMARK UPDATE

Related Content

AI as your analyst

AI in investment Research

Stablecoin Inflows and Spillovers to FX Markets (IMF paper)

AI as your analyst

Automation Ahead Pathway slides and recordings

Contact Us

Follow

Privacy & Terms