Quick benchmark comparison data for GP (General Purpose) models and finance-specialized ones (KPIs explained at the end) as of March 2026:

Finance LLMs

KPIs Description
-
SA – FPB micro F1: Measures how accurately a model classifies sentiment (positive, negative, neutral) in financial sentences from the Financial PhraseBank dataset using the micro-averaged F1 score.
-
SA-FiQA weighted F1: Evaluates sentiment detection in financial tweets and news headlines from the FiQA dataset using a weighted F1 score that accounts for class imbalance.
-
SA-FOMC micro-F1: Measures sentiment classification performance on central bank communications (e.g., Federal Reserve statements) using micro-averaged F1.
-
HC Avg F1: Average F1 score for categorizing financial news headlines into predefined topic classes (e.g., corporate actions, market movements).
-
NER (Entity-level F1): Evaluates how accurately a model identifies and classifies financial entities (e.g., companies, dates, monetary values) in text using entity-level F1 score.
-
QA-FinQA (EmAcc): Exact match accuracy for answering financial reasoning questions from the FinQA dataset, often requiring calculations based on tables or reports.
-
QA-ConvFinQA (EmAcc): Exact match accuracy for multi-turn financial question answering in the ConvFinQA dataset, where the model must maintain context across questions.
-
BigData22 (Acc/MCC): Measures financial document classification performance using Accuracy and Matthews Correlation Coefficient on the BigData22 dataset.
-
ACL18 (Acc/MCC): Evaluates financial text classification using Accuracy and Matthews Correlation Coefficient on a dataset introduced at ACL 2018.
-
CIKM18 (Acc/MCC): Measures classification of financial events or sentiment using Accuracy and Matthews Correlation Coefficient from a dataset introduced at CIKM 2018.
-
ECTSum (Rouge-1): Evaluates summarization quality of earnings call transcripts using ROUGE-1, which measures overlap of words between generated and reference summaries.
-
EDTSum (Rouge-1): Measures summarization quality of earnings disclosures or financial reports using ROUGE-1.
Feel free to reply should any of the scores be outdated.
------------------------------
Carlos Salas
Portfolio Manager & Freelance Investment Research Consultant
------------------------------