Technology and Innovation Community

 View Only
  • 1.  LLM BENCHMARK UPDATE

    Posted yesterday

    Quick benchmark comparison data for GP (General Purpose) models and finance-specialized ones (KPIs explained at the end) as of March 2026:

    Finance LLMs

    KPIs Description

    • SA – FPB micro F1: Measures how accurately a model classifies sentiment (positive, negative, neutral) in financial sentences from the Financial PhraseBank dataset using the micro-averaged F1 score.

    • SA-FiQA weighted F1: Evaluates sentiment detection in financial tweets and news headlines from the FiQA dataset using a weighted F1 score that accounts for class imbalance.

    • SA-FOMC micro-F1: Measures sentiment classification performance on central bank communications (e.g., Federal Reserve statements) using micro-averaged F1.

    • HC Avg F1: Average F1 score for categorizing financial news headlines into predefined topic classes (e.g., corporate actions, market movements).

    • NER (Entity-level F1): Evaluates how accurately a model identifies and classifies financial entities (e.g., companies, dates, monetary values) in text using entity-level F1 score.

    • QA-FinQA (EmAcc): Exact match accuracy for answering financial reasoning questions from the FinQA dataset, often requiring calculations based on tables or reports.

    • QA-ConvFinQA (EmAcc): Exact match accuracy for multi-turn financial question answering in the ConvFinQA dataset, where the model must maintain context across questions.

    • BigData22 (Acc/MCC): Measures financial document classification performance using Accuracy and Matthews Correlation Coefficient on the BigData22 dataset.

    • ACL18 (Acc/MCC): Evaluates financial text classification using Accuracy and Matthews Correlation Coefficient on a dataset introduced at ACL 2018.

    • CIKM18 (Acc/MCC): Measures classification of financial events or sentiment using Accuracy and Matthews Correlation Coefficient from a dataset introduced at CIKM 2018.

    • ECTSum (Rouge-1): Evaluates summarization quality of earnings call transcripts using ROUGE-1, which measures overlap of words between generated and reference summaries.

    • EDTSum (Rouge-1): Measures summarization quality of earnings disclosures or financial reports using ROUGE-1.

    Feel free to reply should any of the scores be outdated.



    ------------------------------
    Carlos Salas
    Portfolio Manager & Freelance Investment Research Consultant
    ------------------------------