Technology and Innovation Community

 View Only

LLMs Research - Lookahead Bias for Prediction Tasks

  • 1.  LLMs Research - Lookahead Bias for Prediction Tasks

    Posted 4 days ago

    The next paper covers lookahead bias in LLMs during prediction tasks, which is a topic related to the upcoming CFA UK Webinar on LLMs biases.

    Paper Key Takeaways:

    • Paper Gist: LM forecast accuracy can be inflated by memorization of training data rather than genuine reasoning/inference.

    • Lookahead Propensity (LAP) provides a practical proxy for detecting training-data overlap.

    • A positive interaction between LAP and forecast accuracy is formal evidence of lookahead bias.

    • In stock return prediction, roughly 37% of the apparent predictive effect is amplified by memorization.

    • In earnings call CapEx forecasting, about 19% of predictive strength is linked to memorization effects.

    • Model confidence measures do not explain away LAP effects - memorization operates independently.

    • The LAP interaction disappears in true out-of-sample tests, supporting the bias interpretation.

    • LLM-based financial backtests may overstate alpha unless lookahead bias is explicitly tested and controlled.

    Paper Summary

    1. Introduction of Lookahead Propensity (LAP)

    • LAP measures how likely a prompt was seen during training.
    • Constructed using a MIN-K% token probability method (bottom 20% least likely tokens).
    • Higher LAP → greater model familiarity → higher likelihood of memorization.
    • Requires no retraining or access to proprietary training data.

    2. Theoretical Contribution

    • The paper formalizes lookahead bias as contamination:

    • If forecast accuracy increases with LAP, the predictive power is partly driven by memorization.
    • The key test: include an interaction term between prediction and LAP.
    • A positive interaction coefficient (β₃ > 0) implies lookahead bias.

    3. Empirical Test 1: News → Stock Returns

    Using Bloomberg headlines (2012–2023):

    • Baseline: LLM predictions significantly forecast next-day returns.
    • Adding LAP interaction:
      • Predictive power increases significantly with LAP.
      • 1 SD increase in LAP boosts the marginal LLM effect by ~37% of the baseline effect.
    • Small-cap predictability is largely driven by high-LAP amplification.
    • In true out-of-sample tests (post-model release), the interaction becomes insignificant.
    • Bootstrap confirms in-sample predictability differs from OOS distribution (p = 0.033).
    • Implication: A substantial portion of "alpha" reflects memorized event–outcome pairs.

    4. Empirical Test 2: Earnings Calls → CapEx

    Using 2006–2020 transcripts:

    • Baseline: LLM predicts future capital expenditures.
    • LAP interaction is positive and highly significant.
    • 1 SD increase in LAP increases marginal LLM effect by ~19%.
    • Indicates memorization also drives apparent foresight in corporate investment forecasting.

    5. Confidence ≠ Memorization

    • LAP effects remain even after controlling for:
      • First-token conditional probability
      • Self-reported model confidence
    • Memorization operates independently of model "confidence."

    6. Practical Contribution

    The LAP test acts as a leakage detector for LLM forecasting tasks, and is:

    • Model-agnostic
    • Cost-efficient
    • No retraining required
    • Applicable case-by-case
    • Suitable for backtesting diagnostics

    7. Broader Implication

    • Lookahead bias is task-specific, not universal.
    • LLM forecasts can appear superior due to training-period overlap.
    • Backtests using historical text may overstate real predictive ability.
    • Distinguishing reasoning from memorization is essential for credible empirical finance research.


    ------------------------------
    Carlos Salas
    Portfolio Manager & Freelance Investment Research Consultant
    ------------------------------