The next paper covers lookahead bias in LLMs during prediction tasks, which is a topic related to the upcoming CFA UK Webinar on LLMs biases.
Paper Key Takeaways:
-
Paper Gist: LM forecast accuracy can be inflated by memorization of training data rather than genuine reasoning/inference.
-
Lookahead Propensity (LAP) provides a practical proxy for detecting training-data overlap.
-
A positive interaction between LAP and forecast accuracy is formal evidence of lookahead bias.
-
In stock return prediction, roughly 37% of the apparent predictive effect is amplified by memorization.
-
In earnings call CapEx forecasting, about 19% of predictive strength is linked to memorization effects.
-
Model confidence measures do not explain away LAP effects - memorization operates independently.
-
The LAP interaction disappears in true out-of-sample tests, supporting the bias interpretation.
-
LLM-based financial backtests may overstate alpha unless lookahead bias is explicitly tested and controlled.
Paper Summary
1. Introduction of Lookahead Propensity (LAP)
- LAP measures how likely a prompt was seen during training.
- Constructed using a MIN-K% token probability method (bottom 20% least likely tokens).
- Higher LAP → greater model familiarity → higher likelihood of memorization.
- Requires no retraining or access to proprietary training data.
2. Theoretical Contribution
- The paper formalizes lookahead bias as contamination:

- If forecast accuracy increases with LAP, the predictive power is partly driven by memorization.
- The key test: include an interaction term between prediction and LAP.
- A positive interaction coefficient (β₃ > 0) implies lookahead bias.
3. Empirical Test 1: News → Stock Returns
Using Bloomberg headlines (2012–2023):
- Baseline: LLM predictions significantly forecast next-day returns.
- Adding LAP interaction:
- Predictive power increases significantly with LAP.
- 1 SD increase in LAP boosts the marginal LLM effect by ~37% of the baseline effect.
- Small-cap predictability is largely driven by high-LAP amplification.
- In true out-of-sample tests (post-model release), the interaction becomes insignificant.
- Bootstrap confirms in-sample predictability differs from OOS distribution (p = 0.033).
- Implication: A substantial portion of "alpha" reflects memorized event–outcome pairs.
4. Empirical Test 2: Earnings Calls → CapEx
Using 2006–2020 transcripts:
- Baseline: LLM predicts future capital expenditures.
- LAP interaction is positive and highly significant.
- 1 SD increase in LAP increases marginal LLM effect by ~19%.
- Indicates memorization also drives apparent foresight in corporate investment forecasting.
5. Confidence ≠ Memorization
- LAP effects remain even after controlling for:
- First-token conditional probability
- Self-reported model confidence
- Memorization operates independently of model "confidence."
6. Practical Contribution
The LAP test acts as a leakage detector for LLM forecasting tasks, and is:
- Model-agnostic
- Cost-efficient
- No retraining required
- Applicable case-by-case
- Suitable for backtesting diagnostics
7. Broader Implication
- Lookahead bias is task-specific, not universal.
- LLM forecasts can appear superior due to training-period overlap.
- Backtests using historical text may overstate real predictive ability.
- Distinguishing reasoning from memorization is essential for credible empirical finance research.
------------------------------
Carlos Salas
Portfolio Manager & Freelance Investment Research Consultant
------------------------------