Technology and Innovation Community

 View Only

GROKKING AKA Delayed Generalization

  • 1.  GROKKING AKA Delayed Generalization

    Posted 05-03-2025 09:31
    Edited by Carlos Salas 05-03-2025 09:32

    Great research paper on Grokking and its underlying causes with fully available report and code:

    • Title: "Grokking at the Edge of Numerical Stability"
    • Gist:   Authors provide new insights into why grokking is delayed and why regularization is crucial for it.
    • Link: Grokking at the Edge of Numerical Stability
      Arxiv remove preview
      Grokking at the Edge of Numerical Stability
      Lucas Prieto,  Melih Barsbey,  Pedro A.M. Mediano ,  Tolga Birdal Department of ComputingImperial College London Joint senior authors, equal contribution Abstract Grokking, or sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon that has challenged our understanding of deep learning.
      View this on Arxiv >
    • code: https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability

    Please see below some introduction to Grokking for those of you unfamiliar with this concept: 

    • Grokking is a a phenomenon where a model memorizes the training data early on, but only later learns the underlying structure needed for generalization. In this way, the model generalizes suddenly after prolonged overfitting, rather than improving gradually.
    • Discovery: First introduced by OpenAI in January 2022 while studying how neural networks perform calculations.
    • Phase Transition: Grokking resembles a phase transition in training, where models shift from memorization to generalization.
    • Regularization's Role: Weight decay (a form of regularization) may help by favouring general solutions, which are harder to find but have lower weight values. Research shows that regularization techniques (e.g. weight decay, L2 regularization, dropout) are crucial for achieving generalization to prevent models from overfitting.
    • Lazy vs. Rich Training: The transition from "lazy training" (minimal weight movement) to "rich training" (task-relevant weight updates) is a key theoretical explanation for grokking: models may initially train inefficiently (lazy regime) before transitioning to meaningful learning (rich regime).
    • Softmax Collapse (SC) in grokking indicates that numerical stability issues can hinder learning-a key consideration when developing robust trading algorithms.
    • Active Research: Grokking has been observed in deep and non-neural models, and ongoing studies suggest it arises from optimizer properties, weight decay, and initialization norms.

    Why is this relevant for investment professionals?

    • Many ML models in finance show high performance in backtests but fail in live trading.
    • Market regimes change: Grokking can explain why models that seem to work initially fail later.
    • Overfitting vs. generalization in financial ML: Over-optimized strategies often appear strong in historical data but do not generalize to real-world conditions.


    Feel free to reply with any additional insights on Grokking



    ------------------------------
    Carlos Salas
    Portfolio Manager & Freelance Investment Research Consultant
    ------------------------------