Latest post from Andrej Karpathy on GitHub discussing nanochat miniseries on GitHub (available on Discord channel, as well).
discussion
A few notes from Karpathy below:
- Why miniseries - The correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do careful science of scaling laws and ultimately this is what gives you the confidence that when you pay for "the big run", the extrapolation will work and your money will be well spent.
- Top-level comparison to GPT-2/GPT-3 miniseries
- Details included for Scaling laws, Hyperparameter sweeps, GPT-2 / GPT-3 CORE scores, Miniseries v1 CORE scores
- Todor
------------------------------
Todor Kostov
Director
------------------------------