GPT-2 (124M) reproduction time discrepancy #75
Replies: 1 comment
-
My understanding is that the model used in the "build-nanogpt" repo is the one Andrej built during his GPT2 the movie on youtube. He did indeed train that for 1.7 hrs (could be reduced if you compile the model and ignore generation and HellaSwag evaluation during training) and this model "beat" the OpenAI GPT2 124M checkpoint for HellaSwag after training for said ~2 hrs. The model in nanoGPT is a more refined version, and was used as the template for the llm.c implementation. This was trained on OpenWebText and appearently trained for way longer. My intution though, is that this model will perform significantly better - especially considering the model from the build-nanogpt repo was more ad-hoc for the youtube video |
Beta Was this translation helpful? Give feedback.
-
Hi there,
I noticed a small discrepancy between the README files of two repositories by @karpathy, and I'm hoping to get some clarification.
In the README of the karpathy/build-nanogpt repository, it mentions:
However, in the README of the karpathy/nanoGPT repository, it states:
These statements seem to be at odds with each other, particularly regarding the training time. The first suggests the model can be trained in about an hour, while the second indicates it takes approximately four days.
Could someone shed some light on the difference in these training times? Is it due to different datasets, model configurations, or perhaps something else?
Thanks in advance for any clarification!
Beta Was this translation helpful? Give feedback.
All reactions