Hands on Large(?) Language model from scratch

tutorial: LLM basics from scratch provide step by step explanation.

how to run

Download Dataset

cd to data folder

cd data

Initialize Git LFS for Large Files

git lfs install

Clone the dataset:

git clone https://huggingface.co/datasets/Skylion007/openwebtext

Unzip dataset:

bash unzip.sh

Convert Data

Back to the root folder, run the following command:

python convert_data.py

It converts all the .xz files in data/openwebtext/subsets and put the converted .txt files in folder data/extracted.

We are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses. If you are working on a remote server, you can use ssh -L 20202:localhost:20202 user@remotehost to forward the port to your local machine, or you can directly access the server's IP address with the port number, and you will see all the processes:

Optionally, the script will ask you if you'd like to delete the original .xz files to save disk space. If you want to keep them, type n and press Enter.

train

python train.py --config config/gptv1_s.toml

Since we are using neetbox for monitoring, open localhost:20202 (neetbox's default port) in your browser and you can check the progresses:

predict

python inference.py --config config/gptv1_s.toml

Open localhost:20202 (neetbox's default port) in your browser and feed text to your model via action button.

further

more information see also LLM basics from scratch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Hands on Large(?) Language model from scratch

how to run

Download Dataset

Convert Data

train

predict

further

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Hands on Large(?) Language model from scratch

how to run

Download Dataset

Convert Data

train

predict

further