Documentation Improvements #745

aman-17 · 2024-11-12T16:00:09Z

Documentation Improvements

Changes Made

Fixed grammar and improved documentation clarity throughout.
Restructured training instructions for better readability.
Enhanced checkpoint download documentation.
- Added new script scripts/download_checkpoints.py to automate checkpoint downloads.
- Removed manual URL conversion by automating R2 to public URL conversion.
Fixed bug for loading unsharded checkpoints in scripts/train.py.
Improved data inspection instructions.

New Features

The new scripts/download_checkpoints.py script:

Automatically handles URL conversions between R2 and public formats.
Downloads checkpoint files with progress tracking.
Supports specific step selection and directory listing.

.gitignore

README.md

scripts/download_checkpoints.py

scripts/train.py

aman-17

@dirkgr Can you review this?

2015aroras · 2024-11-20T23:51:57Z

README.md

@@ -58,7 +55,7 @@ The core models in the OLMo family released so far are (all trained on the [Dolm
 URLs to checkpoints at intermediate steps of the models' trainings can be found in the csv files under [`checkpoints/official/`](https://github.com/allenai/OLMo/blob/main/checkpoints/official). These 'directory' URLs cannot currently be directly accessed, but files within the directory are publicly accessible. These URLs can also be provided to the training script to resume training from the checkpoint (see [Training](#training)). Each checkpoint directory consists of:

 - `config.yaml`: the config at that training step.
- `model.pt`, `optim.pt`, `train.pt`: model, optimizer and training state at that training step.
+- `model.safetensors`, `optim.safetensors`, `train.pt`: model, optimizer and training state at that training step.


train.safetensors? Also, for the original model we just have *.pt so we should have that format mentioned somewhere.

We are going to save in .safetensors starting from OLMo-2

Sure, but people might still try to use older OLMo models. The documentation should be backwards-compatible?

No, not backwards compatible

scripts/download_checkpoints.py

dirkgr · 2024-11-26T01:20:56Z

scripts/download_checkpoints.py

+        except requests.exceptions.HTTPError as e:
+            print(f"HTTP error for {pattern}: {e}")
+        except requests.exceptions.RequestException as e:
+            print(f"Connection error for {pattern}: {e}")


This is still swallowing exceptions. It just prints about them. This part of the code should check whether the exception is a 404 error, in which case it's fine, and otherwise let the exception propagate.

dirkgr · 2024-11-26T01:21:25Z

scripts/download_checkpoints.py

+   print(f"Saving to: {base_path}")
+   available_files = try_get_directory_listing(public_url)
+
+   if not available_files:


Suggested change

if not available_files:

if len(available_files) <= 0:

How does this make any better? found_files can never be negative right? if not available_files: is same as if len(available_files) == 0:.

dirkgr · 2024-11-26T01:22:06Z

scripts/download_checkpoints.py

+        try:
+            test_url = urljoin(url.rstrip('/') + '/', pattern)
+            response = requests.head(test_url)
+            # response.raise_for_status()


Leftover debugging code?

dirkgr · 2024-11-26T01:24:20Z

scripts/download_checkpoints.py

+       try:
+           print(f"\nDownloading: {file}")
+           download_file(file_url, file_path)
+       except requests.exceptions.Timeout:
+           print(f"Timeout error for {file}, retrying once...")
+           try:
+               download_file(file_url, file_path)
+           except requests.exceptions.RequestException as e:
+               failed_files.append(file)
+               print(f"Failed to download {file}: {e}")
+       except requests.exceptions.RequestException as e:
+           failed_files.append(file)
+           print(f"Failed to download {file}: {e}")


Don't retry in this way. The Python requests library already supports retries. In this code we should just call download_file(). If a file fails, we fail the whole process (i.e., we let the exception propagate).

dirkgr · 2024-11-26T01:36:14Z

scripts/download_checkpoints.py

+    response = requests.get(url, stream=True)
+    response.raise_for_status()
+    total_size = int(response.headers.get('content-length', 0))
+    save_path.parent.mkdir(parents=True, exist_ok=True)
+
+    with open(save_path, 'wb') as f:
+        with tqdm(total=total_size, unit='B', unit_scale=True, desc=save_path.name) as pbar:
+            for chunk in response.iter_content(chunk_size=chunk_size):
+                if chunk:
+                    f.write(chunk)
+                    pbar.update(len(chunk))


Retries should be handled in here. If the request fails with a recoverable error code (anything that starts with 5XX, 408, 409, 429), it should wait one second, and then try again. Try up to 5 times, and then give up. When giving up, it must make sure that the file at save_path does not exist.

dirkgr · 2024-11-26T01:39:29Z

scripts/download_checkpoints.py

+            print(f"Error: Step {args.step} not found in the CSV file.")
+            print("Use list argument to see available step numbers.")
+            return


Don't print and then return. If you do this, the process will return normally, and whoever called it will think it succeeded. If you want to fail, raise an exception and let it crash the process.

dirkgr · 2024-11-26T01:41:17Z

scripts/download_checkpoints.py

+        reader = csv.DictReader(f)
+        urls = [(row['Step'], row['Checkpoint Directory']) for row in reader]
+
+    if args.command == 'list':


This should be an if ladder:

if args.command == 'foo': do_foo_things() do_other_things() elif args.command == 'bar': do_bar_things() do_more_things() else: raise RuntimeException(f"Unexpected command: {args.command}")

dirkgr · 2024-11-26T01:41:56Z

scripts/download_checkpoints.py

+            print(f"Step {step}")
+        return
+
+    if args.step:


Suggested change

if args.step:

if args.step is not None:

dirkgr · 2024-11-26T01:42:22Z

scripts/download_checkpoints.py

+                                          help='Download checkpoints from CSV file')
+    download_parser.add_argument('csv_file', type=str, 
+                               help='Path to the CSV file containing checkpoint URLs')
+    download_parser.add_argument('--step', type=str, required=True,


In the code it seems like step is not required if you want to download everything?

No, step is required argument. It throws this error when user fails to add --step.

python scripts/download_checkpoints.py download checkpoints/official/OLMo-1B.csv --save-dir ./new_checkpoints usage: download_checkpoints.py download [-h] --step STEP [--save-dir SAVE_DIR] csv_file download_checkpoints.py download: error: the following arguments are required: --step

dirkgr · 2024-11-26T01:43:26Z

scripts/download_checkpoints.py

+        r2_url = convert_to_r2_url(url)
+        public_url = convert_to_public_url(r2_url)


The URLs in the CSV are already public URLs? Why are we doing this change?

I've taken these functions from my another file where I will have r2 urls in csv while uploading. I missed this here.

docs: improve documentation

a622fb0

aman-17 added the type/documentation An issue or pull request related to documentation label Nov 12, 2024

aman-17 requested a review from dirkgr November 12, 2024 16:00

aman-17 assigned dirkgr and aman-17 Nov 12, 2024

dirkgr requested changes Nov 14, 2024

View reviewed changes

aman-17 added 2 commits November 20, 2024 11:06

updated code after Dirk's review

8aac2ea

added scripts/convert_pt_to_safetensors.py

c21087d

aman-17 commented Nov 20, 2024

View reviewed changes

2015aroras reviewed Nov 20, 2024

View reviewed changes

aman-17 and others added 2 commits November 25, 2024 16:43

updated arguments to subcommands and readme accordingly

4e256a9

Merge branch 'main' into improve-documentation

71abc2c

dirkgr requested changes Nov 26, 2024

View reviewed changes

dirkgr added 17 commits November 25, 2024 17:49

isort

c904429

Removing non-peteish configs

36ba37a

Removing some more configs

2448127

Merge remote-tracking branch 'origin/main' into improve-documentation

ccfb06d

Keep only the anneals we actually used

930daaa

Merge remote-tracking branch 'origin/main' into improve-documentation

e1e54d9

Remove even more anneals

b2f7ffc

Rename the old official configs

e4786af

Delete a bunch of unused scripts

46cfcce

Formatting

5d2fbb7

Official configs for stage 1 training

796de60

Update model table

206da7c

Checkpoints aren't ready anyways

889aaaa

Removing section about checkpoints that don't exist

d867ced

Update references to model

973b34d

Remove mentioning of checkpoints that don't exist

b3324b5

Remove reproducibility

dc3cfe1

dirkgr added 6 commits November 26, 2024 12:27

use, don't utilize

8c34f59

More references to non-existing checkpoints

4fdc829

Make the example match the model card

5da6e3d

Link to data

a40d46e

Fix link to eval

3b0139d

Adds link to instruct variants

d520823

dirkgr merged commit 0d14158 into main Nov 26, 2024
10 of 11 checks passed

dirkgr deleted the improve-documentation branch November 26, 2024 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation Improvements #745

Documentation Improvements #745

aman-17 commented Nov 12, 2024

aman-17 left a comment

2015aroras Nov 20, 2024

aman-17 Nov 21, 2024

2015aroras Nov 26, 2024

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

aman-17 Nov 26, 2024 •

edited

Loading

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

dirkgr Nov 26, 2024

aman-17 Nov 26, 2024

dirkgr Nov 26, 2024

aman-17 Nov 26, 2024

		r2_url = convert_to_r2_url(url)
		public_url = convert_to_public_url(r2_url)

Documentation Improvements #745

Documentation Improvements #745

Conversation

aman-17 commented Nov 12, 2024