Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Improvements #745

Merged
merged 28 commits into from
Nov 26, 2024
Merged

Documentation Improvements #745

merged 28 commits into from
Nov 26, 2024

Conversation

aman-17
Copy link
Member

@aman-17 aman-17 commented Nov 12, 2024

Documentation Improvements

Changes Made

  • Fixed grammar and improved documentation clarity throughout.
  • Restructured training instructions for better readability.
  • Enhanced checkpoint download documentation.
    • Added new script scripts/download_checkpoints.py to automate checkpoint downloads.
    • Removed manual URL conversion by automating R2 to public URL conversion.
  • Fixed bug for loading unsharded checkpoints in scripts/train.py.
  • Improved data inspection instructions.

New Features

The new scripts/download_checkpoints.py script:

  • Automatically handles URL conversions between R2 and public formats.
  • Downloads checkpoint files with progress tracking.
  • Supports specific step selection and directory listing.

@aman-17 aman-17 added the type/documentation An issue or pull request related to documentation label Nov 12, 2024
@aman-17 aman-17 requested a review from dirkgr November 12, 2024 16:00
.gitignore Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
scripts/download_checkpoints.py Outdated Show resolved Hide resolved
scripts/download_checkpoints.py Outdated Show resolved Hide resolved
scripts/download_checkpoints.py Outdated Show resolved Hide resolved
scripts/download_checkpoints.py Outdated Show resolved Hide resolved
scripts/train.py Outdated Show resolved Hide resolved
scripts/train.py Outdated Show resolved Hide resolved
Copy link
Member Author

@aman-17 aman-17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dirkgr Can you review this?

README.md Outdated
@@ -58,7 +55,7 @@ The core models in the OLMo family released so far are (all trained on the [Dolm
URLs to checkpoints at intermediate steps of the models' trainings can be found in the csv files under [`checkpoints/official/`](https://github.com/allenai/OLMo/blob/main/checkpoints/official). These 'directory' URLs cannot currently be directly accessed, but files within the directory are publicly accessible. These URLs can also be provided to the training script to resume training from the checkpoint (see [Training](#training)). Each checkpoint directory consists of:

- `config.yaml`: the config at that training step.
- `model.pt`, `optim.pt`, `train.pt`: model, optimizer and training state at that training step.
- `model.safetensors`, `optim.safetensors`, `train.pt`: model, optimizer and training state at that training step.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train.safetensors? Also, for the original model we just have *.pt so we should have that format mentioned somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to save in .safetensors starting from OLMo-2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but people might still try to use older OLMo models. The documentation should be backwards-compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not backwards compatible

scripts/download_checkpoints.py Outdated Show resolved Hide resolved
Comment on lines 52 to 55
except requests.exceptions.HTTPError as e:
print(f"HTTP error for {pattern}: {e}")
except requests.exceptions.RequestException as e:
print(f"Connection error for {pattern}: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still swallowing exceptions. It just prints about them. This part of the code should check whether the exception is a 404 error, in which case it's fine, and otherwise let the exception propagate.

print(f"Saving to: {base_path}")
available_files = try_get_directory_listing(public_url)

if not available_files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not available_files:
if len(available_files) <= 0:

Copy link
Member Author

@aman-17 aman-17 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this make any better? found_files can never be negative right? if not available_files: is same as if len(available_files) == 0:.

try:
test_url = urljoin(url.rstrip('/') + '/', pattern)
response = requests.head(test_url)
# response.raise_for_status()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debugging code?

Comment on lines 74 to 86
try:
print(f"\nDownloading: {file}")
download_file(file_url, file_path)
except requests.exceptions.Timeout:
print(f"Timeout error for {file}, retrying once...")
try:
download_file(file_url, file_path)
except requests.exceptions.RequestException as e:
failed_files.append(file)
print(f"Failed to download {file}: {e}")
except requests.exceptions.RequestException as e:
failed_files.append(file)
print(f"Failed to download {file}: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't retry in this way. The Python requests library already supports retries. In this code we should just call download_file(). If a file fails, we fail the whole process (i.e., we let the exception propagate).

Comment on lines 23 to 33
response = requests.get(url, stream=True)
response.raise_for_status()
total_size = int(response.headers.get('content-length', 0))
save_path.parent.mkdir(parents=True, exist_ok=True)

with open(save_path, 'wb') as f:
with tqdm(total=total_size, unit='B', unit_scale=True, desc=save_path.name) as pbar:
for chunk in response.iter_content(chunk_size=chunk_size):
if chunk:
f.write(chunk)
pbar.update(len(chunk))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries should be handled in here. If the request fails with a recoverable error code (anything that starts with 5XX, 408, 409, 429), it should wait one second, and then try again. Try up to 5 times, and then give up. When giving up, it must make sure that the file at save_path does not exist.

Comment on lines 122 to 124
print(f"Error: Step {args.step} not found in the CSV file.")
print("Use list argument to see available step numbers.")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't print and then return. If you do this, the process will return normally, and whoever called it will think it succeeded. If you want to fail, raise an exception and let it crash the process.

reader = csv.DictReader(f)
urls = [(row['Step'], row['Checkpoint Directory']) for row in reader]

if args.command == 'list':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an if ladder:

if args.command == 'foo':
    do_foo_things()
    do_other_things()
elif args.command == 'bar':
    do_bar_things()
    do_more_things()
else:
    raise RuntimeException(f"Unexpected command: {args.command}")

print(f"Step {step}")
return

if args.step:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if args.step:
if args.step is not None:

help='Download checkpoints from CSV file')
download_parser.add_argument('csv_file', type=str,
help='Path to the CSV file containing checkpoint URLs')
download_parser.add_argument('--step', type=str, required=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code it seems like step is not required if you want to download everything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, step is required argument. It throws this error when user fails to add --step.

python scripts/download_checkpoints.py download checkpoints/official/OLMo-1B.csv  --save-dir ./new_checkpoints 
usage: download_checkpoints.py download [-h] --step STEP [--save-dir SAVE_DIR] csv_file
download_checkpoints.py download: error: the following arguments are required: --step

Comment on lines 128 to 129
r2_url = convert_to_r2_url(url)
public_url = convert_to_public_url(r2_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URLs in the CSV are already public URLs? Why are we doing this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken these functions from my another file where I will have r2 urls in csv while uploading. I missed this here.

@dirkgr dirkgr merged commit 0d14158 into main Nov 26, 2024
10 of 11 checks passed
@dirkgr dirkgr deleted the improve-documentation branch November 26, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/documentation An issue or pull request related to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants