Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NeMo Run to ASR #10933

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Add support for NeMo Run to ASR #10933

wants to merge 29 commits into from

Conversation

titu1994
Copy link
Collaborator

@titu1994 titu1994 commented Oct 17, 2024

What does this PR do ?

Adds NeMo run support to ASR and common utilities for Run to common collections

Collection: [ASR, Common]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

Local Execution

conf/run_local.yaml

# The script to be run.
script: ???
script_config: ???

exp_name: null  # populated by exp_manager.name if not provided
results_dir: ???  # Where to store the results of the run

num_runs: 1
num_tasks_per_node: 1

########################################################################################################################

executor: local

containers:
  asr: nvcr.io/nvidia/nemo:24.07  # or nvcr.io/nvidia/nemo:dev

mounts:
  - "~/.cache/torch/NeMo:/cache/torch/NeMo"  # To mount your nemo cache dir (if needed for pretrained models)

Call run_helper.py

python run_helper.py --config-path "conf" --config-name "run_local.yaml" \
  script=asr_ctc/speech_to_text_ctc_bpe.py \
  script_config=conf/conformer/conformer_ctc_bpe.yaml \
  results_dir=$PWD/results \
  ++model.train_ds.manifest_filepath=/manifests/train_clean_5.json \ 
  ++model.validation_ds.manifest_filepath="/manifests/dev_clean_2.json" \ 
  ++model.tokenizer.dir=/manifests/librispeech_tokenizer_spe_unigram_v1024 \ 
  ++mount_1="<Path to Manifests>/librispeech/manifests:/manifests" \ 
  ++mount_2="<Data Path>:/data"

Cluster Execution

conf/run_slurm.yaml

# The script to be run.
script: ???
script_config: ???

exp_name: null  # populated by exp_manager.name if not provided
results_dir: ???  # Where to store the results of the run

# Optional arguments
num_runs: 1
num_tasks_per_node: 8
max_runtime: "00:03:45:00"

########################################################################################################################

executor: slurm

ssh_tunnel:
  host: <CLUSTER HOST>
  # ------------------------------- Fill this up! -------------------------------
  user: "${USER}"  # your username; or resolved from ${USER} environment variable 
  job_dir: <DIRECTORY TO STORE NEMO RUN JOB INFO>
  identity: "${CLUSTER_SSH_IDENTITY}"
  # -----------------------------------------------------------------------------

account: <SLURM ACCOUNT>
partition: <SLURM PARTITIONS>
job_name_prefix: <JOB PREFIX NAMES>

containers:
  asr:  <CONTAINER NAME>

# These env vars are propagated to slurm runtime
env_vars:
  - 'TOKENIZERS_PARALLELISM=false'
  - 'LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE=0.3'
  - 'TORCH_CUDNN_V8_API_ENABLED=1'
  - 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True'
  - 'HYDRA_FULL_ERROR=1'

# These env vars are propagated to slurm runtime
required_env_vars:
  - 'HF_TOKEN'

mounts:
  # Replace with your own paths in your cluster config
  - <DATA PATH>:/data
  - <CHECKPOINTS PATH>:/asr_checkpoints

timeouts:
  interactive: 04:00:00

########################################################################################################################

IMPORTANT NOTE

NOTE: Be very careful with using ${} syntax inside of your hydra overrides - it will try to resolve using your env variables if you use double quotes ("). If you want to provide "hydra placeholders" - use SINGLE QUOTES (') as shown below for ++name and ++results_dir

Call run_helper.py

python run_helper.py --config-path conf/ --config-name \ 
  run_slurm script=speech_multitask/speech_to_text_aed.py \
  script_config=conf/aed_config.yaml \
  exp_name=<JOB NAME> \
  results_dir='/results/${exp_name}' \
  num_runs=2 \
  ++trainer.num_nodes=2 \
  ++name='${exp_name}' \
  ++exp_manager.wandb_logger_kwargs.project="nemo_asr" \
  ++USER=$USER \
  ++CLUSTER_SSH_IDENTITY=$CLUSTER_SSH_IDENTITY

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

examples/asr/run_helper.py Fixed Show fixed Hide fixed
examples/asr/run_helper.py Fixed Show fixed Hide fixed
examples/asr/run_helper.py Fixed Show fixed Hide fixed
unmounted_path = run_utils.get_unmounted_filepath(cluster_cfg, v)
run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg)

# elif "ais://" in v and ais_endpoint is not None: # if the value is a string, check if its an ais path
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzelasko How does one check the existance of dataset given by path in ais?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, if 'ais://' in v will not cut it, it could start with s3:// or other prefixes, so you need to check if it's a valid URL, see this helper function for reference https://github.com/lhotse-speech/lhotse/blob/41269ff1f86e2fab6831d9b638ea922409b6b166/lhotse/utils.py#L132-L137

Once you have AIS client, you can run the following snippet which will raise a 404 error if the object is not found:

client.fetch_object_by_url(url).head()

is this check being run directly on the cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is run locally via a ssh tunnel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's through a tunnel, then it should be OK as the actual command would be executed on the cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn;t it better to rename this file to nemo-run_helper.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need imo, makes it unnecesarily longer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think its better to keep it nemo-run_helper.py (still shorter than speech_to_text_finetune.py name for comparison). Or we could update our README.md to include what each each script does on high level.

Copy link
Collaborator

@pzelasko pzelasko Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the sweetest syntactic sugar here would be to register a CLI using setuptools like this

https://github.com/lhotse-speech/lhotse/blob/86b7e79431bdd49a62189cc007f6cd4a7b180cb9/setup.py#L220-L224

under nemo-run ; what it does is that when you have python env active it creates a bash command that points to the right python function for you

$ nemo-run my_script.py args

(note: lhotse uses click for CLI but the setuptools mechanism is agnostic to what is the actual CLI parser)

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to test this script somehow?

# recursively walk all values of the script_config, checking if its a path-like string and if so, check if the path is a mounted path
# if it is not, raise an error

if 'AIS_ENDPOINT' in os.environ:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: ais_endpoint = os.environ.get("AIS_ENDPOINT")

unmounted_path = run_utils.get_unmounted_filepath(cluster_cfg, v)
run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg)

# elif "ais://" in v and ais_endpoint is not None: # if the value is a string, check if its an ais path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, if 'ais://' in v will not cut it, it could start with s3:// or other prefixes, so you need to check if it's a valid URL, see this helper function for reference https://github.com/lhotse-speech/lhotse/blob/41269ff1f86e2fab6831d9b638ea922409b6b166/lhotse/utils.py#L132-L137

Once you have AIS client, you can run the following snippet which will raise a 404 error if the object is not found:

client.fetch_object_by_url(url).head()

is this check being run directly on the cluster?

# Create the command to run the script
cmd = """
nvidia-smi && \
export PYTHONPATH=$PYTHONPATH:/nemo_run/code && \
export HF_TOKEN={HF_TOKEN} && \
export WANDB_API_KEY={WANDB} && \
find /results/ -name '*-unfinished' -type f -delete && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was part of the scripts you sent, it think its pretty safe to do as long as exp manager doesn't crash? I can remove it though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - hopefully we don't need it any longer after fixing checkpoint issues.

@titu1994
Copy link
Collaborator Author

@pzelasko Ive added the AIS check, as well as a flag in cluster config check_ais_paths which can be disabled (via config or commandline) in case folks want to preserve speed instead of doing path checks.

Copy link
Contributor

[🤖]: Hi @titu1994 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Nov 12, 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this created in first place?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think its better to keep it nemo-run_helper.py (still shorter than speech_to_text_finetune.py name for comparison). Or we could update our README.md to include what each each script does on high level.

@@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import datetime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you planning to test NeMo Run for ASR? As part of CI, if yes can you add that too.

++mount_<anything>='/src:/dest'

Args:
cluster_cfg: Cluster config dictionary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add keys of dictionary here. and their suggested values

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually wondering if the usage of the tools in this PR deserves its own doc page or tutorial

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why deleting this file?


# Get the execution script
cmd = get_execution_script(cluster_script_path, config_name, merged_config, cluster_cfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pass cluster_cfg when merged_config already consists cluster_cfg?


# Copy the merged config file to remote location's /results/configs directory
config_dir = os.path.join(results_dir, 'configs')
run_utils.create_remote_config(merged_config, config_name, config_dir, cluster_cfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. merged_config already consists cluster_cfg?


# Get the execution script
cmd = get_execution_script(cluster_script_path, config_name, merged_config, cluster_cfg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: may be rename to get_execution_script_cmd?

Returns:
Task: The task object added to the NeMo Run experiment.
"""
# Checj if dependencies are provided
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Checj if dependencies are provided
# Check if dependencies are provided

@github-actions github-actions bot removed the stale label Nov 16, 2024
run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg)

elif (
check_ais_paths and "ais://" in v and ais_endpoint is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case my earlier comment was missed -- re-iterating that AIStore handles all kinds of URL schemas, the condition that leads to AIStore code branch here should:

  1. check whether a path is an URL/URI
  2. check whether ais_endpoint is not None

ais_client.fetch_object_by_url(v).head()

except ImportError:
logging.warning("\nais module is not installed. Please install it to use ais paths.\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logging.warning("\nais module is not installed. Please install it to use ais paths.\n")
logging.warning("\nais module is not installed. Please 'pip install aistore' to use ais paths.\n")

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments and respones; how can I run something using this as a final check before approving?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants