Add support for NeMo Run to ASR #10933

titu1994 · 2024-10-17T20:52:34Z

What does this PR do ?

Adds NeMo run support to ASR and common utilities for Run to common collections

Collection: [ASR, Common]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

Local Execution

conf/run_local.yaml

# The script to be run.
script: ???
script_config: ???

exp_name: null  # populated by exp_manager.name if not provided
results_dir: ???  # Where to store the results of the run

num_runs: 1
num_tasks_per_node: 1

########################################################################################################################

executor: local

containers:
  asr: nvcr.io/nvidia/nemo:24.07  # or nvcr.io/nvidia/nemo:dev

mounts:
  - "~/.cache/torch/NeMo:/cache/torch/NeMo"  # To mount your nemo cache dir (if needed for pretrained models)

Call run_helper.py

python run_helper.py --config-path "conf" --config-name "run_local.yaml" \
  script=asr_ctc/speech_to_text_ctc_bpe.py \
  script_config=conf/conformer/conformer_ctc_bpe.yaml \
  results_dir=$PWD/results \
  ++model.train_ds.manifest_filepath=/manifests/train_clean_5.json \ 
  ++model.validation_ds.manifest_filepath="/manifests/dev_clean_2.json" \ 
  ++model.tokenizer.dir=/manifests/librispeech_tokenizer_spe_unigram_v1024 \ 
  ++mount_1="<Path to Manifests>/librispeech/manifests:/manifests" \ 
  ++mount_2="<Data Path>:/data"

Cluster Execution

conf/run_slurm.yaml

# The script to be run.
script: ???
script_config: ???

exp_name: null  # populated by exp_manager.name if not provided
results_dir: ???  # Where to store the results of the run

# Optional arguments
num_runs: 1
num_tasks_per_node: 8
max_runtime: "00:03:45:00"

########################################################################################################################

executor: slurm

ssh_tunnel:
  host: <CLUSTER HOST>
  # ------------------------------- Fill this up! -------------------------------
  user: "${USER}"  # your username; or resolved from ${USER} environment variable 
  job_dir: <DIRECTORY TO STORE NEMO RUN JOB INFO>
  identity: "${CLUSTER_SSH_IDENTITY}"
  # -----------------------------------------------------------------------------

account: <SLURM ACCOUNT>
partition: <SLURM PARTITIONS>
job_name_prefix: <JOB PREFIX NAMES>

containers:
  asr:  <CONTAINER NAME>

# These env vars are propagated to slurm runtime
env_vars:
  - 'TOKENIZERS_PARALLELISM=false'
  - 'LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE=0.3'
  - 'TORCH_CUDNN_V8_API_ENABLED=1'
  - 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True'
  - 'HYDRA_FULL_ERROR=1'

# These env vars are propagated to slurm runtime
required_env_vars:
  - 'HF_TOKEN'

mounts:
  # Replace with your own paths in your cluster config
  - <DATA PATH>:/data
  - <CHECKPOINTS PATH>:/asr_checkpoints

timeouts:
  interactive: 04:00:00

########################################################################################################################

IMPORTANT NOTE

NOTE: Be very careful with using ${} syntax inside of your hydra overrides - it will try to resolve using your env variables if you use double quotes ("). If you want to provide "hydra placeholders" - use SINGLE QUOTES (') as shown below for ++name and ++results_dir

Call run_helper.py

python run_helper.py --config-path conf/ --config-name \ 
  run_slurm script=speech_multitask/speech_to_text_aed.py \
  script_config=conf/aed_config.yaml \
  exp_name=<JOB NAME> \
  results_dir='/results/${exp_name}' \
  num_runs=2 \
  ++trainer.num_nodes=2 \
  ++name='${exp_name}' \
  ++exp_manager.wandb_logger_kwargs.project="nemo_asr" \
  ++USER=$USER \
  ++CLUSTER_SSH_IDENTITY=$CLUSTER_SSH_IDENTITY

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

examples/asr/run_helper.py

titu1994 · 2024-10-17T21:47:01Z

examples/asr/run_helper.py

+            unmounted_path = run_utils.get_unmounted_filepath(cluster_cfg, v)
+            run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg)
+
+        # elif "ais://" in v and ais_endpoint is not None:  # if the value is a string, check if its an ais path


@pzelasko How does one check the existance of dataset given by path in ais?

First, if 'ais://' in v will not cut it, it could start with s3:// or other prefixes, so you need to check if it's a valid URL, see this helper function for reference https://github.com/lhotse-speech/lhotse/blob/41269ff1f86e2fab6831d9b638ea922409b6b166/lhotse/utils.py#L132-L137

Once you have AIS client, you can run the following snippet which will raise a 404 error if the object is not found:

client.fetch_object_by_url(url).head()

is this check being run directly on the cluster?

The check is run locally via a ssh tunnel

If it's through a tunnel, then it should be OK as the actual command would be executed on the cluster.

nithinraok · 2024-10-18T14:21:11Z

examples/asr/run_helper.py

isn;t it better to rename this file to nemo-run_helper.py

No need imo, makes it unnecesarily longer

I still think its better to keep it nemo-run_helper.py (still shorter than speech_to_text_finetune.py name for comparison). Or we could update our README.md to include what each each script does on high level.

IMO the sweetest syntactic sugar here would be to register a CLI using setuptools like this

https://github.com/lhotse-speech/lhotse/blob/86b7e79431bdd49a62189cc007f6cd4a7b180cb9/setup.py#L220-L224

under nemo-run ; what it does is that when you have python env active it creates a bash command that points to the right python function for you

$ nemo-run my_script.py args

(note: lhotse uses click for CLI but the setuptools mechanism is agnostic to what is the actual CLI parser)

pzelasko

Is it possible to test this script somehow?

pzelasko · 2024-10-22T13:43:43Z

examples/asr/run_helper.py

    # recursively walk all values of the script_config, checking if its a path-like string and if so, check if the path is a mounted path
    # if it is not, raise an error

+    if 'AIS_ENDPOINT' in os.environ:


nitpick: ais_endpoint = os.environ.get("AIS_ENDPOINT")

pzelasko · 2024-10-22T13:48:07Z

examples/asr/run_helper.py

+            unmounted_path = run_utils.get_unmounted_filepath(cluster_cfg, v)
+            run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg)
+
+        # elif "ais://" in v and ais_endpoint is not None:  # if the value is a string, check if its an ais path


First, if 'ais://' in v will not cut it, it could start with s3:// or other prefixes, so you need to check if it's a valid URL, see this helper function for reference https://github.com/lhotse-speech/lhotse/blob/41269ff1f86e2fab6831d9b638ea922409b6b166/lhotse/utils.py#L132-L137

Once you have AIS client, you can run the following snippet which will raise a 404 error if the object is not found:

client.fetch_object_by_url(url).head()

is this check being run directly on the cluster?

pzelasko · 2024-10-22T13:49:02Z

examples/asr/run_helper.py

    # Create the command to run the script
    cmd = """
 nvidia-smi && \
 export PYTHONPATH=$PYTHONPATH:/nemo_run/code && \
 export HF_TOKEN={HF_TOKEN} && \
 export WANDB_API_KEY={WANDB} && \
+find /results/ -name '*-unfinished' -type f -delete && \ 


is this really needed?

It was part of the scripts you sent, it think its pretty safe to do as long as exp manager doesn't crash? I can remove it though

Got it - hopefully we don't need it any longer after fixing checkpoint issues.

titu1994 · 2024-10-23T06:17:03Z

@pzelasko Ive added the AIS check, as well as a flag in cluster config check_ais_paths which can be disabled (via config or commandline) in case folks want to preserve speed instead of doing path checks.

Signed-off-by: smajumdar <[email protected]>

Signed-off-by: titu1994 <[email protected]>

Signed-off-by: smajumdar <[email protected]>

Signed-off-by: titu1994 <[email protected]>

Signed-off-by: smajumdar <[email protected]>

Signed-off-by: titu1994 <[email protected]>

#11061)

Signed-off-by: smajumdar <[email protected]>

github-actions · 2024-10-29T01:12:18Z

[🤖]: Hi @titu1994 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

github-actions · 2024-11-12T01:56:55Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

nithinraok · 2024-11-15T23:52:44Z

examples/asr/_temp/config.yaml

why is this created in first place?

nithinraok · 2024-11-15T23:55:03Z

examples/asr/run_helper.py

I still think its better to keep it nemo-run_helper.py (still shorter than speech_to_text_finetune.py name for comparison). Or we could update our README.md to include what each each script does on high level.

nithinraok · 2024-11-15T23:55:58Z

examples/asr/run_helper.py

@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import datetime


How are you planning to test NeMo Run for ASR? As part of CI, if yes can you add that too.

nithinraok · 2024-11-15T23:58:15Z

examples/asr/run_helper.py

+    ++mount_<anything>='/src:/dest'
+
+    Args:
+        cluster_cfg: Cluster config dictionary


Please add keys of dictionary here. and their suggested values

actually wondering if the usage of the tools in this PR deserves its own doc page or tutorial

nithinraok · 2024-11-16T00:07:46Z

examples/asr/conf/run_local.yaml

Why deleting this file?

nithinraok · 2024-11-16T00:23:57Z

examples/asr/run_helper.py


+        # Get the execution script
+        cmd = get_execution_script(cluster_script_path, config_name, merged_config, cluster_cfg)


why pass cluster_cfg when merged_config already consists cluster_cfg?

nithinraok · 2024-11-16T00:24:33Z

examples/asr/run_helper.py

+
+        # Copy the merged config file to remote location's /results/configs directory
+        config_dir = os.path.join(results_dir, 'configs')
+        run_utils.create_remote_config(merged_config, config_name, config_dir, cluster_cfg)


same. merged_config already consists cluster_cfg?

nithinraok · 2024-11-16T00:26:38Z

examples/asr/run_helper.py


+        # Get the execution script
+        cmd = get_execution_script(cluster_script_path, config_name, merged_config, cluster_cfg)


minor: may be rename to get_execution_script_cmd?

nithinraok · 2024-11-16T00:33:00Z

nemo/collections/common/parts/run_utils.py

+    Returns:
+        Task: The task object added to the NeMo Run experiment.
+    """
+    # Checj if dependencies are provided


Suggested change

# Checj if dependencies are provided

# Check if dependencies are provided

pzelasko · 2024-11-18T15:21:30Z

examples/asr/run_helper.py

+            run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg)
+
+        elif (
+            check_ais_paths and "ais://" in v and ais_endpoint is not None


In case my earlier comment was missed -- re-iterating that AIStore handles all kinds of URL schemas, the condition that leads to AIStore code branch here should:

check whether a path is an URL/URI

check whether ais_endpoint is not None

pzelasko · 2024-11-18T15:21:59Z

examples/asr/run_helper.py

+                ais_client.fetch_object_by_url(v).head()
+
+            except ImportError:
+                logging.warning("\nais module is not installed. Please install it to use ais paths.\n")


Suggested change

logging.warning("\nais module is not installed. Please install it to use ais paths.\n")

logging.warning("\nais module is not installed. Please 'pip install aistore' to use ais paths.\n")

pzelasko

Left a few more comments and respones; how can I run something using this as a final check before approving?

github-actions bot added ASR common labels Oct 17, 2024

github-advanced-security bot found potential problems Oct 17, 2024

View reviewed changes

examples/asr/run_helper.py Fixed Show fixed Hide fixed

examples/asr/run_helper.py Fixed Show fixed Hide fixed

examples/asr/run_helper.py Fixed Show fixed Hide fixed

titu1994 force-pushed the asr_run branch from c117894 to d870a88 Compare October 17, 2024 21:46

titu1994 commented Oct 17, 2024

View reviewed changes

ericharper requested a review from hemildesai October 17, 2024 23:23

nithinraok reviewed Oct 18, 2024

View reviewed changes

pzelasko reviewed Oct 22, 2024

View reviewed changes

titu1994 added the Run CICD label Oct 23, 2024

titu1994 requested review from pablo-garay and ko3n1g as code owners October 28, 2024 22:10

github-actions bot added core Changes to NeMo Core TTS NLP CI Multi Modal audio labels Oct 28, 2024

titu1994 added 12 commits October 28, 2024 15:10

Remove local run config

8757de1

Signed-off-by: smajumdar <[email protected]>

Remove local run config

48b842c

Signed-off-by: smajumdar <[email protected]>

Fix num gpus resolution

9453b44

Signed-off-by: smajumdar <[email protected]>

Fix num gpus resolution

b34d2d9

Signed-off-by: smajumdar <[email protected]>

Fix num gpus resolution

b5c1af1

Signed-off-by: smajumdar <[email protected]>

Fix resolution of the cluster config in a copy

7f6dfed

Signed-off-by: smajumdar <[email protected]>

Fix resolution of the cluster config in a copy

92facc5

Signed-off-by: smajumdar <[email protected]>

Fix resolution of the cluster config in a copy

28fe5bb

Signed-off-by: smajumdar <[email protected]>

Fix resolution of the cluster config in a copy

f82bab8

Signed-off-by: smajumdar <[email protected]>

Make config name unique with date time stamp

2efe4ae

Signed-off-by: smajumdar <[email protected]>

Make config name unique with date time stamp

db25c75

Signed-off-by: smajumdar <[email protected]>

Make config name unique with date time stamp

5273301

Signed-off-by: smajumdar <[email protected]>

titu1994 and others added 14 commits October 28, 2024 15:12

Fix bug in dependency launcher

96e5036

Signed-off-by: smajumdar <[email protected]>

Fix bug in dependency launcher

9e0b538

Signed-off-by: smajumdar <[email protected]>

Fix bug in dependency launcher

6bc72a8

Signed-off-by: smajumdar <[email protected]>

Fix bug in dependency launcher

ddcbcd0

Signed-off-by: smajumdar <[email protected]>

Fix bug in dependency launcher

9f60c9f

Signed-off-by: smajumdar <[email protected]>

Cleanup

1dfaf18

Signed-off-by: smajumdar <[email protected]>

Apply isort and black reformatting

fd36af3

Signed-off-by: titu1994 <[email protected]>

Correct task ids implementation inside run utils

4c7cc8f

Signed-off-by: smajumdar <[email protected]>

Apply isort and black reformatting

4bc64af

Signed-off-by: titu1994 <[email protected]>

Update ais check

ce2e16a

Signed-off-by: smajumdar <[email protected]>

Add the ability to disable ais checks optionally

af083c4

Signed-off-by: smajumdar <[email protected]>

Apply isort and black reformatting

784aab4

Signed-off-by: titu1994 <[email protected]>

fix: Resolve mutable default issue in MultiModalSampleConfig dataclass (

89fa080

#11061)

Add the ability to disable ais checks optionally

042542e

Signed-off-by: smajumdar <[email protected]>

titu1994 force-pushed the asr_run branch from a91ee2e to 042542e Compare October 28, 2024 22:12

github-actions bot removed core Changes to NeMo Core TTS NLP CI Multi Modal labels Oct 28, 2024

titu1994 added Run CICD and removed Run CICD labels Oct 28, 2024

github-actions bot added the stale label Nov 12, 2024

nithinraok reviewed Nov 16, 2024

View reviewed changes

github-actions bot removed the stale label Nov 16, 2024

pzelasko reviewed Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for NeMo Run to ASR #10933

Add support for NeMo Run to ASR #10933

titu1994 commented Oct 17, 2024 •

edited

Loading

titu1994 Oct 17, 2024

pzelasko Oct 22, 2024

titu1994 Oct 22, 2024

pzelasko Nov 18, 2024

nithinraok Oct 18, 2024

titu1994 Oct 18, 2024

nithinraok Nov 15, 2024

pzelasko Nov 18, 2024 •

edited

Loading

pzelasko left a comment

pzelasko Oct 22, 2024

pzelasko Oct 22, 2024

pzelasko Oct 22, 2024

titu1994 Oct 22, 2024

pzelasko Nov 18, 2024

titu1994 commented Oct 23, 2024

github-actions bot commented Oct 29, 2024

github-actions bot commented Nov 12, 2024

nithinraok Nov 15, 2024

nithinraok Nov 15, 2024

nithinraok Nov 15, 2024

nithinraok Nov 15, 2024

pzelasko Nov 18, 2024

nithinraok Nov 16, 2024

nithinraok Nov 16, 2024

nithinraok Nov 16, 2024

nithinraok Nov 16, 2024

nithinraok Nov 16, 2024

pzelasko Nov 18, 2024

pzelasko Nov 18, 2024

pzelasko left a comment


		# Get the execution script
		cmd = get_execution_script(cluster_script_path, config_name, merged_config, cluster_cfg)

	# Checj if dependencies are provided
	# Check if dependencies are provided

	logging.warning("\nais module is not installed. Please install it to use ais paths.\n")
	logging.warning("\nais module is not installed. Please 'pip install aistore' to use ais paths.\n")

Add support for NeMo Run to ASR #10933

Are you sure you want to change the base?

Add support for NeMo Run to ASR #10933

Conversation

titu1994 commented Oct 17, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

Local Execution

Cluster Execution

IMPORTANT NOTE

GitHub Actions CI

Before your PR is "Ready for review"

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

pzelasko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

titu1994 commented Oct 23, 2024

github-actions bot commented Oct 29, 2024

github-actions bot commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko left a comment

Choose a reason for hiding this comment

titu1994 commented Oct 17, 2024 •

edited

Loading

pzelasko Nov 18, 2024 •

edited

Loading