-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NeMo Run to ASR #10933
base: main
Are you sure you want to change the base?
Conversation
examples/asr/run_helper.py
Outdated
unmounted_path = run_utils.get_unmounted_filepath(cluster_cfg, v) | ||
run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg) | ||
|
||
# elif "ais://" in v and ais_endpoint is not None: # if the value is a string, check if its an ais path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pzelasko How does one check the existance of dataset given by path in ais?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, if 'ais://' in v
will not cut it, it could start with s3://
or other prefixes, so you need to check if it's a valid URL, see this helper function for reference https://github.com/lhotse-speech/lhotse/blob/41269ff1f86e2fab6831d9b638ea922409b6b166/lhotse/utils.py#L132-L137
Once you have AIS client, you can run the following snippet which will raise a 404 error if the object is not found:
client.fetch_object_by_url(url).head()
is this check being run directly on the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is run locally via a ssh tunnel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's through a tunnel, then it should be OK as the actual command would be executed on the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn;t it better to rename this file to nemo-run_helper.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need imo, makes it unnecesarily longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think its better to keep it nemo-run_helper.py (still shorter than speech_to_text_finetune.py name for comparison). Or we could update our README.md to include what each each script does on high level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the sweetest syntactic sugar here would be to register a CLI using setuptools like this
under nemo-run
; what it does is that when you have python env active it creates a bash command that points to the right python function for you
$ nemo-run my_script.py args
(note: lhotse uses click
for CLI but the setuptools mechanism is agnostic to what is the actual CLI parser)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to test this script somehow?
# recursively walk all values of the script_config, checking if its a path-like string and if so, check if the path is a mounted path | ||
# if it is not, raise an error | ||
|
||
if 'AIS_ENDPOINT' in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: ais_endpoint = os.environ.get("AIS_ENDPOINT")
examples/asr/run_helper.py
Outdated
unmounted_path = run_utils.get_unmounted_filepath(cluster_cfg, v) | ||
run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg) | ||
|
||
# elif "ais://" in v and ais_endpoint is not None: # if the value is a string, check if its an ais path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, if 'ais://' in v
will not cut it, it could start with s3://
or other prefixes, so you need to check if it's a valid URL, see this helper function for reference https://github.com/lhotse-speech/lhotse/blob/41269ff1f86e2fab6831d9b638ea922409b6b166/lhotse/utils.py#L132-L137
Once you have AIS client, you can run the following snippet which will raise a 404 error if the object is not found:
client.fetch_object_by_url(url).head()
is this check being run directly on the cluster?
# Create the command to run the script | ||
cmd = """ | ||
nvidia-smi && \ | ||
export PYTHONPATH=$PYTHONPATH:/nemo_run/code && \ | ||
export HF_TOKEN={HF_TOKEN} && \ | ||
export WANDB_API_KEY={WANDB} && \ | ||
find /results/ -name '*-unfinished' -type f -delete && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was part of the scripts you sent, it think its pretty safe to do as long as exp manager doesn't crash? I can remove it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - hopefully we don't need it any longer after fixing checkpoint issues.
@pzelasko Ive added the AIS check, as well as a flag in cluster config |
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: titu1994 <[email protected]>
Signed-off-by: smajumdar <[email protected]>
[🤖]: Hi @titu1994 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this created in first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think its better to keep it nemo-run_helper.py (still shorter than speech_to_text_finetune.py name for comparison). Or we could update our README.md to include what each each script does on high level.
@@ -12,6 +12,7 @@ | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
|
|||
import datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you planning to test NeMo Run for ASR? As part of CI, if yes can you add that too.
++mount_<anything>='/src:/dest' | ||
|
||
Args: | ||
cluster_cfg: Cluster config dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add keys of dictionary here. and their suggested values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually wondering if the usage of the tools in this PR deserves its own doc page or tutorial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why deleting this file?
|
||
# Get the execution script | ||
cmd = get_execution_script(cluster_script_path, config_name, merged_config, cluster_cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why pass cluster_cfg
when merged_config
already consists cluster_cfg
?
|
||
# Copy the merged config file to remote location's /results/configs directory | ||
config_dir = os.path.join(results_dir, 'configs') | ||
run_utils.create_remote_config(merged_config, config_name, config_dir, cluster_cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. merged_config
already consists cluster_cfg
?
|
||
# Get the execution script | ||
cmd = get_execution_script(cluster_script_path, config_name, merged_config, cluster_cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: may be rename to get_execution_script_cmd
?
Returns: | ||
Task: The task object added to the NeMo Run experiment. | ||
""" | ||
# Checj if dependencies are provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Checj if dependencies are provided | |
# Check if dependencies are provided |
run_utils.check_remote_mount_directories(unmounted_path, cluster_cfg) | ||
|
||
elif ( | ||
check_ais_paths and "ais://" in v and ais_endpoint is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case my earlier comment was missed -- re-iterating that AIStore handles all kinds of URL schemas, the condition that leads to AIStore code branch here should:
- check whether a path is an URL/URI
- check whether
ais_endpoint is not None
ais_client.fetch_object_by_url(v).head() | ||
|
||
except ImportError: | ||
logging.warning("\nais module is not installed. Please install it to use ais paths.\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.warning("\nais module is not installed. Please install it to use ais paths.\n") | |
logging.warning("\nais module is not installed. Please 'pip install aistore' to use ais paths.\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments and respones; how can I run something using this as a final check before approving?
What does this PR do ?
Adds NeMo run support to ASR and common utilities for Run to common collections
Collection: [ASR, Common]
Changelog
Usage
Local Execution
conf/run_local.yaml
Call run_helper.py
Cluster Execution
conf/run_slurm.yaml
IMPORTANT NOTE
NOTE: Be very careful with using
${}
syntax inside of your hydra overrides - it will try to resolve using your env variables if you use double quotes ("). If you want to provide "hydra placeholders" - use SINGLE QUOTES (') as shown below for++name
and++results_dir
Call run_helper.py
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.