Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds explicit requirements to ensure correct dependencies are installed #9890

Closed

Conversation

richardkmichael
Copy link

What does this PR do ?

Numpy 1.x is required, but 2 (latest) will be installed if it is not present. IPython is also required, but not listed as a requirement. Explicit requirements also make it possible to use pip install nemo_toolkit without the entire Anaconda distribution installed.

This change is similar to #9805 , which upper-bounded huggingface_hub.

Also include a few trivial typos in the tutorial notebooks.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@@ -2,6 +2,7 @@ braceexpand
editdistance
einops
g2p_en
IPython
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this one. IPython is optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is. I recently got this error on main after a clean install:

Traceback (most recent call last):
  File "/home/dgalvez/scratch/code/asr/open_asr_leaderboard/nemo_asr/run_eval.py", line 12, in <module>
    from nemo.collections.asr.models import ASRModel
  File "/home/dgalvez/scratch/miniconda3/envs/open_asr_leaderboard/lib/python3.10/site-packages/nemo/collections/asr/__init__.py", line 15, in <module>
    from nemo.collections.asr import data, losses, models, modules
  File "/home/dgalvez/scratch/miniconda3/envs/open_asr_leaderboard/lib/python3.10/site-packages/nemo/collections/asr/models/__init__.py", line 22, in <module>
    from nemo.collections.asr.models.clustering_diarizer import ClusteringDiarizer
  File "/home/dgalvez/scratch/miniconda3/envs/open_asr_leaderboard/lib/python3.10/site-packages/nemo/collections/asr/models/clustering_diarizer.py", line 43, in <module>
    from nemo.collections.asr.parts.utils.vad_utils import (
  File "/home/dgalvez/scratch/miniconda3/envs/open_asr_leaderboard/lib/python3.10/site-packages/nemo/collections/asr/parts/utils/vad_utils.py", line 26, in <module>
    import IPython.display as ipd
ModuleNotFoundError: No module named 'IPython'

We are unconditionally importing IPython in vad_utils.py. IPython is not intended as a required dependency, but in practice it currently is.

Copy link
Author

@richardkmichael richardkmichael Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. I did consider leaving out IPython, assuming it was intended to be optional. However, I wanted to point this out and discuss it, so I left it in.

$ python -c 'from nemo.collections.asr.models.clustering_diarizer import ClusteringDiarizer'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/code/nemo/t/lib/python3.10/site-packages/nemo/collections/asr/__init__.py", line 15, in <module>
    from nemo.collections.asr import data, losses, models, modules
  File "/code/nemo/t/lib/python3.10/site-packages/nemo/collections/asr/models/__init__.py", line 22, in <module>
    from nemo.collections.asr.models.clustering_diarizer import ClusteringDiarizer
  File "/code/nemo/t/lib/python3.10/site-packages/nemo/collections/asr/models/clustering_diarizer.py", line 43, in <module>
    from nemo.collections.asr.parts.utils.vad_utils import (
  File "/code/nemo/t/lib/python3.10/site-packages/nemo/collections/asr/parts/utils/vad_utils.py", line 985, in <module>
    ) -> ipd.Audio:
NameError: name 'ipd' is not defined. Did you mean: 'pd'?

In vad_utils.py, there are two plot-related functions which make use of IPython, as return values and in one case, as a declared return type in the function signature. plot() and plot_sample_from_rttm().

I took a quick look..

Those functions not used in many locations. plot_sample_from_rttm() only once. plot() is a bit harder to track down since its name matches plotlib.plot().

$ grep -I -l -r -F plot_sample_from_rttm .
./nemo/collections/asr/parts/utils/vad_utils.py
./examples/asr/speech_classification/README.md
./scripts/speaker_tasks/multispeaker_data_analysis.py

I have never created a Python package, nor dealt with organizing conditional imports.

I wondered about moving those two functions to a new file, vad_utils_visualization.py or ..._plotting.py, and changing the usage sites. This could remove the import in vad_utils.py.

I also wondered about importing inside the functions. That would mean removing the ipd use in the return type signature, and it will still blow up, since IPython isn't a stated requirement for pip.

But, users still need to be able to install IPython if they want to run notebooks or scripts, so it must be declared somewhere for pip. Would a new requirements group solve this? e.g., pip install nemo_toolkit[extras]?

If you offer some guidance, I could make the changes on this PR. Or, I can drop the commit for now, to merge and deal with it in another PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericharper I took out IPython, so this can go into rc1 if you like. Let me know about reorganizing the code, and I can take a stab in another PR.

@richardkmichael
Copy link
Author

richardkmichael commented Aug 5, 2024

There was a another comment asking about the numpy < 2 requirement. It's due to the use of numpy.sctypes() which has been removed from NumPy 2. (As have the scalar type predicates, issctype() and issubsctype().)

This occurs inside the ClusteringDiarizer:

  File "/code/nemo/t/lib/python3.10/site-packages/nemo/collections/asr/parts/preprocessing/segment.py", line 258, in _convert_samples_to_float32
    if samples.dtype in np.sctypes['int']:
  File "/code/nemo/t/lib/python3.10/site-packages/numpy/__init__.py", line 397, in __getattr__
    raise AttributeError(
AttributeError: `np.sctypes` was removed in the NumPy 2.0 release. Access dtypes explicitly instead.. Did you mean: 'dtypes'?

It does not appear to be used in many locations:

$ grep -r 'in .*\.sctypes' .
./nemo/collections/asr/parts/preprocessing/feature_loader.py:        if samples.dtype in np.sctypes['int']:
./nemo/collections/asr/parts/preprocessing/feature_loader.py:        elif samples.dtype in np.sctypes['float']:
./nemo/collections/asr/parts/preprocessing/segment.py:        if samples.dtype in np.sctypes['int']:
./nemo/collections/asr/parts/preprocessing/segment.py:        elif samples.dtype in np.sctypes['float']:

I experimented, and I believe this could be changed to use issubdtype() which is present in both NumPy 1.x and 2. I could look more closely at NumPy 2 in another PR. It might be nice to start moving NeMo toward it, as it seems NumPy 2 has significant improvements. LMK.

Aside from NeMo code, NumPy 2 will at least need Tensorboard >= 2.18, for its update to NumPy 2 (done but not released).

Thanks for the feedback on this PR!

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Aug 20, 2024
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants