Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StragglerDetection and FTlauncher to NeMo2.0 #11117

Closed
wants to merge 38 commits into from

Commits on Nov 15, 2024

  1. init draft

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    0e8dd86 View commit details
    Browse the repository at this point in the history
  2. Fix FaultTolerencePlugin

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    58b1d42 View commit details
    Browse the repository at this point in the history
  3. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    e350f03 View commit details
    Browse the repository at this point in the history
  4. Add StragglerDetection callback to all NeMo2.0 recipes

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    3fd001b View commit details
    Browse the repository at this point in the history
  5. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    f0ea714 View commit details
    Browse the repository at this point in the history
  6. Add missing and remove unsued imports

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    a637494 View commit details
    Browse the repository at this point in the history
  7. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    849872d View commit details
    Browse the repository at this point in the history
  8. Add ft launcher test

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    1af154f View commit details
    Browse the repository at this point in the history
  9. fix typo

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    874a879 View commit details
    Browse the repository at this point in the history
  10. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    8f4a22b View commit details
    Browse the repository at this point in the history
  11. fix more typos

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    6c5df44 View commit details
    Browse the repository at this point in the history
  12. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    4f13a42 View commit details
    Browse the repository at this point in the history
  13. add ft launcher using nemo-run for llama3 test

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    6f80cce View commit details
    Browse the repository at this point in the history
  14. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    e3d29e9 View commit details
    Browse the repository at this point in the history
  15. fix serialization errors

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    140ebbf View commit details
    Browse the repository at this point in the history
  16. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    1f99d40 View commit details
    Browse the repository at this point in the history
  17. create seperate ft test

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    160def9 View commit details
    Browse the repository at this point in the history
  18. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    f79d8a6 View commit details
    Browse the repository at this point in the history
  19. change github actions test

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    d0aa5d9 View commit details
    Browse the repository at this point in the history
  20. draft crash simulation

    Signed-off-by: Shriya Balaji Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    8508edb View commit details
    Browse the repository at this point in the history
  21. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    16cebf3 View commit details
    Browse the repository at this point in the history
  22. Simulate a crash using step, disable checkpointing

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    42ca51b View commit details
    Browse the repository at this point in the history
  23. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    dc11d21 View commit details
    Browse the repository at this point in the history
  24. Add a straggler detection test as well

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    9519c3c View commit details
    Browse the repository at this point in the history
  25. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    8829909 View commit details
    Browse the repository at this point in the history
  26. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    00aa2e1 View commit details
    Browse the repository at this point in the history
  27. Revert enabling straggler_detection by default in all recipes

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    0aaa9e0 View commit details
    Browse the repository at this point in the history
  28. Remove unused imports

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    3fdcdce View commit details
    Browse the repository at this point in the history
  29. Remove extra check in ConfigValidationPlugin

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    b26bd70 View commit details
    Browse the repository at this point in the history
  30. Address pylinter issues

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    98a533f View commit details
    Browse the repository at this point in the history
  31. Improve straggler detection testing and add doc string

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    166048f View commit details
    Browse the repository at this point in the history
  32. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    3d0a03d View commit details
    Browse the repository at this point in the history
  33. fix paths

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    bff35c4 View commit details
    Browse the repository at this point in the history
  34. Add assert for crash

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    e6bbb27 View commit details
    Browse the repository at this point in the history
  35. Apply isort and black reformatting

    Signed-off-by: ShriyaPalsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    ba065bd View commit details
    Browse the repository at this point in the history
  36. Append run logs to a file after a crash

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    7f27da9 View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2024

  1. Set FAULT_TOL_FINISHED_FLAG_FILE and FAULT_TOL_CFG_PATH

    Signed-off-by: Shriya Palsamudram <[email protected]>
    ShriyaPalsamudram committed Nov 18, 2024
    Configuration menu
    Copy the full SHA
    6c0857f View commit details
    Browse the repository at this point in the history

Commits on Nov 19, 2024

  1. Merge branch 'main' into shriya/resiliency

    Signed-off-by: Shriya Rishab <[email protected]>
    ShriyaPalsamudram authored Nov 19, 2024
    Configuration menu
    Copy the full SHA
    fd55915 View commit details
    Browse the repository at this point in the history