Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: multicast design: adding helper.h, progress.h, and mcast.h #838

Merged
merged 1 commit into from
Oct 25, 2023

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Sep 11, 2023

TL/MLX5: multicast design: adding helper.h, progress.h, and mcast.h

@MamziB MamziB requested a review from samnordmann September 11, 2023 22:00
@MamziB MamziB self-assigned this Sep 11, 2023
Copy link
Contributor

@Sergei-Lebedev Sergei-Lebedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix CI

src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
@MamziB
Copy link
Collaborator Author

MamziB commented Sep 12, 2023

@Sergei-Lebedev I fixed the comment. Thanks for the feedback. I see that all the errors in testing system comes from failure in building UCX, which has nothing to do with this PR. @manjugv FYI

core/ucp_mm.c:980:74: error: too few arguments to function call, expected 8, have 7
                          UCS_BIT(md_index), UCT_MD_MEM_ACCESS_ALL, &memh);
                                                                         ^
core/ucp_mm.inl:35:1: note: 'ucp_memh_get' declared here
ucp_memh_get(ucp_context_h context, void *address, size_t length,
^
1 error generated.
make[1]: *** [Makefile:1199: core/libucp_la-ucp_mm.lo] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/tmp/ucx/src/ucp'
make: *** [Makefile:782: install-recursive] Error 1

@janjust
Copy link
Collaborator

janjust commented Sep 13, 2023

@Sergei-Lebedev I fixed the comment. Thanks for the feedback. I see that all the errors in testing system comes from failure in building UCX, which has nothing to do with this PR. @manjugv FYI

core/ucp_mm.c:980:74: error: too few arguments to function call, expected 8, have 7
                          UCS_BIT(md_index), UCT_MD_MEM_ACCESS_ALL, &memh);
                                                                         ^
core/ucp_mm.inl:35:1: note: 'ucp_memh_get' declared here
ucp_memh_get(ucp_context_h context, void *address, size_t length,
^
1 error generated.
make[1]: *** [Makefile:1199: core/libucp_la-ucp_mm.lo] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/tmp/ucx/src/ucp'
make: *** [Makefile:782: install-recursive] Error 1

This was fixed in UCX @Sergei-Lebedev @manjugv I really think the UCC CI should build against UCX release branch, and not master, we faced the same issue briefly in OMPI.

@manjugv
Copy link
Contributor

manjugv commented Sep 13, 2023

@Sergei-Lebedev I fixed the comment. Thanks for the feedback. I see that all the errors in testing system comes from failure in building UCX, which has nothing to do with this PR. @manjugv FYI

core/ucp_mm.c:980:74: error: too few arguments to function call, expected 8, have 7
                          UCS_BIT(md_index), UCT_MD_MEM_ACCESS_ALL, &memh);
                                                                         ^
core/ucp_mm.inl:35:1: note: 'ucp_memh_get' declared here
ucp_memh_get(ucp_context_h context, void *address, size_t length,
^
1 error generated.
make[1]: *** [Makefile:1199: core/libucp_la-ucp_mm.lo] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/tmp/ucx/src/ucp'
make: *** [Makefile:782: install-recursive] Error 1

This was fixed in UCX @Sergei-Lebedev @manjugv I really think the UCC CI should build against UCX release branch, and not master, we faced the same issue briefly in OMPI.

Regarding CI, let's convey that to the CI team. Between, there is a commit title error as well.

src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_progress.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_progress.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_progress.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_progress.h Outdated Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch from 7813a10 to dc6721c Compare September 18, 2023 23:18
@MamziB
Copy link
Collaborator Author

MamziB commented Sep 18, 2023

@samnordmann @Sergei-Lebedev Thank you so much guys for the constructive comments. I have resolved/responded to your comments and questions. Then I pushed a single commit.

@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch 2 times, most recently from 49d4720 to a3dc4aa Compare September 19, 2023 23:04
@MamziB
Copy link
Collaborator Author

MamziB commented Sep 19, 2023

Thanks @samnordmann for the new comments. I have resolved them. I updated the commit. clang-tidy failure is not related to this PR:

dpkg: error processing archive /tmp/apt-dpkg-install-D3JNOM/89-rocthrust-dev_2.18.0.50700-63~20.04_amd64.deb (--unpack):
 cannot copy extracted data for './opt/rocm-5.7.0/include/thrust/detail/preprocessor.h' to '/opt/rocm-5.7.0/include/thrust/detail/preprocessor.h.dpkg-new': failed to write (No space left on device)
No apport report written because the error message indicates a disk full error
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Selecting previously unselected package rocwmma-dev.
Preparing to unpack .../90-rocwmma-dev_1.2.0.50700-63~20.04_amd64.deb ...
Unpacking rocwmma-dev (1.2.0.50700-63~20.04) ...
Selecting previously unselected package rocm-hip-sdk.
Preparing to unpack .../91-rocm-hip-sdk_5.7.0.50700-63~20.04_amd64.deb ...
Unpacking rocm-hip-sdk (5.7.0.50700-63~20.04) ...
dpkg: error: failed to write status database record about 'libjson-perl' to '/var/lib/dpkg/status': No space left on device
E: Sub-process /usr/bin/dpkg returned an error code (2)
Error: Process completed with exit code 100.

@samnordmann samnordmann self-requested a review September 20, 2023 08:47
@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch from a3dc4aa to 3d4ff9e Compare September 21, 2023 04:02
@MamziB
Copy link
Collaborator Author

MamziB commented Sep 21, 2023

@samnordmann Thanks for the detailed comments. I have resolved them, please take a look. I have also updated the commit.

@samnordmann samnordmann self-requested a review September 21, 2023 11:06
@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch from 3d4ff9e to d078846 Compare September 21, 2023 17:21
@MamziB
Copy link
Collaborator Author

MamziB commented Sep 21, 2023

@samnordmann Thank you for the new comments. Please take a look. I also updated the commit.

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 29, 2023

@manjugv any thoughts why ucc checks are failing?

ucc — Test FAILed. No test results found.

src/components/tl/mlx5/mcast/tl_mlx5_mcast_progress.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch from 15a6025 to fa4c325 Compare October 4, 2023 19:33
@MamziB
Copy link
Collaborator Author

MamziB commented Oct 4, 2023

@Sergei-Lebedev thanks for the constructive comments. I have resolved them and updated the PR's commit.

@MamziB MamziB requested a review from Sergei-Lebedev October 9, 2023 17:48
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch from fa4c325 to ad190ff Compare October 18, 2023 21:54
@MamziB
Copy link
Collaborator Author

MamziB commented Oct 18, 2023

@Sergei-Lebedev Thank you very much for the detailed and constructive comments. Please see my answers. I have also updated the commit. Thank you.

@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch 2 times, most recently from fd2231c to d8daca8 Compare October 18, 2023 23:01
@Sergei-Lebedev
Copy link
Contributor

@MamziB pls rebase

@MamziB MamziB force-pushed the mamzi/mcast-merge-5 branch from d8daca8 to c9cafbf Compare October 24, 2023 17:27
@MamziB
Copy link
Collaborator Author

MamziB commented Oct 24, 2023

@Sergei-Lebedev Thanks for approving the PR. I have rebased the branch on the top of the laster master branch.

@Sergei-Lebedev Sergei-Lebedev merged commit 483b91b into openucx:master Oct 25, 2023
7 of 9 checks passed
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants