[wip] group op #1202

Chao1Han · 2024-12-23T05:11:20Z

No description provided.

zhangxiaoli73 · 2024-12-23T08:47:49Z

src/xccl/ProcessGroupXCCL.hpp

+        },
+        [](at::xpu::XPUStream&,
+           c10::intrusive_ptr<ProcessGroupXCCL::WorkXCCL>&) {
+          ccl::group_end();


I think groupStart/groupEnd wraps ccl::group_start/ccl::group_end. Then should you call the wrapped API?

xcclActiveGroupCounter_ affect batchP2P choice. lets use origin api like nccl

zhangxiaoli73 · 2024-12-23T08:49:02Z

src/xccl/ProcessGroupXCCL.cpp

+  return true;
+}
+
+void check_xpu_single_tensor(


Should you follow the same naming format like checkSingleTensor?

zhangxiaoli73 · 2024-12-23T08:49:39Z

src/xccl/ProcessGroupXCCL.cpp

    }
  }
 }

+int64_t check_xpu_tensors_same_device(const std::vector<at::Tensor>& tensors) {


Should you follow the same naming format like checkTensorOnSameDevice?

zhangxiaoli73 · 2024-12-23T08:52:37Z

src/xccl/ProcessGroupXCCL.cpp

@@ -62,6 +109,10 @@ ccl::reduction getXcclReduceOp(const ReduceOp& reduceOp, at::Tensor& input) {
      // Map sum to max for bool tensors to avoid overflow issues with sum.
      return ccl::reduction::max;
    }
+    // WA due to oneCCL not support AVG
+    if (reduceOp == ReduceOp::AVG) {


The WA does not mean simply replacing avg with sum, but using sum collective and div SYCL kernel to simulate avg. Please update your comment.

Please also add comments that oneCCL is expected to support avg in basekit 2025.2 release.

zhangxiaoli73 · 2024-12-30T01:55:22Z

src/xccl/ProcessGroupXCCL.cpp

@@ -31,22 +31,69 @@ const std::map<at::ScalarType, ccl::datatype> xcclDatatypes = {
    {at::kFloat8_e5m2fnuz, ccl::datatype::uint8},
 };

-void checkXPUTensor(at::Tensor& tensor) {
+bool check_same_size(const std::vector<at::Tensor>& input_tensors) {


Please refine the API name.

zhangxiaoli73 reviewed Dec 23, 2024

View reviewed changes

Chao1Han added 5 commits December 23, 2024 19:06

Add group op

dfb6f3a

add cases

75a58ee

Merge branch 'chao/xccl' into chao/xccl2

3f0f77b

rm test_case

5904ca5

update

2a80dce

zhangxiaoli73 reviewed Dec 30, 2024

View reviewed changes

add comments

106adb5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] group op #1202

[wip] group op #1202

Chao1Han commented Dec 23, 2024

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

zhangxiaoli73 Dec 23, 2024

Chao1Han Dec 24, 2024

zhangxiaoli73 Dec 30, 2024

Chao1Han Dec 30, 2024

zhangxiaoli73 Dec 30, 2024

Chao1Han Dec 30, 2024

[wip] group op #1202

Are you sure you want to change the base?

[wip] group op #1202

Conversation

Chao1Han commented Dec 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment