forked from openucx/ucc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NEWS
212 lines (153 loc) · 6.84 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
/**
* @copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
* See file LICENSE for terms.
*/
## Current
## 1.2.0 (June 6th, 2023)
## New Features and Enhancements
## CL/HIER
- Fixed single proc on node issue in alltoall ([#658](https://github.com/openucx/ucc/pull/658))
- Implemented allreduce rab pipelined ([#608](https://github.com/openucx/ucc/pull/608))
- Added bcast 2step algorithm ([#620](https://github.com/openucx/ucc/pull/620))
- Fixed allreduce rab pipeline ([#759](https://github.com/openucx/ucc/pull/759))
## TL/CUDA
- Support for CUDA 12
- Fixed cache unmap issue ([#642](https://github.com/openucx/ucc/pull/642))
- Implemented reduce scatter linear ([#669](https://github.com/openucx/ucc/pull/669))
- Added algorithm selection based on topology ([#688](https://github.com/openucx/ucc/pull/688))
- Fixed linear algorithms ([#751](https://github.com/openucx/ucc/pull/751))
- Fixed pipelining in linear rs ([#770](https://github.com/openucx/ucc/pull/770))
## TL/UCP
- Added special service worker ([#560](https://github.com/openucx/ucc/pull/560))
- Added scatterv ([#663](https://github.com/openucx/ucc/pull/663))
- Added gatherv ([#664](https://github.com/openucx/ucc/pull/664))
- Fixed running with npolls 0 ([#695](https://github.com/openucx/ucc/pull/695))
- Added knomial allgather ([#729](https://github.com/openucx/ucc/pull/729))
- Fixed bug for triggered colls ([#757](https://github.com/openucx/ucc/pull/757))
- Added bruck alltoall ([#756](https://github.com/openucx/ucc/pull/756))
- Added SLOAV alltoallv ([#687](https://github.com/openucx/ucc/pull/687))
- Large message broadcast optimizations ([#738](https://github.com/openucx/ucc/pull/738))
- Ranks reordering in ring allgather for better locality([#69](https://github.com/openucx/ucc/pull/698))
## TL/SHARP
- Fixed memory type check in allreduce ([#662](https://github.com/openucx/ucc/pull/662))
- Added support for sharpv3 dt ([#661](https://github.com/openucx/ucc/pull/661))
- Fixed assert check ([#686](https://github.com/openucx/ucc/pull/686))
- Implemented SHARP OOB fixes ([#746](https://github.com/openucx/ucc/pull/746))
- Fixed local rank when NODE SBGP not enabled ([#760](https://github.com/openucx/ucc/pull/760))
- Prevented sharp team with team max ppn > 1 ([#761](https://github.com/openucx/ucc/pull/761))
## CORE
- Fixed memory type score update ([#650](https://github.com/openucx/ucc/pull/650))
- Fixed ucc parser build ([#666](https://github.com/openucx/ucc/pull/666))
- Implemented ucc_pipeline_params ([#675](https://github.com/openucx/ucc/pull/675))
- Changed log level of config_modify ([#667](https://github.com/openucx/ucc/pull/667))
- Fixed timeout handle for triggered post ([#679](https://github.com/openucx/ucc/pull/679))
## DOCS
- Added User Guide ([#720](https://github.com/openucx/ucc/pull/720))
## 1.1.0 (October 7th, 2022)
## Features
## API
- Added float 128 and float 32, 64, 128 (complex) data types
- Added Active Sets based collectives to support dynamic groups as well as
point-to-point messaging
- Added ucc_team_get_attr interface
## Core
- Config file support
- Fixed component search
## CL
- Added split rail allreduce collective implementation
- Enable hierarchical alltoallv and barrier
- Fixed cleanup bugs
## TL
- Added SELF TL supporting team size one
### UCP
- Added service broadcast
- Added reduce_scatterv ring algorithm
- Added k-nomial based gather collective implementation
- Added one-sided get based algorithms
### SHARP
- Fixed SHARP OOB
- Added SHARP broadcast
### GPU Collectives (CUDA, NCCL TL and RCCL TL)
- Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
- Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv
multiring in CUDA TL
- Added topo based ring construction in CUDA TL to maximize bandwidth
- Added NCCL gather, scatter and its vector variant
- Enable using multiple streams for collectives
- Added support for RCCL gather (v), scatter (v), broadcast, allgather (v),
barrier, alltoall (v) and all reduce collectives
- Added ROCm memory component
- Adapted all GPU collectives to executor design
### Tests
- Added tests for triggered collectives in perftests
- Fixed bugs in multi-threading tests
### Utils
- Added CPU model and vendor detection
- Several bug fixes in all components
## 1.0.0 (April 19th, 2022)
### Features
#### API
- Added Avg reduce operation
- Added nonblocking team destroy option
- Added user-defined datatype definitions
- Added Bfloat16 type
- Clarify semantics of core abstractions including teams and context
- Added timeout option
#### Core
- Added coll scoring and selection support
- Added support for Triggered collectives
- Added support for timeouts in collectives
- Added support for team create without ep in post
- Added support for multithreaded context progress
- Added support for nonblocking team destroy
#### CL
- Added support for hierarchical collectives
- Added support for hierarchical allreduce collective operation
- Added support for collectives based on one-sided communication routines
#### TL
- Added SHARP TL
##### UCP
- Added Bcast SAG algorithm for large messages
- Added Knomial based reduce algorithm
- Making allgather and alltoall agree with the API
- Added SRA knomial allreduce algorithm
- Added pairwise alltoall and alltoallv algorithms
- Added allgather and allgatherv ring algorithms
- Added support for collective operations based on one-sided semantics
- Added support for alltoall with one-sided transfer semantics
- Bug fixes
##### SHARP
- Added support for switch based hardware collectives (SHARP)
#### NCCL
- Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv
#### Tests
- Updated tests to test the newly added algorithms and operations
## 0.1.0 (TBD)
### Features
#### API
- UCC API to support library, contexts, teams, collective operations, execution
engine, memory types, and triggered operations
#### Core
- Added implementation for UCC abstractions - library, context, team,
collective operations, execution engine, memory types, and triggered
operations
- Added support for memory types - CUDA, and CPU
- Added support for configuring UCC library and contexts
#### CL
- Added support for collectives, while the source and destination is either in
CPU or device (GPU)
- Added support for UCC_THREAD_MULTIPLE
- Added support for CUDA stream-based collectives
#### TL
- Added support for send/receive based collectives using UCX/UCP as a transport
layer
- Support for basic collectives types including barrier, alltoall, alltoallv,
broadcast, allgather, allgatherv, allreduce was added in the UCP TL
- Added support using NCCL as a transport layer
- Support for collectives types including alltoall, alltoallv, allgather,
allgatherv, allreduce, and broadcast
#### Tests
- Added support for unit testing (gtest) infrastructure
- Added support for MPI tests