Releases: openucx/ucc
Releases · openucx/ucc
1.4.4
New Features and Enhancements
Core
- Implemented asymmetric memory support {PR #1000}
- Enhanced error handling and resource cleanup {PR #960, #951}
- Improved service team handling {PR #1046}
- Fixed triggered post for zero size collectives {PR #960}
CL/HIER
- Added allgatherv support {PR #1111}
- Implemented node subgroup unpacking {PR #1103}
- Added reduce to supported collectives {PR #997}
- Fixed integer overflow in alltoall {PR #944}
TL/UCP
- Split single and multithreaded send/receive operations {PR #1109}
- Added knomial allgather with CUDA memory support {PR #1095}
- Implemented reduce SRG knomial algorithm {PR #1058}
- Added radix selection to knomial operations {PR #1072}
- Added sliding window allreduce implementation {PR #958}
- Added knomial allgatherv support {PR #1008}
- Added sparbit algorithm for allgather {PR #940}
- Extended broadcast active set support for size > 2 {PR #926}
- Added knomial algorithm for reduce-scatter {PR #970}
TL/MLX5
- Added multicast-based zero-copy broadcast {PR #1087}
- Implemented mcast multi-group support {PR #1060}
- Added non-blocking CUDA memory copy support {PR #1040}
- Added device memory multicast broadcast {PR #989}
- Enhanced mcast allgather staging-based algorithm {PR #994}
- Improved one-sided mcast reliability initialization {PR #980}
- Various performance optimizations in alltoall {PR #1067}
- Fixed fences in all-to-all WQEs {PR #1069}
- Added context option to disable all-to-all operations {PR #1062}
- Improved error handling and device checks {PR #1102}
- Disabled mcast for thread multiple mode {PR #961}
TL/SHARP
- Added support for allgather operation {PR #1081}
- Enabled reduce-scatter with SAT support {PR #1084}
- Added SHARP multi-channel support {PR #1049}
- Fixed service team OOB handling {PR #1001}
- Improved internal OOB usage {PR #986}
CUDA
- Added linear broadcast implementation {PR #948}
- Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
- Enhanced error handling for CUDA context operations {PR #1025}
- Fixed context cleanup in CUDA operations {PR #954}
Build and Test
- Added support for specific GPU architectures with ROCM {PR #987}
- Added UCC pkg-config support {PR #1036}
- Fixed build compatibility with NVC compiler {PR #1052}
- Enhanced config parser functionality {PR #1092}
- Enhanced ASAN/LSAN memory leak detection {PR #1074}
- Added error checking and exit handling in gtests {PR #1083}
Documentation
- Updated README with UCC publication information {PR #1028}
- Added DOCA_UROM documentation {PR #999}
- Fixed Doxygen documentation issues {PR #1038}
- Enhanced code style consistency {PR #1020}
CL/DOCA_UROM
- Implemented new DOCA UROM plugin {PR #978}
- Added support for offloading collective operations to DPUs
- Implemented allreduce collective
v1.4.4-rc1
New Features and Enhancements
Core
- Implemented asymmetric memory support {PR #1000}
- Enhanced error handling and resource cleanup {PR #960, #951}
- Improved service team handling {PR #1046}
- Fixed triggered post for zero size collectives {PR #960}
CL/HIER
- Added allgatherv support {PR #1111}
- Implemented node subgroup unpacking {PR #1103}
- Added reduce to supported collectives {PR #997}
- Fixed integer overflow in alltoall {PR #944}
TL/UCP
- Split single and multithreaded send/receive operations {PR #1109}
- Added knomial allgather with CUDA memory support {PR #1095}
- Implemented reduce SRG knomial algorithm {PR #1058}
- Added radix selection to knomial operations {PR #1072}
- Added sliding window allreduce implementation {PR #958}
- Added knomial allgatherv support {PR #1008}
- Added sparbit algorithm for allgather {PR #940}
- Extended broadcast active set support for size > 2 {PR #926}
- Added knomial algorithm for reduce-scatter {PR #970}
TL/MLX5
- Added multicast-based zero-copy broadcast {PR #1087}
- Implemented mcast multi-group support {PR #1060}
- Added non-blocking CUDA memory copy support {PR #1040}
- Added device memory multicast broadcast {PR #989}
- Enhanced mcast allgather staging-based algorithm {PR #994}
- Improved one-sided mcast reliability initialization {PR #980}
- Various performance optimizations in alltoall {PR #1067}
- Fixed fences in all-to-all WQEs {PR #1069}
- Added context option to disable all-to-all operations {PR #1062}
- Improved error handling and device checks {PR #1102}
- Disabled mcast for thread multiple mode {PR #961}
TL/SHARP
- Added support for allgather operation {PR #1081}
- Enabled reduce-scatter with SAT support {PR #1084}
- Added SHARP multi-channel support {PR #1049}
- Fixed service team OOB handling {PR #1001}
- Improved internal OOB usage {PR #986}
CUDA
- Added linear broadcast implementation {PR #948}
- Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
- Enhanced error handling for CUDA context operations {PR #1025}
- Fixed context cleanup in CUDA operations {PR #954}
Build and Test
- Added support for specific GPU architectures with ROCM {PR #987}
- Added UCC pkg-config support {PR #1036}
- Fixed build compatibility with NVC compiler {PR #1052}
- Enhanced config parser functionality {PR #1092}
- Enhanced ASAN/LSAN memory leak detection {PR #1074}
- Added error checking and exit handling in gtests {PR #1083}
Documentation
- Updated README with UCC publication information {PR #1028}
- Added DOCA_UROM documentation {PR #999}
- Fixed Doxygen documentation issues {PR #1038}
- Enhanced code style consistency {PR #1020}
CL/DOCA_UROM
- Implemented new DOCA UROM plugin {PR #978}
- Added support for offloading collective operations to DPUs
- Implemented allreduce collective
1.3.0 (April 18th, 2024)
1.3.0 (April 18, 2024)
New Features and Enhancements
CL/HIER
- Disable onesided alltoallv {PR #875}
TL/CUDA
- Initialize remote CUDA scratch to NULL {PR #911}
TL/UCP
- Enable hybrid alltoallv {PR #781}
- Avoid copy in knomial scatter {PR #771}
- Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
- Remove memcpy in last SRA step {PR #743}
- Fix sparse pack in hybrid a2av {PR #825}
- Fix recycle in hybrid a2av {PR #827}
- Reorder ranks for SRA {PR #834}
- Use ring allgather when reordering needed {PR #879}
- Use pipelining in SRA allreduce for CUDA {PR #873}
- Poll for onesided alltoall completion {PR #876}
- Add support for non-host buffers in bruck alltoall {PR #852}
- Added Neighbor Exchange Allgather{PR #822}
TL/SHARP
- Enable bcast for any predefined dt {PR #774}
- Don't print team create error {PR #777}
- Check datasize supported {PR #776}
- Fix sharp context cleanup {PR #843}
API
- Remove duplicate get_version_string {PR #933}
TL/NCCL
- Make team init non-blocking {PR #772}
- Add CUDA managed to score {PR #793}
- Make ncclGroupEnd nb {PR #798}
- Lazy init nccl comm {PR #851}
TL/MLX5
- Share ib_ctx and pd {PR #749}
- Rcache {PR #753}
- Device memory and topo init {PR #780}
- Adding mcast interface {PR #784}
- A2A part 1 -- coll init {PR #790}
- A2A part 2 -- full collective {PR #802}
- Revisit team and ctx init {PR #815}
- Fix context create hang {PR #887}
- Add librdmacm linkage {PR #910}
CORE
- Fix score update when only score given {PR #779}
- Coverity fixes {PR #809}
- Additional coverty fixes {PR #813}
- Fix error handling for ctx create epilog {PR #818}
- Skip zero size collectives {PR #787}
DOCS
BUILD and TEST
v1.3.0-rc1
1.3.0 (TBD)
New Features and Enhancements
CL/HIER
- Disable onesided alltoallv {PR #875}
TL/CUDA
- Initialize remote CUDA scratch to NULL {PR #911}
TL/UCP
- Enable hybrid alltoallv {PR #781}
- Avoid copy in knomial scatter {PR #771}
- Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
- Remove memcpy in last SRA step {PR #743}
- Fix sparse pack in hybrid a2av {PR #825}
- Fix recycle in hybrid a2av {PR #827}
- Reorder ranks for SRA {PR #834}
- Use ring allgather when reordering needed {PR #879}
- Use pipelining in SRA allreduce for CUDA {PR #873}
- Poll for onesided alltoall completion {PR #876}
- Add support for non-host buffers in bruck alltoall {PR #852}
- Added Neighbor Exchange Allgather{PR #822}
TL/SHARP
- Enable bcast for any predefined dt {PR #774}
- Don't print team create error {PR #777}
- Check datasize supported {PR #776}
- Fix sharp context cleanup {PR #843}
API
- Remove duplicate get_version_string {PR #933}
TL/NCCL
- Make team init non-blocking {PR #772}
- Add CUDA managed to score {PR #793}
- Make ncclGroupEnd nb {PR #798}
- Lazy init nccl comm {PR #851}
TL/MLX5
- Share ib_ctx and pd {PR #749}
- Rcache {PR #753}
- Device memory and topo init {PR #780}
- Adding mcast interface {PR #784}
- A2A part 1 -- coll init {PR #790}
- A2A part 2 -- full collective {PR #802}
- Revisit team and ctx init {PR #815}
- Fix context create hang {PR #887}
- Add librdmacm linkage {PR #910}
CORE
- Fix score update when only score given {PR #779}
- Coverity fixes {PR #809}
- Additional coverty fixes {PR #813}
- Fix error handling for ctx create epilog {PR #818}
- Skip zero size collectives {PR #787}
DOCS
- Updating NEWS for v1.2 {PR #791}
TEST
UCC v1.2.0
This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:
New Features and Enhancements
CL/HIER
- Fixed single proc on node issue in alltoall (#658)
- Implemented allreduce rab pipelined (#608)
- Added bcast 2step algorithm (#620)
- Fixed allreduce rab pipeline (#759)
TL/CUDA
- Support for CUDA 12
- Fixed cache unmap issue (#642)
- Implemented reduce scatter linear (#669)
- Added algorithm selection based on topology (#688)
- Fixed linear algorithms (#751)
- Fixed pipelining in linear rs (#770)
TL/UCP
- Added special service worker (#560)
- Added scatterv (#663)
- Added gatherv (#664)
- Fixed running with npolls 0 (#695)
- Added knomial allgather (#729)
- Fixed bug for triggered colls (#757)
- Added bruck alltoall (#756)
- Added SLOAV alltoallv (#687)
- Large message broadcast optimizations (#738)
- Ranks reordering in ring allgather for better locality(#69)
TL/SHARP
- Fixed memory type check in allreduce (#662)
- Added support for sharpv3 dt (#661)
- Fixed assert check (#686)
- Implemented SHARP OOB fixes (#746)
- Fixed local rank when NODE SBGP not enabled (#760)
- Prevented sharp team with team max ppn > 1 (#761)
CORE
- Fixed memory type score update (#650)
- Fixed ucc parser build (#666)
- Implemented ucc_pipeline_params (#675)
- Changed log level of config_modify (#667)
- Fixed timeout handle for triggered post (#679)
DOCS
- Added User Guide (#720)
v1.2.0-rc1
This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:
New Features and Enhancements
CL/HIER
- Fixed single proc on node issue in alltoall (#658)
- Implemented allreduce rab pipelined (#608)
- Added bcast 2step algorithm (#620)
- Fixed allreduce rab pipeline (#759)
TL/CUDA
- Fixed cache unmap issue (#642)
- Implemented reduce scatter linear (#669)
- Added algorithm selection based on topology (#688)
- Fixed linear algorithms (#751)
- Fixed pipelining in linear rs (#770)
TL/UCP
- Added special service worker (#560)
- Added scatterv (#663)
- Added gatherv (#664)
- Fixed running with npolls 0 (#695)
- Added knomial allgather (#729)
- Fixed bug for triggered colls (#757)
- Added bruck alltoall (#756)
TL/SHARP
- Fixed memory type check in allreduce (#662)
- Added support for sharpv3 dt (#661)
- Fixed assert check (#686)
- Implemented SHARP OOB fixes (#746)
- Fixed local rank when NODE SBGP not enabled (#760)
- Prevented sharp team with team max ppn > 1 (#761)
CORE
- Fixed memory type score update (#650)
- Fixed ucc parser build (#666)
- Implemented ucc_pipeline_params (#675)
- Changed log level of config_modify (#667)
- Fixed timeout handle for triggered post (#679)
DOCS
- Added User Guide (#720)
UCC Version 1.1.0
Features
API
- Added float 128 and float 32, 64, 128 (complex) data types
- Added Active Sets based collectives to support dynamic groups as well as
point-to-point messaging - Added ucc_team_get_attr interface
Core
- Config file support
- Fixed component search
CL
- Added split rail allreduce collective implementation
- Enable hierarchical alltoallv and barrier
- Fixed cleanup bugs
TL
- Added SELF TL supporting team size one
UCP
- Added service broadcast
- Added reduce_scatterv ring algorithm
- Added k-nomial based gather collective implementation
- Added one-sided get based algorithms
SHARP
- Fixed SHARP OOB
- Added SHARP broadcast
GPU Collectives (CUDA, NCCL TL and RCCL TL)
- Added RCCL TL to support RCCL collectives
- Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
- Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv
multiring in CUDA TL - Added topo based ring construction in CUDA TL to maximize bandwidth
- Added NCCL gather, scatter and its vector variant
- Enable using multiple streams for collectives
- Added support for RCCL gather (v), scatter (v), broadcast, allgather (v),
barrier, alltoall (v) and all reduce collectives - Added ROCm memory component
- Adapted all GPU collectives to executor design
Tests
- Added tests for triggered collectives in perftests
- Fixed bugs in multi-threading tests
Utils
- Added CPU model and vendor detection
- Several bug fixes in all components
UCC Version 1.1.0 - RC1
1.1.0
Features
API
- Added float 128 and float 32, 64, 128 (complex) data types
- Added Active Sets based collectives to support dynamic groups as well as point-to-point messaging
Core
- Config file support
- Fixed component search
CL
- Added split rail all reduce collective implementation
- Enable hierarchical alltoallv
- Fixed cleanup bugs
TL
- Added SELF TL supporting team size one
UCP
- Added service broadcast
- Added reduce_scatterv ring algorithm
- Added k-nomial based gather collective implementation
- Added one-sided get based algorithms
SHARP
- Fixed SHARP OOB
- Added SHARP broadcast
GPU Collectives (CUDA, NCCL TL and RCCL TL)
- Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
- Added multiring allgatherv, alltoall in CUDA TL
- Added NCCL gather, scatter and its vector variant
- Enable using multiple streams for collectives
- Added support for RCCL gather (v), scatter (v), broadcast, allgather (v), barrier, alltoall (v) and all reduce collectives
- Added ROCm memory component
- Adapted all GPU collectives to executor design
Tests
- Added tests for triggered collectives in perftests
- Fixed bugs in multi-threading tests
Utils
- Added CPU model and vendor detection
- Several bug fixes in all components
Unified Collective Communication, Version 1.0.0
1.0.0
Features
API
- Added Avg reduce operation
- Added nonblocking team destroy option
- Added user-defined datatype definitions
- Added Bfloat16 type
- Clarify semantics of core abstractions including teams and context
- Added timeout option
Core
- Added coll scoring and selection support
- Added support for Triggered collectives
- Added support for timeouts in collectives
- Added support for team create without ep in post
- Added support for multithreaded context progress
- Added support for nonblocking team destroy
CL
- Added support for hierarchical collectives
- Added support for hierarchical allreduce collective operation
- Added support for collectives based on one-sided communication routines
TL
- Added SHARP TL
UCP
- Added Bcast SAG algorithm for large messages
- Added Knomial based reduce algorithm
- Making allgather and alltoall agree with the API
- Added SRA knomial allreduce algorithm
- Added pairwise alltoall and alltoallv algorithms
- Added allgather and allgatherv ring algorithms
- Added support for collective operations based on one-sided semantics
- Added support for alltoall with one-sided transfer semantics
- Bug fixes
SHARP
- Added support for switch-based hardware collectives (SHARP)
NCCL
- Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv
Tests
- Updated tests to test the newly added algorithms and operations
Unified Collective Communication, Version 1.0.0 - RC2
1.0.0
Features
API
- Added Avg reduce operation
- Added nonblocking team destroy option
- Added user-defined datatype definitions
- Added Bfloat16 type
- Clarify semantics of core abstractions including teams and context
- Added timeout option
Core
- Added coll scoring and selection support
- Added support for Triggered collectives
- Added support for timeouts in collectives
- Added support for team create without ep in post
- Added support for multithreaded context progress
- Added support for nonblocking team destroy
CL
- Added support for hierarchical collectives
- Added support for hierarchical allreduce collective operation
- Added support for collectives based on one-sided communication routines
TL
- Added SHARP TL
UCP
- Added Bcast SAG algorithm for large messages
- Added Knomial based reduce algorithm
- Making allgather and alltoall agree with the API
- Added SRA knomial allreduce algorithm
- Added pairwise alltoall and alltoallv algorithms
- Added allgather and allgatherv ring algorithms
- Added support for collective operations based on one-sided semantics
- Added support for alltoall with one-sided transfer semantics
- Bug fixes
SHARP
- Added support for switch-based hardware collectives (SHARP)
NCCL
- Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv
Tests
- Updated tests to test the newly added algorithms and operations