Releases · openucx/ucc

09 May 07:55

Sergei-Lebedev

v1.4.4

2c77074

1.4.4 Latest

Latest

New Features and Enhancements

Core

Implemented asymmetric memory support {PR #1000}
Enhanced error handling and resource cleanup {PR #960, #951}
Improved service team handling {PR #1046}
Fixed triggered post for zero size collectives {PR #960}

CL/HIER

Added allgatherv support {PR #1111}
Implemented node subgroup unpacking {PR #1103}
Added reduce to supported collectives {PR #997}
Fixed integer overflow in alltoall {PR #944}

TL/UCP

Split single and multithreaded send/receive operations {PR #1109}
Added knomial allgather with CUDA memory support {PR #1095}
Implemented reduce SRG knomial algorithm {PR #1058}
Added radix selection to knomial operations {PR #1072}
Added sliding window allreduce implementation {PR #958}
Added knomial allgatherv support {PR #1008}
Added sparbit algorithm for allgather {PR #940}
Extended broadcast active set support for size > 2 {PR #926}
Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

Added multicast-based zero-copy broadcast {PR #1087}
Implemented mcast multi-group support {PR #1060}
Added non-blocking CUDA memory copy support {PR #1040}
Added device memory multicast broadcast {PR #989}
Enhanced mcast allgather staging-based algorithm {PR #994}
Improved one-sided mcast reliability initialization {PR #980}
Various performance optimizations in alltoall {PR #1067}
Fixed fences in all-to-all WQEs {PR #1069}
Added context option to disable all-to-all operations {PR #1062}
Improved error handling and device checks {PR #1102}
Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

Added support for allgather operation {PR #1081}
Enabled reduce-scatter with SAT support {PR #1084}
Added SHARP multi-channel support {PR #1049}
Fixed service team OOB handling {PR #1001}
Improved internal OOB usage {PR #986}

CUDA

Added linear broadcast implementation {PR #948}
Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
Enhanced error handling for CUDA context operations {PR #1025}
Fixed context cleanup in CUDA operations {PR #954}

Build and Test

Added support for specific GPU architectures with ROCM {PR #987}
Added UCC pkg-config support {PR #1036}
Fixed build compatibility with NVC compiler {PR #1052}
Enhanced config parser functionality {PR #1092}
Enhanced ASAN/LSAN memory leak detection {PR #1074}
Added error checking and exit handling in gtests {PR #1083}

Documentation

Updated README with UCC publication information {PR #1028}
Added DOCA_UROM documentation {PR #999}
Fixed Doxygen documentation issues {PR #1038}
Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

Implemented new DOCA UROM plugin {PR #978}
Added support for offloading collective operations to DPUs
Implemented allreduce collective

Assets 2

15 Apr 08:06

Sergei-Lebedev

v1.4.4-rc1

d733d17

v1.4.4-rc1 Pre-release

Pre-release

New Features and Enhancements

Core

Implemented asymmetric memory support {PR #1000}
Enhanced error handling and resource cleanup {PR #960, #951}
Improved service team handling {PR #1046}
Fixed triggered post for zero size collectives {PR #960}

CL/HIER

Added allgatherv support {PR #1111}
Implemented node subgroup unpacking {PR #1103}
Added reduce to supported collectives {PR #997}
Fixed integer overflow in alltoall {PR #944}

TL/UCP

Split single and multithreaded send/receive operations {PR #1109}
Added knomial allgather with CUDA memory support {PR #1095}
Implemented reduce SRG knomial algorithm {PR #1058}
Added radix selection to knomial operations {PR #1072}
Added sliding window allreduce implementation {PR #958}
Added knomial allgatherv support {PR #1008}
Added sparbit algorithm for allgather {PR #940}
Extended broadcast active set support for size > 2 {PR #926}
Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

Added multicast-based zero-copy broadcast {PR #1087}
Implemented mcast multi-group support {PR #1060}
Added non-blocking CUDA memory copy support {PR #1040}
Added device memory multicast broadcast {PR #989}
Enhanced mcast allgather staging-based algorithm {PR #994}
Improved one-sided mcast reliability initialization {PR #980}
Various performance optimizations in alltoall {PR #1067}
Fixed fences in all-to-all WQEs {PR #1069}
Added context option to disable all-to-all operations {PR #1062}
Improved error handling and device checks {PR #1102}
Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

Added support for allgather operation {PR #1081}
Enabled reduce-scatter with SAT support {PR #1084}
Added SHARP multi-channel support {PR #1049}
Fixed service team OOB handling {PR #1001}
Improved internal OOB usage {PR #986}

CUDA

Added linear broadcast implementation {PR #948}
Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
Enhanced error handling for CUDA context operations {PR #1025}
Fixed context cleanup in CUDA operations {PR #954}

Build and Test

Added support for specific GPU architectures with ROCM {PR #987}
Added UCC pkg-config support {PR #1036}
Fixed build compatibility with NVC compiler {PR #1052}
Enhanced config parser functionality {PR #1092}
Enhanced ASAN/LSAN memory leak detection {PR #1074}
Added error checking and exit handling in gtests {PR #1083}

Documentation

Updated README with UCC publication information {PR #1028}
Added DOCA_UROM documentation {PR #999}
Fixed Doxygen documentation issues {PR #1038}
Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

Implemented new DOCA UROM plugin {PR #978}
Added support for offloading collective operations to DPUs
Implemented allreduce collective

Assets 2

18 Apr 18:10

manjugv

v1.3.0

1522ccf

1.3.0 (April 18th, 2024)

1.3.0 (April 18, 2024)

New Features and Enhancements

CL/HIER

Disable onesided alltoallv {PR #875}

TL/CUDA

Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

Enable hybrid alltoallv {PR #781}
Avoid copy in knomial scatter {PR #771}
Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
Remove memcpy in last SRA step {PR #743}
Fix sparse pack in hybrid a2av {PR #825}
Fix recycle in hybrid a2av {PR #827}
Reorder ranks for SRA {PR #834}
Use ring allgather when reordering needed {PR #879}
Use pipelining in SRA allreduce for CUDA {PR #873}
Poll for onesided alltoall completion {PR #876}
Add support for non-host buffers in bruck alltoall {PR #852}
Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

Enable bcast for any predefined dt {PR #774}
Don't print team create error {PR #777}
Check datasize supported {PR #776}
Fix sharp context cleanup {PR #843}

API

Remove duplicate get_version_string {PR #933}

TL/NCCL

Make team init non-blocking {PR #772}
Add CUDA managed to score {PR #793}
Make ncclGroupEnd nb {PR #798}
Lazy init nccl comm {PR #851}

TL/MLX5

Share ib_ctx and pd {PR #749}
Rcache {PR #753}
Device memory and topo init {PR #780}
Adding mcast interface {PR #784}
A2A part 1 -- coll init {PR #790}
A2A part 2 -- full collective {PR #802}
Revisit team and ctx init {PR #815}
Fix context create hang {PR #887}
Add librdmacm linkage {PR #910}

CORE

Fix score update when only score given {PR #779}
Coverity fixes {PR #809}
Additional coverty fixes {PR #813}
Fix error handling for ctx create epilog {PR #818}
Skip zero size collectives {PR #787}

DOCS

Updating NEWS for v1.2 {PR #791}
Updating NEWS for v1.3 {PR #937}

BUILD and TEST

Updated build system to enable UCC with ROCm 6.x {PR #906 and #917}
Check op and dt compatibility {PR #773}
Fix barrier test {PR #799}
Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

Assets 2

04 Mar 20:19

manjugv

v.1.3.0-rc1

484f69a

v1.3.0-rc1 Pre-release

Pre-release

1.3.0 (TBD)

New Features and Enhancements

CL/HIER

Disable onesided alltoallv {PR #875}

TL/CUDA

Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

Enable hybrid alltoallv {PR #781}
Avoid copy in knomial scatter {PR #771}
Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
Remove memcpy in last SRA step {PR #743}
Fix sparse pack in hybrid a2av {PR #825}
Fix recycle in hybrid a2av {PR #827}
Reorder ranks for SRA {PR #834}
Use ring allgather when reordering needed {PR #879}
Use pipelining in SRA allreduce for CUDA {PR #873}
Poll for onesided alltoall completion {PR #876}
Add support for non-host buffers in bruck alltoall {PR #852}
Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

Enable bcast for any predefined dt {PR #774}
Don't print team create error {PR #777}
Check datasize supported {PR #776}
Fix sharp context cleanup {PR #843}

API

Remove duplicate get_version_string {PR #933}

TL/NCCL

Make team init non-blocking {PR #772}
Add CUDA managed to score {PR #793}
Make ncclGroupEnd nb {PR #798}
Lazy init nccl comm {PR #851}

TL/MLX5

Share ib_ctx and pd {PR #749}
Rcache {PR #753}
Device memory and topo init {PR #780}
Adding mcast interface {PR #784}
A2A part 1 -- coll init {PR #790}
A2A part 2 -- full collective {PR #802}
Revisit team and ctx init {PR #815}
Fix context create hang {PR #887}
Add librdmacm linkage {PR #910}

CORE

Fix score update when only score given {PR #779}
Coverity fixes {PR #809}
Additional coverty fixes {PR #813}
Fix error handling for ctx create epilog {PR #818}
Skip zero size collectives {PR #787}

DOCS

Updating NEWS for v1.2 {PR #791}

TEST

Check op and dt compatibility {PR #773}
Fix barrier test {PR #799}
Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

Assets 2

13 Jun 13:27

manjugv

v1.2.0

20fc186

UCC v1.2.0

This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:

New Features and Enhancements

CL/HIER

Fixed single proc on node issue in alltoall (#658)
Implemented allreduce rab pipelined (#608)
Added bcast 2step algorithm (#620)
Fixed allreduce rab pipeline (#759)

TL/CUDA

Support for CUDA 12
Fixed cache unmap issue (#642)
Implemented reduce scatter linear (#669)
Added algorithm selection based on topology (#688)
Fixed linear algorithms (#751)
Fixed pipelining in linear rs (#770)

TL/UCP

Added special service worker (#560)
Added scatterv (#663)
Added gatherv (#664)
Fixed running with npolls 0 (#695)
Added knomial allgather (#729)
Fixed bug for triggered colls (#757)
Added bruck alltoall (#756)
Added SLOAV alltoallv (#687)
Large message broadcast optimizations (#738)
Ranks reordering in ring allgather for better locality(#69)

TL/SHARP

Fixed memory type check in allreduce (#662)
Added support for sharpv3 dt (#661)
Fixed assert check (#686)
Implemented SHARP OOB fixes (#746)
Fixed local rank when NODE SBGP not enabled (#760)
Prevented sharp team with team max ppn > 1 (#761)

CORE

Fixed memory type score update (#650)
Fixed ucc parser build (#666)
Implemented ucc_pipeline_params (#675)
Changed log level of config_modify (#667)
Fixed timeout handle for triggered post (#679)

DOCS

Added User Guide (#720)

Assets 2

25 May 16:24

manjugv

v1.2.0-rc1

c0b5d1f

v1.2.0-rc1 Pre-release

Pre-release

This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:

New Features and Enhancements

CL/HIER

Fixed single proc on node issue in alltoall (#658)
Implemented allreduce rab pipelined (#608)
Added bcast 2step algorithm (#620)
Fixed allreduce rab pipeline (#759)

TL/CUDA

Fixed cache unmap issue (#642)
Implemented reduce scatter linear (#669)
Added algorithm selection based on topology (#688)
Fixed linear algorithms (#751)
Fixed pipelining in linear rs (#770)

TL/UCP

Added special service worker (#560)
Added scatterv (#663)
Added gatherv (#664)
Fixed running with npolls 0 (#695)
Added knomial allgather (#729)
Fixed bug for triggered colls (#757)
Added bruck alltoall (#756)

TL/SHARP

Fixed memory type check in allreduce (#662)
Added support for sharpv3 dt (#661)
Fixed assert check (#686)
Implemented SHARP OOB fixes (#746)
Fixed local rank when NODE SBGP not enabled (#760)
Prevented sharp team with team max ppn > 1 (#761)

CORE

Fixed memory type score update (#650)
Fixed ucc parser build (#666)
Implemented ucc_pipeline_params (#675)
Changed log level of config_modify (#667)
Fixed timeout handle for triggered post (#679)

DOCS

Added User Guide (#720)

Assets 2

07 Oct 14:02

manjugv

v1.1.0

cd3fce9

UCC Version 1.1.0

Features

API

Added float 128 and float 32, 64, 128 (complex) data types
Added Active Sets based collectives to support dynamic groups as well as
point-to-point messaging
Added ucc_team_get_attr interface

Core

Config file support
Fixed component search

CL

Added split rail allreduce collective implementation
Enable hierarchical alltoallv and barrier
Fixed cleanup bugs

TL

Added SELF TL supporting team size one

UCP

Added service broadcast
Added reduce_scatterv ring algorithm
Added k-nomial based gather collective implementation
Added one-sided get based algorithms

SHARP

Fixed SHARP OOB
Added SHARP broadcast

GPU Collectives (CUDA, NCCL TL and RCCL TL)

Added RCCL TL to support RCCL collectives
Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv
multiring in CUDA TL
Added topo based ring construction in CUDA TL to maximize bandwidth
Added NCCL gather, scatter and its vector variant
Enable using multiple streams for collectives
Added support for RCCL gather (v), scatter (v), broadcast, allgather (v),
barrier, alltoall (v) and all reduce collectives
Added ROCm memory component
Adapted all GPU collectives to executor design

Tests

Added tests for triggered collectives in perftests
Fixed bugs in multi-threading tests

Utils

Added CPU model and vendor detection
Several bug fixes in all components

Assets 2

07 Sep 15:40

manjugv

v1.1.0-rc1

9f22d78

UCC Version 1.1.0 - RC1 Pre-release

Pre-release

1.1.0 Features

API

Added float 128 and float 32, 64, 128 (complex) data types
Added Active Sets based collectives to support dynamic groups as well as point-to-point messaging

Core

Config file support
Fixed component search

CL

Added split rail all reduce collective implementation
Enable hierarchical alltoallv
Fixed cleanup bugs

TL

Added SELF TL supporting team size one

UCP

Added service broadcast
Added reduce_scatterv ring algorithm
Added k-nomial based gather collective implementation
Added one-sided get based algorithms

SHARP

Fixed SHARP OOB
Added SHARP broadcast

GPU Collectives (CUDA, NCCL TL and RCCL TL)

Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
Added multiring allgatherv, alltoall in CUDA TL
Added NCCL gather, scatter and its vector variant
Enable using multiple streams for collectives
Added support for RCCL gather (v), scatter (v), broadcast, allgather (v), barrier, alltoall (v) and all reduce collectives
Added ROCm memory component
Adapted all GPU collectives to executor design

Tests

Added tests for triggered collectives in perftests
Fixed bugs in multi-threading tests

Utils

Added CPU model and vendor detection
Several bug fixes in all components

Assets 2

19 Apr 21:57

manjugv

v1.0.0

c69c53b

Unified Collective Communication, Version 1.0.0

1.0.0 Features

API

Added Avg reduce operation
Added nonblocking team destroy option
Added user-defined datatype definitions
Added Bfloat16 type
Clarify semantics of core abstractions including teams and context
Added timeout option

Core

Added coll scoring and selection support
Added support for Triggered collectives
Added support for timeouts in collectives
Added support for team create without ep in post
Added support for multithreaded context progress
Added support for nonblocking team destroy

CL

Added support for hierarchical collectives
Added support for hierarchical allreduce collective operation
Added support for collectives based on one-sided communication routines

TL

Added SHARP TL

UCP

Added Bcast SAG algorithm for large messages
Added Knomial based reduce algorithm
Making allgather and alltoall agree with the API
Added SRA knomial allreduce algorithm
Added pairwise alltoall and alltoallv algorithms
Added allgather and allgatherv ring algorithms
Added support for collective operations based on one-sided semantics
Added support for alltoall with one-sided transfer semantics
Bug fixes

SHARP

Added support for switch-based hardware collectives (SHARP)

NCCL

Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv

Tests

Updated tests to test the newly added algorithms and operations

Assets 2

27 Jan 19:04

manjugv

v1.0.0-rc2

c5d3ee5

Unified Collective Communication, Version 1.0.0 - RC2 Pre-release

Pre-release

1.0.0 Features

API

Added Avg reduce operation
Added nonblocking team destroy option
Added user-defined datatype definitions
Added Bfloat16 type
Clarify semantics of core abstractions including teams and context
Added timeout option

Core

Added coll scoring and selection support
Added support for Triggered collectives
Added support for timeouts in collectives
Added support for team create without ep in post
Added support for multithreaded context progress
Added support for nonblocking team destroy

CL

Added support for hierarchical collectives
Added support for hierarchical allreduce collective operation
Added support for collectives based on one-sided communication routines

TL

Added SHARP TL

UCP

Added Bcast SAG algorithm for large messages
Added Knomial based reduce algorithm
Making allgather and alltoall agree with the API
Added SRA knomial allreduce algorithm
Added pairwise alltoall and alltoallv algorithms
Added allgather and allgatherv ring algorithms
Added support for collective operations based on one-sided semantics
Added support for alltoall with one-sided transfer semantics
Bug fixes

SHARP

Added support for switch-based hardware collectives (SHARP)

NCCL

Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
scatter, bcast, allgather and allgatherv

Tests

Updated tests to test the newly added algorithms and operations

Assets 2

Releases: openucx/ucc

1.4.4

New Features and Enhancements

Core

CL/HIER

TL/UCP

TL/MLX5

TL/SHARP

CUDA

Build and Test

Documentation

CL/DOCA_UROM

Uh oh!

v1.4.4-rc1

New Features and Enhancements

Core

CL/HIER

TL/UCP

TL/MLX5

TL/SHARP

CUDA

Build and Test

Documentation

CL/DOCA_UROM

Uh oh!

1.3.0 (April 18th, 2024)

1.3.0 (April 18, 2024)

New Features and Enhancements

CL/HIER

TL/CUDA

TL/UCP

TL/SHARP

API

TL/NCCL

TL/MLX5

CORE

DOCS

BUILD and TEST

Uh oh!

v1.3.0-rc1

1.3.0 (TBD)

New Features and Enhancements

CL/HIER

TL/CUDA

TL/UCP

TL/SHARP

API

TL/NCCL

TL/MLX5

CORE

DOCS

TEST

Uh oh!

UCC v1.2.0

New Features and Enhancements

CL/HIER

TL/CUDA

TL/UCP

TL/SHARP

CORE

DOCS

Uh oh!

v1.2.0-rc1

New Features and Enhancements

CL/HIER

TL/CUDA

TL/UCP

TL/SHARP

CORE

DOCS

Uh oh!

UCC Version 1.1.0

Features

API

Core

CL

TL

UCP

SHARP

GPU Collectives (CUDA, NCCL TL and RCCL TL)