Skip to content

Releases: openucx/ucc

1.4.4

09 May 07:55
2c77074
Compare
Choose a tag to compare

New Features and Enhancements

Core

  • Implemented asymmetric memory support {PR #1000}
  • Enhanced error handling and resource cleanup {PR #960, #951}
  • Improved service team handling {PR #1046}
  • Fixed triggered post for zero size collectives {PR #960}

CL/HIER

  • Added allgatherv support {PR #1111}
  • Implemented node subgroup unpacking {PR #1103}
  • Added reduce to supported collectives {PR #997}
  • Fixed integer overflow in alltoall {PR #944}

TL/UCP

  • Split single and multithreaded send/receive operations {PR #1109}
  • Added knomial allgather with CUDA memory support {PR #1095}
  • Implemented reduce SRG knomial algorithm {PR #1058}
  • Added radix selection to knomial operations {PR #1072}
  • Added sliding window allreduce implementation {PR #958}
  • Added knomial allgatherv support {PR #1008}
  • Added sparbit algorithm for allgather {PR #940}
  • Extended broadcast active set support for size > 2 {PR #926}
  • Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

  • Added multicast-based zero-copy broadcast {PR #1087}
  • Implemented mcast multi-group support {PR #1060}
  • Added non-blocking CUDA memory copy support {PR #1040}
  • Added device memory multicast broadcast {PR #989}
  • Enhanced mcast allgather staging-based algorithm {PR #994}
  • Improved one-sided mcast reliability initialization {PR #980}
  • Various performance optimizations in alltoall {PR #1067}
  • Fixed fences in all-to-all WQEs {PR #1069}
  • Added context option to disable all-to-all operations {PR #1062}
  • Improved error handling and device checks {PR #1102}
  • Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

  • Added support for allgather operation {PR #1081}
  • Enabled reduce-scatter with SAT support {PR #1084}
  • Added SHARP multi-channel support {PR #1049}
  • Fixed service team OOB handling {PR #1001}
  • Improved internal OOB usage {PR #986}

CUDA

  • Added linear broadcast implementation {PR #948}
  • Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
  • Enhanced error handling for CUDA context operations {PR #1025}
  • Fixed context cleanup in CUDA operations {PR #954}

Build and Test

  • Added support for specific GPU architectures with ROCM {PR #987}
  • Added UCC pkg-config support {PR #1036}
  • Fixed build compatibility with NVC compiler {PR #1052}
  • Enhanced config parser functionality {PR #1092}
  • Enhanced ASAN/LSAN memory leak detection {PR #1074}
  • Added error checking and exit handling in gtests {PR #1083}

Documentation

  • Updated README with UCC publication information {PR #1028}
  • Added DOCA_UROM documentation {PR #999}
  • Fixed Doxygen documentation issues {PR #1038}
  • Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

  • Implemented new DOCA UROM plugin {PR #978}
  • Added support for offloading collective operations to DPUs
  • Implemented allreduce collective

v1.4.4-rc1

15 Apr 08:06
Compare
Choose a tag to compare
v1.4.4-rc1 Pre-release
Pre-release

New Features and Enhancements

Core

  • Implemented asymmetric memory support {PR #1000}
  • Enhanced error handling and resource cleanup {PR #960, #951}
  • Improved service team handling {PR #1046}
  • Fixed triggered post for zero size collectives {PR #960}

CL/HIER

  • Added allgatherv support {PR #1111}
  • Implemented node subgroup unpacking {PR #1103}
  • Added reduce to supported collectives {PR #997}
  • Fixed integer overflow in alltoall {PR #944}

TL/UCP

  • Split single and multithreaded send/receive operations {PR #1109}
  • Added knomial allgather with CUDA memory support {PR #1095}
  • Implemented reduce SRG knomial algorithm {PR #1058}
  • Added radix selection to knomial operations {PR #1072}
  • Added sliding window allreduce implementation {PR #958}
  • Added knomial allgatherv support {PR #1008}
  • Added sparbit algorithm for allgather {PR #940}
  • Extended broadcast active set support for size > 2 {PR #926}
  • Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

  • Added multicast-based zero-copy broadcast {PR #1087}
  • Implemented mcast multi-group support {PR #1060}
  • Added non-blocking CUDA memory copy support {PR #1040}
  • Added device memory multicast broadcast {PR #989}
  • Enhanced mcast allgather staging-based algorithm {PR #994}
  • Improved one-sided mcast reliability initialization {PR #980}
  • Various performance optimizations in alltoall {PR #1067}
  • Fixed fences in all-to-all WQEs {PR #1069}
  • Added context option to disable all-to-all operations {PR #1062}
  • Improved error handling and device checks {PR #1102}
  • Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

  • Added support for allgather operation {PR #1081}
  • Enabled reduce-scatter with SAT support {PR #1084}
  • Added SHARP multi-channel support {PR #1049}
  • Fixed service team OOB handling {PR #1001}
  • Improved internal OOB usage {PR #986}

CUDA

  • Added linear broadcast implementation {PR #948}
  • Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
  • Enhanced error handling for CUDA context operations {PR #1025}
  • Fixed context cleanup in CUDA operations {PR #954}

Build and Test

  • Added support for specific GPU architectures with ROCM {PR #987}
  • Added UCC pkg-config support {PR #1036}
  • Fixed build compatibility with NVC compiler {PR #1052}
  • Enhanced config parser functionality {PR #1092}
  • Enhanced ASAN/LSAN memory leak detection {PR #1074}
  • Added error checking and exit handling in gtests {PR #1083}

Documentation

  • Updated README with UCC publication information {PR #1028}
  • Added DOCA_UROM documentation {PR #999}
  • Fixed Doxygen documentation issues {PR #1038}
  • Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

  • Implemented new DOCA UROM plugin {PR #978}
  • Added support for offloading collective operations to DPUs
  • Implemented allreduce collective

1.3.0 (April 18th, 2024)

18 Apr 18:10
1522ccf
Compare
Choose a tag to compare

1.3.0 (April 18, 2024)

New Features and Enhancements

CL/HIER

  • Disable onesided alltoallv {PR #875}

TL/CUDA

  • Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

  • Enable hybrid alltoallv {PR #781}
  • Avoid copy in knomial scatter {PR #771}
  • Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
  • Remove memcpy in last SRA step {PR #743}
  • Fix sparse pack in hybrid a2av {PR #825}
  • Fix recycle in hybrid a2av {PR #827}
  • Reorder ranks for SRA {PR #834}
  • Use ring allgather when reordering needed {PR #879}
  • Use pipelining in SRA allreduce for CUDA {PR #873}
  • Poll for onesided alltoall completion {PR #876}
  • Add support for non-host buffers in bruck alltoall {PR #852}
  • Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

  • Enable bcast for any predefined dt {PR #774}
  • Don't print team create error {PR #777}
  • Check datasize supported {PR #776}
  • Fix sharp context cleanup {PR #843}

API

  • Remove duplicate get_version_string {PR #933}

TL/NCCL

  • Make team init non-blocking {PR #772}
  • Add CUDA managed to score {PR #793}
  • Make ncclGroupEnd nb {PR #798}
  • Lazy init nccl comm {PR #851}

TL/MLX5

  • Share ib_ctx and pd {PR #749}
  • Rcache {PR #753}
  • Device memory and topo init {PR #780}
  • Adding mcast interface {PR #784}
  • A2A part 1 -- coll init {PR #790}
  • A2A part 2 -- full collective {PR #802}
  • Revisit team and ctx init {PR #815}
  • Fix context create hang {PR #887}
  • Add librdmacm linkage {PR #910}

CORE

  • Fix score update when only score given {PR #779}
  • Coverity fixes {PR #809}
  • Additional coverty fixes {PR #813}
  • Fix error handling for ctx create epilog {PR #818}
  • Skip zero size collectives {PR #787}

DOCS

  • Updating NEWS for v1.2 {PR #791}
  • Updating NEWS for v1.3 {PR #937}

BUILD and TEST

  • Updated build system to enable UCC with ROCm 6.x {PR #906 and #917}
  • Check op and dt compatibility {PR #773}
  • Fix barrier test {PR #799}
  • Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

v1.3.0-rc1

04 Mar 20:19
484f69a
Compare
Choose a tag to compare
v1.3.0-rc1 Pre-release
Pre-release

1.3.0 (TBD)

New Features and Enhancements

CL/HIER

  • Disable onesided alltoallv {PR #875}

TL/CUDA

  • Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

  • Enable hybrid alltoallv {PR #781}
  • Avoid copy in knomial scatter {PR #771}
  • Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
  • Remove memcpy in last SRA step {PR #743}
  • Fix sparse pack in hybrid a2av {PR #825}
  • Fix recycle in hybrid a2av {PR #827}
  • Reorder ranks for SRA {PR #834}
  • Use ring allgather when reordering needed {PR #879}
  • Use pipelining in SRA allreduce for CUDA {PR #873}
  • Poll for onesided alltoall completion {PR #876}
  • Add support for non-host buffers in bruck alltoall {PR #852}
  • Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

  • Enable bcast for any predefined dt {PR #774}
  • Don't print team create error {PR #777}
  • Check datasize supported {PR #776}
  • Fix sharp context cleanup {PR #843}

API

  • Remove duplicate get_version_string {PR #933}

TL/NCCL

  • Make team init non-blocking {PR #772}
  • Add CUDA managed to score {PR #793}
  • Make ncclGroupEnd nb {PR #798}
  • Lazy init nccl comm {PR #851}

TL/MLX5

  • Share ib_ctx and pd {PR #749}
  • Rcache {PR #753}
  • Device memory and topo init {PR #780}
  • Adding mcast interface {PR #784}
  • A2A part 1 -- coll init {PR #790}
  • A2A part 2 -- full collective {PR #802}
  • Revisit team and ctx init {PR #815}
  • Fix context create hang {PR #887}
  • Add librdmacm linkage {PR #910}

CORE

  • Fix score update when only score given {PR #779}
  • Coverity fixes {PR #809}
  • Additional coverty fixes {PR #813}
  • Fix error handling for ctx create epilog {PR #818}
  • Skip zero size collectives {PR #787}

DOCS

  • Updating NEWS for v1.2 {PR #791}

TEST

  • Check op and dt compatibility {PR #773}
  • Fix barrier test {PR #799}
  • Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

UCC v1.2.0

13 Jun 13:27
20fc186
Compare
Choose a tag to compare

This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:

New Features and Enhancements

CL/HIER

  • Fixed single proc on node issue in alltoall (#658)
  • Implemented allreduce rab pipelined (#608)
  • Added bcast 2step algorithm (#620)
  • Fixed allreduce rab pipeline (#759)

TL/CUDA

  • Support for CUDA 12
  • Fixed cache unmap issue (#642)
  • Implemented reduce scatter linear (#669)
  • Added algorithm selection based on topology (#688)
  • Fixed linear algorithms (#751)
  • Fixed pipelining in linear rs (#770)

TL/UCP

  • Added special service worker (#560)
  • Added scatterv (#663)
  • Added gatherv (#664)
  • Fixed running with npolls 0 (#695)
  • Added knomial allgather (#729)
  • Fixed bug for triggered colls (#757)
  • Added bruck alltoall (#756)
  • Added SLOAV alltoallv (#687)
  • Large message broadcast optimizations (#738)
  • Ranks reordering in ring allgather for better locality(#69)

TL/SHARP

  • Fixed memory type check in allreduce (#662)
  • Added support for sharpv3 dt (#661)
  • Fixed assert check (#686)
  • Implemented SHARP OOB fixes (#746)
  • Fixed local rank when NODE SBGP not enabled (#760)
  • Prevented sharp team with team max ppn > 1 (#761)

CORE

  • Fixed memory type score update (#650)
  • Fixed ucc parser build (#666)
  • Implemented ucc_pipeline_params (#675)
  • Changed log level of config_modify (#667)
  • Fixed timeout handle for triggered post (#679)

DOCS

  • Added User Guide (#720)

v1.2.0-rc1

25 May 16:24
c0b5d1f
Compare
Choose a tag to compare
v1.2.0-rc1 Pre-release
Pre-release

This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:

New Features and Enhancements

CL/HIER

  • Fixed single proc on node issue in alltoall (#658)
  • Implemented allreduce rab pipelined (#608)
  • Added bcast 2step algorithm (#620)
  • Fixed allreduce rab pipeline (#759)

TL/CUDA

  • Fixed cache unmap issue (#642)
  • Implemented reduce scatter linear (#669)
  • Added algorithm selection based on topology (#688)
  • Fixed linear algorithms (#751)
  • Fixed pipelining in linear rs (#770)

TL/UCP

  • Added special service worker (#560)
  • Added scatterv (#663)
  • Added gatherv (#664)
  • Fixed running with npolls 0 (#695)
  • Added knomial allgather (#729)
  • Fixed bug for triggered colls (#757)
  • Added bruck alltoall (#756)

TL/SHARP

  • Fixed memory type check in allreduce (#662)
  • Added support for sharpv3 dt (#661)
  • Fixed assert check (#686)
  • Implemented SHARP OOB fixes (#746)
  • Fixed local rank when NODE SBGP not enabled (#760)
  • Prevented sharp team with team max ppn > 1 (#761)

CORE

  • Fixed memory type score update (#650)
  • Fixed ucc parser build (#666)
  • Implemented ucc_pipeline_params (#675)
  • Changed log level of config_modify (#667)
  • Fixed timeout handle for triggered post (#679)

DOCS

  • Added User Guide (#720)

UCC Version 1.1.0

07 Oct 14:02
cd3fce9
Compare
Choose a tag to compare

Features

API

  • Added float 128 and float 32, 64, 128 (complex) data types
  • Added Active Sets based collectives to support dynamic groups as well as
    point-to-point messaging
  • Added ucc_team_get_attr interface

Core

  • Config file support
  • Fixed component search

CL

  • Added split rail allreduce collective implementation
  • Enable hierarchical alltoallv and barrier
  • Fixed cleanup bugs

TL

  • Added SELF TL supporting team size one

UCP

  • Added service broadcast
  • Added reduce_scatterv ring algorithm
  • Added k-nomial based gather collective implementation
  • Added one-sided get based algorithms

SHARP

  • Fixed SHARP OOB
  • Added SHARP broadcast

GPU Collectives (CUDA, NCCL TL and RCCL TL)

  • Added RCCL TL to support RCCL collectives
  • Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
  • Added multiring allgatherv, alltoall, reduce-scatter, and reduce-scatterv
    multiring in CUDA TL
  • Added topo based ring construction in CUDA TL to maximize bandwidth
  • Added NCCL gather, scatter and its vector variant
  • Enable using multiple streams for collectives
  • Added support for RCCL gather (v), scatter (v), broadcast, allgather (v),
    barrier, alltoall (v) and all reduce collectives
  • Added ROCm memory component
  • Adapted all GPU collectives to executor design

Tests

  • Added tests for triggered collectives in perftests
  • Fixed bugs in multi-threading tests

Utils

  • Added CPU model and vendor detection
  • Several bug fixes in all components

UCC Version 1.1.0 - RC1

07 Sep 15:40
9f22d78
Compare
Choose a tag to compare
Pre-release

1.1.0

Features

API

  • Added float 128 and float 32, 64, 128 (complex) data types
  • Added Active Sets based collectives to support dynamic groups as well as point-to-point messaging

Core

  • Config file support
  • Fixed component search

CL

  • Added split rail all reduce collective implementation
  • Enable hierarchical alltoallv
  • Fixed cleanup bugs

TL

  • Added SELF TL supporting team size one

UCP

  • Added service broadcast
  • Added reduce_scatterv ring algorithm
  • Added k-nomial based gather collective implementation
  • Added one-sided get based algorithms

SHARP

  • Fixed SHARP OOB
  • Added SHARP broadcast

GPU Collectives (CUDA, NCCL TL and RCCL TL)

  • Added support for CUDA TL (intranode collectives for NVIDIA GPUs)
  • Added multiring allgatherv, alltoall in CUDA TL
  • Added NCCL gather, scatter and its vector variant
  • Enable using multiple streams for collectives
  • Added support for RCCL gather (v), scatter (v), broadcast, allgather (v), barrier, alltoall (v) and all reduce collectives
  • Added ROCm memory component
  • Adapted all GPU collectives to executor design

Tests

  • Added tests for triggered collectives in perftests
  • Fixed bugs in multi-threading tests

Utils

  • Added CPU model and vendor detection
  • Several bug fixes in all components

Unified Collective Communication, Version 1.0.0

19 Apr 21:57
c69c53b
Compare
Choose a tag to compare

1.0.0

Features

API

  • Added Avg reduce operation
  • Added nonblocking team destroy option
  • Added user-defined datatype definitions
  • Added Bfloat16 type
  • Clarify semantics of core abstractions including teams and context
  • Added timeout option

Core

  • Added coll scoring and selection support
  • Added support for Triggered collectives
  • Added support for timeouts in collectives
  • Added support for team create without ep in post
  • Added support for multithreaded context progress
  • Added support for nonblocking team destroy

CL

  • Added support for hierarchical collectives
  • Added support for hierarchical allreduce collective operation
  • Added support for collectives based on one-sided communication routines

TL

  • Added SHARP TL

UCP

  • Added Bcast SAG algorithm for large messages
  • Added Knomial based reduce algorithm
  • Making allgather and alltoall agree with the API
  • Added SRA knomial allreduce algorithm
  • Added pairwise alltoall and alltoallv algorithms
  • Added allgather and allgatherv ring algorithms
  • Added support for collective operations based on one-sided semantics
  • Added support for alltoall with one-sided transfer semantics
  • Bug fixes

SHARP

  • Added support for switch-based hardware collectives (SHARP)

NCCL

  • Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
    scatter, bcast, allgather and allgatherv

Tests

  • Updated tests to test the newly added algorithms and operations

Unified Collective Communication, Version 1.0.0 - RC2

27 Jan 19:04
c5d3ee5
Compare
Choose a tag to compare

1.0.0

Features

API

  • Added Avg reduce operation
  • Added nonblocking team destroy option
  • Added user-defined datatype definitions
  • Added Bfloat16 type
  • Clarify semantics of core abstractions including teams and context
  • Added timeout option

Core

  • Added coll scoring and selection support
  • Added support for Triggered collectives
  • Added support for timeouts in collectives
  • Added support for team create without ep in post
  • Added support for multithreaded context progress
  • Added support for nonblocking team destroy

CL

  • Added support for hierarchical collectives
  • Added support for hierarchical allreduce collective operation
  • Added support for collectives based on one-sided communication routines

TL

  • Added SHARP TL

UCP

  • Added Bcast SAG algorithm for large messages
  • Added Knomial based reduce algorithm
  • Making allgather and alltoall agree with the API
  • Added SRA knomial allreduce algorithm
  • Added pairwise alltoall and alltoallv algorithms
  • Added allgather and allgatherv ring algorithms
  • Added support for collective operations based on one-sided semantics
  • Added support for alltoall with one-sided transfer semantics
  • Bug fixes

SHARP

  • Added support for switch-based hardware collectives (SHARP)

NCCL

  • Add support for NCCL allreduce, alltoall, alltoallv, barrier, reduce, reduce
    scatter, bcast, allgather and allgatherv

Tests

  • Updated tests to test the newly added algorithms and operations