[Perf] Reduce allocations in basic circuit operations #2620

ljedrz · 2025-02-26T11:20:56Z

These performance improvements were found via profiling of the large_program, which is included in the 1st commit. It is clear that a lot of the time is being spent on allocations (which go into millions), many of which are linked to some of the basic circuit operations.

This PR is not the end of this endeavor, but I decided to isolate these changes in order for the diff to not spiral out of control. While it is already considerable, note that almost 20% is the benchmark, and most of the new additions are almost exact duplicates of the existing code, but with unnecessary clone operations removed. The current code reimplements many operations using the "lowest common denominator", which reduces the amount of code, but results in performance penalties. I believe this amount of duplication is acceptable considering how impactful these operations are, and how unlikely they are to change in the future.

Based on the new large_program (which involves the synthesis of 1_721_211 constraints) benchmark, these changes result in a boost to runtime performance:

CheckDeployment for do  time:   [1.1380 s 1.1509 s 1.1668 s]
                        change: [-10.272% -8.5848% -6.9016%] (p = 0.00 < 0.05)
                        Performance has improved.

(updated results)
According to additional heap profiling, these changes reduce the total number of allocations involved by over 22%.

As suggested by @vicsn, I ran the credits.sh Canary script with these changes, and saw no change to the diff, suggesting that there is no impact to logic.

ljedrz · 2025-02-26T11:40:38Z

Note: we can probably remove some of the repeated code comments and instead refer to a single reference implementation that has them.

vicsn · 2025-02-26T12:51:54Z

Analysis regarding impact on pricing and deployment limits.

For simplicity, we assume the number of variables equals the number of constraints, which is roughly the case for most deployed programs. And recall that compute is currently priced at 100 credits / second.

Currently a 2**20 constraint program will cost around 25 (SYNTHESIS_FEE_MULTIPLIER) * (2**20 (MAX_DEPLOYMENT_CONSTRAINTS) + 2**20 (MAX_DEPLOYMENT_VARIABLES)) = 52'428'800 microcredits. Deployments are verified in batches of 5 (MAX_PARALLEL_DEPLOY_VERIFICATIONS), implying verification time per program of around 5*0.52=2.6 seconds.

This PR's benchmark implies 5-threaded runtime of (5x2x1'721'211)/1.15=14'967'052 (constraints+variables) per second.

So:

SYNTHESIS_FEE_MULTIPLIER can be 100'000'000 / 14'967'052 = 7.
MAX_DEPLOYMENT_CONSTRAINTS and MAX_DEPLOYMENT_VARIABLES can remain unchanged, as 1'496'705 rounds down to 2**20. In the current model, individual transactions should not take longer than 1 second to verify.

ljedrz · 2025-02-27T08:15:51Z

Since I'm now switching to a different optimization, I'll post the final bits related to arithmetic circuit operations that I've found since the filing of this PR.

This is the end result of these changes:

CheckDeployment for do  time:   [1.0422 s 1.0508 s 1.0656 s]
                        change: [-18.630% -17.220% -15.937%] (p = 0.00 < 0.05)
                        Performance has improved.

circuit/types/integers/src/xor.rs

circuit/types/boolean/src/xor.rs

kpandl

Nice optimization!
Related to what Victor said, I was thinking if it is possible to remove code duplications here with a new helper function (without cloning)?

I see the bitxor operation 5 times:

let output = Boolean(
    E::new_variable(Mode::Private, match self.eject_value() ^ other.eject_value() {
        true => E::BaseField::one(),
        false => E::BaseField::zero(),
    })
    .into(),
);

and also the bitand operation 5 times:

let output = Boolean(
    E::new_variable(Mode::Private, match self.eject_value() & other.eject_value() {
        true => E::BaseField::one(),
        false => E::BaseField::zero(),
    })
    .into(),
);

ljedrz · 2025-03-10T13:24:48Z

I'll see what I can do to compress these 👍.

vicsn · 2025-03-10T13:36:57Z

I was thinking if it is possible to remove code duplications here with a new helper function (without cloning)?

Good idea!

ljedrz · 2025-03-10T15:19:39Z

I rebased and applied the first round of "compressions", which already greatly improved the diff:

 circuit/environment/src/helpers/linear_combination.rs | 190 +++++++++++--------------------------------------------------------------------------------------------
 circuit/environment/src/helpers/variable.rs           |  10 +++---
 circuit/types/boolean/src/and.rs                      | 186 ++++++++++++++++++++++++++---------------------------------------------------------------------------
 circuit/types/boolean/src/helpers/bits_are_zero.rs    |   4 ++-
 circuit/types/boolean/src/not.rs                      |   8 -----
 circuit/types/boolean/src/xor.rs                      | 276 ++++++++++++++++++++++++++----------------------------------------------------------------------------------------------------------------------------
 circuit/types/field/src/add.rs                        |  18 +++++-----
 circuit/types/integers/src/lib.rs                     |   3 +-
 8 files changed, 136 insertions(+), 559 deletions(-)

I'm going to do another pass in case I can find any substitutions that can be made without a drop in performance. Also, these new changes have actually resulted in a further small improvement:

CheckDeployment for do  time:   [1.0344 s 1.0367 s 1.0395 s]
                        change: [-19.915% -18.836% -18.093%] (p = 0.00 < 0.05)
                        Performance has improved.

ljedrz · 2025-03-10T16:38:16Z

I managed to squeeze the impls a bit further:

 circuit/environment/benches/linear_combination.rs     |   3 ++-
 circuit/environment/src/helpers/linear_combination.rs | 146 +++++++++++---------------------------------------------------------------------------------------------------------------------------------------
 synthesizer/process/benches/check_deployment.rs       |   2 +-
 3 files changed, 14 insertions(+), 137 deletions(-)

The perf improvement was reduced a little bit, but it's still a win:

CheckDeployment for do  time:   [1.0559 s 1.0693 s 1.0864 s]
                        change: [-15.954% -14.786% -13.414%] (p = 0.00 < 0.05)
                        Performance has improved.

It might still be possible to reduce the amount of duplication a bit more.

vicsn · 2025-03-10T18:43:28Z

I managed to squeeze the impls a bit further:

Thank you. The Cow stuff is very helpful. Please revert the bit sections to their original state so we spare reviewer's time. We can eat the performance cost, the cumulative efforts of the three allocation PRs will help to increase the constraint limit for 2**20 to 2**21, which is an amazing improvement.

ljedrz · 2025-03-11T10:10:46Z

Done. I was worried that it would wipe more of the performance win, but actually the impact was pretty benign:

CheckDeployment for do  time:   [1.1149 s 1.1309 s 1.1513 s]
                        change: [-14.565% -12.904% -10.988%] (p = 0.00 < 0.05)
                        Performance has improved.

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

…d Variables Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

…nation Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

kpandl

LGTM

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>

kaimast

I left some suggestions on this PR ages ago but never finished the review... Hopefully we can merge this soon?

kaimast · 2025-03-12T18:40:47Z