Skip to content

[Perf] Reduce allocations in basic circuit operations #2620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: staging
Choose a base branch
from

Conversation

ljedrz
Copy link
Collaborator

@ljedrz ljedrz commented Feb 26, 2025

These performance improvements were found via profiling of the large_program, which is included in the 1st commit. It is clear that a lot of the time is being spent on allocations (which go into millions), many of which are linked to some of the basic circuit operations.

This PR is not the end of this endeavor, but I decided to isolate these changes in order for the diff to not spiral out of control. While it is already considerable, note that almost 20% is the benchmark, and most of the new additions are almost exact duplicates of the existing code, but with unnecessary clone operations removed. The current code reimplements many operations using the "lowest common denominator", which reduces the amount of code, but results in performance penalties. I believe this amount of duplication is acceptable considering how impactful these operations are, and how unlikely they are to change in the future.

Based on the new large_program (which involves the synthesis of 1_721_211 constraints) benchmark, these changes result in a boost to runtime performance:

CheckDeployment for do  time:   [1.1380 s 1.1509 s 1.1668 s]
                        change: [-10.272% -8.5848% -6.9016%] (p = 0.00 < 0.05)
                        Performance has improved.

(updated results)
According to additional heap profiling, these changes reduce the total number of allocations involved by over 22%.

As suggested by @vicsn, I ran the credits.sh Canary script with these changes, and saw no change to the diff, suggesting that there is no impact to logic.

@ljedrz ljedrz requested a review from vicsn February 26, 2025 11:20
@ljedrz
Copy link
Collaborator Author

ljedrz commented Feb 26, 2025

Note: we can probably remove some of the repeated code comments and instead refer to a single reference implementation that has them.

@vicsn
Copy link
Collaborator

vicsn commented Feb 26, 2025

Analysis regarding impact on pricing and deployment limits.

For simplicity, we assume the number of variables equals the number of constraints, which is roughly the case for most deployed programs. And recall that compute is currently priced at 100 credits / second.

Currently a 2**20 constraint program will cost around 25 (SYNTHESIS_FEE_MULTIPLIER) * (2**20 (MAX_DEPLOYMENT_CONSTRAINTS) + 2**20 (MAX_DEPLOYMENT_VARIABLES)) = 52'428'800 microcredits. Deployments are verified in batches of 5 (MAX_PARALLEL_DEPLOY_VERIFICATIONS), implying verification time per program of around 5*0.52=2.6 seconds.

This PR's benchmark implies 5-threaded runtime of (5x2x1'721'211)/1.15=14'967'052 (constraints+variables) per second.

So:

  • SYNTHESIS_FEE_MULTIPLIER can be 100'000'000 / 14'967'052 = 7.
  • MAX_DEPLOYMENT_CONSTRAINTS and MAX_DEPLOYMENT_VARIABLES can remain unchanged, as 1'496'705 rounds down to 2**20. In the current model, individual transactions should not take longer than 1 second to verify.

@ljedrz
Copy link
Collaborator Author

ljedrz commented Feb 27, 2025

Since I'm now switching to a different optimization, I'll post the final bits related to arithmetic circuit operations that I've found since the filing of this PR.

This is the end result of these changes:

CheckDeployment for do  time:   [1.0422 s 1.0508 s 1.0656 s]
                        change: [-18.630% -17.220% -15.937%] (p = 0.00 < 0.05)
                        Performance has improved.

@vicsn vicsn requested a review from kpandl March 8, 2025 20:00
Copy link
Collaborator

@kpandl kpandl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization!
Related to what Victor said, I was thinking if it is possible to remove code duplications here with a new helper function (without cloning)?

I see the bitxor operation 5 times:

let output = Boolean(
    E::new_variable(Mode::Private, match self.eject_value() ^ other.eject_value() {
        true => E::BaseField::one(),
        false => E::BaseField::zero(),
    })
    .into(),
);

and also the bitand operation 5 times:

let output = Boolean(
    E::new_variable(Mode::Private, match self.eject_value() & other.eject_value() {
        true => E::BaseField::one(),
        false => E::BaseField::zero(),
    })
    .into(),
);

@ljedrz
Copy link
Collaborator Author

ljedrz commented Mar 10, 2025

I'll see what I can do to compress these 👍.

@vicsn
Copy link
Collaborator

vicsn commented Mar 10, 2025

I was thinking if it is possible to remove code duplications here with a new helper function (without cloning)?

Good idea!

@ljedrz ljedrz force-pushed the perf/circuit_ops_allocs branch from 1dfb988 to cd70ab9 Compare March 10, 2025 15:16
@ljedrz
Copy link
Collaborator Author

ljedrz commented Mar 10, 2025

I rebased and applied the first round of "compressions", which already greatly improved the diff:

 circuit/environment/src/helpers/linear_combination.rs | 190 +++++++++++--------------------------------------------------------------------------------------------
 circuit/environment/src/helpers/variable.rs           |  10 +++---
 circuit/types/boolean/src/and.rs                      | 186 ++++++++++++++++++++++++++---------------------------------------------------------------------------
 circuit/types/boolean/src/helpers/bits_are_zero.rs    |   4 ++-
 circuit/types/boolean/src/not.rs                      |   8 -----
 circuit/types/boolean/src/xor.rs                      | 276 ++++++++++++++++++++++++++----------------------------------------------------------------------------------------------------------------------------
 circuit/types/field/src/add.rs                        |  18 +++++-----
 circuit/types/integers/src/lib.rs                     |   3 +-
 8 files changed, 136 insertions(+), 559 deletions(-)

I'm going to do another pass in case I can find any substitutions that can be made without a drop in performance. Also, these new changes have actually resulted in a further small improvement:

CheckDeployment for do  time:   [1.0344 s 1.0367 s 1.0395 s]
                        change: [-19.915% -18.836% -18.093%] (p = 0.00 < 0.05)
                        Performance has improved.

@ljedrz
Copy link
Collaborator Author

ljedrz commented Mar 10, 2025

I managed to squeeze the impls a bit further:

 circuit/environment/benches/linear_combination.rs     |   3 ++-
 circuit/environment/src/helpers/linear_combination.rs | 146 +++++++++++---------------------------------------------------------------------------------------------------------------------------------------
 synthesizer/process/benches/check_deployment.rs       |   2 +-
 3 files changed, 14 insertions(+), 137 deletions(-)

The perf improvement was reduced a little bit, but it's still a win:

CheckDeployment for do  time:   [1.0559 s 1.0693 s 1.0864 s]
                        change: [-15.954% -14.786% -13.414%] (p = 0.00 < 0.05)
                        Performance has improved.

It might still be possible to reduce the amount of duplication a bit more.

@vicsn
Copy link
Collaborator

vicsn commented Mar 10, 2025

I managed to squeeze the impls a bit further:

Thank you. The Cow stuff is very helpful. Please revert the bit sections to their original state so we spare reviewer's time. We can eat the performance cost, the cumulative efforts of the three allocation PRs will help to increase the constraint limit for 2**20 to 2**21, which is an amazing improvement.

@ljedrz
Copy link
Collaborator Author

ljedrz commented Mar 11, 2025

Done. I was worried that it would wipe more of the performance win, but actually the impact was pretty benign:

CheckDeployment for do  time:   [1.1149 s 1.1309 s 1.1513 s]
                        change: [-14.565% -12.904% -10.988%] (p = 0.00 < 0.05)
                        Performance has improved.

@ljedrz ljedrz force-pushed the perf/circuit_ops_allocs branch 2 times, most recently from 2a32e60 to c325d9d Compare March 11, 2025 10:48
@ljedrz ljedrz requested review from kpandl and vicsn March 11, 2025 10:52
@ljedrz ljedrz force-pushed the perf/circuit_ops_allocs branch from c325d9d to 15e3b79 Compare March 11, 2025 12:22
ljedrz added 7 commits March 11, 2025 13:23
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
…d Variables

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
ljedrz added 14 commits March 11, 2025 13:23
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
…nation

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Copy link
Collaborator

@kpandl kpandl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vicsn vicsn requested a review from kaimast March 12, 2025 14:27
ljedrz added 2 commits March 12, 2025 18:17
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@vicsn vicsn removed the v3.5.0 label Mar 21, 2025
Copy link
Collaborator

@kaimast kaimast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions on this PR ages ago but never finished the review... Hopefully we can merge this soon?

};
use smallvec::SmallVec;
use std::borrow::Cow;

// Before high level program operations are converted into constraints, they are first tracked as linear combinations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change this to a doc comment (/// instead of //) and use Markdown formatting for the list?

false => {
match output.terms.binary_search_by(|(v, _)| v.cmp(variable)) {
match output.terms.binary_search_by(|(v, _)| v.cmp(&variable)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be slightly nicer to do

output.terms.binary_search_by_key(&variable, |(v, _)| v)

@@ -249,7 +227,7 @@ impl<F: PrimeField> Add<Variable<F>> for LinearCombination<F> {

#[allow(clippy::op_ref)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this decorator, because the &other part is gone. (See here what the lint does)

}
}

impl<F: PrimeField> Add<LinearCombination<F>> for LinearCombination<F> {
type Output = Self;

fn add(self, other: Self) -> Self::Output {
self + &other
if self.terms.len() > other.terms.len() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that we expand the larger terms to reduce heap allocations? Might be good to add a comment.

self + &other
if self.terms.len() > other.terms.len() {
let mut output = self;
output += Cow::Owned(other);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is AddAssign more efficient than Add? Again, it might be good to document that with a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants