Skip to content

Make multi-threading a feature #3237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

WorksButNotTested
Copy link
Collaborator

Description

Add single-threaded and multi-threaded configuration options to librasan to switch out the use of the spin crate for synchronization for the newly created nospin crate (which doesn't provide any thread safety).

Checklist

  • I have run ./scripts/precommit.sh and addressed all comments

@WorksButNotTested
Copy link
Collaborator Author

@wfdewith Any thoughts?

@@ -42,6 +43,10 @@ libc = ["dep:libc"]
linux = ["dep:rustix"]
## Enable the `baby_mimalloc` allocator
mimalloc = ["dep:baby-mimalloc"]
## Support multi-threaded targets by using real synchronization primitives
multi-threaded = ["dep:spin"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be default

## Support multi-threaded targets by using real synchronization primitives
multi-threaded = ["dep:spin"]
## Support single-threaded targets by using dummy synchronization primitives
single-threaded = ["dep:nospin"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsafe-single-threaded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe actually just make this a feature and have the other not be a feature? Or just make threadsafe a feature, and leave the other one default. In rust features should be additive if possible

@domenukk
Copy link
Member

You could also take a look at the critical_section crate, it's quasi the default on no_std (at least on embedded) - and you can also define your own synchronisation handlers (like... do nothing)

Probably quite a bit more work though

@wfdewith
Copy link
Contributor

I agree with @domenukk on creating one feature for unsafe-single-threaded, because in my experience, mutually exclusive features are annoying to work with and not really what they are intended for.

However, I'm going to be annoying again and ask if this really improves performance measurably. The reason I'm asking is that mutexes typically try opportunistically lock before falling back to a more expensive syscall, or in the case of spin, fall back to actually spinning. See here. This means that in single-threaded scenarios, you should always successfully lock the mutex through the opportunistic path, which should be extremely fast.

@domenukk
Copy link
Member

domenukk commented May 15, 2025

There are still some atomics so the cores cannot run free - if you know that your target is single threaded there's not really any reason to pay for the potential overhead (think 128 cores at 100% etc) .. however measurements for this on real world software are always nice to have, if it's only a few nano seconds extra, why bother

@wfdewith
Copy link
Contributor

There are still some atomics so the cores cannot run free - if you know that your target is single threaded there's not really any reason to pay for the potential overhead (think 128 cores at 100% etc) .. however measurements for this on real world software are always nice to have, if it's only a few nano seconds extra, why bother

I do agree that the atomics have overhead, but consider that this code will only run in QEMU. Since QEMU JITs instructions during execution (except in system mode with KVM acceleration enabled), it can optimize memory barriers out during runtime if it is running in single-threaded context. Looking at the source and documentation here and here, that is exactly what QEMU does.

@WorksButNotTested
Copy link
Collaborator Author

There are still some atomics so the cores cannot run free - if you know that your target is single threaded there's not really any reason to pay for the potential overhead (think 128 cores at 100% etc) .. however measurements for this on real world software are always nice to have, if it's only a few nano seconds extra, why bother

I do agree that the atomics have overhead, but consider that this code will only run in QEMU. Since QEMU JITs instructions during execution (except in system mode with KVM acceleration enabled), it can optimize memory barriers out during runtime if it is running in single-threaded context. Looking at the source and documentation here and here, that is exactly what QEMU does.

Cool. I didn't know it did that. Obviously the oppotunistic locking would otherwise still have a cache coherency overhead.

I'm running qemu_launcher and using top to find the busy process (as that's probably the one doing the fuzzing right?). Looking at /proc/<pid>/task it looks like there's a second thread though?!?

@WorksButNotTested
Copy link
Collaborator Author

Also interestingly, if a process creates a second thread, then the TCG is flushed. That could cause performance overhead for fork-server implementations?

@domenukk
Copy link
Member

Also interestingly, if a process creates a second thread, then the TCG is flushed. That could cause performance overhead for fork-server implementations?

CC @rmalmain

@WorksButNotTested
Copy link
Collaborator Author

Seems we don't benefit from QEMU dropping the memory barriers right now as the process is multi threaded (that needs investigating). But the PR may still be useful if librasan is to be reused in FRIDA, or otherwise outside of QEMU.

@WorksButNotTested
Copy link
Collaborator Author

WorksButNotTested commented May 19, 2025

Seems if the process makes any shared mappings, then the optimizations have to be disabled too (which make sense). https://github.com/qemu/qemu/blob/757a34115e7491744a63dfc3d291fd1de5297ee2/linux-user/mmap.c#L1011. So this optimization still stands as presumable this is how the fuzzing process communicates with the fuzzer itself?

@wfdewith
Copy link
Contributor

Seems if the process makes any shared mappings, then the optimizations have to be disabled too (which make sense). https://github.com/qemu/qemu/blob/757a34115e7491744a63dfc3d291fd1de5297ee2/linux-user/mmap.c#L1011. So this optimization still stands as presumable this is how the fuzzing process communicates with the fuzzer itself?

Well spotted! Only remaining question for me then is what the overhead of the single atomic operation in spin-rs would be: https://github.com/zesterer/spin-rs/blob/2798933e6842968e44e23fa35d283caea4d3517c/src/mutex/spin.rs#L253. I imagine that it could be quite significant as the mutexes are locked for most memory accesses.

@WorksButNotTested
Copy link
Collaborator Author

I'd expect that it must be noticeable, otherwise, QEMU wouldn't have gone to the effort of removing them either. I'd expect it would be worse as you run more parallel instances?

@WorksButNotTested
Copy link
Collaborator Author

I'm not sure this adds much performance, so I'll close it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants