Make multi-threading a feature #3237

WorksButNotTested · 2025-05-15T17:13:52Z

Description

Add single-threaded and multi-threaded configuration options to librasan to switch out the use of the spin crate for synchronization for the newly created nospin crate (which doesn't provide any thread safety).

Checklist

I have run ./scripts/precommit.sh and addressed all comments

WorksButNotTested · 2025-05-15T17:14:12Z

@wfdewith Any thoughts?

domenukk · 2025-05-15T20:00:15Z

libafl_qemu/librasan/asan/Cargo.toml

@@ -42,6 +43,10 @@ libc = ["dep:libc"]
 linux = ["dep:rustix"]
 ## Enable the `baby_mimalloc` allocator
 mimalloc = ["dep:baby-mimalloc"]
+## Support multi-threaded targets by using real synchronization primitives
+multi-threaded = ["dep:spin"]


should be default

domenukk · 2025-05-15T20:00:25Z

libafl_qemu/librasan/asan/Cargo.toml

+## Support multi-threaded targets by using real synchronization primitives
+multi-threaded = ["dep:spin"]
+## Support single-threaded targets by using dummy synchronization primitives
+single-threaded = ["dep:nospin"]


unsafe-single-threaded?

Maybe actually just make this a feature and have the other not be a feature? Or just make threadsafe a feature, and leave the other one default. In rust features should be additive if possible

domenukk · 2025-05-15T20:01:21Z

You could also take a look at the critical_section crate, it's quasi the default on no_std (at least on embedded) - and you can also define your own synchronisation handlers (like... do nothing)

Probably quite a bit more work though

wfdewith · 2025-05-15T21:34:51Z

I agree with @domenukk on creating one feature for unsafe-single-threaded, because in my experience, mutually exclusive features are annoying to work with and not really what they are intended for.

However, I'm going to be annoying again and ask if this really improves performance measurably. The reason I'm asking is that mutexes typically try opportunistically lock before falling back to a more expensive syscall, or in the case of spin, fall back to actually spinning. See here. This means that in single-threaded scenarios, you should always successfully lock the mutex through the opportunistic path, which should be extremely fast.

domenukk · 2025-05-15T22:53:47Z

There are still some atomics so the cores cannot run free - if you know that your target is single threaded there's not really any reason to pay for the potential overhead (think 128 cores at 100% etc) .. however measurements for this on real world software are always nice to have, if it's only a few nano seconds extra, why bother

wfdewith · 2025-05-16T08:40:25Z

There are still some atomics so the cores cannot run free - if you know that your target is single threaded there's not really any reason to pay for the potential overhead (think 128 cores at 100% etc) .. however measurements for this on real world software are always nice to have, if it's only a few nano seconds extra, why bother

I do agree that the atomics have overhead, but consider that this code will only run in QEMU. Since QEMU JITs instructions during execution (except in system mode with KVM acceleration enabled), it can optimize memory barriers out during runtime if it is running in single-threaded context. Looking at the source and documentation here and here, that is exactly what QEMU does.

WorksButNotTested · 2025-05-16T09:13:31Z

There are still some atomics so the cores cannot run free - if you know that your target is single threaded there's not really any reason to pay for the potential overhead (think 128 cores at 100% etc) .. however measurements for this on real world software are always nice to have, if it's only a few nano seconds extra, why bother

I do agree that the atomics have overhead, but consider that this code will only run in QEMU. Since QEMU JITs instructions during execution (except in system mode with KVM acceleration enabled), it can optimize memory barriers out during runtime if it is running in single-threaded context. Looking at the source and documentation here and here, that is exactly what QEMU does.

Cool. I didn't know it did that. Obviously the oppotunistic locking would otherwise still have a cache coherency overhead.

I'm running qemu_launcher and using top to find the busy process (as that's probably the one doing the fuzzing right?). Looking at /proc/<pid>/task it looks like there's a second thread though?!?

WorksButNotTested · 2025-05-16T09:15:06Z

Also interestingly, if a process creates a second thread, then the TCG is flushed. That could cause performance overhead for fork-server implementations?

domenukk · 2025-05-16T11:34:35Z

Also interestingly, if a process creates a second thread, then the TCG is flushed. That could cause performance overhead for fork-server implementations?

CC @rmalmain

WorksButNotTested · 2025-05-16T16:43:01Z

Seems we don't benefit from QEMU dropping the memory barriers right now as the process is multi threaded (that needs investigating). But the PR may still be useful if librasan is to be reused in FRIDA, or otherwise outside of QEMU.

WorksButNotTested · 2025-05-19T09:15:50Z

Seems if the process makes any shared mappings, then the optimizations have to be disabled too (which make sense). https://github.com/qemu/qemu/blob/757a34115e7491744a63dfc3d291fd1de5297ee2/linux-user/mmap.c#L1011. So this optimization still stands as presumable this is how the fuzzing process communicates with the fuzzer itself?

wfdewith · 2025-05-19T09:55:34Z

Seems if the process makes any shared mappings, then the optimizations have to be disabled too (which make sense). https://github.com/qemu/qemu/blob/757a34115e7491744a63dfc3d291fd1de5297ee2/linux-user/mmap.c#L1011. So this optimization still stands as presumable this is how the fuzzing process communicates with the fuzzer itself?

Well spotted! Only remaining question for me then is what the overhead of the single atomic operation in spin-rs would be: https://github.com/zesterer/spin-rs/blob/2798933e6842968e44e23fa35d283caea4d3517c/src/mutex/spin.rs#L253. I imagine that it could be quite significant as the mutexes are locked for most memory accesses.

WorksButNotTested · 2025-05-19T09:59:26Z

I'd expect that it must be noticeable, otherwise, QEMU wouldn't have gone to the effort of removing them either. I'd expect it would be worse as you run more parallel instances?

WorksButNotTested · 2025-05-21T12:12:47Z

I'm not sure this adds much performance, so I'll close it for now.

domenukk reviewed May 15, 2025

View reviewed changes

WorksButNotTested force-pushed the nospin branch from 109d157 to 5e39ddf Compare May 16, 2025 15:00

Add single-threaded feature to librasan

c0e8f43

WorksButNotTested force-pushed the nospin branch from 5e39ddf to c0e8f43 Compare May 20, 2025 16:13

WorksButNotTested closed this May 21, 2025

Uh oh!

Make multi-threading a feature #3237

Make multi-threading a feature #3237

Uh oh!

Conversation

WorksButNotTested commented May 15, 2025

Description

Checklist

Uh oh!

WorksButNotTested commented May 15, 2025

Uh oh!

domenukk May 15, 2025

Choose a reason for hiding this comment

Uh oh!

domenukk May 15, 2025

Choose a reason for hiding this comment

Uh oh!

domenukk May 15, 2025

Choose a reason for hiding this comment

Uh oh!

domenukk commented May 15, 2025

Uh oh!

wfdewith commented May 15, 2025

Uh oh!

domenukk commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wfdewith commented May 16, 2025

Uh oh!

WorksButNotTested commented May 16, 2025

Uh oh!

WorksButNotTested commented May 16, 2025

Uh oh!

domenukk commented May 16, 2025

Uh oh!

WorksButNotTested commented May 16, 2025

Uh oh!

WorksButNotTested commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wfdewith commented May 19, 2025

Uh oh!

WorksButNotTested commented May 19, 2025

Uh oh!

WorksButNotTested commented May 21, 2025

Uh oh!

Uh oh!

domenukk commented May 15, 2025 •

edited

Loading

WorksButNotTested commented May 19, 2025 •

edited

Loading