linux-kernel - Re: user-space concurrent pipe buffer scheduler interactions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1a116a43-fd2e-4f03-8a17-75816fc62717@metaparadigm.com>
Date: Thu, 4 Apr 2024 09:39:48 +1300
From: Michael Clark <michael@...aparadigm.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Jens Axboe <axboe@...nel.dk>, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: Re: user-space concurrent pipe buffer scheduler interactions

On 4/4/24 05:56, Linus Torvalds wrote:
> On Tue, 2 Apr 2024 at 13:54, Michael Clark <michael@...aparadigm.com> wrote:
>>
>> I am working on a low latency cross-platform concurrent pipe buffer
>> using C11 threads and atomics.
> 
> You will never get good performance doing spinlocks in user space
> unless you actually tell the scheduler about the spinlocks, and have
> some way to actually sleep on contention.
> 
> Which I don't see you as having.

We can work on this.

So maybe it is possible to look at how many LOCK instructions were 
retired in the last scheduler quantum ideally with retired-success, 
retired-failed for interlocked-compare-and-swap. Maybe it is just a 
performance counter and doesn't require perf tracing switched on?

Then you can probably make a queue of processes in lock contention but 
the hard part is deducing who had contention with who. I will need to 
think about this for a while. We know the latency when things are not 
contended because these critical sections are usually small. It's about 
~200-400ns and you can get these numbers in a loop at boot up.

But I don't know how we can see spinning on acquires. It makes me think 
that the early relaxed/acquire comparison before the LOCK op is bad. I 
got a very minor performance boost but it would break the strategy I 
just mentioned because we wouldn't have a LOCK CMPXCHG in our spin loop. 
We would know for certain "that" process had a failed LOCK CMPXCHG.

So I would need to delete this line and other lines like this:

https://github.com/michaeljclark/cpipe/blob/13c0ad1a865b9cc0174fc8f61d76f37bdbf11d4d/include/buffer.h#L317

I also want a user-space wrapper for futexes for a waitlist_t that 
rechecks conditions and uses cond_timeout on old broken POSIX systems so 
that we won't deadlock due to a missed wake-up. FreeBSD, macOS and 
Windows are starting to look like they might have something we can use.

WaitOnAddress in Windows has a compare not-equals, and supports 8, 16, 
32, and 64 bit words, but when used to construct equals, which is what I 
need, or less-than or greater-than it could suffer from thundering herd 
if used in a decentralized way in user-space. Maybe we would need an 
address waiter list for an address stashed in a CPU struct for the lead 
waiter, then centrally recheck the condition and when appropriate 
reschedule those sleeping in the queue for events on that address?

Sorry I don't know how the Linux scheduler and futexes work internally.
I just want to use this stuff in user-space. I want a POSIX waitlist_t.

I am working on a tiny emulator for the Windows Hypervisor which is how 
I justify the present "embedded" version which spins. This pipe buffer 
is for a tiny test kernel to get rid of a janky lock around printf.

- https://github.com/michaeljclark/emu

Thanks,
Michael.