[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a116a43-fd2e-4f03-8a17-75816fc62717@metaparadigm.com>
Date: Thu, 4 Apr 2024 09:39:48 +1300
From: Michael Clark <michael@...aparadigm.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Jens Axboe <axboe@...nel.dk>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: Re: user-space concurrent pipe buffer scheduler interactions
On 4/4/24 05:56, Linus Torvalds wrote:
> On Tue, 2 Apr 2024 at 13:54, Michael Clark <michael@...aparadigm.com> wrote:
>>
>> I am working on a low latency cross-platform concurrent pipe buffer
>> using C11 threads and atomics.
>
> You will never get good performance doing spinlocks in user space
> unless you actually tell the scheduler about the spinlocks, and have
> some way to actually sleep on contention.
>
> Which I don't see you as having.
We can work on this.
So maybe it is possible to look at how many LOCK instructions were
retired in the last scheduler quantum ideally with retired-success,
retired-failed for interlocked-compare-and-swap. Maybe it is just a
performance counter and doesn't require perf tracing switched on?
Then you can probably make a queue of processes in lock contention but
the hard part is deducing who had contention with who. I will need to
think about this for a while. We know the latency when things are not
contended because these critical sections are usually small. It's about
~200-400ns and you can get these numbers in a loop at boot up.
But I don't know how we can see spinning on acquires. It makes me think
that the early relaxed/acquire comparison before the LOCK op is bad. I
got a very minor performance boost but it would break the strategy I
just mentioned because we wouldn't have a LOCK CMPXCHG in our spin loop.
We would know for certain "that" process had a failed LOCK CMPXCHG.
So I would need to delete this line and other lines like this:
https://github.com/michaeljclark/cpipe/blob/13c0ad1a865b9cc0174fc8f61d76f37bdbf11d4d/include/buffer.h#L317
I also want a user-space wrapper for futexes for a waitlist_t that
rechecks conditions and uses cond_timeout on old broken POSIX systems so
that we won't deadlock due to a missed wake-up. FreeBSD, macOS and
Windows are starting to look like they might have something we can use.
WaitOnAddress in Windows has a compare not-equals, and supports 8, 16,
32, and 64 bit words, but when used to construct equals, which is what I
need, or less-than or greater-than it could suffer from thundering herd
if used in a decentralized way in user-space. Maybe we would need an
address waiter list for an address stashed in a CPU struct for the lead
waiter, then centrally recheck the condition and when appropriate
reschedule those sleeping in the queue for events on that address?
Sorry I don't know how the Linux scheduler and futexes work internally.
I just want to use this stuff in user-space. I want a POSIX waitlist_t.
I am working on a tiny emulator for the Windows Hypervisor which is how
I justify the present "embedded" version which spins. This pipe buffer
is for a tiny test kernel to get rid of a janky lock around printf.
- https://github.com/michaeljclark/emu
Thanks,
Michael.
Powered by blists - more mailing lists