[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab59863f-25f0-4635-8408-4aaec39ec6c2@igalia.com>
Date: Thu, 28 Sep 2023 17:05:59 +0200
From: André Almeida <andrealmeid@...lia.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
"Paul E . McKenney" <paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>,
"H . Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
David Laight <David.Laight@...LAB.COM>,
Christian Brauner <brauner@...nel.org>,
Florian Weimer <fw@...eb.enyo.de>,
"carlos@...hat.com" <carlos@...hat.com>,
Peter Oskolkov <posk@...k.io>,
Alexander Mikhalitsyn <alexander@...alicyn.com>,
'Peter Zijlstra' <peterz@...radead.org>,
Chris Kennelly <ckennelly@...gle.com>,
Ingo Molnar <mingo@...hat.com>,
Darren Hart <dvhart@...radead.org>,
Davidlohr Bueso <dave@...olabs.net>,
"libc-alpha@...rceware.org" <libc-alpha@...rceware.org>,
Steven Rostedt <rostedt@...dmis.org>,
Jonathan Corbet <corbet@....net>,
Noah Goldstein <goldstein.w.n@...il.com>,
Daniel Colascione <dancol@...gle.com>,
"longman@...hat.com" <longman@...hat.com>,
Florian Weimer <fweimer@...hat.com>
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq
On 9/28/23 15:20, Mathieu Desnoyers wrote:
> On 9/28/23 07:22, David Laight wrote:
>> From: Peter Zijlstra
>>> Sent: 28 September 2023 11:39
>>>
>>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
>>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>>> adaptative mutexes to decide more accurately between busy-waiting and
>>>> calling sys_futex() to release the CPU, based on the on-cpu state
>>>> of the
>>>> mutex owner.
>>
>> Are you trying to avoid spinning when the owning process is sleeping?
>
> Yes, this is my main intent.
>
>> Or trying to avoid the system call when it will find that the futex
>> is no longer held?
>>
>> The latter is really horribly detremental.
>
> That's a good questions. What should we do in those three situations
> when trying to grab the lock:
>
> 1) Lock has no owner
>
> We probably want to simply grab the lock with an atomic instruction.
> But then if other threads are queued on sys_futex and did not manage
> to grab the lock yet, this would be detrimental to fairness.
>
> 2) Lock owner is running:
>
> The lock owner is certainly running on another cpu (I'm using the term
> "cpu" here as logical cpu).
>
> I guess we could either decide to bypass sys_futex entirely and try to
> grab the lock with an atomic, or we go through sys_futex nevertheless
> to allow futex to guarantee some fairness across threads.
About the fairness part:
Even if you enqueue everyone, the futex syscall doesn't provide any
guarantee about the order of the wake. The current implementation tries
to be fair, but I don't think it works for every case. I wouldn't be
much concern about being fair here, given that it's an inherent problem
of the futex anyway.
From the man pages:
"No guarantee is provided about which waiters are awoken"
>
> 3) Lock owner is sleeping:
>
> The lock owner may be either tied to the same cpu as the requester, or
> a different cpu. Here calling FUTEX_WAIT and friends is pretty much
> required.
>
> Can you elaborate on why skipping sys_futex in scenario (2) would be
> so bad ? I wonder if we could get away with skipping futex entirely in
> this scenario and still guarantee fairness by implementing MCS locking
> or ticket locks in userspace. Basically, if userspace queues itself on
> the lock through either MCS locking or ticket locks, it could
> guarantee fairness on its own.
>
> Of course things are more complicated with PI-futex, is that what you
> have in mind ?
>
>>
>>>>
>>>> It is only provided as an optimization hint, because there is no
>>>> guarantee that the page containing this field is in the page cache,
>>>> and
>>>> therefore the scheduler may very well fail to clear the on-cpu
>>>> state on
>>>> preemption. This is expected to be rare though, and is resolved as
>>>> soon
>>>> as the task returns to user-space.
>>>>
>>>> The goal is to improve use-cases where the duration of the critical
>>>> sections for a given lock follows a multi-modal distribution,
>>>> preventing
>>>> statistical guesses from doing a good job at choosing between
>>>> busy-wait
>>>> and futex wait behavior.
>>>
>>> As always, are syscalls really *that* expensive? Why can't we busy wait
>>> in the kernel instead?
>>>
>>> I mean, sure, meltdown sucked, but most people should now be running
>>> chips that are not affected by that particular horror show, no?
>>
>> IIRC 'page table separation' which is what makes system calls expensive
>> is only a compile-time option. So is likely to be enabled on any
>> 'distro'
>> kernel.
>> But a lot of other mitigations (eg RSB stuffing) are also pretty
>> detrimental.
>>
>> OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
>> All that needs to happen is for a ethernet interrupt to decide to
>> discard
>> completed transmits and refill the rx ring, and then for the softint
>> code
>> to free a load of stuff deferred by rcu while you've grabbed the mutex
>> and no matter how short the user-space code path the mutex won't be
>> released for absolutely ages.
>>
>> I had to change a load of code to use arrays and atomic increments
>> to avoid delays acquiring mutex.
>
> That's good input, thanks! I mostly defer to André Almeida on the
> use-case motivation. I mostly provided this POC patch to show that it
> _can_ be done with sys_rseq(2).
>
> Thanks!
>
> Mathieu
>
>>
>> David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
>> MK1 1PT, UK
>> Registration No: 1397386 (Wales)
>>
>
Powered by blists - more mailing lists