linux-kernel - Re: [RFC PATCH v2 1/4] rseq: Add sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab59863f-25f0-4635-8408-4aaec39ec6c2@igalia.com>
Date:   Thu, 28 Sep 2023 17:05:59 +0200
From:   André Almeida <andrealmeid@...lia.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        "H . Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
        "linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
        David Laight <David.Laight@...LAB.COM>,
        Christian Brauner <brauner@...nel.org>,
        Florian Weimer <fw@...eb.enyo.de>,
        "carlos@...hat.com" <carlos@...hat.com>,
        Peter Oskolkov <posk@...k.io>,
        Alexander Mikhalitsyn <alexander@...alicyn.com>,
        'Peter Zijlstra' <peterz@...radead.org>,
        Chris Kennelly <ckennelly@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Darren Hart <dvhart@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        "libc-alpha@...rceware.org" <libc-alpha@...rceware.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Jonathan Corbet <corbet@....net>,
        Noah Goldstein <goldstein.w.n@...il.com>,
        Daniel Colascione <dancol@...gle.com>,
        "longman@...hat.com" <longman@...hat.com>,
        Florian Weimer <fweimer@...hat.com>
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq


On 9/28/23 15:20, Mathieu Desnoyers wrote:
> On 9/28/23 07:22, David Laight wrote:
>> From: Peter Zijlstra
>>> Sent: 28 September 2023 11:39
>>>
>>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote:
>>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>>> adaptative mutexes to decide more accurately between busy-waiting and
>>>> calling sys_futex() to release the CPU, based on the on-cpu state 
>>>> of the
>>>> mutex owner.
>>
>> Are you trying to avoid spinning when the owning process is sleeping?
>
> Yes, this is my main intent.
>
>> Or trying to avoid the system call when it will find that the futex
>> is no longer held?
>>
>> The latter is really horribly detremental.
>
> That's a good questions. What should we do in those three situations 
> when trying to grab the lock:
>
> 1) Lock has no owner
>
> We probably want to simply grab the lock with an atomic instruction. 
> But then if other threads are queued on sys_futex and did not manage 
> to grab the lock yet, this would be detrimental to fairness.
>
> 2) Lock owner is running:
>
> The lock owner is certainly running on another cpu (I'm using the term 
> "cpu" here as logical cpu).
>
> I guess we could either decide to bypass sys_futex entirely and try to 
> grab the lock with an atomic, or we go through sys_futex nevertheless 
> to allow futex to guarantee some fairness across threads.

About the fairness part:

Even if you enqueue everyone, the futex syscall doesn't provide any 
guarantee about the order of the wake. The current implementation tries 
to be fair, but I don't think it works for every case. I wouldn't be 
much concern about being fair here, given that it's an inherent problem 
of the futex anyway.

 From the man pages:

"No guarantee is provided about which waiters are awoken"

>
> 3) Lock owner is sleeping:
>
> The lock owner may be either tied to the same cpu as the requester, or 
> a different cpu. Here calling FUTEX_WAIT and friends is pretty much 
> required.
>
> Can you elaborate on why skipping sys_futex in scenario (2) would be 
> so bad ? I wonder if we could get away with skipping futex entirely in 
> this scenario and still guarantee fairness by implementing MCS locking 
> or ticket locks in userspace. Basically, if userspace queues itself on 
> the lock through either MCS locking or ticket locks, it could 
> guarantee fairness on its own.
>
> Of course things are more complicated with PI-futex, is that what you 
> have in mind ?
>
>>
>>>>
>>>> It is only provided as an optimization hint, because there is no
>>>> guarantee that the page containing this field is in the page cache, 
>>>> and
>>>> therefore the scheduler may very well fail to clear the on-cpu 
>>>> state on
>>>> preemption. This is expected to be rare though, and is resolved as 
>>>> soon
>>>> as the task returns to user-space.
>>>>
>>>> The goal is to improve use-cases where the duration of the critical
>>>> sections for a given lock follows a multi-modal distribution, 
>>>> preventing
>>>> statistical guesses from doing a good job at choosing between 
>>>> busy-wait
>>>> and futex wait behavior.
>>>
>>> As always, are syscalls really *that* expensive? Why can't we busy wait
>>> in the kernel instead?
>>>
>>> I mean, sure, meltdown sucked, but most people should now be running
>>> chips that are not affected by that particular horror show, no?
>>
>> IIRC 'page table separation' which is what makes system calls expensive
>> is only a compile-time option. So is likely to be enabled on any 
>> 'distro'
>> kernel.
>> But a lot of other mitigations (eg RSB stuffing) are also pretty 
>> detrimental.
>>
>> OTOH if you have a 'hot' userspace mutex you are going to lose whatever.
>> All that needs to happen is for a ethernet interrupt to decide to 
>> discard
>> completed transmits and refill the rx ring, and then for the softint 
>> code
>> to free a load of stuff deferred by rcu while you've grabbed the mutex
>> and no matter how short the user-space code path the mutex won't be
>> released for absolutely ages.
>>
>> I had to change a load of code to use arrays and atomic increments
>> to avoid delays acquiring mutex.
>
> That's good input, thanks! I mostly defer to André Almeida on the 
> use-case motivation. I mostly provided this POC patch to show that it 
> _can_ be done with sys_rseq(2).
>
> Thanks!
>
> Mathieu
>
>>
>>     David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, 
>> MK1 1PT, UK
>> Registration No: 1397386 (Wales)
>>
>