linux-kernel - Re: [RFC PATCH 1/4] rseq: Add sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <06ee47e0-99e0-4b6a-ab67-239fccf2777d@efficios.com>
Date:   Fri, 19 May 2023 10:15:11 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Boqun Feng <boqun.feng@...il.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        "H . Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
        linux-api@...r.kernel.org, Christian Brauner <brauner@...nel.org>,
        Florian Weimer <fw@...eb.enyo.de>, David.Laight@...lab.com,
        carlos@...hat.com, Peter Oskolkov <posk@...k.io>,
        Alexander Mikhalitsyn <alexander@...alicyn.com>,
        Chris Kennelly <ckennelly@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Darren Hart <dvhart@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        André Almeida <andrealmeid@...lia.com>,
        libc-alpha@...rceware.org, Steven Rostedt <rostedt@...dmis.org>,
        Jonathan Corbet <corbet@....net>,
        Florian Weimer <fweimer@...hat.com>
Subject: Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq

On 2023-05-18 17:49, Boqun Feng wrote:
> On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote:
[...]
>> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
>> index c233aae5eac9..c6d8537e23ca 100644
>> --- a/include/uapi/linux/rseq.h
>> +++ b/include/uapi/linux/rseq.h
>> @@ -37,6 +37,13 @@ enum rseq_cs_flags {
>>   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>>   };
>>   
>> +enum rseq_sched_state {
>> +	/*
>> +	 * Task is currently running on a CPU if bit is set.
>> +	 */
>> +	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
>> +};
[...]
>>   
>> +	/*
>> +	 * Restartable sequences sched_state field. Updated by the kernel. Read
>> +	 * by user-space with single-copy atomicity semantics. This fields can
>> +	 * be read by any userspace thread. Aligned on 32-bit. Contains a
> 
> Maybe this is a premature optimization, but since most of the time the
> bit would be read by another thread, does it make sense putting the
> "sched_state" into a different cache line to avoid false sharing?

I'm puzzled by your optimization proposal, so I'll say it outright: I'm 
probably missing something.

I agree that false-sharing would be an issue if various threads would 
contend for updating any field within this cache line.

But the only thread responsible for updating this cache line's fields is 
the current thread, either from userspace (stores to rseq_abi->rseq_cs) 
or from the kernel (usually on return to userspace, except for this new 
ON_CPU bit clear on context switch).

The other threads busy-waiting on the content of this sched_state field 
will only load from it, never store. And they will only busy-wait on it 
as long as the current task runs. When that task gets preempted, other 
threads will notice the flag change and use sys_futex instead.

So the very worse I can think of in terms of pattern causing cache 
coherency traffic due to false-sharing is if the lock owner happens to 
have lots of rseq critical sections as well, causing it to repeatedly 
store to the rseq_abi->rseq_cs field, which is in the same cache line.

But even then I'm wondering if this really matters, because each of 
those stores to rseq_cs would only slow down loads from other threads 
which will need to retry busy-wait anyway if the on-cpu flag is still set.

So, what am I missing ? Is this heavy use of rseq critical sections 
while the lock is held the scenario you are concerned about ?

Note that the heavy cache-line bouncing in my test-case happens on the 
lock structure (cmpxchg expecting NULL, setting the current thread 
rseq_get_abi() pointer on success). There are probably better ways to 
implement that part, it is currently just a simple prototype showcasing 
the approach.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com