linux-kernel - Re: [RFC PATCH v2 1/4] rseq: Add sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b0416e8c-9b8f-9b25-dd0c-3b7882e5746f@efficios.com>
Date:   Tue, 30 May 2023 10:25:57 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Florian Weimer <fweimer@...hat.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        "H . Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
        linux-api@...r.kernel.org, Christian Brauner <brauner@...nel.org>,
        David.Laight@...LAB.COM, carlos@...hat.com,
        Peter Oskolkov <posk@...k.io>,
        Alexander Mikhalitsyn <alexander@...alicyn.com>,
        Chris Kennelly <ckennelly@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Darren Hart <dvhart@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        André Almeida <andrealmeid@...lia.com>,
        libc-alpha@...rceware.org, Steven Rostedt <rostedt@...dmis.org>,
        Jonathan Corbet <corbet@....net>,
        Noah Goldstein <goldstein.w.n@...il.com>, longman@...hat.com
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq

On 5/30/23 04:20, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>>> I don't see why we can't stick this directly into struct rseq because
>>> it's all public anyway.
>>
>> The motivation for moving this to a different cache line is to handle
>> the prior comment from Boqun, who is concerned that busy-waiting
>> repeatedly loading a field from struct rseq will cause false-sharing
>> and make other stores to that cache line slower, especially stores to
>> rseq_cs to begin rseq critical sections, thus slightly increasing the
>> overhead of rseq critical sections taken while mutexes are held.
> 
> Hmm.  For context, in glibc, we have to place struct rseq on a fixed
> offset from the start of a page (or even some larger alignment) for all
> threads.  In the future (once we move the thread control block off the
> top of the userspace stack, where it resides since the LinuxThreads
> days), it is likely that the pointer difference between different
> threads will also be a multiple of a fairly large power of two
> (something like 2**20 might be common).  Maybe this will make caching
> even more difficult?
> 
>> If we want to embed this field into struct rseq with its own cache
>> line, then we need to add a lot of padding, which is inconvenient.
>>
>> That being said, perhaps this is premature optimization, what do you
>> think ?
> 
> Maybe?  I don't know how the access patterns will look like.  But I
> suspect that once we hit this case, performance will not be great
> anyway, so the optimization is perhaps unnecessary?

What I dislike though is that contention for any lock which busy-waits 
on the rseq sched_state would slow down all rseq critical sections of 
that thread, which is a side-effect we want to avoid.

I've done some more additional benchmarks on my 8-core AMD laptop, and I 
notice that things get especially bad whenever the store to 
rseq_abi->rseq_cs is surrounded by other instructions that need to be 
ordered with that store, e.g. a for loop doing 10 stores to a local 
variables. If it's surrounded by instructions that don't need to be 
ordered wrt that store (e.g. a for loop of 10 iterations issuing 
barrier() "memory" asm clobbers), then the overhead cannot be noticed 
anymore.

> 
> The challenge is that once we put stuff at fixed offsets, we can't
> transparently fix it later.  It would need more auxv entries with
> further offsets, or accessing this data through some indirection,
> perhaps via vDSO helpers.

Perhaps this is more flexibility/complexity than we really need. One 
possible approach would be to split struct rseq into sub-structures, e.g.:

rseq_len = overall size of all sub-structures.
auxv AT_RSEQ_ALIGN = 256

auxv AT_RSEQ_FEATURE_SIZE = size of first portion of struct rseq,
                             at most 256 bytes, meant to contain fields
                             stored/loaded from the thread doing the
                             registration.
auxv AT_RSEQ_SHARED_FEATURE_SIZE =
                             size of 2nd portion of struct rseq,
                             starts at offset 256, at most 256 bytes,
                             meant to contain fields stored/loaded by
                             any thread.

Then we have this layout:

struct rseq {
   struct rseq_local {
     /* Fields accessed from local thread. */

   } __attribute__((aligned((256));
   struct rseq_shared {
     /* Shared fields. */

   } __attribute__((aligned(256));
} __attribute__((aligned(256));

And if someday AT_RSEQ_FEATURE_SIZE needs to grow over 256 bytes
(32 * u64), we can still extend with a new auxv entry after the "shared"
features.


> 
>>> The TID field would be useful in its own right.
>>
>> Indeed, good point.
>>
>> While we are there, I wonder if we should use the thread_pointer() as
>> lock identifier, or if the address of struct rseq is fine ?
> 
> Hard to tell until we'll see what the futex integration looks like, I
> think.

Good point. I can choose one way or another for the prototype, and then 
we'll see how things go with futex integration.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com