[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b0416e8c-9b8f-9b25-dd0c-3b7882e5746f@efficios.com>
Date: Tue, 30 May 2023 10:25:57 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Florian Weimer <fweimer@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
"Paul E . McKenney" <paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>,
"H . Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
linux-api@...r.kernel.org, Christian Brauner <brauner@...nel.org>,
David.Laight@...LAB.COM, carlos@...hat.com,
Peter Oskolkov <posk@...k.io>,
Alexander Mikhalitsyn <alexander@...alicyn.com>,
Chris Kennelly <ckennelly@...gle.com>,
Ingo Molnar <mingo@...hat.com>,
Darren Hart <dvhart@...radead.org>,
Davidlohr Bueso <dave@...olabs.net>,
André Almeida <andrealmeid@...lia.com>,
libc-alpha@...rceware.org, Steven Rostedt <rostedt@...dmis.org>,
Jonathan Corbet <corbet@....net>,
Noah Goldstein <goldstein.w.n@...il.com>, longman@...hat.com
Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq
On 5/30/23 04:20, Florian Weimer wrote:
> * Mathieu Desnoyers:
>
>>> I don't see why we can't stick this directly into struct rseq because
>>> it's all public anyway.
>>
>> The motivation for moving this to a different cache line is to handle
>> the prior comment from Boqun, who is concerned that busy-waiting
>> repeatedly loading a field from struct rseq will cause false-sharing
>> and make other stores to that cache line slower, especially stores to
>> rseq_cs to begin rseq critical sections, thus slightly increasing the
>> overhead of rseq critical sections taken while mutexes are held.
>
> Hmm. For context, in glibc, we have to place struct rseq on a fixed
> offset from the start of a page (or even some larger alignment) for all
> threads. In the future (once we move the thread control block off the
> top of the userspace stack, where it resides since the LinuxThreads
> days), it is likely that the pointer difference between different
> threads will also be a multiple of a fairly large power of two
> (something like 2**20 might be common). Maybe this will make caching
> even more difficult?
>
>> If we want to embed this field into struct rseq with its own cache
>> line, then we need to add a lot of padding, which is inconvenient.
>>
>> That being said, perhaps this is premature optimization, what do you
>> think ?
>
> Maybe? I don't know how the access patterns will look like. But I
> suspect that once we hit this case, performance will not be great
> anyway, so the optimization is perhaps unnecessary?
What I dislike though is that contention for any lock which busy-waits
on the rseq sched_state would slow down all rseq critical sections of
that thread, which is a side-effect we want to avoid.
I've done some more additional benchmarks on my 8-core AMD laptop, and I
notice that things get especially bad whenever the store to
rseq_abi->rseq_cs is surrounded by other instructions that need to be
ordered with that store, e.g. a for loop doing 10 stores to a local
variables. If it's surrounded by instructions that don't need to be
ordered wrt that store (e.g. a for loop of 10 iterations issuing
barrier() "memory" asm clobbers), then the overhead cannot be noticed
anymore.
>
> The challenge is that once we put stuff at fixed offsets, we can't
> transparently fix it later. It would need more auxv entries with
> further offsets, or accessing this data through some indirection,
> perhaps via vDSO helpers.
Perhaps this is more flexibility/complexity than we really need. One
possible approach would be to split struct rseq into sub-structures, e.g.:
rseq_len = overall size of all sub-structures.
auxv AT_RSEQ_ALIGN = 256
auxv AT_RSEQ_FEATURE_SIZE = size of first portion of struct rseq,
at most 256 bytes, meant to contain fields
stored/loaded from the thread doing the
registration.
auxv AT_RSEQ_SHARED_FEATURE_SIZE =
size of 2nd portion of struct rseq,
starts at offset 256, at most 256 bytes,
meant to contain fields stored/loaded by
any thread.
Then we have this layout:
struct rseq {
struct rseq_local {
/* Fields accessed from local thread. */
} __attribute__((aligned((256));
struct rseq_shared {
/* Shared fields. */
} __attribute__((aligned(256));
} __attribute__((aligned(256));
And if someday AT_RSEQ_FEATURE_SIZE needs to grow over 256 bytes
(32 * u64), we can still extend with a new auxv entry after the "shared"
features.
>
>>> The TID field would be useful in its own right.
>>
>> Indeed, good point.
>>
>> While we are there, I wonder if we should use the thread_pointer() as
>> lock identifier, or if the address of struct rseq is fine ?
>
> Hard to tell until we'll see what the futex integration looks like, I
> think.
Good point. I can choose one way or another for the prototype, and then
we'll see how things go with futex integration.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists