[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1305865358.10354.1594665620975.JavaMail.zimbra@efficios.com>
Date: Mon, 13 Jul 2020 14:40:20 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Christian Brauner <christian.brauner@...ntu.com>
Cc: Florian Weimer <fw@...eb.enyo.de>,
Linus Torvalds <torvalds@...ux-foundation.org>,
carlos <carlos@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
linux-kernel <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
paulmck <paulmck@...ux.ibm.com>,
Boqun Feng <boqun.feng@...il.com>,
"H. Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
linux-api <linux-api@...r.kernel.org>,
Dmitry Vyukov <dvyukov@...gle.com>,
Neel Natu <neelnatu@...gle.com>
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce
RSEQ_FLAG_RELIABLE_CPU_ID
----- On Jul 11, 2020, at 11:54 AM, Christian Brauner christian.brauner@...ntu.com wrote:
>
> The registration is a thread-group property I'll assume, right, i.e. all
> threads will have rseq TLS or no thread will have it?
No, rseq registration is a per-thread property, but it would arguably be
a bit weird for a thread-group to observe different registration states
for different threads.
> Some things I seem to be able to assume (correct me if I'm wrong):
> - rseq registration is expected to be done at thread creation time
True.
> - rseq registration _should_ only be done once, i.e. if a caller detects
> that rseq is already registered for a thread, then they could
> technically re-register rseq - risking a feature mismatch if doing so
> - but they shouldn't re-register but simply assume that someone else
> is in control of rseq. If they violate that assumption than you're
> hosed anyway.
Right.
> So it seems as long as callers leave rseq registration alone whenever
> they detect that it is already registered then one can assume that rseq
> is under consistent control of a single library/program. If that's the
> case it should safe to assume that the library will use the same rseq
> registration for all threads bounded by the available kernel features or
> by the set of features it is aware of.
The rseq registration is per-thread. But yes, as soon as one user registers
rseq, other users for that thread are expected to piggy-back on that
registration.
> I proposed that specific scheme because I was under the impression that
> you are in need of a mechanism for a caller (at thread creation I
> assume) to check what feature set is supported by rseq _without_
> issung a system call. If you were to record the requested flags in
> struct rseq or in some other way, then another library which tries to
> register rseq for a thread but detects it has already been registered
> can look at e.g. whether the reliable cpu feature is around and then
> adjust it's behavior accordingly.
> Another reason why this seems worthwhile is because of how rseq works in
> general. Since it registers a piece of userspace memory which userspace
> can trivially access enforcing that userspace issue a syscall to get at
> the feature list seems odd when you can just record it in the struct.
> But that's a matter of style, I guess.
Good points.
>
> Also, I'm thinking about the case of adding one or two new features that
> introduce mutually exclusive behavior for rseq, i.e. if you register
> rseq with FEAT1 and someone else registers it with FEAT2 and FEAT1 and
> FEAT2 would lead to incompatible behavior for an aspect or all of rseq.
> Even if you had a way to query the kernel for FEAT1 and FEAT2 in the
> rseq syscall it still be a problem since a caller wouldn't know at rseq
> registration time whether the library registered rseq with FEAT1 or
> FEAT2. If you record the behavior somewhere - kernel_flags or whatever -
> then the caller could check at registration time "ah, rseq is registered
> with this behavior" I need to adjust my behavior.
I think what we want here is to be able to extend the feature set, but not
"pick and choose" different incompatible features.
[...]
>>
>> One additional thing to keep in mind: the application can itself choose
>> to define the __rseq_abi TLS, which AFAIU (please let me know if I am
>> wrong) would take precedence over glibc's copy. So extending the
>
> You mean either that an application could simply choose to ignore that e.g.
> glibc has already registered rseq and e.g. unregister it and register
> it's own or that it registers it's own rseq before glibc comes into the
> play?
No quite.
I mean that an application binary or a preloaded library is allowed to
interpose with glibc and expose a __rseq_abi symbol with a size smaller
than glibc's __rseq_abi. The issue is that glibc (or other library
responsible for rseq registration) is unaware of that symbol's size.
This means that extending __rseq_abi cannot be done by means of additional
flags or parameters passed directly from the registration owner to the
rseq system call.
> I mean, if I interpreted what you're trying to say correctly, I think
> one needs to work under the assumption that if someone else has already
> registered rseq than it becomes the single source of truth. I think that
> basically needs to be the contract. Same way you expect a user of
> pthreads to not suddenly go and call clone3() with CLONE_THREAD |
> CLONE_VM | [...] and assume that this won't mess with glibc's internal
> state.
Right. The issue is not about which library owns the registration, but
rather making sure everyone agree on the size of __rseq_abi, including
interposition scenarios.
[...]
>>
>> For both approaches, we could either pass them as parameters with rseq
>> registration, and make rseq registration success conditional on the
>> kernel implementing those feature/fix-version, or validate the flag/version
>> separately from registration.
>>
>> If this is done on registration, it means glibc will eventually have to
>> handle this. This prevents user libraries with specific needs to query
>> whether their features are available. Doing the feature/version validation
>> separately from registration allows each user library to make its own
>> queries and take advantage of new kernel features before glibc is
>> upgraded to be made aware of them.
>
> Why isn't there a "dual scheme"? I.e. you record the features somewhere
> in struct rseq or some other place so userspace can query an already
> registered thread for the features it was registered with and have a way
> to query the kernel for the supported features through the system call
> (See what I suggested above with the feature checking flags.).
This discussion got my head into gears over the weekend, and I think
I came up with something that could elegantly solve all the "rseq extensibility"
problem. More below.
[...]
> I really think this is not an rseq specific problem. This seems to be a
> generic problem any interface has when it e.g. makes use of an extended
> struct and e.g. decides to make its own syscalls without going through
> the glibc wrappers (If there are any...). That's the reality right now
> and will likely always be that way short of us blocking non-libc
> syscalls similar to some bsds at which point we need a 1:1 kernel + libc
> development. :) That's not going to happen. The problem ranges from
> struct statx to struct clone_args and struct open_how and so on.
AFAIU the only "special" thing about rseq is that its __rseq_abi is
a TLS symbol shared between application/libraries, and interposition is
allowed.
>
> But one question. Musn't the assumption be that all threads in a
> thread-group if they have registered rseq then the same
> application/library has done that registration?
No, __rseq_abi is a TLS, and the registration is per-thread.
> And if that's the case
> then the application/library doing the registration is what defines the
> layout for that thread-group and becomes the one source of truth.
> Basically, if an application uses it's own rseq then glibc must be out
> of the picture. If that's not part of the contract then it feels to me
> that rseq cannot be extended (for now).
Indeed, the new scheme I have in mind for rseq extensibility would allow
new features to be used between new application/library and kernel even
with an older glibc which would contain the rseq registration code, but
be unaware of those new features.
[...]
>
> Wouldn't the non-updated application just access fields and methods of
> the smaller struct? The way struct extensions work is that we only
> extend them after the last member and always correctly 64 bit aligned.
> And as long as you only extend the struct at the end wouldn't that be
> ok? An application with a non-updated/smaller struct rseq would just
> access fields that the larger struct supports, no?
The issue is symbol interposition, as discussed above.
So here is the idea which emerged through the weekend as I was thinking
about your email:
* Current technical constraints
- struct rseq __rseq_abi can be interposed by preloaded libraries and
application, without knowledge from the registration "owner" (typically
glibc).
* Objectives:
- Allow extending the size of struct rseq to add additional features,
such as node_id field, signal-disabling fields, and so on.
- Allow extending this size without requiring an upgrade of the library
performing rseq registration. Simply upgrading the rseq "user" application
or library and the kernel should suffice.
* Proposed ABI
- Reserve a bit in the field (struct rseq *)->flags of the TLS __rseq_abi,
named e.g.: RSEQ_CS_FLAG_SIZE = (1U << 3).
- A definition wishing to extend struct rseq would be required to initialize
__rseq_abi with this bit set in the flags field.
- When RSEQ_CS_FLAG_SIZE is set, struct rseq is guaranteed to have two new
fields after the flags field: a __u32 user_size and a __u32 kernel_size field.
- The user_size field is meant to be initialized to sizeof(struct rseq) by the
__rseq_abi definition. In an interposition scenario, a kernel supporting this
size feature will know about the size of the symbol by checking both the
RSEQ_CS_FLAG_SIZE flag and the user_size field.
- On registration, if the __rseq_abi.flags RSEQ_CS_FLAG_SIZE flag is set (and this
flag is supported by the kernel), the kernel updates the kernel_size field to
min(sizeof(struct rseq), __rseq_abi.user_size), effectively the subset of size
supported by both the user-space __rseq_abi definition and by the kernel.
- Users wishing to use additional fields beyond __rseq_abi.flags would need to check
that __rseq_abi->flags & RSEQ_CS_FLAG_SIZE is true, and that
__rseq_abi.kernel_size >= offsetof(struct rseq, feature_field) + sizeof(__rseq_abi.feature_field)
This would ensure the fields are only used if the symbol is large enough to
hold them *and* the kernel supports them.
With this kind of scheme, we could then easily extend struct rseq to cover additional
use-cases such as:
- adding a new "node_id" field to speed up getcpu(3), user-space locking on NUMA,
and possibly memory allocators,
- adding fields allowing to quickly disable/enable signals,
- adding a "__u64 features" field, which would contain for instance
RSEQ_FEATURE_RELIABLE_CPU_ID.
I'm not sure why I did not think of this earlier, but it all seems to fit nicely.
Any comments ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Powered by blists - more mailing lists