[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1685571268.31788.1643903619178.JavaMail.zimbra@efficios.com>
Date: Thu, 3 Feb 2022 10:53:39 -0500 (EST)
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Florian Weimer <fw@...eb.enyo.de>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
paulmck <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
"H. Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
linux-api <linux-api@...r.kernel.org>,
Christian Brauner <christian.brauner@...ntu.com>,
David Laight <David.Laight@...LAB.COM>,
carlos <carlos@...hat.com>, Peter Oskolkov <posk@...k.io>
Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group
vcpu id
----- On Feb 1, 2022, at 8:32 PM, Mathieu Desnoyers mathieu.desnoyers@...icios.com wrote:
> ----- On Feb 1, 2022, at 4:30 PM, Florian Weimer fw@...eb.enyo.de wrote:
>
>> * Mathieu Desnoyers:
>>
>>> ----- On Feb 1, 2022, at 3:32 PM, Florian Weimer fw@...eb.enyo.de wrote:
>>> [...]
>>>>
>>>>>> Is the switch really useful? I suspect it's faster to just write as
>>>>>> much as possible all the time. The switch should be well-predictable
>>>>>> if running uniform userspace, but still …
>>>>>
>>>>> The switch ensures the kernel don't try to write to a memory area beyond
>>>>> the rseq size which has been registered by user-space. So it seems to be
>>>>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>>>>> point ?
>>>>
>>>> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
>>>> now?
>>>
>>> Yes, but I would expect the rseq registration arguments to have a rseq_len
>>> of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
>>> feature to be supported (but not the following features).
>>
>> But if rseq is managed by libc, it really has to use the full size
>> unconditionally. I would even expect that eventually, the kernel only
>> supports the initial 32, maybe 64 for a few early extension, and the
>> size indicated by the auxiliary vector.
>>
>> Not all of that area would be ABI, some of it would be used by the
>> vDSO only and opaque to userspace application (with applications/libcs
>> passing __rseq_offset as an argument to these functions).
>>
>
> I think one aspect leading to our misunderstanding here is the distinction
> between the size of the rseq area _allocation_, and the offset after the last
> field supported by the given kernel.
>
> With this in mind, let's state a bit more clearly our expected aux. vector
> extensibility scheme.
>
> With CONFIG_RSEQ=y, the kernel would pass the following information through
> the ELF auxv:
>
> - rseq allocation size (auxv_rseq_alloc_size),
> - rseq allocation alignment (auxv_rseq_alloc_align),
> - offset after the end of the last rseq field supported by this kernel
> (auxv_rseq_offset_end),
>
> We always have auxv_rseq_alloc_size >= auxv_rseq_offset_end.
>
> I would expect libc to use this information to allocate a memory area
> at least auxv_rseq_alloc_size in size, with an alignment respecting
> auxv_rseq_alloc_align. It would use a value >= auvx_rseq_alloc_size
> as rseq_len argument for the rseq registration.
>
> But I would expect libc to use the auxv_rseq_offset_end value to populate
> __rseq_size,
> so rseq users can rely on this to check whether the fields they are trying to
> access
> is indeed populated by the kernel.
>
> Of course, the kernel would still allow the original 32-byte rseq_len argument
> for the rseq registration, so the original ABI still works. It would however
> reject any rseq registration with size smaller than auxv_rseq_alloc_size (other
> than the 32-byte special-case).
>
> Is that in line with what you have in mind ? Do we really need to expose those 3
> auxv variables independently or can we somehow remove auxv_rseq_alloc_size and
> use auxv_rseq_offset_end as a min value for allocation instead ?
After giving all this some more thoughts, I think we can extend struct rseq
cleanly without adding any "padding1" fields at the end of the existing structure
(which would be effectively wasting very useful hot cache line space).
Here is what I have in mind:
We consider separately the "size" and the "feature_size" of the rseq structure.
- The "size" is really the size for memory allocation (includes padding),
- The "feature_size" is the offset after the last supported feature field.
So for the original struct rseq, size=32 bytes and feature_size=20 bytes
(offsetofend(struct rseq, flags)).
The kernel can expose this "supported rseq feature size" value through the ELF auxiliary
vector (with a new AT_RSEQ_FEATURE_SIZE). It would also expose a new AT_RSEQ_ALIGN for the
minimal allocation alignment required by the kernel. Those can be queried by user-space
through getauxval(3).
glibc can add a new const unsigned int __rseq_feature_size symbol in a future release which
will support extended rseq structures. This is the symbol I expect rseq users to check
at least once in the program's lifetime before they try to access rseq fields beyond
the originally populated 20 bytes.
The rseq_len argument passed to sys_rseq would really be the allocated size for the rseq
area (as it is today, considering that the kernel expects sizeof(struct rseq)). Typically,
I would expect glibc to take the rseq feature_size value and round it up to a value which
is a multiple of the AT_RSEQ_ALIGN. That allocation size would be used to populate
__rseq_size.
Am I missing something ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Powered by blists - more mailing lists