[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D9FOMMGOGOZS.FN9LKYJAB9PD@ventanamicro.com>
Date: Fri, 25 Apr 2025 13:27:34 +0200
From: Radim Krčmář <rkrcmar@...tanamicro.com>
To: "Deepak Gupta" <debug@...osinc.com>
Cc: "Thomas Gleixner" <tglx@...utronix.de>, "Ingo Molnar"
<mingo@...hat.com>, "Borislav Petkov" <bp@...en8.de>, "Dave Hansen"
<dave.hansen@...ux.intel.com>, <x86@...nel.org>, "H. Peter Anvin"
<hpa@...or.com>, "Andrew Morton" <akpm@...ux-foundation.org>, "Liam R.
Howlett" <Liam.Howlett@...cle.com>, "Vlastimil Babka" <vbabka@...e.cz>,
"Lorenzo Stoakes" <lorenzo.stoakes@...cle.com>, "Paul Walmsley"
<paul.walmsley@...ive.com>, "Palmer Dabbelt" <palmer@...belt.com>, "Albert
Ou" <aou@...s.berkeley.edu>, "Conor Dooley" <conor@...nel.org>, "Rob
Herring" <robh@...nel.org>, "Krzysztof Kozlowski" <krzk+dt@...nel.org>,
"Arnd Bergmann" <arnd@...db.de>, "Christian Brauner" <brauner@...nel.org>,
"Peter Zijlstra" <peterz@...radead.org>, "Oleg Nesterov" <oleg@...hat.com>,
"Eric Biederman" <ebiederm@...ssion.com>, "Kees Cook" <kees@...nel.org>,
"Jonathan Corbet" <corbet@....net>, "Shuah Khan" <shuah@...nel.org>, "Jann
Horn" <jannh@...gle.com>, "Conor Dooley" <conor+dt@...nel.org>,
<linux-kernel@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
<linux-mm@...ck.org>, <linux-riscv@...ts.infradead.org>,
<devicetree@...r.kernel.org>, <linux-arch@...r.kernel.org>,
<linux-doc@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
<alistair.francis@....com>, <richard.henderson@...aro.org>,
<jim.shu@...ive.com>, <andybnac@...il.com>, <kito.cheng@...ive.com>,
<charlie@...osinc.com>, <atishp@...osinc.com>, <evan@...osinc.com>,
<cleger@...osinc.com>, <broonie@...nel.org>, <rick.p.edgecombe@...el.com>,
"Zong Li" <zong.li@...ive.com>, "linux-riscv"
<linux-riscv-bounces@...ts.infradead.org>
Subject: Re: [PATCH v12 05/28] riscv: usercfi state for task and
save/restore of CSR_SSP on trap entry/exit
2025-04-24T10:56:34-07:00, Deepak Gupta <debug@...osinc.com>:
> On Thu, Apr 24, 2025 at 01:52:43PM +0200, Radim Krčmář wrote:
>>2025-04-23T17:00:29-07:00, Deepak Gupta <debug@...osinc.com>:
>>> On Thu, Apr 10, 2025 at 01:04:39PM +0200, Radim Krčmář wrote:
>>>>2025-03-14T14:39:24-07:00, Deepak Gupta <debug@...osinc.com>:
>>>>> diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
>>>>> @@ -62,6 +62,9 @@ struct thread_info {
>>>>> long user_sp; /* User stack pointer */
>>>>> int cpu;
>>>>> unsigned long syscall_work; /* SYSCALL_WORK_ flags */
>>>>> +#ifdef CONFIG_RISCV_USER_CFI
>>>>> + struct cfi_status user_cfi_state;
>>>>> +#endif
>>>>
>>>>I don't think it makes sense to put all the data in thread_info.
>>>>kernel_ssp and user_ssp is more than enough and the rest can comfortably
>>>>live elsewhere in task_struct.
>>>>
>>>>thread_info is supposed to be as small as possible -- just spanning
>>>>multiple cache-lines could be noticeable.
>>>
>>> I can change it to only include only `user_ssp`, base and size.
>>
>>No need for base and size either -- we don't touch that in the common
>>exception code.
>
> got it.
>
>>
>>> But before we go there, see below:
>>>
>>> $ pahole -C thread_info kbuild/vmlinux
>>> struct thread_info {
>>> long unsigned int flags; /* 0 8 */
>>> int preempt_count; /* 8 4 */
>>>
>>> /* XXX 4 bytes hole, try to pack */
>>>
>>> long int kernel_sp; /* 16 8 */
>>> long int user_sp; /* 24 8 */
>>> int cpu; /* 32 4 */
>>>
>>> /* XXX 4 bytes hole, try to pack */
>>>
>>> long unsigned int syscall_work; /* 40 8 */
>>> struct cfi_status user_cfi_state; /* 48 32 */
>>> /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
>>> long unsigned int a0; /* 80 8 */
>>> long unsigned int a1; /* 88 8 */
>>> long unsigned int a2; /* 96 8 */
>>>
>>> /* size: 104, cachelines: 2, members: 10 */
>>> /* sum members: 96, holes: 2, sum holes: 8 */
>>> /* last cacheline: 40 bytes */
>>> };
>>>
>>> If we were to remove entire `cfi_status`, it would still be 72 bytes (88 bytes
>>> if shadow call stack were enabled) and already spans across two cachelines.
>>
>>It has only 64 bytes of data without shadow call stack, but it wasted 8
>>bytes on the holes.
>>a2 is somewhat an outlier that is not used most exception paths and
>>excluding it makes everything fit nicely even now.
>
> But we can't exclude shadow call stack. It'll lead to increased size if that
> config is selected. A solution has to work for all the cases and not half
> hearted effort.
We could drop a0 or user_sp and place the two ints next to each other,
saving at least 16 bytes.
(user_sp, a0, a1, and a2 are just temporary storage. I think would be
fine with just two temporaries + kernel_sp, to provide three registers
for new_vmalloc_check and we never need more.)
>>> if shadow call stack were enabled) and already spans across two cachelines. I
>>> did see the comment above that it should fit inside a cacheline. Although I
>>> assumed its stale comment given that it already spans across cacheline and I
>>> didn't see any special mention in commit messages of changes which grew this
>>> structure above one cacheline. So I assumed this was a stale comment.
>>>
>>> On the other hand, whenever enable/lock bits are checked, there is a high
>>> likelyhood that user_ssp and other fields are going to be accessed and
>>> thus it actually might be helpful to have it all in one cacheline during
>>> runtime.
>>
>>Yes, although accessing enable/lock bits will be relatively rare.
>>It seems better to have the overhead during thread setup, rather than on
>>every trap.
>>
>>> So I am not sure if its helpful sticking to the comment which already is stale.
>>
>>We could fix the holes and also use sp instead of a0 in the
>>new_vmalloc_check, so everything would fit better.
>>
>>We are really close to fitting into a single cache-line, so I'd prefer
>>if shadow stack only filled thread_info with data that is used very
>>often in the exception handling code.
>
> I don't get what's the big deal if it results in two cachelines. We can
> (re)organize data structure in a way the most frequently accessed members are
> together in a single cacheline. We just need to find those members.
Yes, and because this patch is reorganizing the structure, I thought it
would be better to do the analysis now, rather than to incur additional
debt.
thread_info members are accessed during the first instructions after a
trap. We want to maximize the chance that the execution doesn't stall
until uarch has time to engage its crystal ball.
> In the hot path of exception handling, I see accesses to pt_regs on stack as
> well. These are definitley different cacheline than thread_info.
Right, and we also access cache-lines for the code.
I don't know how well each uarch keeps the early trap data/code in
caches, but it doesn't seem like a bad idea to minimize the amount of
cache-lines that are accessed early after trap.
> I understand the argument of one member field crossing into two cachelines can
> have undesired perf effects. I do not understand reasoning that thread_info
> exactly has to fit inside one cacheline.
I agree that we could probably lift the constraint for some values --
it's a lot of performance modeling and convincing, though...
In this series, I think it would be good to avoid splitting kernel_sp
and a0/a1 into two cache-lines. kernel_sp and a0/a1 are accessed within
the first few instructions.
> If this was always supposed to fit in a single cacheline, clearly this
> invariant isn't/wasn't maintained as changes trickled in. I would like to see
> what maintainers have to say or someone who did data analysis on this.
I don't think it is necessary to fix the rest, just not making things
worse is already great.
>>I think we could do without user_sp in thread_info as well, so there are
>>other packing options.
>
> Sure, probably somewhere in task_struct. But fact of the matter is that it has
> to be saved/restore during exception entry/exit. But then load/store to
> task_struct is essentially a different cachline. Not sure what we will achieve
> here?
user_sp is only temporarily storage space in thread_info.
The sp register is restored from pt_regs, so we could refactor the code
to drop user_sp from thread_info.
e.g. use a0, a1, or a2 for the temporary storage: user_sp is not even
the userspace sp, it is sp of the previous sp "user", which might have
been the kernel.
>>Btw. could ssp be added to pt_regs?
>
> I had that earlier. It breaks user abi. And it was a no go.
Thanks, I was afraid of that. :)
We might want to eventually push ssp to the stack to follow the same
design for trap nesting as sp has, but that can happen when implementing
ssp for the kernel. Squeezing into thread_info should work for now.
Powered by blists - more mailing lists