[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87ikexhbah.ffs@tglx>
Date: Wed, 26 Nov 2025 01:45:10 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Kevin Brodsky <kevin.brodsky@....com>, Dmitry Vyukov
<dvyukov@...gle.com>, mathieu.desnoyers@...icios.com,
peterz@...radead.org, boqun.feng@...il.com, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
aruna.ramakrishna@...cle.com, elver@...gle.com
Cc: "Paul E. McKenney" <paulmck@...nel.org>, x86@...nel.org,
linux-kernel@...r.kernel.org, Florian Weimer <fweimer@...hat.com>
Subject: Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
On Mon, Oct 20 2025 at 15:46, Kevin Brodsky wrote:
> +Florian Weimer
>> * A single struct rseq per thread is allowed.
>> + *
>> + * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
>> + * then the assigned pkey should either be accessible whenever these structs
>> + * are registered/installed, or they should be protected with pkey 0.
>
> I think it's worth pointing out that every case of async uaccess seems
> to be handled differently w.r.t. pkeys. Some interesting cases:
>
> * During signal delivery, the pkey register is reset to permissive
> before writing to the signal stack (this is the logic that patch 2
> touches in fact). This is handled in the same way on x86 [1] and arm64 [2].
>
> * AFAICT io_uring worker threads inherit the user thread's pkey register
> on x86, since they are forked without setting PF_KTHREAD. OTOH on arm64
> the pkey register is reset to the init value (RW access to pkey 0 only)
> for all non-user threads [3].
>
> * Now with this patch accesses to struct rseq are made with whatever the
> pkey register is set to when the thread is interrupted + RW access for
> pkey 0.
>
> This is clearly a difficult problem, since no approach will satisfy all
> users. I believe a new API would be desirable to allow users to
> configure this; see the thread at [4] regarding signal handlers.
>
> In any case, it seems best to ignore the value of the pkey register at
> the point where the thread is interrupted, because userspace has no
> control on that value and it could lead to spurious faults (depending on
> when the interruption happens). For rseq, maybe the most logical option
> would be to set the pkey register to permissive in
> __rseq_handle_notify_resume(), because there is nothing sensible to
> check at that point. Checking could be done at the point of registration
> instead.
That's all broken. Assume:
1) process starts with pkey 0 (default)
2) glibc creates TLS (protected by pkey 0)
3) main() switches to protection pkey 1
If the switch to pkey 1 does not ensure that TLS (where RSEQ sits) is
accessible by pkey 1, then how is userspace able to survive?
You then do not even need the help of the kernel to die. If the process
accesses TLS it dies on it's own.
As each map can only be attached to a single pkey, the only solutions for
this are:
1) enable both pkey0 and pkey1
or
2) ensure that everything which was mapped _before_ main() changes
the protection key is covered by pkey1 when it disables pkey0.
For that to pull off you need #1 temporary to finish the
transition of _all_ affected mappings via pkey_mprotect()
No?
And no matter how many PKEY hacks you sprinkle into the kernel none of
them will solve all of the related problems.
If user space registers shared memory with the kernel (that's what RSEQ
is), user space has to make sure that it is accessible under it's own
PKEY constraints. If it fails to do so, it dies rightfully whether in
user space or in kernel space does not matter.
The only exception is the alternative signal stack, which can be set up
as an catch all sink which is not covered by the currently active PKEY
set on purpose.
And looking at how x86 handles this it gets it actually wrong because it
enables all keys independent of the stack target it delivers to while
this should be strictly restricted to the alternative stack, but that's
a different discussion to have.
So no, these patches are going nowhere because they try to paper over a
fundamental bug in the application and the understanding of how all of
this protection key mechanism actualy can work.
Just let me quote the cover letter to prove my point:
"If an application registers rseq, and ever switches to another pkey
protection (such that the rseq becomes inaccessible), then any
context switch will cause failure in __rseq_handle_notify_resume()
attempting to read/write struct rseq and/or rseq_cs. Since context
switches are asynchronous and are outside of the application control
(not part of the restricted code scope), temporarily enable access
to 0 (default) PKEY to read/write rseq/rseq_cs.
0 is the only PKEY supported for rseq for now."
Clearly that does not even notice the fact that switching to pkey 1
prevents access to the TLS. Fact is that RSEQ is part of the TLS whether
you like it or not. See above.
"Theoretically other PKEYs can be supported, but it's unclear
how/if that can work. So for now we don't support that to simplify
code."
Clearly tells me that this is attempt to scratch an itch which was not
understood at all.
There are exactly two options to solve this in user space and there is
zero justification to inflict any of this nonsense on the kernel.
Thanks,
tglx
Powered by blists - more mailing lists