linux-kernel - Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5d2f73f9-2aa5-4cbd-ba6e-e22f82b95884@arm.com>
Date: Tue, 2 Dec 2025 20:19:26 +0100
From: Kevin Brodsky <kevin.brodsky@....com>
To: Thomas Gleixner <tglx@...utronix.de>, Florian Weimer <fweimer@...hat.com>
Cc: Dmitry Vyukov <dvyukov@...gle.com>, mathieu.desnoyers@...icios.com,
 peterz@...radead.org, boqun.feng@...il.com, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, hpa@...or.com, aruna.ramakrishna@...cle.com,
 elver@...gle.com, "Paul E. McKenney" <paulmck@...nel.org>, x86@...nel.org,
 linux-kernel@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
 Catalin Marinas <catalin.marinas@....com>
Subject: Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys

+ Catalin

On 26/11/2025 18:56, Thomas Gleixner wrote:
> [...]
>
>> In the other direction, code that sets a restrictive access mask is
>> already not allowed to call into arbitrary code.  For example, we could
>> use protection keys internally within glibc in the dynamic linker and
>> require that a key that we allocated retains read access.
>>
>> Unfortunately, there's a use case for singleton access rights that does
>> not include key 0: validate that a pointer points to memory colored in a
>> specific way (e.g, for vtables, or for bytecode).
> Fair enough.

This goes beyond this singleton pattern: if one wants to isolate
untrusted (e.g. jitted) code by giving it its own pkey, then preventing
access to the default/privileged pkey 0 is essential while executing the
sandboxed code.

> [...]
>
>> On the other hand, I get the idea that protection keys are pretty dead.
>> So far, I couldn't get the x86 signal issue fixed in the kernel, so we
>> can't use them for glibc hardening.
> Then let's sit down and fix it once and forever.

That's all I've been hoping for :)

>> AArch64 duplicated the x86 behavior, too.  And POWER removed
>> protection key support with the switch to the radix MMU.
> I'm not sure whether we should declare them dead.
>
> They definitely have a value, but none of this PKEY muck has been really
> thought through and we just ended up with a cobbled together ABI and a
> hard (impossible for some stuff) to use programming model.

Agreed, all that has been done so far in terms of async uaccess (such as
writing the signal frame) is working around the problem rather than
providing an actual solution.

> So we really need to sit down and actually define a proper programming
> model first instead of trying to duct tape the current ill defined mess
> forever.
>
> What do we have to take into account:
>
>    1) signals
>
>       Broken as we know already.

And unnecessarily complicated, as we use two different pkey register
values: one to write the signal frame (no restriction), one to invoke
the signal handler (pkru_init / pkey 0 on arm64). A big piece of duct
tape :)

>       IMO, the proper solution is to provide a mechanism to register a
>       set of permissions which are used for signal delivery. The
>       resulting hardware value should expand the permission, but keep
>       the current active ones enabled.

It feels like there are really two situations to consider, as you
mentioned in your first reply:

a. Delivering the signal on the regular stack. This should work without
intervention: the interrupted context should be able to write to its own
stack pointer.

b. Delivering on the alternate signal stack. There is no expectation or
requirement that the interrupted context is able to access it, so we
need to update the pkey register.

In this specific case I feel that signal handlers should be able to rely
on a consistent ABI regardless of SA_ONSTACK. Preserving the register
value and expanding it with user-defined permissions seems like a
reasonable compromise.

>       That can be kinda kept backwards compatible as the signal perms
>       would default to PKEY0.

If anyone uses the current cobbled-up mechanism [1], i.e. uses some
assembly trampoline to switch the pkey register before invoking the
actual signal handler on the alternate signal stack (pkey != 0), then it
would break. I doubt there are many users if at all, though.

[1]
https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oKUfHRw@mail.gmail.com/

>    2) rseq
>
>       The option of having a separate key which needs to be always
>       enabled is definitely simple, but it wastes a key just for
>       that. There are only 16 of them :(

And even fewer on arm64, just 8 :/ I don't think reserving a pkey just
for rseq is a reasonable option.

>       If we solve the signal case with an explicit permission set, we
>       can just reuse those signal permissions. They are maybe wider than
>       what's required to access RSEQ, but the signal permissions have to
>       include the TLS/RSEQ area to actually work.

But as you mentioned further up, nothing prevents the user from
bypassing glibc and using rseq directly. It seems rather strange to bake
in such assumption in the kernel.

Because accesses to the rseq struct are truly asynchronous and unrelated
to the interrupted context, I do not think that inheriting the
interrupted pkey register makes sense in that case. We could have a
separate mechanism to set that value, or maybe use the same value as
when the struct was registered (surely the context that called
rseq(&rseq_struct) must have access to rseq_struct).

>    3) io-uring
>
>       "Works" on x86 as the worker inherits the permissions of the task
>       which creates the worker unless the user memory is not accessible
>       with the tasks current permissions. That's pretty much preventing
>       a full isolation of the memory which that worker can access
>       because the task which creates it must have the keys to access
>       stack, rseq and whatever enabled to survive the syscall.
>
>       Fails on ARM64 if the user memory is not accessible via the
>       default key, which enforces that stack, rseq and the worker memory
>       is accessible via the default key. Again no isolation of the
>       worker memory possible.
>
>       I think it should have a mechanism to set the required permissions
>       explicitly and default to the current behaviour, but that's
>       solvable within the io uring space I think. Jens?

This case seems very similar to rseq to me, but we may want more control
as io_uring allows to perform a wide range of accesses.


There are other situations where async uaccess occurs, and I think those
should be considered too. Users of __copy_from_user_inatomic() and
functions based on it such as copy_from_user_nofault() often fall in
that category.

>From what I've gathered so far, most of these situations don't look too
concerning:

- Functions that inspect the stack (perf_callchain_user(),
arch_stack_walk_user()). The access is asynchronous, but like in the
signal delivery case it seems safe to assume that the stack is
accessible in any context.

- uprobe tracing allows reading memory at some arbitrary address. This
feels like a synchronous situation to me: when a particular function is
called, read some memory. The access will fail if prevented by the pkey
register, but that is exactly the same as if the function itself
accessed the memory.

One situation looks a lot more concerning:

- There are BPF helpers to read/write user memory (bpf_copy_from_user,
bpf_probe_read_user, etc.). These accesses do not necessarily occur in
any given user context AFAIU. Does it really make sense to apply pkey
restrictions in that case?

> Thoughts?

Thanks for getting the ball rolling! I really hope we can settle on a
consistent way to handle the pkey register for all those uaccess calls
that are not directly initiated by userspace.

- Kevin