linux-kernel - Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a508he4h.ffs@tglx>
Date: Wed, 26 Nov 2025 18:56:14 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Florian Weimer <fweimer@...hat.com>
Cc: Kevin Brodsky <kevin.brodsky@....com>, Dmitry Vyukov
 <dvyukov@...gle.com>, mathieu.desnoyers@...icios.com,
 peterz@...radead.org, boqun.feng@...il.com, mingo@...hat.com,
 bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 aruna.ramakrishna@...cle.com, elver@...gle.com, "Paul E. McKenney"
 <paulmck@...nel.org>, x86@...nel.org, linux-kernel@...r.kernel.org, Jens
 Axboe <axboe@...nel.dk>
Subject: Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys

On Wed, Nov 26 2025 at 10:32, Florian Weimer wrote:
> * Thomas Gleixner:
>
>> That's all broken. Assume:
>>
>>   1) process starts with pkey 0 (default)
>>   2) glibc creates TLS (protected by pkey 0)
>>   3) main() switches to protection pkey 1
>>
>> If the switch to pkey 1 does not ensure that TLS (where RSEQ sits) is
>> accessible by pkey 1, then how is userspace able to survive?
>>
>> You then do not even need the help of the kernel to die. If the process
>> accesses TLS it dies on it's own.
>
> Signals have the same problem.  With the x86 approach to disable all
> access, protection keys are not really usable without tight control over
> all code in the process.  This behavior breaks encapsulation.

It enables PKEY0, but I agree that the signal muck is broken. See below.

> I'm less concerned about the impact on restart of restartable sequences
> because by design, it's a non-modular feature: syscalls and function
> calls are already banned.  If the code wants to restart, it has to make
> sure that the access rights at the restart point are correct.  But
> that's like any other register contents, I think.

It's not only restart. RSEQ is also accessed by the kernel for storing
CPUID, NODEID, CID. Some of that is used in glibc today, no?

> In the other direction, code that sets a restrictive access mask is
> already not allowed to call into arbitrary code.  For example, we could
> use protection keys internally within glibc in the dynamic linker and
> require that a key that we allocated retains read access.
>
> Unfortunately, there's a use case for singleton access rights that does
> not include key 0: validate that a pointer points to memory colored in a
> specific way (e.g, for vtables, or for bytecode).

Fair enough.

> If the kernel/scheduler cannot bypass restrictions on access key 0, then
> supporting this kind of memory color check is rather difficult because
> userspace would always have to put key 0 into the accessible set.

Right, but blindly bypassing restrictions on key 0 is not a real good
solution either. It's just another piece of duct tape.

> Would it help to allocate a dedicated key for rseq and specify that
> userspace must always include this access in the accessible set?

That would definitely be helpful to avoid switching PKRU in rseq
handling code on exit to user space.

Though with the reworked RSEQ code the extra overhead might not be
horrible. See below.

But like with signals just blindly enabling key0 and hope that it works
is not really a solution. Nothing prevents me from disabling RSEQ for
glibc. Then install my own RSEQ page and mprotect it. When that key
becomes disabled in PKRU and the code section is interrupted then exit
to user space will fault and die in exactly the same way as
today. That's progress...

> In glibc, we cannot easily set a different key for the TLS area today
> because it's not necessarily on an isolated page on which we could call
> pkey_mprotect.  We plan to fix this next year, but it's not a trivial
> change.

I understand.

> On the other hand, I get the idea that protection keys are pretty dead.
> So far, I couldn't get the x86 signal issue fixed in the kernel, so we
> can't use them for glibc hardening.

Then let's sit down and fix it once and forever.

> AArch64 duplicated the x86 behavior, too.  And POWER removed
> protection key support with the switch to the radix MMU.

I'm not sure whether we should declare them dead.

They definitely have a value, but none of this PKEY muck has been really
thought through and we just ended up with a cobbled together ABI and a
hard (impossible for some stuff) to use programming model.

So we really need to sit down and actually define a proper programming
model first instead of trying to duct tape the current ill defined mess
forever.

What do we have to take into account:

   1) signals

      Broken as we know already.

      IMO, the proper solution is to provide a mechanism to register a
      set of permissions which are used for signal delivery. The
      resulting hardware value should expand the permission, but keep
      the current active ones enabled.

      That can be kinda kept backwards compatible as the signal perms
      would default to PKEY0.

   2) rseq

      The option of having a separate key which needs to be always
      enabled is definitely simple, but it wastes a key just for
      that. There are only 16 of them :(

      If we solve the signal case with an explicit permission set, we
      can just reuse those signal permissions. They are maybe wider than
      what's required to access RSEQ, but the signal permissions have to
      include the TLS/RSEQ area to actually work.

   3) io-uring

      "Works" on x86 as the worker inherits the permissions of the task
      which creates the worker unless the user memory is not accessible
      with the tasks current permissions. That's pretty much preventing
      a full isolation of the memory which that worker can access
      because the task which creates it must have the keys to access
      stack, rseq and whatever enabled to survive the syscall.

      Fails on ARM64 if the user memory is not accessible via the
      default key, which enforces that stack, rseq and the worker memory
      is accessible via the default key. Again no isolation of the
      worker memory possible.

      I think it should have a mechanism to set the required permissions
      explicitly and default to the current behaviour, but that's
      solvable within the io uring space I think. Jens?

Thoughts?

Thanks,

        tglx