[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRxXkSx3WbgAPp_Q@J2N7QTR9R3>
Date: Tue, 18 Nov 2025 11:25:05 +0000
From: Mark Rutland <mark.rutland@....com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Kees Cook <kees@...nel.org>, Arnd Bergmann <arnd@...db.de>,
Ard Biesheuvel <ardb@...nel.org>,
Jeremy Linton <jeremy.linton@....com>,
Will Deacon <will@...nel.org>,
Catalin Marinas <Catalin.Marinas@....com>,
"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance
On Tue, Nov 18, 2025 at 10:28:29AM +0000, Ryan Roberts wrote:
> On 17/11/2025 20:27, Kees Cook wrote:
> > On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> >> On 17/11/2025 11:30, Ryan Roberts wrote:
> The original rationale for a separate choose_random_kstack_offset() at the end
> of the syscall is described as:
>
> * This position in the syscall flow is done to
> * frustrate attacks from userspace attempting to learn the next offset:
> * - Maximize the timing uncertainty visible from userspace: if the
> * offset is chosen at syscall entry, userspace has much more control
> * over the timing between choosing offsets. "How long will we be in
> * kernel mode?" tends to be more difficult to predict than "how long
> * will we be in user mode?"
> * - Reduce the lifetime of the new offset sitting in memory during
> * kernel mode execution. Exposure of "thread-local" memory content
> * (e.g. current, percpu, etc) tends to be easier than arbitrary
> * location memory exposure.
>
> I'm not totally convinced by the first argument; for arches that use the tsc,
> sampling the tsc at syscall entry would mean that userspace can figure out the
> random value that will be used for syscall N by sampling the tsc and adding a
> bit just before calling syscall N. Sampling the tsc at syscall exit would mean
> that userspace can figure out the random value that will be used for syscall N
> by sampling the tsc and subtracting a bit just after syscall N-1 returns. I
> don't really see any difference in protection?
>
> If you're trying force the kernel-sampled tsc to be a specific value, then for
> the sample-on-exit case, userspace can just make a syscall with an invalid id as
> it's syscall N-1 and in that case the duration between entry and exit is tiny
> and fixed so it's still pretty simple to force the value.
FWIW, I agree. I don't think we're gaining much based on the placement
of choose_random_kstack_offset() at the start/end of the entry/exit
sequences.
As an aside, it looks like x86 calls choose_random_kstack_offset() for
*any* return to userspace, including non-syscall returns (e.g. from
IRQ), in arch_exit_to_user_mode_prepare(). There's some additional
randomness/perturbation that'll cause, but logically it's not necessary
to do that for *all* returns to userspace.
> So what do you think of this approach? :
>
> #define add_random_kstack_offset(rand) do { \
> if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> &randomize_kstack_offset)) { \
> u32 offset = raw_cpu_read(kstack_offset); \
> u8 *ptr; \
> \
> offset = ror32(offset, 5) ^ (rand); \
> raw_cpu_write(kstack_offset, offset); \
> u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset)); \
> /* Keep allocation even after "ptr" loses scope. */ \
> asm volatile("" :: "r"(ptr) : "memory"); \
> } \
> } while (0)
>
> This ignores "Maximize the timing uncertainty" (but that's ok because the
> current version doesn't really do that either), but strengthens "Reduce the
> lifetime of the new offset sitting in memory".
Is this assuming that 'rand' can be generated in a non-preemptible
context? If so (and this is non-preemptible), that's fine.
I'm not sure whether that was the intent, or this was ignoring the
rescheduling problem.
If we do this per-task, then that concern disappears, and this can all
be preemptible.
Mark.
Powered by blists - more mailing lists