linux-kernel - Re: [DISCUSSION] kstack offset randomization: bugs and performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b71ce209-80d7-4cfa-aa77-4f6c8999a187@arm.com>
Date: Tue, 18 Nov 2025 12:16:35 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Mark Rutland <mark.rutland@....com>
Cc: Kees Cook <kees@...nel.org>, Arnd Bergmann <arnd@...db.de>,
 Ard Biesheuvel <ardb@...nel.org>, Jeremy Linton <jeremy.linton@....com>,
 Will Deacon <will@...nel.org>, Catalin Marinas <Catalin.Marinas@....com>,
 "linux-arm-kernel@...ts.infradead.org"
 <linux-arm-kernel@...ts.infradead.org>,
 Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance

On 18/11/2025 11:25, Mark Rutland wrote:
> On Tue, Nov 18, 2025 at 10:28:29AM +0000, Ryan Roberts wrote:
>> On 17/11/2025 20:27, Kees Cook wrote:
>>> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
>>>> On 17/11/2025 11:30, Ryan Roberts wrote:
>> The original rationale for a separate choose_random_kstack_offset() at the end
>> of the syscall is described as:
>>
>>  * This position in the syscall flow is done to
>>  * frustrate attacks from userspace attempting to learn the next offset:
>>  * - Maximize the timing uncertainty visible from userspace: if the
>>  *   offset is chosen at syscall entry, userspace has much more control
>>  *   over the timing between choosing offsets. "How long will we be in
>>  *   kernel mode?" tends to be more difficult to predict than "how long
>>  *   will we be in user mode?"
>>  * - Reduce the lifetime of the new offset sitting in memory during
>>  *   kernel mode execution. Exposure of "thread-local" memory content
>>  *   (e.g. current, percpu, etc) tends to be easier than arbitrary
>>  *   location memory exposure.
>>
>> I'm not totally convinced by the first argument; for arches that use the tsc,
>> sampling the tsc at syscall entry would mean that userspace can figure out the
>> random value that will be used for syscall N by sampling the tsc and adding a
>> bit just before calling syscall N. Sampling the tsc at syscall exit would mean
>> that userspace can figure out the random value that will be used for syscall N
>> by sampling the tsc and subtracting a bit just after syscall N-1 returns. I
>> don't really see any difference in protection?
>>
>> If you're trying force the kernel-sampled tsc to be a specific value, then for
>> the sample-on-exit case, userspace can just make a syscall with an invalid id as
>> it's syscall N-1 and in that case the duration between entry and exit is tiny
>> and fixed so it's still pretty simple to force the value.
> 
> FWIW, I agree. I don't think we're gaining much based on the placement
> of choose_random_kstack_offset() at the start/end of the entry/exit
> sequences.
> 
> As an aside, it looks like x86 calls choose_random_kstack_offset() for
> *any* return to userspace, including non-syscall returns (e.g. from
> IRQ), in arch_exit_to_user_mode_prepare(). There's some additional
> randomness/perturbation that'll cause, but logically it's not necessary
> to do that for *all* returns to userspace.

(as is s390)

Hmm, that's interesting; that will defeat the attack where a task is migrated
away from the cpu mid-syscall, since any future return to user space will still
stir the per-cpu pot.

So getting rid of choose_random_kstack_offset() would likely reduce security for
x86 and s390 because we would only be sampling the tsc on entry to syscalls.

But similarly, I think changing to keeping the offset per-task has potential to
reduce the security here too, because each return to user space will only mix
the current per-task value.

Perhaps this is actually an argument to keep it per-cpu.

For arm64 performance, the interrupt timing provides a somewhat-random source
that an attacker can't control or guess. So could we take a similar route; fold
in the timer value at every return to user (or even every interrupt). Then use
that with Jeremy's per-cpu prng, which is seeded by the crng only once, at
cpu-online time.

I know you don't like relying on the timer, Mark, but sampling it per interrupt
feels a bit stronger to me?

> 
>> So what do you think of this approach? :
>>
>> #define add_random_kstack_offset(rand) do {				\
>> 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
>> 				&randomize_kstack_offset)) {		\
>> 		u32 offset = raw_cpu_read(kstack_offset);		\
>> 		u8 *ptr;						\
>> 									\
>> 		offset = ror32(offset, 5) ^ (rand);			\
>> 		raw_cpu_write(kstack_offset, offset);			\
>> 		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
>> 		/* Keep allocation even after "ptr" loses scope. */	\
>> 		asm volatile("" :: "r"(ptr) : "memory");		\
>> 	}								\
>> } while (0)
>>
>> This ignores "Maximize the timing uncertainty" (but that's ok because the
>> current version doesn't really do that either), but strengthens "Reduce the
>> lifetime of the new offset sitting in memory".
> 
> Is this assuming that 'rand' can be generated in a non-preemptible
> context? If so (and this is non-preemptible), that's fine.

Yes this needs to be called with preemption disabled and yes it assumes that
rand can be generated in non-preemptible context - which is true for all arch's
rand sources today.

Although given the cost of get_random_u16() (or u8) for arm64, for the case
where it has to call into the crng, I was considering that we would first "try
get rand" before enabling preemption, and if that failed, enable preemption then
do the slow path. That would at least allow preemption for RT kernels (due to
get_random_u16()'s local_lock).

But given the above discussion about return-to-user, perhaps this is not the way
to go anyway. I suspect it makes sense to keep the entropy collection separate
from it's usage.

> 
> I'm not sure whether that was the intent, or this was ignoring the
> rescheduling problem.
> 
> If we do this per-task, then that concern disappears, and this can all
> be preemptible.
> 
> Mark.