linux-kernel - Re: [DISCUSSION] kstack offset randomization: bugs and performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <202511171221.517FC4F@keescook>
Date: Mon, 17 Nov 2025 12:27:36 -0800
From: Kees Cook <kees@...nel.org>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Arnd Bergmann <arnd@...db.de>, Ard Biesheuvel <ardb@...nel.org>,
	Jeremy Linton <jeremy.linton@....com>,
	Will Deacon <will@...nel.org>,
	Catalin Marinas <Catalin.Marinas@....com>,
	Mark Rutland <mark.rutland@....com>,
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance

On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> Sorry; forgot to add Mark and the lists!
> 
> 
> On 17/11/2025 11:30, Ryan Roberts wrote:
> > Hi All,
> > 
> > Over the last few years we had a few complaints that syscall performance on
> > arm64 is slower than x86. Most recently, it was observed that a certain Java
> > benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
> > get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
> > ideas about how performance could be improved.
> > 
> > But I'll get to the performance angle in a bit. First, I want to discuss some
> > bugs that I believe I have uncovered during code review...
> > 
> > 
> > Bug 1: We have the following pattern:
> > 
> > add_random_kstack_offset()
> >   enable_interrupts_and_preemption()
> >     do_syscall()
> >   disable_interrupts_and_preemption()
> > choose_random_kstack_offset(random)
> > 
> > Where add_random_kstack_offset() adds an offset to the stack that was chosen by
> > a previous call to choose_random_kstack_offset() and stored in a per-cpu
> > variable. But since preemption is enabled during the syscall, surely an attacker
> > could defeat this by arranging for the thread to be preempted or migrated while
> > executing the syscall? That way the new offset is calculated for a different CPU
> > and a subsequent syscall on the original CPU will use the original offset?

Oh, interesting -- I hadn't considered that the entire thread would be
moved between CPUs while it is *IN* the syscall. Yeah, that would
effectively cause the offset to never change for the moved-from CPU. Ew.

> > I think we could just pass the random seed to add_random_kstack_offset() so that
> > we consume the old and buffer the new atomically? We would still buffer it
> > across syscalls to avoid the guessability issue that's documented. Then
> > choose_random_kstack_offset() could be removed. Or we could store
> > per-task_struct given it is only 32-bits?

I had wanted to avoid both growing task_struct and to avoid tying the
randomness to a given task -- then unpredictability could be higher
(barring the bug above), and could be only reduced to per-thread by
pinning a thread exclusively to a single CPU.

> > Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
> > document their requirement to be called with interrupts and preemption disabled.
> > They use raw_cpu_*(), which require this. But on arm64, they are called from
> > invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
> > don't think this will cause functional harm for arm64's implementations of
> > raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
> > being referred to. Perhaps there is a way for user code to exploit this to
> > defeat the purpose of the feature.
> > 
> > This should be straightforward to fix; if we take the task_struct approach for
> > bug 1, then that would also fix this issue too because the requirement to be in
> > atomic context goes away. Otherwsise it can be moved earlier in the callchain,
> > before interrupts are enabled.

I can't speak to these internals, just that I'd hope to avoid forcing
the randomness down to the thread-level.

> > 
> > 
> > Then we get to the performance aspect...
> > 
> > arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
> > originally came from the crng. get_random_u16() does
> > local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
> > fastpath and the slow path). It turns out that this locking/unlocking accounts
> > for 30%-50% of the total cost of kstack offset randomization. By introducing a
> > new raw_try_get_random_uX() helper that's called from a context where irqs are
> > disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
> > costing so much).
> > 
> > Furthermore, given we are actually only using 6 bits of entropy per syscall, we
> > could instead just request a u8 instead of a u16 and only throw away 2 bits
> > instead of 10 bits. This means we drain the entropy buffer half as quickly and
> > make half as many slow calls into the crng:
> > 
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
> > |           |    (baseline) |              |                 |                 |
> > +===========+===============+==============+=================+=================+
> > | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > 
> > I expect we could even choose to re-buffer and save those 2 bits so we call the
> > slow path even less often.
> > 
> > I believe this helps the mean latency significantly without sacrificing any
> > strength. But it doesn't reduce the tail latency because we still have to call
> > into the crng eventually.
> > 
> > So here's another idea: Could we use siphash to generate some random bits? We
> > would generate the secret key at boot using the crng. Then generate a 64 bit
> > siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
> > new hash). As long as the key remains secret, the hash is unpredictable.
> > (perhaps we don't even need the timer value). For every hash we get 64 bits, so
> > that would last for 10 syscalls at 6 bits per call. So we would still have to
> > call siphash every 10 syscalls, so there would still be a tail, but from my
> > experiements, it's much less than the crng:
> > 
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
> > |           |    (baseline) |              |                 |                 |
> > +===========+===============+==============+=================+=================+
> > | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > 
> > Could this give us a middle ground between strong-crng and
> > weak-timestamp-counter? Perhaps the main issue is that we need to store the
> > secret key for a long period?

All these ideas seem fine to me. I agree about only needed 6 bits, so a
u8 is good. Since we already use siphash extensively, that also seems
fine, though it'd be nice if we could find a solution that avoided
intermittent reseeding.

> > Anyway, I plan to work up a series with the bugfixes and performance
> > improvements. I'll add the siphash approach as an experimental addition and get
> > some more detailed numbers for all the options. But wanted to raise it all here
> > first to get any early feedback.

Thanks for tackling this!

-Kees

-- 
Kees Cook