linux-kernel - Re: [DISCUSSION] kstack offset randomization: bugs and performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aRysurZNqV6H8Tgc@zx2c4.com>
Date: Tue, 18 Nov 2025 18:28:26 +0100
From: "Jason A. Donenfeld" <Jason@...c4.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Arnd Bergmann <arnd@...db.de>, Kees Cook <kees@...nel.org>,
	Ard Biesheuvel <ardb@...nel.org>,
	Jeremy Linton <jeremy.linton@....com>,
	Will Deacon <will@...nel.org>,
	Catalin Marinas <Catalin.Marinas@....com>,
	Mark Rutland <mark.rutland@....com>,
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	ebiggers@...nel.org
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance

On Tue, Nov 18, 2025 at 05:21:17PM +0000, Ryan Roberts wrote:
> On 18/11/2025 17:15, Jason A. Donenfeld wrote:
> > On Mon, Nov 17, 2025 at 05:47:05PM +0100, Arnd Bergmann wrote:
> >> On Mon, Nov 17, 2025, at 12:31, Ryan Roberts wrote:
> >>> On 17/11/2025 11:30, Ryan Roberts wrote:
> >>>> Hi All,
> >>>>
> >>>> Over the last few years we had a few complaints that syscall performance on
> >>>> arm64 is slower than x86. Most recently, it was observed that a certain Java
> >>>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
> >>>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
> >>>> ideas about how performance could be improved.
> >>
> >>
> >>>> I believe this helps the mean latency significantly without sacrificing any
> >>>> strength. But it doesn't reduce the tail latency because we still have to call
> >>>> into the crng eventually.
> >>>>
> >>>> So here's another idea: Could we use siphash to generate some random bits? We
> >>>> would generate the secret key at boot using the crng. Then generate a 64 bit
> >>>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
> >>>> new hash). As long as the key remains secret, the hash is unpredictable.
> >>>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
> >>>> that would last for 10 syscalls at 6 bits per call. So we would still have to
> >>>> call siphash every 10 syscalls, so there would still be a tail, but from my
> >>>> experiements, it's much less than the crng:
> >>
> >> IIRC, Jason argued against creating another type of prng inside of the
> >> kernel for a special purpose. 
> > 
> > Yes indeed... I'm really not a fan of adding bespoke crypto willynilly
> > like that. Let's make get_random_u*() faster. If you're finding that the
> > issue with it is the locking, and that you're calling this from irq
> > context anyway, then your proposal (if I read this discussion correctly)
> > to add a raw_get_random_u*() seems like it could be sensible. Those
> > functions are generated via macro anyway, so it wouldn't be too much to
> > add the raw overloads. Feel free to send a patch to my random.git tree
> > if you'd like to give that a try.
> 
> Thanks Jason; that's exactly what I did, and it helps. But I think ultimately
> the get_random_uXX() slow path is too slow; that's the part that causes the tail
> latency problem. I doubt there are options for speeding that up?
> 
> Anyway, I'm currently prototyping a few options and getting clear performance
> numbers. I'll be back in a couple of days and we can continue the discussion in
> light of the data.
 
Interesting... I would be curious to see what sorts of stable numbers
you find. Because most of the time, get_random_uXX() should just be
copying memory. Does the unlikely slower case really matter that much? I
suspect it doesn't matter for anything real. On the other hand, it's
probably possible to improve the slow path on ARM a bit by using the
pure-ARM assembly chacha implementation that we use in the vDSO:
https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/tree/arch/arm64/kernel/vdso/vgetrandom-chacha.S
Or by using the non-generic code already provided by libcrypto from
random.c.

Jason