linux-kernel - Re: [DISCUSSION] kstack offset randomization: bugs and performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b0c4bfc1-1070-48ae-a51a-d6a32f534022@arm.com>
Date: Mon, 17 Nov 2025 14:56:12 -0600
From: Jeremy Linton <jeremy.linton@....com>
To: Ryan Roberts <ryan.roberts@....com>, Kees Cook <kees@...nel.org>,
 Arnd Bergmann <arnd@...db.de>, Ard Biesheuvel <ardb@...nel.org>,
 Will Deacon <will@...nel.org>, Catalin Marinas <Catalin.Marinas@....com>,
 Mark Rutland <mark.rutland@....com>
Cc: "linux-arm-kernel@...ts.infradead.org"
 <linux-arm-kernel@...ts.infradead.org>,
 Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance

Hi,

On 11/17/25 5:31 AM, Ryan Roberts wrote:
> Sorry; forgot to add Mark and the lists!
> 
> 
> On 17/11/2025 11:30, Ryan Roberts wrote:
>> Hi All,
>>
>> Over the last few years we had a few complaints that syscall performance on
>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>> ideas about how performance could be improved.
>>
>> But I'll get to the performance angle in a bit. First, I want to discuss some
>> bugs that I believe I have uncovered during code review...
>>
>>
>> Bug 1: We have the following pattern:
>>
>> add_random_kstack_offset()
>>    enable_interrupts_and_preemption()
>>      do_syscall()
>>    disable_interrupts_and_preemption()
>> choose_random_kstack_offset(random)
>>
>> Where add_random_kstack_offset() adds an offset to the stack that was chosen by
>> a previous call to choose_random_kstack_offset() and stored in a per-cpu
>> variable. But since preemption is enabled during the syscall, surely an attacker
>> could defeat this by arranging for the thread to be preempted or migrated while
>> executing the syscall? That way the new offset is calculated for a different CPU
>> and a subsequent syscall on the original CPU will use the original offset?
>>
>> I think we could just pass the random seed to add_random_kstack_offset() so that
>> we consume the old and buffer the new atomically? We would still buffer it
>> across syscalls to avoid the guessability issue that's documented. Then
>> choose_random_kstack_offset() could be removed. Or we could store
>> per-task_struct given it is only 32-bits?
>>
>>
>> Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
>> document their requirement to be called with interrupts and preemption disabled.
>> They use raw_cpu_*(), which require this. But on arm64, they are called from
>> invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
>> don't think this will cause functional harm for arm64's implementations of
>> raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
>> being referred to. Perhaps there is a way for user code to exploit this to
>> defeat the purpose of the feature.
>>
>> This should be straightforward to fix; if we take the task_struct approach for
>> bug 1, then that would also fix this issue too because the requirement to be in
>> atomic context goes away. Otherwsise it can be moved earlier in the callchain,
>> before interrupts are enabled.
>>
>>
>> Then we get to the performance aspect...
>>
>> arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
>> originally came from the crng. get_random_u16() does
>> local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
>> fastpath and the slow path). It turns out that this locking/unlocking accounts
>> for 30%-50% of the total cost of kstack offset randomization. By introducing a
>> new raw_try_get_random_uX() helper that's called from a context where irqs are
>> disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
>> costing so much).

IIRC it was because there is another thread filling batch of random numbers.


>>
>> Furthermore, given we are actually only using 6 bits of entropy per syscall, we
>> could instead just request a u8 instead of a u16 and only throw away 2 bits
>> instead of 10 bits. This means we drain the entropy buffer half as quickly and
>> make half as many slow calls into the crng:
>>
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
>> |           |    (baseline) |              |                 |                 |
>> +===========+===============+==============+=================+=================+
>> | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>>
>> I expect we could even choose to re-buffer and save those 2 bits so we call the
>> slow path even less often.
>>
>> I believe this helps the mean latency significantly without sacrificing any
>> strength. But it doesn't reduce the tail latency because we still have to call
>> into the crng eventually.
>>
>> So here's another idea: Could we use siphash to generate some random bits? We
>> would generate the secret key at boot using the crng. Then generate a 64 bit
>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>> new hash). As long as the key remains secret, the hash is unpredictable.
>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>> call siphash every 10 syscalls, so there would still be a tail, but from my
>> experiements, it's much less than the crng:
>>
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
>> |           |    (baseline) |              |                 |                 |
>> +===========+===============+==============+=================+=================+
>> | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>>
>> Could this give us a middle ground between strong-crng and
>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>> secret key for a long period?

This is sorta fundamental with the (C)PRNG's. There is an estimated 
cycle (2^112 it says for prandom) and the more state you keep hidden to 
increase that value slower the generation becomes.

If your not worried about tail latency, then just use prandom per the 
patch below and xor in some data from the get_random_* into the local 
state on some interval. Say every million calls. One might argue for a 
couple bits of a counter on some interval would be enough if the tsc is 
good enough for x86, but frankly any counter must be independent from 
the core cpu clocks to have any hope of those values being 
unpredictable. Or just reseed it only on a subset of syscalls known to 
have a lot of variance, or some other per-cpu work, that at least will 
hide when its being reseeded from someone measuring the latency variance.

I guess it depends on how strong you believe the existing system is. My 
perspective is that resseding it is probably a waste of time since its 
being quantized down to 2^5 bits anyway.

>>
>>
>> Anyway, I plan to work up a series with the bugfixes and performance
>> improvements. I'll add the siphash approach as an experimental addition and get
>> some more detailed numbers for all the options. But wanted to raise it all here
>> first to get any early feedback.
>>
>>
>> [1]
>> https://lore.kernel.org/linux-arm-kernel/20240305221824.3300322-1-jeremy.linton@arm.com/
>>
>> Thanks,
>> Ryan
>>
>