[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ee19b0eb-8039-4598-ac94-09fd0417fd84@arm.com>
Date: Tue, 18 Nov 2025 11:05:45 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Jeremy Linton <jeremy.linton@....com>, Kees Cook <kees@...nel.org>,
Arnd Bergmann <arnd@...db.de>, Ard Biesheuvel <ardb@...nel.org>,
Will Deacon <will@...nel.org>, Catalin Marinas <Catalin.Marinas@....com>,
Mark Rutland <mark.rutland@....com>
Cc: "linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance
On 17/11/2025 20:56, Jeremy Linton wrote:
> Hi,
>
> On 11/17/25 5:31 AM, Ryan Roberts wrote:
>> Sorry; forgot to add Mark and the lists!
>>
>>
>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Over the last few years we had a few complaints that syscall performance on
>>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>>> ideas about how performance could be improved.
>>>
>>> But I'll get to the performance angle in a bit. First, I want to discuss some
>>> bugs that I believe I have uncovered during code review...
>>>
>>>
>>> Bug 1: We have the following pattern:
>>>
>>> add_random_kstack_offset()
>>> enable_interrupts_and_preemption()
>>> do_syscall()
>>> disable_interrupts_and_preemption()
>>> choose_random_kstack_offset(random)
>>>
>>> Where add_random_kstack_offset() adds an offset to the stack that was chosen by
>>> a previous call to choose_random_kstack_offset() and stored in a per-cpu
>>> variable. But since preemption is enabled during the syscall, surely an attacker
>>> could defeat this by arranging for the thread to be preempted or migrated while
>>> executing the syscall? That way the new offset is calculated for a different CPU
>>> and a subsequent syscall on the original CPU will use the original offset?
>>>
>>> I think we could just pass the random seed to add_random_kstack_offset() so that
>>> we consume the old and buffer the new atomically? We would still buffer it
>>> across syscalls to avoid the guessability issue that's documented. Then
>>> choose_random_kstack_offset() could be removed. Or we could store
>>> per-task_struct given it is only 32-bits?
>>>
>>>
>>> Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
>>> document their requirement to be called with interrupts and preemption disabled.
>>> They use raw_cpu_*(), which require this. But on arm64, they are called from
>>> invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
>>> don't think this will cause functional harm for arm64's implementations of
>>> raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
>>> being referred to. Perhaps there is a way for user code to exploit this to
>>> defeat the purpose of the feature.
>>>
>>> This should be straightforward to fix; if we take the task_struct approach for
>>> bug 1, then that would also fix this issue too because the requirement to be in
>>> atomic context goes away. Otherwsise it can be moved earlier in the callchain,
>>> before interrupts are enabled.
>>>
>>>
>>> Then we get to the performance aspect...
>>>
>>> arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
>>> originally came from the crng. get_random_u16() does
>>> local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
>>> fastpath and the slow path). It turns out that this locking/unlocking accounts
>>> for 30%-50% of the total cost of kstack offset randomization. By introducing a
>>> new raw_try_get_random_uX() helper that's called from a context where irqs are
>>> disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
>>> costing so much).
>
> IIRC it was because there is another thread filling batch of random numbers.
Err, why should that cause slow down? The lock is a local_lock, so it shouldn't
cause contention between 2 cpus. Conceptually each CPU has a different lock, right?
For non-RT:
#define local_lock_irqsave(lock, flags) \
__local_lock_irqsave(this_cpu_ptr(lock), flags)
#define __local_lock_irqsave(lock, flags) \
do { \
local_irq_save(flags); \
__local_lock_acquire(lock); \
} while (0)
local_irq_save() resolves to these 2 functions, which don't "look" expensive:
static __always_inline unsigned long __daif_local_save_flags(void)
{
return read_sysreg(daif);
}
static __always_inline void __daif_local_irq_disable(void)
{
barrier();
asm volatile("msr daifset, #3");
barrier();
}
__local_lock_acquire resolves to this where, the lock is a local_lock_t, so this
doesn't do anything:
#define __local_lock_acquire(lock) \
do { \
local_trylock_t *tl; \
local_lock_t *l; \
\
l = (local_lock_t *)(lock); \
tl = (local_trylock_t *)l; \
_Generic((lock), \
local_trylock_t *: ({ \
lockdep_assert(tl->acquired == 0); \
WRITE_ONCE(tl->acquired, 1); \
}), \
local_lock_t *: (void)0); \
local_lock_acquire(l); \
} while (0)
>
>
>>>
>>> Furthermore, given we are actually only using 6 bits of entropy per syscall, we
>>> could instead just request a u8 instead of a u16 and only throw away 2 bits
>>> instead of 10 bits. This means we drain the entropy buffer half as quickly and
>>> make half as many slow calls into the crng:
>>>
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
>>> | | (baseline) | | | |
>>> +===========+===============+==============+=================+=================+
>>> | getpid | 0.19 | (R) -11.43% | (R) -8.41% | (R) -5.97% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | getppid | 0.19 | (R) -13.81% | (R) -7.83% | (R) -6.14% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | invalid | 0.18 | (R) -12.22% | (R) -5.55% | (R) -3.70% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>>
>>> I expect we could even choose to re-buffer and save those 2 bits so we call the
>>> slow path even less often.
>>>
>>> I believe this helps the mean latency significantly without sacrificing any
>>> strength. But it doesn't reduce the tail latency because we still have to call
>>> into the crng eventually.
>>>
>>> So here's another idea: Could we use siphash to generate some random bits? We
>>> would generate the secret key at boot using the crng. Then generate a 64 bit
>>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>>> new hash). As long as the key remains secret, the hash is unpredictable.
>>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>>> call siphash every 10 syscalls, so there would still be a tail, but from my
>>> experiements, it's much less than the crng:
>>>
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | Benchmark | randomize=off | randomize=on | siphash | Jeremy's prng |
>>> | | (baseline) | | | |
>>> +===========+===============+==============+=================+=================+
>>> | getpid | 0.19 | (R) -11.43% | (R) -5.74% | (R) -2.06% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | getppid | 0.19 | (R) -13.81% | (R) -3.39% | (R) -2.59% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | invalid | 0.18 | (R) -12.22% | (R) -2.43% | -1.31% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>>
>>> Could this give us a middle ground between strong-crng and
>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>>> secret key for a long period?
>
> This is sorta fundamental with the (C)PRNG's. There is an estimated cycle (2^112
> it says for prandom) and the more state you keep hidden to increase that value
> slower the generation becomes.
>
> If your not worried about tail latency, then just use prandom per the patch
> below and xor in some data from the get_random_* into the local state on some
> interval. Say every million calls. One might argue for a couple bits of a
> counter on some interval would be enough if the tsc is good enough for x86, but
> frankly any counter must be independent from the core cpu clocks to have any
> hope of those values being unpredictable. Or just reseed it only on a subset of
> syscalls known to have a lot of variance, or some other per-cpu work, that at
> least will hide when its being reseeded from someone measuring the latency
> variance.
All sound reasonable to me. Lemme get some data for different options.
>
> I guess it depends on how strong you believe the existing system is. My
> perspective is that resseding it is probably a waste of time since its being
> quantized down to 2^5 bits anyway.
nit: it's 6 bits so 2^6 = 64 values.
I tend to agree; If you have some stack attack where you need to know the
offset, I guess you can just select one offset and repeat the attack until it
suceeds, and on average it will succeed 1 in 64 times? It doesn't sound like we
should be paying such a performance price for that level of protection to me.
Thanksm,
Ryan
>
>>>
>>>
>>> Anyway, I plan to work up a series with the bugfixes and performance
>>> improvements. I'll add the siphash approach as an experimental addition and get
>>> some more detailed numbers for all the options. But wanted to raise it all here
>>> first to get any early feedback.
>>>
>>>
>>> [1]
>>> https://lore.kernel.org/linux-arm-kernel/20240305221824.3300322-1-
>>> jeremy.linton@....com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>
Powered by blists - more mailing lists