lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cc9cd056-3786-42db-8e40-bb0425dfe142@arm.com>
Date: Thu, 27 Nov 2025 14:09:04 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Ard Biesheuvel <ardb@...nel.org>
Cc: Kees Cook <kees@...nel.org>, Will Deacon <will@...nel.org>,
 Arnd Bergmann <arnd@...db.de>, Jeremy Linton <jeremy.linton@....com>,
 Catalin Marinas <Catalin.Marinas@....com>,
 Mark Rutland <mark.rutland@....com>,
 "linux-arm-kernel@...ts.infradead.org"
 <linux-arm-kernel@...ts.infradead.org>,
 Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance

On 27/11/2025 12:19, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 12:50, Ryan Roberts <ryan.roberts@....com> wrote:
>>
>> On 27/11/2025 08:00, Kees Cook wrote:
>>> On Wed, Nov 26, 2025 at 11:58:40PM +0100, Ard Biesheuvel wrote:
> ...
>>>> the tail latency issue, but I'm not sure I understand why that is a
>>>> problem to begin with if it occurs sufficiently rarely. Is that a
>>>> PREEMPT_RT issue?
>>
>> Yes; RT was Jeremy's original motivation for looking at the prng approach.
>>
>> For the issue I see, improving the mean would be sufficient, but improving the
>> tail too is a bonus.
>>
>>>> Would it be better if the refill of the per-CPU
>>>> batched entropy buffers was relegated to some kind of kthread so it
>>>> can be scheduled independently? (Those buffers are all the same size
>>>> so we could easily keep a few hot spares)
>>
>> That came up in Jeremy's thread last year. My understanding was that this would
>> not help because either the thread is lower priority, in which case you can't
>> guarrantee it will run, or it is higher priority, in which case the RT thread
>> still takes the glitch. (But I'm hand waving - I'm not expert on the details).
>>
> 
> PREEMPT_RT is generally more concerned about the worst case latency
> being bounded rather than being as low as possible.

Sure, but if you can reduce the tail, that's still "better" right?

> 
> The get_random fallback runs a few rounds of chacha20, which takes
> more time than just reading the next value and bumping the position
> counter. But that does not imply it fails to meet RT constraints.
> 
> And if a thread running ChaCha20 in the background fails to get enough
> cycles, it is not an RT problem, it is an ordinary starvation problem,
> which can only be achieved by doing less work in total. But cranking
> prandom_u32_state() on every syscall is not free either.

Indeed, but it's  a lot cheaper than get_random. See:

https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/

> 
> In summary, it would be good to have a better problem statement wrt RT
> constraints before assuming that 99% tail latency is something to
> obsess about, especially given the fact het getpid() is not that
> representative a syscall to begin with.

I think that's a fair point. But I also think the results I link above show very
clearly that one approach is more performant than the other, in terms of the
overhead of syscall entry and exit. And as I said when starting this thread,
that is something we have had complaints about from partners.

Personally, based on that data, I think we could reduce it to this decision tree:

is a prng good enough for kstack offset randomization?
  yes: is 3% syscall entry/exit overhead a reasonable price?
    yes: Land my series
    no: rip out kstack offset randomization
  no: is 10% syscall entry/exit overhead a reasonable price?
    yes: Land Ard's series
    no: rip out kstack offset randomization

For the avoidance of doubt, my opinion is that prng is good enough for 6 bits.

By the way, my sense is that we won't get much below 3% no matter what we do. It
looks to me like it could be bottlenecked on __alloca() which forces any
speculation using the incorrect stack address to be abandoned. So I don't think
offloading to a thread will end up helping us much. I don't have data that shows
that conclusively, but that's my intuition from some earlier benchmarking.

Thanks,
Ryan


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ