[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMj1kXGyYMy2xhcdNicHkMfWBnEjyhc+xg8ciuR-6WXDxDpZxg@mail.gmail.com>
Date: Thu, 27 Nov 2025 16:03:31 +0100
From: Ard Biesheuvel <ardb@...nel.org>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Ard Biesheuvel <ardb+git@...gle.com>, linux-hardening@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Kees Cook <kees@...nel.org>, Will Deacon <will@...nel.org>, Arnd Bergmann <arnd@...db.de>,
Jeremy Linton <jeremy.linton@....com>, Catalin Marinas <Catalin.Marinas@....com>,
Mark Rutland <mark.rutland@....com>, "Jason A. Donenfeld" <Jason@...c4.com>
Subject: Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@....com> wrote:
>
> On 27/11/2025 12:28, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@....com> wrote:
> >>
> >> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> >>> From: Ard Biesheuvel <ardb@...nel.org>
> >>>
> >>> Ryan reports that get_random_u16() is dominant in the performance
> >>> profiling of syscall entry when kstack randomization is enabled [0].
> >>>
> >>> This is the reason many architectures rely on a counter instead, and
> >>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> >>> is gathered and recorded in a per-CPU variable.
> >>>
> >>> Let's try to make the get_random_uXX() fast path faster, and switch to
> >>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> >>> wire it up in the syscall entry path, replacing the per-CPU variable,
> >>> making the logic at syscall exit redundant.
> >>
> >> I ran the same set of syscall benchmarks for this series as I've done for my
> >> series.
> >>
> >
> > Thanks!
> >
> >
> >> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> >> performance cost of turning it on without any changes to the implementation,
> >> then the reduced performance cost of turning it on with my changes applied, and
> >> finally cost of turning it on with Ard's changes applied:
> >>
> >> arm64 (AWS Graviton3):
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | Benchmark | Result Class | v6.18-rc5 | per-task-prng | fast-get-random |
> >> | | | rndstack-on | | |
> >> +=================+==============+=============+===============+=================+
> >> | syscall/getpid | mean (ns) | (R) 15.62% | (R) 3.43% | (R) 11.93% |
> >> | | p99 (ns) | (R) 155.01% | (R) 3.20% | (R) 11.00% |
> >> | | p99.9 (ns) | (R) 156.71% | (R) 2.93% | (R) 11.39% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | syscall/getppid | mean (ns) | (R) 14.09% | (R) 2.12% | (R) 10.44% |
> >> | | p99 (ns) | (R) 152.81% | 1.55% | (R) 9.94% |
> >> | | p99.9 (ns) | (R) 153.67% | 1.77% | (R) 9.83% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | syscall/invalid | mean (ns) | (R) 13.89% | (R) 3.32% | (R) 10.39% |
> >> | | p99 (ns) | (R) 165.82% | (R) 3.51% | (R) 10.72% |
> >> | | p99.9 (ns) | (R) 168.83% | (R) 3.77% | (R) 11.03% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >>
> >
> > What does the (R) mean?
> >
> >> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> >> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> >> that fully explains it though.
> >>
> >> But it's still a 10% cost on average.
> >>
> >> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> >> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> >>
> >
> > Interesting!
> >
> > So the only thing that get_random_u8() does that could explain the
> > delta is calling into the scheduler on preempt_enable(), given that it
> > does very little beyond that.
> >
> > Would you mind repeating this experiment after changing the
> > put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>
> This has no impact on performance.
>
Thanks. But this is really rather surprising: what else could be
taking up that time, given that on the fast path, there are only some
loads and stores to the buffer, and a cmpxchg64_local(). Could it be
the latter that is causing so much latency? I suppose the local
cmpxchg() semantics don't really exist on arm64, and this uses the
exact same LSE instruction that would be used for an ordinary
cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
In any case, there is no debate that your code is faster on arm64. I
also think that using prandom for this purpose is perfectly fine, even
without reseeding: with a 2^113 period and only 6 observable bits per
32 bit sample, predicting the next value reliably is maybe not
impossible, but hardly worth the extensive effort, given that we're
not generating cryptographic keys here.
So the question is really whether we want to dedicate 16 bytes per
task for this. I wouldn't mind personally, but it is something our
internal QA engineers tend to obsess over.
Powered by blists - more mailing lists