[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250221193124.GA3790599@google.com>
Date: Fri, 21 Feb 2025 19:31:24 +0000
From: Eric Biggers <ebiggers@...nel.org>
To: Xiao Liang <shaw.leon@...il.com>
Cc: x86@...nel.org, linux-crypto@...r.kernel.org,
linux-kernel@...r.kernel.org, Ard Biesheuvel <ardb@...nel.org>,
Ben Greear <greearb@...delatech.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
"Jason A . Donenfeld" <Jason@...c4.com>
Subject: Re: [RFC PATCH 1/2] x86/fpu: make kernel-mode FPU reliably usable in
softirqs
Hi Xiao,
On Fri, Feb 21, 2025 at 03:38:27PM +0800, Xiao Liang wrote:
> > Therefore, this patch updates x86 accordingly to reliably support
> > kernel-mode FPU in softirqs (except when hardirqs are disabled).
> >
> > This is done by just disabling softirq processing in kernel-mode FPU
> > sections, as that prevents the nesting that was problematic.
> >
> > This will delay some softirqs slightly, but only ones that would have
> > otherwise been nested inside a task context kernel-mode FPU section.
> > Any such softirqs would have taken the slow fallback path before if they
> > tried to do any en/decryption. Now these softirqs will just run at the
> > end of the task context kernel-mode FPU section (since local_bh_enable()
> > runs pending softirqs) and will no longer take the slow fallback path.
>
> I think this will delay all softirqs, including those that don't use
> FPU. Will there be a performance impact?
> (I guess you've noticed the patch I submitted last year. And this is
> the main reason why it was implemented in the way you mentioned as
> the second alternative.)
Thanks for taking a look at this patch! It's true that this patch makes all
softirqs on the same CPU be delayed until the end of the current kernel-mode FPU
section. But, I'm a bit skeptical that it actually matters enough on x86 to go
with a more complex solution that would allow nested kernel-mode FPU.
Kernel-mode FPU sections are generally short; the usual cases are en/decrypting
disk sectors or network packets that are 4 KiB or less.
Even if a longer buffer is passed in, most of the x86 SIMD-optimized code
already divides the buffer into chunks of at most 4 KiB and uses a separate
kernel-mode FPU section for each chunk. This happens either explicitly, or
implicitly via the skcipher_walk_* functions which never return more than
PAGE_SIZE (i.e. 4 KiB on x86) in a single step. There is some code that does
not do this, e.g. the CRC code, but that could easily be fixed.
The commonly-used x86 SIMD-optimized code is also super fast these days, and is
only getting faster. For example, on an AMD desktop processor from this year I
get roughly 35 GB/s AES-256-XTS, 25 GB/s AES-256-GCM, or 80 GB/s any CRC,
courtesy of VAES and VPCLMULQDQ (measuring single-threaded throughput at max
frequency). That works out to 50-165 nanoseconds per 4 KiB. Increasingly these
algorithms can be thought of as similar to memcpy() in speed.
Of course, the worst case is probably about 100x slower -- consider a CPU that
is much older, and from a low-voltage product line (e.g. Intel Atom), and not
running at its max frequency, and computing a much slower crypto algorithm that
lacks hardware acceleration like Serpent-XTS, or even AES-something if the CPU
is so old (over 15 years) as to lack AES-NI.
But, the super slow crypto algorithms are becoming increasingly rare. The
crypto algorithms in use these days tend to have hardware acceleration on x86
(via AES-NI, PCLMULQDQ, or SHA extensions) or at least be fast with SSE / AVX.
So while the worst case is likely about 20 microseconds on certain systems where
everything lines up the wrong way, realistically the worst case on most systems
based on what's actually being used is probably less than 1 microsecond.
That seems probably short enough to be acceptable? Remember that preemption was
already being disabled during this time. And this is only on one CPU.
I think it's also important to note that when local_bh_enable() re-enables
softirq processing (when called from kernel_fpu_end()), it also immediatelly
runs any pending softirqs. Thus there would be no additional delay; the CPU
will *immediately* run any pending softirqs.
As for supporting nested kernel-mode FPU if we wanted to go that way: yes, your
patch from last year
https://lore.kernel.org/lkml/20240403140138.393825-1-shaw.leon@gmail.com/
ostensibly did that. However, I found some bugs in it; e.g., it didn't take
into account that struct fpu is variable-length. So it didn't turn out as
simple as that patch made it seem. Just extending fpregs_{lock,unlock}() to
kernel-mode FPU is a simpler solution with fewer edge cases, and it avoids
increasing the memory usage of the kernel. So I thought I'd propose that first.
- Eric
Powered by blists - more mailing lists