[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250305173925.GA4014401@google.com>
Date: Wed, 5 Mar 2025 17:39:25 +0000
From: Eric Biggers <ebiggers@...nel.org>
To: Ingo Molnar <mingo@...nel.org>
Cc: x86@...nel.org, linux-crypto@...r.kernel.org,
linux-kernel@...r.kernel.org, Ard Biesheuvel <ardb@...nel.org>,
Ben Greear <greearb@...delatech.com>,
Xiao Liang <shaw.leon@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
"Jason A . Donenfeld" <Jason@...c4.com>
Subject: Re: [RFC PATCH v2] x86/fpu: make kernel-mode FPU reliably usable in
softirqs
On Wed, Mar 05, 2025 at 10:07:45AM +0100, Ingo Molnar wrote:
>
> * Eric Biggers <ebiggers@...nel.org> wrote:
>
> > From: Eric Biggers <ebiggers@...gle.com>
> >
> > Currently kernel-mode FPU is not always usable in softirq context on
> > x86, since softirqs can nest inside a kernel-mode FPU section in task
> > context, and nested use of kernel-mode FPU is not supported.
> >
> > Therefore, x86 SIMD-optimized code that can be called in softirq context
> > has to sometimes fall back to non-SIMD code. There are two options for
> > the fallback, both of which are pretty terrible:
> >
> > (a) Use a scalar fallback. This can be 10-100x slower than vectorized
> > code because it cannot use specialized instructions like AES, SHA,
> > or carryless multiplication.
> >
> > (b) Execute the request asynchronously using a kworker. In other
> > words, use the "crypto SIMD helper" in crypto/simd.c.
> >
> > Currently most of the x86 en/decryption code (skcipher and aead
> > algorithms) uses option (b), since this avoids the slow scalar fallback
> > and it is easier to wire up. But option (b) is still really bad for its
> > own reasons:
> >
> > - Punting the request to a kworker is bad for performance too.
> >
> > - It forces the algorithm to be marked as asynchronous
> > (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
> > users who request a synchronous algorithm. That's another huge
> > performance problem, which is especially unfortunate for users who
> > don't even do en/decryption in softirq context.
> >
> > - It makes all en/decryption operations take a detour through
> > crypto/simd.c. That involves additional checks and an additional
> > indirect call, which slow down en/decryption for *everyone*.
> >
> > Fortunately, the skcipher and aead APIs are only usable in task and
> > softirq context in the first place. Thus, if kernel-mode FPU were to
> > be reliably usable in softirq context, no fallback would be needed.
> > Indeed, other architectures such as arm, arm64, and riscv have
> > already done this.
> >
> > Therefore, this patch updates x86 accordingly to reliably support
> > kernel-mode FPU in softirqs.
> >
> > This is done by just disabling softirq processing in kernel-mode FPU
> > sections (when hardirqs are not already disabled), as that prevents the
> > nesting that was problematic.
> >
> > This will delay some softirqs slightly, but only ones that would have
> > otherwise been nested inside a task context kernel-mode FPU section.
> > Any such softirqs would have taken the slow fallback path before if they
> > tried to do any en/decryption. Now these softirqs will just run at the
> > end of the task context kernel-mode FPU section (since local_bh_enable()
> > runs pending softirqs) and will no longer take the slow fallback path.
> >
> > Alternatives considered:
> >
> > - Make kernel-mode FPU sections fully preemptible. This would require
> > growing task_struct by another struct fpstate which is more than 2K.
>
> So that's something that will probably happen once the kernel is built
> using APX anyway?
The APX state is just 16 GPRs, for 128 bytes total. That's about 5% of the size
of the fpstate (assuming AVX512 is supported). As Dave mentioned, for in-kernel
use of APX it probably will make more sense to treat the new GPRs like the
existing ones, instead of using XSTATE and integrating it with kernel-mode FPU.
I.e., they will be saved/restored using plain moves to/from a dedicated buffer.
>
> > - Make softirqs save/restore the kernel-mode FPU state to a per-CPU
> > struct fpstate when nested use is detected. Somewhat interesting, but
> > seems unnecessary when a simpler solution exists.
>
> So:
>
> > void kernel_fpu_begin_mask(unsigned int kfpu_mask)
> > {
> > - preempt_disable();
> > + if (!irqs_disabled())
> > + fpregs_lock();
>
> > + if (!irqs_disabled())
> > + fpregs_unlock();
>
> So why is the irqs_disabled() check needed here? (On x86 it can be a
> bit expensive at times, because the IRQ flag has to be loaded,
> including all flags, so basically it's a soft synchronization point of
> a sort.)
>
> Ie. why cannot we simply do a local_bh_disable()/enable() pair (on
> !RT), ie. fpregs_lock()/fpregs_unlock()?
>
> local_bh_disable() is very similar in cost to preempt_disable(), both
> are increasing the preempt_count.
It's to keep kernel_fpu_begin()/end() working when hardirqs are disabled, since
local_bh_disable()/enable() require that hardirqs be enabled. See the changelog
and https://lore.kernel.org/r/20250228035924.GC5588@sol.localdomain/. There are
other directions we could go, but this seems to be the simplest solution. If we
forbid kernel_fpu_begin() with hardirqs disabled (as PS1 did), then a call to
irqs_disabled() is still needed in irq_fpu_usable(). To avoid irqs_disabled()
entirely, we'd need to avoid disabling softirqs, which would mean supporting
nested kernel-mode FPU in softirqs. I can sent out a patch that does that using
a per-CPU buffer, if you'd like to see that. I wasn't super happy with the
extra edge cases and memory usage, but we could go in that direction.
- Eric
Powered by blists - more mailing lists