[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251130190601.GC1395@sol>
Date: Sun, 30 Nov 2025 11:06:01 -0800
From: Eric Biggers <ebiggers@...nel.org>
To: "Becker, Hanno" <beckphan@...zon.co.uk>
Cc: "Jason A. Donenfeld" <Jason@...c4.com>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
David Howells <dhowells@...hat.com>,
Herbert Xu <herbert@...dor.apana.org.au>,
Luis Chamberlain <mcgrof@...nel.org>,
Petr Pavlu <petr.pavlu@...e.com>,
Daniel Gomez <da.gomez@...nel.org>,
Sami Tolvanen <samitolvanen@...gle.com>,
Ard Biesheuvel <ardb@...nel.org>,
Stephan Mueller <smueller@...onox.de>,
Lukas Wunner <lukas@...ner.de>,
Ignat Korchagin <ignat@...udflare.com>,
"keyrings@...r.kernel.org" <keyrings@...r.kernel.org>,
"linux-modules@...r.kernel.org" <linux-modules@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"matthias@...nwischer.eu" <matthias@...nwischer.eu>
Subject: Re: [PATCH 1/4] lib/crypto: Add ML-DSA verification support
On Sun, Nov 30, 2025 at 07:15:22AM +0000, Becker, Hanno wrote:
> > - Vector registers (e.g. AVX) can be used in the kernel only in some
> > contexts, and only when they are explicitly saved and restored. So
> > we have to do our own integration of any code that uses them anyway.
> > There is also more overhead to each vector-optimized function than
> > there is in userspace, so very fine-grained optimization (e.g. as is
> > used in the Dilithium reference code) doesn't work too well.
>
> That's very useful, can you say more? Would one want some sort of
> configurable preamble/postamble in the top-level API which takes care of
> the necessary save/restore logic?
>
> What is the per-function overhead?
It varies by architecture, but usually it looks something like:
if (irq_fpu_usable()) {
kernel_fpu_begin();
avx_function();
kernel_fpu_end();
} else {
generic_function();
}
The overhead varies significantly by CPU, kernel config options, and
whether it's the first use since the current task last entered the
kernel. But it can be up to a few hundred cycles.
> > Note that the kernel already has optimized Keccak code. That already
> > covers the most performance-critical part of ML-DSA.
>
> No, this would need _batched_ Keccak. An ML-DSA implementation using
> only 1x-Keccak will never have competitive performance. See
> https://github.com/pq-code-package/mldsa-native/pull/754 for the
> performance loss from using unbatched Keccak only, on a variety of
> platforms; it's >2x for some.
>
> In turn, if you want to integrate batched Keccak -- but perhaps only on
> some platforms? -- you need to rewrite your entire code to make use of
> it. That's not a simple change, and part of what I mean when I say that
> the challenges are just deferred. Note that the official reference and
> AVX2 implementations duck this problem by duplicating the code and
> adjusting it, rather than looking for a common structure that could host
> both 'plain' and batched Keccak. I assume the amount of code duplication
> this brings would be unacceptable.
At least in my code, only the matrix expansion code would need to change
to take advantage of interleaved Keccak. The fact that other
implementations apparently are having trouble with this actually
suggests to me that perhaps they're not good implementations to use.
Anyway, no one has said they want this particular optimization in the
kernel anyway. And hopefully the future is native Keccak support
anyway; s390 already has it, and (at least) RISC-V is working on it.
- Eric
Powered by blists - more mailing lists