linux-kernel - RE: [PATCH v4 10/24] crypto: x86/poly

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <MW5PR84MB1842C2D1EA00D5EF65784E25AB179@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM>
Date:   Fri, 2 Dec 2022 06:21:02 +0000
From:   "Elliott, Robert (Servers)" <elliott@....com>
To:     Herbert Xu <herbert@...dor.apana.org.au>,
        "Jason A. Donenfeld" <Jason@...c4.com>
CC:     "davem@...emloft.net" <davem@...emloft.net>,
        "tim.c.chen@...ux.intel.com" <tim.c.chen@...ux.intel.com>,
        "ap420073@...il.com" <ap420073@...il.com>,
        "ardb@...nel.org" <ardb@...nel.org>,
        "David.Laight@...lab.com" <David.Laight@...lab.com>,
        "ebiggers@...nel.org" <ebiggers@...nel.org>,
        "linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption



> -----Original Message-----
> From: Herbert Xu <herbert@...dor.apana.org.au>
> Sent: Friday, November 25, 2022 2:41 AM
> To: Jason A. Donenfeld <Jason@...c4.com>
> Cc: Elliott, Robert (Servers) <elliott@....com>; davem@...emloft.net;
> tim.c.chen@...ux.intel.com; ap420073@...il.com; ardb@...nel.org;
> David.Laight@...lab.com; ebiggers@...nel.org; linux-
> crypto@...r.kernel.org; linux-kernel@...r.kernel.org
> Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> 
> On Wed, Nov 16, 2022 at 12:13:51PM +0100, Jason A. Donenfeld wrote:
> > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote:
> > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */
> > > +static const unsigned int bytes_per_fpu = 337 * 1024;
> > > +
> >
> > Use an enum for constants like this:
> >
> >     enum { BYTES_PER_FPU = ... };
> >
> > You can even make it function-local, so it's near the code that uses it,
> > which will better justify its existence.
> >
> > Also, where did you get this number? Seems kind of weird.
> 
> These numbers are highly dependent on hardware and I think having
> them hard-coded is wrong.
> 
> Perhaps we should try a different approach.  How about just limiting
> the size to 4K, and then depending on need_resched we break out of
> the loop? Something like:
> 
> 	if (!len)
> 		return 0;
> 
> 	kernel_fpu_begin();
> 	for (;;) {
> 		unsigned int chunk = min(len, 4096);
> 
> 		sha1_base_do_update(desc, data, chunk, sha1_xform);
> 
> 		len -= chunk;
> 		data += chunk;
> 
> 		if (!len)
> 			break;
> 
> 		if (need_resched()) {
> 			kernel_fpu_end();
> 			cond_resched();
> 			kernel_fpu_begin();
> 		}
> 	}
> 	kernel_fpu_end();
> 

I implemented that conditional approach in the sha algorithms. 

The results of a boot (using sha512 for module signatures, with
crypto extra tests enabled, comparing to sha512 with a 20 KiB
fixed limit) are:

  sha1  cond:  14479 calls;   784256 cycles doing begin/end; longest FPU context     35828 cycles
sha256  cond:  26763 calls;  1273570 cycles doing begin/end; longest FPU context    118612 cycles
sha512  cond:  26957 calls;  1680046 cycles doing begin/end; longest FPU context 169140982 cycles
sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context   4049644 cycles

NOTE: I didn't have a patch in place to isolate the counts for each variation
(ssse3 vs. avx vs. avx2) and
- for sha512: sha512 vs. sha384
- for sha256: sha256 vs. sha224
so the numbers include sha256 and sha512 running twice as many tests
as sha1.

This approach looks very good:
- 16% of the number of begin/end calls
- 10% of the CPU cycles spent making the calls
- the FPU context is held for a long time (77 ms) but only while
  it's not needed.

That's much more efficient than releasing it every 30 us just in case.

I'll keep testing this to make sure RCU stalls stay away, and apply
the approach to the other algorithms.

In x86, need_resched() has to deal with a PER_CPU variable, so I'm
not sure it's worth the hassle to figure out how to do that from
assembly code.