[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <FuRP6eq1TdvRdmbeIUM2jGr9qmB2CptLqZsWq_hVq3Bqur-pL9HX3xBLPC3iC1F6F4FT1Moi9GhPssy8MGEKO7LUKH6sA34TbV0S8K3Gn8Q=@n8pjl.ca>
Date: Tue, 06 Dec 2022 23:06:27 +0000
From: Peter Lafreniere <peter@...jl.ca>
To: "Elliott, Robert (Servers)" <elliott@....com>
Cc: Herbert Xu <herbert@...dor.apana.org.au>,
"Jason A. Donenfeld" <Jason@...c4.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"tim.c.chen@...ux.intel.com" <tim.c.chen@...ux.intel.com>,
"ap420073@...il.com" <ap420073@...il.com>,
"ardb@...nel.org" <ardb@...nel.org>,
"David.Laight@...lab.com" <David.Laight@...lab.com>,
"ebiggers@...nel.org" <ebiggers@...nel.org>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> > Perhaps we should try a different approach. How about just limiting
> > the size to 4K, and then depending on need_resched we break out of
> > the loop? Something like:
> >
> > if (!len)
> > return 0;
> >
> > kernel_fpu_begin();
> > for (;;) {
> > unsigned int chunk = min(len, 4096);
> >
> > sha1_base_do_update(desc, data, chunk, sha1_xform);
> >
> > len -= chunk;
> > data += chunk;
> >
> > if (!len)
> > break;
> >
> > if (need_resched()) {
> > kernel_fpu_end();
> > cond_resched();
> > kernel_fpu_begin();
> > }
> > }
> > kernel_fpu_end();
>
>
> I implemented that conditional approach in the sha algorithms.
>
> The results of a boot (using sha512 for module signatures, with
> crypto extra tests enabled, comparing to sha512 with a 20 KiB
> fixed limit) are:
>
> sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles
> sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles
> sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles
> sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles
>
> NOTE: I didn't have a patch in place to isolate the counts for each variation
> (ssse3 vs. avx vs. avx2) and
> - for sha512: sha512 vs. sha384
> - for sha256: sha256 vs. sha224
> so the numbers include sha256 and sha512 running twice as many tests
> as sha1.
>
> This approach looks very good:
> - 16% of the number of begin/end calls
> - 10% of the CPU cycles spent making the calls
> - the FPU context is held for a long time (77 ms) but only while
> it's not needed.
>
> That's much more efficient than releasing it every 30 us just in case.
How recently did you make this change? I implemented this conditional
approach for ecb_cbc_helpers.h, but saw no changes at all to performance
on serpent-avx2 and twofish-avx.
kernel_fpu_{begin,end} (after the first call to begin) don't do anything
more than enable/disable preemption and make a few writes to the mxcsr.
It's likely that the above approach has the tiniest bit less overhead,
and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests
a performance uplift.
This brings us back to this question: should crypto routines be
preempted under PREEMPT_VOLUNTARY or not?
> I'll keep testing this to make sure RCU stalls stay away, and apply
> the approach to the other algorithms.
I missed the earlier discussions. Have you seen issues with RCU
stalls/latency spikes because of crypto routines? If so, what preemption
model were you running?
> In x86, need_resched() has to deal with a PER_CPU variable, so I'm
> not sure it's worth the hassle to figure out how to do that from
> assembly code.
Leave it in c. It'll be more maintainable that way.
Cheers,
Peter Lafreniere <peter@...jl.ca>
Powered by blists - more mailing lists