linux-kernel - RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <MW4PR15MB4762B9F48EA56F477EDA63A1BD749@MW4PR15MB4762.namprd15.prod.outlook.com>
Date:   Mon, 13 Dec 2021 19:23:27 +0000
From:   Alexander Duyck <alexanderduyck@...com>
To:     Eric Dumazet <edumazet@...gle.com>,
        David Laight <David.Laight@...lab.com>
CC:     Noah Goldstein <goldstein.w.n@...il.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        X86 ML <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        open list <linux-kernel@...r.kernel.org>,
        netdev <netdev@...r.kernel.org>
Subject: RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not
 multiples of 8 bytes.



> -----Original Message-----
> From: Eric Dumazet <edumazet@...gle.com>
> Sent: Monday, December 13, 2021 10:45 AM
> To: David Laight <David.Laight@...lab.com>
> Cc: Noah Goldstein <goldstein.w.n@...il.com>; tglx@...utronix.de;
> mingo@...hat.com; Borislav Petkov <bp@...en8.de>;
> dave.hansen@...ux.intel.com; X86 ML <x86@...nel.org>; hpa@...or.com;
> peterz@...radead.org; Alexander Duyck <alexanderduyck@...com>; open
> list <linux-kernel@...r.kernel.org>; netdev <netdev@...r.kernel.org>
> Subject: Re: [PATCH] lib/x86: Optimise csum_partial of buffers that are not
> multiples of 8 bytes.
> 
> On Mon, Dec 13, 2021 at 10:00 AM David Laight <David.Laight@...lab.com>
> wrote:
> >
> >
> > Add in the trailing bytes first so that there is no need to worry
> > about the sum exceeding 64 bits.
> >
> > Signed-off-by: David Laight <david.laight@...lab.com>
> > ---
> >
> > This ought to be faster - because of all the removed 'adc $0'.
> > Guessing how fast x86 code will run is hard!
> > There are other ways of handing buffers that are shorter than 8 bytes,
> > but I'd rather hope they don't happen in any hot paths.
> >
> > Note - I've not even compile tested it.
> > (But have tested an equivalent change before.)
> >
> >  arch/x86/lib/csum-partial_64.c | 55
> > ++++++++++++----------------------
> >  1 file changed, 19 insertions(+), 36 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c
> > b/arch/x86/lib/csum-partial_64.c index abf819dd8525..fbcc073fc2b5
> > 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -37,6 +37,24 @@ __wsum csum_partial(const void *buff, int len,
> __wsum sum)
> >         u64 temp64 = (__force u64)sum;
> >         unsigned result;
> >
> > +       if (len & 7) {
> > +               if (unlikely(len < 8)) {
> > +                       /* Avoid falling off the start of the buffer */
> > +                       if (len & 4) {
> > +                               temp64 += *(u32 *)buff;
> > +                               buff += 4;
> > +                       }
> > +                       if (len & 2) {
> > +                               temp64 += *(u16 *)buff;
> > +                               buff += 2;
> > +                       }
> > +                       if (len & 1)
> > +                               temp64 += *(u8 *)buff;
> > +                       goto reduce_to32;
> > +               }
> > +               temp64 += *(u64 *)(buff + len - 8) << (8 - (len & 7))
> > + * 8;
> 
> This is reading far away (end of buffer).
> 
> Maybe instead read the first bytes and adjust @buff, to allow for better
> hardware prefetching ?

That will cause the wsum to be aligned to the length instead of the buff wouldn't it? So we would need an extra rotation at the end to realign odd length sections wouldn't we?

Since our only concern here would be large buffers would it maybe make sense to just run the loop in the call in reverse if we make this change? That way we would still be accessing the same cache line once we start the loop. For smaller buffers I would imagine the overhead should be minimal since we likely would have the end of the buffer still in cache since it would be something like 40B over anyway.