netdev - RE: [PATCH net-next] net: optimize csum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <063D6719AE5E284EB5DD2968C1650D6D0F6E7939@AcuExch.aculab.com>
Date:	Mon, 24 Mar 2014 14:38:35 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'Eric Dumazet' <eric.dumazet@...il.com>
CC:	David Miller <davem@...emloft.net>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	"hkchu@...gle.com" <hkchu@...gle.com>,
	"mwdalton@...gle.com" <mwdalton@...gle.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: RE: [PATCH net-next] net: optimize csum_replace2()

From: Eric Dumazet 
> On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote:
> > From: Eric Dumazet <edumazet@...gle.com>
> > >
> > > When changing one 16bit value by another in IP header, we can adjust the
> > > IP checksum by doing a simple operation described in RFC 1624,
> > > as reminded by David.
> > >
> > > csum_partial() is a complex function on x86_64, not really suited
> > > for small number of checksummed bytes.
> > >
> > > I spotted csum_partial() being in the top 20 most consuming
> > > functions (more than 1 %) in a GRO workload, which was rather
> > > unexpected.
> > >
> > > The caller was inet_gro_complete() doing a csum_replace2() when
> > > building the new IP header for the GRO packet.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > ---
> > >  include/net/checksum.h |   23 +++++++++++++++++++++--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
> > > index 37a0e24adbe7..a28f4e0f6251 100644
> > > --- a/include/net/checksum.h
> > > +++ b/include/net/checksum.h
> > > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend)
> > >  	return csum_add(csum, ~addend);
> > >  }
> > >
> > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend)
> > > +{
> > > +	u16 res = (__force u16)csum;
> >
> > Shouldn't that be u32 ?
> 
> Why ? We compute 16bit checksums here.
> 
> >
> > > +	res += (__force u16)addend;
> > > +	return (__force __sum16)(res + (res < (__force u16)addend));
> 
> Note how carry is propagated, and how we return 16 bit anyway.

Not enough coffee - my brain didn't parse that correctly at all :-(

> Using u32 would force to use a fold, which is more expensive.

Try compiling for something other than x86/amd64
(ie something without 16bit arithmetic).
The ppc compiler I have (not the latest( generates 3 masks with 0xffff and
a conditional branch for the above function (without the inline).

> > > +}
> > > +
> > > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
> > > +{
> > > +	return csum16_add(csum, ~addend);
> > > +}
> > > +
> > >  static inline __wsum
> > >  csum_block_add(__wsum csum, __wsum csum2, int offset)
> > >  {
> > > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)
> > >  	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));
> > >  }
> > >
> > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)
> > > +/* Implements RFC 1624 (Incremental Internet Checksum)
> > > + * 3. Discussion states :
> > > + *     HC' = ~(~HC + ~m + m')
> > > + *  m : old value of a 16bit field
> > > + *  m' : new value of a 16bit field
> > > + */
> > > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)
> > >  {
> > > -	csum_replace4(sum, (__force __be32)from, (__force __be32)to);
> > > +	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);
> > >  }
> >
> > It might be clearer to just say:
> > 	*sum = ~csum16_add(csum16_add(~*sum, ~old), new));
> > or even:
> > 	*sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new));
> > which might remove some mask instructions - especially if all the
> > intermediate values are left larger than 16 bits.
> 
> We have csum_add() and csum_sub(), I added csum16_add() and
> csum16_sub(), for analogy and completeness.
> 
> For linux guys, its quite common stuff to use csum_sub(x, y) instead of
> csum_add(c, ~y).
> 
> You can use whatever code matching your taste.
> 
> RFC 1624 mentions ~m, not m ^ 0xffff

But C only has 32bit maths, so all the 16bit values keep
need masking.

> But again if you prefer m ^ 0xffff, thats up to you ;)

On ppc I get much better code for (retyped):
static inline u32 csum_add(u32 c, u32 a)
{
	c += a'
	return (c + (c >> 16)) & 0xffff;
}
void csum_repl(u16 *c16, u16 o16, u16 n)
{
	u32 c = *c16, o = o16;
	*c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n);
}
13 instructions and no branches against 21 instructions
and 2 branches.
The same is probably true of arm.
For amd64 (where the compiler can do a 16bit compare)
my version is one instruction longer.
I've not tried to guess at the actual cycle counts!

	David