netdev - RE: [net-next v2 02/19] i40e: simplify txd use count calculation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B1C1DF2ACD01FD4881736AA51731BAB2A54ADC@ORSMSX107.amr.corp.intel.com>
Date:   Thu, 8 Dec 2016 00:35:06 +0000
From:   "Duyck, Alexander H" <alexander.h.duyck@...el.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        "Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>
CC:     "davem@...emloft.net" <davem@...emloft.net>,
        "Williams, Mitch A" <mitch.a.williams@...el.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "nhorman@...hat.com" <nhorman@...hat.com>,
        "sassmann@...hat.com" <sassmann@...hat.com>,
        "jogreene@...hat.com" <jogreene@...hat.com>,
        "guru.anbalagane@...cle.com" <guru.anbalagane@...cle.com>
Subject: RE: [net-next v2 02/19] i40e: simplify txd use count calculation

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@...il.com]
> Sent: Wednesday, December 7, 2016 4:16 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@...el.com>
> Cc: davem@...emloft.net; Williams, Mitch A <mitch.a.williams@...el.com>;
> netdev@...r.kernel.org; nhorman@...hat.com; sassmann@...hat.com;
> jogreene@...hat.com; guru.anbalagane@...cle.com; Duyck, Alexander H
> <alexander.h.duyck@...el.com>
> Subject: Re: [net-next v2 02/19] i40e: simplify txd use count calculation
> 
> On Wed, 2016-12-07 at 14:19 -0800, Jeff Kirsher wrote:
> > From: Mitch Williams <mitch.a.williams@...el.com>
> >
> > The i40e_txd_use_count function was fast but confusing. In the
> > comments, it even admits that it's ugly. So replace it with a new
> > function that is
> > (very) slightly faster and has extensive commenting to help the
> > thicker among us (including the author, who will forget in a week)
> > understand how it works.
> >
> > Change-ID: Ifb533f13786a0bf39cb29f77969a5be2c83d9a87
> > Signed-off-by: Mitch Williams <mitch.a.williams@...el.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@...el.com>
> > Tested-by: Andrew Bowers <andrewx.bowers@...el.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@...el.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.h   | 45 +++++++++++++++++---------
> -
> >  drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 45
> > +++++++++++++++++----------
> >  2 files changed, 56 insertions(+), 34 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > index de8550f..e065321 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > @@ -173,26 +173,37 @@ static inline bool i40e_test_staterr(union
> > i40e_rx_desc *rx_desc,  #define I40E_MAX_DATA_PER_TXD_ALIGNED \
> >  	(I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
> >
> > -/* This ugly bit of math is equivalent to DIV_ROUNDUP(size, X) where
> > X is
> > - * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the
> > fact
> > - * that 12K is not a power of 2 and division is expensive.  It is
> > used to
> > - * approximate the number of descriptors used per linear buffer.
> > Note
> > - * that this will overestimate in some cases as it doesn't account
> > for the
> > - * fact that we will add up to 4K - 1 in aligning the 12K buffer,
> > however
> > - * the error should not impact things much as large buffers usually
> > mean
> > - * we will use fewer descriptors then there are frags in an skb.
> > +/**
> > + * i40e_txd_use_count  - estimate the number of descriptors needed
> > +for Tx
> > + * @size: transmit request size in bytes
> > + *
> > + * Due to hardware alignment restrictions (4K alignment), we need to
> > + * assume that we can have no more than 12K of data per descriptor,
> > +even
> > + * though each descriptor can take up to 16K - 1 bytes of aligned memory.
> > + * Thus, we need to divide by 12K. But division is slow! Instead,
> > + * we decompose the operation into shifts and one relatively cheap
> > + * multiply operation.
> > + *
> > + * To divide by 12K, we first divide by 4K, then divide by 3:
> > + *     To divide by 4K, shift right by 12 bits
> > + *     To divide by 3, multiply by 85, then divide by 256
> > + *     (Divide by 256 is done by shifting right by 8 bits)
> > + * Finally, we add one to round up. Because 256 isn't an exact
> > +multiple of
> > + * 3, we'll underestimate near each multiple of 12K. This is actually
> > +more
> > + * accurate as we have 4K - 1 of wiggle room that we can fit into the
> > +last
> > + * segment.  For our purposes this is accurate out to 1M which is
> > +orders of
> > + * magnitude greater than our largest possible GSO size.
> > + *
> > + * This would then be implemented as:
> > + *     return (((size >> 12) * 85) >> 8) + 1;
> > + *
> > + * Since multiplication and division are commutative, we can reorder
> > + * operations into:
> > + *     return ((size * 85) >> 20) + 1;
> >   */
> >  static inline unsigned int i40e_txd_use_count(unsigned int size)  {
> > -	const unsigned int max = I40E_MAX_DATA_PER_TXD_ALIGNED;
> > -	const unsigned int reciprocal = ((1ull << 32) - 1 + (max / 2)) / max;
> > -	unsigned int adjust = ~(u32)0;
> > -
> > -	/* if we rounded up on the reciprocal pull down the adjustment */
> > -	if ((max * reciprocal) > adjust)
> > -		adjust = ~(u32)(reciprocal - 1);
> > -
> > -	return (u32)((((u64)size * reciprocal) + adjust) >> 32);
> > +	return ((size * 85) >> 20) + 1;
> >  }
> 
> But...
> 
> I thought gcc already implemented reciprocal divides ?
> 
> 
> $ cat div.c
> unsigned int foo(unsigned int size)
> {
> 	return size / 0x3000;
> }
> $ gcc -O2 -S div.c && cat div.s
> foo:
> .LFB0:
> 	.cfi_startproc
> 	movl	%edi, %eax
> 	movl	$-1431655765, %edx  // 0xaaaaaaab
> 	mull	%edx
> 	shrl	$13, %edx
> 	movl	%edx, %eax
> 	ret
> 

Well there ends up being a few aspects to it.  First we don't need the precision of a full 64b inverse multiplication, that is why we can get away with multiple by 85 and shift.  The assumption is we should never see a buffer larger than 64K for a TSO frame.  That being the case we can do the same thing without having to use a 64b value which isn't an option on 32b architectures.

So basically what it comes down to is dealing with the "optimized for size" kernel option, and 32b architectures not being able to do this.  Arguably both are corner cases but better to deal with them than take a performance hit we don't have to.

- Alex