lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <02874ECE860811409154E81DA85FBB5882AC15CA@ORSMSX115.amr.corp.intel.com>
Date:   Mon, 9 Oct 2017 20:10:16 +0000
From:   "Keller, Jacob E" <jacob.e.keller@...el.com>
To:     Alexander Duyck <alexander.duyck@...il.com>,
        David Laight <David.Laight@...lab.com>
CC:     "Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "nhorman@...hat.com" <nhorman@...hat.com>,
        "sassmann@...hat.com" <sassmann@...hat.com>,
        "jogreene@...hat.com" <jogreene@...hat.com>
Subject: RE: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to
 set RS bit

> -----Original Message-----
> From: netdev-owner@...r.kernel.org [mailto:netdev-owner@...r.kernel.org]
> On Behalf Of Alexander Duyck
> Sent: Monday, October 09, 2017 9:28 AM
> To: David Laight <David.Laight@...lab.com>
> Cc: Kirsher, Jeffrey T <jeffrey.t.kirsher@...el.com>; davem@...emloft.net;
> Keller, Jacob E <jacob.e.keller@...el.com>; netdev@...r.kernel.org;
> nhorman@...hat.com; sassmann@...hat.com; jogreene@...hat.com
> Subject: Re: [net-next 14/15] i40e: ignore skb->xmit_more when deciding to set
> RS bit
> 
> On Mon, Oct 9, 2017 at 2:07 AM, David Laight <David.Laight@...lab.com> wrote:
> > From: Jeff Kirsher
> >> Sent: 06 October 2017 18:57
> >> From: Jacob Keller <jacob.e.keller@...el.com>
> >>
> >> Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and
> >> disable force WB workaround") we've tried to "optimize" setting the
> >> RS bit based around skb->xmit_more. This same logic was refactored
> >> in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"),
> >> but ultimately was not functionally changed.
> >
> > I've tried to understand the above, but without definitions of WB
> > and RS and the '4-ness' of something it is quite difficult.
> >
> > I THINK this is all about telling the hardware there is a packet
> > in the ring to transmit?
> >
> > I don't understand the 4-ness. Linux requires that the hardware
> > be notified of a single packet transmit, and that the 'transmit
> > complete' also be notified in 'a timely manner' for a single packet.
> 
> So to clarify some of this. The RS is short for Report Status. It
> tells the hardware that when it has completed the descriptor it should
> trigger a write back, aka WB.
> 
> The 4-ness is because each descriptor is 16 bytes, so if we write back
> once every 4 descriptors that is one 64B cache line which is much
> better for most platforms performance wise.
> 
> > skb->xmit_more ought to be usable to optimise both of these
> > (assuming it isn't a lie).
> 
> Actually it leads to issues if we hold off for too long. If we are
> busy dequeueing a long list, and the Tx queue has gone empty then we
> are stalling Tx without need to do so. We need to be regularly
> notifying the device that it has buffers available to transmit which
> is what xmit_more is good for. However in addition the hardware needs
> to notify us regularly that there are buffers ready to clean which it
> isn't necessarily so useful for, especially if  are filling the entire
> ring and only providing one notification to the driver that there are
> buffers ready since we can defer the Tx interrupt for too long.
> 
> What Jake is trying to fix here is to prevent us from generating what
> looks like a square wave in terms of throughput. Essentially the
> current behavior that is being seen is that all the buffers in the
> ring either belong to software, or belong to hardware. It is best for
> us to have this split half and half so that the hardware has buffers
> to transmit and software has buffers to clean and enqueue new buffers
> onto.
> 

Yes this sums it up pretty well. The test case which found this bug is outlined in the description. Essentially we could see up to dozens or even almost a hundred packets in a row bunched up if we had a bunch of virtual devices layered on top of the network driver. This led to us not indicating to get status from the hardware about finished packets, which results in an increased delay to finish Tx.

> > The driver would need to ensure a ring full of messages isn't
> > generated without either wakeup - but that might be 128 frames.
> 
> That is what is happening here. Basically we are guaranteeing
> writebacks are occurring on a regular basis instead of in large
> bursts.
> 
> > FWIW on the systems I have PCIe writes are relatively cheap
> > (reads are slow). So different counts would be appropriate
> > for delaying doorbell writes and requesting completion interrupts.
> >
> >         David
> 
> The doorbell writes are still being delayed the same amount with this
> patch. The only thing that was split out was the device write backs
> are now occurring on a more regular basis instead of being held off
> until a much larger set is handled.
> 

Exactly. The tail bumps are still functioning the same, but it's the hardware report status requests which are not delayed. Note that there's some further quirks in how our hardware performs write backs which exacerbate the problem because of how the cachelines are setup so that the delay is even worse than we would expect.

Thanks,
Jake

> - Alex

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ