lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 18 Oct 2018 16:28:35 +1030
From:   Jonathan Woithe <jwoithe@...ad.com.au>
To:     Francois Romieu <romieu@...zoreil.com>
Cc:     Holger Hoffstätte 
        <holger@...lied-asynchrony.com>,
        Heiner Kallweit <hkallweit1@...il.com>,
        David Miller <davem@...emloft.net>,
        Realtek linux nic maintainers <nic_swsd@...ltek.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH net] r8169: fix NAPI handling under high load

On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote:
> Holger Hoffstätte <holger@...lied-asynchrony.com> :
> [...]
> > I continued to use the BQL patch in my private tree after it was reverted
> > and also had occasional timeouts, but *only* after I started playing
> > with ethtool to change offload settings. Without offloads or the BQL patch
> > everything has been rock-solid since then.
> > The other weird problem was that timeouts would occur on an otherwise
> > *completely idle* system. Since that occasionally borked my NFS server
> > over night I ultimately removed BQL as well. Rock-solid since then.
> 
> The bug will induce delayed rx processing when a spike of "load" is
> followed by an idle period.

If this is the case, I wonder whether this bug might also be the cause of
the long reception delays we've observed at times when a period of high
network load is followed by almost nothing[1].  That thread[2] details the
investigations subsequently done.  A git bisect showed that commit
da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour
we were observing.

We still see the problem when we test with recent kernels.  It would be
great if the underlying problem has now been identified.

I can possibly scrape some hardware together to test any proposed fix under
our workload if there was interest.

Regards
  jonathan

[1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2
[2] https://marc.info/?t=136281339500002&r=1&w=2

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ