netdev - Re: [PATCH net] r8169: fix NAPI handling under high load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181018055835.GE2487@marvin.atrad.com.au>
Date:   Thu, 18 Oct 2018 16:28:35 +1030
From:   Jonathan Woithe <jwoithe@...ad.com.au>
To:     Francois Romieu <romieu@...zoreil.com>
Cc:     Holger Hoffstätte 
        <holger@...lied-asynchrony.com>,
        Heiner Kallweit <hkallweit1@...il.com>,
        David Miller <davem@...emloft.net>,
        Realtek linux nic maintainers <nic_swsd@...ltek.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH net] r8169: fix NAPI handling under high load

On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote:
> Holger Hoffstätte <holger@...lied-asynchrony.com> :
> [...]
> > I continued to use the BQL patch in my private tree after it was reverted
> > and also had occasional timeouts, but *only* after I started playing
> > with ethtool to change offload settings. Without offloads or the BQL patch
> > everything has been rock-solid since then.
> > The other weird problem was that timeouts would occur on an otherwise
> > *completely idle* system. Since that occasionally borked my NFS server
> > over night I ultimately removed BQL as well. Rock-solid since then.
> 
> The bug will induce delayed rx processing when a spike of "load" is
> followed by an idle period.

If this is the case, I wonder whether this bug might also be the cause of
the long reception delays we've observed at times when a period of high
network load is followed by almost nothing[1].  That thread[2] details the
investigations subsequently done.  A git bisect showed that commit
da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour
we were observing.

We still see the problem when we test with recent kernels.  It would be
great if the underlying problem has now been identified.

I can possibly scrape some hardware together to test any proposed fix under
our workload if there was interest.

Regards
  jonathan

[1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2
[2] https://marc.info/?t=136281339500002&r=1&w=2