netdev - Re: [PATCH net] r8169: fix NAPI handling under high load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8beda4fa-5d04-49e6-eb3e-5656897a301f@gmail.com>
Date:   Thu, 18 Oct 2018 08:03:32 +0200
From:   Heiner Kallweit <hkallweit1@...il.com>
To:     Jonathan Woithe <jwoithe@...ad.com.au>
Cc:     Francois Romieu <romieu@...zoreil.com>,
        Holger Hoffstätte <holger@...lied-asynchrony.com>,
        David Miller <davem@...emloft.net>,
        Realtek linux nic maintainers <nic_swsd@...ltek.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH net] r8169: fix NAPI handling under high load

On 18.10.2018 07:58, Jonathan Woithe wrote:
> On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote:
>> Holger Hoffstätte <holger@...lied-asynchrony.com> :
>> [...]
>>> I continued to use the BQL patch in my private tree after it was reverted
>>> and also had occasional timeouts, but *only* after I started playing
>>> with ethtool to change offload settings. Without offloads or the BQL patch
>>> everything has been rock-solid since then.
>>> The other weird problem was that timeouts would occur on an otherwise
>>> *completely idle* system. Since that occasionally borked my NFS server
>>> over night I ultimately removed BQL as well. Rock-solid since then.
>>
>> The bug will induce delayed rx processing when a spike of "load" is
>> followed by an idle period.
> 
> If this is the case, I wonder whether this bug might also be the cause of
> the long reception delays we've observed at times when a period of high
> network load is followed by almost nothing[1].  That thread[2] details the
> investigations subsequently done.  A git bisect showed that commit
> da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour
> we were observing.
> 
> We still see the problem when we test with recent kernels.  It would be
> great if the underlying problem has now been identified.
> 
> I can possibly scrape some hardware together to test any proposed fix under
> our workload if there was interest.
> 
Proposed fix is here:
https://patchwork.ozlabs.org/patch/985014/
Would be good if you could test it. Thanks!

Heiner

> Regards
>   jonathan
> 
> [1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2
> [2] https://marc.info/?t=136281339500002&r=1&w=2
>