netdev - Re: tg3 dropping packets at high packet rates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220525085647.6dfb7ed0@kernel.org>
Date:   Wed, 25 May 2022 08:56:47 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     David Laight <David.Laight@...LAB.COM>
Cc:     'Pavan Chebbi' <pavan.chebbi@...adcom.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Michael Chan <michael.chan@...adcom.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "mchan@...adcom.com" <mchan@...adcom.com>,
        David Miller <davem@...emloft.net>
Subject: Re: tg3 dropping packets at high packet rates

On Wed, 25 May 2022 07:28:42 +0000 David Laight wrote:
> > As the trace below shows I think the underlying problem
> > is that the napi callbacks aren't being made in a timely manner.  
> 
> Further investigations have shown that this is actually
> a generic problem with the way napi callbacks are called
> from the softint handler.
> 
> The underlying problem is the effect of this code
> in __do_softirq().
> 
>         pending = local_softirq_pending();
>         if (pending) {
>                 if (time_before(jiffies, end) && !need_resched() &&
>                     --max_restart)
>                         goto restart;
> 
>                 wakeup_softirqd();
>         }
> 
> The napi processing can loop through here and needs to do
> the 'goto restart' - not doing so will drop packets.
> The need_resched() test is particularly troublesome.
> I've also had to increase the limit for 'max_restart' from
> its (hard coded) 10 to 1000 (100 isn't enough).
> I'm not sure whether I'm hitting the jiffies limit,
> but that is hard coded at 2.
> 
> I'm going to start another thread.

If you share the core between the application and NAPI try using prefer
busy polling (SO_PREFER_BUSY_POLL), and manage polling from user space.
If you have separate cores use threaded NAPI and isolate the core
running NAPI or give it high prio.

YMMV but I've spent more time than I'd like to admit looking at the
softirq yielding conditions, they are hard to beat :( If you control
the app much better use of your time to arrange busy poll or pin things.