netdev - Re: [RFC PATCH net-next 0/6] implement kthread based napi poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20200930085809.58eee328@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Wed, 30 Sep 2020 08:58:09 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     Wei Wang <weiwan@...gle.com>, Eric Dumazet <edumazet@...gle.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Hannes Frederic Sowa <hannes@...essinduktion.org>,
        Felix Fietkau <nbd@....name>, Luigi Rizzo <lrizzo@...gle.com>
Subject: Re: [RFC PATCH net-next 0/6] implement kthread based napi poll

On Wed, 30 Sep 2020 10:58:00 +0200 Paolo Abeni wrote:
> On Tue, 2020-09-29 at 14:48 -0700, Jakub Kicinski wrote:
> > On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote:  
> > > On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski <kuba@...nel.org> wrote:  
> > > > On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote:    
> > > > > Wei, this is a very nice work.
> > > > > 
> > > > > Please re-send it without the RFC tag, so that we can hopefully merge it ASAP.    
> > > > 
> > > > The problem is for the application I'm testing with this implementation
> > > > is significantly slower (in terms of RPS) than Felix's code:
> > > > 
> > > >               |        L  A  T  E  N  C  Y       |  App   |     C P U     |
> > > >        |  RPS |   AVG  |  P50  |   P99  |   P999 | Overld |  busy |  PSI  |
> > > > thread | 1.1% | -15.6% | -0.3% | -42.5% |  -8.1% | -83.4% | -2.3% | 60.6% |
> > > > work q | 4.3% | -13.1% |  0.1% | -44.4% |  -1.1% |   2.3% | -1.2% | 90.1% |
> > > > TAPI   | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% |
> > > > 
> > > > thread is this code, "work q" is Felix's code, TAPI is my hacks.
> > > > 
> > > > The numbers are comparing performance to normal NAPI.
> > > > 
> > > > In all cases (but not the baseline) I configured timer-based polling
> > > > (defer_hard_irqs), with around 100us timeout. Without deferring hard
> > > > IRQs threaded NAPI is actually slower for this app. Also I'm not
> > > > modifying niceness, this again causes application performance
> > > > regression here.
> > > >    
> > > 
> > > If I remember correctly, Felix's workqueue code uses HIGHPRIO flag
> > > which by default uses -20 as the nice value for the workqueue threads.
> > > But the kthread implementation leaves nice level as 20 by default.
> > > This could be 1 difference.  
> > 
> > FWIW this is the data based on which I concluded the nice -20 actually
> > makes things worse here:
> > 
> >       threded: -1.50%
> >  threded p-20: -5.67%
> >      thr poll:  2.93%
> > thr poll p-20:  2.22%
> > 
> > Annoyingly relative performance change varies day to day and this test
> > was run a while back (over the weekend I was getting < 2% improvement
> > with this set).  
> 
> I'm assuming your application uses UDP as the transport protocol - raw
> IP or packet socket should behave in the same way. I observed similar
> behavior - that is unstable figures, and end-to-end tput decrease when
> network stack get more cycles (or become faster) - when the bottle-neck 
> was in user-space processing[1].
> 
> You can double check you are hitting the same scenario observing the
> UDP protocol stats (you should see higher drops figures with threaded
> and even more with threded p-20, compared to the other impls).
> 
> If you are hitting such scenario, you should be able to improve things
> setting nice-20 to the user-space process, increasing the UDP socket
> receive buffer size or enabling socket busy polling
> (/proc/sys/net/core/busy_poll, I mean). 

It's not UDP. The application has some logic to tell the load balancer
to back off whenever it feels like it's not processing requests fast enough
(App Overld in the table 2 emails back). That statistic is higher with p-20.
Application latency suffers, too.