netdev - Re: [RFC PATCH net-next 0/6] implement kthread based napi poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz30afXpbn+RXwN5BNMwrLAcW0Cn8tqP502oCLaKH0+kZg@mail.gmail.com>
Date:   Fri, 25 Sep 2020 15:48:35 +0200
From:   Magnus Karlsson <magnus.karlsson@...il.com>
To:     Wei Wang <weiwan@...gle.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        Network Development <netdev@...r.kernel.org>,
        Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Hannes Frederic Sowa <hannes@...essinduktion.org>,
        Felix Fietkau <nbd@....name>,
        Björn Töpel <bjorn.topel@...el.com>
Subject: Re: [RFC PATCH net-next 0/6] implement kthread based napi poll

On Mon, Sep 14, 2020 at 7:26 PM Wei Wang <weiwan@...gle.com> wrote:
>
> The idea of moving the napi poll process out of softirq context to a
> kernel thread based context is not new.
> Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi
> poll to kthread back in 2016. And Felix Fietkau has also proposed
> patches of similar ideas to use workqueue to process napi poll just a
> few weeks ago.
>
> The main reason we'd like to push forward with this idea is that the
> scheduler has poor visibility into cpu cycles spent in softirq context,
> and is not able to make optimal scheduling decisions of the user threads.
> For example, we see in one of the application benchmark where network
> load is high, the CPUs handling network softirqs has ~80% cpu util. And
> user threads are still scheduled on those CPUs, despite other more idle
> cpus available in the system. And we see very high tail latencies. In this
> case, we have to explicitly pin away user threads from the CPUs handling
> network softirqs to ensure good performance.
> With napi poll moved to kthread, scheduler is in charge of scheduling both
> the kthreads handling network load, and the user threads, and is able to
> make better decisions. In the previous benchmark, if we do this and we
> pin the kthreads processing napi poll to specific CPUs, scheduler is
> able to schedule user threads away from these CPUs automatically.
>
> And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> entity per host, is that kthread is more configurable than workqueue,
> and we could leverage existing tuning tools for threads, like taskset,
> chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> if we eventually want to provide busy poll feature using kernel threads
> for napi poll, kthread seems to be more suitable than workqueue.
>
> In this patch series, I revived Paolo and Hannes's patch in 2016 and
> left them as the first 2 patches. Then there are changes proposed by
> Felix, Jakub, Paolo and myself on top of those, with suggestions from
> Eric Dumazet.
>
> In terms of performance, I ran tcp_rr tests with 1000 flows with
> various request/response sizes, with RFS/RPS disabled, and compared
> performance between softirq vs kthread. Host has 56 hyper threads and
> 100Gbps nic.
>
>         req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
> softirq   1B/1B   2.19M   284us       987us      1.1ms      1.56ms
> kthread   1B/1B   2.14M   295us       987us      1.0ms      1.17ms
>
> softirq 5KB/5KB   1.31M   869us      1.06ms     1.28ms      2.38ms
> kthread 5KB/5KB   1.32M   878us      1.06ms     1.26ms      1.66ms
>
> softirq 1MB/1MB  10.78K   84ms       166ms      234ms       294ms
> kthread 1MB/1MB  10.83K   82ms       173ms      262ms       320ms
>
> I also ran one application benchmark where the user threads have more
> work to do. We do see good amount of tail latency reductions with the
> kthread model.

I really like this RFC and would encourage you to submit it as a
patch. Would love to see it make it into the kernel.

I see the same positive effects as you when trying it out with AF_XDP
sockets. Made some simple experiments where I sent 64-byte packets to
a single AF_XDP socket. Have not managed to figure out how to do
percentiles on my load generator, so this is going to be min, avg and
max only. The application using the AF_XDP socket just performs a mac
swap on the packet and sends it back to the load generator that then
measures the round trip latency. The kthread is taskset to the same
core as ksoftirqd would run on. So in each experiment, they always run
on the same core id (which is not the same as the application).

Rate 12 Mpps with 0% loss.
              Latencies (us)         Delay Variation between packets
          min    avg    max      avg   max
sofirq  11.0  17.1   78.4      0.116  63.0
kthread 11.2  17.1   35.0     0.116  20.9

Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss
              Latencies (us)         Delay Variation between packets
          min    avg    max      avg   max
softirq  87.6  194.9  282.6    0.062  25.9
kthread  86.5  185.2  271.8    0.061  22.5

For the last experiment, I also get 1.5% to 2% higher throughput with
your kthread approach. Moreover, just from the per-second throughput
printouts from my application, I can see that the kthread numbers are
more stable. The softirq numbers can vary quite a lot between each
second, around +-3%. But for the kthread approach, they are nice and
stable. Have not examined why.

One thing I noticed though, and I do not know if this is an issue, is
that the switching between the two modes does not occur at high packet
rates. I have to lower the packet rate to something that makes the
core work less than 100% for it to switch between ksoftirqd to kthread
and vice versa. They just seem too busy to switch at 100% load when
changing the "threaded" sysfs variable.

Thank you for working on this feature.


/Magnus


> Paolo Abeni (2):
>   net: implement threaded-able napi poll loop support
>   net: add sysfs attribute to control napi threaded mode
> Felix Fietkau (1):
>   net: extract napi poll functionality to __napi_poll()
> Jakub Kicinski (1):
>   net: modify kthread handler to use __napi_poll()
> Paolo Abeni (1):
>   net: process RPS/RFS work in kthread context
> Wei Wang (1):
>   net: improve napi threaded config
>
>  include/linux/netdevice.h |   6 ++
>  net/core/dev.c            | 146 +++++++++++++++++++++++++++++++++++---
>  net/core/net-sysfs.c      |  99 ++++++++++++++++++++++++++
>  3 files changed, 242 insertions(+), 9 deletions(-)
>
> --
> 2.28.0.618.gf4bc123cb7-goog
>