lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEA6p_CKXMzqqWK0Mo5ppA4vV7bKqV=2toDxmumCJwFeWtq4gQ@mail.gmail.com>
Date:   Wed, 18 Nov 2020 12:14:09 -0800
From:   Wei Wang <weiwan@...gle.com>
To:     David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>
Cc:     Eric Dumazet <edumazet@...gle.com>, Felix Fietkau <nbd@....name>,
        Paolo Abeni <pabeni@...hat.com>,
        Hannes Frederic Sowa <hannes@...essinduktion.org>,
        Hillf Danton <hdanton@...a.com>
Subject: Re: [PATCH net-next v3 0/5] implement kthread based napi poll

On Wed, Nov 18, 2020 at 12:07 PM Wei Wang <weiwan@...gle.com> wrote:
>
> The idea of moving the napi poll process out of softirq context to a
> kernel thread based context is not new.
> Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi
> poll to kthread back in 2016. And Felix Fietkau has also proposed
> patches of similar ideas to use workqueue to process napi poll just a
> few weeks ago.
>
> The main reason we'd like to push forward with this idea is that the
> scheduler has poor visibility into cpu cycles spent in softirq context,
> and is not able to make optimal scheduling decisions of the user threads.
> For example, we see in one of the application benchmark where network
> load is high, the CPUs handling network softirqs has ~80% cpu util. And
> user threads are still scheduled on those CPUs, despite other more idle
> cpus available in the system. And we see very high tail latencies. In this
> case, we have to explicitly pin away user threads from the CPUs handling
> network softirqs to ensure good performance.
> With napi poll moved to kthread, scheduler is in charge of scheduling both
> the kthreads handling network load, and the user threads, and is able to
> make better decisions. In the previous benchmark, if we do this and we
> pin the kthreads processing napi poll to specific CPUs, scheduler is
> able to schedule user threads away from these CPUs automatically.
>
> And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> entity per host, is that kthread is more configurable than workqueue,
> and we could leverage existing tuning tools for threads, like taskset,
> chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> if we eventually want to provide busy poll feature using kernel threads
> for napi poll, kthread seems to be more suitable than workqueue.
> Furthermore, for large platforms with 2 NICs attached to 2 sockets,
> kthread is more flexible to be pinned to different sets of CPUs.
>
> In this patch series, I revived Paolo and Hannes's patch in 2016 and
> left them as the first 2 patches. Then there are changes proposed by
> Felix, Jakub, Paolo and myself on top of those, with suggestions from
> Eric Dumazet.
>
> In terms of performance, I ran tcp_rr tests with 1000 flows with
> various request/response sizes, with RFS/RPS disabled, and compared
> performance between softirq vs kthread vs workqueue (patchset proposed
> by Felix Fietkau).
> Host has 56 hyper threads and 100Gbps nic, 8 rx queues and only 1 numa
> node. All threads are unpinned.
>
>         req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
> softirq   1B/1B   2.75M   337us       376us      1.04ms     3.69ms
> kthread   1B/1B   2.67M   371us       408us      455us      550us
> workq     1B/1B   2.56M   384us       435us      673us      822us
>
> softirq 5KB/5KB   1.46M   678us       750us      969us      2.78ms
> kthread 5KB/5KB   1.44M   695us       789us      891us      1.06ms
> workq   5KB/5KB   1.34M   720us       905us     1.06ms      1.57ms
>
> softirq 1MB/1MB   11.0K   79ms       166ms      306ms       630ms
> kthread 1MB/1MB   11.0K   75ms       177ms      303ms       596ms
> workq   1MB/1MB   11.0K   79ms       180ms      303ms       587ms
>
> When running workqueue implementation, I found the number of threads
> used is usually twice as much as kthread implementation. This probably
> introduces higher scheduling cost, which results in higher tail
> latencies in most cases.
>
> I also ran an application benchmark, which performs fixed qps remote SSD
> read/write operations, with various sizes. Again, both with RFS/RPS
> disabled.
> The result is as follows:
>          op_size  QPS   50%tile 95%tile 99%tile 99.9%tile
> softirq   4K     572.6K   385us   1.5ms  3.16ms   6.41ms
> kthread   4K     572.6K   390us   803us  2.21ms   6.83ms
> workq     4k     572.6K   384us   763us  3.12ms   6.87ms
>
> softirq   64K    157.9K   736us   1.17ms 3.40ms   13.75ms
> kthread   64K    157.9K   745us   1.23ms 2.76ms    9.87ms
> workq     64K    157.9K   746us   1.23ms 2.76ms    9.96ms
>
> softirq   1M     10.98K   2.03ms  3.10ms  3.7ms   11.56ms
> kthread   1M     10.98K   2.13ms  3.21ms  4.02ms  13.3ms
> workq     1M     10.98K   2.13ms  3.20ms  3.99ms  14.12ms
>
> In this set of tests, the latency is predominant by the SSD operation.
> Also, the user threads are much busier compared to tcp_rr tests. We have
> to pin the kthreads/workqueue threads to limit to a few CPUs, to not
> disturb user threads, and provide some isolation.
>
>
> Changes since v2:
> Corrected typo in patch 1, and updated the cover letter with more
> detailed and updated test results.
>

Hi everyone,

We thought it is a good time to re-push this patch series to get
another round of evaluation after several weeks since last version.
The patch series itself did not have much change. But I updated the
cover letter to include the updated and more detailed test results,
hoping to give more contexts.

Thanks for reviewing!
Wei

> Changes since v1:
> Replaced kthread_create() with kthread_run() in patch 5 as suggested by
> Felix Fietkau.
>
> Changes since RFC:
> Renamed the kthreads to be napi/<dev>-<napi_id> in patch 5 as suggested
> by Hannes Frederic Sowa.
>
> Paolo Abeni (2):
>   net: implement threaded-able napi poll loop support
>   net: add sysfs attribute to control napi threaded mode
> Felix Fietkau (1):
>   net: extract napi poll functionality to __napi_poll()
> Jakub Kicinski (1):
>   net: modify kthread handler to use __napi_poll()
> Wei Wang (1):
>   net: improve napi threaded config
>
>  include/linux/netdevice.h |   5 ++
>  net/core/dev.c            | 143 +++++++++++++++++++++++++++++++++++---
>  net/core/net-sysfs.c      | 100 ++++++++++++++++++++++++++
>  3 files changed, 239 insertions(+), 9 deletions(-)
>
> --
> 2.29.2.454.gaff20da3a2-goog
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ