netdev - Re: [PATCH net-next 0/5] implement kthread based napi poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEA6p_DWVGV9hOh3CcuWPcxSDmOSb94qHMft-o+Ts8KNoKqxxQ@mail.gmail.com>
Date:   Thu, 1 Oct 2020 18:44:40 -0700
From:   Wei Wang <weiwan@...gle.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     Eric Dumazet <edumazet@...gle.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Hannes Frederic Sowa <hannes@...essinduktion.org>,
        Paolo Abeni <pabeni@...hat.com>, Felix Fietkau <nbd@....name>
Subject: Re: [PATCH net-next 0/5] implement kthread based napi poll

On Thu, Oct 1, 2020 at 4:46 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Thu, 1 Oct 2020 15:12:20 -0700 Wei Wang wrote:
> > Yes. I did a round of testing with workqueue as well. The "real
> > workload" I mentioned is a google internal application benchmark which
> > involves networking  as well as disk ops.
> > There are 2 types of tests there.
> > 1 is sustained tests, where the ops/s is being pushed to very high,
> > and keeps the overall cpu usage to > 80%, with various sizes of
> > payload.
> > In this type of test case, I see a better result with the kthread
> > model compared to workqueue in the latency metrics, and similar CPU
> > savings, with some tuning of the kthreads. (e.g., we limit the
> > kthreads to a pool of CPUs to run on, to avoid mixture with
> > application threads. I did the same for workqueue as well to be fair.)
>
> Can you share relative performance delta of this banchmark?
>
> Could you explain why threads are slower than ksoftirqd if you pin the
> application away? From your cover letter it sounded like you want the
> scheduler to see the NAPI load, but then you say you pinned the
> application away from the NAPI cores for the test, so I'm confused.
>

No. We did not explicitly pin the application threads away.
Application threads are free to run anywhere. What we do is we
restrict the NAPI kthreads to only those CPUs handling rx interrupts.
(For us, 8 cpus out of 56.) So the load on those CPUs are very high
when running the test. And the scheduler is smart enough to avoid
using those CPUs for the application threads automatically.
Here is the results of 1 representative test result:
                     cpu/op   50%tile     95%tile       99%tile
base            71.47        417us      1.01ms          2.9ms
kthread         67.84       396us      976us            2.4ms
workqueue   69.68       386us      791us             1.9ms

Actually, I remembered it wrong. It does seem workqueue is doing
better on latencies. But cpu/op wise, kthread seems to be a bit
better.

> > The other is trace based tests, where the load is based on the actual
> > trace taken from the real servers. This kind of test has less load and
> > ops/s overall. (~25% total cpu usage on the host)
> > In this test case, I observe a similar amount of latency savings with
> > both kthread and workqueue, but workqueue seems to have better cpu
> > saving here, possibly due to less # of threads woken up to process the
> > load.
> >
> > And one reason we would like to push forward with 1 kthread per NAPI,
> > is we are also trying to do busy polling with the kthread. And it
> > seems a good model to have 1 kthread dedicated to 1 NAPI to begin
> > with.
>
> And you'd pin those busy polling threads to a specific, single CPU, too?
> 1 cpu : 1 thread : 1 NAPI?
Yes. That is my thought.