netdev - Re: [PATCH net-next 0/5] implement kthread based napi poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEA6p_DukokJByTLp4QeGRrbNgC-hb9P6YX5Qh=UswPubrEnVA@mail.gmail.com>
Date:   Thu, 1 Oct 2020 15:12:20 -0700
From:   Wei Wang <weiwan@...gle.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     Eric Dumazet <edumazet@...gle.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Hannes Frederic Sowa <hannes@...essinduktion.org>,
        Paolo Abeni <pabeni@...hat.com>, Felix Fietkau <nbd@....name>
Subject: Re: [PATCH net-next 0/5] implement kthread based napi poll

On Thu, Oct 1, 2020 at 1:26 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote:
> > On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > > On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote:
> > > > With napi poll moved to kthread, scheduler is in charge of scheduling both
> > > > the kthreads handling network load, and the user threads, and is able to
> > > > make better decisions. In the previous benchmark, if we do this and we
> > > > pin the kthreads processing napi poll to specific CPUs, scheduler is
> > > > able to schedule user threads away from these CPUs automatically.
> > > >
> > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> > > > entity per host, is that kthread is more configurable than workqueue,
> > > > and we could leverage existing tuning tools for threads, like taskset,
> > > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> > > > if we eventually want to provide busy poll feature using kernel threads
> > > > for napi poll, kthread seems to be more suitable than workqueue.
> > >
> > > As I said in my reply to the RFC I see better performance with the
> > > workqueue implementation, so I would hold off until we have more
> > > conclusive results there, as this set adds fairly strong uAPI that
> > > we'll have to support for ever.
> >
> > We can make incremental changes, the kthread implementation looks much
> > nicer to us.
>
> Having done two implementation of something more wq-like now
> I can say with some confidence that it's quite likely not a
> simple extension of this model. And since we'll likely need
> to support switching at runtime there will be a fast-path
> synchronization overhead.
>
> > The unique work queue is a problem on server class platforms, with
> > NUMA placement.
> > We now have servers with NIC on different NUMA nodes.
>
> Are you saying that the wq code is less NUMA friendly than unpinned
> threads?
>
> > We can not introduce a new model that will make all workload better
> > without any tuning.
> > If you really think you can do that, think again.
>
> Has Wei tested the wq implementation with real workloads?
>
> All the cover letter has is some basic netperf runs and a vague
> sentence saying "real workload also improved".
>

Yes. I did a round of testing with workqueue as well. The "real
workload" I mentioned is a google internal application benchmark which
involves networking  as well as disk ops.
There are 2 types of tests there.
1 is sustained tests, where the ops/s is being pushed to very high,
and keeps the overall cpu usage to > 80%, with various sizes of
payload.
In this type of test case, I see a better result with the kthread
model compared to workqueue in the latency metrics, and similar CPU
savings, with some tuning of the kthreads. (e.g., we limit the
kthreads to a pool of CPUs to run on, to avoid mixture with
application threads. I did the same for workqueue as well to be fair.)
The other is trace based tests, where the load is based on the actual
trace taken from the real servers. This kind of test has less load and
ops/s overall. (~25% total cpu usage on the host)
In this test case, I observe a similar amount of latency savings with
both kthread and workqueue, but workqueue seems to have better cpu
saving here, possibly due to less # of threads woken up to process the
load.

And one reason we would like to push forward with 1 kthread per NAPI,
is we are also trying to do busy polling with the kthread. And it
seems a good model to have 1 kthread dedicated to 1 NAPI to begin
with.

> I think it's possible to get something that will be a better default
> for 90% of workloads. Our current model predates SMP by two decades.
> It's pretty bad.
>
> I'm talking about upstream defaults, obviously, maybe you're starting
> from a different baseline configuration than the rest of the world..
>
> > Even the old ' fix'  (commit 4cd13c21b207e80ddb1144c576500098f2d5f882
> > "softirq: Let ksoftirqd do its job" )
> > had severe issues for latency sensitive jobs.
> >
> > We need to be able to opt-in to threads, and let process scheduler
> > take decisions.
> > If we believe the process scheduler takes bad decision, it should be
> > reported to scheduler experts.
>
> I wouldn't expect that the scheduler will learn all by itself how to
> group processes that run identical code for cache efficiency, and how
> to schedule at 10us scale. I hope I'm wrong.
>
> > I fully support this implementation, I do not want to wait for yet
> > another 'work queue' model or scheduler classes.
>
> I can't sympathize. I don't understand why you're trying to rush this.
> And you're not giving me enough info about your target config to be able
> to understand your thinking.