[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8371549-30e9-b6cb-7d44-3325f9311c24@gmail.com>
Date: Fri, 25 Sep 2020 19:30:19 +0200
From: Eric Dumazet <eric.dumazet@...il.com>
To: Wei Wang <weiwan@...gle.com>,
Magnus Karlsson <magnus.karlsson@...il.com>
Cc: "David S . Miller" <davem@...emloft.net>,
Network Development <netdev@...r.kernel.org>,
Jakub Kicinski <kuba@...nel.org>,
Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Felix Fietkau <nbd@....name>,
Björn Töpel <bjorn.topel@...el.com>
Subject: Re: [RFC PATCH net-next 0/6] implement kthread based napi poll
On 9/25/20 7:15 PM, Wei Wang wrote:
> On Fri, Sep 25, 2020 at 6:48 AM Magnus Karlsson
> <magnus.karlsson@...il.com> wrote:
>>
>> On Mon, Sep 14, 2020 at 7:26 PM Wei Wang <weiwan@...gle.com> wrote:
>>>
>>> The idea of moving the napi poll process out of softirq context to a
>>> kernel thread based context is not new.
>>> Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi
>>> poll to kthread back in 2016. And Felix Fietkau has also proposed
>>> patches of similar ideas to use workqueue to process napi poll just a
>>> few weeks ago.
>>>
>>> The main reason we'd like to push forward with this idea is that the
>>> scheduler has poor visibility into cpu cycles spent in softirq context,
>>> and is not able to make optimal scheduling decisions of the user threads.
>>> For example, we see in one of the application benchmark where network
>>> load is high, the CPUs handling network softirqs has ~80% cpu util. And
>>> user threads are still scheduled on those CPUs, despite other more idle
>>> cpus available in the system. And we see very high tail latencies. In this
>>> case, we have to explicitly pin away user threads from the CPUs handling
>>> network softirqs to ensure good performance.
>>> With napi poll moved to kthread, scheduler is in charge of scheduling both
>>> the kthreads handling network load, and the user threads, and is able to
>>> make better decisions. In the previous benchmark, if we do this and we
>>> pin the kthreads processing napi poll to specific CPUs, scheduler is
>>> able to schedule user threads away from these CPUs automatically.
>>>
>>> And the reason we prefer 1 kthread per napi, instead of 1 workqueue
>>> entity per host, is that kthread is more configurable than workqueue,
>>> and we could leverage existing tuning tools for threads, like taskset,
>>> chrt, etc to tune scheduling class and cpu set, etc. Another reason is
>>> if we eventually want to provide busy poll feature using kernel threads
>>> for napi poll, kthread seems to be more suitable than workqueue.
>>>
>>> In this patch series, I revived Paolo and Hannes's patch in 2016 and
>>> left them as the first 2 patches. Then there are changes proposed by
>>> Felix, Jakub, Paolo and myself on top of those, with suggestions from
>>> Eric Dumazet.
>>>
>>> In terms of performance, I ran tcp_rr tests with 1000 flows with
>>> various request/response sizes, with RFS/RPS disabled, and compared
>>> performance between softirq vs kthread. Host has 56 hyper threads and
>>> 100Gbps nic.
>>>
>>> req/resp QPS 50%tile 90%tile 99%tile 99.9%tile
>>> softirq 1B/1B 2.19M 284us 987us 1.1ms 1.56ms
>>> kthread 1B/1B 2.14M 295us 987us 1.0ms 1.17ms
>>>
>>> softirq 5KB/5KB 1.31M 869us 1.06ms 1.28ms 2.38ms
>>> kthread 5KB/5KB 1.32M 878us 1.06ms 1.26ms 1.66ms
>>>
>>> softirq 1MB/1MB 10.78K 84ms 166ms 234ms 294ms
>>> kthread 1MB/1MB 10.83K 82ms 173ms 262ms 320ms
>>>
>>> I also ran one application benchmark where the user threads have more
>>> work to do. We do see good amount of tail latency reductions with the
>>> kthread model.
>>
>> I really like this RFC and would encourage you to submit it as a
>> patch. Would love to see it make it into the kernel.
>>
>
> Thanks for the feedback! I am preparing an official patchset for this
> and will send them out soon.
>
>> I see the same positive effects as you when trying it out with AF_XDP
>> sockets. Made some simple experiments where I sent 64-byte packets to
>> a single AF_XDP socket. Have not managed to figure out how to do
>> percentiles on my load generator, so this is going to be min, avg and
>> max only. The application using the AF_XDP socket just performs a mac
>> swap on the packet and sends it back to the load generator that then
>> measures the round trip latency. The kthread is taskset to the same
>> core as ksoftirqd would run on. So in each experiment, they always run
>> on the same core id (which is not the same as the application).
>>
>> Rate 12 Mpps with 0% loss.
>> Latencies (us) Delay Variation between packets
>> min avg max avg max
>> sofirq 11.0 17.1 78.4 0.116 63.0
>> kthread 11.2 17.1 35.0 0.116 20.9
>>
>> Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss
>> Latencies (us) Delay Variation between packets
>> min avg max avg max
>> softirq 87.6 194.9 282.6 0.062 25.9
>> kthread 86.5 185.2 271.8 0.061 22.5
>>
>> For the last experiment, I also get 1.5% to 2% higher throughput with
>> your kthread approach. Moreover, just from the per-second throughput
>> printouts from my application, I can see that the kthread numbers are
>> more stable. The softirq numbers can vary quite a lot between each
>> second, around +-3%. But for the kthread approach, they are nice and
>> stable. Have not examined why.
>>
>
> Thanks for sharing the results!
>
>> One thing I noticed though, and I do not know if this is an issue, is
>> that the switching between the two modes does not occur at high packet
>> rates. I have to lower the packet rate to something that makes the
>> core work less than 100% for it to switch between ksoftirqd to kthread
>> and vice versa. They just seem too busy to switch at 100% load when
>> changing the "threaded" sysfs variable.
>>
>
> I think the reason for this is when load is high, napi_poll() probably
> always exhausts the predefined napi->weight. So it will keep
> re-polling in the current context. The switch could only happen the
> next time ___napi_schedule() is called.
A similar problem happens when /proc/irq/{..}/smp_affinity is changed.
Few drivers actually detect the affinity has changed (and does not include
current cpu), and force an napi poll complete/exit, so that a new hardware
interrupt is allowed and routed to another cpu.
Presumably the softirq -> kthread transition could be enforced if really needed.
Powered by blists - more mailing lists