netdev - Re: [PATCH v3 1/2] net: add support for threaded NAPI polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <48e3082b-7d89-d0ad-f256-b1fa1dca0a45@gmail.com>
Date:   Sat, 22 Aug 2020 09:22:06 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Jakub Kicinski <kuba@...nel.org>, Felix Fietkau <nbd@....name>
Cc:     netdev@...r.kernel.org, Eric Dumazet <eric.dumazet@...il.com>,
        Hillf Danton <hdanton@...a.com>
Subject: Re: [PATCH v3 1/2] net: add support for threaded NAPI polling



On 8/21/20 6:49 PM, Jakub Kicinski wrote:
> On Fri, 21 Aug 2020 21:01:50 +0200 Felix Fietkau wrote:
>> For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
>> poll function does not perform well. Since NAPI poll is bound to the CPU it
>> was scheduled from, we can easily end up with a few very busy CPUs spending
>> most of their time in softirq/ksoftirqd and some idle ones.
>>
>> Introduce threaded NAPI for such drivers based on a workqueue. The API is the
>> same except for using netif_threaded_napi_add instead of netif_napi_add.
>>
>> In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
>> improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
>> NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
>> thread.
>>
>> With threaded NAPI, throughput seems stable and consistent (and higher than
>> the best results I got without it).
>>
>> Based on a patch by Hillf Danton
> 
> I've tested this patch on a non-NUMA system with a moderately
> high-network workload (roughly 1:6 network to compute cycles)
> - and it provides ~2.5% speedup in terms of RPS but 1/6/10% worse
> P50/P99/P999 latency.
> 
> I started working on a counter-proposal which uses a pool of threads
> dedicated to NAPI polling. It's not unlike the workqueue code but
> trying to be a little more clever. It gives me ~6.5% more RPS but at
> the same time lowers the p99 latency by 35% without impacting other
> percentiles. (I only started testing this afternoon, so hopefully the
> numbers will improve further).
> 
> I'm happy for this patch to be merged, it's quite nice, but I wanted 
> to give the heads up that I may have something that would replace it...
> 
> The extremely rough PoC, less than half-implemented code which is really
> too broken to share:
> https://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux.git/log/?h=tapi
> 

Yes, the idea of sharing a single napi_workq without the ability to perform
some per-queue tuning is probably okay for the class of devices Felix is interested in.

I vote for waiting a bit and see what you can achieve, since Felix showed no intent
to work on using kthreads instead of work queues.

Having one kthread per queue gives us existing instrumentation (sched stats),
and ability to decide for optimal affinities and priorities.

Thanks !