netdev - Re: [PATCH v3 1/2] net: add support for threaded NAPI polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <bb520bbe-3e6e-949f-e534-53f1425f91eb@dd-wrt.com>
Date:   Sun, 30 Aug 2020 10:46:09 +0200
From:   Sebastian Gottschall <s.gottschall@...wrt.com>
To:     Jakub Kicinski <kuba@...nel.org>, Felix Fietkau <nbd@....name>
Cc:     netdev@...r.kernel.org, Eric Dumazet <eric.dumazet@...il.com>,
        Hillf Danton <hdanton@...a.com>
Subject: Re: [PATCH v3 1/2] net: add support for threaded NAPI polling


Am 22.08.2020 um 03:49 schrieb Jakub Kicinski:
> On Fri, 21 Aug 2020 21:01:50 +0200 Felix Fietkau wrote:
>> For some drivers (especially 802.11 drivers), doing a lot of work in the NAPI
>> poll function does not perform well. Since NAPI poll is bound to the CPU it
>> was scheduled from, we can easily end up with a few very busy CPUs spending
>> most of their time in softirq/ksoftirqd and some idle ones.
>>
>> Introduce threaded NAPI for such drivers based on a workqueue. The API is the
>> same except for using netif_threaded_napi_add instead of netif_napi_add.
>>
>> In my tests with mt76 on MT7621 using threaded NAPI + a thread for tx scheduling
>> improves LAN->WLAN bridging throughput by 10-50%. Throughput without threaded
>> NAPI is wildly inconsistent, depending on the CPU that runs the tx scheduling
>> thread.
>>
>> With threaded NAPI, throughput seems stable and consistent (and higher than
>> the best results I got without it).
>>
>> Based on a patch by Hillf Danton
> I've tested this patch on a non-NUMA system with a moderately
> high-network workload (roughly 1:6 network to compute cycles)
> - and it provides ~2.5% speedup in terms of RPS but 1/6/10% worse
> P50/P99/P999 latency.
>
> I started working on a counter-proposal which uses a pool of threads
> dedicated to NAPI polling. It's not unlike the workqueue code but
> trying to be a little more clever. It gives me ~6.5% more RPS but at
> the same time lowers the p99 latency by 35% without impacting other
> percentiles. (I only started testing this afternoon, so hopefully the
> numbers will improve further).
>
> I'm happy for this patch to be merged, it's quite nice, but I wanted
> to give the heads up that I may have something that would replace it...
>
> The extremely rough PoC, less than half-implemented code which is really
> too broken to share:
> https://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux.git/log/?h=tapi

looks interesting. keep going

Sebastian

>