[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b163bfe3-0b91-e2cf-f702-8ab08a30db0d@nbd.name>
Date: Fri, 17 Feb 2023 16:26:37 +0100
From: Felix Fietkau <nbd@....name>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC v2] net/core: add optional threading for rps backlog
processing
On 17.02.23 15:38, Eric Dumazet wrote:
> On Fri, Feb 17, 2023 at 2:40 PM Felix Fietkau <nbd@....name> wrote:
>>
>> On 17.02.23 13:57, Eric Dumazet wrote:
>> > On Fri, Feb 17, 2023 at 1:35 PM Felix Fietkau <nbd@....name> wrote:
>> >>
>> >> On 17.02.23 13:23, Eric Dumazet wrote:
>> >> > On Fri, Feb 17, 2023 at 11:06 AM Felix Fietkau <nbd@....name> wrote:
>> >> >>
>> >> >> When dealing with few flows or an imbalance on CPU utilization, static RPS
>> >> >> CPU assignment can be too inflexible. Add support for enabling threaded NAPI
>> >> >> for RPS backlog processing in order to allow the scheduler to better balance
>> >> >> processing. This helps better spread the load across idle CPUs.
>> >> >>
>> >> >> Signed-off-by: Felix Fietkau <nbd@....name>
>> >> >> ---
>> >> >>
>> >> >> RFC v2:
>> >> >> - fix rebase error in rps locking
>> >> >
>> >> > Why only deal with RPS ?
>> >> >
>> >> > It seems you propose the sofnet_data backlog be processed by a thread,
>> >> > instead than from softirq ?
>> >> Right. I originally wanted to mainly improve RPS, but my patch does
>> >> cover backlog in general. I will update the description in the next
>> >> version. Does the approach in general make sense to you?
>> >>
>> >
>> > I do not know, this seems to lack some (perf) numbers, and
>> > descriptions of added max latencies and stuff like that :)
>> I just ran some test where I used a MT7621 device (dual-core 800 MHz
>> MIPS, 4 threads) as a router doing NAT without flow offloading.
>>
>> Using the flent RRUL test between 2 PCs connected through the router,
>> I get these results:
>>
>> rps_threaded=0: (combined CPU idle time around 27%)
>> avg median 99th % # data pts
>> Ping (ms) ICMP : 26.08 28.70 54.74 ms 199
>> Ping (ms) UDP BE : 1.96 24.12 37.28 ms 200
>> Ping (ms) UDP BK : 1.88 15.86 27.30 ms 200
>> Ping (ms) UDP EF : 1.98 31.77 54.10 ms 200
>> Ping (ms) avg : 1.94 N/A N/A ms 200
>> TCP download BE : 69.25 70.20 139.55 Mbits/s 200
>> TCP download BK : 95.15 92.51 163.93 Mbits/s 200
>> TCP download CS5 : 133.64 129.10 292.46 Mbits/s 200
>> TCP download EF : 129.86 127.70 254.47 Mbits/s 200
>> TCP download avg : 106.97 N/A N/A Mbits/s 200
>> TCP download sum : 427.90 N/A N/A Mbits/s 200
>> TCP totals : 864.43 N/A N/A Mbits/s 200
>> TCP upload BE : 97.54 96.67 163.99 Mbits/s 200
>> TCP upload BK : 139.76 143.88 190.37 Mbits/s 200
>> TCP upload CS5 : 97.52 94.70 206.60 Mbits/s 200
>> TCP upload EF : 101.71 106.72 147.88 Mbits/s 200
>> TCP upload avg : 109.13 N/A N/A Mbits/s 200
>> TCP upload sum : 436.53 N/A N/A Mbits/s 200
>>
>> rps_threaded=1: (combined CPU idle time around 16%)
>> avg median 99th % # data pts
>> Ping (ms) ICMP : 13.70 16.10 27.60 ms 199
>> Ping (ms) UDP BE : 2.03 18.35 24.16 ms 200
>> Ping (ms) UDP BK : 2.03 18.36 29.13 ms 200
>> Ping (ms) UDP EF : 2.36 25.20 41.50 ms 200
>> Ping (ms) avg : 2.14 N/A N/A ms 200
>> TCP download BE : 118.69 120.94 160.12 Mbits/s 200
>> TCP download BK : 134.67 137.81 177.14 Mbits/s 200
>> TCP download CS5 : 126.15 127.81 174.84 Mbits/s 200
>> TCP download EF : 78.36 79.41 143.31 Mbits/s 200
>> TCP download avg : 114.47 N/A N/A Mbits/s 200
>> TCP download sum : 457.87 N/A N/A Mbits/s 200
>> TCP totals : 918.19 N/A N/A Mbits/s 200
>> TCP upload BE : 112.20 111.55 164.38 Mbits/s 200
>> TCP upload BK : 144.99 139.24 205.12 Mbits/s 200
>> TCP upload CS5 : 93.09 95.50 132.39 Mbits/s 200
>> TCP upload EF : 110.04 108.21 207.00 Mbits/s 200
>> TCP upload avg : 115.08 N/A N/A Mbits/s 200
>> TCP upload sum : 460.32 N/A N/A Mbits/s 200
>>
>> As you can see, both throughput and latency improve because load can be
>> better distributed across CPU cores.
>>
>
> What happens if user threads are competing with your kthreads ?
>
> It seems you are adding another variant of ksoftirqd.
>
> More threads might look better in some cases, if we accept to be at
> the mercy of process scheduling decisions.
>
> NUMA affinities matter, I do not see how you are dealing with this.
>
> Your patch assumes all cpus can participate in network processing,
> but rps is fine grained:
> Boxes with 128 or 256 cpus usually have different rps_mask per receive
> queue, with 2 or 4 bits set in the per-rx-queue rps_mask.
I'm assuming that in cases where this matters, user space can set the
affinities and priority of the rps threads. The number in the rps-%d
process name matches the bit number for rps_mask, so it can control
things in a more fine grained way.
Would you prefer having support for selectively enabling threading on
individual backlogs with a mask?
> Then, process_backlog() has been designed to run only from the cpu
> tied to the per-cpu data (softnet_data)
> There are multiple comments about this assumption, and various things
> that would need to be changed
> (eg sd_has_rps_ipi_waiting() would be wrong in its current implementation)
That's why I added the NAPI_STATE_THREADED check in napi_schedule_rps,
so that sd_has_rps_ipi_waiting would always return false.
Or are you worried about a race when enabling threading?
- Felix
Powered by blists - more mailing lists