[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <193501d0-094a-cc5a-c3ae-4553a56e3a3a@redhat.com>
Date: Wed, 29 Mar 2023 18:14:32 +0200
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Felix Fietkau <nbd@....name>, Jakub Kicinski <kuba@...nel.org>
Cc: brouer@...hat.com, netdev@...r.kernel.org,
Jonathan Corbet <corbet@....net>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, Dave Taht <dave.taht@...il.com>
Subject: Re: [PATCH net-next] net/core: add optional threading for backlog
processing
On 24/03/2023 18.35, Felix Fietkau wrote:
> On 24.03.23 18:20, Jakub Kicinski wrote:
>> On Fri, 24 Mar 2023 18:13:14 +0100 Felix Fietkau wrote:
>>> When dealing with few flows or an imbalance on CPU utilization, static RPS
>>> CPU assignment can be too inflexible. Add support for enabling threaded NAPI
>>> for backlog processing in order to allow the scheduler to better balance
>>> processing. This helps better spread the load across idle CPUs.
>>
>> Can you explain the use case a little bit more?
>
> I'm primarily testing this on routers with 2 or 4 CPUs and limited
> processing power, handling routing/NAT. RPS is typically needed to
> properly distribute the load across all available CPUs. When there is
> only a small number of flows that are pushing a lot of traffic, a static
> RPS assignment often leaves some CPUs idle, whereas others become a
> bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but
> CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
>
> Making backlog processing threaded helps split up the processing work
> even more and distribute it onto remaining idle CPUs.
>
> It can basically be used to make RPS a bit more dynamic and
> configurable, because you can assign multiple backlog threads to a set
> of CPUs and selectively steer packets from specific devices / rx queues
> to them and allow the scheduler to take care of the rest.
>
My experience with RPS was that it was too slow on the RX-CPU. Meaning
that it doesn't really scale, because the RX-CPU becomes the scaling
bottleneck. (My data is old and it might scale differently on your ARM
boards).
This is why I/we created the XDP "cpumap". It also creates a kernel
threaded model via a kthread on "map-configured" CPUs. It scales
significantly better than RPS, but it doesn't handle flows and packet
Out-of-Order (OoO) situations automatically like RPS. That is left up
to the BPF-programmer. The kernel samples/bpf xdp_redirect_cpu[0] have
code that shows strategies of load-balancing flows.
The project xdp-cpumap-tc[1] runs in production (3 ISPs using this) and
works in concert with netstack Traffic Control (TC) for scaling
bandwidth shaping at the ISPs. OoO is solved by redirecting all
customers IPs to the same TX/egress CPU. As the README[1] describes it
is recommended to reduce the number of RX-CPUs processing packets, and
have more TX-CPUs that basically runs netstack/TC. One ISP with
2x25Gbit/s is using 2 CPUs for RX and 6 CPUs for TX.
--Jesper
[0]
https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu.bpf.c
[1] https://github.com/xdp-project/xdp-cpumap-tc
Powered by blists - more mailing lists