[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKQ3g2+nSWaV3BWarpbneRCSoGSXdGP90PF7ScDu4ULEQ@mail.gmail.com>
Date: Wed, 30 Oct 2024 13:55:07 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Caleb Sander <csander@...estorage.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: skip RPS if packet is already on target CPU
On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@...estorage.com> wrote:
>
> On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > <csander@...estorage.com> wrote:
> > >
> > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > packets to the correct CPU, this additional queuing is unnecessary and
> > > the spinlocks involved incur significant CPU overhead.
> > >
> > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > current CPU.
> > >
> > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> >
> > Current implementation was a conscious choice. This has been discussed
> > several times.
> >
> > By processing packets inline, you are actually increasing latencies of
> > packets queued to other cpus.
>
> Sorry, I wasn't aware of these prior discussions. I take it you are
> referring to threads like
> https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> ? I see what you mean about the latency penalty for packets that do
> require cross-CPU steering.
>
> Do you have an alternate suggestion for how to avoid the overhead of
> acquiring a spinlock for every packet? The atomic instruction in
> rps_lock_irq_disable() called from process_backlog() is consuming 5%
> of our CPU time. For our use case, we don't really want software RPS;
> we are expecting ARFS to steer all high-bandwidth traffic to the
> desired CPUs. We would happily turn off software RPS entirely if we
> could, which seems like it would avoid the concerns about higher
> latency for packets that need to be steering to a different CPU. But
> my understanding is that using ARFS requires RPS to be enabled
> (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> queue), which enables these rps_needed static branches. Is that
> correct? If so, would you be open to adding a sysctl that disables
> software RPS and relies upon ARFS to do the packet steering?
A sysctl will not avoid the fundamental issue.
Why not instead address the past feedback ?
Can you test the following ?
diff --git a/net/core/dev.c b/net/core/dev.c
index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct
net_device *dev, struct netdev_bpf *xdp)
return ret;
}
+#ifdef CONFIG_RPS
+static bool net_must_use_backlog(int tcpu)
+{
+ if (tcpu < 0)
+ return false;
+ if (tcpu != smp_processor_id())
+ return true;
+ /* target cpu is ourself. We must use our backlog
+ * if we have deferred IPI or packets.
+ */
+ return this_cpu_read(softnet_data.rps_ipi_list) != NULL ||
+ this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0;
+}
+#endif
+
static int netif_receive_skb_internal(struct sk_buff *skb)
{
int ret;
@@ -5857,7 +5872,7 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
struct rps_dev_flow voidflow, *rflow = &voidflow;
int cpu = get_rps_cpu(skb->dev, skb, &rflow);
- if (cpu >= 0) {
+ if (net_must_use_backlog(cpu)) {
ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
rcu_read_unlock();
return ret;
@@ -5890,7 +5905,7 @@ void netif_receive_skb_list_internal(struct
list_head *head)
struct rps_dev_flow voidflow, *rflow = &voidflow;
int cpu = get_rps_cpu(skb->dev, skb, &rflow);
- if (cpu >= 0) {
+ if (net_must_use_backlog(cpu)) {
/* Will be handled, remove from list */
skb_list_del_init(skb);
enqueue_to_backlog(skb, cpu,
&rflow->last_qtail);
Powered by blists - more mailing lists