[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CADUfDZqEe--KodhfJLK065biSE__TQ-FZNRtyanfXTA+iPjn4Q@mail.gmail.com>
Date: Thu, 14 Nov 2024 13:35:32 -0800
From: Caleb Sander <csander@...estorage.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: skip RPS if packet is already on target CPU
On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@...estorage.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > <csander@...estorage.com> wrote:
> > > >
> > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > the spinlocks involved incur significant CPU overhead.
> > > >
> > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > current CPU.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > >
> > > Current implementation was a conscious choice. This has been discussed
> > > several times.
> > >
> > > By processing packets inline, you are actually increasing latencies of
> > > packets queued to other cpus.
> >
> > Sorry, I wasn't aware of these prior discussions. I take it you are
> > referring to threads like
> > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > ? I see what you mean about the latency penalty for packets that do
> > require cross-CPU steering.
> >
> > Do you have an alternate suggestion for how to avoid the overhead of
> > acquiring a spinlock for every packet? The atomic instruction in
> > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > of our CPU time. For our use case, we don't really want software RPS;
> > we are expecting ARFS to steer all high-bandwidth traffic to the
> > desired CPUs. We would happily turn off software RPS entirely if we
> > could, which seems like it would avoid the concerns about higher
> > latency for packets that need to be steering to a different CPU. But
> > my understanding is that using ARFS requires RPS to be enabled
> > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > queue), which enables these rps_needed static branches. Is that
> > correct? If so, would you be open to adding a sysctl that disables
> > software RPS and relies upon ARFS to do the packet steering?
>
> A sysctl will not avoid the fundamental issue.
> Why not instead address the past feedback ?
> Can you test the following ?
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c682173a76424d7dadcc8374aa5b11dff44a4b46..7a5a7f1a4b7c3cbd105ecfc076377f25929729eb
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5842,6 +5842,21 @@ static int generic_xdp_install(struct
> net_device *dev, struct netdev_bpf *xdp)
> return ret;
> }
>
> +#ifdef CONFIG_RPS
> +static bool net_must_use_backlog(int tcpu)
> +{
> + if (tcpu < 0)
> + return false;
> + if (tcpu != smp_processor_id())
> + return true;
> + /* target cpu is ourself. We must use our backlog
> + * if we have deferred IPI or packets.
> + */
> + return this_cpu_read(softnet_data.rps_ipi_list) != NULL ||
> + this_cpu_read(softnet_data.input_pkt_queue.qlen) != 0;
> +}
> +#endif
Hi Eric,
Look at this patch again, I am wondering why the tcpu < 0 case is
treated differently from the tcpu == smp_processor_id() case. If I
understand correctly, packets without a CPU flow hint are allowed to
be processed immediately on the current CPU. They will leapfrog
packets that other CPUs have queued onto this CPU via RPS and any RPS
IPIs waiting to be issued to other CPUs. I see this is the behavior in
the current code too, but why don't the same concerns about higher
latency for packets steered cross-CPU apply?
Thanks,
Caleb
Powered by blists - more mailing lists