[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADUfDZqcd_2+409_4GGhbRwW8gYHtZSU1vE1eNuE=jycoNMMJA@mail.gmail.com>
Date: Mon, 4 Nov 2024 10:42:01 -0800
From: Caleb Sander <csander@...estorage.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: skip RPS if packet is already on target CPU
On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@...estorage.com> wrote:
>
> On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@...estorage.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > <csander@...estorage.com> wrote:
> > > > >
> > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > the spinlocks involved incur significant CPU overhead.
> > > > >
> > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > current CPU.
> > > > >
> > > > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > > >
> > > > Current implementation was a conscious choice. This has been discussed
> > > > several times.
> > > >
> > > > By processing packets inline, you are actually increasing latencies of
> > > > packets queued to other cpus.
> > >
> > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > referring to threads like
> > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > ? I see what you mean about the latency penalty for packets that do
> > > require cross-CPU steering.
> > >
> > > Do you have an alternate suggestion for how to avoid the overhead of
> > > acquiring a spinlock for every packet? The atomic instruction in
> > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > of our CPU time. For our use case, we don't really want software RPS;
> > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > desired CPUs. We would happily turn off software RPS entirely if we
> > > could, which seems like it would avoid the concerns about higher
> > > latency for packets that need to be steering to a different CPU. But
> > > my understanding is that using ARFS requires RPS to be enabled
> > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > queue), which enables these rps_needed static branches. Is that
> > > correct? If so, would you be open to adding a sysctl that disables
> > > software RPS and relies upon ARFS to do the packet steering?
> >
> > A sysctl will not avoid the fundamental issue.
>
> Sorry if my suggestion was unclear. I mean that we would ideally like
> to use only hardware ARFS for packet steering, and disable software
> RPS.
> In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> application has long-lived TCP sockets, each processed on a single
> thread affinitized to one of the interrupt CPUs for the Ethernet
> device.) In the off chance that ARFS doesn't steer the packet to the
> correct CPU, we would rather just process it on the CPU that receives
> it instead of going through the RPS queues. If software RPS is never
> used, then there wouldn't be any concerns about higher latency for
> RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> computation is also not cheap, so it would be nice to skip it too.
> Basically, we want to program ARFS but skip these
> static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> way to do that currently. (Please let me know if it's already possible
> to do that.)
I see now that get_rps_cpu() still needs to be called even if software
RPS isn't going to be used, because it's also responsible for calling
set_rps_cpu() to program ARFS. So looks like avoiding the calls to
get_rps_cpu() entirely is not possible.
> > Why not instead address the past feedback ?
> > Can you test the following ?
>
> Sure, I will test the performance on our setup with this patch. Still,
> we would prefer to skip the get_rps_cpu() computation and these extra
> checks, and just process the packets on the CPUs they arrive on.
Hi Eric,
I tried out your patch and it seems to work equally well at
eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
added checks contribute a bit of overhead in
netif_receive_skb_list_internal(): it has increased from 0.15% to
0.28% of CPU time. But that's definitely worth it to remove the 5%
hotspot acquiring the RPS spinlock.
Feel free to add:
Tested-by: Caleb Sander Mateos <csander@...estorage.com>
Some minor comments on the patch:
- Probably should be using either skb_queue_len_lockless() or
skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
the RPS lock is not held at this point
- Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
a single this_cpu_ptr(&softnet_data) call
Thanks,
Caleb
Powered by blists - more mailing lists