[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CADUfDZr-xuruCjJwebWk2SUAq9-pCDLDQ1HHpQTOPnK3BAXg=g@mail.gmail.com>
Date: Mon, 4 Nov 2024 11:18:18 -0800
From: Caleb Sander <csander@...estorage.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: skip RPS if packet is already on target CPU
On Mon, Nov 4, 2024 at 10:58 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@...estorage.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@...estorage.com> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@...estorage.com> wrote:
> > > > >
> > > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > > > <csander@...estorage.com> wrote:
> > > > > > >
> > > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > > > the spinlocks involved incur significant CPU overhead.
> > > > > > >
> > > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > > > current CPU.
> > > > > > >
> > > > > > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > > > > >
> > > > > > Current implementation was a conscious choice. This has been discussed
> > > > > > several times.
> > > > > >
> > > > > > By processing packets inline, you are actually increasing latencies of
> > > > > > packets queued to other cpus.
> > > > >
> > > > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > > > referring to threads like
> > > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > > > ? I see what you mean about the latency penalty for packets that do
> > > > > require cross-CPU steering.
> > > > >
> > > > > Do you have an alternate suggestion for how to avoid the overhead of
> > > > > acquiring a spinlock for every packet? The atomic instruction in
> > > > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > > > of our CPU time. For our use case, we don't really want software RPS;
> > > > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > > > desired CPUs. We would happily turn off software RPS entirely if we
> > > > > could, which seems like it would avoid the concerns about higher
> > > > > latency for packets that need to be steering to a different CPU. But
> > > > > my understanding is that using ARFS requires RPS to be enabled
> > > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > > > queue), which enables these rps_needed static branches. Is that
> > > > > correct? If so, would you be open to adding a sysctl that disables
> > > > > software RPS and relies upon ARFS to do the packet steering?
> > > >
> > > > A sysctl will not avoid the fundamental issue.
> > >
> > > Sorry if my suggestion was unclear. I mean that we would ideally like
> > > to use only hardware ARFS for packet steering, and disable software
> > > RPS.
> > > In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> > > application has long-lived TCP sockets, each processed on a single
> > > thread affinitized to one of the interrupt CPUs for the Ethernet
> > > device.) In the off chance that ARFS doesn't steer the packet to the
> > > correct CPU, we would rather just process it on the CPU that receives
> > > it instead of going through the RPS queues. If software RPS is never
> > > used, then there wouldn't be any concerns about higher latency for
> > > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> > > computation is also not cheap, so it would be nice to skip it too.
> > > Basically, we want to program ARFS but skip these
> > > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> > > way to do that currently. (Please let me know if it's already possible
> > > to do that.)
> >
> > I see now that get_rps_cpu() still needs to be called even if software
> > RPS isn't going to be used, because it's also responsible for calling
> > set_rps_cpu() to program ARFS. So looks like avoiding the calls to
> > get_rps_cpu() entirely is not possible.
>
> Yes, this is the reason. Things would be different if ARFS was done at
> dev_queue_xmit() time.
>
> >
> > > > Why not instead address the past feedback ?
> > > > Can you test the following ?
> > >
> > > Sure, I will test the performance on our setup with this patch. Still,
> > > we would prefer to skip the get_rps_cpu() computation and these extra
> > > checks, and just process the packets on the CPUs they arrive on.
> >
> > Hi Eric,
> > I tried out your patch and it seems to work equally well at
> > eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
> > added checks contribute a bit of overhead in
> > netif_receive_skb_list_internal(): it has increased from 0.15% to
> > 0.28% of CPU time. But that's definitely worth it to remove the 5%
> > hotspot acquiring the RPS spinlock.
> >
> > Feel free to add:
> > Tested-by: Caleb Sander Mateos <csander@...estorage.com>
> >
> > Some minor comments on the patch:
> > - Probably should be using either skb_queue_len_lockless() or
> > skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
> > the RPS lock is not held at this point
>
> this_cpu_read() has implicit READ_ONCE() semantic.
Okay. Seems like it would be nicer not to reach into the sk_buff_head
internals, but up to you.
>
> > - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
> > a single this_cpu_ptr(&softnet_data) call
>
> Double check the generated assembly first :)
>
> this_cpu_read() is faster than going through
> this_cpu_ptr(&softnet_data), at least on x86,
> thanks to %gs: prefix.
>
> No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data).
Got it, thanks.
Powered by blists - more mailing lists