linux-kernel - Re: [PATCH] net: skip RPS if packet is already on target CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJEL7=q_tYXPbwUGF4MoX=W0AOoevMg31Y=nH7WyofaiA@mail.gmail.com>
Date: Mon, 4 Nov 2024 19:58:24 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Caleb Sander <csander@...estorage.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: skip RPS if packet is already on target CPU

On Mon, Nov 4, 2024 at 7:42 PM Caleb Sander <csander@...estorage.com> wrote:
>
> On Wed, Oct 30, 2024 at 1:26 PM Caleb Sander <csander@...estorage.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@...estorage.com> wrote:
> > > >
> > > > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > > > <csander@...estorage.com> wrote:
> > > > > >
> > > > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > > > the spinlocks involved incur significant CPU overhead.
> > > > > >
> > > > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > > > current CPU.
> > > > > >
> > > > > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > > > >
> > > > > Current implementation was a conscious choice. This has been discussed
> > > > > several times.
> > > > >
> > > > > By processing packets inline, you are actually increasing latencies of
> > > > > packets queued to other cpus.
> > > >
> > > > Sorry, I wasn't aware of these prior discussions. I take it you are
> > > > referring to threads like
> > > > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > > > ? I see what you mean about the latency penalty for packets that do
> > > > require cross-CPU steering.
> > > >
> > > > Do you have an alternate suggestion for how to avoid the overhead of
> > > > acquiring a spinlock for every packet? The atomic instruction in
> > > > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > > > of our CPU time. For our use case, we don't really want software RPS;
> > > > we are expecting ARFS to steer all high-bandwidth traffic to the
> > > > desired CPUs. We would happily turn off software RPS entirely if we
> > > > could, which seems like it would avoid the concerns about higher
> > > > latency for packets that need to be steering to a different CPU. But
> > > > my understanding is that using ARFS requires RPS to be enabled
> > > > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > > > queue), which enables these rps_needed static branches. Is that
> > > > correct? If so, would you be open to adding a sysctl that disables
> > > > software RPS and relies upon ARFS to do the packet steering?
> > >
> > > A sysctl will not avoid the fundamental issue.
> >
> > Sorry if my suggestion was unclear. I mean that we would ideally like
> > to use only hardware ARFS for packet steering, and disable software
> > RPS.
> > In our testing, ARFS reliably steers packets to the desired CPUs. (Our
> > application has long-lived TCP sockets, each processed on a single
> > thread affinitized to one of the interrupt CPUs for the Ethernet
> > device.) In the off chance that ARFS doesn't steer the packet to the
> > correct CPU, we would rather just process it on the CPU that receives
> > it instead of going through the RPS queues. If software RPS is never
> > used, then there wouldn't be any concerns about higher latency for
> > RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
> > computation is also not cheap, so it would be nice to skip it too.
> > Basically, we want to program ARFS but skip these
> > static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
> > way to do that currently. (Please let me know if it's already possible
> > to do that.)
>
> I see now that get_rps_cpu() still needs to be called even if software
> RPS isn't going to be used, because it's also responsible for calling
> set_rps_cpu() to program ARFS. So looks like avoiding the calls to
> get_rps_cpu() entirely is not possible.

Yes, this is the reason. Things would be different if ARFS was done at
dev_queue_xmit() time.

>
> > > Why not instead address the past feedback ?
> > > Can you test the following ?
> >
> > Sure, I will test the performance on our setup with this patch. Still,
> > we would prefer to skip the get_rps_cpu() computation and these extra
> > checks, and just process the packets on the CPUs they arrive on.
>
> Hi Eric,
> I tried out your patch and it seems to work equally well at
> eliminating the process_backlog() -> _raw_spin_lock_irq() hotspot. The
> added checks contribute a bit of overhead in
> netif_receive_skb_list_internal(): it has increased from 0.15% to
> 0.28% of CPU time. But that's definitely worth it to remove the 5%
> hotspot acquiring the RPS spinlock.
>
> Feel free to add:
> Tested-by: Caleb Sander Mateos <csander@...estorage.com>
>
> Some minor comments on the patch:
> - Probably should be using either skb_queue_len_lockless() or
> skb_queue_empty_lockless() to check input_pkt_queue.qlen != 0 because
> the RPS lock is not held at this point

this_cpu_read() has implicit READ_ONCE() semantic.

> - Could consolidate the 2 this_cpu_read(softnet_data...) lookups into
> a single this_cpu_ptr(&softnet_data) call

Double check the generated assembly first :)

this_cpu_read() is faster than going through
this_cpu_ptr(&softnet_data), at least on x86,
thanks to %gs: prefix.

No temporary register is needed to compute and hold this_cpu_ptr(&softnet_data).