linux-kernel - Re: [PATCH] net: skip RPS if packet is already on target CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADUfDZpeudTGP5UZt6QqbrYkA+Twei7gGQa6hJ+iYwuZfyp9gw@mail.gmail.com>
Date: Wed, 30 Oct 2024 13:26:49 -0700
From: Caleb Sander <csander@...estorage.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: skip RPS if packet is already on target CPU

On Wed, Oct 30, 2024 at 5:55 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Tue, Oct 29, 2024 at 9:38 PM Caleb Sander <csander@...estorage.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 12:02 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 7:27 PM Caleb Sander Mateos
> > > <csander@...estorage.com> wrote:
> > > >
> > > > If RPS is enabled, all packets with a CPU flow hint are enqueued to the
> > > > target CPU's input_pkt_queue and process_backlog() is scheduled on that
> > > > CPU to dequeue and process the packets. If ARFS has already steered the
> > > > packets to the correct CPU, this additional queuing is unnecessary and
> > > > the spinlocks involved incur significant CPU overhead.
> > > >
> > > > In netif_receive_skb_internal() and netif_receive_skb_list_internal(),
> > > > check if the CPU flow hint get_rps_cpu() returns is the current CPU. If
> > > > so, bypass input_pkt_queue and immediately process the packet(s) on the
> > > > current CPU.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > >
> > > Current implementation was a conscious choice. This has been discussed
> > > several times.
> > >
> > > By processing packets inline, you are actually increasing latencies of
> > > packets queued to other cpus.
> >
> > Sorry, I wasn't aware of these prior discussions. I take it you are
> > referring to threads like
> > https://lore.kernel.org/netdev/20230322072142.32751-1-xu.xin16@zte.com.cn/T/
> > ? I see what you mean about the latency penalty for packets that do
> > require cross-CPU steering.
> >
> > Do you have an alternate suggestion for how to avoid the overhead of
> > acquiring a spinlock for every packet? The atomic instruction in
> > rps_lock_irq_disable() called from process_backlog() is consuming 5%
> > of our CPU time. For our use case, we don't really want software RPS;
> > we are expecting ARFS to steer all high-bandwidth traffic to the
> > desired CPUs. We would happily turn off software RPS entirely if we
> > could, which seems like it would avoid the concerns about higher
> > latency for packets that need to be steering to a different CPU. But
> > my understanding is that using ARFS requires RPS to be enabled
> > (rps_sock_flow_entries set globally and rps_flow_cnt set on each
> > queue), which enables these rps_needed static branches. Is that
> > correct? If so, would you be open to adding a sysctl that disables
> > software RPS and relies upon ARFS to do the packet steering?
>
> A sysctl will not avoid the fundamental issue.

Sorry if my suggestion was unclear. I mean that we would ideally like
to use only hardware ARFS for packet steering, and disable software
RPS.
In our testing, ARFS reliably steers packets to the desired CPUs. (Our
application has long-lived TCP sockets, each processed on a single
thread affinitized to one of the interrupt CPUs for the Ethernet
device.) In the off chance that ARFS doesn't steer the packet to the
correct CPU, we would rather just process it on the CPU that receives
it instead of going through the RPS queues. If software RPS is never
used, then there wouldn't be any concerns about higher latency for
RPS-steered vs. non-RPS-steered packets, right? The get_rps_cpu()
computation is also not cheap, so it would be nice to skip it too.
Basically, we want to program ARFS but skip these
static_branch_unlikely(&rps_needed) branches. But I'm not aware of a
way to do that currently. (Please let me know if it's already possible
to do that.)

> Why not instead address the past feedback ?
> Can you test the following ?

Sure, I will test the performance on our setup with this patch. Still,
we would prefer to skip the get_rps_cpu() computation and these extra
checks, and just process the packets on the CPUs they arrive on.

Thanks for your help,
Caleb