linux-kernel - [RFC] problems with RFS on bRPC applications

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAHCEFEwToeQe_Ey8e=sf8fOmoobvrDCPsxw+hfUSoRawPX03+Q@mail.gmail.com>
Date: Wed, 23 Jul 2025 20:04:02 +0800
From: Hc Zheng <zhenghc00@...il.com>
To: andrew+netdev@...n.ch, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Simon Horman <horms@...nel.org>, Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>, 
	Leon Romanovsky <leon@...nel.org>, laoar.shao@...il.com, yc1082463@...il.com
Cc: netdev@...r.kernel.org, linux-rdma@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: [RFC] problems with RFS on bRPC applications

Hi all,

I have tried to enable ARFS on the Mellanox CX-6 Ethernet card. It
works fine for simple workloads and benchmarks, but when running a
BRPC (https://github.com/apache/brpc) workload on a 2 NUMA-node
machine, the performance degrades significantly. After some tracing, I
identified the following workload patterns that ARFS/RFS failed to
handle efficiently:

- The workload has multiple threads that use epoll and read from the
same socket, which may cause the flow to be frequently updated in
sock_flow_table.

- The threads reading from the socket also migrate frequently between CPUs.

With these patterns, the flow is being updated very frequently, which
causes severe lock contention on arfs->arfs_lock in
mlx5e_rx_flow_steer. As a result, network packets are not being
handled in a timely manner.

Here are the case we want to enable ARFS/RFS:

- We want to ensure that flows belonging to different containers do
not interfere with each other. Our goal is for the flows to be steered
to the appropriate container’s CPUs.

- In the case of BRPC, the original RFS/ARFS logic does not help, so
we aim to steer the flow to CPUs running the container, as close as
possible and as balance as possible.

One simple solution I came up with is to have another mode in addition
to RFS, eg: like rps_record_sock_flow with a fix interval to avoid
frequently updated, or add an interface to allow usespace to dynamicly
steer the flows. This mode would steer flows to cpu within the target
container’s CPU set, providing some load balancing and locality

I have written some simple PoC code for this. After applying it in
production, we noticed the following performance changes:

- Cross NUMA memory bandwidth: 13GB → 9GB

- Pod system busy: 7.2% → 6.8%

- CPU PSI: 14ms → 12ms

However, we also noticed that some RX queue receives more flows than
others, since this code does not implement load balancing.

I am writing this email to request suggestions from netdev developers.

Additionally, for Mellanox forks, is there any plans to refine
arfs->arfs_lock in mlx5e_rx_flow_steer?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a04fbf72476..1df7e125c61f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -30,6 +30,7 @@
 #include <asm/byteorder.h>
 #include <asm/local.h>

+#include <linux/cpumask.h>
 #include <linux/percpu.h>
 #include <linux/rculist.h>
 #include <linux/workqueue.h>
@@ -753,15 +754,21 @@ static inline void rps_record_sock_flow(struct
rps_sock_flow_table *table,
        if (table && hash) {
                unsigned int index = hash & table->mask;
                u32 val = hash & ~rps_cpu_mask;
+               u32 old = READ_ONCE(table->ents[index]);

+
+               if (likely((old & ~rps_cpu_mask) == val &&
cpumask_test_cpu(old & rps_cpu_mask, current->cpus_ptr))) {
+                   return;
+               }
                /* We only give a hint, preemption can change CPU under us */
                val |= raw_smp_processor_id();

                /* The following WRITE_ONCE() is paired with the READ_ONCE()
                 * here, and another one in get_rps_cpu().
                 */
-               if (READ_ONCE(table->ents[index]) != val)
+               if (old != val) {
                        WRITE_ONCE(table->ents[index], val);
+               }
        }
 }

Best Regards
Huaicheng Zheng