[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAHCEFEwToeQe_Ey8e=sf8fOmoobvrDCPsxw+hfUSoRawPX03+Q@mail.gmail.com>
Date: Wed, 23 Jul 2025 20:04:02 +0800
From: Hc Zheng <zhenghc00@...il.com>
To: andrew+netdev@...n.ch, "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>, Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>,
Leon Romanovsky <leon@...nel.org>, laoar.shao@...il.com, yc1082463@...il.com
Cc: netdev@...r.kernel.org, linux-rdma@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: [RFC] problems with RFS on bRPC applications
Hi all,
I have tried to enable ARFS on the Mellanox CX-6 Ethernet card. It
works fine for simple workloads and benchmarks, but when running a
BRPC (https://github.com/apache/brpc) workload on a 2 NUMA-node
machine, the performance degrades significantly. After some tracing, I
identified the following workload patterns that ARFS/RFS failed to
handle efficiently:
- The workload has multiple threads that use epoll and read from the
same socket, which may cause the flow to be frequently updated in
sock_flow_table.
- The threads reading from the socket also migrate frequently between CPUs.
With these patterns, the flow is being updated very frequently, which
causes severe lock contention on arfs->arfs_lock in
mlx5e_rx_flow_steer. As a result, network packets are not being
handled in a timely manner.
Here are the case we want to enable ARFS/RFS:
- We want to ensure that flows belonging to different containers do
not interfere with each other. Our goal is for the flows to be steered
to the appropriate container’s CPUs.
- In the case of BRPC, the original RFS/ARFS logic does not help, so
we aim to steer the flow to CPUs running the container, as close as
possible and as balance as possible.
One simple solution I came up with is to have another mode in addition
to RFS, eg: like rps_record_sock_flow with a fix interval to avoid
frequently updated, or add an interface to allow usespace to dynamicly
steer the flows. This mode would steer flows to cpu within the target
container’s CPU set, providing some load balancing and locality
I have written some simple PoC code for this. After applying it in
production, we noticed the following performance changes:
- Cross NUMA memory bandwidth: 13GB → 9GB
- Pod system busy: 7.2% → 6.8%
- CPU PSI: 14ms → 12ms
However, we also noticed that some RX queue receives more flows than
others, since this code does not implement load balancing.
I am writing this email to request suggestions from netdev developers.
Additionally, for Mellanox forks, is there any plans to refine
arfs->arfs_lock in mlx5e_rx_flow_steer?
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a04fbf72476..1df7e125c61f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -30,6 +30,7 @@
#include <asm/byteorder.h>
#include <asm/local.h>
+#include <linux/cpumask.h>
#include <linux/percpu.h>
#include <linux/rculist.h>
#include <linux/workqueue.h>
@@ -753,15 +754,21 @@ static inline void rps_record_sock_flow(struct
rps_sock_flow_table *table,
if (table && hash) {
unsigned int index = hash & table->mask;
u32 val = hash & ~rps_cpu_mask;
+ u32 old = READ_ONCE(table->ents[index]);
+
+ if (likely((old & ~rps_cpu_mask) == val &&
cpumask_test_cpu(old & rps_cpu_mask, current->cpus_ptr))) {
+ return;
+ }
/* We only give a hint, preemption can change CPU under us */
val |= raw_smp_processor_id();
/* The following WRITE_ONCE() is paired with the READ_ONCE()
* here, and another one in get_rps_cpu().
*/
- if (READ_ONCE(table->ents[index]) != val)
+ if (old != val) {
WRITE_ONCE(table->ents[index], val);
+ }
}
}
Best Regards
Huaicheng Zheng
Powered by blists - more mailing lists