lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250521085851.GQ365796@horms.kernel.org>
Date: Wed, 21 May 2025 09:58:51 +0100
From: Simon Horman <horms@...nel.org>
To: Krishna Kumar <krikku@...il.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, anthony.l.nguyen@...el.com,
	przemyslaw.kitszel@...el.com, edumazet@...gle.com,
	intel-wired-lan@...ts.osuosl.org, andrew+netdev@...n.ch,
	kuba@...nel.org, pabeni@...hat.com, sridhar.samudrala@...el.com,
	ahmed.zaki@...el.com, krishna.ku@...pkart.com
Subject: Re: [PATCH v2 net] net: ice: Perform accurate aRFS flow match

On Tue, May 20, 2025 at 10:36:56PM +0530, Krishna Kumar wrote:
> This patch fixes an issue seen in a large-scale deployment under heavy
> incoming pkts where the aRFS flow wrongly matches a flow and reprograms the
> NIC with wrong settings. That mis-steering causes RX-path latency spikes
> and noisy neighbor effects when many connections collide on the same
> hash (some of our production servers have 20-30K connections).
> 
> set_rps_cpu() calls ndo_rx_flow_steer() with flow_id that is calculated by
> hashing the skb sized by the per rx-queue table size. This results in
> multiple connections (even across different rx-queues) getting the same
> hash value. The driver steer function modifies the wrong flow to use this
> rx-queue, e.g.: Flow#1 is first added:
>     Flow#1:  <ip1, port1, ip2, port2>, Hash 'h', q#10
> 
> Later when a new flow needs to be added:
> 	    Flow#2:  <ip3, port3, ip4, port4>, Hash 'h', q#20
> 
> The driver finds the hash 'h' from Flow#1 and updates it to use q#20. This
> results in both flows getting un-optimized - packets for Flow#1 goes to
> q#20, and then reprogrammed back to q#10 later and so on; and Flow #2
> programming is never done as Flow#1 is matched first for all misses. Many
> flows may wrongly share the same hash and reprogram rules of the original
> flow each with their own q#.
> 
> Tested on two 144-core servers with 16K netperf sessions for 180s. Netperf
> clients are pinned to cores 0-71 sequentially (so that wrong packets on q#s
> 72-143 can be measured). IRQs are set 1:1 for queues -> CPUs, enable XPS,
> enable aRFS (global value is 144 * rps_flow_cnt).
> 
> Test notes about results from ice_rx_flow_steer():
> ---------------------------------------------------
> 1. "Skip:" counter increments here:
>     if (fltr_info->q_index == rxq_idx ||
> 	arfs_entry->fltr_state != ICE_ARFS_ACTIVE)
> 	    goto out;
> 2. "Add:" counter increments here:
>     ret = arfs_entry->fltr_info.fltr_id;
>     INIT_HLIST_NODE(&arfs_entry->list_entry);
> 3. "Update:" counter increments here:
>     /* update the queue to forward to on an already existing flow */
> 
> Runtime comparison: original code vs with the patch for different
> rps_flow_cnt values.
> 
> +-------------------------------+--------------+--------------+
> | rps_flow_cnt                  |      512     |    2048      |
> +-------------------------------+--------------+--------------+
> | Ratio of Pkts on Good:Bad q's | 214 vs 822K  | 1.1M vs 980K |
> | Avoid wrong aRFS programming  | 0 vs 310K    | 0 vs 30K     |
> | CPU User                      | 216 vs 183   | 216 vs 206   |
> | CPU System                    | 1441 vs 1171 | 1447 vs 1320 |
> | CPU Softirq                   | 1245 vs 920  | 1238 vs 961  |
> | CPU Total                     | 29 vs 22.7   | 29 vs 24.9   |
> | aRFS Update                   | 533K vs 59   | 521K vs 32   |
> | aRFS Skip                     | 82M vs 77M   | 7.2M vs 4.5M |
> +-------------------------------+--------------+--------------+
> 
> A separate TCP_STREAM and TCP_RR with 1,4,8,16,64,128,256,512 connections
> showed no performance degradation.
> 
> Some points on the patch/aRFS behavior:
> 1. Enabling full tuple matching ensures flows are always correctly matched,
>    even with smaller hash sizes.
> 2. 5-6% drop in CPU utilization as the packets arrive at the correct CPUs
>    and fewer calls to driver for programming on misses.
> 3. Larger hash tables reduces mis-steering due to more unique flow hashes,
>    but still has clashes. However, with larger per-device rps_flow_cnt, old
>    flows take more time to expire and new aRFS flows cannot be added if h/w
>    limits are reached (rps_may_expire_flow() succeeds when 10*rps_flow_cnt
>    pkts have been processed by this cpu that are not part of the flow).
> 
> Changes since v1:
>   - Added "Fixes:" tag and documented return values.
>   - Added @ for function parameters.
>   - Updated subject line to denote target tree (net)

Thanks for the updates, much appreciated.

I don't think it is necessary to repost because of this, but for future
reference, these days it is preferred to place change information, like
that immediately above, below the scissors ("---"). That way it is visible
to reviewers and appears in mailing list archives and so on.  But it is
omitted from git history, as there the commit message is truncated at the
scissors.

> Fixes: 28bf26724fdb0 ("ice: Implement aRFS")
> Signed-off-by: Krishna Kumar <krikku@...il.com>

Reviewed-by: Simon Horman <horms@...nel.org>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ