netdev - Re: [PATCH] net: ice: Perform accurate aRFS flow match

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4068bd0c-d613-483f-8975-9cde1c6074d6@intel.com>
Date: Tue, 20 May 2025 08:47:20 -0600
From: Ahmed Zaki <ahmed.zaki@...el.com>
To: Krishna Kumar <krikku@...il.com>, <netdev@...r.kernel.org>
CC: <davem@...emloft.net>, <anthony.l.nguyen@...el.com>,
	<przemyslaw.kitszel@...el.com>, <edumazet@...gle.com>,
	<intel-wired-lan@...ts.osuosl.org>, <andrew+netdev@...n.ch>,
	<kuba@...nel.org>, <pabeni@...hat.com>, <sridhar.samudrala@...el.com>,
	<krishna.ku@...pkart.com>
Subject: Re: [PATCH] net: ice: Perform accurate aRFS flow match



On 2025-05-19 11:02 p.m., Krishna Kumar wrote:
> This patch fixes an issue seen in a large-scale deployment under heavy
> incoming pkts where the aRFS flow wrongly matches a flow and reprograms the
> NIC with wrong settings. That mis-steering causes RX-path latency spikes
> and noisy neighbor effects when many connections collide on the same has
> (some of our production servers have 20-30K connections).
> 
> set_rps_cpu() calls ndo_rx_flow_steer() with flow_id that is calculated by
> hashing the skb sized by the per rx-queue table size. This results in
> multiple connections (even across different rx-queues) getting the same
> hash value. The driver steer function modifies the wrong flow to use this
> rx-queue, e.g.:
>      Flow#1 is first added:
> 	    Flow#1:  <ip1, port1, ip2, port2>, Hash 'h', q#10
>      Later when a new flow needs to be added:
> 	    Flow#2:  <ip3, port3, ip4, port4>, Hash 'h', q#20

add empty line.

> The driver finds the hash 'h' from Flow#1 and updates it to use q#20. This
> results in both flows getting un-optimized - packets for Flow#1 goes to
> q#20, and then reprogrammed back to q#10 later and so on; and Flow #2
> programming is never done as Flow#1 is matched first for all misses. Many
> flows may wrongly share the same hash and reprogram rules of the original
> flow each with their own q#.
> 
> Tested on two 144-core servers with 16K netperf sessions for 180s. Netperf
> clients are pinned to cores 0-71 sequentially (so that wrong packets on q#s
> 72-143 can be measured). IRQs are set 1:1 for queues -> CPUs, enable XPS,
> enable aRFS (global value is 144 * rps_flow_cnt).
> 
> Test notes about results from ice_rx_flow_steer():
> ---------------------------------------------------
> 1. "Skip:" counter increments here:
>      if (fltr_info->q_index == rxq_idx ||
> 	arfs_entry->fltr_state != ICE_ARFS_ACTIVE)
> 	    goto out;
> 2. "Add:" counter increments here:
>      ret = arfs_entry->fltr_info.fltr_id;
>      INIT_HLIST_NODE(&arfs_entry->list_entry);
> 3. "Update:" counter increments here:
>      /* update the queue to forward to on an already existing flow */
> 
> - **rps_flow_cnt=512**
>    - Ratio of packets on good vs bad queues: 214 vs 822,392
>    - Avoid updating wrong aRFS filter: 0 vs 310,826
>    - CPU: User: 216 vs 183, System: 1441 vs 1171, Softirq: 1245 vs 920
>           Total: 29.02 22.74
>    - aRFS Add: 6,078,551 vs 6,126,286 Update: 533,973 vs 59
>           Skip: 82,219,629 vs 77,088,191, Del: 6,078,409 vs 6,126,139
> 
> - **rps_flow_cnt=1024**
>    - Ratio of packets on good vs bad queues: 854 vs 1,003,176
>    - Avoid updating wrong aRFS filter: 0 vs 50,363
>    - CPU: User: 220 vs 206, System: 1460 vs 1322 Softirq: 1216 vs 1027
>           Total: 28.96 vs 25.55
>    - aRFS Add: 7,000,757 vs 7,499,586 Update: 504,371 vs 33
>           Skip: 27,455,269 vs 21,752,043, Del: 7,000,610 vs 7,499,438
> 
> - **rps_flow_cnt=2048**
>    - Ratio of packets on good vs bad queues: 1,173,756 vs 981,892
>    - Avoid updating wrong aRFS filter: 0 vs 30,145
>    - CPU: User: 216 vs 206, System: 1447 vs 1320, Softirq: 1238 vs 961
>           Total: 29.01 vs 24.87
>    - aRFS Add: 7,226,598 vs 6,960,991, Update: 521,264 vs 32
>           Skip: 7,236,716 vs 4,584,043, Del: 722,6430 vs 696,0798

Are these numbers with the patch applied? Can we get a w/o and with patch?

A table might be better to visualize, also may be drop the 
"rps_flow_cnt=1024*" case. I think it is enough to show min and max ones.

Also, please add instructions on how to get these values, so that 
validation team may be able to replicate.

> 
> A separate TCP_STREAM and TCP_RR with 1,4,8,16,64,128,256,512 connections
> showed no performance degradation.
> 
> Some points on the patch/aRFS behavior:
> 1. Enabling full tuple matching ensures flows are always correctly matched,
>     even with smaller hash sizes.
> 2. 5-6% drop in CPU utilization as the packets arrive at the correct CPUs
>     and fewer calls to driver for programming on misses.
> 3. Larger hash tables reduces mis-steering due to more unique flow hashes,
>     but still has clashes. However, with larger per-device rps_flow_cnt, old
>     flows take more time to expire and new aRFS flows cannot be added if h/w
>     limits are reached (rps_may_expire_flow() succeeds when 10*rps_flow_cnt
>     pkts have been processed by this cpu that are not part of the flow).
> 
> Signed-off-by: Krishna Kumar <krikku@...il.com>
> ---
>   drivers/net/ethernet/intel/ice/ice_arfs.c | 45 +++++++++++++++++++++++
>   1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_arfs.c b/drivers/net/ethernet/intel/ice/ice_arfs.c
> index 2bc5c7f59844..b36bd189bd64 100644
> --- a/drivers/net/ethernet/intel/ice/ice_arfs.c
> +++ b/drivers/net/ethernet/intel/ice/ice_arfs.c
> @@ -377,6 +377,47 @@ ice_arfs_is_perfect_flow_set(struct ice_hw *hw, __be16 l3_proto, u8 l4_proto)
>   	return false;
>   }
>   
> +/**
> + * ice_arfs_cmp - Check if aRFS filter matches this flow.
> + * @fltr_info: filter info of the saved ARFS entry.
> + * @fk: flow dissector keys.
> + * n_proto:  One of htons(IPv4) or htons(IPv6).
> + * ip_proto: One of IPPROTO_TCP or IPPROTO_UDP.
> + *
> + * Since this function assumes limited values for n_proto and ip_proto, it
> + * is meant to be called only from ice_rx_flow_steer().
> + */
> +static bool
> +ice_arfs_cmp(const struct ice_fdir_fltr *fltr_info, const struct flow_keys *fk,
> +	     __be16 n_proto, u8 ip_proto)
> +{
> +	/*
> +	 * Determine if the filter is for IPv4 or IPv6 based on flow_type,
> +	 * which is one of ICE_FLTR_PTYPE_NONF_IPV{4,6}_{TCP,UDP}.
> +	 */
> +	bool is_v4 = fltr_info->flow_type == ICE_FLTR_PTYPE_NONF_IPV4_TCP ||
> +		     fltr_info->flow_type == ICE_FLTR_PTYPE_NONF_IPV4_UDP;
> +
> +	/* Following checks are arranged in the quickest and most discriminative
> +	 * fields first for early failure.
> +	 */
> +	if (is_v4)
> +		return n_proto == htons(ETH_P_IP) &&
> +			fltr_info->ip.v4.src_port == fk->ports.src &&
> +			fltr_info->ip.v4.dst_port == fk->ports.dst &&
> +			fltr_info->ip.v4.src_ip == fk->addrs.v4addrs.src &&
> +			fltr_info->ip.v4.dst_ip == fk->addrs.v4addrs.dst &&
> +			fltr_info->ip.v4.proto == ip_proto;
> +
> +	return fltr_info->ip.v6.src_port == fk->ports.src &&
> +		fltr_info->ip.v6.dst_port == fk->ports.dst &&
> +		fltr_info->ip.v6.proto == ip_proto &&
> +		!memcmp(&fltr_info->ip.v6.src_ip, &fk->addrs.v6addrs.src,
> +			sizeof(struct in6_addr)) &&
> +		!memcmp(&fltr_info->ip.v6.dst_ip, &fk->addrs.v6addrs.dst,
> +			sizeof(struct in6_addr));
> +}
> +
>   /**
>    * ice_rx_flow_steer - steer the Rx flow to where application is being run
>    * @netdev: ptr to the netdev being adjusted
> @@ -448,6 +489,10 @@ ice_rx_flow_steer(struct net_device *netdev, const struct sk_buff *skb,
>   			continue;
>   
>   		fltr_info = &arfs_entry->fltr_info;
> +
> +		if (!ice_arfs_cmp(fltr_info, &fk, n_proto, ip_proto))
> +			continue;
> +
>   		ret = fltr_info->fltr_id;
>   
>   		if (fltr_info->q_index == rxq_idx ||


This seems similar to a patch I tried up-streaming before:

https://lore.kernel.org/netdev/20230407210820.3046220-1-anthony.l.nguyen@intel.com/

not sure why I did not pursue further. If that is correct, then 
obviously I have no objection to this patch.

The exact flow match will reduce (but not completely eliminate) the 
chance that a packet may land on wrong queue (since there is always a 
chance of hash collisions in aRFS).

Thank you.