[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+vhTeqgTPr+suupJNLMHp-RAX89aBrFhiQnu58233bAw@mail.gmail.com>
Date: Mon, 21 Jul 2025 01:14:20 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Krishna Kumar <krikku@...il.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, tom@...bertland.com,
kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, sdf@...ichev.me,
kuniyu@...gle.com, ahmed.zaki@...el.com, aleksander.lobakin@...el.com,
atenart@...nel.org, jdamato@...tly.com, krishna.ku@...pkart.com
Subject: Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for
active flows
On Sun, Jul 20, 2025 at 8:16 PM Krishna Kumar <krikku@...il.com> wrote:
>
> This patch fixes an issue where two different flows on the same RXq
> produce the same hash resulting in continuous flow overwrites.
>
> Flow #1: A packet for Flow #1 comes in, kernel calls the steering
> function. The driver gives back a filter id. The kernel saves
> this filter id in the selected slot. Later, the driver's
> service task checks if any filters have expired and then
> installs the rule for Flow #1.
> Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
> But this time, the chosen slot is being used by Flow #1. The
> driver gives a new filter id and the kernel saves it in the
> same slot. When the driver's service task runs, it runs through
> all the flows, checks if Flow #1 should be expired, the kernel
> returns True as the slot has a different filter id, and then
> the driver installs the rule for Flow #2.
> Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
> The slot is overwritten with a new filter id for Flow #1.
>
> This causes a repeated cycle of flow programming for missed packets,
> wasting CPU cycles while not improving performance. This problem happens
> at higher rates when the RPS table is small, but tests show it still
> happens even with 12,000 connections and an RPS size of 16K per queue
> (global table size = 144x16K = 64K).
>
> This patch prevents overwriting an rps_dev_flow entry if it is active.
> The intention is that it is better to do aRFS for the first flow instead
> of hurting all flows on the same hash. Without this, two (or more) flows
> on one RX queue with the same hash can keep overwriting each other. This
> causes the driver to reprogram the flow repeatedly.
>
> Changes:
> 1. Add a new 'hash' field to struct rps_dev_flow.
> 2. Add rps_flow_is_active(): a helper function to check if a flow is
> active or not, extracted from rps_may_expire_flow().
> 3. In set_rps_cpu():
> - Avoid overwriting by programming a new filter if:
> - The slot is not in use, or
> - The slot is in use but the flow is not active, or
> - The slot has an active flow with the same hash, but target CPU
> differs.
> - Save the hash in the rps_dev_flow entry.
> 4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().
>
> Testing & results:
> - Driver: ice (E810 NIC), Kernel: net-next
> - #CPUs = #RXq = 144 (1:1)
> - Number of flows: 12K
> - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
> it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
> global table slots)
> - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
> - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
> - Metrics captured on client
>
> Legend for following tables:
> Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
> Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
> Add: #times driver actually programmed aRFS (ice_arfs_build_entry())
> Del: #times driver deleted the flow (ice_arfs_del_flow_rules())
> Units: K = 1,000 times, M = 1 million times
>
> |-------|---------|------| Org Code |---------|---------|
> | RPS | Latency | CPU | Add | Del | Steer-C | Steer-L |
> |-------|---------|------|--------|--------|---------|---------|
> | 256 | 227.0 | 93.2 | 1.6M | 1.6M | 121.7M | 267.6M |
> | 512 | 225.9 | 94.1 | 11.5M | 11.2M | 65.7M | 199.6M |
> | 1024 | 223.5 | 95.6 | 16.5M | 16.5M | 27.1M | 187.3M |
> | 2048 | 222.2 | 96.3 | 10.5M | 10.5M | 12.5M | 115.2M |
> | 4096 | 223.9 | 94.1 | 5.5M | 5.5M | 7.2M | 65.9M |
> | 8192 | 224.7 | 92.5 | 2.7M | 2.7M | 3.0M | 29.9M |
> | 16384 | 223.5 | 92.5 | 1.3M | 1.3M | 1.4M | 13.9M |
> | 32768 | 219.6 | 93.2 | 838.1K | 838.1K | 965.1K | 8.9M |
> |-------|---------|------| New Code |---------|---------|
> | 256 | 201.5 | 99.1 | 13.4K | 5.0K | 13.7K | 75.2K |
> | 512 | 202.5 | 98.2 | 11.2K | 5.9K | 11.2K | 55.5K |
> | 1024 | 207.3 | 93.9 | 11.5K | 9.7K | 11.5K | 59.6K |
> | 2048 | 207.5 | 96.7 | 11.8K | 11.1K | 15.5K | 79.3K |
> | 4096 | 206.9 | 96.6 | 11.8K | 11.7K | 11.8K | 63.2K |
> | 8192 | 205.8 | 96.7 | 11.9K | 11.8K | 11.9K | 63.9K |
> | 16384 | 200.9 | 98.2 | 11.9K | 11.9K | 11.9K | 64.2K |
> | 32768 | 202.5 | 98.0 | 11.9K | 11.9K | 11.9K | 64.2K |
> |-------|---------|------|--------|--------|---------|---------|
>
> Some observations:
> 1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
> 2. Overall CPU increased: (777.32-751.49)/751.45*100 = 3.44%
> 3. Flow Management (add/delete) remained almost constant at ~11K
> compared to values in millions.
>
> Reported-by: kernel test robot <lkp@...el.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202507161125.rUCoz9ov-lkp@intel.com/
> Signed-off-by: Krishna Kumar <krikku@...il.com>
> ---
> include/net/rps.h | 5 +--
> net/core/dev.c | 82 ++++++++++++++++++++++++++++++++++++++++----
> net/core/net-sysfs.c | 4 ++-
> 3 files changed, 81 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/rps.h b/include/net/rps.h
> index d8ab3a08bcc4..8e33dbea9327 100644
> --- a/include/net/rps.h
> +++ b/include/net/rps.h
> @@ -25,13 +25,14 @@ struct rps_map {
>
> /*
> * The rps_dev_flow structure contains the mapping of a flow to a CPU, the
> - * tail pointer for that CPU's input queue at the time of last enqueue, and
> - * a hardware filter index.
> + * tail pointer for that CPU's input queue at the time of last enqueue, a
> + * hardware filter index, and the hash of the flow.
> */
> struct rps_dev_flow {
> u16 cpu;
> u16 filter;
> unsigned int last_qtail;
> + u32 hash;
This is problematic, because adds an extra potential cache line miss in RPS.
Some of us do not use CONFIG_RFS_ACCEL, make sure to not add extra
costs for this configuration ?
Powered by blists - more mailing lists