netdev - Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+vhTeqgTPr+suupJNLMHp-RAX89aBrFhiQnu58233bAw@mail.gmail.com>
Date: Mon, 21 Jul 2025 01:14:20 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Krishna Kumar <krikku@...il.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, tom@...bertland.com, 
	kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, sdf@...ichev.me, 
	kuniyu@...gle.com, ahmed.zaki@...el.com, aleksander.lobakin@...el.com, 
	atenart@...nel.org, jdamato@...tly.com, krishna.ku@...pkart.com
Subject: Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for
 active flows

On Sun, Jul 20, 2025 at 8:16 PM Krishna Kumar <krikku@...il.com> wrote:
>
> This patch fixes an issue where two different flows on the same RXq
> produce the same hash resulting in continuous flow overwrites.
>
> Flow #1: A packet for Flow #1 comes in, kernel calls the steering
>          function. The driver gives back a filter id. The kernel saves
>          this filter id in the selected slot. Later, the driver's
>          service task checks if any filters have expired and then
>          installs the rule for Flow #1.
> Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
>          But this time, the chosen slot is being used by Flow #1. The
>          driver gives a new filter id and the kernel saves it in the
>          same slot. When the driver's service task runs, it runs through
>          all the flows, checks if Flow #1 should be expired, the kernel
>          returns True as the slot has a different filter id, and then
>          the driver installs the rule for Flow #2.
> Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
>          The slot is overwritten with a new filter id for Flow #1.
>
> This causes a repeated cycle of flow programming for missed packets,
> wasting CPU cycles while not improving performance. This problem happens
> at higher rates when the RPS table is small, but tests show it still
> happens even with 12,000 connections and an RPS size of 16K per queue
> (global table size = 144x16K = 64K).
>
> This patch prevents overwriting an rps_dev_flow entry if it is active.
> The intention is that it is better to do aRFS for the first flow instead
> of hurting all flows on the same hash. Without this, two (or more) flows
> on one RX queue with the same hash can keep overwriting each other. This
> causes the driver to reprogram the flow repeatedly.
>
> Changes:
>   1. Add a new 'hash' field to struct rps_dev_flow.
>   2. Add rps_flow_is_active(): a helper function to check if a flow is
>      active or not, extracted from rps_may_expire_flow().
>   3. In set_rps_cpu():
>      - Avoid overwriting by programming a new filter if:
>         - The slot is not in use, or
>         - The slot is in use but the flow is not active, or
>         - The slot has an active flow with the same hash, but target CPU
>           differs.
>      - Save the hash in the rps_dev_flow entry.
>   4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().
>
> Testing & results:
>   - Driver: ice (E810 NIC), Kernel: net-next
>   - #CPUs = #RXq = 144 (1:1)
>   - Number of flows: 12K
>   - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
>     it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
>     global table slots)
>   - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
>   - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
>   - Metrics captured on client
>
> Legend for following tables:
> Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
> Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
> Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
> Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
> Units:   K = 1,000 times, M = 1 million times
>
>   |-------|---------|------|     Org Code    |---------|---------|
>   | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
>   |-------|---------|------|--------|--------|---------|---------|
>   | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
>   | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
>   | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
>   | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
>   | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
>   | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
>   | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
>   | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
>   |-------|---------|------|   New Code      |---------|---------|
>   | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
>   | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
>   | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
>   | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
>   | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
>   | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
>   | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
>   | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
>   |-------|---------|------|--------|--------|---------|---------|
>
> Some observations:
>   1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
>   2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
>   3. Flow Management (add/delete) remained almost constant at ~11K
>      compared to values in millions.
>
> Reported-by: kernel test robot <lkp@...el.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202507161125.rUCoz9ov-lkp@intel.com/
> Signed-off-by: Krishna Kumar <krikku@...il.com>
> ---
>  include/net/rps.h    |  5 +--
>  net/core/dev.c       | 82 ++++++++++++++++++++++++++++++++++++++++----
>  net/core/net-sysfs.c |  4 ++-
>  3 files changed, 81 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/rps.h b/include/net/rps.h
> index d8ab3a08bcc4..8e33dbea9327 100644
> --- a/include/net/rps.h
> +++ b/include/net/rps.h
> @@ -25,13 +25,14 @@ struct rps_map {
>
>  /*
>   * The rps_dev_flow structure contains the mapping of a flow to a CPU, the
> - * tail pointer for that CPU's input queue at the time of last enqueue, and
> - * a hardware filter index.
> + * tail pointer for that CPU's input queue at the time of last enqueue, a
> + * hardware filter index, and the hash of the flow.
>   */
>  struct rps_dev_flow {
>         u16             cpu;
>         u16             filter;
>         unsigned int    last_qtail;
> +       u32             hash;

This is problematic, because adds an extra potential cache line miss in RPS.

Some of us do not use CONFIG_RFS_ACCEL, make sure to not add extra
costs for this configuration ?