netdev - Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for active flows

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACLgkEbcbWBs+ma3rq7d6B7nBtOzVVEogy0aEJX5hd6PV9790g@mail.gmail.com>
Date: Mon, 21 Jul 2025 16:42:41 +0530
From: Krishna Kumar <krikku@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, tom@...bertland.com, 
	kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, sdf@...ichev.me, 
	kuniyu@...gle.com, ahmed.zaki@...el.com, aleksander.lobakin@...el.com, 
	atenart@...nel.org, jdamato@...tly.com, krishna.ku@...pkart.com
Subject: Re: [PATCH v5 net-next 1/2] net: Prevent RPS table overwrite for
 active flows

Thanks Eric.

1. I will move hash under aRFS (addressing your next comment separately)?
2. There's no point (actually may harm) in moving the existing
"filter" under aRFS
    though that's also used in aRFS code - it is u16 and pairs well
with the "u16 cpu".

I hope #1 is sufficient?

Thanks,
- Krishna

On Mon, Jul 21, 2025 at 1:44 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Sun, Jul 20, 2025 at 8:16 PM Krishna Kumar <krikku@...il.com> wrote:
> >
> > This patch fixes an issue where two different flows on the same RXq
> > produce the same hash resulting in continuous flow overwrites.
> >
> > Flow #1: A packet for Flow #1 comes in, kernel calls the steering
> >          function. The driver gives back a filter id. The kernel saves
> >          this filter id in the selected slot. Later, the driver's
> >          service task checks if any filters have expired and then
> >          installs the rule for Flow #1.
> > Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
> >          But this time, the chosen slot is being used by Flow #1. The
> >          driver gives a new filter id and the kernel saves it in the
> >          same slot. When the driver's service task runs, it runs through
> >          all the flows, checks if Flow #1 should be expired, the kernel
> >          returns True as the slot has a different filter id, and then
> >          the driver installs the rule for Flow #2.
> > Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
> >          The slot is overwritten with a new filter id for Flow #1.
> >
> > This causes a repeated cycle of flow programming for missed packets,
> > wasting CPU cycles while not improving performance. This problem happens
> > at higher rates when the RPS table is small, but tests show it still
> > happens even with 12,000 connections and an RPS size of 16K per queue
> > (global table size = 144x16K = 64K).
> >
> > This patch prevents overwriting an rps_dev_flow entry if it is active.
> > The intention is that it is better to do aRFS for the first flow instead
> > of hurting all flows on the same hash. Without this, two (or more) flows
> > on one RX queue with the same hash can keep overwriting each other. This
> > causes the driver to reprogram the flow repeatedly.
> >
> > Changes:
> >   1. Add a new 'hash' field to struct rps_dev_flow.
> >   2. Add rps_flow_is_active(): a helper function to check if a flow is
> >      active or not, extracted from rps_may_expire_flow().
> >   3. In set_rps_cpu():
> >      - Avoid overwriting by programming a new filter if:
> >         - The slot is not in use, or
> >         - The slot is in use but the flow is not active, or
> >         - The slot has an active flow with the same hash, but target CPU
> >           differs.
> >      - Save the hash in the rps_dev_flow entry.
> >   4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().
> >
> > Testing & results:
> >   - Driver: ice (E810 NIC), Kernel: net-next
> >   - #CPUs = #RXq = 144 (1:1)
> >   - Number of flows: 12K
> >   - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
> >     it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
> >     global table slots)
> >   - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
> >   - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
> >   - Metrics captured on client
> >
> > Legend for following tables:
> > Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
> > Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
> > Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
> > Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
> > Units:   K = 1,000 times, M = 1 million times
> >
> >   |-------|---------|------|     Org Code    |---------|---------|
> >   | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
> >   |-------|---------|------|--------|--------|---------|---------|
> >   | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
> >   | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
> >   | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
> >   | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
> >   | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
> >   | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
> >   | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
> >   | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
> >   |-------|---------|------|   New Code      |---------|---------|
> >   | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
> >   | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
> >   | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
> >   | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
> >   | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
> >   | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
> >   | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
> >   | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
> >   |-------|---------|------|--------|--------|---------|---------|
> >
> > Some observations:
> >   1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
> >   2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
> >   3. Flow Management (add/delete) remained almost constant at ~11K
> >      compared to values in millions.
> >
> > Reported-by: kernel test robot <lkp@...el.com>
> > Closes: https://lore.kernel.org/oe-kbuild-all/202507161125.rUCoz9ov-lkp@intel.com/
> > Signed-off-by: Krishna Kumar <krikku@...il.com>
> > ---
> >  include/net/rps.h    |  5 +--
> >  net/core/dev.c       | 82 ++++++++++++++++++++++++++++++++++++++++----
> >  net/core/net-sysfs.c |  4 ++-
> >  3 files changed, 81 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/net/rps.h b/include/net/rps.h
> > index d8ab3a08bcc4..8e33dbea9327 100644
> > --- a/include/net/rps.h
> > +++ b/include/net/rps.h
> > @@ -25,13 +25,14 @@ struct rps_map {
> >
> >  /*
> >   * The rps_dev_flow structure contains the mapping of a flow to a CPU, the
> > - * tail pointer for that CPU's input queue at the time of last enqueue, and
> > - * a hardware filter index.
> > + * tail pointer for that CPU's input queue at the time of last enqueue, a
> > + * hardware filter index, and the hash of the flow.
> >   */
> >  struct rps_dev_flow {
> >         u16             cpu;
> >         u16             filter;
> >         unsigned int    last_qtail;
> > +       u32             hash;
>
> This is problematic, because adds an extra potential cache line miss in RPS.
>
> Some of us do not use CONFIG_RFS_ACCEL, make sure to not add extra
> costs for this configuration ?