netdev - [PATCH AUTOSEL 6.17-5.4] net: When removing nexthops, don't call synchronize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20251025160905.3857885-94-sashal@kernel.org>
Date: Sat, 25 Oct 2025 11:55:25 -0400
From: Sasha Levin <sashal@...nel.org>
To: patches@...ts.linux.dev,
	stable@...r.kernel.org
Cc: Christoph Paasch <cpaasch@...nai.com>,
	Ido Schimmel <idosch@...dia.com>,
	Nikolay Aleksandrov <razor@...ckwall.org>,
	Eric Dumazet <edumazet@...gle.com>,
	David Ahern <dsahern@...nel.org>,
	Jakub Kicinski <kuba@...nel.org>,
	Sasha Levin <sashal@...nel.org>,
	davem@...emloft.net,
	netdev@...r.kernel.org
Subject: [PATCH AUTOSEL 6.17-5.4] net: When removing nexthops, don't call synchronize_net if it is not necessary

From: Christoph Paasch <cpaasch@...nai.com>

[ Upstream commit b0ac6d3b56a2384db151696cfda2836a8a961b6d ]

When removing a nexthop, commit
90f33bffa382 ("nexthops: don't modify published nexthop groups") added a
call to synchronize_rcu() (later changed to _net()) to make sure
everyone sees the new nexthop-group before the rtnl-lock is released.

When one wants to delete a large number of groups and nexthops, it is
fastest to first flush the groups (ip nexthop flush groups) and then
flush the nexthops themselves (ip -6 nexthop flush). As that way the
groups don't need to be rebalanced.

However, `ip -6 nexthop flush` will still take a long time if there is
a very large number of nexthops because of the call to
synchronize_net(). Now, if there are no more groups, there is no point
in calling synchronize_net(). So, let's skip that entirely by checking
if nh->grp_list is empty.

This gives us a nice speedup:

BEFORE:
=======

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 2097152 nexthops

real	1m45.345s
user	0m0.001s
sys	0m0.005s

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 4194304 nexthops

real	3m10.430s
user	0m0.002s
sys	0m0.004s

AFTER:
======

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 2097152 nexthops

real	0m17.545s
user	0m0.003s
sys	0m0.003s

$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 4194304 nexthops

real	0m35.823s
user	0m0.002s
sys	0m0.004s

Signed-off-by: Christoph Paasch <cpaasch@...nai.com>
Reviewed-by: Ido Schimmel <idosch@...dia.com>
Reviewed-by: Nikolay Aleksandrov <razor@...ckwall.org>
Reviewed-by: Eric Dumazet <edumazet@...gle.com>
Reviewed-by: David Ahern <dsahern@...nel.org>
Link: https://patch.msgid.link/20250816-nexthop_dump-v2-2-491da3462118@openai.com
Signed-off-by: Jakub Kicinski <kuba@...nel.org>
Signed-off-by: Sasha Levin <sashal@...nel.org>
---

LLM Generated explanations, may be completely bogus:

YES

Why this is a good stable backport
- Fixes a real, user-facing regression in large-scale environments:
  deleting many nexthops pays an O(N) cost from an unnecessary RCU grace
  period per nexthop. The commit message shows dramatic runtime
  reductions (minutes → seconds) for “ip -6 nexthop flush” on millions
  of nexthops. This is an operational pain point, not a micro-
  optimization.
- Minimal, localized change with no functional/architectural impact: it
  only short-circuits a barrier when there is provably nothing to
  synchronize. No API changes, no behavior changes when cleanup is
  actually needed.
- Preserves correctness: the synchronize call was introduced to
  serialize readers after updating a published group array (commit
  90f33bffa382). Skipping it is safe when there were no group updates.

Specific code and history analysis
- Barrier origin and purpose:
  - 90f33bffa382 added a post-update grace period to “make sure all see
    the newly published array before releasing RTNL” by calling
    `synchronize_rcu()` (later became `synchronize_net()`).
  - See 90f33bffa382: net/ipv4/nexthop.c: the barrier was added after
    removing a nexthop from groups.
- Current code path (pre-patch):
  - `remove_nexthop_from_groups()` iterates `nh->grp_list`, potentially
    updating group arrays via `remove_nh_grp_entry()`, then
    unconditionally calls `synchronize_net()`; net/ipv4/nexthop.c:2085
    and net/ipv4/nexthop.c:2094.
  - This function runs for non-group nexthops during deletion; see call
    site in `__remove_nexthop()`: net/ipv4/nexthop.c:2166. The RTNL lock
    is held across deletion (rtnl lock in `rtm_del_nexthop()`);
    net/ipv4/nexthop.c:3310.
- The patch’s exact change:
  - Adds an early return when there is nothing to remove:
    - New check: `if (list_empty(&nh->grp_list)) return;`
    - This prevents the unconditional `synchronize_net()` when `nh`
      belongs to no groups.
  - The loop and the barrier still run when there are entries to remove,
    preserving the original safety guarantee.
- Why the early return is safe:
  - If `&nh->grp_list` is empty, no group arrays are modified; there is
    nothing to “publish” and thus no readers to wait out. The barrier is
    purely to serialize readers after `rcu_assign_pointer()` of a new
    group array (e.g., in `remove_nh_grp_entry()` which calls
    `rcu_assign_pointer(nhp->nh_grp, newg)`; net/ipv4/nexthop.c:around
    2020). With no modifications, the barrier is a no-op, only adding
    latency.
  - Concurrency context is correct: group membership modifications
    happen under RTNL, and `remove_nexthop_from_groups()` is called
    under RTNL; `list_empty()` on `nh->grp_list` is consistent. The list
    head is always initialized (`INIT_LIST_HEAD(&nh->grp_list)`;
    net/ipv4/nexthop.c:542).
  - Other RCU barriers in the file that protect real publications remain
    intact (e.g., in group replacement, `synchronize_net()` remains;
    net/ipv4/nexthop.c:2291).

Stable policy considerations
- Scope is tiny and self-contained (one function, one early return); no
  cross-subsystem impact.
- Not a feature; it is a performance fix for a behavior introduced by an
  earlier change (90f33bffa382) that added unconditional grace periods
  even when nothing changed.
- Risk of regression is very low: previously, the barrier was sometimes
  unnecessary. Now it remains when necessary and is skipped when
  provably unneeded. No change to notifier behavior or group update
  logic.

Practical backport notes
- Older stable trees may have `synchronize_rcu()` instead of
  `synchronize_net()` at the end of `remove_nexthop_from_groups()`. The
  early return remains valid and safe regardless; adapt the barrier name
  to the tree’s version if needed.
- The infrastructure used by the check (`nh->grp_list`) and usage
  context (RTNL held) are long-standing and present in stable kernels
  that have nexthop groups.

Conclusion
- This change is a classic stable backport candidate: important user-
  visible improvement, minimal risk, no semantics change, and tightly
  scoped to the nexthop cleanup path.

 net/ipv4/nexthop.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 34137768e7f9a..15acfb74fd238 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2087,6 +2087,12 @@ static void remove_nexthop_from_groups(struct net *net, struct nexthop *nh,
 {
 	struct nh_grp_entry *nhge, *tmp;
 
+	/* If there is nothing to do, let's avoid the costly call to
+	 * synchronize_net()
+	 */
+	if (list_empty(&nh->grp_list))
+		return;
+
 	list_for_each_entry_safe(nhge, tmp, &nh->grp_list, nh_list)
 		remove_nh_grp_entry(net, nhge, nlinfo);
 
-- 
2.51.0