[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aOPEYwnyGnMQCp-f@shredder>
Date: Mon, 6 Oct 2025 16:30:11 +0300
From: Ido Schimmel <idosch@...sch.org>
To: demetriousz@...ton.me
Cc: "David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty
saddr before ECMP hash
On Sun, Oct 05, 2025 at 08:49:55PM +0000, Dmitry Z via B4 Relay wrote:
> From: Dmitry Z <demetriousz@...ton.me>
>
> In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
> `saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
> It gets filled later, once the outgoing interface (OIF) is determined and
> `ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.
>
> In some cases, this can cause the flow to switch paths: the first packets
> go via one link, while the rest of the flow is routed over another.
>
> A practical example is a Git-over-SSH session. When running `git fetch`,
> the initial control traffic uses TOS 0x48, but data transfer switches to
> TOS 0x20. This triggers a new hash computation, and at that time `saddr`
> is already populated. As a result, packets with TOS 0x20 may be sent via
> a different OIF, because `rt6_multipath_hash()` now produces a different
> result.
>
> This issue can happen even if the matched IPv6 route specifies a `src`
> (preferred source) address. The actual impact depends on the network
> topology. In my setup, the flow was redirected to a different switch and
> reached another host, leading to TCP RSTs from the host where the session
> was never established.
>
> Possible workarounds:
> 1. Use netfilter to normalize the DSCP field before route lookup.
> (breaks DSCP/TOS assignment set by the socket)
> 2. Exclude the source address from the ECMP hash via sysctl knobs.
> (excludes an important part from hash computation)
Two more options (which I didn't test):
3. Setting "IPQoS" in SSH config to a single value. It should prevent
OpenSSH from switching DSCP while the connection is alive. Switching
DSCP triggers a route lookup since commit 305e95bb893c ("net-ipv6:
changes to ->tclass (via IPV6_TCLASS) should sk_dst_reset()"). To be
clear, I don't think this commit is problematic as there are other
events that can invalidate cached dst entries.
4. Setting "BindAddress" in SSH config. It should make sure that the
same source address is used for all route lookups.
> This patch uses the `fib6_prefsrc.addr` value from the selected route to
> populate `saddr` before ECMP hash computation, ensuring consistent path
> selection across the flow.
I'm not convinced the problem is in the kernel. As long as all the
packets are sent with the same 5-tuple, it's up to the network to
deliver them correctly. I don't know how your topology looks like, but
in the general case packets belonging to the same flow can be routed via
different paths over time. If multiple servers can service incoming SSH
connections, then there should be a stateful load balancer between them
and the clients so that packets belonging to the same flow are always
delivered to the same server. ECMP cannot be relied on to do load
balancing alone as it's stateless.
Powered by blists - more mailing lists