[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <willemdebruijn.kernel.3a8157265761f@gmail.com>
Date: Tue, 07 Oct 2025 18:25:47 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Ido Schimmel <idosch@...sch.org>,
Dmitry <demetriousz@...ton.me>,
willemdebruijn.kernel@...il.com
Cc: "David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty
saddr before ECMP hash
Ido Schimmel wrote:
> On Mon, Oct 06, 2025 at 06:31:10PM +0000, Dmitry wrote:
> > If the 5-tuple is not changed, then both the hash and the outgoing interface
> > (OIF) should remain consistent, which is not the case.
With git fetch over SSH, the process apparenty explicitly changes DSCP
(by calling setsockopt IPV6_TCLASS?). Which triggers a dst reset,
which that may trigger a different path. That is WAI, right?
Policy routing can explicitly specify different egress devices for
different DSCP settings.
Is this the entire issue? The original message states
> In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
> `saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
> It gets filled later, once the outgoing interface (OIF) is determined and
> `ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.
>
> In some cases, this can cause the flow to switch paths: the first packets
> go via one link, while the rest of the flow is routed over another.
That sounds as if the OIF can change in between the rt6_multipath_hash
and ipv6_dev_get_saddr calls for a regular socket, without such
explicit DSCP changes. Does this happen?
> > Only with the fix does it
> > respect the configured SRC and produce a consistent, correct 5-tuple with the
> > proper hash.
> >
> > Therefore, in my opinion, this should be fixed.
>
> Note that even if the hash is consistent throughout the lifetime of the
> socket, it is still possible for packets to be routed out of different
> interfaces. This can happen, for example, if one of the nexthop devices
> loses its carrier. This will change the hash thresholds in the ECMP
> group and can cause packets to egress a different interface even if the
> current one is not the one that went down. Obviously packets can also
> change paths due to changes in other routers between you and the
> destination. A network design that results in connections being severed
> every time a flow is routed differently seems fragile to me.
>
> If you still want to address the issue, then I believe that the correct
> way to do it would be to align tcp_v6_connect() with tcp_v4_connect().
> I'm not sure why they differ, but the IPv4 version will first do a route
> lookup to determine the source address, then allocate a source port and
> only when all the parameters are known it will do a final route lookup
> and cache the result in the socket. IPv6 on the other hand, does a
> single route lookup with an unknown source address and an unknown source
> port.
>
> This is explained in the comment above ip_route_connect_init() and
> Willem also explained it here:
>
> https://lore.kernel.org/all/20250424143549.669426-2-willemdebruijn.kernel@gmail.com/
>
> Willem, do you happen to know why tcp_v6_connect() only performs a
> single route lookup?
I did not fully get to the historical reasons for the differences.
>From v1 of that patch:
"Side-quest: I wonder if the second route lookup in ip_route_connect
is vestigial since the introduction of the third route lookup with
ip_route_newports. IPv6 has neither second nor third lookup, which
hints that perhaps both can be removed. "
https://lore.kernel.org/netdev/20250420180537.2973960-2-willemdebruijn.kernel@gmail.com/
> Link to the original patch:
>
> https://lore.kernel.org/netdev/20251005-ipv6-set-saddr-to-prefsrc-before-hash-to-stabilize-ecmp-v1-1-d43b6ef00035@proton.me/
>
> Thanks
Powered by blists - more mailing lists