[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <MZruGuax8jyrCcZTXAVhH0AaAMOZ-2Gcj5VeZO8xy8wS9FqwA3EMhPFpHLZs67FAKCu6z3GpEVeArSX2qGdSUqsysI-0o13dKK1ZmUhK_l0=@proton.me>
Date: Mon, 06 Oct 2025 18:31:10 +0000
From: Dmitry <demetriousz@...ton.me>
To: Ido Schimmel <idosch@...sch.org>
Cc: "David S. Miller" <davem@...emloft.net>, David Ahern <dsahern@...nel.org>, Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org, "demetriousz@...ton.me" <demetriousz@...ton.me>
Subject: Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash
> Two more options (which I didn't test):
>
> 3. Setting "IPQoS" in SSH config to a single value. It should prevent
> OpenSSH from switching DSCP while the connection is alive. Switching
> DSCP triggers a route lookup since commit 305e95bb893c ("net-ipv6:
> changes to ->tclass (via IPV6_TCLASS) should sk_dst_reset()"). To be
> clear, I don't think this commit is problematic as there are other
> events that can invalidate cached dst entries.
I haven't tested this, but I assume it should work, since the IP header isn't
changed during an active connection.
> 4. Setting "BindAddress" in SSH config. It should make sure that the
> same source address is used for all route lookups.
Yes, I've tested this one, and it works. I was focused on finding a system-level
solution and didn't think about application-level settings.
> As long as all the packets are sent with the same 5-tuple.
The problem is that in the beginning the SADDR remains empty during hash
computation. It appears to be filled later, once the outgoing interface (OIF) is
determined.
Let's look at how to reproduce the issue:
Test lab topology:
+-----+ vlan=1 +-----+
| +---------------->| |
|HostA+---------------->|HostF|
| |... | |
| +---------------->| |
+-----+ vlan=99 +-----+
HostA lo: 2001:db8:aaaa::
HostF lo: 2001:db8:ffff::
Host A has an ECMP route to 2001:db8:ffff:: with a specified source address
2001:db8:aaaa::, distributed across all VLANs toward Host F. I run git fetch on
Host A to transfer data from Host F.
PCAP Without the fix:
16:34:40.875734 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 98: vlan 49, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xfdf8e, hlim 64, next-header TCP (6) payload length: 40)
2001:db8:aaaa::.44690 > 2001:db8:ffff::.22: Flags [S], cksum 0x064b (incorrect
-> 0x5490), seq 827400610, win 64800, options [mss 1440,sackOK,TS val 1303683318
ecr 0,nop,wscale 7], length 0
<skipped>
16:34:41.566130 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 90: vlan 49, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xfdf8e, hlim 64, next-header TCP (6) payload length: 32)
2001:db8:aaaa::.44690 > 2001:db8:ffff::.22: Flags [.], cksum 0x0643 (incorrect
-> 0xd980), seq 3570, ack 4031, win 509, options [nop,nop,TS val 1303684009 ecr
3265960348], length 0
16:34:41.567338 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 234: vlan 83, p 0, ethertype IPv6 (0x86dd), (class 0x20,
flowlabel 0xfdf8e, hlim 64, next-header TCP (6) payload length: 176)
2001:db8:aaaa::.44690 > 2001:db8:ffff::.22: Flags [P.], cksum 0x06d3 (incorrect
-> 0xce55), seq 3570:3714, ack 4031, win 509, options [nop,nop,TS val 1303684009
ecr 3265960348], length 144
As you can see, it sends packets through different interfaces — this is a
symptom of the issue. In a real environment with multiple physical links (up to
6–8 interfaces), the same problem can be observed as well.
I put some prints around ip6_multipath_hash_policy():
PRINTS Without the fix:
Oct 06 16:34:40 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=:: dst=2001:db8:ffff:: proto=6 hash=2109163277
Oct 06 16:34:41 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3559450110
Oct 06 16:34:41 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3559450110
As you can see, the saddr field is empty at the beginning of the connection,
which causes the hash to be different initially.
PCAP With the fix:
16:42:27.624160 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 98: vlan 70, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xcff07, hlim 64, next-header TCP (6) payload length: 40)
2001:db8:aaaa::.43660 > 2001:db8:ffff::.22: Flags [S], cksum 0x064b (incorrect
-> 0x174e), seq 1032224426, win 64800, options [mss 1440,sackOK,TS val
3603754981 ecr 0,nop,wscale 10], length 0
<skipped>
16:42:28.328572 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 90: vlan 70, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xcff07, hlim 64, next-header TCP (6) payload length: 32)
2001:db8:aaaa::.43660 > 2001:db8:ffff::.22: Flags [.], cksum 0x0643 (incorrect
-> 0xcd3f), seq 3570, ack 4031, win 66, options [nop,nop,TS val 3603755686 ecr
3266427110], length 0
16:42:28.329511 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 234: vlan 70, p 0, ethertype IPv6 (0x86dd), (class 0x20,
flowlabel 0xcff07, hlim 64, next-header TCP (6) payload length: 176)
2001:db8:aaaa::.43660 > 2001:db8:ffff::.22: Flags [P.], cksum 0x06d3 (incorrect
-> 0x3fd6), seq 3570:3714, ack 4031, win 66, options [nop,nop,TS val 3603755686
ecr 3266427110], length 144
As you can see here we have the same vlan.
PRINTS With the fix:
Oct 06 16:42:27 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3025767165
Oct 06 16:42:28 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3025767165
Oct 06 16:42:28 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3025767165
So, with the fix applied, we populate SADDR and calculate the hash correctly. I
think it's reasonable to respect the src field in the IPv6 route when computing
the hash.
> I'm not convinced the problem is in the kernel. As long as all the
> packets are sent with the same 5-tuple, it's up to the network to
> deliver them correctly. I don't know how your topology looks like, but
> in the general case packets belonging to the same flow can be routed via
> different paths over time. If multiple servers can service incoming SSH
> connections, then there should be a stateful load balancer between them
> and the clients so that packets belonging to the same flow are always
> delivered to the same server. ECMP cannot be relied on to do load
> balancing alone as it's stateless.
Well, it seems the current implementation doesn't properly respect the SRC field
and handles it inconsistently - it is ignored at the start of a session and only
taken into account once the session is established.
> as long as all the packets are sent with the same 5-tuple, it’s up to the
> network to deliver them correctly
If the 5-tuple is not changed, then both the hash and the outgoing interface
(OIF) should remain consistent, which is not the case. Only with the fix does it
respect the configured SRC and produce a consistent, correct 5-tuple with the
proper hash.
Therefore, in my opinion, this should be fixed.
Powered by blists - more mailing lists