[<prev] [next>] [day] [month] [year] [list]
Message-ID: <AE9FC986-D052-4E3B-A98C-EAF2235F498D@akamai.com>
Date: Mon, 8 Sep 2025 06:52:48 +0000
From: "Banerjee, Debabrata" <dbanerje@...mai.com>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Question: Mixed MTU environment support, possible kernel patch?
Hi netdev,
We have a problem with implementing for jumbo support across generic hosts/networks. Perhaps we've missed some obvious solution, but this area is esoteric and complicated enough maybe netdev can help, please bear with me. In greenfield environments, jumbo is not really a problem as you can simply just setup your local network correctly and have whatever your border (or other path router) reply with ICMP PTB/Fragmentation Needed messages. Of course, it’s not hard to overload your border router CPU with ICMP replies (various permutations of different packets types like UDP, or TCP MSS negotiation that has jumbo on both client/server, etc), and in that case you can add "mtu 1500" to your default route, and create a map of new kernel routes to places you expect jumbo paths to and avoid overloading router limits (keeping in mind, the actual routes may flap upstream, moving your traffic back to a non-jumbo path between any 2 hosts, where you will need PTB's again). Since it's all new software there's total control over anything you add and it's a straightforward solution.
However the problem is that we do not have a greenfield environment, we have many systems in mixed networks which both have legacy software /multitenancy of varying degrees, and each piece of software may want to insert its own routes for its own purposes, or add additional devices on-demand like macvlans/bridges/etc. Because that software likely doesn't know any better, since we have jumbo set on our public interface, and "mtu 1500" set on our default route, a brand new route (or macvlan/etc) added will default to jumbo - and it probably doesn't work, it may even blackhole forever if that path is in the local broadcast domain as there isn't a router to send a PTB.
I suspect that we are not the only ones fighting this issue today, given that /usr/bin/ping added a critical feature for working in this paradigm just last year, support for passing IP_PMTUDISC_PROBE so we can bypass the dst pmtu on the kernel's output route lookup (remember we've set "mtu 1500" on our default route from above) so we can test what hosts are actually jumbo-reachable before enabling jumbo on that path.
I've tried to brainstorm various ideas of if we could deal with this, most seem imperfect or may result in a kernel patch that is complicated or unacceptably hacky:
1. We could patch all kernel route adds to append "mtu 1500" conditionally if not otherwise specified. Instead of a kernel patch this could possibly be a rtnetlink notification listener or eBPF. This has the downside that any daemon rechecking its installed routes may choke on the additional metric. Also after prototyping this with rtnetlink I discovered that we have some very old programs that use SIOCADDRT or /sbin/route, which doesn't understand this stuff and will make duplicate routes instead of getting -EEXISTS as it blindly tries to make sure its route is in the kernel (yeah, ick!). Also new devices like macvlan's and bridges need additional work, but could be fixed up in a similar fashion.
2. For every dst lookup on output we simply look at a global or per-interface sysctl (i.e. /proc/sys/net/ipv4/conf/eth0/default_mtu) on whether we should lie about what the MTU really is. i.e. we always return the sysctl value (1500), unless a MTU metric is attached to that route. This preserves behavior with any legacy programs looking at route tables, since the routes will not look different. Of course, this has the downside of now having to modify all programs or routes where we WANT jumbo, but maybe that has a much smaller surface area. However we still have a problem for all things that inherit the interface MTU, like macvlan as above. At least this patch would be small, probably.
3. Patch struct net_device {} and subsequently the whole world (net drivers, iproute2, etc) to add "unsigned int max_mtu", which can be set differently than "unsigned int mtu" per net_device on RTM_SETLINK. When doing normal lookups "mtu" would be used, however, if a route or child device explicitly requests a larger mtu than set in struct net_device, that would be checked against max_mtu, which is the value we'd actually change to enable jumbo frames to/from the physical interface when in a mixed MTU environment. In greenfield environments you would set max_mtu=mtu=9000.
Any feedback appreciated.
Thanks,
Debabrata Banerjee
Powered by blists - more mailing lists