linux-kernel - Re: [PATCH] udp: gso: fix MTU check for small packets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO3-Pbpxt9K=mT3ozFqMHAQcy0B30snxq9Kg9xvP7pmzmXP5=w@mail.gmail.com>
Date: Wed, 29 Jan 2025 10:48:58 -0600
From: Yan Zhai <yan@...udflare.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>, 
	David Ahern <dsahern@...nel.org>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Shuah Khan <shuah@...nel.org>, Josh Hunt <johunt@...mai.com>, 
	Alexander Duyck <alexander.h.duyck@...ux.intel.com>, linux-kernel@...r.kernel.org, 
	linux-kselftest@...r.kernel.org, kernel-team <kernel-team@...udflare.com>
Subject: Re: [PATCH] udp: gso: fix MTU check for small packets

On Wed, Jan 29, 2025 at 8:08 AM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> Yan Zhai wrote:
> > On Tue, Jan 28, 2025 at 8:45 AM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> > >
> > > Yan Zhai wrote:
> > > > Hi Willem,
> > > >
> > > > Thanks for getting back to me.
> > > >
> > > > On Mon, Jan 27, 2025 at 8:33 AM Willem de Bruijn
> > > > <willemdebruijn.kernel@...il.com> wrote:
> > > > >
> > > > > Yan Zhai wrote:
> > > > > > Commit 4094871db1d6 ("udp: only do GSO if # of segs > 1") avoided GSO
> > > > > > for small packets. But the kernel currently dismisses GSO requests only
> > > > > > after checking MTU on gso_size. This means any packets, regardless of
> > > > > > their payload sizes, would be dropped when MTU is smaller than requested
> > > > > > gso_size.
> > > > >
> > > > > Is this a realistic concern? How did you encounter this in practice.
> > > > >
> > > > > It *is* a misconfiguration to configure a gso_size larger than MTU.
> > > > >
> > > > > > Meanwhile, EINVAL would be returned in this case, making it
> > > > > > very misleading to debug.
> > > > >
> > > > > Misleading is subjective. I'm not sure what is misleading here. From
> > > > > my above comment, I believe this is correctly EINVAL.
> > > > >
> > > > > That said, if this impacts a real workload we could reconsider
> > > > > relaxing the check. I.e., allowing through packets even when an
> > > > > application has clearly misconfigured UDP_SEGMENT.
> > > > >
> > > > We did encounter a painful reliability issue in production last month.
> > > >
> > > > To simplify the scenario, we had these symptoms when the issue occurred:
> > > > 1. QUIC connections to host A started to fail, and cannot establish new ones
> > > > 2. User space Wireguard to the exact same host worked 100% fine
> > > >
> > > > This happened rarely, like one or twice a day, lasting for a few
> > > > minutes usually, but it was quite visible since it is an office
> > > > network.
> > > >
> > > > Initially this prompted something wrong at the protocol layer. But
> > > > after multiple rounds of digging, we finally figured the root cause
> > > > was:
> > > > 3. Something sometimes pings host B, which shares the same IP with
> > > > host A but different ports (thanks to limited IPv4 space), and its
> > > > PMTU was reduced to 1280 occasionally. This unexpectedly affected all
> > > > traffic to that IP including traffic toward host A. Our QUIC client
> > > > set gso_size to 1350, and that's why it got hit.
> > > >
> > > > I agree that configurations do matter a lot here. Given how broken the
> > > > PMTU was for the Internet, we might just turn off pmtudisc option on
> > > > our end to avoid this failure path. But for those who hasn't yet, this
> > > > could still be confusing if it ever happens, because nothing seems to
> > > > point to PMTU in the first place:
> > > > * small packets also get dropped
> > > > * error code was EINVAL from sendmsg
> > > >
> > > > That said, I probably should have used PMTU in my commit message to be
> > > > more clear for our problem. But meanwhile I am also concerned about
> > > > newly added tunnels to trigger the same issue, even if it has a static
> > > > device MTU. My proposal should make the error reason more clear:
> > > > EMSGSIZE itself is a direct signal pointing to MTU/PMTU. Larger
> > > > packets getting dropped would have a similar effect.
> > >
> > > Thanks for that context. Makes sense that this is a real issue.
> > >
> > > One issue is that with segmentation, the initial mtu checks are
> > > skipped, so they have to be enforced later. In __ip_append_data:
> > >
> > >     mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
> > >
> > You are right, if packet sizes are between (PMTU, gso_size), then they
> > should still be dropped. But instead of checking explicitly in
> > udp_send_skb, maybe we can leave them to be dropped in
> > ip_finish_output?
>
> Not sure how to do this, or whether it will be simpler than having all
> the UDP GSO checks in udp_send_skb.
>
> For a "don't add cost to the hot path" point of view, it's actually
> best to keep all these checks in one place only when UDP_SEGMENT is
> negotiated (where the hot path is the common case without GSO).
>
I mean ip_finish_output is already dropping packets with length larger
than dst MTU. But I guess it doesn't hurt to check it also in GSO
branch. Let me send a V2 later to address it.

> > This way there is no need to add an extra branch for
> > non GSO code paths. PMTU shrinking should be rare, so the overhead
> > should be minimal.
> >
> > > Also, might this make the debugging actually harder, as the
> > > error condition is now triggered intermittently.
> > Yes sendmsg may only return errors for a portion of packets now under
> > the same situation. But IMHO it's not trading debugging for
> > reliability. Consistent error is good news for engineers to reproduce
> > locally, but in production I find people (SREs, solution and
> > escalation engineers) rely on pcaps and errno a lot. The pattern in
> > pcaps (lack of large packets of certain sizes, since they are dropped
> > before dev_queue_xmit), and exact error reasons like EMSGSIZE are both
> > good indicators for root causes. EINVAL is more generic on the other
> > hand. For example, I remembered we had another issue on UDP sendmsg,
> > which also returned a bunch of EINVAL. But that was due to some
> > attacker tricking us to reply with source port 0.
>
> Relying on error code is fraught anyway. For online analysis (which
> I think can be assumed when pcap is mentioned), function tracing and
> bpf trace are much more powerful.
>
Totally agree tracing is more powerful. Time by time we see issues
lingering for a few months get addressed in a few days or even hours
when tracing is plugged in. Unfortunately at least for us, the number
of people who can trace properly is far behind the volume of problems.
I can only hope in the future more people will recognize this as a
golden skill, in addition to current standard skills like pcap
analysis.

Yan

> That said, no objections to returning EMSGSIZE instead of EINVAL. That
> is the same UDP returns when sending a single datagram that exceeds
> MTU, after all.
>