[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <679a367198f13_132e0829467@willemb.c.googlers.com.notmuch>
Date: Wed, 29 Jan 2025 09:08:49 -0500
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Yan Zhai <yan@...udflare.com>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: netdev@...r.kernel.org,
"David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
Shuah Khan <shuah@...nel.org>,
Josh Hunt <johunt@...mai.com>,
Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
linux-kernel@...r.kernel.org,
linux-kselftest@...r.kernel.org,
kernel-team <kernel-team@...udflare.com>
Subject: Re: [PATCH] udp: gso: fix MTU check for small packets
Yan Zhai wrote:
> On Tue, Jan 28, 2025 at 8:45 AM Willem de Bruijn
> <willemdebruijn.kernel@...il.com> wrote:
> >
> > Yan Zhai wrote:
> > > Hi Willem,
> > >
> > > Thanks for getting back to me.
> > >
> > > On Mon, Jan 27, 2025 at 8:33 AM Willem de Bruijn
> > > <willemdebruijn.kernel@...il.com> wrote:
> > > >
> > > > Yan Zhai wrote:
> > > > > Commit 4094871db1d6 ("udp: only do GSO if # of segs > 1") avoided GSO
> > > > > for small packets. But the kernel currently dismisses GSO requests only
> > > > > after checking MTU on gso_size. This means any packets, regardless of
> > > > > their payload sizes, would be dropped when MTU is smaller than requested
> > > > > gso_size.
> > > >
> > > > Is this a realistic concern? How did you encounter this in practice.
> > > >
> > > > It *is* a misconfiguration to configure a gso_size larger than MTU.
> > > >
> > > > > Meanwhile, EINVAL would be returned in this case, making it
> > > > > very misleading to debug.
> > > >
> > > > Misleading is subjective. I'm not sure what is misleading here. From
> > > > my above comment, I believe this is correctly EINVAL.
> > > >
> > > > That said, if this impacts a real workload we could reconsider
> > > > relaxing the check. I.e., allowing through packets even when an
> > > > application has clearly misconfigured UDP_SEGMENT.
> > > >
> > > We did encounter a painful reliability issue in production last month.
> > >
> > > To simplify the scenario, we had these symptoms when the issue occurred:
> > > 1. QUIC connections to host A started to fail, and cannot establish new ones
> > > 2. User space Wireguard to the exact same host worked 100% fine
> > >
> > > This happened rarely, like one or twice a day, lasting for a few
> > > minutes usually, but it was quite visible since it is an office
> > > network.
> > >
> > > Initially this prompted something wrong at the protocol layer. But
> > > after multiple rounds of digging, we finally figured the root cause
> > > was:
> > > 3. Something sometimes pings host B, which shares the same IP with
> > > host A but different ports (thanks to limited IPv4 space), and its
> > > PMTU was reduced to 1280 occasionally. This unexpectedly affected all
> > > traffic to that IP including traffic toward host A. Our QUIC client
> > > set gso_size to 1350, and that's why it got hit.
> > >
> > > I agree that configurations do matter a lot here. Given how broken the
> > > PMTU was for the Internet, we might just turn off pmtudisc option on
> > > our end to avoid this failure path. But for those who hasn't yet, this
> > > could still be confusing if it ever happens, because nothing seems to
> > > point to PMTU in the first place:
> > > * small packets also get dropped
> > > * error code was EINVAL from sendmsg
> > >
> > > That said, I probably should have used PMTU in my commit message to be
> > > more clear for our problem. But meanwhile I am also concerned about
> > > newly added tunnels to trigger the same issue, even if it has a static
> > > device MTU. My proposal should make the error reason more clear:
> > > EMSGSIZE itself is a direct signal pointing to MTU/PMTU. Larger
> > > packets getting dropped would have a similar effect.
> >
> > Thanks for that context. Makes sense that this is a real issue.
> >
> > One issue is that with segmentation, the initial mtu checks are
> > skipped, so they have to be enforced later. In __ip_append_data:
> >
> > mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
> >
> You are right, if packet sizes are between (PMTU, gso_size), then they
> should still be dropped. But instead of checking explicitly in
> udp_send_skb, maybe we can leave them to be dropped in
> ip_finish_output?
Not sure how to do this, or whether it will be simpler than having all
the UDP GSO checks in udp_send_skb.
For a "don't add cost to the hot path" point of view, it's actually
best to keep all these checks in one place only when UDP_SEGMENT is
negotiated (where the hot path is the common case without GSO).
> This way there is no need to add an extra branch for
> non GSO code paths. PMTU shrinking should be rare, so the overhead
> should be minimal.
>
> > Also, might this make the debugging actually harder, as the
> > error condition is now triggered intermittently.
> Yes sendmsg may only return errors for a portion of packets now under
> the same situation. But IMHO it's not trading debugging for
> reliability. Consistent error is good news for engineers to reproduce
> locally, but in production I find people (SREs, solution and
> escalation engineers) rely on pcaps and errno a lot. The pattern in
> pcaps (lack of large packets of certain sizes, since they are dropped
> before dev_queue_xmit), and exact error reasons like EMSGSIZE are both
> good indicators for root causes. EINVAL is more generic on the other
> hand. For example, I remembered we had another issue on UDP sendmsg,
> which also returned a bunch of EINVAL. But that was due to some
> attacker tricking us to reply with source port 0.
Relying on error code is fraught anyway. For online analysis (which
I think can be assumed when pcap is mentioned), function tracing and
bpf trace are much more powerful.
That said, no objections to returning EMSGSIZE instead of EINVAL. That
is the same UDP returns when sending a single datagram that exceeds
MTU, after all.
Powered by blists - more mailing lists