[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z1JeePBN5f1YCmYd@zatzit>
Date: Fri, 6 Dec 2024 13:16:24 +1100
From: David Gibson <david@...son.dropbear.id.au>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Stefano Brivio <sbrivio@...hat.com>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
netdev@...r.kernel.org, Kuniyuki Iwashima <kuniyu@...zon.com>,
Mike Manning <mvrmanning@...il.com>,
Paul Holzinger <pholzing@...hat.com>,
Philo Lu <lulie@...ux.alibaba.com>,
Cambda Zhu <cambda@...ux.alibaba.com>,
Fred Chen <fred.cc@...baba-inc.com>,
Yubing Qiu <yubing.qiuyubing@...baba-inc.com>
Subject: Re: [PATCH net-next 2/2] datagram, udp: Set local address and rehash
socket atomically against lookup
On Thu, Dec 05, 2024 at 11:52:38PM +0100, Eric Dumazet wrote:
> On Thu, Dec 5, 2024 at 11:32 PM David Gibson
> <david@...son.dropbear.id.au> wrote:
> >
> > On Thu, Dec 05, 2024 at 05:35:52PM +0100, Eric Dumazet wrote:
> > > On Wed, Dec 4, 2024 at 11:12 PM Stefano Brivio <sbrivio@...hat.com> wrote:
> > [snip]
> > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > index 6a01905d379f..8490408f6009 100644
> > > > --- a/net/ipv4/udp.c
> > > > +++ b/net/ipv4/udp.c
> > > > @@ -639,18 +639,21 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
> > > > int sdif, struct udp_table *udptable, struct sk_buff *skb)
> > > > {
> > > > unsigned short hnum = ntohs(dport);
> > > > - struct udp_hslot *hslot2;
> > > > + struct udp_hslot *hslot, *hslot2;
> > > > struct sock *result, *sk;
> > > > unsigned int hash2;
> > > >
> > > > + hslot = udp_hashslot(udptable, net, hnum);
> > > > + spin_lock_bh(&hslot->lock);
> > >
> > > This is not acceptable.
> > > UDP is best effort, packets can be dropped.
> > > Please fix user application expectations.
> >
> > The packets aren't merely dropped, they're rejected with an ICMP Port
> > Unreachable.
>
> We made UDP stack scalable with RCU, it took years of work.
>
> And this patch is bringing back the UDP stack to horrible performance
> from more than a decade ago.
> Everybody will go back to DPDK.
It's reasonable to be concerned about the performance impact. But
this seems like preamture hyperbole given no-one has numbers yet, or
has even suggested a specific benchmark to reveal the impact.
> I am pretty certain this can be solved without using a spinlock in the
> fast path.
Quite possibly. But Stefano has tried, and it certainly wasn't
trivial.
> Think about UDP DNS/QUIC servers, using SO_REUSEPORT and receiving
> 10,000,000 packets per second....
>
> Changing source address on an UDP socket is highly unusual, we are not
> going to slow down UDP for this case.
Changing in a general way is very rare, one specific case is not.
Every time you connect() a socket that wasn't previously bound to a
specific address you get an implicit source address change from
0.0.0.0 or :: to something that depends on the routing table.
> Application could instead open another socket, and would probably work
> on old linux versions.
Possibly there's a procedure that would work here, but it's not at all
obvious:
* Clearly, you can't close the non-connected socket before opening
the connected one - that just introduces a new much wider race. It
doesn't even get rid of the existing one, because unless you can
independently predict what the correct bound address will be
for a given peer address, the second socket will still have an
address change when you connect().
* So, you must create the connected socket before closing the
unconnected one, meaning you have to use SO_REUSEADDR or
SO_REUSEPORT whether or not you otherwise wanted to.
* While both sockets are open, you need to handle the possibility
that packets could be delivered to either one. Doable, but a pain
in the arse.
* How do you know when the transition is completed and you can close
the unconnected socket? The fact that the rehashing has completed
and all the necessary memory barriers passed isn't something
userspace can directly discern.
> If the regression was recent, this would be considered as a normal regression,
> but apparently nobody noticed for 10 years. This should be saying something...
It does. But so does the fact that it can be trivially reproduced.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists