netdev - Re: Delayed source port allocation for connected UDP sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877e3fniep.fsf@cloudflare.com>
Date:   Mon, 02 Dec 2019 11:14:38 +0100
From:   Jakub Sitnicki <jakub@...udflare.com>
To:     netdev@...r.kernel.org
Cc:     kernel-team <kernel-team@...udflare.com>,
        Marek Majkowski <marek@...udflare.com>
Subject: Re: Delayed source port allocation for connected UDP sockets

On Wed, Nov 27, 2019 at 03:07 PM CET, Marek Majkowski wrote:
> In my applications I need something like a connectx()[1] syscall. On
> Linux I can get quite far with using bind-before-connect and
> IP_BIND_ADDRESS_NO_PORT. One corner case is missing though.
>
> For various UDP applications I'm establishing connected sockets from
> specific 2-tuple. This is working fine with bind-before-connect, but
> in UDP it creates a slight race condition. It's possible the socket
> will receive packet from arbitrary source after bind():
>
> s = socket(SOCK_DGRAM)
> s.bind((192.0.2.1, 1703))
> # here be dragons
> s.connect((198.18.0.1, 58910))
>
> For the short amount of time after bind() and before connect(), the
> socket may receive packets from any peer. For situations when I don't
> need to specify source port, IP_BIND_ADDRESS_NO_PORT flag solves the
> issue. This code is fine:
>
> s = socket(SOCK_DGRAM)
> s.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> s.bind((192.0.2.1, 0))
> s.connect((198.18.0.1, 58910))
>
> But the IP_BIND_ADDRESS_NO_PORT doesn't work when the source port is
> selected. It seems natural to expand the scope of
> IP_BIND_ADDRESS_NO_PORT flag. Perhaps this could be made to work:
>
> s = socket(SOCK_DGRAM)
> s.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> s.bind((192.0.2.1, 1703))
> s.connect((198.18.0.1, 58910))
>
> I would like such code to delay the binding to port 1703 up until the
> connect(). IP_BIND_ADDRESS_NO_PORT only makes sense for connected
> sockets anyway. This raises a couple of questions though:
>
>  - IP_BIND_ADDRESS_NO_PORT name is confusing - we specify the port
> number in the bind!
>
>  - Where to store the source port in __inet_bind. Neither
> inet->inet_sport nor inet->inet_num seem like correct places to store
> the user-passed source port hint. The alternative is to introduce
> yet-another field onto inet_sock struct, but that is wasteful.

We've been talking with Marek about it some more. I'll summarize for the
sake of keeping the discussion open.

1. inet->inet_sport as storage for port hint

   It seems inet->inet_sport could be used to hold the port passed to
   bind() when we're delaying port allocation with
   IP_BIND_ADDRESS_NO_PORT. As long as local port, inet->inet_num, is
   not set, connect() and sendmsg() will know the socket needs to be
   bound to a port first.

   We didn't do a detailed audit of all access sites to
   inet->inet_sport. Potentially we missed something.

2. Backward compatibility

   Changing the existing behavior to delay port allocation when
   IP_BIND_ADDRESS_NO_PORT is set but port number was passed to bind(),
   could break apps that set the sockopt but never connect() the socket
   for some reason.

3. Extend the sockopt? Add new one? Introduce connectx() syscall?

   Since IP_BIND_ADDRESS_NO_PORT cannot be reused as is, we need a way
   for the user-space to signal its desire to delay binding to a
   specific port.

   We could imagine an extended version of IP_BIND_ADDRESS_NO_PORT
   sockopt that takes an extra value apart from the int flag.

   Then there's the option of adding a new sockopt dedicated for this
   use-case. However, we fear two sockopts having a similar purpose will
   be confusing for the users [0].

   Finally, we could go for the hard-core solution and take a stab at
   adding connectx() syscall [1]. Were there any attempts or discussions
   about this before? Quick search didn't turn up anything but the name
   is kind of a nightmare to google for.

   Question to the maintainers - which approach would be most welcome?

4. Why connected UDP sockets?

   We know that it's better to stick to receiving UDP sockets and
   demultiplex the client requests/sessions in user-space. Being hashed
   just by local address & port, connected UDP sockets don't scale well.

   We think there is one useful application, though. Service draining
   during restarts.

   When a service is being restarted, we would like the dying process to
   handle the ongoing L7 sessions until they come to an end. New UDP
   flows should go to a fresh service instance.

   To achieve that, for each ongoing session we would open a connected
   UDP socket. This way socket lookup logic would deliver just the flows
   we care about to the old process.

5. reuseport BPF with SOCKARRAY to the rescue?

   Since we're talking about opening connected UDP sockets that share
   the local port with other receiving UDP sockets (owned by another
   process), we would need to opt for port sharing with REUSEPORT [3].

   If we don't want the connected UDP sockets to receive any traffic
   during the short window of opportunity when the socket is bound but
   not connected, we could exclude it from the reuseport group by
   controlling the socket set with BPF & SOCKARRAY.

Comments and thoughts more than welcome.

-Jakub

[0] Unless we call it IP_BIND_ADDRESS_NO_PORT_FOR_REAL... ;-)
[1] https://www.unix.com/man-page/mojave/2/connectx/
[2] Or REUSEADDR which semantics allow it for unicast UDP.