lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sfg1vuqj.fsf@cloudflare.com>
Date:   Mon, 23 Jan 2023 21:48:06 +0100
From:   Jakub Sitnicki <jakub@...udflare.com>
To:     Leon Romanovsky <leon@...nel.org>
Cc:     netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Kuniyuki Iwashima <kuniyu@...zon.com>,
        Neal Cardwell <ncardwell@...gle.com>, selinux@...r.kernel.org,
        Paul Moore <paul@...l-moore.com>,
        Stephen Smalley <stephen.smalley.work@...il.com>,
        Eric Paris <eparis@...isplace.org>, kernel-team@...udflare.com,
        Marek Majkowski <marek@...udflare.com>
Subject: Re: [PATCH net-next v4 1/2] inet: Add IP_LOCAL_PORT_RANGE socket
 option

On Mon, Jan 23, 2023 at 07:47 PM +02, Leon Romanovsky wrote:
> On Mon, Jan 23, 2023 at 03:44:39PM +0100, Jakub Sitnicki wrote:
>> Users who want to share a single public IP address for outgoing connections
>> between several hosts traditionally reach for SNAT. However, SNAT requires
>> state keeping on the node(s) performing the NAT.
>> 
>> A stateless alternative exists, where a single IP address used for egress
>> can be shared between several hosts by partitioning the available ephemeral
>> port range. In such a setup:
>> 
>> 1. Each host gets assigned a disjoint range of ephemeral ports.
>> 2. Applications open connections from the host-assigned port range.
>> 3. Return traffic gets routed to the host based on both, the destination IP
>>    and the destination port.
>> 
>> An application which wants to open an outgoing connection (connect) from a
>> given port range today can choose between two solutions:
>> 
>> 1. Manually pick the source port by bind()'ing to it before connect()'ing
>>    the socket.
>> 
>>    This approach has a couple of downsides:
>> 
>>    a) Search for a free port has to be implemented in the user-space. If
>>       the chosen 4-tuple happens to be busy, the application needs to retry
>>       from a different local port number.
>> 
>>       Detecting if 4-tuple is busy can be either easy (TCP) or hard
>>       (UDP). In TCP case, the application simply has to check if connect()
>>       returned an error (EADDRNOTAVAIL). That is assuming that the local
>>       port sharing was enabled (REUSEADDR) by all the sockets.
>> 
>>         # Assume desired local port range is 60_000-60_511
>>         s = socket(AF_INET, SOCK_STREAM)
>>         s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>>         s.bind(("192.0.2.1", 60_000))
>>         s.connect(("1.1.1.1", 53))
>>         # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
>>         # Application must retry with another local port
>> 
>>       In case of UDP, the network stack allows binding more than one socket
>>       to the same 4-tuple, when local port sharing is enabled
>>       (REUSEADDR). Hence detecting the conflict is much harder and involves
>>       querying sock_diag and toggling the REUSEADDR flag [1].
>> 
>>    b) For TCP, bind()-ing to a port within the ephemeral port range means
>>       that no connecting sockets, that is those which leave it to the
>>       network stack to find a free local port at connect() time, can use
>>       the this port.
>> 
>>       IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
>>       will be skipped during the free port search at connect() time.
>> 
>> 2. Isolate the app in a dedicated netns and use the use the per-netns
>>    ip_local_port_range sysctl to adjust the ephemeral port range bounds.
>> 
>>    The per-netns setting affects all sockets, so this approach can be used
>>    only if:
>> 
>>    - there is just one egress IP address, or
>>    - the desired egress port range is the same for all egress IP addresses
>>      used by the application.
>> 
>>    For TCP, this approach avoids the downsides of (1). Free port search and
>>    4-tuple conflict detection is done by the network stack:
>> 
>>      system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
>> 
>>      s = socket(AF_INET, SOCK_STREAM)
>>      s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
>>      s.bind(("192.0.2.1", 0))
>>      s.connect(("1.1.1.1", 53))
>>      # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
>> 
>>   For UDP this approach has limited applicability. Setting the
>>   IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
>>   port being shared with other connected UDP sockets.
>> 
>>   Hence relying on the network stack to find a free source port, limits the
>>   number of outgoing UDP flows from a single IP address down to the number
>>   of available ephemeral ports.
>> 
>> To put it another way, partitioning the ephemeral port range between hosts
>> using the existing Linux networking API is cumbersome.
>> 
>> To address this use case, add a new socket option at the SOL_IP level,
>> named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
>> ephemeral port range for each socket individually.
>> 
>> The option can be used only to narrow down the per-netns local port
>> range. If the per-socket range lies outside of the per-netns range, the
>> latter takes precedence.
>> 
>> UAPI-wise, the low and high range bounds are passed to the kernel as a pair
>> of u16 values in host byte order packed into a u32. This avoids pointer
>> passing.
>> 
>>   PORT_LO = 40_000
>>   PORT_HI = 40_511
>> 
>>   s = socket(AF_INET, SOCK_STREAM)
>>   v = struct.pack("I", PORT_HI << 16 | PORT_LO)
>>   s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
>>   s.bind(("127.0.0.1", 0))
>>   s.getsockname()
>>   # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
>>   # if there is a free port. EADDRINUSE otherwise.
>> 
>> [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
>> 
>> v3 -> v4:
>>  * Clarify that u16 values are in host byte order (Neal)
>> 
>> v2 -> v3:
>>  * Make SCTP bind()/bind_add() respect IP_LOCAL_PORT_RANGE option (Eric)
>> 
>> v1 -> v2:
>>  * Fix the corner case when the per-socket range doesn't overlap with the
>>    per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)
>
> Please put changelog after "---" trailer, so it will be stripped while
> applying patch.

I've put the changelog above the "---" on purpose. AFAIK, it is (was?)
preferred by netdev maintainers to keep the changelog in the
description.

Do you know if this convention is now a thing of the past? I might have
missed something.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ