netdev - Re: [PATCH net-next v2 1/2] tcp: Consider every port when connecting with IP_LOCAL_PORT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLm_hRW3+MHsP8p5aTUStohz0nvWbKTGZU6K3EdRadrYw@mail.gmail.com>
Date: Tue, 8 Jul 2025 04:38:39 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Neal Cardwell <ncardwell@...gle.com>, Kuniyuki Iwashima <kuniyu@...gle.com>, kernel-team@...udflare.com, 
	Lee Valentine <lvalentine@...udflare.com>
Subject: Re: [PATCH net-next v2 1/2] tcp: Consider every port when connecting
 with IP_LOCAL_PORT_RANGE

On Thu, Jul 3, 2025 at 8:59 AM Jakub Sitnicki <jakub@...udflare.com> wrote:
>
> Situation
> ---------
>
> We observe the following scenario in production:
>
>                                                   inet_bind_bucket
>                                                 state for port 54321
>                                                 --------------------
>
>                                                 (bucket doesn't exist)
>
> // Process A opens a long-lived connection:
> s1 = socket(AF_INET, SOCK_STREAM)
> s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
> s1.bind(192.0.2.10, 0)
> s1.connect(192.51.100.1, 443)
>                                                 tb->reuse = -1
>                                                 tb->reuseport = -1
> s1.getsockname() -> 192.0.2.10:54321
> s1.send()
> s1.recv()
> // ... s1 stays open.
>
> // Process B opens a short-lived connection:
> s2 = socket(AF_INET, SOCK_STREAM)
> s2.setsockopt(SO_REUSEADDR)
> s2.bind(192.0.2.20, 0)
>                                                 tb->reuse = 0
>                                                 tb->reuseport = 0
> s2.connect(192.51.100.2, 53)
> s2.getsockname() -> 192.0.2.20:54321
> s2.send()
> s2.recv()
> s2.close()
>
>                                                 // bucket remains in this
>                                                 // state even though port
>                                                 // was released by s2
>                                                 tb->reuse = 0
>                                                 tb->reuseport = 0
>
> // Process A attempts to open another connection
> // when there is connection pressure from
> // 192.0.2.30:54000..54500 to 192.51.100.1:443.
> // Assume only port 54321 is still available.
>
> s3 = socket(AF_INET, SOCK_STREAM)
> s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
> s3.bind(192.0.2.30, 0)
> s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)
>
> Problem
> -------
>
> We end up in a state where Process A can't reuse ephemeral port 54321 for
> as long as there are sockets, like s1, that keep the bind bucket alive. The
> bucket does not return to "reusable" state even when all sockets which
> blocked it from reuse, like s2, are gone.
>
> The ephemeral port becomes available for use again only after all sockets
> bound to it are gone and the bind bucket is destroyed.
>
> Programs which behave like Process B in this scenario - that is, binding to
> an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
> poorly written. However, the reality is that such implementation is not
> actually uncommon. Trying to fix each and every such program is like
> playing whack-a-mole.
>
> For instance, it could be any software using Golang's net.Dialer with
> LocalAddr provided:
>
>         dialer := &net.Dialer{
>                 LocalAddr: &net.TCPAddr{IP: srcIP},
>         }
>         conn, err := dialer.Dial("tcp4", dialTarget)
>
> Or even a ubiquitous tool like dig when using a specific local address:
>
>         $ dig -b 127.1.1.1 +tcp +short example.com
>
> Hence, we are proposing a systematic fix in the network stack itself.
>
> Solution
> --------
>
> If there is no IP address conflict with any socket bound to a given local
> port, then from the protocol's perspective, the port can be safely shared.
>
> With that in mind, modify the port search during connect(), that is
> __inet_hash_connect, to consider all bind buckets (ports) when looking for
> a local port for egress.
>
> To achieve this, add an extra walk over bhash2 buckets for the port to
> check for IP conflicts. The additional walk is not free, so perform it only
> once per port - during the second phase of conflict checking, when the
> bhash bucket is locked.
>
> We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket
> option is set. The rationale is that users are likely to care about using
> every possible local port only when they have deliberately constrained the
> ephemeral port range.
>
> Limitations
> -----------
>
> Sharing both the local IP and port with other established sockets, when the
> remote address is unique is still not possible if the bucket is in
> non-reusable state (tb->{fastreuse,fastreuseport} >= 0) because of a socket
> explicitly bound to that port.
>
> Alternatives
> ------------
>
> * Update bind bucket state on port release
>
> A valid solution to the described problem would also be to walk the bind
> bucket owners when releasing the port and recalculate the
> tb->{reuse,reuseport} state.
>
> However, in comparison to the proposed solution, this alone would not allow
> sharing the local port with other sockets bound to non-conflicting IPs for
> as long as they exist.
>
> Another downside is that we'd pay the extra cost on each unbind (more
> frequent) rather than only when connecting with IP_LOCAL_PORT_RANGE
> set (less frequent). Due to that we would also likely need to guard it
> behind a sysctl (see below).
>
> * Run your service in a dedicated netns
>
> This would also solve the problem. While we don't rule out transitioning to
> this setup in the future at a cost of shifting the complexity elsewhere.
>
> Isolating your service in a netns requires assigning it dedicated IPs for
> egress. If the egress IPs must be shared with other processes, as in our
> case, then SNAT and connection tracking on egress are required - adding
> complexity.
>
> * Guard it behind a sysctl setting instead of a socket option
>
> Users are unlikely to encounter this problem unless their workload connects
> from a severely narrowed-down ephemeral port range. Hence, paying the bind
> bucket walk cost for each connect() call doesn't seem sensible. Whereas
> with a socket option, only a limited set of connections incur the
> performance overhead.
>
> Reported-by: Lee Valentine <lvalentine@...udflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@...udflare.com>
> ---
> To: Eric Dumazet <edumazet@...gle.com>
> To: Paolo Abeni <pabeni@...hat.com>
> To: David S. Miller <davem@...emloft.net>
> To: Jakub Kicinski <kuba@...nel.org>
> To: Neal Cardwell <ncardwell@...gle.com>
> To: Kuniyuki Iwashima <kuniyu@...gle.com>
> Cc: netdev@...r.kernel.org
> Cc: kernel-team@...udflare.com
> ---
> Changes in v2:
> - Fix unused var warning when CONFIG_IPV6=n
> - Convert INADDR_ANY to network byte order before comparison
> - Link to v1: https://lore.kernel.org/r/20250626120247.1255223-1-jakub@cloudflare.com
> ---
>  net/ipv4/inet_hashtables.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index ceeeec9b7290..b4c4caf3ff6c 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1005,6 +1005,43 @@ EXPORT_IPV6_MOD(inet_bhash2_reset_saddr);
>  #define INET_TABLE_PERTURB_SIZE (1 << CONFIG_INET_TABLE_PERTURB_ORDER)
>  static u32 *table_perturb;
>
> +/* True on source address conflict with another socket. False otherwise.
> + * Caller must hold hashbucket lock for this tb.
> + */
> +static inline bool check_bound(const struct sock *sk,
> +                              const struct inet_bind_bucket *tb)
> +{
> +       const struct inet_bind2_bucket *tb2;
> +
> +       hlist_for_each_entry(tb2, &tb->bhash2, bhash_node) {
> +#if IS_ENABLED(CONFIG_IPV6)
> +               const struct sock *sk2;
> +
> +               if (sk->sk_family == AF_INET6) {
> +                       if (tb2->addr_type == IPV6_ADDR_ANY ||
> +                           ipv6_addr_equal(&tb2->v6_rcv_saddr,
> +                                           &sk->sk_v6_rcv_saddr))
> +                               return true;
> +                       continue;
> +               }
> +
> +               /* Check for ipv6 non-v6only wildcard sockets */
> +               if (tb2->addr_type == IPV6_ADDR_ANY)
> +                       sk_for_each_bound(sk2, &tb2->owners)
> +                               if (!sk2->sk_ipv6only)
> +                                       return true;
> +
> +               if (tb2->addr_type != IPV6_ADDR_MAPPED)
> +                       continue;
> +#endif
> +               if (tb2->rcv_saddr == htonl(INADDR_ANY) ||
> +                   tb2->rcv_saddr == sk->sk_rcv_saddr)
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
>  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>                 struct sock *sk, u64 port_offset,
>                 u32 hash_port0,
> @@ -1070,6 +1107,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>                         if (!inet_bind_bucket_match(tb, net, port, l3mdev))
>                                 continue;
>                         if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) {
> +                               if (unlikely(local_ports))
> +                                       break; /* optimistic assumption */

I find this quite pessimistic :/

It seems you had some internal code before my recent change (86c2bc293b8130
"tcp: use RCU lookup in __inet_hash_connect()") ?

Instead, make the RCU changes so that check_bound() can be called from RCU,
and call it here before taking the decision to break off this loop.

>                                 rcu_read_unlock();
>                                 goto next_port;
>                         }
> @@ -1088,9 +1127,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>                  */
>                 inet_bind_bucket_for_each(tb, &head->chain) {
>                         if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
> -                               if (tb->fastreuse >= 0 ||
> -                                   tb->fastreuseport >= 0)
> +                               if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) {
> +                                       if (unlikely(local_ports &&
> +                                                    !check_bound(sk, tb)))
> +                                               goto ok;
>                                         goto next_port_unlock;
> +                               }
>                                 WARN_ON(hlist_empty(&tb->bhash2));
>                                 if (!check_established(death_row, sk,
>                                                        port, &tw, false,
>
> --
> 2.43.0
>