[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLm_hRW3+MHsP8p5aTUStohz0nvWbKTGZU6K3EdRadrYw@mail.gmail.com>
Date: Tue, 8 Jul 2025 04:38:39 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>,
"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Neal Cardwell <ncardwell@...gle.com>, Kuniyuki Iwashima <kuniyu@...gle.com>, kernel-team@...udflare.com,
Lee Valentine <lvalentine@...udflare.com>
Subject: Re: [PATCH net-next v2 1/2] tcp: Consider every port when connecting
with IP_LOCAL_PORT_RANGE
On Thu, Jul 3, 2025 at 8:59 AM Jakub Sitnicki <jakub@...udflare.com> wrote:
>
> Situation
> ---------
>
> We observe the following scenario in production:
>
> inet_bind_bucket
> state for port 54321
> --------------------
>
> (bucket doesn't exist)
>
> // Process A opens a long-lived connection:
> s1 = socket(AF_INET, SOCK_STREAM)
> s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
> s1.bind(192.0.2.10, 0)
> s1.connect(192.51.100.1, 443)
> tb->reuse = -1
> tb->reuseport = -1
> s1.getsockname() -> 192.0.2.10:54321
> s1.send()
> s1.recv()
> // ... s1 stays open.
>
> // Process B opens a short-lived connection:
> s2 = socket(AF_INET, SOCK_STREAM)
> s2.setsockopt(SO_REUSEADDR)
> s2.bind(192.0.2.20, 0)
> tb->reuse = 0
> tb->reuseport = 0
> s2.connect(192.51.100.2, 53)
> s2.getsockname() -> 192.0.2.20:54321
> s2.send()
> s2.recv()
> s2.close()
>
> // bucket remains in this
> // state even though port
> // was released by s2
> tb->reuse = 0
> tb->reuseport = 0
>
> // Process A attempts to open another connection
> // when there is connection pressure from
> // 192.0.2.30:54000..54500 to 192.51.100.1:443.
> // Assume only port 54321 is still available.
>
> s3 = socket(AF_INET, SOCK_STREAM)
> s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
> s3.bind(192.0.2.30, 0)
> s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)
>
> Problem
> -------
>
> We end up in a state where Process A can't reuse ephemeral port 54321 for
> as long as there are sockets, like s1, that keep the bind bucket alive. The
> bucket does not return to "reusable" state even when all sockets which
> blocked it from reuse, like s2, are gone.
>
> The ephemeral port becomes available for use again only after all sockets
> bound to it are gone and the bind bucket is destroyed.
>
> Programs which behave like Process B in this scenario - that is, binding to
> an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
> poorly written. However, the reality is that such implementation is not
> actually uncommon. Trying to fix each and every such program is like
> playing whack-a-mole.
>
> For instance, it could be any software using Golang's net.Dialer with
> LocalAddr provided:
>
> dialer := &net.Dialer{
> LocalAddr: &net.TCPAddr{IP: srcIP},
> }
> conn, err := dialer.Dial("tcp4", dialTarget)
>
> Or even a ubiquitous tool like dig when using a specific local address:
>
> $ dig -b 127.1.1.1 +tcp +short example.com
>
> Hence, we are proposing a systematic fix in the network stack itself.
>
> Solution
> --------
>
> If there is no IP address conflict with any socket bound to a given local
> port, then from the protocol's perspective, the port can be safely shared.
>
> With that in mind, modify the port search during connect(), that is
> __inet_hash_connect, to consider all bind buckets (ports) when looking for
> a local port for egress.
>
> To achieve this, add an extra walk over bhash2 buckets for the port to
> check for IP conflicts. The additional walk is not free, so perform it only
> once per port - during the second phase of conflict checking, when the
> bhash bucket is locked.
>
> We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket
> option is set. The rationale is that users are likely to care about using
> every possible local port only when they have deliberately constrained the
> ephemeral port range.
>
> Limitations
> -----------
>
> Sharing both the local IP and port with other established sockets, when the
> remote address is unique is still not possible if the bucket is in
> non-reusable state (tb->{fastreuse,fastreuseport} >= 0) because of a socket
> explicitly bound to that port.
>
> Alternatives
> ------------
>
> * Update bind bucket state on port release
>
> A valid solution to the described problem would also be to walk the bind
> bucket owners when releasing the port and recalculate the
> tb->{reuse,reuseport} state.
>
> However, in comparison to the proposed solution, this alone would not allow
> sharing the local port with other sockets bound to non-conflicting IPs for
> as long as they exist.
>
> Another downside is that we'd pay the extra cost on each unbind (more
> frequent) rather than only when connecting with IP_LOCAL_PORT_RANGE
> set (less frequent). Due to that we would also likely need to guard it
> behind a sysctl (see below).
>
> * Run your service in a dedicated netns
>
> This would also solve the problem. While we don't rule out transitioning to
> this setup in the future at a cost of shifting the complexity elsewhere.
>
> Isolating your service in a netns requires assigning it dedicated IPs for
> egress. If the egress IPs must be shared with other processes, as in our
> case, then SNAT and connection tracking on egress are required - adding
> complexity.
>
> * Guard it behind a sysctl setting instead of a socket option
>
> Users are unlikely to encounter this problem unless their workload connects
> from a severely narrowed-down ephemeral port range. Hence, paying the bind
> bucket walk cost for each connect() call doesn't seem sensible. Whereas
> with a socket option, only a limited set of connections incur the
> performance overhead.
>
> Reported-by: Lee Valentine <lvalentine@...udflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@...udflare.com>
> ---
> To: Eric Dumazet <edumazet@...gle.com>
> To: Paolo Abeni <pabeni@...hat.com>
> To: David S. Miller <davem@...emloft.net>
> To: Jakub Kicinski <kuba@...nel.org>
> To: Neal Cardwell <ncardwell@...gle.com>
> To: Kuniyuki Iwashima <kuniyu@...gle.com>
> Cc: netdev@...r.kernel.org
> Cc: kernel-team@...udflare.com
> ---
> Changes in v2:
> - Fix unused var warning when CONFIG_IPV6=n
> - Convert INADDR_ANY to network byte order before comparison
> - Link to v1: https://lore.kernel.org/r/20250626120247.1255223-1-jakub@cloudflare.com
> ---
> net/ipv4/inet_hashtables.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index ceeeec9b7290..b4c4caf3ff6c 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1005,6 +1005,43 @@ EXPORT_IPV6_MOD(inet_bhash2_reset_saddr);
> #define INET_TABLE_PERTURB_SIZE (1 << CONFIG_INET_TABLE_PERTURB_ORDER)
> static u32 *table_perturb;
>
> +/* True on source address conflict with another socket. False otherwise.
> + * Caller must hold hashbucket lock for this tb.
> + */
> +static inline bool check_bound(const struct sock *sk,
> + const struct inet_bind_bucket *tb)
> +{
> + const struct inet_bind2_bucket *tb2;
> +
> + hlist_for_each_entry(tb2, &tb->bhash2, bhash_node) {
> +#if IS_ENABLED(CONFIG_IPV6)
> + const struct sock *sk2;
> +
> + if (sk->sk_family == AF_INET6) {
> + if (tb2->addr_type == IPV6_ADDR_ANY ||
> + ipv6_addr_equal(&tb2->v6_rcv_saddr,
> + &sk->sk_v6_rcv_saddr))
> + return true;
> + continue;
> + }
> +
> + /* Check for ipv6 non-v6only wildcard sockets */
> + if (tb2->addr_type == IPV6_ADDR_ANY)
> + sk_for_each_bound(sk2, &tb2->owners)
> + if (!sk2->sk_ipv6only)
> + return true;
> +
> + if (tb2->addr_type != IPV6_ADDR_MAPPED)
> + continue;
> +#endif
> + if (tb2->rcv_saddr == htonl(INADDR_ANY) ||
> + tb2->rcv_saddr == sk->sk_rcv_saddr)
> + return true;
> + }
> +
> + return false;
> +}
> +
> int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> struct sock *sk, u64 port_offset,
> u32 hash_port0,
> @@ -1070,6 +1107,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> if (!inet_bind_bucket_match(tb, net, port, l3mdev))
> continue;
> if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) {
> + if (unlikely(local_ports))
> + break; /* optimistic assumption */
I find this quite pessimistic :/
It seems you had some internal code before my recent change (86c2bc293b8130
"tcp: use RCU lookup in __inet_hash_connect()") ?
Instead, make the RCU changes so that check_bound() can be called from RCU,
and call it here before taking the decision to break off this loop.
> rcu_read_unlock();
> goto next_port;
> }
> @@ -1088,9 +1127,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> */
> inet_bind_bucket_for_each(tb, &head->chain) {
> if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
> - if (tb->fastreuse >= 0 ||
> - tb->fastreuseport >= 0)
> + if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) {
> + if (unlikely(local_ports &&
> + !check_bound(sk, tb)))
> + goto ok;
> goto next_port_unlock;
> + }
> WARN_ON(hlist_empty(&tb->bhash2));
> if (!check_established(death_row, sk,
> port, &tw, false,
>
> --
> 2.43.0
>
Powered by blists - more mailing lists