lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoD9BA_Y26dSz+rkvi2_ZEc6D29zVEBhSQ5++OtOqJ3Xvw@mail.gmail.com>
Date: Mon, 19 Aug 2024 20:27:11 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, kernel-team@...udflare.com
Subject: Re: [PATCH RFC net-next] tcp: Allow TIME-WAIT reuse after 1 millisecond

Hello Jakub,

On Mon, Aug 19, 2024 at 7:31 PM Jakub Sitnicki <jakub@...udflare.com> wrote:
>
> [This patch needs a description. Please see the RFC cover letter below.]
>
> Signed-off-by: Jakub Sitnicki <jakub@...udflare.com>
> ---
> Can we shorten the TCP connection reincarnation period?
>
> Situation
> =========
>
> Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port)
> in the TIME-WAIT state to establish a new outgoing TCP connection after a period
> of 1 second. This period, during which the 4-tuple remains blocked from reuse,
> is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp
> timestamp, which presently uses a 1 Hz clock (ktime_get_seconds).
>
> The TIME-WAIT block is enforced by __{inet,inet6}_check_established ->
> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last
> ts_recent_stamp update before allowing the 4-tuple to be reused.
>
> This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an
> already in use local port during connect") [1], protects the TCP receiver
> against segments from an earlier incarnation of the same connection (FIN
> retransmits), which could potentially corrupt the TCP stream, as described by
> RFC 7323 [2, 3].
>
> Problem
> =======
>
> The one-second reincarnation period has not posed a problem when we had a
> sufficiently large pool of ephemeral ports (tens of thousands per host).
> However, as we began sharing egress IPv4 addresses between hosts by partitioning
> the available port range [4], the ephemeral port pool size has shrunk
> significantly—down to hundreds of ports per host.
>
> This reduction in port pool size has made it clear that a one-second connection
> reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP
> connections, such as DNS queries, can complete in milliseconds, yet the TCP
> 4-tuple remains blocked for a period of time that is orders of magnitude longer.
>
> Solution
> ========
>
> We would like to propose to shorten the period during which the 4-tuple is tied
> up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes
> nowadays to perform of a short-lived TCP connection, from setup to teardown.
>
> The ts_recent_stamp protection is based on the same principle as PAWS but
> extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point
> (b):
>
>     An additional mechanism could be added to the TCP, a per-host
>     cache of the last timestamp received from any connection.  This
>     value could then be used in the PAWS mechanism to reject old
>     duplicate segments from earlier incarnations of the connection,
>     if the timestamp clock can be guaranteed to have ticked at least
>     once since the old connection was open.  This would require that
>     the TIME-WAIT delay plus the RTT together must be at least one
>     tick of the sender's timestamp clock.  Such an extension is not
>     part of the proposal of this RFC.
>
> Due to that, we would want to follow the same guidelines as the for TSval
> timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms
> to 1 sec per tick [5], when reconsidering the default setting.
>
> (Note that the Linux TCP stack has recently introduced even finer granularity
> with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for
> usec resolution in TCP TS values" [6] for use in private networks.)
>
> A simple implementation could be to switch from a second to a millisecond clock,
> as demonstrated by the following patch. However, this could also be a tunable
> option to allow administrators to adjust it based on their specific needs and
> risk tolerance.
>
> A tunable also opens the door to letting users set the TIME-WAIT reuse period
> beyond the RFC 7323 recommended range at their own risk.
>
> Workaround
> ==========
>
> Today, when an application has only a small ephemeral port pool available, we
> work around the 1-second reincarnation period by manually selecting the local
> port with an explicit bind().
>
> This has been possible since the introduction of the ts_recent_stamp protection
> mechanism [1]. However, it is unclear why this is allowed for egress
> connections.
>
> To guide readers to the relevant code: if the local port is selected by the
> user, we do not pass a TIME-WAIT socket to the check_established helper from
> __inet_hash_connect. This way we circumvent the timestamp check in
> tcp_twsk_unique [7] (as twp is NULL).
>
> However, relying on this workaround conflicts with our goal of delegating TCP
> local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8]
> and IP_LOCAL_PORT_RANGE [9] socket options.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
> [2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8
> [3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B
> [4] https://lpc.events/event/16/contributions/1349/
> [5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4
> [6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93
> [7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156
> [8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT
> [9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE
> ---
>
> ---
>  drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +-
>  include/linux/tcp.h                                         | 4 ++--
>  net/ipv4/tcp_input.c                                        | 2 +-
>  net/ipv4/tcp_ipv4.c                                         | 5 ++---
>  net/ipv4/tcp_minisocks.c                                    | 9 ++++++---
>  5 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> index 6f6525983130..b15b26db8902 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> @@ -1866,7 +1866,7 @@ static void chtls_timewait(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>
>         tp->rcv_nxt++;
> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
>         tp->srtt_us = 0;
>         tcp_time_wait(sk, TCP_TIME_WAIT, 0);
>  }
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 6a5e08b937b3..174257114ee4 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -110,7 +110,7 @@ struct tcp_sack_block {
>
>  struct tcp_options_received {
>  /*     PAWS/RTTM data  */
> -       int     ts_recent_stamp;/* Time we stored ts_recent (for aging) */
> +       u32     ts_recent_stamp;/* Time we stored ts_recent (for aging) */
>         u32     ts_recent;      /* Time stamp to echo next              */
>         u32     rcv_tsval;      /* Time stamp value                     */
>         u32     rcv_tsecr;      /* Time stamp echo reply                */
> @@ -543,7 +543,7 @@ struct tcp_timewait_sock {
>         /* The time we sent the last out-of-window ACK: */
>         u32                       tw_last_oow_ack_time;
>
> -       int                       tw_ts_recent_stamp;
> +       u32                       tw_ts_recent_stamp;
>         u32                       tw_tx_delay;
>  #ifdef CONFIG_TCP_MD5SIG
>         struct tcp_md5sig_key     *tw_md5_key;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e37488d3453f..873a1cbd6d14 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk)
>  static void tcp_store_ts_recent(struct tcp_sock *tp)
>  {
>         tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
>  }
>
>  static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fd17f25ff288..47e2dcda4eae 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -116,7 +116,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
>         const struct inet_timewait_sock *tw = inet_twsk(sktw);
>         const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
>         struct tcp_sock *tp = tcp_sk(sk);
> -       int ts_recent_stamp;
> +       u32 ts_recent_stamp;
>
>         if (reuse == 2) {
>                 /* Still does not detect *everything* that goes through
> @@ -157,8 +157,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
>          */
>         ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
>         if (ts_recent_stamp &&
> -           (!twp || (reuse && time_after32(ktime_get_seconds(),
> -                                           ts_recent_stamp)))) {
> +           (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) {

At first glance, I wonder whether 1 ms is really too short, especially
for most cases? If the rtt is 2-3 ms which is quite often seen in
production, we may lose our opportunity to change the sub-state of
timewait socket and finish the work that should be done as expected.
One second is safe for most cases, of course, since I obscurely
remember I've read one paper (tuning the initial window to 10) saying
in Google the cases exceeding 100ms rtt is rare but exists. So I still
feel a fixed short value is not that appropriate...

Like you said, how about converting the fixed value into a tunable one
and keeping 1 second as the default value?

After you submit the next version, I think I can try it and test it
locally :) It's interesting!

Thanks,
Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ