[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoD9BA_Y26dSz+rkvi2_ZEc6D29zVEBhSQ5++OtOqJ3Xvw@mail.gmail.com>
Date: Mon, 19 Aug 2024 20:27:11 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, kernel-team@...udflare.com
Subject: Re: [PATCH RFC net-next] tcp: Allow TIME-WAIT reuse after 1 millisecond
Hello Jakub,
On Mon, Aug 19, 2024 at 7:31 PM Jakub Sitnicki <jakub@...udflare.com> wrote:
>
> [This patch needs a description. Please see the RFC cover letter below.]
>
> Signed-off-by: Jakub Sitnicki <jakub@...udflare.com>
> ---
> Can we shorten the TCP connection reincarnation period?
>
> Situation
> =========
>
> Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port)
> in the TIME-WAIT state to establish a new outgoing TCP connection after a period
> of 1 second. This period, during which the 4-tuple remains blocked from reuse,
> is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp
> timestamp, which presently uses a 1 Hz clock (ktime_get_seconds).
>
> The TIME-WAIT block is enforced by __{inet,inet6}_check_established ->
> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last
> ts_recent_stamp update before allowing the 4-tuple to be reused.
>
> This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an
> already in use local port during connect") [1], protects the TCP receiver
> against segments from an earlier incarnation of the same connection (FIN
> retransmits), which could potentially corrupt the TCP stream, as described by
> RFC 7323 [2, 3].
>
> Problem
> =======
>
> The one-second reincarnation period has not posed a problem when we had a
> sufficiently large pool of ephemeral ports (tens of thousands per host).
> However, as we began sharing egress IPv4 addresses between hosts by partitioning
> the available port range [4], the ephemeral port pool size has shrunk
> significantly—down to hundreds of ports per host.
>
> This reduction in port pool size has made it clear that a one-second connection
> reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP
> connections, such as DNS queries, can complete in milliseconds, yet the TCP
> 4-tuple remains blocked for a period of time that is orders of magnitude longer.
>
> Solution
> ========
>
> We would like to propose to shorten the period during which the 4-tuple is tied
> up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes
> nowadays to perform of a short-lived TCP connection, from setup to teardown.
>
> The ts_recent_stamp protection is based on the same principle as PAWS but
> extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point
> (b):
>
> An additional mechanism could be added to the TCP, a per-host
> cache of the last timestamp received from any connection. This
> value could then be used in the PAWS mechanism to reject old
> duplicate segments from earlier incarnations of the connection,
> if the timestamp clock can be guaranteed to have ticked at least
> once since the old connection was open. This would require that
> the TIME-WAIT delay plus the RTT together must be at least one
> tick of the sender's timestamp clock. Such an extension is not
> part of the proposal of this RFC.
>
> Due to that, we would want to follow the same guidelines as the for TSval
> timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms
> to 1 sec per tick [5], when reconsidering the default setting.
>
> (Note that the Linux TCP stack has recently introduced even finer granularity
> with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for
> usec resolution in TCP TS values" [6] for use in private networks.)
>
> A simple implementation could be to switch from a second to a millisecond clock,
> as demonstrated by the following patch. However, this could also be a tunable
> option to allow administrators to adjust it based on their specific needs and
> risk tolerance.
>
> A tunable also opens the door to letting users set the TIME-WAIT reuse period
> beyond the RFC 7323 recommended range at their own risk.
>
> Workaround
> ==========
>
> Today, when an application has only a small ephemeral port pool available, we
> work around the 1-second reincarnation period by manually selecting the local
> port with an explicit bind().
>
> This has been possible since the introduction of the ts_recent_stamp protection
> mechanism [1]. However, it is unclear why this is allowed for egress
> connections.
>
> To guide readers to the relevant code: if the local port is selected by the
> user, we do not pass a TIME-WAIT socket to the check_established helper from
> __inet_hash_connect. This way we circumvent the timestamp check in
> tcp_twsk_unique [7] (as twp is NULL).
>
> However, relying on this workaround conflicts with our goal of delegating TCP
> local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8]
> and IP_LOCAL_PORT_RANGE [9] socket options.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
> [2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8
> [3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B
> [4] https://lpc.events/event/16/contributions/1349/
> [5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4
> [6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93
> [7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156
> [8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT
> [9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE
> ---
>
> ---
> drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +-
> include/linux/tcp.h | 4 ++--
> net/ipv4/tcp_input.c | 2 +-
> net/ipv4/tcp_ipv4.c | 5 ++---
> net/ipv4/tcp_minisocks.c | 9 ++++++---
> 5 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> index 6f6525983130..b15b26db8902 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> @@ -1866,7 +1866,7 @@ static void chtls_timewait(struct sock *sk)
> struct tcp_sock *tp = tcp_sk(sk);
>
> tp->rcv_nxt++;
> - tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> + tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
> tp->srtt_us = 0;
> tcp_time_wait(sk, TCP_TIME_WAIT, 0);
> }
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 6a5e08b937b3..174257114ee4 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -110,7 +110,7 @@ struct tcp_sack_block {
>
> struct tcp_options_received {
> /* PAWS/RTTM data */
> - int ts_recent_stamp;/* Time we stored ts_recent (for aging) */
> + u32 ts_recent_stamp;/* Time we stored ts_recent (for aging) */
> u32 ts_recent; /* Time stamp to echo next */
> u32 rcv_tsval; /* Time stamp value */
> u32 rcv_tsecr; /* Time stamp echo reply */
> @@ -543,7 +543,7 @@ struct tcp_timewait_sock {
> /* The time we sent the last out-of-window ACK: */
> u32 tw_last_oow_ack_time;
>
> - int tw_ts_recent_stamp;
> + u32 tw_ts_recent_stamp;
> u32 tw_tx_delay;
> #ifdef CONFIG_TCP_MD5SIG
> struct tcp_md5sig_key *tw_md5_key;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e37488d3453f..873a1cbd6d14 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk)
> static void tcp_store_ts_recent(struct tcp_sock *tp)
> {
> tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
> - tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> + tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
> }
>
> static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fd17f25ff288..47e2dcda4eae 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -116,7 +116,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
> const struct inet_timewait_sock *tw = inet_twsk(sktw);
> const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
> struct tcp_sock *tp = tcp_sk(sk);
> - int ts_recent_stamp;
> + u32 ts_recent_stamp;
>
> if (reuse == 2) {
> /* Still does not detect *everything* that goes through
> @@ -157,8 +157,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
> */
> ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
> if (ts_recent_stamp &&
> - (!twp || (reuse && time_after32(ktime_get_seconds(),
> - ts_recent_stamp)))) {
> + (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) {
At first glance, I wonder whether 1 ms is really too short, especially
for most cases? If the rtt is 2-3 ms which is quite often seen in
production, we may lose our opportunity to change the sub-state of
timewait socket and finish the work that should be done as expected.
One second is safe for most cases, of course, since I obscurely
remember I've read one paper (tuning the initial window to 10) saying
in Google the cases exceeding 100ms rtt is rare but exists. So I still
feel a fixed short value is not that appropriate...
Like you said, how about converting the fixed value into a tunable one
and keeping 1 second as the default value?
After you submit the next version, I think I can try it and test it
locally :) It's interesting!
Thanks,
Jason
Powered by blists - more mailing lists