netdev - Re: [PATCH RFC net-next] tcp: Allow TIME-WAIT reuse after 1 millisecond

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ikvwr5iz.fsf@cloudflare.com>
Date: Mon, 19 Aug 2024 15:44:36 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org,  kernel-team@...udflare.com
Subject: Re: [PATCH RFC net-next] tcp: Allow TIME-WAIT reuse after 1
 millisecond

On Mon, Aug 19, 2024 at 01:59 PM +02, Eric Dumazet wrote:
> On Mon, Aug 19, 2024 at 1:31 PM Jakub Sitnicki <jakub@...udflare.com> wrote:
>>
>> [This patch needs a description. Please see the RFC cover letter below.]
>>
>> Signed-off-by: Jakub Sitnicki <jakub@...udflare.com>
>> ---
>> Can we shorten the TCP connection reincarnation period?
>>
>> Situation
>> =========
>>
>> Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port)
>> in the TIME-WAIT state to establish a new outgoing TCP connection after a period
>> of 1 second. This period, during which the 4-tuple remains blocked from reuse,
>> is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp
>> timestamp, which presently uses a 1 Hz clock (ktime_get_seconds).
>>
>> The TIME-WAIT block is enforced by __{inet,inet6}_check_established ->
>> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last
>> ts_recent_stamp update before allowing the 4-tuple to be reused.
>>
>> This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an
>> already in use local port during connect") [1], protects the TCP receiver
>> against segments from an earlier incarnation of the same connection (FIN
>> retransmits), which could potentially corrupt the TCP stream, as described by
>> RFC 7323 [2, 3].
>>
>> Problem
>> =======
>>
>> The one-second reincarnation period has not posed a problem when we had a
>> sufficiently large pool of ephemeral ports (tens of thousands per host).
>
>
> We now have network namespaces, and still ~30,000 ephemeral ports per netns :)

It's just that we are short on public IPv4 addresses with certain traits
we need to proxy on egress (like ownership, reputation, geolocation).
Hence we had to share the addresses and divide the port space between
hosts :-/

>
>> However, as we began sharing egress IPv4 addresses between hosts by partitioning
>> the available port range [4], the ephemeral port pool size has shrunk
>> significantly—down to hundreds of ports per host.
>>
>> This reduction in port pool size has made it clear that a one-second connection
>> reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP
>> connections, such as DNS queries, can complete in milliseconds, yet the TCP
>> 4-tuple remains blocked for a period of time that is orders of magnitude longer.
>>
>> Solution
>> ========
>>
>> We would like to propose to shorten the period during which the 4-tuple is tied
>> up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes
>> nowadays to perform of a short-lived TCP connection, from setup to teardown.
>>
>> The ts_recent_stamp protection is based on the same principle as PAWS but
>> extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point
>> (b):
>>
>>     An additional mechanism could be added to the TCP, a per-host
>>     cache of the last timestamp received from any connection.  This
>>     value could then be used in the PAWS mechanism to reject old
>>     duplicate segments from earlier incarnations of the connection,
>>     if the timestamp clock can be guaranteed to have ticked at least
>>     once since the old connection was open.  This would require that
>>     the TIME-WAIT delay plus the RTT together must be at least one
>>     tick of the sender's timestamp clock.  Such an extension is not
>>     part of the proposal of this RFC.
>
> Note the RTT part here. I do not see this implemented in your patch.
>

Not sure I follow. I need to look into that more.

My initial thinking here was that as long as TW delay (1 msec) is not
shorter than one tick of the sender's TS clock (1 msec), then I can
ignore the RTT and the requirement is still met.

>>
>> Due to that, we would want to follow the same guidelines as the for TSval
>> timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms
>> to 1 sec per tick [5], when reconsidering the default setting.
>>
>> (Note that the Linux TCP stack has recently introduced even finer granularity
>> with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for
>> usec resolution in TCP TS values" [6] for use in private networks.)
>>
>> A simple implementation could be to switch from a second to a millisecond clock,
>> as demonstrated by the following patch. However, this could also be a tunable
>> option to allow administrators to adjust it based on their specific needs and
>> risk tolerance.
>>
>> A tunable also opens the door to letting users set the TIME-WAIT reuse period
>> beyond the RFC 7323 recommended range at their own risk.
>>

[...]

>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index e37488d3453f..873a1cbd6d14 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk)
>>  static void tcp_store_ts_recent(struct tcp_sock *tp)
>>  {
>>         tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
>> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
>> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
>
> Please do not abuse tcp_clock_ms().
>
> Instead use tcp_time_stamp_ms(tp)
>
> Same remark for other parts of the patch, try to reuse tp->tcp_mstamp
> if available.
>
> Also, (tcp_clock_ms() != ts_recent_stamp) can be true even after one
> usec has elapsed, due to rounding.
>
> The 'one second delay' was really: 'An average of 0.5 second delay'
>
> Solution : no longer use jiffies, but usec based timestamps, since we
> already have this infrastructure in TCP stack.

Thank you for feedback. Especially wrt the rounding bug - eye opening.

Will rework it to move away from jiffies.