[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241204-jakub-krn-909-poc-msec-tw-tstamp-v1-0-8b54467a0f34@cloudflare.com>
Date: Wed, 04 Dec 2024 19:53:21 +0100
From: Jakub Sitnicki <jakub@...udflare.com>
To: netdev@...r.kernel.org
Cc: Eric Dumazet <edumazet@...gle.com>,
Jason Xing <kerneljasonxing@...il.com>,
Adrien Vasseur <avasseur@...udflare.com>,
Lee Valentine <lvalentine@...udflare.com>, kernel-team@...udflare.com
Subject: [PATCH net-next 0/2] Make TIME-WAIT reuse delay deterministic and
configurable
This patch set is an effort to enable faster reuse of TIME-WAIT sockets.
We have recently talked about the motivation and the idea at Plumbers [1].
We now feel confident enough with these changes to repost them as a regular
patch set. There was no feedback for the RFCv2 [2] so I'm working under an
assumption that the gradual, two step approach of converting TS.Recent last
update timestamp to milliseconds is okay with everyone.
Experiment in production
------------------------
We've deployed these patches to a couple of nodes in production. They have
been soaking for more than a week now. One node was running with the
default setting for the new reuse delay sysctl (1 second), while another
had the reuse delay set to radically shorter value (1 millisecond).
We monitored ephemeral port search latency, measured as time from entry to
exit for __inet_hash_connect(), and skb drops due to failed PAWS check.
p95 port search latency was same-or-better, while average latency showed
higher spikes than the baseline/control node. This calls for a closer look
at tail latency which we want to address by running synthetic stress tests
with stress-ng --sockmany.
When it comes to skb drops due to PAWS, we expected to see higher drop
count on the node with reduced TW delay, but TBH I would have to crunch the
raw data to be able to tell if there was a statistically significant
difference there from the control node.
Hopefully we can test PAWS reject with packetdrill, see my note on what's
holding us back below, because in production we have to rely on TCP
retranmissions to happen to observe PAWS reject packets.
I might be able to post the collected graphs somewhere on GH if people
would like to review those for themselves. If interested, please let me
know.
We are now expanding the experiment to more nodes in other PoPs to check if
we will see the same patterns as described elsewhere.
Packetdrill tests
-----------------
The set of packetdrill tests did not grow since RFCv2. I've fixed up
expected TS val so it monotonically increases across connection
reincarnations. It is a cosmetic change so that the test scripts better
reflect the reality. Packetdrill doesn't care. It doesn't track timestamp
offsets across connections. This makes sense. TW reuse is a special case
where the random timestamp offset doesn't change.
However, it also gets in the way of adding tests for PAWS rejecting old
duplicate segments after TW reuse - packetdrill aborts because it can't
infer the offset for TS ecr. I'm looking at how we can address that
shortcoming so I can expand the test set.
The packetdrill tests TIME-WAIT reuse are now posted as a draft PR [2].
Thanks,
-jkbs
[1] https://lpc.events/event/18/contributions/1962/
[2] https://lore.kernel.org/r/20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com
[3] https://github.com/google/packetdrill/pull/90
Signed-off-by: Jakub Sitnicki <jakub@...udflare.com>
---
Changes in RFCv2:
- Make TIME-WAIT reuse configurable through a per-netns sysctl.
- Account for timestamp rounding so delay is not shorter than set value.
- Use tcp_mstamp when we know it is fresh due to receiving a segment.
- Link to RFCv1: https://lore.kernel.org/r/20240819-jakub-krn-909-poc-msec-tw-tstamp-v1-1-6567b5006fbe@cloudflare.com
---
Changes in v1:
- packetdrill: Adjust TS val for reused connection so value keep increasing
- Link to RFCv2: https://lore.kernel.org/r/20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com
---
Jakub Sitnicki (2):
tcp: Measure TIME-WAIT reuse delay with millisecond precision
tcp: Add sysctl to configure TIME-WAIT reuse delay
Documentation/networking/ip-sysctl.rst | 14 ++++++++++++++
.../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
include/linux/tcp.h | 9 ++++++++-
include/net/netns/ipv4.h | 1 +
net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
net/ipv4/tcp_ipv4.c | 9 ++++++---
net/ipv4/tcp_minisocks.c | 20 ++++++++++++++------
7 files changed, 54 insertions(+), 10 deletions(-)
Powered by blists - more mailing lists