[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <27090b7a-ab98-49b5-b612-f0d8471228f4@isc.org>
Date: Thu, 19 Sep 2024 13:02:35 +0200
From: Petr Špaček <pspacek@....org>
To: netdev@...r.kernel.org
Cc: Frederick Lawler <fred@...udflare.com>,
Jakub Sitnicki <jakub@...udflare.com>
Subject: Re: [RFC] Socket Pressure Stall Information / ephemeral port range
depletion info
On 13. 09. 24 13:57, Petr Špaček wrote:
> This RFC relates to "LPC 2023: connect() - why you so slow?" [1] by
> Frederick Lawler <fred@...udflare.com>.
...
> Problems
> ========
> - Userspace has no visibility into port range usage ratio.
> - Userspace can be blocked for an unknown amount time on bind() or
> connect() when the port range has high utilization rate.
> Miliseconds long blocking quoted on LPC slide 10 are observed in DNS
> land as well.
>
> Corollary: Hardcoded level of parallelism does not work well.
>
> Over time it gets worse because port range is fixed size but number of
> CPUs and processing speeds improve. Today a good userspace DNS
> implementation can handle 130 k query/answer pairs per CPU core per
> second. Measured on 64 a core system with no bind() mid-flight [3].
This is an answer to LPC in-person request to clarify. Here's an DNS
example with real numbers rounded:
- Assume DNS query-answer rate ~ 100 k / sec / CPU core
- Assume DNS resolver with cache hit rate ~ 95 %
- Assume DNS cache hit costs nothing to process
- Cache miss requires communication over network on a random port =>
bind(0) is needed, possibly for UDP followed by TCP connect()
- Cache miss processing must not block event loop / cache hit answers
This gives us CPU time budget of
1 sec / (100 000 requests * 5 % cache miss)
= 0.2 ms per single cache miss request.
This works fine on a system which has enough source ports. Trouble
starts when port range has high utilization.
"perf trace --summary" from a system suffering port range depletion:
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
------- ------ ------ -------- ------- ------- ------- ------
bind 6301 0 6753.553 0.000 1.072 9.031 2.12%
With 1 ms average we are 5x over the CPU time budget and overall system
throughput goes down the toilet. bind() blocks the event loop and,
depending on resolver architecture, can block processing cache hits as well.
If the resolver knew that port range is depleted it could refuse to
process requests which resulted in cache miss and thus not waste CPU
cycles on vain bind() attempts, keeping throughput for cache hits.
In other words, it's about going from constant value for number of
requests processed in parallel to a dynamic behavior / auto-tuning
depending on workload.
A different DNS use-case, DNS zone transfers between authoritative DNS
servers, can suffer from the same problem as well as involves lots of
short lived TCP transactions.
I'm happy to supply more details as needed.
--
Petr Špaček
Internet Systems Consortium
Powered by blists - more mailing lists