netdev - Re: [RFC] Socket Pressure Stall Information / ephemeral port range depletion info

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <27090b7a-ab98-49b5-b612-f0d8471228f4@isc.org>
Date: Thu, 19 Sep 2024 13:02:35 +0200
From: Petr Špaček <pspacek@....org>
To: netdev@...r.kernel.org
Cc: Frederick Lawler <fred@...udflare.com>,
 Jakub Sitnicki <jakub@...udflare.com>
Subject: Re: [RFC] Socket Pressure Stall Information / ephemeral port range
 depletion info

On 13. 09. 24 13:57, Petr Špaček wrote:
> This RFC relates to "LPC 2023: connect() - why you so slow?" [1] by 
> Frederick Lawler <fred@...udflare.com>.
...

> Problems
> ========
> - Userspace has no visibility into port range usage ratio.
> - Userspace can be blocked for an unknown amount time on bind() or 
> connect() when the port range has high utilization rate.
> Miliseconds long blocking quoted on LPC slide 10 are observed in DNS 
> land as well.
> 
> Corollary: Hardcoded level of parallelism does not work well.
> 
> Over time it gets worse because port range is fixed size but number of 
> CPUs and processing speeds improve. Today a good userspace DNS 
> implementation can handle 130 k query/answer pairs per CPU core per 
> second. Measured on 64 a core system with no bind() mid-flight [3].
This is an answer to LPC in-person request to clarify. Here's an DNS 
example with real numbers rounded:

- Assume DNS query-answer rate ~ 100 k / sec / CPU core
- Assume DNS resolver with cache hit rate ~ 95 %
- Assume DNS cache hit costs nothing to process
- Cache miss requires communication over network on a random port => 
bind(0) is needed, possibly for UDP followed by TCP connect()
- Cache miss processing must not block event loop / cache hit answers

This gives us CPU time budget of
1 sec / (100 000 requests * 5 % cache miss)
= 0.2 ms per single cache miss request.

This works fine on a system which has enough source ports. Trouble 
starts when port range has high utilization.

"perf trace --summary" from a system suffering port range depletion:

syscall  calls  errors  total     min     avg     max   stddev
                         (msec)  (msec)  (msec)  (msec)    (%)
------- ------  ------ -------- ------- ------- ------- ------
bind      6301      0  6753.553   0.000   1.072   9.031  2.12%

With 1 ms average we are 5x over the CPU time budget and overall system 
throughput goes down the toilet. bind() blocks the event loop and, 
depending on resolver architecture, can block processing cache hits as well.

If the resolver knew that port range is depleted it could refuse to 
process requests which resulted in cache miss and thus not waste CPU 
cycles on vain bind() attempts, keeping throughput for cache hits.

In other words, it's about going from constant value for number of 
requests processed in parallel to a dynamic behavior / auto-tuning 
depending on workload.

A different DNS use-case, DNS zone transfers between authoritative DNS 
servers, can suffer from the same problem as well as involves lots of 
short lived TCP transactions.

I'm happy to supply more details as needed.

-- 
Petr Špaček
Internet Systems Consortium