netdev - [RFC] Socket Pressure Stall Information / ephemeral port range depletion info

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <accaf70a-be01-4de9-9577-196ef5b06109@isc.org>
Date: Fri, 13 Sep 2024 13:57:03 +0200
From: Petr Špaček <pspacek@....org>
To: netdev@...r.kernel.org
Cc: Frederick Lawler <fred@...udflare.com>
Subject: [RFC] Socket Pressure Stall Information / ephemeral port range
 depletion info

This RFC relates to "LPC 2023: connect() - why you so slow?" [1] by 
Frederick Lawler <fred@...udflare.com>.

Background
==========
LPC quote
 > 50k egress unicast connections to a single destination… Who does that?

Not only web proxies, it happens in large DNS server deployments too.

DNS setup on a single machine often involves multi-process 
implementation (Knot Resolver) and/or proxies (e.g. BIND + dnsdist). 
This makes 'keep track of ephemeral port usage inside application' 
approach not viable.

Problems
========
- Userspace has no visibility into port range usage ratio.
- Userspace can be blocked for an unknown amount time on bind() or 
connect() when the port range has high utilization rate.
Miliseconds long blocking quoted on LPC slide 10 are observed in DNS 
land as well.

Corollary: Hardcoded level of parallelism does not work well.

Over time it gets worse because port range is fixed size but number of 
CPUs and processing speeds improve. Today a good userspace DNS 
implementation can handle 130 k query/answer pairs per CPU core per 
second. Measured on 64 a core system with no bind() mid-flight [3].

What can we do?
===============
What netdev masterminds suggest as a most tenable approach?

Couple ideas as a kick start:

A. Socket Pressure Stall Information
------------------------------------
Modeled after PSI present in kernel [2]. Cooperating processes can 
detect contention and lower their level of (attempted) parallelism when 
bind() becomes a bottleneck. PSI already has a notification mechanism 
which is handy to applications.

An obvious problem:
Port range is per (address, protocol). Would one number be good enough? 
Well, the same applies to I/O which is currently also summarized into a 
single PSI.

B. Expose state of port range
-----------------------------
Expose number of free ports within net.ipv4.ip_local_port_range for each 
(address, protocol) tuple.

As an application developer I would like that if access to the counter 
is damn cheap. But maybe the accuracy is not worth the complexity?

C. Non-blocking bind()
----------------------
My head is about to explode. I doubt it be worth the overhead for 
typical situation without contention.

D. Your idea here
-----------------
Any other ideas how to tackle this?

Thank you for your time!

[1] 
https://lpc.events/event/17/contributions/1593/attachments/1208/2472/lpc-2023-connect-why-you-so-slow.pdf
[2] https://www.kernel.org/doc/html/latest/accounting/psi.html
[3] https://www.knot-dns.cz/benchmark/

-- 
Petr Špaček
Internet Systems Consortium