[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <accaf70a-be01-4de9-9577-196ef5b06109@isc.org>
Date: Fri, 13 Sep 2024 13:57:03 +0200
From: Petr Špaček <pspacek@....org>
To: netdev@...r.kernel.org
Cc: Frederick Lawler <fred@...udflare.com>
Subject: [RFC] Socket Pressure Stall Information / ephemeral port range
depletion info
This RFC relates to "LPC 2023: connect() - why you so slow?" [1] by
Frederick Lawler <fred@...udflare.com>.
Background
==========
LPC quote
> 50k egress unicast connections to a single destination… Who does that?
Not only web proxies, it happens in large DNS server deployments too.
DNS setup on a single machine often involves multi-process
implementation (Knot Resolver) and/or proxies (e.g. BIND + dnsdist).
This makes 'keep track of ephemeral port usage inside application'
approach not viable.
Problems
========
- Userspace has no visibility into port range usage ratio.
- Userspace can be blocked for an unknown amount time on bind() or
connect() when the port range has high utilization rate.
Miliseconds long blocking quoted on LPC slide 10 are observed in DNS
land as well.
Corollary: Hardcoded level of parallelism does not work well.
Over time it gets worse because port range is fixed size but number of
CPUs and processing speeds improve. Today a good userspace DNS
implementation can handle 130 k query/answer pairs per CPU core per
second. Measured on 64 a core system with no bind() mid-flight [3].
What can we do?
===============
What netdev masterminds suggest as a most tenable approach?
Couple ideas as a kick start:
A. Socket Pressure Stall Information
------------------------------------
Modeled after PSI present in kernel [2]. Cooperating processes can
detect contention and lower their level of (attempted) parallelism when
bind() becomes a bottleneck. PSI already has a notification mechanism
which is handy to applications.
An obvious problem:
Port range is per (address, protocol). Would one number be good enough?
Well, the same applies to I/O which is currently also summarized into a
single PSI.
B. Expose state of port range
-----------------------------
Expose number of free ports within net.ipv4.ip_local_port_range for each
(address, protocol) tuple.
As an application developer I would like that if access to the counter
is damn cheap. But maybe the accuracy is not worth the complexity?
C. Non-blocking bind()
----------------------
My head is about to explode. I doubt it be worth the overhead for
typical situation without contention.
D. Your idea here
-----------------
Any other ideas how to tackle this?
Thank you for your time!
[1]
https://lpc.events/event/17/contributions/1593/attachments/1208/2472/lpc-2023-connect-why-you-so-slow.pdf
[2] https://www.kernel.org/doc/html/latest/accounting/psi.html
[3] https://www.knot-dns.cz/benchmark/
--
Petr Špaček
Internet Systems Consortium
Powered by blists - more mailing lists