netdev - Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6851cb4dcdae7_2f713f294e4@willemb.c.googlers.com.notmuch>
Date: Tue, 17 Jun 2025 16:08:45 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Stanislav Fomichev <stfomichev@...il.com>, 
 Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: bpf@...r.kernel.org, 
 netdev@...r.kernel.org, 
 ast@...nel.org, 
 daniel@...earbox.net, 
 john.fastabend@...il.com, 
 martin.lau@...ux.dev, 
 Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global
 table starvation

Stanislav Fomichev wrote:
> On 06/16, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@...gle.com>
> > 
> > BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> > map is full, due to percpu reservations and force shrink before
> > neighbor stealing. Once a CPU is unable to borrow from the global map,
> > it will once steal one elem from a neighbor and after that each time
> > flush this one element to the global list and immediately recycle it.
> > 
> > Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> > with 79 CPUs. CPU 79 will observe this behavior even while its
> > neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> > 
> > CPUs need not be active concurrently. The issue can appear with
> > affinity migration, e.g., irqbalance. Each CPU can reserve and then
> > hold onto its 128 elements indefinitely.
> > 
> > Avoid global list exhaustion by limiting aggregate percpu caches to
> > half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> > This change has no effect on sufficiently large tables.
> 
> The code and rationale look good to me!

Great :)

> There is also
> Documentation/bpf/map_lru_hash_update.dot which mentions
> LOCAL_FREE_TARGET, not sure if it's easy to convey these clamping
> details in there? Or, instead, maybe expand on it in
> Documentation/bpf/map_hash.rst?

Good catch. How about in the graph I replace LOCAL_FREE_TARGET by
target_free and in map_hash.rst something like the following diff:

 - Attempt to use CPU-local state to batch operations
-- Attempt to fetch free nodes from global lists
+- Attempt to fetch ``target_free`` free nodes from global lists
 - Attempt to pull any node from a global list and remove it from the hashmap
 - Attempt to pull any node from any CPU's list and remove it from the hashmap
 
+The number of nodes to borrow from the global list in a batch, ``target_free``,
+depends on the size of the map. Larger batch size reduces lock contention, but
+may also exhaust the global structure. The value is computed at map init to
+avoid exhaustion, by limiting aggregate reservation by all CPUs to half the map
+size. Bounded by a minimum of 1 and maximum budget of 128 at a time.

Btw, there is also great documentation on
https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_LRU_HASH/. That had a
small error in the order of those Attempt operations above that I
fixed up this week. I'll also update the LOCAL_FREE_TARGET there.
Since it explains the LRU mechanism well, should I link to it as well?

> This <size>/<nrcpu>/2 is a heuristic,
> so maybe we can give some guidance on the recommended fill level for
> small (size/nrcpu < 128) maps?

I don't know if we can suggest a size that works for all cases. It depends on
factors like the number of CPUs that actively update the map and how tolerable
prematurely removed elements are to the workload.