linux-kernel - Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGis_TUNfUOD3+GdbJn1U33W8wW5pWmASxiMa5e5+5-BqJ-PKw@mail.gmail.com>
Date: Mon, 30 Jun 2025 14:28:06 +0100
From: Matt Fleming <mfleming@...udflare.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Matt Fleming <matt@...dmodwrite.com>, Ignat Korchagin <ignat@...udflare.com>, 
	Song Liu <song@...nel.org>, Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, 
	Eduard Zingerman <eddyz87@...il.com>, Yonghong Song <yonghong.song@...ux.dev>, 
	John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>, 
	Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>, 
	bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>, 
	kernel-team <kernel-team@...udflare.com>, Jesper Dangaard Brouer <hawk@...nel.org>
Subject: Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()

On Fri, 27 Jun 2025 at 20:36, Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> Good. Now you see my point, right?
> The cond_resched() doesn't fix the issue.
> 1hr to free a trie of 100M elements is horrible.
> Try 100M kmalloc/kfree to see that slab is not the issue.
> trie_free() algorithm is to blame. It doesn't need to start
> from the root for every element. Fix the root cause.

It doesn't take an hour to free 100M entries, the table showed it
takes about a minute (67 or 62 seconds).

I never claimed that kmalloc/kfree was at fault. I said that the loop
in trie_free() has no preemption, and that's a problem with tries with
millions of entries.

Of course, rewriting the algorithm used in the lpm trie code would
make this less of an issue. But this would require a major rework.
It's not as simple as improving trie_free() alone. FWIW I tried using
a recursive algorithm in trie_free() and the results are slightly
better, but it still takes multiple seconds to free 10M entries (4.3s)
and under a minute for 100M (56.7s). To fix this properly it's
necessary to use more than two children per node to reduce the height
of the trie. And in the meantime, anyone who uses maps with millions
of entries is gonna have the kthread spin in a loop without
preemption.

Thanks,
Matt