netdev - Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQLnicTicjJhH8gUJK+mpngg5rVoJuQGMiypwtmyC01ZOw@mail.gmail.com>
Date: Thu, 31 Jul 2025 09:41:22 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Matt Fleming <matt@...dmodwrite.com>
Cc: Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, Eduard Zingerman <eddyz87@...il.com>, Shuah Khan <shuah@...nel.org>, 
	kernel-team <kernel-team@...udflare.com>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	"open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>, 
	bpf <bpf@...r.kernel.org>, Martin KaFai Lau <martin.lau@...nel.org>, 
	Yonghong Song <yonghong.song@...ux.dev>, Network Development <netdev@...r.kernel.org>, 
	Matt Fleming <mfleming@...udflare.com>
Subject: Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks

On Tue, Jul 29, 2025 at 6:56 AM Matt Fleming <matt@...dmodwrite.com> wrote:
>
> On Mon, Jul 28, 2025 at 3:35 PM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > Please make a full description of what the test does,
> > since it's not trivial to decipher from the code.
> > If I'm reading it correctly, first, the user space
> > makes the LPM completely full and then lookup/update
> > use random key-s within range.
>
> Yep, that's correct. I'll provide a more comprehensive description in v4.
>
> > But delete looks different. It seems the kernel delete
> > operation can interleave with user space refilling the LPM,
> > so ...
> >
> > >   lookup: throughput    7.423 ± 0.023 M ops/s (  7.423M ops/prod), latency  134.710 ns/op
> > >   update: throughput    2.643 ± 0.015 M ops/s (  2.643M ops/prod), latency  378.310 ns/op
> > >   delete: throughput    0.712 ± 0.008 M ops/s (  0.712M ops/prod), latency 1405.152 ns/op
> >
> > this comparison doesn't look like apples to apples,
> > since delete will include user space refill time.
>
> Yeah, you're right. Though we only measure the delete time from in the
> BPF prog, delete throughput is essentially blocked while the refill
> happens and because things are measured with a 1-second timer
> (bench.c:sigalarm_handler) the refill time gets accounted for anyway.
> I could try extrapolating the delete time like I've done for the free
> op, i.e. we calculate how many ops were completed in what fraction of
> a second.
>
> > >     free: throughput    0.574 ± 0.003 K ops/s (  0.574K ops/prod), latency    1.743 ms/op
> >
> > Does this measure the free-ing of full LPM ?
>
> Yes, this measures the total time to free every element in the trie.
>
> > > +static void __lpm_validate(void)
> >
> > why double underscore ?
> > Just lpm_validate() ?
>
> The double underscore is used for the common validation parts, but
> I'll rename this to include "_common()" (along with all other uses of
> __).

No. It's also wrong.
Double underscore or _common suffix are both meaningless.
The helper name should describe what it's for.

> > > +       /*
> > > +        * Ideally we'd have access to the map ID but that's already
> > > +        * freed before we enter trie_free().
> > > +        */
> > > +       BPF_CORE_READ_STR_INTO(&name, map, name);
> > > +       if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map"))
> > > +               return 0;
> > > +
> > > +       val = bpf_ktime_get_ns();
> > > +       bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY);
> >
> > Looks like there is only one lpm map.
> > What's the point of that extra map ?
> > Just have a global var for start time ?
>
> Sure, I can make this a global variable for the start time instead.
>
> > bpf_get_prandom_u32() is not free
> > and modulo operation isn't free either.
> > The benchmark includes their time.
> > It's ok to have it, but add a mode where the bench
> > tests linear lookup/update too with simple key.data++
>
> Good idea.
>
> > Since the map suppose to full before we start all keys
> > should be there, right?
>
> Yes.
>
> > Let's add a sanity check that update() succeeds.
>
> Will do.
>
> > > +static int delete (__u32 index, bool *need_refill)
> > > +{
> > > +       struct trie_key key = {
> > > +               .data = deleted_entries,
> > > +               .prefixlen = prefixlen,
> > > +       };
> > > +
> > > +       bpf_map_delete_elem(&trie_map, &key);
> > > +
> > > +       /* Do we need to refill the map? */
> > > +       if (++deleted_entries == nr_entries) {
> > > +               /*
> > > +                * Atomicity isn't required because DELETE only supports
> > > +                * one producer running concurrently. What we need is a
> > > +                * way to track how many entries have been deleted from
> > > +                * the trie between consecutive invocations of the BPF
> > > +                * prog because a single bpf_loop() call might not
> > > +                * delete all entries, e.g. when NR_LOOPS < nr_entries.
> > > +                */
> > > +               deleted_entries = 0;
> > > +               *need_refill = true;
> > > +               return 1;
> >
> > This early break is another reason that makes
> > 'delete' op different from 'lookup/update'.
> > Pls make all 3 consistent, so they can be compared to each other.
>
> Hmm.. I'm not quite sure how to do that. lookup/update don't modify
> the number of entries in the map whereas delete does (update only
> updates existing entries, it doesn't create new ones). So when the map
> is empty it needs to be refilled. You're right that somehow the time
> to refill needs to be removed from the delete throughput/latency
> numbers but fundamentally these 3 ops are not the same and I don't see
> how to treat them as such.

well, random-key update when the map is full is also quite different from
random-key update when the map is empty.

Instead doing an update from user space do timed ops:
1 start with empty map, update (aka insert) all keys sequentially
2 lookup all sequentially
3 delete all sequentially
4 update (aka insert) all sequentially
5 lookup random
6 update random
7 delete all random

The elapsed time for 1 and 4 should be exactly the same.
While all others might have differences,
but all can be compared to each other and reasoned about.