linux-kernel - Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQLev2V-ARjPc9EPYaSssCev_87Lc0NWkLvL-5tuy=3Veg@mail.gmail.com>
Date: Wed, 26 Feb 2025 19:17:07 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Hou Tao <houtao@...weicloud.com>
Cc: Zvi Effron <zeffron@...tgames.com>, Toke Høiland-Jørgensen <toke@...nel.org>, 
	bpf <bpf@...r.kernel.org>, rcu@...r.kernel.org, 
	LKML <linux-kernel@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, 
	Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>, 
	Yonghong Song <yonghong.song@...ux.dev>, John Fastabend <john.fastabend@...il.com>, 
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, 
	Jiri Olsa <jolsa@...nel.org>, "Paul E . McKenney" <paulmck@...nel.org>, Cody Haas <chaas@...tgames.com>, 
	Hou Tao <hotforest@...il.com>
Subject: Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash
 map atomically

On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@...weicloud.com> wrote:
>
> >>
> >> lookup procedure A
> >> A: find the old element (instead of the new old)
> >>
> >>               update procedure B
> >>               B: delete the old element
> >>               update procedure C on the same CPU:
> >>               C: reuse the old element (overwrite its key and insert in
> >> the same bucket)
> >>
> >> A: the key is mismatched and return -ENOENT.
> > This is fine. It's just normal reuse.
> > Orthogonal to 'update as insert+delete' issue.
>
> OK. However, it will break the lookup procedure because it expects it
> will return an valid result instead of -ENOENT.

What do you mean 'breaks the lookup' ?
lookup_elem_raw() matches hash, and then it memcmp(key),
if the element is reused anything can happen.
Either it succeeds in memcmp() and returns an elem,
or miscompares in memcmp().
Both are expected, because elems are reused in place.

And this behavior is expected and not-broken,
because bpf prog that does lookup on one cpu and deletes
the same element on the other cpu is asking for trouble.
bpf infra guarantees the safety of the kernel.
It doesn't guarantee that bpf progs are sane.

> > It's been a long time since I looked into rcu_nulls details.
> > Pls help me understand that this new replace_rcu_nulls()
> > is correct from nulls pov,
> > If it is then this patch set may be the right answer to non-atomic update.
>
> If I understand correctly, only the manipulations of ->first pointer and
> ->next pointer need to take care of nulls pointer.

hmm. I feel we're still talking past each other.
See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
It's there because of reuse. The lookup can start in one bucket
and finish in another.

> >
> > And for the future, please please focus on "why" part in
> > the cover letter and commit logs instead of "what".
> >
> > Since the only thing I got from the log was:
> > "Currently, the update is not atomic
> > because the overwrite of existing element happens in a two-steps way,
> > but the support of atomic update is feasible".
> >
> > "is feasible" doesn't explain "why".
> >
> > Link to xdp-newbie question is nice for additional context,
> > but reviewers should not need to go and read some thread somewhere
> > to understand "why" part.
> > All of it should be in the commit log.
>
> OK. My original thought is that is a reported problem, so an extra link
> will be enough. Will try to add more context next time.
> >
> >> map may still be incorrect (as shown long time ago [1]), so I think
> >> maybe for other types of map, the atomic update doesn't matter too much.
> >>
> >> [1]:
> >> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> > A thread from 3 years ago ?! Sorry, it's not helpful to ask
> > people to page-in such an old context with lots of follow ups
> > that may or may not be relevant today.
> Let me reuse part of the diagram above to explain how does the lookup
> procedure may return a incorrect value:
>
> lookup procedure A
> A: find the old element (instead of the new element)
>
>
>               update procedure B
>               B: delete the old element
>               update procedure C on the same CPU:
>
>
> A: the key is matched and return the value in the element
>
>               C: reuse the old element (overwrite its key and value)
>
> A: read the value (it is incorrect, because it has been reused and
> overwritten)

... and it's fine. It's by design. It's an element reuse behavior.

Long ago hashmap had two modes: prealloc (default) and
NO_PREALLOC (call_rcu + kfree)

The call_rcu part was there to make things safe.
The memory cannot be kfree-ed to the kernel until RCU GP.
With bpf_mem_alloc hashmap elements are freed back to bpf_ma
right away. Hashmap is doing bpf_mem_cache_free()
(instead of bpf_mem_cache_free_rcu()) because users need speed.
So since 2022 both prealloc and no_prealloc reuse elements.
We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
that will use _rcu() flavor of freeing into bpf_ma,
but it has to have a strong reason.
And soon as we add it the default with prealloc would need
to use call_rcu() too, right?
and that becomes nightmare, since bpf prog can easily DoS the system.
Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
Unlike new things like bpf_obj_new/obj_drop the hashmap
is unpriv, so concerns are drastically different.