[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6b70aa4e-68c4-20de-b042-c549efd6a901@huaweicloud.com>
Date: Thu, 27 Feb 2025 12:08:26 +0800
From: Hou Tao <houtao@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Zvi Effron <zeffron@...tgames.com>,
Toke Høiland-Jørgensen <toke@...nel.org>,
bpf <bpf@...r.kernel.org>, rcu@...r.kernel.org,
LKML <linux-kernel@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
<eddyz87@...il.com>, Song Liu <song@...nel.org>,
Yonghong Song <yonghong.song@...ux.dev>,
John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
Jiri Olsa <jolsa@...nel.org>, "Paul E . McKenney" <paulmck@...nel.org>,
Cody Haas <chaas@...tgames.com>, Hou Tao <hotforest@...il.com>
Subject: Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash
map atomically
Hi,
On 2/27/2025 11:17 AM, Alexei Starovoitov wrote:
> On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@...weicloud.com> wrote:
>>>> lookup procedure A
>>>> A: find the old element (instead of the new old)
>>>>
>>>> update procedure B
>>>> B: delete the old element
>>>> update procedure C on the same CPU:
>>>> C: reuse the old element (overwrite its key and insert in
>>>> the same bucket)
>>>>
>>>> A: the key is mismatched and return -ENOENT.
>>> This is fine. It's just normal reuse.
>>> Orthogonal to 'update as insert+delete' issue.
>> OK. However, it will break the lookup procedure because it expects it
>> will return an valid result instead of -ENOENT.
> What do you mean 'breaks the lookup' ?
> lookup_elem_raw() matches hash, and then it memcmp(key),
> if the element is reused anything can happen.
> Either it succeeds in memcmp() and returns an elem,
> or miscompares in memcmp().
> Both are expected, because elems are reused in place.
>
> And this behavior is expected and not-broken,
> because bpf prog that does lookup on one cpu and deletes
> the same element on the other cpu is asking for trouble.
It seems I mislead you in the above example. It is also possible for
doing lookup in one CPU and updating the same element in other CPU. So
does such concurrence also look insane ?
lookup procedure A
A: find the old element (instead of the new old)
update procedure B
B: overwrite the same element and free the old element
update procedure C on the same CPU:
C: reuse the old element (overwrite its key and insert in
the same bucket)
A: the key is mismatched and return -ENOENT.
> bpf infra guarantees the safety of the kernel.
> It doesn't guarantee that bpf progs are sane.
>
>>> It's been a long time since I looked into rcu_nulls details.
>>> Pls help me understand that this new replace_rcu_nulls()
>>> is correct from nulls pov,
>>> If it is then this patch set may be the right answer to non-atomic update.
>> If I understand correctly, only the manipulations of ->first pointer and
>> ->next pointer need to take care of nulls pointer.
> hmm. I feel we're still talking past each other.
> See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
> It's there because of reuse. The lookup can start in one bucket
> and finish in another.
Yes. I noticed that. However, it happens when the deleted element is
reused and inserted to a different bucket, right ? For
replace_rcu_nulls(), the reuse of old element is possible only after
replace_rcu_nulls() completes, so I think it is irrelevant.
>If it is then this patch set may be the right answer to non-atomic update.
I also didn't follow that. Due to the immediate reuse, the lookup
procedure may still return -ENOENT when there is concurrent overwriting
of the same element as show above, so I think it only reduce the
possibility. I still prefer to implement a separated update procedure of
htab of maps to fix the atomic update problem completely for htab of
maps first.
>
>>> And for the future, please please focus on "why" part in
>>> the cover letter and commit logs instead of "what".
>>>
>>> Since the only thing I got from the log was:
>>> "Currently, the update is not atomic
>>> because the overwrite of existing element happens in a two-steps way,
>>> but the support of atomic update is feasible".
>>>
>>> "is feasible" doesn't explain "why".
>>>
>>> Link to xdp-newbie question is nice for additional context,
>>> but reviewers should not need to go and read some thread somewhere
>>> to understand "why" part.
>>> All of it should be in the commit log.
>> OK. My original thought is that is a reported problem, so an extra link
>> will be enough. Will try to add more context next time.
>>>> map may still be incorrect (as shown long time ago [1]), so I think
>>>> maybe for other types of map, the atomic update doesn't matter too much.
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>>> A thread from 3 years ago ?! Sorry, it's not helpful to ask
>>> people to page-in such an old context with lots of follow ups
>>> that may or may not be relevant today.
>> Let me reuse part of the diagram above to explain how does the lookup
>> procedure may return a incorrect value:
>>
>> lookup procedure A
>> A: find the old element (instead of the new element)
>>
>>
>> update procedure B
>> B: delete the old element
>> update procedure C on the same CPU:
>>
>>
>> A: the key is matched and return the value in the element
>>
>> C: reuse the old element (overwrite its key and value)
>>
>> A: read the value (it is incorrect, because it has been reused and
>> overwritten)
> ... and it's fine. It's by design. It's an element reuse behavior.
>
> Long ago hashmap had two modes: prealloc (default) and
> NO_PREALLOC (call_rcu + kfree)
>
> The call_rcu part was there to make things safe.
> The memory cannot be kfree-ed to the kernel until RCU GP.
> With bpf_mem_alloc hashmap elements are freed back to bpf_ma
> right away. Hashmap is doing bpf_mem_cache_free()
> (instead of bpf_mem_cache_free_rcu()) because users need speed.
> So since 2022 both prealloc and no_prealloc reuse elements.
> We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
> that will use _rcu() flavor of freeing into bpf_ma,
> but it has to have a strong reason.
> And soon as we add it the default with prealloc would need
> to use call_rcu() too, right?
> and that becomes nightmare, since bpf prog can easily DoS the system.
> Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
> Unlike new things like bpf_obj_new/obj_drop the hashmap
> is unpriv, so concerns are drastically different.
>
> .
I see. Thanks for the detailed explanation. And that is part of the
reason why I proposed adding a seq-counter in the htab-element and
checking the seq-counter during lookup, so at least the lookup will not
return -ENOENT for the concurrent overwriting procedure.
Powered by blists - more mailing lists