linux-kernel - Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e1b65f13-a426-d707-0319-f57e8b15575a@huaweicloud.com>
Date: Thu, 27 Feb 2025 10:43:07 +0800
From: Hou Tao <houtao@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Zvi Effron <zeffron@...tgames.com>,
 Toke Høiland-Jørgensen <toke@...nel.org>,
 bpf <bpf@...r.kernel.org>, rcu@...r.kernel.org,
 LKML <linux-kernel@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
 <eddyz87@...il.com>, Song Liu <song@...nel.org>,
 Yonghong Song <yonghong.song@...ux.dev>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, "Paul E . McKenney" <paulmck@...nel.org>,
 Cody Haas <chaas@...tgames.com>, Hou Tao <hotforest@...il.com>
Subject: Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash
 map atomically

Hi,

On 2/27/2025 9:59 AM, Alexei Starovoitov wrote:
> On Wed, Feb 26, 2025 at 5:48 PM Hou Tao <houtao@...weicloud.com> wrote:
>> Hi,
>>
>> On 2/27/2025 7:17 AM, Zvi Effron wrote:
>>> On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
>>> <alexei.starovoitov@...il.com> wrote:
>>>> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@...weicloud.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
>>>>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@...weicloud.com> wrote:
>>>>>>> Hi Toke,
>>>>>>>
>>>>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>>>>>>> Hou Tao <houtao@...weicloud.com> writes:
>>>>>>>>
>>>>>>>>> +cc Cody Haas
>>>>>>>>>
>>>>>>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>>>>>>
>>>>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>>>>>>> Currently, the update of existing element in hash map involves two
>>>>>>>>>> steps:
>>>>>>>>>> 1) insert the new element at the head of the hash list
>>>>>>>>>> 2) remove the old element
>>>>>>>>>>
>>>>>>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>>>>>>> either the old element or the new element if the lookup operation starts
>>>>>>>>>> before the addition and continues after the removal.
>>>>>>>>>>
>>>>>>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>>>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>>>>>>> operation: it will either find the old element or the new element.
>>>>>> I'm missing the point.
>>>>>> This "atomic" replacement doesn't really solve anything.
>>>>>> lookup will see one element.
>>>>>> That element could be deleted by another thread.
>>>>>> bucket lock and either two step update or single step
>>>>>> don't change anything from the pov of bpf prog doing lookup.
>>>>> The point is that overwriting an existed element may lead to concurrent
>>>>> lookups return ENOENT as demonstrated by the added selftest and the
>>>>> patch tried to "fix" that. However, it seems using
>>>>> hlist_nulls_replace_rcu() for the overwriting update is still not
>>>>> enough. Because when the lookup procedure found the old element, the old
>>>>> element may be reusing, the comparison of the map key may fail, and the
>>>>> lookup procedure may still return ENOENT.
>>>> you mean l_old == l_new ? I don't think it's possible
>>>> within htab_map_update_elem(),
>>>> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
>>>> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
>>>> and just deleted elem is reused to be inserted
>>>> into another bucket.
>> No. I mean the following procedure in which the lookup procedure finds
>> the old element and tries to match its key, one update procedure has
>> already deleted the old element, and another update procedure is reusing
>> the old element:
>>
>> lookup procedure A
>> A: find the old element (instead of the new old)
>>
>>               update procedure B
>>               B: delete the old element
>>               update procedure C on the same CPU:
>>               C: reuse the old element (overwrite its key and insert in
>> the same bucket)
>>
>> A: the key is mismatched and return -ENOENT.
> This is fine. It's just normal reuse.
> Orthogonal to 'update as insert+delete' issue.

OK. However, it will break the lookup procedure because it expects it
will return an valid result instead of -ENOENT.
>
>> It can be reproduced by increasing ctx.loop in the selftest.
>>>> I'm not sure whether this new hlist_nulls_replace_rcu()
>>>> primitive works with nulls logic.
>>>>
>>>> So back to the problem statement..
>>>> Are you saying that adding new to a head while lookup is in the middle
>>>> is causing it to miss an element that
>>>> is supposed to be updated assuming atomicity of the update?
>>>> While now update_elem() is more like a sequence of delete + insert?
>>>>
>>>> Hmm.
>>> Yes, exactly that. Because update_elem is actually a delete + insert (actually
>>> an insert + delete, I think?), it is possible for a concurrent lookup to see no
>>> element instead of either the old or new value.
>> Yep.
>>>>> I see. In v2 I will fallback to the original idea: adding a standalone
>>>>> update procedure for htab of maps in which it will atomically overwrite
>>>>> the map_ptr just like array of maps does.
>>>> hold on. is this only for hash-of-maps ?
>>> I believe this was also replicated for hash as well as hash-of-maps. Cody can
>>> confirm, or use the reproducer he has to test that case.
>> The fix for hash-of-maps will be much simpler and the returned map
>> pointer will be correct. For other types of hash map, beside switching
>> to hlist_nulls_replace_rcu(),
> It's been a long time since I looked into rcu_nulls details.
> Pls help me understand that this new replace_rcu_nulls()
> is correct from nulls pov,
> If it is then this patch set may be the right answer to non-atomic update.

If I understand correctly, only the manipulations of ->first pointer and
->next pointer need to take care of nulls pointer.
>
> And for the future, please please focus on "why" part in
> the cover letter and commit logs instead of "what".
>
> Since the only thing I got from the log was:
> "Currently, the update is not atomic
> because the overwrite of existing element happens in a two-steps way,
> but the support of atomic update is feasible".
>
> "is feasible" doesn't explain "why".
>
> Link to xdp-newbie question is nice for additional context,
> but reviewers should not need to go and read some thread somewhere
> to understand "why" part.
> All of it should be in the commit log.

OK. My original thought is that is a reported problem, so an extra link
will be enough. Will try to add more context next time.
>
>> map may still be incorrect (as shown long time ago [1]), so I think
>> maybe for other types of map, the atomic update doesn't matter too much.
>>
>> [1]:
>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> A thread from 3 years ago ?! Sorry, it's not helpful to ask
> people to page-in such an old context with lots of follow ups
> that may or may not be relevant today.
Let me reuse part of the diagram above to explain how does the lookup
procedure may return a incorrect value:

lookup procedure A
A: find the old element (instead of the new element)


              update procedure B
              B: delete the old element
              update procedure C on the same CPU:


A: the key is matched and return the value in the element

              C: reuse the old element (overwrite its key and value)

A: read the value (it is incorrect, because it has been reused and
overwritten)