netdev - Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f27bb6dc-b0ac-45aa-b748-4156531ca825@linux.dev>
Date: Tue, 14 Oct 2025 19:40:13 +0800
From: luoxuanqiang <xuanqiang.luo@...ux.dev>
To: Eric Dumazet <edumazet@...gle.com>
Cc: kuniyu@...gle.com, "Paul E. McKenney" <paulmck@...nel.org>,
 kerneljasonxing@...il.com, davem@...emloft.net, kuba@...nel.org,
 netdev@...r.kernel.org, Xuanqiang Luo <luoxuanqiang@...inos.cn>,
 Frederic Weisbecker <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu()
 and hlist_nulls_replace_init_rcu()


在 2025/10/14 18:02, Eric Dumazet 写道:
> On Tue, Oct 14, 2025 at 1:41 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>
>> 在 2025/10/14 16:09, Eric Dumazet 写道:
>>> On Tue, Oct 14, 2025 at 1:05 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>>> 在 2025/10/14 15:34, Eric Dumazet 写道:
>>>>> On Tue, Oct 14, 2025 at 12:21 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>>>>> 在 2025/10/13 17:49, Eric Dumazet 写道:
>>>>>>> On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>>>>>>> 在 2025/10/13 15:31, Eric Dumazet 写道:
>>>>>>>>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@...ux.dev> wrote:
>>>>>>>>>> From: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>>>>>>>>>
>>>>>>>>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>>>>>>>>>>
>>>>>>>>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
>>>>>>>>>> mentioned in the patch below:
>>>>>>>>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>>>> rculist_nulls")
>>>>>>>>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>>>> hlist_nulls")
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>>>>>>>>> ---
>>>>>>>>>>       include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
>>>>>>>>>>       1 file changed, 59 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>>>>>>>>>> index 89186c499dd4..c26cb83ca071 100644
>>>>>>>>>> --- a/include/linux/rculist_nulls.h
>>>>>>>>>> +++ b/include/linux/rculist_nulls.h
>>>>>>>>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>>>>>>>>>       #define hlist_nulls_next_rcu(node) \
>>>>>>>>>>              (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
>>>>>>>>>> + * @node: element of the list.
>>>>>>>>>> + */
>>>>>>>>>> +#define hlist_nulls_pprev_rcu(node) \
>>>>>>>>>> +       (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
>>>>>>>>>> +
>>>>>>>>>>       /**
>>>>>>>>>>        * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
>>>>>>>>>>        * @n: the element to delete from the hash list.
>>>>>>>>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>>>>>>>>>>              n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
>>>>>>>>>> + * @old: the element to be replaced
>>>>>>>>>> + * @new: the new element to insert
>>>>>>>>>> + *
>>>>>>>>>> + * Description:
>>>>>>>>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>>>>>>>>>> + * permitting racing traversals.
>>>>>>>>>> + *
>>>>>>>>>> + * The caller must take whatever precautions are necessary (such as holding
>>>>>>>>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>>>>>>>>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>>>>>>>>>> + * list.  However, it is perfectly legal to run concurrently with the _rcu
>>>>>>>>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>>>>>>>>>> + */
>>>>>>>>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
>>>>>>>>>> +                                          struct hlist_nulls_node *new)
>>>>>>>>>> +{
>>>>>>>>>> +       struct hlist_nulls_node *next = old->next;
>>>>>>>>>> +
>>>>>>>>>> +       WRITE_ONCE(new->next, next);
>>>>>>>>>> +       WRITE_ONCE(new->pprev, old->pprev);
>>>>>>>>> I do not think these two WRITE_ONCE() are needed.
>>>>>>>>>
>>>>>>>>> At this point new is not yet visible.
>>>>>>>>>
>>>>>>>>> The following  rcu_assign_pointer() is enough to make sure prior
>>>>>>>>> writes are committed to memory.
>>>>>>>> Dear Eric,
>>>>>>>>
>>>>>>>> I’m quoting your more detailed explanation from the other patch [0], thank
>>>>>>>> you for that!
>>>>>>>>
>>>>>>>> However, regarding new->next, if the new object is allocated with
>>>>>>>> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
>>>>>>>> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>> rculist_nulls”)?
>>>>>>>>
>>>>>>>> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
>>>>>>>> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
>>>>>>> I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
>>>>>>>
>>>>>>> It is a bit weird to annotate the writes, but not the lockless reads,
>>>>>>> even if apparently KCSAN
>>>>>>> is okay with that.
>>>>>>>
>>>>>> Dear Eric,
>>>>>>
>>>>>> I’m sorry—I still haven’t fully grasped the scenario you mentioned where
>>>>>> sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
>>>>>> similar to the race described in commit 860c8802ace1 (“rcu: Use
>>>>>> WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
>>>>>>
>>>>> inet_unhash() does a lockless sk_unhash(sk) call, while no lock is
>>>>> held in some cases (look at tcp_done())
>>>>>
>>>>> void inet_unhash(struct sock *sk)
>>>>> {
>>>>> struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
>>>>>
>>>>> if (sk_unhashed(sk))    // Here no lock is held
>>>>>        return;
>>>>>
>>>>> Relevant lock (depending on (sk->sk_state == TCP_LISTEN)) is acquired
>>>>> a few lines later.
>>>>>
>>>>> Then
>>>>>
>>>>> __sk_nulls_del_node_init_rcu() is called safely, while the bucket lock is held.
>>>>>
>>>> Dear Eric,
>>>>
>>>> Thanks for the quick response!
>>>>
>>>> In the call path:
>>>>            tcp_retransmit_timer()
>>>>                    tcp_write_err()
>>>>                            tcp_done()
>>>>
>>>> tcp_retransmit_timer() already calls lockdep_sock_is_held(sk) to check the
>>>> socket‑lock state.
>>>>
>>>> void tcp_retransmit_timer(struct sock *sk)
>>>> {
>>>>            struct tcp_sock *tp = tcp_sk(sk);
>>>>            struct net *net = sock_net(sk);
>>>>            struct inet_connection_sock *icsk = inet_csk(sk);
>>>>            struct request_sock *req;
>>>>            struct sk_buff *skb;
>>>>
>>>>            req = rcu_dereference_protected(tp->fastopen_rsk,
>>>>                                     lockdep_sock_is_held(sk)); // Check here
>>>>
>>>> Does that mean we’re already protected by lock_sock(sk) or
>>>> bh_lock_sock(sk)?
>>> But the socket lock is not protecting ehash buckets. These are other locks.
>>>
>>> Also, inet_unhash() can be called from other paths, without a socket
>>> lock being held.
>> Dear Eric,
>>
>> I understand the distinction now, but looking at the call stack in [0],
>> both CPUs reach inet_unhash() via the tcp_retransmit_timer() path, so only
>> one of them should pass the check, right?
>>
>> I’m still not clear how this race condition arises.
> Because that is two different sockets. This once again explains why
> holding or not the socket lock is not relevant.
>
> One of them is changing pointers in the chain, messing with
> surrounding pointers.
>
> The second one is reading sk->sk_node.pprev without using
> hlist_unhashed_lockless().
>
> I do not know how to explain this...
>
> Please look at the difference between hlist_unhashed_lockless() and
> hlist_unhashed().

Thank you very much for the patient explanation. :)

I think I understand now!

Best regards!
Xuanqiang.