[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f27bb6dc-b0ac-45aa-b748-4156531ca825@linux.dev>
Date: Tue, 14 Oct 2025 19:40:13 +0800
From: luoxuanqiang <xuanqiang.luo@...ux.dev>
To: Eric Dumazet <edumazet@...gle.com>
Cc: kuniyu@...gle.com, "Paul E. McKenney" <paulmck@...nel.org>,
kerneljasonxing@...il.com, davem@...emloft.net, kuba@...nel.org,
netdev@...r.kernel.org, Xuanqiang Luo <luoxuanqiang@...inos.cn>,
Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu()
and hlist_nulls_replace_init_rcu()
在 2025/10/14 18:02, Eric Dumazet 写道:
> On Tue, Oct 14, 2025 at 1:41 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>
>> 在 2025/10/14 16:09, Eric Dumazet 写道:
>>> On Tue, Oct 14, 2025 at 1:05 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>>> 在 2025/10/14 15:34, Eric Dumazet 写道:
>>>>> On Tue, Oct 14, 2025 at 12:21 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>>>>> 在 2025/10/13 17:49, Eric Dumazet 写道:
>>>>>>> On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>>>>>>> 在 2025/10/13 15:31, Eric Dumazet 写道:
>>>>>>>>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@...ux.dev> wrote:
>>>>>>>>>> From: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>>>>>>>>>
>>>>>>>>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>>>>>>>>>>
>>>>>>>>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
>>>>>>>>>> mentioned in the patch below:
>>>>>>>>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>>>> rculist_nulls")
>>>>>>>>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>>>> hlist_nulls")
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>>>>>>>>> ---
>>>>>>>>>> include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
>>>>>>>>>> 1 file changed, 59 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>>>>>>>>>> index 89186c499dd4..c26cb83ca071 100644
>>>>>>>>>> --- a/include/linux/rculist_nulls.h
>>>>>>>>>> +++ b/include/linux/rculist_nulls.h
>>>>>>>>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>>>>>>>>> #define hlist_nulls_next_rcu(node) \
>>>>>>>>>> (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
>>>>>>>>>> + * @node: element of the list.
>>>>>>>>>> + */
>>>>>>>>>> +#define hlist_nulls_pprev_rcu(node) \
>>>>>>>>>> + (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
>>>>>>>>>> +
>>>>>>>>>> /**
>>>>>>>>>> * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
>>>>>>>>>> * @n: the element to delete from the hash list.
>>>>>>>>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>>>>>>>>>> n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
>>>>>>>>>> + * @old: the element to be replaced
>>>>>>>>>> + * @new: the new element to insert
>>>>>>>>>> + *
>>>>>>>>>> + * Description:
>>>>>>>>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>>>>>>>>>> + * permitting racing traversals.
>>>>>>>>>> + *
>>>>>>>>>> + * The caller must take whatever precautions are necessary (such as holding
>>>>>>>>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>>>>>>>>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>>>>>>>>>> + * list. However, it is perfectly legal to run concurrently with the _rcu
>>>>>>>>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>>>>>>>>>> + */
>>>>>>>>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
>>>>>>>>>> + struct hlist_nulls_node *new)
>>>>>>>>>> +{
>>>>>>>>>> + struct hlist_nulls_node *next = old->next;
>>>>>>>>>> +
>>>>>>>>>> + WRITE_ONCE(new->next, next);
>>>>>>>>>> + WRITE_ONCE(new->pprev, old->pprev);
>>>>>>>>> I do not think these two WRITE_ONCE() are needed.
>>>>>>>>>
>>>>>>>>> At this point new is not yet visible.
>>>>>>>>>
>>>>>>>>> The following rcu_assign_pointer() is enough to make sure prior
>>>>>>>>> writes are committed to memory.
>>>>>>>> Dear Eric,
>>>>>>>>
>>>>>>>> I’m quoting your more detailed explanation from the other patch [0], thank
>>>>>>>> you for that!
>>>>>>>>
>>>>>>>> However, regarding new->next, if the new object is allocated with
>>>>>>>> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
>>>>>>>> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>>>> rculist_nulls”)?
>>>>>>>>
>>>>>>>> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
>>>>>>>> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>>>> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
>>>>>>> I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
>>>>>>>
>>>>>>> It is a bit weird to annotate the writes, but not the lockless reads,
>>>>>>> even if apparently KCSAN
>>>>>>> is okay with that.
>>>>>>>
>>>>>> Dear Eric,
>>>>>>
>>>>>> I’m sorry—I still haven’t fully grasped the scenario you mentioned where
>>>>>> sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
>>>>>> similar to the race described in commit 860c8802ace1 (“rcu: Use
>>>>>> WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
>>>>>>
>>>>> inet_unhash() does a lockless sk_unhash(sk) call, while no lock is
>>>>> held in some cases (look at tcp_done())
>>>>>
>>>>> void inet_unhash(struct sock *sk)
>>>>> {
>>>>> struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
>>>>>
>>>>> if (sk_unhashed(sk)) // Here no lock is held
>>>>> return;
>>>>>
>>>>> Relevant lock (depending on (sk->sk_state == TCP_LISTEN)) is acquired
>>>>> a few lines later.
>>>>>
>>>>> Then
>>>>>
>>>>> __sk_nulls_del_node_init_rcu() is called safely, while the bucket lock is held.
>>>>>
>>>> Dear Eric,
>>>>
>>>> Thanks for the quick response!
>>>>
>>>> In the call path:
>>>> tcp_retransmit_timer()
>>>> tcp_write_err()
>>>> tcp_done()
>>>>
>>>> tcp_retransmit_timer() already calls lockdep_sock_is_held(sk) to check the
>>>> socket‑lock state.
>>>>
>>>> void tcp_retransmit_timer(struct sock *sk)
>>>> {
>>>> struct tcp_sock *tp = tcp_sk(sk);
>>>> struct net *net = sock_net(sk);
>>>> struct inet_connection_sock *icsk = inet_csk(sk);
>>>> struct request_sock *req;
>>>> struct sk_buff *skb;
>>>>
>>>> req = rcu_dereference_protected(tp->fastopen_rsk,
>>>> lockdep_sock_is_held(sk)); // Check here
>>>>
>>>> Does that mean we’re already protected by lock_sock(sk) or
>>>> bh_lock_sock(sk)?
>>> But the socket lock is not protecting ehash buckets. These are other locks.
>>>
>>> Also, inet_unhash() can be called from other paths, without a socket
>>> lock being held.
>> Dear Eric,
>>
>> I understand the distinction now, but looking at the call stack in [0],
>> both CPUs reach inet_unhash() via the tcp_retransmit_timer() path, so only
>> one of them should pass the check, right?
>>
>> I’m still not clear how this race condition arises.
> Because that is two different sockets. This once again explains why
> holding or not the socket lock is not relevant.
>
> One of them is changing pointers in the chain, messing with
> surrounding pointers.
>
> The second one is reading sk->sk_node.pprev without using
> hlist_unhashed_lockless().
>
> I do not know how to explain this...
>
> Please look at the difference between hlist_unhashed_lockless() and
> hlist_unhashed().
Thank you very much for the patient explanation. :)
I think I understand now!
Best regards!
Xuanqiang.
Powered by blists - more mailing lists