netdev - Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJtMVRhcdfaH3Qz0cLf30cV32jYpAA5YBcL1Auovccdug@mail.gmail.com>
Date: Tue, 14 Oct 2025 01:09:14 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: luoxuanqiang <xuanqiang.luo@...ux.dev>
Cc: kuniyu@...gle.com, "Paul E. McKenney" <paulmck@...nel.org>, kerneljasonxing@...il.com, 
	davem@...emloft.net, kuba@...nel.org, netdev@...r.kernel.org, 
	Xuanqiang Luo <luoxuanqiang@...inos.cn>, Frederic Weisbecker <frederic@...nel.org>, 
	Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu()
 and hlist_nulls_replace_init_rcu()

On Tue, Oct 14, 2025 at 1:05 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>
>
> 在 2025/10/14 15:34, Eric Dumazet 写道:
> > On Tue, Oct 14, 2025 at 12:21 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
> >>
> >> 在 2025/10/13 17:49, Eric Dumazet 写道:
> >>> On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
> >>>> 在 2025/10/13 15:31, Eric Dumazet 写道:
> >>>>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@...ux.dev> wrote:
> >>>>>> From: Xuanqiang Luo <luoxuanqiang@...inos.cn>
> >>>>>>
> >>>>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
> >>>>>>
> >>>>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
> >>>>>> mentioned in the patch below:
> >>>>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
> >>>>>> rculist_nulls")
> >>>>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
> >>>>>> hlist_nulls")
> >>>>>>
> >>>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@...inos.cn>
> >>>>>> ---
> >>>>>>     include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
> >>>>>>     1 file changed, 59 insertions(+)
> >>>>>>
> >>>>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> >>>>>> index 89186c499dd4..c26cb83ca071 100644
> >>>>>> --- a/include/linux/rculist_nulls.h
> >>>>>> +++ b/include/linux/rculist_nulls.h
> >>>>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
> >>>>>>     #define hlist_nulls_next_rcu(node) \
> >>>>>>            (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
> >>>>>>
> >>>>>> +/**
> >>>>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
> >>>>>> + * @node: element of the list.
> >>>>>> + */
> >>>>>> +#define hlist_nulls_pprev_rcu(node) \
> >>>>>> +       (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
> >>>>>> +
> >>>>>>     /**
> >>>>>>      * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
> >>>>>>      * @n: the element to delete from the hash list.
> >>>>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
> >>>>>>            n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
> >>>>>>     }
> >>>>>>
> >>>>>> +/**
> >>>>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
> >>>>>> + * @old: the element to be replaced
> >>>>>> + * @new: the new element to insert
> >>>>>> + *
> >>>>>> + * Description:
> >>>>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
> >>>>>> + * permitting racing traversals.
> >>>>>> + *
> >>>>>> + * The caller must take whatever precautions are necessary (such as holding
> >>>>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
> >>>>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
> >>>>>> + * list.  However, it is perfectly legal to run concurrently with the _rcu
> >>>>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
> >>>>>> + */
> >>>>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
> >>>>>> +                                          struct hlist_nulls_node *new)
> >>>>>> +{
> >>>>>> +       struct hlist_nulls_node *next = old->next;
> >>>>>> +
> >>>>>> +       WRITE_ONCE(new->next, next);
> >>>>>> +       WRITE_ONCE(new->pprev, old->pprev);
> >>>>> I do not think these two WRITE_ONCE() are needed.
> >>>>>
> >>>>> At this point new is not yet visible.
> >>>>>
> >>>>> The following  rcu_assign_pointer() is enough to make sure prior
> >>>>> writes are committed to memory.
> >>>> Dear Eric,
> >>>>
> >>>> I’m quoting your more detailed explanation from the other patch [0], thank
> >>>> you for that!
> >>>>
> >>>> However, regarding new->next, if the new object is allocated with
> >>>> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
> >>>> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
> >>>> rculist_nulls”)?
> >>>>
> >>>> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
> >>>> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
> >>>> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
> >>> I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
> >>>
> >>> It is a bit weird to annotate the writes, but not the lockless reads,
> >>> even if apparently KCSAN
> >>> is okay with that.
> >>>
> >> Dear Eric,
> >>
> >> I’m sorry—I still haven’t fully grasped the scenario you mentioned where
> >> sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
> >> similar to the race described in commit 860c8802ace1 (“rcu: Use
> >> WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
> >>
> > inet_unhash() does a lockless sk_unhash(sk) call, while no lock is
> > held in some cases (look at tcp_done())
> >
> > void inet_unhash(struct sock *sk)
> > {
> > struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
> >
> > if (sk_unhashed(sk))    // Here no lock is held
> >      return;
> >
> > Relevant lock (depending on (sk->sk_state == TCP_LISTEN)) is acquired
> > a few lines later.
> >
> > Then
> >
> > __sk_nulls_del_node_init_rcu() is called safely, while the bucket lock is held.
> >
> Dear Eric,
>
> Thanks for the quick response!
>
> In the call path:
>          tcp_retransmit_timer()
>                  tcp_write_err()
>                          tcp_done()
>
> tcp_retransmit_timer() already calls lockdep_sock_is_held(sk) to check the
> socket‑lock state.
>
> void tcp_retransmit_timer(struct sock *sk)
> {
>          struct tcp_sock *tp = tcp_sk(sk);
>          struct net *net = sock_net(sk);
>          struct inet_connection_sock *icsk = inet_csk(sk);
>          struct request_sock *req;
>          struct sk_buff *skb;
>
>          req = rcu_dereference_protected(tp->fastopen_rsk,
>                                   lockdep_sock_is_held(sk)); // Check here
>
> Does that mean we’re already protected by lock_sock(sk) or
> bh_lock_sock(sk)?

But the socket lock is not protecting ehash buckets. These are other locks.

Also, inet_unhash() can be called from other paths, without a socket
lock being held.