lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7d89cff3-1045-4901-bdd3-f669eecfee97@linux.dev>
Date: Tue, 14 Oct 2025 16:04:22 +0800
From: luoxuanqiang <xuanqiang.luo@...ux.dev>
To: Eric Dumazet <edumazet@...gle.com>
Cc: kuniyu@...gle.com, "Paul E. McKenney" <paulmck@...nel.org>,
 kerneljasonxing@...il.com, davem@...emloft.net, kuba@...nel.org,
 netdev@...r.kernel.org, Xuanqiang Luo <luoxuanqiang@...inos.cn>,
 Frederic Weisbecker <frederic@...nel.org>,
 Neeraj Upadhyay <neeraj.upadhyay@...nel.org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu()
 and hlist_nulls_replace_init_rcu()


在 2025/10/14 15:34, Eric Dumazet 写道:
> On Tue, Oct 14, 2025 at 12:21 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>
>> 在 2025/10/13 17:49, Eric Dumazet 写道:
>>> On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@...ux.dev> wrote:
>>>> 在 2025/10/13 15:31, Eric Dumazet 写道:
>>>>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@...ux.dev> wrote:
>>>>>> From: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>>>>>
>>>>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>>>>>>
>>>>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
>>>>>> mentioned in the patch below:
>>>>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
>>>>>> rculist_nulls")
>>>>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>>>> hlist_nulls")
>>>>>>
>>>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>>>>> ---
>>>>>>     include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 59 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>>>>>> index 89186c499dd4..c26cb83ca071 100644
>>>>>> --- a/include/linux/rculist_nulls.h
>>>>>> +++ b/include/linux/rculist_nulls.h
>>>>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>>>>>     #define hlist_nulls_next_rcu(node) \
>>>>>>            (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
>>>>>>
>>>>>> +/**
>>>>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
>>>>>> + * @node: element of the list.
>>>>>> + */
>>>>>> +#define hlist_nulls_pprev_rcu(node) \
>>>>>> +       (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
>>>>>> +
>>>>>>     /**
>>>>>>      * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
>>>>>>      * @n: the element to delete from the hash list.
>>>>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>>>>>>            n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>>>>>>     }
>>>>>>
>>>>>> +/**
>>>>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
>>>>>> + * @old: the element to be replaced
>>>>>> + * @new: the new element to insert
>>>>>> + *
>>>>>> + * Description:
>>>>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>>>>>> + * permitting racing traversals.
>>>>>> + *
>>>>>> + * The caller must take whatever precautions are necessary (such as holding
>>>>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>>>>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>>>>>> + * list.  However, it is perfectly legal to run concurrently with the _rcu
>>>>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>>>>>> + */
>>>>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
>>>>>> +                                          struct hlist_nulls_node *new)
>>>>>> +{
>>>>>> +       struct hlist_nulls_node *next = old->next;
>>>>>> +
>>>>>> +       WRITE_ONCE(new->next, next);
>>>>>> +       WRITE_ONCE(new->pprev, old->pprev);
>>>>> I do not think these two WRITE_ONCE() are needed.
>>>>>
>>>>> At this point new is not yet visible.
>>>>>
>>>>> The following  rcu_assign_pointer() is enough to make sure prior
>>>>> writes are committed to memory.
>>>> Dear Eric,
>>>>
>>>> I’m quoting your more detailed explanation from the other patch [0], thank
>>>> you for that!
>>>>
>>>> However, regarding new->next, if the new object is allocated with
>>>> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
>>>> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
>>>> rculist_nulls”)?
>>>>
>>>> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
>>>> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
>>> I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
>>>
>>> It is a bit weird to annotate the writes, but not the lockless reads,
>>> even if apparently KCSAN
>>> is okay with that.
>>>
>> Dear Eric,
>>
>> I’m sorry—I still haven’t fully grasped the scenario you mentioned where
>> sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
>> similar to the race described in commit 860c8802ace1 (“rcu: Use
>> WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
>>
> inet_unhash() does a lockless sk_unhash(sk) call, while no lock is
> held in some cases (look at tcp_done())
>
> void inet_unhash(struct sock *sk)
> {
> struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
>
> if (sk_unhashed(sk))    // Here no lock is held
>      return;
>
> Relevant lock (depending on (sk->sk_state == TCP_LISTEN)) is acquired
> a few lines later.
>
> Then
>
> __sk_nulls_del_node_init_rcu() is called safely, while the bucket lock is held.
>
Dear Eric,

Thanks for the quick response!

In the call path:
         tcp_retransmit_timer()
                 tcp_write_err()
                         tcp_done()

tcp_retransmit_timer() already calls lockdep_sock_is_held(sk) to check the
socket‑lock state.

void tcp_retransmit_timer(struct sock *sk)
{
         struct tcp_sock *tp = tcp_sk(sk);
         struct net *net = sock_net(sk);
         struct inet_connection_sock *icsk = inet_csk(sk);
         struct request_sock *req;
         struct sk_buff *skb;

         req = rcu_dereference_protected(tp->fastopen_rsk,
                                  lockdep_sock_is_held(sk)); // Check here

Does that mean we’re already protected by lock_sock(sk) or
bh_lock_sock(sk)?

Thanks!

>
>
>> Two CPUs invoke inet_unhash() from the tcp_retransmit_timer() path on the
>> same sk, causing a race even though tcp_retransmit_timer() checks
>> lockdep_sock_is_held(sk).
>>
>> How does this race happen? I can’t find more details to understand the
>> situation, so any hints would be greatly appreciated.
>>
>> My simple understanding is that hlist_nulls_replace_rcu() might have the
>> same call path as hlist_nulls_add_head_rcu(), so I keep using WRITE_ONCE().
>>
>> Finally, Kuniyuki Iwashima also raised a similar discussion in the v3
>> series; here’s the link [1].
>>
>> [0]:
>> ------------------------------------------------------------------------
>>
>> BUG: KCSAN: data-race in inet_unhash / inet_unhash
>>
>> write to 0xffff8880a69a0170 of 8 bytes by interrupt on cpu 1:
>>    __hlist_nulls_del include/linux/list_nulls.h:88 [inline]
>>    hlist_nulls_del_init_rcu include/linux/rculist_nulls.h:36 [inline]
>>    __sk_nulls_del_node_init_rcu include/net/sock.h:676 [inline]
>>    inet_unhash+0x38f/0x4a0 net/ipv4/inet_hashtables.c:612
>>    tcp_set_state+0xfa/0x3e0 net/ipv4/tcp.c:2249
>>    tcp_done+0x93/0x1e0 net/ipv4/tcp.c:3854
>>    tcp_write_err+0x7e/0xc0 net/ipv4/tcp_timer.c:56
>>    tcp_retransmit_timer+0x9b8/0x16d0 net/ipv4/tcp_timer.c:479
>>    tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:599
>>    tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:619
>>    call_timer_fn+0x5f/0x2f0 kernel/time/timer.c:1404
>>    expire_timers kernel/time/timer.c:1449 [inline]
>>    __run_timers kernel/time/timer.c:1773 [inline]
>>    __run_timers kernel/time/timer.c:1740 [inline]
>>    run_timer_softirq+0xc0c/0xcd0 kernel/time/timer.c:1786
>>    __do_softirq+0x115/0x33f kernel/softirq.c:292
>>    invoke_softirq kernel/softirq.c:373 [inline]
>>    irq_exit+0xbb/0xe0 kernel/softirq.c:413
>>    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
>>    smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
>>    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
>>    native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
>>    arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
>>    default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
>>    cpuidle_idle_call kernel/sched/idle.c:154 [inline]
>>    do_idle+0x1af/0x280 kernel/sched/idle.c:263
>>    cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
>>    start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264
>>    secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
>>
>> read to 0xffff8880a69a0170 of 8 bytes by interrupt on cpu 0:
>>    sk_unhashed include/net/sock.h:607 [inline]
>>    inet_unhash+0x3d/0x4a0 net/ipv4/inet_hashtables.c:592
>>    tcp_set_state+0xfa/0x3e0 net/ipv4/tcp.c:2249
>>    tcp_done+0x93/0x1e0 net/ipv4/tcp.c:3854
>>    tcp_write_err+0x7e/0xc0 net/ipv4/tcp_timer.c:56
>>    tcp_retransmit_timer+0x9b8/0x16d0 net/ipv4/tcp_timer.c:479
>>    tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:599
>>    tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:619
>>    call_timer_fn+0x5f/0x2f0 kernel/time/timer.c:1404
>>    expire_timers kernel/time/timer.c:1449 [inline]
>>    __run_timers kernel/time/timer.c:1773 [inline]
>>    __run_timers kernel/time/timer.c:1740 [inline]
>>    run_timer_softirq+0xc0c/0xcd0 kernel/time/timer.c:1786
>>    __do_softirq+0x115/0x33f kernel/softirq.c:292
>>    invoke_softirq kernel/softirq.c:373 [inline]
>>    irq_exit+0xbb/0xe0 kernel/softirq.c:413
>>    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
>>    smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
>>    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
>>    native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
>>    arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
>>    default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
>>    cpuidle_idle_call kernel/sched/idle.c:154 [inline]
>>    do_idle+0x1af/0x280 kernel/sched/idle.c:263
>>    cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
>>    rest_init+0xec/0xf6 init/main.c:452
>>    arch_call_rest_init+0x17/0x37
>>    start_kernel+0x838/0x85e init/main.c:786
>>    x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
>>    x86_64_start_kernel+0x72/0x76 arch/x86/kernel/head64.c:471
>>    secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
>>
>> Reported by Kernel Concurrency Sanitizer on:
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc6+ #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine,
>> BIOS Google 01/01/2011
>>
>> ------------------------------------------------------------------------
>>
>> [1]: https://lore.kernel.org/all/CAAVpQUCoCizxTm6wRs0+n6_kPK+kgxwszsYKNds3YvuBfBvrhg@mail.gmail.com/
>>
>> Thanks!
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ