netdev - Re: [PATCH net] inet: Avoid established lookup missing active sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ca7ad52-bfd6-4619-abef-93bf5be3969c@linux.dev>
Date: Wed, 3 Sep 2025 15:54:41 +0800
From: luoxuanqiang <xuanqiang.luo@...ux.dev>
To: Kuniyuki Iwashima <kuniyu@...gle.com>
Cc: edumazet@...gle.com, davem@...emloft.net, kuba@...nel.org,
 kernelxing@...cent.com, netdev@...r.kernel.org,
 Xuanqiang Luo <luoxuanqiang@...inos.cn>
Subject: Re: [PATCH net] inet: Avoid established lookup missing active sk


在 2025/9/3 13:48, Kuniyuki Iwashima 写道:
> On Tue, Sep 2, 2025 at 10:16 PM Kuniyuki Iwashima <kuniyu@...gle.com> wrote:
>> On Tue, Sep 2, 2025 at 7:45 PM Xuanqiang Luo <xuanqiang.luo@...ux.dev> wrote:
>>> From: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>>
>>> Since the lookup of sk in ehash is lockless, when one CPU is performing a
>>> lookup while another CPU is executing delete and insert operations
>>> (deleting reqsk and inserting sk), the lookup CPU may miss either of
>>> them, if sk cannot be found, an RST may be sent.
>>>
>>> The call trace map is drawn as follows:
>>>     CPU 0                           CPU 1
>>>     -----                           -----
>>>                                  spin_lock()
>>>                                  sk_nulls_del_node_init_rcu(osk)
>>> __inet_lookup_established()
>>>                                  __sk_nulls_add_node_rcu(sk, list)
>>>                                  spin_unlock()
>> This usually does not happen except for local communication, and
>> retrying on the client side is much better than penalising all lookups
>> for SYN.
>>
>>> We can try using spin_lock()/spin_unlock() to wait for ehash updates
>>> (ensuring all deletions and insertions are completed) after a failed
>>> lookup in ehash, then lookup sk again after the update. Since the sk
>>> expected to be found is unlikely to encounter the aforementioned scenario
>>> multiple times consecutively, we only need one update.
>>>
>>> Similarly, an issue occurs in tw hashdance. Try adjusting the order in
>>> which it operates on ehash: remove sk first, then add tw. If sk is missed
>>> during lookup, it will likewise wait for the update to find tw, without
>>> worrying about the skc_refcnt issue that would arise if tw were found
>>> first.
>>>
>>> Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@...inos.cn>
>>> ---
>>>   net/ipv4/inet_hashtables.c    | 12 ++++++++++++
>>>   net/ipv4/inet_timewait_sock.c |  9 ++++-----
>>>   2 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>> index ceeeec9b7290..4eb3a55b855b 100644
>>> --- a/net/ipv4/inet_hashtables.c
>>> +++ b/net/ipv4/inet_hashtables.c
>>> @@ -505,6 +505,7 @@ struct sock *__inet_lookup_established(const struct net *net,
>>>          unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
>>>          unsigned int slot = hash & hashinfo->ehash_mask;
>>>          struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
>>> +       bool try_lock = true;
>>>
>>>   begin:
>>>          sk_nulls_for_each_rcu(sk, node, &head->chain) {
>>> @@ -528,6 +529,17 @@ struct sock *__inet_lookup_established(const struct net *net,
>>>           */
>>>          if (get_nulls_value(node) != slot)
>>>                  goto begin;
>>> +
>>> +       if (try_lock) {
>>> +               spinlock_t *lock = inet_ehash_lockp(hashinfo, hash);
>>> +
>>> +               try_lock = false;
>>> +               spin_lock(lock);
>>> +               /* Ensure ehash ops under spinlock complete. */
>>> +               spin_unlock(lock);
>>> +               goto begin;
>>> +       }
>>> +
>>>   out:
>>>          sk = NULL;
>>>   found:
>>> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
>>> index 875ff923a8ed..a91e02e19c53 100644
>>> --- a/net/ipv4/inet_timewait_sock.c
>>> +++ b/net/ipv4/inet_timewait_sock.c
>>> @@ -139,14 +139,10 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>>>
>>>          spin_lock(lock);
>>>
>>> -       /* Step 2: Hash TW into tcp ehash chain */
>>> -       inet_twsk_add_node_rcu(tw, &ehead->chain);
>> You are adding a new RST scenario where the corresponding
>> socket is not found and a listener or no socket is found.
>>
>> The try_lock part is not guaranteed to happen after twsk
>> insertion below.

I'm sorry, I didn't get it. If try_lock can search again, other places 
should have left the spinlock critical section. That means twsk should 
have been inserted already. Or maybe I missed some details. Could you 
please explain more clearly? Your guidance would be really helpful. :)

> Oh no, spin_lock() dance sychronises the threads but I still
> think this is rather harmful for normal cases; now sending
> an unmatched packet can trigger lock dance, which is easily
> abused for DDoS.
>
I agree this scenario is possible. It does make DDoS attacks more 
impactful. Thanks for pointing that out. However, this is the best 
approach I can think of right now. Thanks Xuanqiang

>>
>>> -
>>> -       /* Step 3: Remove SK from hash chain */
>>> +       /* Step 2: Remove SK from hash chain */
>>>          if (__sk_nulls_del_node_init_rcu(sk))
>>>                  sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>>>
>>> -
>>>          /* Ensure above writes are committed into memory before updating the
>>>           * refcount.
>>>           * Provides ordering vs later refcount_inc().
>>> @@ -161,6 +157,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>>>           */
>>>          refcount_set(&tw->tw_refcnt, 3);
>>>
>>> +       /* Step 3: Hash TW into tcp ehash chain */
>>> +       inet_twsk_add_node_rcu(tw, &ehead->chain);
>>> +
>>>          inet_twsk_schedule(tw, timeo);
>>>
>>>          spin_unlock(lock);
>>> --
>>> 2.25.1
>>>