netdev - Re: [PATCH] net: procfs: Fix RCU stall and soft lockup in ptype_seq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ylx2hagagfnvdkk6oyom55peoufxumumw4kph3dwozp5z2zkmi@gkcl7cw7vpz6>
Date: Mon, 2 Feb 2026 09:04:23 +0800
From: YinFengwei <fengwei_yin@...ux.alibaba.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, 
	Jakub Kicinski <kuba@...nel.org>, davem@...emloft.net, pabeni@...hat.com, horms@...nel.org, 
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: procfs: Fix RCU stall and soft lockup in
 ptype_seq_next()


Hi Eric,
> On Sat, Jan 31, 2026 at 6:50 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Sat, Jan 31, 2026 at 6:41 PM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> > >
> > > Jakub Kicinski wrote:
> > > > On Wed, 28 Jan 2026 15:03:59 +0800 fengwei_yin@...ux.alibaba.com wrote:
> > > > > The root cause is in ptype_seq_next(): when iterating over packet
> > > > > types, it's possible that a packet type entry (pt) has been removed,
> > > > > its dev set to NULL, and pt->af_packet_net is not initialized.
> > > > > In that case, the function may return the same 'nxt' pointer indefinitely.
> > > > > This results in an infinite loop under RCU read-side critical section,
> > > > > causing an RCU stall and eventually a soft lockup.
> > > > >
> > > > > Fix the issue by properly handling the case where 'nxt' points to
> > > > > an empty list, ensuring forward progress in the iterator.
> > > >
> > > > > @@ -247,7 +247,7 @@ static void *ptype_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > > > >
> > > > >     if (pt->af_packet_net) {
> > > > >  net_ptype_all:
> > > > > -           if (nxt != &net->ptype_all && nxt != &net->ptype_specific)
> > > > > +           if (!list_empty(nxt) && nxt != &net->ptype_all && nxt != &net->ptype_specific)
> > > > >                     goto found;
> > > > >
> > > > >             if (nxt == &net->ptype_all) {
> > > > > @@ -267,6 +267,9 @@ static void *ptype_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > > > >                     return NULL;
> > > > >             nxt = ptype_base[hash].next;
> > > > >     }
> > > > > +
> > > > > +   if (list_empty(nxt))
> > > > > +           return NULL;
> > > > >  found:
> > > > >     return list_entry(nxt, struct packet_type, list);
> > > > >  }
> > > >
> > > > I'm not sure this fix works, TBH, we're dealing with an RCU list here.
> > > > The elements are not deleted with list_del_init(), so they won't
> > > > look "empty".
> > > >
> > > > If the pt entries are under RCU protection I think the issue is that
> > > > af_packet is clearing pt->dev before waiting for the grace period to
> > > > expire.
> > > >
> > > > Willem, is there a reason for that or just convenience?
> > >
> > > That would be wrong. Do we see it doing that somewhere?
> > >
> > > These handlers should get removed with dev_remove_pack. Or
> > > __dev_remove_pack and observe the RCU grace period some other way.
> > > I can review these, but was not aware of any abuses.
> > >
> >
> > packet_notifier()
> >
> > case NETDEV_DOWN:
> > if (dev->ifindex == po->ifindex) {
> > spin_lock(&po->bind_lock);
> > if (packet_sock_flag(po, PACKET_SOCK_RUNNING)) {
> > __unregister_prot_hook(sk, false);
> > /* removed without a synchronize_rcu() */
> > sk->sk_err = ENETDOWN;
> > if (!sock_flag(sk, SOCK_DEAD))
> > sk_error_report(sk);
> > }
> > if (msg == NETDEV_UNREGISTER) {
> > packet_cached_dev_reset(po);
> > WRITE_ONCE(po->ifindex, -1);
> > netdev_put(po->prot_hook.dev,
> >    &po->prot_hook.dev_tracker);
> > po->prot_hook.dev = NULL;                       // pointer set to NULL
Yes. This line is the main problem which trigger the rcu stall.

> > }
> > spin_unlock(&po->bind_lock);
> > }
> > break;
> 
> And other places as well...
> 
> I would suggest adding proper RCU protection to prot_hook.dev

Agree. Using RCU to protect prot_hook.dev is the best fix. I saw
you sent the fixing patch already. Will give it a try and report
back. Thanks.


Regards
Yin, Fengwei