netdev - Re: tc: network egress frozen during qdisc update with debug kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <0e9dbde7-07eb-45f1-a39c-6cf76f9c252f@paulmck-laptop>
Date: Wed, 19 Mar 2025 09:05:07 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Breno Leitao <leitao@...ian.org>, kuba@...nel.org, jhs@...atatu.com,
	xiyou.wangcong@...il.com, jiri@...nulli.us, kuniyu@...zon.com,
	rcu@...r.kernel.org, kasan-dev@...glegroups.com,
	netdev@...r.kernel.org
Subject: Re: tc: network egress frozen during qdisc update with debug kernel

On Wed, Mar 19, 2025 at 04:08:48PM +0100, Eric Dumazet wrote:
> On Wed, Mar 19, 2025 at 4:04 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> 
> > On Wed, Mar 19, 2025 at 07:56:40AM -0700, Breno Leitao wrote:
> > > On Wed, Mar 19, 2025 at 03:41:37PM +0100, Eric Dumazet wrote:
> > > > On Wed, Mar 19, 2025 at 2:09 PM Breno Leitao <leitao@...ian.org>
> > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am experiencing an issue with upstream kernel when compiled with
> > debug
> > > > > capabilities. They are CONFIG_DEBUG_NET, CONFIG_KASAN, and
> > > > > CONFIG_LOCKDEP plus a few others. You can find the full
> > configuration at
> > > > > ....
> > > > >
> > > > > Basically when running a `tc replace`, it takes 13-20 seconds to
> > finish:
> > > > >
> > > > >         # time /usr/sbin/tc qdisc replace dev eth0 root handle
> > 0x1234: mq
> > > > >         real    0m13.195s
> > > > >         user    0m0.001s
> > > > >         sys     0m2.746s
> > > > >
> > > > > While this is running, the machine loses network access completely.
> > The
> > > > > machine's network becomes inaccessible for 13 seconds above, which
> > is far
> > > > > from
> > > > > ideal.
> > > > >
> > > > > Upon investigation, I found that the host is getting stuck in the
> > following
> > > > > call path:
> > > > >
> > > > >         __qdisc_destroy
> > > > >         mq_attach
> > > > >         qdisc_graft
> > > > >         tc_modify_qdisc
> > > > >         rtnetlink_rcv_msg
> > > > >         netlink_rcv_skb
> > > > >         netlink_unicast
> > > > >         netlink_sendmsg
> > > > >
> > > > > The big offender here is rtnetlink_rcv_msg(), which is called with
> > > > > rtnl_lock
> > > > > in the follow path:
> > > > >
> > > > >         static int tc_modify_qdisc() {
> > > > >                 ...
> > > > >                 netdev_lock_ops(dev);
> > > > >                 err = __tc_modify_qdisc(skb, n, extack, dev, tca,
> > tcm,
> > > > > &replay);
> > > > >                 netdev_unlock_ops(dev);
> > > > >                 ...
> > > > >         }
> > > > >
> > > > > So, the rtnl_lock is held for 13 seconds in the case above. I also
> > > > > traced that __qdisc_destroy() is called once per NIC queue, totalling
> > > > > a total of 250 calls for the cards I am using.
> > > > >
> > > > > Ftrace output:
> > > > >
> > > > >         # perf ftrace --graph-opts depth=100,tail,noirqs -G
> > > > > rtnetlink_rcv_msg   /usr/sbin/tc qdisc replace dev eth0 root handle
> > 0x1: mq
> > > > > | grep \\$
> > > > >         7) $ 4335849 us  |        } /* mq_init */
> > > > >         7) $ 4339715 us  |      } /* qdisc_create */
> > > > >         11) $ 15844438 us |        } /* mq_attach */
> > > > >         11) $ 16129620 us |      } /* qdisc_graft */
> > > > >         11) $ 20469368 us |    } /* tc_modify_qdisc */
> > > > >         11) $ 20470448 us |  } /* rtnetlink_rcv_msg */
> > > > >
> > > > >         In this case, the rtnetlink_rcv_msg() took 20 seconds, and,
> > while
> > > > > it
> > > > >         was running, the NIC was not being able to send any packet
> > > > >
> > > > > Going one step further, this matches what I described above:
> > > > >
> > > > >         # perf ftrace --graph-opts depth=100,tail,noirqs -G
> > > > > rtnetlink_rcv_msg   /usr/sbin/tc qdisc replace dev eth0 root handle
> > 0x1: mq
> > > > > | grep "\\@\|\\$"
> > > > >
> > > > >         7) $ 4335849 us  |        } /* mq_init */
> > > > >         7) $ 4339715 us  |      } /* qdisc_create */
> > > > >         14) @ 210619.0 us |                      } /* schedule */
> > > > >         14) @ 210621.3 us |                    } /* schedule_timeout
> > */
> > > > >         14) @ 210654.0 us |                  } /*
> > > > > wait_for_completion_state */
> > > > >         14) @ 210716.7 us |                } /* __wait_rcu_gp */
> > > > >         14) @ 210719.4 us |              } /* synchronize_rcu_normal
> > */
> > > > >         14) @ 210742.5 us |            } /* synchronize_rcu */
> > > > >         14) @ 144455.7 us |            } /* __qdisc_destroy */
> > > > >         14) @ 144458.6 us |          } /* qdisc_put */
> > > > >         <snip>
> > > > >         2) @ 131083.6 us |                        } /* schedule */
> > > > >         2) @ 131086.5 us |                      } /*
> > schedule_timeout */
> > > > >         2) @ 131129.6 us |                    } /*
> > > > > wait_for_completion_state */
> > > > >         2) @ 131227.6 us |                  } /* __wait_rcu_gp */
> > > > >         2) @ 131231.0 us |                } /*
> > synchronize_rcu_normal */
> > > > >         2) @ 131242.6 us |              } /* synchronize_rcu */
> > > > >         2) @ 152162.7 us |            } /* __qdisc_destroy */
> > > > >         2) @ 152165.7 us |          } /* qdisc_put */
> > > > >         11) $ 15844438 us |        } /* mq_attach */
> > > > >         11) $ 16129620 us |      } /* qdisc_graft */
> > > > >         11) $ 20469368 us |    } /* tc_modify_qdisc */
> > > > >         11) $ 20470448 us |  } /* rtnetlink_rcv_msg */
> > > > >
> > > > > From the stack trace, it appears that most of the time is spent
> > waiting
> > > > > for the
> > > > > RCU grace period to free the qdisc (!?):
> > > > >
> > > > >         static void __qdisc_destroy(struct Qdisc *qdisc)
> > > > >         {
> > > > >                 if (ops->destroy)
> > > > >                         ops->destroy(qdisc);
> > > > >
> > > > >                 call_rcu(&qdisc->rcu, qdisc_free_cb);
> > > > >
> > > >
> > > > call_rcu() is asynchronous, this is very different from
> > synchronize_rcu().
> > >
> > > That is a good point. The offender is synchronize_rcu() is here.
> >
> > Should that be synchronize_net()?
> >
> 
> I think we should redesign lockdep_unregister_key() to work on a separately
> allocated piece of memory,
> then use kfree_rcu() in it.
> 
> Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
> 
> struct ... {
>      struct lock_class_key;
>      struct rcu_head  rcu;
> }

Works for me!

                                                        Thanx, Paul

> > > > >         }
> > > > >
> > > > > So, from my newbie PoV, the issue can be summarized as follows:
> > > > >
> > > > >         netdev_lock_ops(dev);
> > > > >         __tc_modify_qdisc()
> > > > >           qdisc_graft()
> > > > >             for (i = 0; i <  255; i++)
> > > > >               qdisc_put()
> > > > >                 ____qdisc_destroy()
> > > > >                   call_rcu()
> > > > >               }
> > > > >
> > > > > Questions:
> > > > >
> > > > > 1) I assume the egress traffic is blocked because we are modifying
> > the
> > > > >    qdisc, which makes sense. How is this achieved? Is it related to
> > > > >    rtnl_lock?
> > > > >
> > > > > 2) Would it be beneficial to attempt qdisc_put() outside of the
> > critical
> > > > >    section (rtnl_lock?) to prevent this freeze?
> > > > >
> > > > >
> > > >
> > > > It is unclear to me why you have syncrhonize_rcu() calls.
> > >
> > > This is coming from:
> > >
> > >       __qdisc_destroy() {
> > >               lockdep_unregister_key(&qdisc->root_lock_key) {
> > >                       ...
> > >                       /* Wait until is_dynamic_key() has finished
> > accessing k->hash_entry. */
> > >                       synchronize_rcu();
> >