netdev - Re: [PATCH RFC net 1/1] net/sched: Fix mirred to self recursion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoMmQHsucU6n1O3XEd50zUB4TENkEH0+J-cZ=5Bbv9298mA@mail.gmail.com>
Date: Wed, 27 Mar 2024 18:57:49 -0400
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com, jiri@...nulli.us, 
	xiyou.wangcong@...il.com, netdev@...r.kernel.org, renmingshuai@...wei.com, 
	Victor Nogueira <victor@...atatu.com>
Subject: Re: [PATCH RFC net 1/1] net/sched: Fix mirred to self recursion

On Wed, Mar 27, 2024 at 9:23 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Wed, Mar 27, 2024 at 12:03 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> >
> > When the mirred action is used on a classful egress qdisc and a packet is
> > mirrored or redirected to self we hit a qdisc lock deadlock.
> > See trace below.
> >
> > [..... other info removed for brevity....]
> > [   82.890906]
> > [   82.890906] ============================================
> > [   82.890906] WARNING: possible recursive locking detected
> > [   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
> > [   82.890906] --------------------------------------------
> > [   82.890906] ping/418 is trying to acquire lock:
> > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > __dev_queue_xmit+0x1778/0x3550
> > [   82.890906]
> > [   82.890906] but task is already holding lock:
> > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > __dev_queue_xmit+0x1778/0x3550
> > [   82.890906]
> > [   82.890906] other info that might help us debug this:
> > [   82.890906]  Possible unsafe locking scenario:
> > [   82.890906]
> > [   82.890906]        CPU0
> > [   82.890906]        ----
> > [   82.890906]   lock(&sch->q.lock);
> > [   82.890906]   lock(&sch->q.lock);
> > [   82.890906]
> > [   82.890906]  *** DEADLOCK ***
> > [   82.890906]
> > [..... other info removed for brevity....]
> >
> > Example setup (eth0->eth0) to recreate
> > tc qdisc add dev eth0 root handle 1: htb default 30
> > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> >      action mirred egress redirect dev eth0
> >
> > Another example(eth0->eth1->eth0) to recreate
> > tc qdisc add dev eth0 root handle 1: htb default 30
> > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> >      action mirred egress redirect dev eth1
> >
> > tc qdisc add dev eth1 root handle 1: htb default 30
> > tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
> >      action mirred egress redirect dev eth0
> >
> > We fix this by adding a per-cpu, per-qdisc recursion counter which is
> > incremented the first time a root qdisc is entered and on a second attempt
> > enter the same root qdisc from the top, the packet is dropped to break the
> > loop.
> >
> > Reported-by: renmingshuai@...wei.com
> > Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
> > Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()")
> > Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops")
> > Co-developed-by: Victor Nogueira <victor@...atatu.com>
> > Signed-off-by: Victor Nogueira <victor@...atatu.com>
> > Signed-off-by: Jamal Hadi Salim <jhs@...atatu.com>
> > ---
> >  include/net/sch_generic.h |  2 ++
> >  net/core/dev.c            |  9 +++++++++
> >  net/sched/sch_api.c       | 12 ++++++++++++
> >  net/sched/sch_generic.c   |  2 ++
> >  4 files changed, 25 insertions(+)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index cefe0c4bdae3..f9f99df037ed 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -125,6 +125,8 @@ struct Qdisc {
> >         spinlock_t              busylock ____cacheline_aligned_in_smp;
> >         spinlock_t              seqlock;
> >
> > +       u16 __percpu            *xmit_recursion;
> > +
> >         struct rcu_head         rcu;
> >         netdevice_tracker       dev_tracker;
> >         /* private data */
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 9a67003e49db..2b712388c06f 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3789,6 +3789,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >         if (unlikely(contended))
> >                 spin_lock(&q->busylock);
>
> This could hang here (busylock)

Notice the goto free_skb_list has an spin_unlock(&q->busylock);  in
its code vicinity. Am I missing something?

>
> >
> > +       if (__this_cpu_read(*q->xmit_recursion) > 0) {
> > +               __qdisc_drop(skb, &to_free);
> > +               rc = NET_XMIT_DROP;
> > +               goto free_skb_list;
> > +       }
>
>
> I do not think we want to add yet another cache line miss and
> complexity in tx fast path.
>

I empathize. The cache miss is due to a per-cpu variable? Otherwise
that seems to be in the vicinity of the other fields being accessed in
__dev_xmit_skb()

> I think that mirred should  use a separate queue to kick a transmit
> from the top level.
>
> (Like netif_rx() does)
>

Eric, here's my concern: this would entail restructuring mirred
totally just to cater for one use case which is in itself _a bad
config_ for egress qdisc case only. Mirred is very heavily used in
many use cases and changing its behavior could likely introduce other
corner cases for some use cases which we would be chasing for a while.
Not to forget now we have to go via an extra transient queue.
If i understood what you are suggesting is to add an equivalent of
backlog queu for the tx side? I am assuming in a very similar nature
to backlog, meaning per cpu fired by softirq? or is it something
closer to qdisc->gso_skb
For either of those cases, the amount of infrastructure code there is
not a few lines of code. And then there's the desire to break the loop
etc.

Some questions regarding your proposal - something I am not following
And i may have misunderstood what you are suggesting, but i am missing
what scenario mirred can directly call tcf_dev_queue_xmit() (see my
comment below)..

> Using a softnet.xmit_qdisc_recursion (not a qdisc-per-cpu thing),
> would allow mirred to bypass this additional queue
> in most cases.
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index cb37817d6382c29117afd8ce54db6dba94f8c930..62ba5ef554860496ee928f7ed6b7c3ea46b8ee1d
> 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3217,7 +3217,8 @@ struct softnet_data {
>  #endif
>         /* written and read only by owning cpu: */
>         struct {
> -               u16 recursion;
> +               u8 recursion;
> +               u8 qdisc_recursion;
>                 u8  more;
>  #ifdef CONFIG_NET_EGRESS
>                 u8  skip_txqueue;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 9a67003e49db87f3f92b6c6296b3e7a5ca9d9171..7ac59835edef657e9558d4d4fc0a76b171aace93
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4298,7 +4298,9 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>
>         trace_net_dev_queue(skb);
>         if (q->enqueue) {
> +               __this_cpu_inc(softnet_data.xmit.qdisc_recursion);

This increments the count by 1..

>                 rc = __dev_xmit_skb(skb, q, dev, txq);
> +               __this_cpu_dec(softnet_data.xmit.qdisc_recursion);
>                 goto out;
>         }
>
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index 5b38143659249e66718348e0ec4ed3c7bc21c13d..0f5f02e6744397d33ae2a72670ba7131aaa6942e
> 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -237,8 +237,13 @@ tcf_mirred_forward(bool at_ingress, bool
> want_ingress, struct sk_buff *skb)
>  {
>         int err;
>
> -       if (!want_ingress)
> -               err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
> +       if (!want_ingress) {
> +               if (__this_cpu_read(softnet_data.xmit.qdisc_recursion)) {

Where does the defered
So this will always be 1 assuming the defer queue will have to be
something like a workqueue

> +                       // Queue to top level, or drop
> +               } else {

and we'll never enter this..

> +                       err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
> +               }
> +       }
>         else if (!at_ingress)
>                 err = netif_rx(skb);
>         else

cheers,
jamal