netdev - Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ppi6ol0VaHrqJs9Rp0-SGp0J1Y0K8hki_jbNZ8sjNOmtEq0mD4f0IozBxxX-m4535QPJonGFYmiPmB643yd4SOpd1HDDYyMeGQuASuFHl-E=@willsroot.io>
Date: Thu, 22 May 2025 21:03:31 +0000
From: William Liu <will@...lsroot.io>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Savy <savy@...t3mfailure.io>, Cong Wang <xiyou.wangcong@...il.com>, Victor Nogueira <victor@...atatu.com>, Pedro Tammela <pctammela@...atatu.com>, Paolo Abeni <pabeni@...hat.com>, Jakub Kicinski <kuba@...nel.org>, Stephen Hemminger <stephen@...workplumber.org>, Davide Caratti <dcaratti@...hat.com>
Subject: Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem_dequeue

On Thursday, May 22nd, 2025 at 12:34 PM, Jamal Hadi Salim <jhs@...atatu.com> wrote:

> > If we do a per cpu global variable approach like in act_mirred.c to track nesting, then wouldn't this break in the case of having multiple normal qdisc setups run in parallel across multiple interfaces?
> 
> 
> A single skb cannot enter netem via multiple cpus. You can have
> multiple cpus entering the netem but they would be different skbs - am
> i missing something? Note mirred uses per-cpu counter which should
> suffice for being per skb counters.
> 

Ah right, you are correct. This approach will be fine then. We were originally concerned about another netem qdisc being interwoven during the operations, but that is not possible.

> > This brings us back to the approach where we don't allow duplication in netem if a parent qdisc is a netem with duplication enabled. However, one issue we are worried about is in regards to qdisc_replace. This means this check would have to happen everytime we want to duplicate something in enqueue right? That isn't ideal either, but let me know if you know of a better place to add the check.
> 
> 
> I didnt follow - can you be more specific?
> 

Oh, I meant that disablement of duplication due to an ancestor netem in the qdisc tree isn't possible at class instantiation time because
qdiscs can be replaced. So we would have to do the check at every skb enqueue, which would be far from ideal. The per cpu approach is better.

Anyways, please let us know how the patch below looks - it passes all the netem category test cases in tc-testing and avoids the problem for me. Thank you for all the help so far!

>From c96d94ab2155e18318a29510f0ee8f3983a14274 Mon Sep 17 00:00:00 2001
From: William Liu <will@...lsroot.io>
Date: Thu, 22 May 2025 13:08:30 -0700
Subject: [PATCH] net/sched: Fix duplication logic in netem_enqueue

netem_enqueue's duplication prevention logic is broken when multiple
netem qdiscs with duplication enabled are stacked together as seen
in [1]. Ensure that duplication does not happen more than once in a
qdisc tree with netem qdiscs. Also move the duplication logic to
happen after the initial packet has finished the enqueue process
to preserve the accuracy of the limit check.

[1] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/

Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Reported-by: William Liu <will@...lsroot.io>
Reported-by: Savino Dicanosa <savy@...t3mfailure.io>
Signed-off-by: William Liu <will@...lsroot.io>
Signed-off-by: Savino Dicanosa <savy@...t3mfailure.io>
---
 net/sched/sch_netem.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index fdd79d3ccd8c..651fae1cd7d6 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -167,6 +167,8 @@ struct netem_skb_cb {
    u64         time_to_send;
 };
 
+static DEFINE_PER_CPU(unsigned int, enqueue_nest_level);
+
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
 {
    /* we assume we can use skb next/prev/tstamp as storage for rb_node */
@@ -454,13 +456,21 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
    struct sk_buff *skb2 = NULL;
    struct sk_buff *segs = NULL;
    unsigned int prev_len = qdisc_pkt_len(skb);
+   int nest_level = __this_cpu_inc_return(enqueue_nest_level);
+   int retval;
    int count = 1;
 
    /* Do not fool qdisc_drop_all() */
    skb->prev = NULL;
 
-   /* Random duplication */
-   if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
+   /*
+    * Random duplication
+    * We must avoid duplicating a duplicated packet, but there is no
+    * good way to track this. The nest_level check disables duplication
+    * if a netem qdisc duplicates the skb in the call chain already
+    */
+   if (q->duplicate && nest_level < 1 &&
+       q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
        ++count;
 
    /* Drop packet? */
@@ -473,7 +483,8 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
    if (count == 0) {
        qdisc_qstats_drop(sch);
        __qdisc_drop(skb, to_free);
-       return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+       retval = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+       goto dec_nest_level;
    }
 
    /* If a delay is expected, orphan the skb. (orphaning usually takes
@@ -528,7 +539,8 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
        qdisc_drop_all(skb, sch, to_free);
        if (skb2)
            __qdisc_drop(skb2, to_free);
-       return NET_XMIT_DROP;
+       retval = NET_XMIT_DROP;
+       goto dec_nest_level;
    }
 
    /*
@@ -642,9 +654,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
        /* Parent qdiscs accounted for 1 skb of size @prev_len */
        qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
    } else if (!skb) {
-       return NET_XMIT_DROP;
+       retval = NET_XMIT_DROP;
+       goto dec_nest_level;
    }
-   return NET_XMIT_SUCCESS;
+   retval = NET_XMIT_SUCCESS;
+
+dec_nest_level:
+   __this_cpu_dec(enqueue_nest_level);
+   return retval;
 }
 
 /* Delay the next round with a new future slot with a
-- 
2.43.0