lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <q7G0Z7oMR2x9TWwNHOiPNsZ8lHzAuXuVgrZgGmAgkH8lkIYyTgeqXwcDrelE_fdS9OdJ4TlfS96px6O9SvnmKigNKFkiaFlStvAGPIJ3b84=@willsroot.io>
Date: Sun, 25 May 2025 20:43:35 +0000
From: William Liu <will@...lsroot.io>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Savy <savy@...t3mfailure.io>, Cong Wang <xiyou.wangcong@...il.com>, Victor Nogueira <victor@...atatu.com>, Pedro Tammela <pctammela@...atatu.com>, Paolo Abeni <pabeni@...hat.com>, Jakub Kicinski <kuba@...nel.org>, Stephen Hemminger <stephen@...workplumber.org>, Davide Caratti <dcaratti@...hat.com>
Subject: Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem_dequeue

I did some more testing with the percpu approach, and we realized the following problem caused now by netem_dequeue.

Recall that we increment the percpu variable on netem_enqueue entry and decrement it on exit. netem_dequeue calls enqueue on the child qdisc - if this child qdisc is a netem qdisc with duplication enabled, it could duplicate a previously duplicated packet from the parent back to the parent, causing the issue again. The percpu variable cannot protect against this case.

However, there is a hack to address this. We can add a field in netem_skb_cb called duplicated to track if a packet is involved in duplicated (both the original and duplicated packet should have it marked). Right before we call the child enqueue in netem_dequeue, we check for the duplicated value. If it is true, we increment the percpu variable before and decrement it after the child enqueue call.

This only works under the assumption that there aren't other qdiscs that call enqueue on their child during dequeue, which seems to be the case for now. And honestly, this is quite a fragile fix - there might be other edge cases that will cause problems later down the line.

Are you aware of other more elegant approaches we can try for us to track this required cross-qdisc state? We suggested adding a single bit to the skb, but we also see the problem with adding a field for a one-off use case to such a vital structure (but this would also completely stomp out this bug).

Anyways, below is a diff with our fix plus the test suites to catch for this bug as well as to ensure that a packet loss takes priority over a packet duplication event.

Please let me know of your thoughts - if this seems like a good enough fix, I will submit a formal patchset. If any others cc'd here have any good ideas, please chime in too!

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index fdd79d3ccd8c..38bf85e24bbd 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -165,8 +165,11 @@ struct netem_sched_data {
  */
 struct netem_skb_cb {
        u64             time_to_send;
+       bool            duplicated;
 };
 
+static DEFINE_PER_CPU(unsigned int, enqueue_nest_level);
+
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
 {
        /* we assume we can use skb next/prev/tstamp as storage for rb_node */
@@ -448,32 +451,39 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch,
 static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                         struct sk_buff **to_free)
 {
+       int nest_level = __this_cpu_inc_return(enqueue_nest_level);
        struct netem_sched_data *q = qdisc_priv(sch);
-       /* We don't fill cb now as skb_unshare() may invalidate it */
-       struct netem_skb_cb *cb;
+       unsigned int prev_len = qdisc_pkt_len(skb);
        struct sk_buff *skb2 = NULL;
        struct sk_buff *segs = NULL;
-       unsigned int prev_len = qdisc_pkt_len(skb);
-       int count = 1;
+       /* We don't fill cb now as skb_unshare() may invalidate it */
+       struct netem_skb_cb *cb;
+       bool duplicate = false;
+       int retval;
 
        /* Do not fool qdisc_drop_all() */
        skb->prev = NULL;
 
-       /* Random duplication */
-       if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
-               ++count;
+       /*
+        * Random duplication
+        * We must avoid duplicating a duplicated packet, but there is no
+        * good way to track this. The nest_level check disables duplication
+        * if a netem qdisc duplicates the skb in the call chain already
+        */
+       if (q->duplicate && nest_level <=1 &&
+           q->duplicate >= get_crandom(&q->dup_cor, &q->prng)) {
+               duplicate = true;
+       }
 
        /* Drop packet? */
        if (loss_event(q)) {
-               if (q->ecn && INET_ECN_set_ce(skb))
-                       qdisc_qstats_drop(sch); /* mark packet */
-               else
-                       --count;
-       }
-       if (count == 0) {
-               qdisc_qstats_drop(sch);
-               __qdisc_drop(skb, to_free);
-               return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+               qdisc_qstats_drop(sch); /* mark packet */
+               if (!(q->ecn && INET_ECN_set_ce(skb))) {
+                       qdisc_qstats_drop(sch);
+                       __qdisc_drop(skb, to_free);
+                       retval = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+                       goto dec_nest_level;
+               }
        }
 
        /* If a delay is expected, orphan the skb. (orphaning usually takes
@@ -486,7 +496,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
         * If we need to duplicate packet, then clone it before
         * original is modified.
         */
-       if (count > 1)
+       if (duplicate)
                skb2 = skb_clone(skb, GFP_ATOMIC);
 
        /*
@@ -528,27 +538,15 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                qdisc_drop_all(skb, sch, to_free);
                if (skb2)
                        __qdisc_drop(skb2, to_free);
-               return NET_XMIT_DROP;
-       }
-
-       /*
-        * If doing duplication then re-insert at top of the
-        * qdisc tree, since parent queuer expects that only one
-        * skb will be queued.
-        */
-       if (skb2) {
-               struct Qdisc *rootq = qdisc_root_bh(sch);
-               u32 dupsave = q->duplicate; /* prevent duplicating a dup... */
-
-               q->duplicate = 0;
-               rootq->enqueue(skb2, rootq, to_free);
-               q->duplicate = dupsave;
-               skb2 = NULL;
+               retval = NET_XMIT_DROP;
+               goto dec_nest_level;
        }
 
        qdisc_qstats_backlog_inc(sch, skb);
 
        cb = netem_skb_cb(skb);
+       if (duplicate)
+               cb->duplicated = true;
        if (q->gap == 0 ||              /* not doing reordering */
            q->counter < q->gap - 1 ||  /* inside last reordering gap */
            q->reorder < get_crandom(&q->reorder_cor, &q->prng)) {
@@ -613,6 +611,19 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                sch->qstats.requeues++;
        }
 
+       /*
+        * If doing duplication then re-insert at top of the
+        * qdisc tree, since parent queuer expects that only one
+        * skb will be queued.
+        */
+       if (skb2) {
+               struct Qdisc *rootq = qdisc_root_bh(sch);
+
+               netem_skb_cb(skb2)->duplicated = true;
+               rootq->enqueue(skb2, rootq, to_free);
+               skb2 = NULL;
+       }
+
 finish_segs:
        if (skb2)
                __qdisc_drop(skb2, to_free);
@@ -642,9 +653,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                /* Parent qdiscs accounted for 1 skb of size @prev_len */
                qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
        } else if (!skb) {
-               return NET_XMIT_DROP;
+               retval = NET_XMIT_DROP;
+               goto dec_nest_level;
        }
-       return NET_XMIT_SUCCESS;
+       retval = NET_XMIT_SUCCESS;
+
+dec_nest_level:
+       __this_cpu_dec(enqueue_nest_level);
+       return retval;
 }
 
 /* Delay the next round with a new future slot with a
@@ -743,8 +759,11 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
                                unsigned int pkt_len = qdisc_pkt_len(skb);
                                struct sk_buff *to_free = NULL;
                                int err;
-
+                               if (netem_skb_cb(skb)->duplicated)
+                                       __this_cpu_inc(enqueue_nest_level);
                                err = qdisc_enqueue(skb, q->qdisc, &to_free);
+                               if (netem_skb_cb(skb)->duplicated)
+                                       __this_cpu_dec(enqueue_nest_level);
                                kfree_skb_list(to_free);
                                if (err != NET_XMIT_SUCCESS) {
                                        if (net_xmit_drop_count(err))
diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
index 3c4444961488..f5dd9c5cd9b2 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
@@ -336,5 +336,46 @@
         "teardown": [
             "$TC qdisc del dev $DUMMY handle 1: root"
         ]
+    },
+    {
+        "id": "d34d",
+        "name": "NETEM test qdisc duplication recursion limit",
+        "category": ["qdisc", "netem"],
+        "plugins": {
+            "requires": "nsPlugin"
+        },
+        "setup": [
+            "$IP link set lo up || true",
+            "$TC qdisc add dev lo root handle 1: netem limit 1 duplicate 100%",
+            "$TC qdisc add dev lo parent 1: handle 2: netem gap 1 limit 1 duplicate 100% delay 1us reorder 100%"
+        ],
+        "cmdUnderTest": "ping -I lo -c1 127.0.0.1 > /dev/null",
+        "expExitCode": "0",
+        "verifyCmd": "$TC -s qdisc show dev lo",
+        "matchPattern": "qdisc netem",
+        "matchCount": "2",
+        "teardown": [
+            "$TC qdisc del dev lo handle 1:0 root"
+        ]
+    },
+    {
+        "id": "b33f",
+        "name": "NETEM test qdisc maximum duplication and loss",
+        "category": ["qdisc", "netem"],
+        "plugins": {
+            "requires": "nsPlugin"
+        },
+        "setup": [
+            "$IP link set lo up || true",
+            "$TC qdisc add dev lo root handle 1: netem limit 1 duplicate 100% loss 100%"
+        ],
+        "cmdUnderTest": "ping -I lo -c1 127.0.0.1 > /dev/null",
+        "expExitCode": "1",
+        "verifyCmd": "$TC -s qdisc show dev lo",
+        "matchPattern": "Sent 0 bytes 0 pkt",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev lo handle 1:0 root"
+        ]
     }
 ]


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ