netdev - Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoMnmpjGVU2XyrH=p=-BY6JGU44qsqyfEik4g5E2M8rMMOQ@mail.gmail.com>
Date: Wed, 28 May 2025 17:59:57 -0400
From: Jamal Hadi Salim <jhs@...atatu.com>
To: William Liu <will@...lsroot.io>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Savy <savy@...t3mfailure.io>, 
	Cong Wang <xiyou.wangcong@...il.com>, Victor Nogueira <victor@...atatu.com>, 
	Pedro Tammela <pctammela@...atatu.com>, Paolo Abeni <pabeni@...hat.com>, 
	Jakub Kicinski <kuba@...nel.org>, Stephen Hemminger <stephen@...workplumber.org>, 
	Davide Caratti <dcaratti@...hat.com>
Subject: Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem_dequeue

Hi,
Sorry for the latency..

On Sun, May 25, 2025 at 4:43 PM William Liu <will@...lsroot.io> wrote:
>
> I did some more testing with the percpu approach, and we realized the following problem caused now by netem_dequeue.
>
> Recall that we increment the percpu variable on netem_enqueue entry and decrement it on exit. netem_dequeue calls enqueue on the child qdisc - if this child qdisc is a netem qdisc with duplication enabled, it could duplicate a previously duplicated packet from the parent back to the parent, causing the issue again. The percpu variable cannot protect against this case.
>

I didnt follow why "percpu variable cannot protect against this case"
- the enqueue and dequeue would be running on the same cpu, no?
Also under what circumstances is the enqueue back to the root going to
end up in calling dequeue? Did you test and hit this issue or its just
theory? Note: It doesnt matter what the source of the skb is as long
as it hits the netem enqueue.

> However, there is a hack to address this. We can add a field in netem_skb_cb called duplicated to track if a packet is involved in duplicated (both the original and duplicated packet should have it marked). Right before we call the child enqueue in netem_dequeue, we check for the duplicated value. If it is true, we increment the percpu variable before and decrement it after the child enqueue call.
>

is netem_skb_cb safe really for hierarchies? grep for qdisc_skb_cb
net/sched/ to see what i mean

> This only works under the assumption that there aren't other qdiscs that call enqueue on their child during dequeue, which seems to be the case for now. And honestly, this is quite a fragile fix - there might be other edge cases that will cause problems later down the line.
>
> Are you aware of other more elegant approaches we can try for us to track this required cross-qdisc state? We suggested adding a single bit to the skb, but we also see the problem with adding a field for a one-off use case to such a vital structure (but this would also completely stomp out this bug).
>

It sounds like quite a complicated approach - i dont know what the
dequeue thing brings to the table; and if we really have to dequeue to
reinject into enqueue then i dont think we are looping anymore..

cheers,
jamal

> Anyways, below is a diff with our fix plus the test suites to catch for this bug as well as to ensure that a packet loss takes priority over a packet duplication event.
>
> Please let me know of your thoughts - if this seems like a good enough fix, I will submit a formal patchset. If any others cc'd here have any good ideas, please chime in too!
>
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index fdd79d3ccd8c..38bf85e24bbd 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -165,8 +165,11 @@ struct netem_sched_data {
>   */
>  struct netem_skb_cb {
>         u64             time_to_send;
> +       bool            duplicated;
>  };
>
> +static DEFINE_PER_CPU(unsigned int, enqueue_nest_level);
> +
>  static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
>  {
>         /* we assume we can use skb next/prev/tstamp as storage for rb_node */
> @@ -448,32 +451,39 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch,
>  static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                          struct sk_buff **to_free)
>  {
> +       int nest_level = __this_cpu_inc_return(enqueue_nest_level);
>         struct netem_sched_data *q = qdisc_priv(sch);
> -       /* We don't fill cb now as skb_unshare() may invalidate it */
> -       struct netem_skb_cb *cb;
> +       unsigned int prev_len = qdisc_pkt_len(skb);
>         struct sk_buff *skb2 = NULL;
>         struct sk_buff *segs = NULL;
> -       unsigned int prev_len = qdisc_pkt_len(skb);
> -       int count = 1;
> +       /* We don't fill cb now as skb_unshare() may invalidate it */
> +       struct netem_skb_cb *cb;
> +       bool duplicate = false;
> +       int retval;
>
>         /* Do not fool qdisc_drop_all() */
>         skb->prev = NULL;
>
> -       /* Random duplication */
> -       if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng))
> -               ++count;
> +       /*
> +        * Random duplication
> +        * We must avoid duplicating a duplicated packet, but there is no
> +        * good way to track this. The nest_level check disables duplication
> +        * if a netem qdisc duplicates the skb in the call chain already
> +        */
> +       if (q->duplicate && nest_level <=1 &&
> +           q->duplicate >= get_crandom(&q->dup_cor, &q->prng)) {
> +               duplicate = true;
> +       }
>
>         /* Drop packet? */
>         if (loss_event(q)) {
> -               if (q->ecn && INET_ECN_set_ce(skb))
> -                       qdisc_qstats_drop(sch); /* mark packet */
> -               else
> -                       --count;
> -       }
> -       if (count == 0) {
> -               qdisc_qstats_drop(sch);
> -               __qdisc_drop(skb, to_free);
> -               return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +               qdisc_qstats_drop(sch); /* mark packet */
> +               if (!(q->ecn && INET_ECN_set_ce(skb))) {
> +                       qdisc_qstats_drop(sch);
> +                       __qdisc_drop(skb, to_free);
> +                       retval = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +                       goto dec_nest_level;
> +               }
>         }
>
>         /* If a delay is expected, orphan the skb. (orphaning usually takes
> @@ -486,7 +496,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>          * If we need to duplicate packet, then clone it before
>          * original is modified.
>          */
> -       if (count > 1)
> +       if (duplicate)
>                 skb2 = skb_clone(skb, GFP_ATOMIC);
>
>         /*
> @@ -528,27 +538,15 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                 qdisc_drop_all(skb, sch, to_free);
>                 if (skb2)
>                         __qdisc_drop(skb2, to_free);
> -               return NET_XMIT_DROP;
> -       }
> -
> -       /*
> -        * If doing duplication then re-insert at top of the
> -        * qdisc tree, since parent queuer expects that only one
> -        * skb will be queued.
> -        */
> -       if (skb2) {
> -               struct Qdisc *rootq = qdisc_root_bh(sch);
> -               u32 dupsave = q->duplicate; /* prevent duplicating a dup... */
> -
> -               q->duplicate = 0;
> -               rootq->enqueue(skb2, rootq, to_free);
> -               q->duplicate = dupsave;
> -               skb2 = NULL;
> +               retval = NET_XMIT_DROP;
> +               goto dec_nest_level;
>         }
>
>         qdisc_qstats_backlog_inc(sch, skb);
>
>         cb = netem_skb_cb(skb);
> +       if (duplicate)
> +               cb->duplicated = true;
>         if (q->gap == 0 ||              /* not doing reordering */
>             q->counter < q->gap - 1 ||  /* inside last reordering gap */
>             q->reorder < get_crandom(&q->reorder_cor, &q->prng)) {
> @@ -613,6 +611,19 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                 sch->qstats.requeues++;
>         }
>
> +       /*
> +        * If doing duplication then re-insert at top of the
> +        * qdisc tree, since parent queuer expects that only one
> +        * skb will be queued.
> +        */
> +       if (skb2) {
> +               struct Qdisc *rootq = qdisc_root_bh(sch);
> +
> +               netem_skb_cb(skb2)->duplicated = true;
> +               rootq->enqueue(skb2, rootq, to_free);
> +               skb2 = NULL;
> +       }
> +
>  finish_segs:
>         if (skb2)
>                 __qdisc_drop(skb2, to_free);
> @@ -642,9 +653,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                 /* Parent qdiscs accounted for 1 skb of size @prev_len */
>                 qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len));
>         } else if (!skb) {
> -               return NET_XMIT_DROP;
> +               retval = NET_XMIT_DROP;
> +               goto dec_nest_level;
>         }
> -       return NET_XMIT_SUCCESS;
> +       retval = NET_XMIT_SUCCESS;
> +
> +dec_nest_level:
> +       __this_cpu_dec(enqueue_nest_level);
> +       return retval;
>  }
>
>  /* Delay the next round with a new future slot with a
> @@ -743,8 +759,11 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
>                                 unsigned int pkt_len = qdisc_pkt_len(skb);
>                                 struct sk_buff *to_free = NULL;
>                                 int err;
> -
> +                               if (netem_skb_cb(skb)->duplicated)
> +                                       __this_cpu_inc(enqueue_nest_level);
>                                 err = qdisc_enqueue(skb, q->qdisc, &to_free);
> +                               if (netem_skb_cb(skb)->duplicated)
> +                                       __this_cpu_dec(enqueue_nest_level);
>                                 kfree_skb_list(to_free);
>                                 if (err != NET_XMIT_SUCCESS) {
>                                         if (net_xmit_drop_count(err))
> diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
> index 3c4444961488..f5dd9c5cd9b2 100644
> --- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
> +++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/netem.json
> @@ -336,5 +336,46 @@
>          "teardown": [
>              "$TC qdisc del dev $DUMMY handle 1: root"
>          ]
> +    },
> +    {
> +        "id": "d34d",
> +        "name": "NETEM test qdisc duplication recursion limit",
> +        "category": ["qdisc", "netem"],
> +        "plugins": {
> +            "requires": "nsPlugin"
> +        },
> +        "setup": [
> +            "$IP link set lo up || true",
> +            "$TC qdisc add dev lo root handle 1: netem limit 1 duplicate 100%",
> +            "$TC qdisc add dev lo parent 1: handle 2: netem gap 1 limit 1 duplicate 100% delay 1us reorder 100%"
> +        ],
> +        "cmdUnderTest": "ping -I lo -c1 127.0.0.1 > /dev/null",
> +        "expExitCode": "0",
> +        "verifyCmd": "$TC -s qdisc show dev lo",
> +        "matchPattern": "qdisc netem",
> +        "matchCount": "2",
> +        "teardown": [
> +            "$TC qdisc del dev lo handle 1:0 root"
> +        ]
> +    },
> +    {
> +        "id": "b33f",
> +        "name": "NETEM test qdisc maximum duplication and loss",
> +        "category": ["qdisc", "netem"],
> +        "plugins": {
> +            "requires": "nsPlugin"
> +        },
> +        "setup": [
> +            "$IP link set lo up || true",
> +            "$TC qdisc add dev lo root handle 1: netem limit 1 duplicate 100% loss 100%"
> +        ],
> +        "cmdUnderTest": "ping -I lo -c1 127.0.0.1 > /dev/null",
> +        "expExitCode": "1",
> +        "verifyCmd": "$TC -s qdisc show dev lo",
> +        "matchPattern": "Sent 0 bytes 0 pkt",
> +        "matchCount": "1",
> +        "teardown": [
> +            "$TC qdisc del dev lo handle 1:0 root"
> +        ]
>      }
>  ]
>