netdev - Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0x7zdcWIGm0NWid6NxFLpYOtO0Z1g6UCzrNnyVZ6hRvWr5rU6b6hi5Yz8dD7_dyUOmvJfkR8LV2_TrDf7uACFgGshyfxiRWgxjWer41EZVY=@willsroot.io>
Date: Fri, 30 May 2025 14:49:25 +0000
From: William Liu <will@...lsroot.io>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Savy <savy@...t3mfailure.io>, Cong Wang <xiyou.wangcong@...il.com>, Victor Nogueira <victor@...atatu.com>, Pedro Tammela <pctammela@...atatu.com>, Paolo Abeni <pabeni@...hat.com>, Jakub Kicinski <kuba@...nel.org>, Stephen Hemminger <stephen@...workplumber.org>, Davide Caratti <dcaratti@...hat.com>
Subject: Re: [BUG] net/sched: Soft Lockup/Task Hang and OOM Loop in netem_dequeue

On Friday, May 30th, 2025 at 2:14 PM, Jamal Hadi Salim <jhs@...atatu.com> wrote:

> 
> 
> On Thu, May 29, 2025 at 11:23 AM William Liu will@...lsroot.io wrote:
> 
> > On Wednesday, May 28th, 2025 at 10:00 PM, Jamal Hadi Salim jhs@...atatu.com wrote:
> > 
> > > Hi,
> > > Sorry for the latency..
> > > 
> > > On Sun, May 25, 2025 at 4:43 PM William Liu will@...lsroot.io wrote:
> > > 
> > > > I did some more testing with the percpu approach, and we realized the following problem caused now by netem_dequeue.
> > > > 
> > > > Recall that we increment the percpu variable on netem_enqueue entry and decrement it on exit. netem_dequeue calls enqueue on the child qdisc - if this child qdisc is a netem qdisc with duplication enabled, it could duplicate a previously duplicated packet from the parent back to the parent, causing the issue again. The percpu variable cannot protect against this case.
> > > 
> > > I didnt follow why "percpu variable cannot protect against this case"
> > > - the enqueue and dequeue would be running on the same cpu, no?
> > > Also under what circumstances is the enqueue back to the root going to
> > > end up in calling dequeue? Did you test and hit this issue or its just
> > > theory? Note: It doesnt matter what the source of the skb is as long
> > > as it hits the netem enqueue.
> > 
> > Yes, I meant that just using the percpu variable in enqueue will not protect against the case for when dequeue calls enqueue on the child. Because of the child netem with duplication enabled, packets already involved in duplication will get sent back to the parent's tfifo queue, and then the current dequeue will remain stuck in the loop before hitting an OOM - refer to the paragraph starting with "In netem_dequeue, the parent netem qdisc's t_len" in the first email for additional clarification. We need to know whether a packet we dequeue has been involved in duplication - if it has, we increment the percpu variable to inform the children netem qdiscs.
> > 
> > Hopefully the following diagram can help elucidate the problem:
> > 
> > Step 1: Initial enqueue of Packet A:
> > 
> > +----------------------+
> > | Packet A |
> > +----------------------+
> > |
> > v
> > +-------------------------+
> > | netem_enqueue |
> > +-------------------------+
> > |
> > v
> > +-----------------------------------+
> > | Duplication Logic (percpu OK): |
> > | => Packet A, Packet B (dup) |
> > +-----------------------------------+
> > | <- percpu variable for netem_enqueue
> > v prevents duplication of B
> > +-------------+
> > | tfifo queue |
> > | [A, B] |
> > +-------------+
> > 
> > Step 2: netem_dequeue processes Packet B (or A)
> > 
> > +-------------+
> > | tfifo queue |
> > | [A] |
> > +-------------+
> > |
> > v
> > +----------------------------------------+
> > | netem_dequeue pops B in tfifo_dequeue |
> > +----------------------------------------+
> > |
> > v
> > +--------------------------------------------+
> > | netem_enqueue to child qdisc (netem w/ dup)|
> > +--------------------------------------------+
> > | <- percpu variable in netem_enqueue prologue
> > | and epilogue does not stop this dup,
> > v does not know about previous dup involvement
> > +--------------------------------------------------------+
> > | Child qdisc duplicates B to root (original netem) as C |
> > +--------------------------------------------------------+
> > |
> > v
> > 
> > Step 3: Packet C enters original root netem again
> > 
> > +-------------------------+
> > | netem_enqueue (again) |
> > +-------------------------+
> > |
> > v
> > +-------------------------------------+
> > | Duplication Logic (percpu OK again) |
> > | => Packet C, Packet D |
> > +-------------------------------------+
> > |
> > v
> > .....
> > 
> > If you increment a percpu variable in enqueue prologue and decrement in enqueue epilogue, you will notice that our original repro will still trigger a loop because of the scenario I pointed out above - this has been tested.
> > 
> > From a current view of the codebase, netem is the only qdisc that calls enqueue on its child from its dequeue. The check we propose will only work if this invariant remains.
> > 
> > > > However, there is a hack to address this. We can add a field in netem_skb_cb called duplicated to track if a packet is involved in duplicated (both the original and duplicated packet should have it marked). Right before we call the child enqueue in netem_dequeue, we check for the duplicated value. If it is true, we increment the percpu variable before and decrement it after the child enqueue call.
> > > 
> > > is netem_skb_cb safe really for hierarchies? grep for qdisc_skb_cb
> > > net/sched/ to see what i mean
> > 
> > We are not using it for cross qdisc hierarchy checking. We are only using it to inform a netem dequeue whether the packet has partaken in duplication from its corresponding netem enqueue. That part seems to be private data for the sk_buff residing in the current qdisc, so my understanding is that it's ok.
> > 
> > > > This only works under the assumption that there aren't other qdiscs that call enqueue on their child during dequeue, which seems to be the case for now. And honestly, this is quite a fragile fix - there might be other edge cases that will cause problems later down the line.
> > > > 
> > > > Are you aware of other more elegant approaches we can try for us to track this required cross-qdisc state? We suggested adding a single bit to the skb, but we also see the problem with adding a field for a one-off use case to such a vital structure (but this would also completely stomp out this bug).
> > > 
> > > It sounds like quite a complicated approach - i dont know what the
> > > dequeue thing brings to the table; and if we really have to dequeue to
> > 
> > Did what I say above help clarify what the problem is? Feel free to let me know if you have more questions, this bug is quite a nasty one.
> 
> 
> The text helped a bit, but send a tc reproducer of the issue you
> described to help me understand better how you end up in the tfifo
> which then calls the enqueu, etc, etc.

The reproducer is the same as the original reproducer we reported:
tc qdisc add dev lo root handle 1: netem limit 1 duplicate 100%
tc qdisc add dev lo parent 1: handle 2: netem gap 1 limit 1 duplicate 100% delay 1us reorder 100%
ping -I lo -f -c1 -s48 -W0.001 127.0.0.1

We walked through the issue in the codepath in the first email of this thread at the paragraph starting with "The root cause for this is complex. Because of the way we setup the parent qdisc" - please let me know if any additional clarification is needed for any part of it.

Best,
Will