[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8e19395d-b6d6-47d4-9ce0-e2b59e109b2b@gmail.com>
Date: Mon, 30 Jun 2025 11:04:51 +0200
From: Lion Ackermann <nnamrec@...il.com>
To: Cong Wang <xiyou.wangcong@...il.com>, Jamal Hadi Salim <jhs@...atatu.com>
Cc: netdev@...r.kernel.org, Jiri Pirko <jiri@...nulli.us>,
Mingi Cho <mincho@...ori.io>
Subject: Re: Incomplete fix for recent bug in tc / hfsc
Hi,
On 6/29/25 9:50 PM, Cong Wang wrote:
> On Sun, Jun 29, 2025 at 10:29:44AM -0400, Jamal Hadi Salim wrote:
>>> On "What do you think the root cause is here?"
>>>
>>> I believe the root cause is that qdiscs like hfsc and qfq are dropping
>>> all packets in enqueue (mostly in relation to peek()) and that result
>>> is not being reflected in the return code returned to its parent
>>> qdisc.
>>> So, in the example you described in this thread, drr is oblivious to
>>> the fact that the child qdisc dropped its packet because the call to
>>> its child enqueue returned NET_XMIT_SUCCESS. This causes drr to
>>> activate a class that shouldn't have been activated at all.
>>>
>>> You can argue that drr (and other similar qdiscs) may detect this by
>>> checking the call to qlen_notify (as the drr patch was
>>> doing), but that seems really counter-intuitive. Imagine writing a new
>>> qdisc and having to check for that every time you call a child's
>>> enqueue. Sure your patch solves this, but it also seems like it's not
>>> fixing the underlying issue (which is drr activating the class in the
>>> first place). Your patch is simply removing all the classes from their
>>> active lists when you delete them. And your patch may seem ok for now,
>>> but I am worried it might break something else in the future that we
>>> are not seeing.
>>>
>>> And do note: All of the examples of the hierarchy I have seen so far,
>>> that put us in this situation, are nonsensical
>>>
>>
>> At this point my thinking is to apply your patch and then we discuss a
>> longer term solution. Cong?
>
> I agree. If Lion's patch works, it is certainly much better as a bug fix
> for both -net and -stable.
>
> Also for all of those ->qlen_notify() craziness, I think we need to
> rethink about the architecture, _maybe_ there are better architectural
> solutions.
>
> Thanks!
Just for the record, I agree with all your points and as was stated this
patch really only does damage prevention. Your proposal of preventing
hierarchies sounds useful in the long run to keep the backlogs sane.
I did run all the tdc tests on the latest net tree and they passed. Also
my HFSC reproducer does not trigger with the proposed patch. I do not have
a simple reproducer at hand for the QFQ tree case that you mentioned. So
please verify this too if you can.
Otherwise please feel free to go forward with the patch. If I can add
anything else to the discussion please let me know.
Thanks,
Lion
Powered by blists - more mailing lists