netdev - Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <OF2YXaY19FGNBLPjTD_cAIQim1BVjj7pzMkq8j5mXSQJr9Kd6N04zf2YkLCEpxnIz-zrljMlV0Ask-hlUDuc3rkzIKfF7MzY-jgVtyTi2Q4=@willsroot.io>
Date: Tue, 12 Aug 2025 02:10:02 +0000
From: William Liu <will@...lsroot.io>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, jhs@...atatu.com, xiyou.wangcong@...il.com, pabeni@...hat.com, jiri@...nulli.us, davem@...emloft.net, edumazet@...gle.com, horms@...nel.org, savy@...t3mfailure.io, victor@...atatu.com
Subject: Re: [PATCH net v4 1/2] net/sched: Fix backlog accounting in qdisc_dequeue_internal

On Tuesday, August 12th, 2025 at 12:51 AM, Jakub Kicinski <kuba@...nel.org> wrote:

> 
> 
> On Mon, 11 Aug 2025 17:51:39 +0000 William Liu wrote:
> 
> > > This sort of separation of logic is very error prone in general.
> > > If you're asking for a specific bug that would exist with your
> > > patch - I believe that two subsequent fq_codel_change() calls,
> > > first one reducing the limit, the other one not reducing (and
> > > therefore never invoking dequeue) will adjust the backlog twice.
> > 
> > In that case, I think the code in the limit adjustment while loop
> > never run, so the backlog reduction would only happen with arguments
> > of 0.
> 
> 
> True.
> 
> > But yes, I agree that this approach is not ideal.
> > 
> > > As I commented in the previous message - wouldn't counting the
> > > packets we actually dequeue not solve this problem? smth like:
> > > 
> > > pkts = 0;
> > > bytes = 0;
> > > while (sch->q.qlen > sch->limit ||
> > > 
> > > q->memory_usage > q->memory_limit) {
> > > 
> > > struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
> > > 
> > > pkts++;
> > > bytes += qdisc_pkt_len(skb);
> > > rtnl_kfree_skbs(skb, skb);
> > > }
> > > qdisc_tree_reduce_backlog(sch, pkts, bytes);
> > > 
> > > ? "Conceptually" we are only responsible for adjusting the backlog
> > > for skbs we actually gave to kfree_skb().
> > 
> > I think the issue here is qdisc_dequeue_internal can call the actual
> > dequeue handler, and fq_codel_dequeue would have already made a
> > qdisc_tree_reduce_backlog call [1] when cstats.drop_count is
> > non-zero. Wouldn't we be double counting packets and bytes for
> > qdisc_tree_reduce_backlog after the limit adjustment loop with this
> > approach?
> 
> 
> AFAICT only if the backlog adjustment is using the prev_qlen,
> prev_backlog approach, which snapshots the backlog. In that case,
> yes, the "internal drops" will mess up the count.

Yep, that's why I added the dropped_qlen and dropped_backlog variables, though that is not a very pretty solution.

But even looking at the method you suggested (copy pasting for reference):

	pkts = 0;
	bytes = 0;
 	while (sch->q.qlen > sch->limit ||
 	       q->memory_usage > q->memory_limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, false);
 
		pkts++;
		bytes += qdisc_pkt_len(skb);
 		rtnl_kfree_skbs(skb, skb);
 	}
	qdisc_tree_reduce_backlog(sch, pkts, bytes);

qdisc_dequeue_internal can trigger fq_codel_dequeue, which can trigger qdisc_tree_reduce_backlog before returning (the only qdisc out of these that does so in its dequeue handler). 

Let's say the limit only goes down by one, and packet A is at the front of the queue. qdisc_dequeue_internal takes the dequeue path, and fq_codel_dequeue triggers a qdisc_tree_reduce_backlog from that packet before returning the skb. Would this final qdisc_tree_reduce_backlog after the limit drop not double count? 

> My naive mental model is that we're only responsible for adjusting
> the backlog for skbs we actually dequeued. IOW the skbs that
> qdisc_dequeue_internal() returned to the "limit trimming loop".
> Because we are "deleting" them, potentially in the middle of
> a qdisc hierarchy. If the qdisc decides to delete some more,
> its on the hook for adjusting for those portions.
> 

Right, but the dequeue handler calling qdisc_tree_reduce_backlog might have already adjusted the backlog for an skb before qdisc_dequeue_internal returns in fq_codel. This is what makes the handling here messier. 

I could be missing something, but the original code in fq_codel was very strange.

> I can't find any bugs in your prev_* snapshot approach, but it does
> feel like a layering violation to me.