lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 10 Nov 2023 13:07:49 +0200
From: Maxim Mikityanskiy <maxtram95@...il.com>
To: "Chittim, Madhu" <madhu.chittim@...el.com>
Cc: Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
	Jamal Hadi Salim <jhs@...atatu.com>,
	Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Tariq Toukan <tariqt@...dia.com>,
	Gal Pressman <gal@...dia.com>, Saeed Mahameed <saeedm@...dia.com>,
	xuejun.zhang@...el.com, sridhar.samudrala@...el.com
Subject: Re: [PATCH net] net: sched: fix warn on htb offloaded class creation

On Thu, 09 Nov 2023 at 13:54:17 -0800, Chittim, Madhu wrote:
> 
> 
> On 10/31/2023 7:40 AM, Maxim Mikityanskiy wrote:
> > On Tue, 31 Oct 2023 at 10:11:14 +0100, Paolo Abeni wrote:
> > > Hi,
> > > 
> > > I'm sorry for the late reply.
> > > 
> > > On Fri, 2023-10-27 at 16:57 +0300, Maxim Mikityanskiy wrote:
> > > > I believe this is not the right fix.
> > > > 
> > > > On Thu, 26 Oct 2023 at 17:36:48 +0200, Paolo Abeni wrote:
> > > > > The following commands:
> > > > > 
> > > > > tc qdisc add dev eth1 handle 2: root htb offload
> > > > > tc class add dev eth1 parent 2: classid 2:1 htb rate 5mbit burst 15k
> > > > > 
> > > > > yeld to a WARN in the HTB qdisc:
> > > > 
> > > > Something is off here. These are literally the most basic commands one
> > > > could invoke with HTB offload, I'm sure they worked. Is it something
> > > > that broke recently? Tariq/Gal/Saeed, could you check them on a Mellanox
> > > > NIC?
> > > > 
> > > > > 
> > > > >   WARNING: CPU: 2 PID: 1583 at net/sched/sch_htb.c:1959
> > > > >   CPU: 2 PID: 1583 Comm: tc Kdump: loaded 6.6.0-rc2.mptcp_7895773e5235+ #59
> > > > >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> > > > >   RIP: 0010:htb_change_class+0x25c4/0x2e30 [sch_htb]
> > > > >   Code: 24 58 48 b8 00 00 00 00 00 fc ff df 48 89 ca 48 c1 ea 03 80 3c 02 00 0f 85 92 01 00 00 49 89 8c 24 b0 01 00 00 e9 77 fc ff ff <0f> 0b e9 15 ec ff ff 80 3d f8 35 00 00 00 0f 85 d4 f9 ff ff ba 32
> > > > >   RSP: 0018:ffffc900015df240 EFLAGS: 00010246
> > > > >   RAX: 0000000000000000 RBX: ffff88811b4ca000 RCX: ffff88811db42800
> > > > >   RDX: 1ffff11023b68502 RSI: ffffffffaf2e6a00 RDI: ffff88811db42810
> > > > >   RBP: ffff88811db45000 R08: 0000000000000001 R09: fffffbfff664bbc9
> > > > >   R10: ffffffffb325de4f R11: ffffffffb2d33748 R12: 0000000000000000
> > > > >   R13: ffff88811db43000 R14: ffff88811b4caaac R15: ffff8881252c0030
> > > > >   FS:  00007f6c1f126740(0000) GS:ffff88815aa00000(0000) knlGS:0000000000000000
> > > > >   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >   CR2: 000055dca8e5b4a8 CR3: 000000011bc7a006 CR4: 0000000000370ee0
> > > > >   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > >   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > >   Call Trace:
> > > > >   <TASK>
> > > > >    tc_ctl_tclass+0x394/0xeb0
> > > > >    rtnetlink_rcv_msg+0x2f5/0xaa0
> > > > >    netlink_rcv_skb+0x12e/0x3a0
> > > > >    netlink_unicast+0x421/0x730
> > > > >    netlink_sendmsg+0x79e/0xc60
> > > > >    ____sys_sendmsg+0x95a/0xc20
> > > > >    ___sys_sendmsg+0xee/0x170
> > > > >    __sys_sendmsg+0xc6/0x170
> > > > >   do_syscall_64+0x58/0x80
> > > > >   entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > > > 
> > > > > The first command creates per TX queue pfifo qdiscs in
> > > > > tc_modify_qdisc() -> htb_init() and grafts the pfifo to each dev_queue
> > > > > via tc_modify_qdisc() ->  qdisc_graft() -> htb_attach().
> > > > 
> > > > Not exactly; it grafts pfifo to direct queues only. htb_attach_offload
> > > > explicitly grafts noop to all the remaining queues.
> > > 
> > > num_direct_qdiscs == real_num_tx_queues:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/net/sched/sch_htb.c#L1101
> > > 
> > > pfifo will be configured on all the TX queues available at TC creation
> > > time, right?
> > 
> > Yes, all real TX queues will be used as direct queues (for unclassified
> > traffic). num_tx_queues should be somewhat bigger than
> > real_num_tx_queues - it should reserve a queue per potential leaf class.
> > 
> > pfifo is configured on direct queues, and the reserved queues have noop.
> > Then, when a new leaf class is added (TC_HTB_LEAF_ALLOC_QUEUE), the
> > driver allocates a new queue and increases real_num_tx_queues. HTB
> > assigns a pfifo qdisc to the newly allocated queue.
> > 
> > Changing the hierarchy (deleting a node or converting an inner node to a
> > leaf) may reorder the classful queues (with indexes >= the initial
> > real_num_tx_queues), so that there are no gaps.
> > 
> > > Lacking a mlx card with offload support I hack basic htb support in
> > > netdevsim and I observe the splat on top of such device. I can as well
> > > share the netdevsim patch - it will need some clean-up.
> > 
> > I will be happy to review the netdevsim patch, but I don't promise
> > prompt responsiveness.
> > 
> > > > 
> > > > > When the command completes, the qdisc_sleeping for each dev_queue is a
> > > > > pfifo one. The next class creation will trigger the reported splat.
> > > > > 
> > > > > Address the issue taking care of old non-builtin qdisc in
> > > > > htb_change_class().
> > > > > 
> > > > > Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> > > > > Signed-off-by: Paolo Abeni <pabeni@...hat.com>
> > > > > ---
> > > > >   net/sched/sch_htb.c | 3 +--
> > > > >   1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> > > > > index 0d947414e616..dc682bd542b4 100644
> > > > > --- a/net/sched/sch_htb.c
> > > > > +++ b/net/sched/sch_htb.c
> > > > > @@ -1955,8 +1955,7 @@ static int htb_change_class(struct Qdisc *sch, u32 classid,
> > > > >   				qdisc_refcount_inc(new_q);
> > > > >   			}
> > > > >   			old_q = htb_graft_helper(dev_queue, new_q);
> > > > > -			/* No qdisc_put needed. */
> > > > > -			WARN_ON(!(old_q->flags & TCQ_F_BUILTIN));
> > > > > +			qdisc_put(old_q);
> > > > 
> > > > We can get here after one of two cases above:
> > > > 
> > > > 1. A new queue is allocated with TC_HTB_LEAF_ALLOC_QUEUE. It's supposed
> > > > to have a noop qdisc by default (after htb_attach_offload).
> > > 
> > > So most likely the trivial netdevsim implementation I used was not good
> > > enough.
> > > 
> > > Which constrains should respect TC_HTB_LEAF_ALLOC_QUEUE WRT the
> > > returned qid value? should it in the (real_num_tx_queues,
> > > num_tx_queues] range?
> > 
> > Let's say N is real_num_tx_queues as it was at the moment of attaching.
> > HTB queues should be allocated from [N, num_tx_queues), and
> > real_num_tx_queues should be increased accordingly. It should not return
> > queues number [0, N).
> > 
> > Deletions should fill the gaps: if queue X is being deleted, N <= X <
> > real_num_tx_queues - 1, then the gap should be filled with queue number
> > real_num_tx_queues - 1 by swapping the queues (real_num_tx_queues will
> > be decreased by 1 accordingly). Some care also needs to be taken when
> > converting inner-to-leaf (TC_HTB_LEAF_DEL_LAST) and leaf-to-inner (it's
> > better to get insights from [1], there are also some comments).
> > 
> > > Can HTB actually configure H/W shaping on
> > > real_num_tx_queues?
> > 
> > It will be on real_num_tx_queues, but after it's increased to add new
> > HTB queues. The original queues [0, N) are used for direct traffic, same
> > as the non-offloaded HTB's direct_queue (it's not shaped).
> > 
> > > I find no clear documentation WRT the above.
> > 
> > I'm sorry for the lack of documentation. All I have is the commit
> > message [2] and a netdev talk [3]. Maybe the slides could be of some
> > use...
> > 
> > I hope the above explanation clarifies something, and feel free to ask
> > further questions, I'll be glad to explain what hasn't been documented
> > properly.
> 
> We would like to enable Tx rate limiting using htb offload on all the
> existing queues.

I don't seem to understand how you see it possible with HTB.

1. Where would the unclassified traffic go? HTB uses a set of filters
that steer certain traffic to certain classes. Everything that doesn't
match the filter goes to the direct queue, which doesn't have shaping.
In the non-offloaded HTB it's struct htb_sched::direct_queue. With HTB
offload it's mapped to the standard set of queues.

2. How do you see the mapping of HTB classes to netdev queues if you use
a fixed set of queues? In the current implementation of HTB offload,
each new HTB class creates a new queue. If you are bound to using only N
standard queues, how would the allocation work when you add the second,
third, etc. HTB classes?

> We are able to do with the following set of commands with
> Paolo's patch
> 
> tc qdisc add dev enp175s0v0 handle 1: root htb offload
> tc class add dev enp175s0v0 parent 1: classid 1:1 htb rate 1000mbit ceil
> 2000mbit burst 100k

As long as you don't attach any filters, this set of commands is not
supposed to shape anything. It should TX full-speed, you should be able
to see this effect if you configure non-offloaded HTB. As long as there
are no filters, and the default classid doesn't match your only class,
all traffic will go to the direct queue.

What is your goal? If you need shaping for the whole netdev, maybe HTB
is not needed, and it's enough to attach a catchall filter with the
police action? Or use /sys/class/net/eth0/queues/tx-*/tx_maxrate
per-queue? Or tc mqprio mode channel, that allows to group existing
queues and assign rate limits?

> where
>   classid 1:1 is tx queue0
>   tx_minrate is rate 1000mbps
>   tx_maxrate is ceil 2000mbps
> 
> 
> In order to not break your implementation could bring in if condition
> instead WARN_ON, something like below
> 	if (!(old_q->flags & TCQ_F_BUILTIN))
> 		qdisc_put(old_q);
> 
> Would this work for you, please advise.

One of the reasons this WARN is there to prevent misuse of the API (the
other is to catch bugs that lead to inconsistencies in the internal
state). We can't just remove the WARN and start misusing the API. Your
ideas require more major design changes, but before that I need to
understand your approach to solving (1) and (2).

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ