lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <7003673d-3267-60d0-9340-b08e73f481fd@intel.com> Date: Fri, 23 Sep 2022 17:46:35 +0200 From: "Wilczynski, Michal" <michal.wilczynski@...el.com> To: Jakub Kicinski <kuba@...nel.org> CC: Edward Cree <ecree.xilinx@...il.com>, <netdev@...r.kernel.org>, <alexandr.lobakin@...el.com>, <dchumak@...dia.com>, <maximmi@...dia.com>, <jiri@...nulli.us>, <simon.horman@...igine.com>, <jacob.e.keller@...el.com>, <jesse.brandeburg@...el.com>, <przemyslaw.kitszel@...el.com> Subject: Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters On 9/23/2022 3:16 PM, Jakub Kicinski wrote: > On Fri, 23 Sep 2022 14:11:08 +0200 Wilczynski, Michal wrote: >> On 9/22/2022 10:29 PM, Jakub Kicinski wrote: >>> On Thu, 22 Sep 2022 15:45:55 +0200 Wilczynski, Michal wrote: >>>> On 9/22/2022 2:50 PM, Jakub Kicinski wrote: >> I'm not sure whether this is allowed on mailing list, but I'm attaching >> a text file with an ASCII drawing representing a tree I've send >> previously as linear. Hope you'll find this easier to read. > That helps, thanks! So what I was saying was anything under the vport > layer should be configured by the policy local to the owner of the > function. My main concern is that there are no interfaces to do so. tc-htb in it's current state just doesn't work for us, as I noted before their whole implementation is about creating new queues. https://legacy.netdevconf.info/0x14/pub/papers/44/0x14-paper44-talk-paper.pdf Also reconfiguration from the VM, would need to be handled by the VF driver i.e iavf. So the solution would get much more complex I guess, since we would need to implement communication between ice-iavf, through virtchnl I guess. > >>>> We tried already tc-htb, and it doesn't work for a couple of reasons, >>>> even in this potential hybrid with devlink-rate. One of the problems >>>> with tc-htb offload is that it forces you to allocate a new >>>> queue, it doesn't allow for reassigning an existing queue to another >>>> scheduling node. This is our main use case. >>>> >>>> Here's a discussion about tc-htb: >>>> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/ >>> This is a problem only for "SR-IOV case" or also for just the PF? >> The way tc-htb is coded it's NOT possible to reassign queues from one >> scheduling node to the other, this is a generic problem with this >> implementation, regardless of SR-IOV or PF. So even if we >> wanted to reassign queues only for PF's this wouldn't be possible. >> I feel like an example would help. So let's say I do this: >> >> tc qdisc replace dev ens785 root handle 1: htb offload >> tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000 >> tc class add dev ens785 parent 1:2 classid 1:3 htb rate 1000 ceil 2000 >> tc class add dev ens785 parent 1:2 classid 1:4 htb rate 1000 ceil 2000 >> tc class add dev ens785 parent 1:3 classid 1:5 htb rate 1000 ceil 2000 >> tc class add dev ens785 parent 1:4 classid 1:6 htb rate 1000 ceil 2000 >> >> 1: <-- root qdisc >> | >> 1:2 >> / \ >> / \ >> 1:3 1:4 >> | | >> | | >> 1:5 1:6 >> | | >> QID QID <---- here we'll have PFIFO qdiscs >> >> >> At this point I would have two additional queues in the system, and >> the kernel would enqueue packets to those new queues according to 'tc >> flower' configuration. > TBH I don't know what you mean by "reassign queues from one > scheduling node to the other", sorry I don't know this code well. > Neither the offload nor HTB itself. So using as an example parts of the drawing I made: Imagine you have a queue like this: pci/0000:4b:00.0/queue/103: type queue parent node_200 It has txq_id 103, that is assigned by hardware - it's uniquely id'd by it. Then I run commands like this: devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent vport_2 tx_share 100Mbps tx_max 500Mbps priority 5 devlink port function rate add pci/0000:4b:00.0/node_custom_2 parent node_custom_1 devlink port function rate add pci/0000:4b:00.0/node_custom_3 parent node_custom_3 And here I reassign the queue: devlink port function rate set pci/0000:4b:00.0/queue/103 parent node_custom_3 So now queue has completely different parent and the packets from that queue are scheduled using completely different parameters. > > My uneducated anticipation of how HTB offload would work is that > queue 0 of the NIC is a catch all for leftovers and all other queues > get assigned leaf nodes. So enabling an htb offload leaves all the existing queues in place, and in case the kernel can't classify a packet to one of the newly created queues, the driver is supposed to select one of the 'older' queues to do the job. (driver receives TC_HTB_LEAF_QUERY_QUEUE event) So basically after running tc qdisc replace dev ens785 root handle 1: htb offload you don't get any new queues yet, but if you create new classes: tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000 you'll get an TC_HTB_LEAF_ALLOC_QUEUE event in the driver, that means you're supposed to allocate new queue in the driver. This just doesn't work for our case. Also our scheduling tree is rather rigid. I can't just remove the whole subtree just because user decided to enable htb-offload. So as you can see in current ultimate devlink-rate implementation I'm exporting the whole tree, and just allow user to modify it. There are some constraints, like inability to remove nodes with queues, or any children really. > >> So theoretically we should create a new queue >> in a hardware and put it in a privileged position in the scheduling >> tree. And I would happily write it this >> way, but this is NOT what our customer want. He doesn't want any >> extra queues in the system, he just >> wants to make existing queues more privileged. And not just PF queues >> - he's mostly interested in VF queues. >> I'm not sure how to state use case more clearly. > The VF means controlling queue scheduling of another function > via the PF, right? Let's leave that out of the picture for now > so we don't have to worry about "architectural" concerns. Devlink works more on pci devices, but I guess you can call it PF still. >>>> So either I would have to invent a new offload type (?) for tc, or >>>> completely rewrite and >>>> probably break tc-htb that mellanox implemented. >>>> Also in our use case it's possible to create completely new >>>> branches from the root and >>>> reassigning queues there. This wouldn't be possible with the method >>>> you're proposing. >>>> >>>> So existing interface doesn't allow us to do what is required. >>> For some definition of "what is required" which was not really >>> disclosed clearly. Or I'm to slow to grasp. >> In most basic variant what we want is a way to make hardware queues >> more privileged, and modify hierarchy of nodes/queues freely. We >> don't want to create new queues, as required by tc-htb >> implementation. This is main reason why tc-htb and devlink-rate >> hybrid doesn't work for us. > Hm, when you say "privileged" do you mean higher quota or priority? Maybe the meaning wasn't clear, so queue privilegeness is determined by everything really: it's position in the tree, it's ancestors assigned parameters (tx_share, tx_max, priority, weight). And it's own assigned parameters. BR, Michał
Powered by blists - more mailing lists