lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 16 Feb 2022 18:39:11 -0500 From: Jamal Hadi Salim <jhs@...atatu.com> To: Tonghao Zhang <xiangxia.m.yue@...il.com> Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>, Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Jonathan Lemon <jonathan.lemon@...il.com>, Eric Dumazet <edumazet@...gle.com>, Alexander Lobakin <alobakin@...me>, Paolo Abeni <pabeni@...hat.com>, Talal Ahmad <talalahmad@...gle.com>, Kevin Hao <haokexin@...il.com>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, Kees Cook <keescook@...omium.org>, Kumar Kartikeya Dwivedi <memxor@...il.com>, Antoine Tenart <atenart@...nel.org>, Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de> Subject: Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue On 2022-02-16 08:36, Tonghao Zhang wrote: > On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim <jhs@...atatu.com> wrote: [...] The mapping to hardware made sense. Sorry I missed it earlier. >> Can you paste a more complete example of a sample setup on some egress >> port including what the classifier would be looking at? > Hi > > +----+ +----+ +----+ +----+ > | P1 | | P2 | | PN | | PM | > +----+ +----+ +----+ +----+ > | | | | > +-----------+-----------+-----------+ > | > | clsact/skbedit > | MQ > v > +-----------+-----------+-----------+ > | q0 | q1 | qn | qm > v v v v > HTB/FQ HTB/FQ ... FIFO FIFO > Below is still missing your MQ setup (If i understood your diagram correctly). Can you please post that? Are you classids essentially mapping to q0..m? tc -s class show after you run some traffic should help > NETDEV=eth0 > tc qdisc add dev $NETDEV clsact > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw > src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash > n m > Have you observed a nice distribution here? for s/w side tc -s class show after you run some traffic should help for h/w side ethtool -s IIUC, the hash of the ip header with src_ip 192.168.122.100 (and dst ip, is being distributed across queues n..m [because either 192.168.122.100 is talking to many destination IPs and/or ports?] Is this correct if packets are being forwarded as opposed to being sourced from the host? ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc) > The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m. > P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc. > > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw > src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash > 0 1 > > The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1. > P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to > limit its network rate, because we don't hope P2 use all bandwidth to > affect P1. > Understood. >> Your diagram was unclear how the load balancing was going to be >> achieved using the qdiscs (or was it the hardware?). > Firstly, in clsact hook, we select one tx queue from qn to qm for P1, > and use the qdisc of this tx queue, for example FIFO. > in underlay driver, because the we set the skb->queue_mapping in > skbedit, so the hw tx queue from qn to qm will be select too. > any way, in clsact hook, we can use the skbedit queue_mapping to > select software tx queue and hw tx queue. > ethtool -s and tc -s class if you have this running somewhere.. > For doing balance, we can use the skbhash/cpuid/cgroup classid to > select tx queue from qn to qm for P1. > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw > src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n > m > tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw > src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid > n m > The skbhash should work fine if you have good entropy (varying dst ip and dst port mostly, the srcip/srcport/protocol dont offer much entropy unless you have a lot of pods on your system). i.e if it works correctly (forwarding vs host - see my question above) then you should be able to pin a 5tuple flow to a tx queue. If you have a large number of flows/pods then you could potentially get a nice distribution. I may be missing something on the cpuid one - seems high likelihood of having the same flow on multiple queues (based on what raw_smp_processor_id() returns, which i believe is not guaranteed to be consistent). IOW, you could be sending packets out of order for the same 5 tuple flow (because they end up in different queues). As for classid variant - if these packets are already outside the pod and into the host stack, is that field even valid? > Why we want to do the balance, because we don't want pin the packets > from Pod to one tx queue. (in k8s the pods are created or destroy > frequently, and the number of Pods > tx queue number). > sharing the tx queue equally is more important. > As long as the same flow is pinned to the same queue (see my comment on cpuid). Over a very long period what you describe maybe true but it also seems depends on many other variables. I think it would help to actually show some data on how true above statement is (example the creation/destruction rate of the pods). Or collect data over a very long period. cheers, jamal
Powered by blists - more mailing lists