[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a6b7a74-82f5-53e7-07f4-2a995df9f349@mojatatu.com>
Date: Wed, 16 Feb 2022 18:39:11 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Tonghao Zhang <xiangxia.m.yue@...il.com>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>,
Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Jonathan Lemon <jonathan.lemon@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
Alexander Lobakin <alobakin@...me>,
Paolo Abeni <pabeni@...hat.com>,
Talal Ahmad <talalahmad@...gle.com>,
Kevin Hao <haokexin@...il.com>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Kees Cook <keescook@...omium.org>,
Kumar Kartikeya Dwivedi <memxor@...il.com>,
Antoine Tenart <atenart@...nel.org>,
Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid
selecting tx queue
On 2022-02-16 08:36, Tonghao Zhang wrote:
> On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
[...]
The mapping to hardware made sense. Sorry I missed it earlier.
>> Can you paste a more complete example of a sample setup on some egress
>> port including what the classifier would be looking at?
> Hi
>
> +----+ +----+ +----+ +----+
> | P1 | | P2 | | PN | | PM |
> +----+ +----+ +----+ +----+
> | | | |
> +-----------+-----------+-----------+
> |
> | clsact/skbedit
> | MQ
> v
> +-----------+-----------+-----------+
> | q0 | q1 | qn | qm
> v v v v
> HTB/FQ HTB/FQ ... FIFO FIFO
>
Below is still missing your MQ setup (If i understood your diagram
correctly). Can you please post that?
Are you classids essentially mapping to q0..m?
tc -s class show after you run some traffic should help
> NETDEV=eth0
> tc qdisc add dev $NETDEV clsact
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash
> n m
>
Have you observed a nice distribution here?
for s/w side tc -s class show after you run some traffic should help
for h/w side ethtool -s
IIUC, the hash of the ip header with src_ip 192.168.122.100
(and dst ip,
is being distributed across queues n..m
[because either 192.168.122.100 is talking to many destination
IPs and/or ports?]
Is this correct if packets are being forwarded as opposed to
being sourced from the host?
ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)
> The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m.
> P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc.
>
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash
> 0 1
>
> The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1.
> P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to
> limit its network rate, because we don't hope P2 use all bandwidth to
> affect P1.
>
Understood.
>> Your diagram was unclear how the load balancing was going to be
>> achieved using the qdiscs (or was it the hardware?).
> Firstly, in clsact hook, we select one tx queue from qn to qm for P1,
> and use the qdisc of this tx queue, for example FIFO.
> in underlay driver, because the we set the skb->queue_mapping in
> skbedit, so the hw tx queue from qn to qm will be select too.
> any way, in clsact hook, we can use the skbedit queue_mapping to
> select software tx queue and hw tx queue.
>
ethtool -s and tc -s class if you have this running somewhere..
> For doing balance, we can use the skbhash/cpuid/cgroup classid to
> select tx queue from qn to qm for P1.
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n
> m
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid
> n m
>
The skbhash should work fine if you have good entropy (varying dst ip
and dst port mostly, the srcip/srcport/protocol dont offer much entropy
unless you have a lot of pods on your system).
i.e if it works correctly (forwarding vs host - see my question above)
then you should be able to pin a 5tuple flow to a tx queue.
If you have a large number of flows/pods then you could potentially
get a nice distribution.
I may be missing something on the cpuid one - seems high likelihood
of having the same flow on multiple queues (based on what
raw_smp_processor_id() returns, which i believe is not guaranteed to be
consistent). IOW, you could be sending packets out of order for the
same 5 tuple flow (because they end up in different queues).
As for classid variant - if these packets are already outside the
pod and into the host stack, is that field even valid?
> Why we want to do the balance, because we don't want pin the packets
> from Pod to one tx queue. (in k8s the pods are created or destroy
> frequently, and the number of Pods > tx queue number).
> sharing the tx queue equally is more important.
>
As long as the same flow is pinned to the same queue (see my comment
on cpuid).
Over a very long period what you describe maybe true but it also
seems depends on many other variables.
I think it would help to actually show some data on how true above
statement is (example the creation/destruction rate of the pods).
Or collect data over a very long period.
cheers,
jamal
Powered by blists - more mailing lists