[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc0affeb-1d2e-3e1f-bc3f-43fc47736674@mojatatu.com>
Date: Sun, 20 Feb 2022 13:30:24 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Tonghao Zhang <xiangxia.m.yue@...il.com>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>,
Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Jonathan Lemon <jonathan.lemon@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
Alexander Lobakin <alobakin@...me>,
Paolo Abeni <pabeni@...hat.com>,
Talal Ahmad <talalahmad@...gle.com>,
Kevin Hao <haokexin@...il.com>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Kees Cook <keescook@...omium.org>,
Kumar Kartikeya Dwivedi <memxor@...il.com>,
Antoine Tenart <atenart@...nel.org>,
Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid
selecting tx queue
On 2022-02-18 07:43, Tonghao Zhang wrote:
> On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>>
> Hi Jamal
>
> The setup commands is shown as below:
> NETDEV=eth0
> ip li set dev $NETDEV up
> tc qdisc del dev $NETDEV clsact 2>/dev/null
> tc qdisc add dev $NETDEV clsact
>
> ip link add ipv1 link $NETDEV type ipvlan mode l2
> ip netns add n1
> ip link set ipv1 netns n1
>
> ip netns exec n1 ip link set ipv1 up
> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
>
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 2.2.2.100 action skbedit queue_mapping hash-type skbhash 2 6
>
> tc qdisc add dev $NETDEV handle 1: root mq
>
> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
>
> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> tc qdisc add dev $NETDEV parent 1:3 pfifo
> tc qdisc add dev $NETDEV parent 1:4 pfifo
> tc qdisc add dev $NETDEV parent 1:5 pfifo
> tc qdisc add dev $NETDEV parent 1:6 pfifo
> tc qdisc add dev $NETDEV parent 1:7 pfifo
>
>
> use the perf to generate packets:
> ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10
>
> we use the skbedit to select tx queue from 2 - 6
> # ethtool -S eth0 | grep -i [tr]x_queue_[0-9]_bytes
> rx_queue_0_bytes: 442
> rx_queue_1_bytes: 60966
> rx_queue_2_bytes: 10440203
> rx_queue_3_bytes: 6083863
> rx_queue_4_bytes: 3809726
> rx_queue_5_bytes: 3581460
> rx_queue_6_bytes: 5772099
> rx_queue_7_bytes: 148
> rx_queue_8_bytes: 368
> rx_queue_9_bytes: 383
> tx_queue_0_bytes: 42
> tx_queue_1_bytes: 0
> tx_queue_2_bytes: 11442586444
> tx_queue_3_bytes: 7383615334
> tx_queue_4_bytes: 3981365579
> tx_queue_5_bytes: 3983235051
> tx_queue_6_bytes: 6706236461
> tx_queue_7_bytes: 42
> tx_queue_8_bytes: 0
> tx_queue_9_bytes: 0
>
> tx queues 2-6 are mapping to classid 1:3 - 1:7
> # tc -s class show dev eth0
> class mq 1:1 root leaf 2:
> Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:2 root leaf 8001:
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:3 root leaf 8002:
> Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:4 root leaf 8003:
> Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:5 root leaf 8004:
> Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:6 root leaf 8005:
> Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:7 root leaf 8006:
> Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:8 root
> Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:9 root
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:a root
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class tbf 8001:1 parent 8001:
>
> class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> lended: 0 borrowed: 0 giants: 0
> tokens: 2000000 ctokens: 2000000
>
> class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> lended: 0 borrowed: 0 giants: 0
> tokens: 1000000 ctokens: 1000000
>
Yes, this is a good example (which should have been in the commit
message of 0/2 or 2/2 - would have avoided long discussion).
The byte count doesnt map correctly between the DMA side and the
qdisc side; you probably had some additional experiments running
prior to installing the mq qdisc.
So not a big deal - it is close enough.
To Cong's comments earlier - I dont think you can correctly have
picked the queue in user space for this specific policy (hash-type
skbhash). Reason is you are dependent on the skb hash computation
which is based on things like ephemeral src port for the netperf
client - which cannot be determined in user space.
> Good question, for TCP, we set the ixgbe ntuple off.
> ethtool -K ixgbe-dev ntuple off
> so in the underlying driver, hw will record this flow, and its tx
> queue, when it comes back to pod.
> hw will send to rx queue corresponding to tx queue.
>
> the codes:
> ixgbe_xmit_frame/ixgbe_xmit_frame_ring -->ixgbe_atr() ->
> ixgbe_fdir_add_signature_filter_82599
> ixgbe_fdir_add_signature_filter_82599 will install the rule for
> incoming packets.
>
>> ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)
> for tcp:
> __tcp_transmit_skb -> skb_set_hash_from_sk
>
> for udp
> udp_sendmsg -> ip_make_skb -> __ip_append_data -> sock_alloc_send_pskb
> -> skb_set_owner_w
Thats a different use case than what you are presenting here.
i.e the k8s pod scenario is purely a forwarding use case.
But it doesnt matter tbh since your data shows reasonable results.
[i didnt dig into the code but it is likely (based on your experimental
data) that both skb->l4_hash and skb->sw_hash will _not be set_
and so skb_get_hash() will compute the skb->hash from scratch.]
>> I may be missing something on the cpuid one - seems high likelihood
>> of having the same flow on multiple queues (based on what
>> raw_smp_processor_id() returns, which i believe is not guaranteed to be
>> consistent). IOW, you could be sending packets out of order for the
>> same 5 tuple flow (because they end up in different queues).
> Yes, but think about one case, we pin one pod to one cpu, so all the
> processes of the pod will
> use the same cpu. then all packets from this pod will use the same tx queue.
To Cong's point - if you already knew the pinned-to cpuid then you could
just as easily set that queue map from user space?
>> As for classid variant - if these packets are already outside th
>> pod and into the host stack, is that field even valid?
> Yes, ipvlan, macvlan and other virt netdev don't clean this field.
>>> Why we want to do the balance, because we don't want pin the packets
>>> from Pod to one tx queue. (in k8s the pods are created or destroy
>>> frequently, and the number of Pods > tx queue number).
>>> sharing the tx queue equally is more important.
>>>
>>
>> As long as the same flow is pinned to the same queue (see my comment
>> on cpuid).
>> Over a very long period what you describe maybe true but it also
>> seems depends on many other variables.
> NETDEV=eth0
>
> ip li set dev $NETDEV up
>
> tc qdisc del dev $NETDEV clsact 2>/dev/null
> tc qdisc add dev $NETDEV clsact
>
> ip link add ipv1 link $NETDEV type ipvlan mode l2
> ip netns add n1
> ip link set ipv1 netns n1
>
> ip netns exec n1 ip link set ipv1 up
> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
>
> tc filter add dev $NETDEV egress protocol ip prio 1 \
> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6
>
> tc qdisc add dev $NETDEV handle 1: root mq
>
> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
>
> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> tc qdisc add dev $NETDEV parent 1:3 pfifo
> tc qdisc add dev $NETDEV parent 1:4 pfifo
> tc qdisc add dev $NETDEV parent 1:5 pfifo
> tc qdisc add dev $NETDEV parent 1:6 pfifo
> tc qdisc add dev $NETDEV parent 1:7 pfifo
>
> set the iperf3 to one cpu
> # mkdir -p /sys/fs/cgroup/cpuset/n0
> # echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus
> # echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems
> # ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G
> # echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks
>
> # ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
> tx_queue_0_bytes: 7180
> tx_queue_1_bytes: 418
> tx_queue_2_bytes: 3015
> tx_queue_3_bytes: 4824
> tx_queue_4_bytes: 3738
> tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4
> tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4,
> skbedit use this tx queue, and don't use tx queue 5
> tx_queue_7_bytes: 4364
> tx_queue_8_bytes: 42
> tx_queue_9_bytes: 3030
>
>
> # tc -s class show dev eth0
> class mq 1:1 root leaf 2:
> Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:2 root leaf 8001:
> Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:3 root leaf 8002:
> Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:4 root leaf 8003:
> Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:5 root leaf 8004:
> Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:6 root leaf 8005:
> Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:7 root leaf 8006:
> Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18)
> backlog 0b 0p requeues 18
> class mq 1:8 root
> Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:9 root
> Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class mq 1:a root
> Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> class tbf 8001:1 parent 8001:
>
> class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> lended: 0 borrowed: 0 giants: 0
> tokens: 2000000 ctokens: 2000000
>
> class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> lended: 0 borrowed: 0 giants: 0
> tokens: 1000000 ctokens: 1000000
>
Yes, if you pin a flow/process to a cpu - this is expected. See my
earlier comment. You could argue that you are automating things but
it is not as a strong as the hash setup (and will have to be documented
that it works only if you pin processes doing network i/o to cpus).
Could you also post an example on the cgroups classid?
cheers,
jamal
Powered by blists - more mailing lists