[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <46cfd7bc-5242-0a4c-b710-48fc2e69007c@mojatatu.com>
Date: Tue, 22 Feb 2022 06:44:34 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Tonghao Zhang <xiangxia.m.yue@...il.com>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>,
Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Jonathan Lemon <jonathan.lemon@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
Alexander Lobakin <alobakin@...me>,
Paolo Abeni <pabeni@...hat.com>,
Talal Ahmad <talalahmad@...gle.com>,
Kevin Hao <haokexin@...il.com>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Kees Cook <keescook@...omium.org>,
Kumar Kartikeya Dwivedi <memxor@...il.com>,
Antoine Tenart <atenart@...nel.org>,
Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid
selecting tx queue
On 2022-02-20 20:43, Tonghao Zhang wrote:
> On Mon, Feb 21, 2022 at 2:30 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>>
>> On 2022-02-18 07:43, Tonghao Zhang wrote:
>>> On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>>>>
>>
>>
>> Thats a different use case than what you are presenting here.
>> i.e the k8s pod scenario is purely a forwarding use case.
>> But it doesnt matter tbh since your data shows reasonable results.
>>
>> [i didnt dig into the code but it is likely (based on your experimental
>> data) that both skb->l4_hash and skb->sw_hash will _not be set_
>> and so skb_get_hash() will compute the skb->hash from scratch.]
> No, for example, for tcp, we have set hash in __tcp_transmit_skb which
> invokes the skb_set_hash_from_sk
> so in skbedit, skb_get_hash only gets skb->hash.
There is no tcp anything in the forwarding case. Your use case was for
forwarding. I understand the local host tcp/udp variant.
>>>> I may be missing something on the cpuid one - seems high likelihood
>>>> of having the same flow on multiple queues (based on what
>>>> raw_smp_processor_id() returns, which i believe is not guaranteed to be
>>>> consistent). IOW, you could be sending packets out of order for the
>>>> same 5 tuple flow (because they end up in different queues).
>>> Yes, but think about one case, we pin one pod to one cpu, so all the
>>> processes of the pod will
>>> use the same cpu. then all packets from this pod will use the same tx queue.
>>
>> To Cong's point - if you already knew the pinned-to cpuid then you could
>> just as easily set that queue map from user space?
> Yes, we can set it from user space. If we can know the cpu which the
> pod uses, and select the one tx queue
> automatically in skbedit, that can make the things easy?
Yes, but you know the CPU - so Cong's point is valid. You knew the
CPU when you setup the cgroup for iperf by hand, you can use the
same hand to set the queue map skbedit.
>>> ip li set dev $NETDEV up
>>>
>>> tc qdisc del dev $NETDEV clsact 2>/dev/null
>>> tc qdisc add dev $NETDEV clsact
>>>
>>> ip link add ipv1 link $NETDEV type ipvlan mode l2
>>> ip netns add n1
>>> ip link set ipv1 netns n1
>>>
>>> ip netns exec n1 ip link set ipv1 up
>>> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
>>>
>>> tc filter add dev $NETDEV egress protocol ip prio 1 \
>>> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6
>>>
>>> tc qdisc add dev $NETDEV handle 1: root mq
>>>
>>> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
>>> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
>>> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
>>>
>>> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
>>> tc qdisc add dev $NETDEV parent 1:3 pfifo
>>> tc qdisc add dev $NETDEV parent 1:4 pfifo
>>> tc qdisc add dev $NETDEV parent 1:5 pfifo
>>> tc qdisc add dev $NETDEV parent 1:6 pfifo
>>> tc qdisc add dev $NETDEV parent 1:7 pfifo
>>>
>>> set the iperf3 to one cpu
>>> # mkdir -p /sys/fs/cgroup/cpuset/n0
>>> # echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus
>>> # echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems
>>> # ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G
>>> # echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks
>>>
>>> # ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
>>> tx_queue_0_bytes: 7180
>>> tx_queue_1_bytes: 418
>>> tx_queue_2_bytes: 3015
>>> tx_queue_3_bytes: 4824
>>> tx_queue_4_bytes: 3738
>>> tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4
>>> tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4,
>>> skbedit use this tx queue, and don't use tx queue 5
>>> tx_queue_7_bytes: 4364
>>> tx_queue_8_bytes: 42
>>> tx_queue_9_bytes: 3030
>>>
>>>
>>> # tc -s class show dev eth0
>>> class mq 1:1 root leaf 2:
>>> Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:2 root leaf 8001:
>>> Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:3 root leaf 8002:
>>> Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:4 root leaf 8003:
>>> Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:5 root leaf 8004:
>>> Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:6 root leaf 8005:
>>> Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:7 root leaf 8006:
>>> Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18)
>>> backlog 0b 0p requeues 18
>>> class mq 1:8 root
>>> Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:9 root
>>> Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class mq 1:a root
>>> Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> class tbf 8001:1 parent 8001:
>>>
>>> class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
>>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> lended: 0 borrowed: 0 giants: 0
>>> tokens: 2000000 ctokens: 2000000
>>>
>>> class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
>>> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>> backlog 0b 0p requeues 0
>>> lended: 0 borrowed: 0 giants: 0
>>> tokens: 1000000 ctokens: 1000000
>>>
>>
>> Yes, if you pin a flow/process to a cpu - this is expected. See my
>> earlier comment. You could argue that you are automating things but
>> it is not as a strong as the hash setup (and will have to be documented
>> that it works only if you pin processes doing network i/o to cpus).
> Ok, it should be documented in iproute2. and we will doc this in
> commit message too.
I think this part is iffy. You could argue automation pov
but i dont see much else.
>> Could you also post an example on the cgroups classid?
>
> The setup commands:
> NETDEV=eth0
> ip li set dev $NETDEV up
>
> tc qdisc del dev $NETDEV clsact 2>/dev/null
> tc qdisc add dev $NETDEV clsact
>
> ip link add ipv1 link $NETDEV type ipvlan mode l2
> ip netns add n1
> ip link set ipv1 netns n1
>
> ip netns exec n1 ip link set ipv1 up
> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
>
> tc filter add dev $NETDEV egress protocol ip prio 1 \
> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type
> classid 2 6
>
> tc qdisc add dev $NETDEV handle 1: root mq
>
> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
>
> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> tc qdisc add dev $NETDEV parent 1:3 pfifo
> tc qdisc add dev $NETDEV parent 1:4 pfifo
> tc qdisc add dev $NETDEV parent 1:5 pfifo
> tc qdisc add dev $NETDEV parent 1:6 pfifo
> tc qdisc add dev $NETDEV parent 1:7 pfifo
>
> setup classid
> # mkdir -p /sys/fs/cgroup/net_cls/n0
> # echo 0x100001 > /sys/fs/cgroup/net_cls/n0/net_cls.classid
> # echo $(pidof iperf3) > /sys/fs/cgroup/net_cls/n0/tasks
>
I would say some thing here as well. You know the classid, you manually
set it above, you could have said:
src_ip 2.2.2.100 action skbedit queue_mapping 1
cheers,
jamal
Powered by blists - more mailing lists