lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <46cfd7bc-5242-0a4c-b710-48fc2e69007c@mojatatu.com>
Date:   Tue, 22 Feb 2022 06:44:34 -0500
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Tonghao Zhang <xiangxia.m.yue@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Alexander Lobakin <alobakin@...me>,
        Paolo Abeni <pabeni@...hat.com>,
        Talal Ahmad <talalahmad@...gle.com>,
        Kevin Hao <haokexin@...il.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Kees Cook <keescook@...omium.org>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        Antoine Tenart <atenart@...nel.org>,
        Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid
 selecting tx queue

On 2022-02-20 20:43, Tonghao Zhang wrote:
> On Mon, Feb 21, 2022 at 2:30 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>>
>> On 2022-02-18 07:43, Tonghao Zhang wrote:
>>> On Thu, Feb 17, 2022 at 7:39 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>>>>
>>





>>
>> Thats a different use case than what you are presenting here.
>> i.e the k8s pod scenario is purely a forwarding use case.
>> But it doesnt matter tbh since your data shows reasonable results.
>>
>> [i didnt dig into the code but it is likely (based on your experimental
>> data) that both skb->l4_hash and skb->sw_hash  will _not be set_
>> and so skb_get_hash() will compute the skb->hash from scratch.]
> No, for example, for tcp, we have set hash in __tcp_transmit_skb which
> invokes the skb_set_hash_from_sk
> so in skbedit, skb_get_hash only gets skb->hash.

There is no tcp anything in the forwarding case. Your use case was for
forwarding. I understand the local host tcp/udp variant.

>>>> I may be missing something on the cpuid one - seems high likelihood
>>>> of having the same flow on multiple queues (based on what
>>>> raw_smp_processor_id() returns, which i believe is not guaranteed to be
>>>> consistent). IOW, you could be sending packets out of order for the
>>>> same 5 tuple flow (because they end up in different queues).
>>> Yes, but think about one case, we pin one pod to one cpu, so all the
>>> processes of the pod will
>>> use the same cpu. then all packets from this pod will use the same tx queue.
>>
>> To Cong's point - if you already knew the pinned-to cpuid then you could
>> just as easily set that queue map from user space?
> Yes, we can set it from user space. If we can know the cpu which the
> pod uses, and select the one tx queue
> automatically in skbedit, that can make the things easy?

Yes, but you know the CPU - so Cong's point is valid. You knew the
CPU when you setup the cgroup for iperf by hand, you can use the
same hand to set the queue map skbedit.

>>> ip li set dev $NETDEV up
>>>
>>> tc qdisc del dev $NETDEV clsact 2>/dev/null
>>> tc qdisc add dev $NETDEV clsact
>>>
>>> ip link add ipv1 link $NETDEV type ipvlan mode l2
>>> ip netns add n1
>>> ip link set ipv1 netns n1
>>>
>>> ip netns exec n1 ip link set ipv1 up
>>> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
>>>
>>> tc filter add dev $NETDEV egress protocol ip prio 1 \
>>> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type cpuid 2 6
>>>
>>> tc qdisc add dev $NETDEV handle 1: root mq
>>>
>>> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
>>> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
>>> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
>>>
>>> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
>>> tc qdisc add dev $NETDEV parent 1:3 pfifo
>>> tc qdisc add dev $NETDEV parent 1:4 pfifo
>>> tc qdisc add dev $NETDEV parent 1:5 pfifo
>>> tc qdisc add dev $NETDEV parent 1:6 pfifo
>>> tc qdisc add dev $NETDEV parent 1:7 pfifo
>>>
>>> set the iperf3 to one cpu
>>> # mkdir -p /sys/fs/cgroup/cpuset/n0
>>> # echo 4 > /sys/fs/cgroup/cpuset/n0/cpuset.cpus
>>> # echo 0 > /sys/fs/cgroup/cpuset/n0/cpuset.mems
>>> # ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 1000 -P 10 -u -b 10G
>>> # echo $(pidof iperf3) > /sys/fs/cgroup/cpuset/n0/tasks
>>>
>>> # ethtool -S eth0 | grep -i tx_queue_[0-9]_bytes
>>>        tx_queue_0_bytes: 7180
>>>        tx_queue_1_bytes: 418
>>>        tx_queue_2_bytes: 3015
>>>        tx_queue_3_bytes: 4824
>>>        tx_queue_4_bytes: 3738
>>>        tx_queue_5_bytes: 716102781 # before setting iperf3 to cpu 4
>>>        tx_queue_6_bytes: 17989642640 # after setting iperf3 to cpu 4,
>>> skbedit use this tx queue, and don't use tx queue 5
>>>        tx_queue_7_bytes: 4364
>>>        tx_queue_8_bytes: 42
>>>        tx_queue_9_bytes: 3030
>>>
>>>
>>> # tc -s class show dev eth0
>>> class mq 1:1 root leaf 2:
>>>    Sent 9874 bytes 63 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:2 root leaf 8001:
>>>    Sent 418 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:3 root leaf 8002:
>>>    Sent 3015 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:4 root leaf 8003:
>>>    Sent 4824 bytes 8 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:5 root leaf 8004:
>>>    Sent 4074 bytes 19 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:6 root leaf 8005:
>>>    Sent 716102781 bytes 480624 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:7 root leaf 8006:
>>>    Sent 18157071781 bytes 12186100 pkt (dropped 0, overlimits 0 requeues 18)
>>>    backlog 0b 0p requeues 18
>>> class mq 1:8 root
>>>    Sent 4364 bytes 26 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:9 root
>>>    Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class mq 1:a root
>>>    Sent 3030 bytes 13 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>> class tbf 8001:1 parent 8001:
>>>
>>> class htb 2:1 root prio 0 rate 100Kbit ceil 100Kbit burst 1600b cburst 1600b
>>>    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>>    lended: 0 borrowed: 0 giants: 0
>>>    tokens: 2000000 ctokens: 2000000
>>>
>>> class htb 2:2 root prio 0 rate 200Kbit ceil 200Kbit burst 1600b cburst 1600b
>>>    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>>    backlog 0b 0p requeues 0
>>>    lended: 0 borrowed: 0 giants: 0
>>>    tokens: 1000000 ctokens: 1000000
>>>
>>
>> Yes, if you pin a flow/process to a cpu - this is expected. See my
>> earlier comment. You could argue that you are automating things but
>> it is not as a strong as the hash setup (and will have to be documented
>> that it works only if you pin processes doing network i/o to cpus).
> Ok, it should be documented in iproute2. and we will doc this in
> commit message too.

I think this part is iffy. You could argue automation pov
but i dont see much else.

>> Could you also post an example on the cgroups classid?
> 
> The setup commands:
> NETDEV=eth0
> ip li set dev $NETDEV up
> 
> tc qdisc del dev $NETDEV clsact 2>/dev/null
> tc qdisc add dev $NETDEV clsact
> 
> ip link add ipv1 link $NETDEV type ipvlan mode l2
> ip netns add n1
> ip link set ipv1 netns n1
> 
> ip netns exec n1 ip link set ipv1 up
> ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
> 
> tc filter add dev $NETDEV egress protocol ip prio 1 \
> flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping hash-type
> classid 2 6
> 
> tc qdisc add dev $NETDEV handle 1: root mq
> 
> tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
> tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
> tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
> 
> tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
> tc qdisc add dev $NETDEV parent 1:3 pfifo
> tc qdisc add dev $NETDEV parent 1:4 pfifo
> tc qdisc add dev $NETDEV parent 1:5 pfifo
> tc qdisc add dev $NETDEV parent 1:6 pfifo
> tc qdisc add dev $NETDEV parent 1:7 pfifo
> 
> setup classid
> # mkdir -p /sys/fs/cgroup/net_cls/n0
> # echo 0x100001 > /sys/fs/cgroup/net_cls/n0/net_cls.classid
> # echo $(pidof iperf3) > /sys/fs/cgroup/net_cls/n0/tasks
> 


I would say some thing here as well. You know the classid, you manually
set it above, you could have said:

src_ip 2.2.2.100 action skbedit queue_mapping 1

cheers,
jamal

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ