netdev - Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a6b7a74-82f5-53e7-07f4-2a995df9f349@mojatatu.com>
Date:   Wed, 16 Feb 2022 18:39:11 -0500
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Tonghao Zhang <xiangxia.m.yue@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Alexander Lobakin <alobakin@...me>,
        Paolo Abeni <pabeni@...hat.com>,
        Talal Ahmad <talalahmad@...gle.com>,
        Kevin Hao <haokexin@...il.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Kees Cook <keescook@...omium.org>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        Antoine Tenart <atenart@...nel.org>,
        Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid
 selecting tx queue

On 2022-02-16 08:36, Tonghao Zhang wrote:
> On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:


[...]
The mapping to hardware made sense. Sorry I missed it earlier.

>> Can you paste a more complete example of a sample setup on some egress
>> port including what the classifier would be looking at?
> Hi
> 
>    +----+      +----+      +----+     +----+
>    | P1 |      | P2 |      | PN |     | PM |
>    +----+      +----+      +----+     +----+
>      |           |           |           |
>      +-----------+-----------+-----------+
>                         |
>                         | clsact/skbedit
>                         |      MQ
>                         v
>      +-----------+-----------+-----------+
>      | q0        | q1        | qn        | qm
>      v           v           v           v
>    HTB/FQ      HTB/FQ  ...  FIFO        FIFO
> 

Below is still missing your MQ setup (If i understood your diagram
correctly). Can you please post that?
Are you classids essentially mapping to q0..m?
tc -s class show after you run some traffic should help

> NETDEV=eth0
> tc qdisc add dev $NETDEV clsact
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash
> n m
> 

Have you observed a nice distribution here?
for s/w side tc -s class show after you run some traffic should help
for h/w side ethtool -s

IIUC, the hash of the ip header with src_ip 192.168.122.100
(and dst ip,
is being distributed across queues n..m
[because either 192.168.122.100 is talking to many destination
IPs and/or ports?]
Is this correct if packets are being forwarded as opposed to
being sourced from the host?
ex: who sets the skb->hash (skb->l4_hash, skb->sw_hash etc)

> The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m.
> P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc.
> 
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash
> 0 1
> 
> The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1.
> P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to
> limit its network rate, because we don't hope P2 use all bandwidth to
> affect P1.
> 

Understood.

>> Your diagram was unclear how the load balancing was going to be
>> achieved using the qdiscs (or was it the hardware?).
> Firstly, in clsact hook, we select one tx queue from qn to qm for P1,
> and use the qdisc of this tx queue, for example FIFO.
> in underlay driver, because the we set the skb->queue_mapping in
> skbedit, so the hw tx queue from qn to qm will be select too.
> any way, in clsact hook, we can use the skbedit queue_mapping to
> select software tx queue and hw tx queue.
> 

ethtool -s and tc -s class if you have this running somewhere..

> For doing balance, we can use the skbhash/cpuid/cgroup classid to
> select tx queue from qn to qm for P1.
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n
> m
> tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
> src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid
> n m
> 

The skbhash should work fine if you have good entropy (varying dst ip
and dst port mostly, the srcip/srcport/protocol dont offer much  entropy
unless you have a lot of pods on your system).
i.e if it works correctly (forwarding vs host - see my question above)
then you should be able to pin a 5tuple flow to a tx queue.
If you have a large number of flows/pods then you could potentially
get a nice distribution.

I may be missing something on the cpuid one - seems high likelihood
of having the same flow on multiple queues (based on what
raw_smp_processor_id() returns, which i believe is not guaranteed to be
consistent). IOW, you could be sending packets out of order for the
same 5 tuple flow (because they end up in different queues).

As for classid variant - if these packets are already outside the
pod and into the host stack, is that field even valid?

> Why we want to do the balance, because we don't want pin the packets
> from Pod to one tx queue. (in k8s the pods are created or destroy
> frequently, and the number of Pods > tx queue number).
> sharing the tx queue equally is more important.
> 

As long as the same flow is pinned to the same queue (see my comment
on cpuid).
Over a very long period what you describe maybe true but it also
seems depends on many other variables.
I think it would help to actually show some data on how true above
statement is (example the creation/destruction rate of the pods).
Or collect data over a very long period.

cheers,
jamal