netdev - Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid selecting tx queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMDZJNXbxstEvFoF=ZRD_PwH6HQc17LEn0tSvFTJvKB9aoW6Aw@mail.gmail.com>
Date:   Wed, 16 Feb 2022 21:36:35 +0800
From:   Tonghao Zhang <xiangxia.m.yue@...il.com>
To:     Jamal Hadi Salim <jhs@...atatu.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Alexander Lobakin <alobakin@...me>,
        Paolo Abeni <pabeni@...hat.com>,
        Talal Ahmad <talalahmad@...gle.com>,
        Kevin Hao <haokexin@...il.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Kees Cook <keescook@...omium.org>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        Antoine Tenart <atenart@...nel.org>,
        Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v8 2/2] net: sched: support hash/classid/cpuid
 selecting tx queue

On Wed, Feb 16, 2022 at 8:17 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> On 2022-02-14 20:40, Tonghao Zhang wrote:
> > On Tue, Feb 15, 2022 at 8:22 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> >>
> >> On 2022-01-26 09:32, xiangxia.m.yue@...il.com wrote:
> >>> From: Tonghao Zhang <xiangxia.m.yue@...il.com>
> >>>
>
> >
> >> So while i dont agree that ebpf is the solution for reasons i mentioned
> >> earlier - after looking at the details think iam confused by this change
> >> and maybe i didnt fully understand the use case.
> >>
> >> What is the driver that would work  with this?
> >> You said earlier packets are coming out of some pods and then heading to
> >> the wire and you are looking to balance and isolate between bulk and
> >> latency  sensitive traffic - how are any of these metadatum useful for
> >> that? skb->priority seems more natural for that.
>
> Quote from your other email:
>
>  > In our production env, we use the ixgbe, i40e and mlx nic which
>  > support multi tx queue.
>
> Please bear with me.
> The part i was wondering about is how these drivers would use queue
> mapping to select their hardware queues.
Hi
For mlx5e, mlx5e_xmit() use the skb_get_queue_mapping() to pick tx queue.
For ixgbe, __ixgbe_xmit_frame() use the skb_get_queue_mapping() to
pick tx queue.
For i40e, i40e_lan_xmit_frame() use the skb->queue_mapping

we can set the skb->queue_mapping in skbedit.
> Maybe you meant the software queue (in the qdiscs?) - But even then
Yes, more importantly, we take care of software tx queue which may use
the fifo or htb/fq qdisc.
> how does queue mapping map select which queue is to be used.
we select the tx queue in clsact and we will not invoke the
netdev_core_pick_tx() to pick the tx queue and then
we can use qdisc of this tx queue to do tc policy(fifo/fq/htb qdisc
enqueue/dequeue ...)

> > Hi
> > I try to explain. there are two tx-queue range, e.g. A(Q0-Qn), B(Qn+1-Qm).
> > A is used for latency sensitive traffic. B is used for bulk sensitive
> > traffic. A may be shared by Pods/Containers which key is
> > high throughput. B may be shared by Pods/Containers which key is low
> > latency. So we can do the balance in range A for latency sensitive
> > traffic.
>
> So far makes sense. I am not sure if you get better performance but
> thats unrelated to this discussion. Just trying to understand your
> setup  first in order to understand the use case. IIUC:
> You have packets coming out of the pods and hitting the host stack
> where you are applying these rules on egress qdisc of one of these
> ixgbe, i40e and mlx nics, correct?
> And that egress qdisc then ends up selecting a queue based on queue
> mapping?
>
> Can you paste a more complete example of a sample setup on some egress
> port including what the classifier would be looking at?
Hi

  +----+      +----+      +----+     +----+
  | P1 |      | P2 |      | PN |     | PM |
  +----+      +----+      +----+     +----+
    |           |           |           |
    +-----------+-----------+-----------+
                       |
                       | clsact/skbedit
                       |      MQ
                       v
    +-----------+-----------+-----------+
    | q0        | q1        | qn        | qm
    v           v           v           v
  HTB/FQ      HTB/FQ  ...  FIFO        FIFO

NETDEV=eth0
tc qdisc add dev $NETDEV clsact
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type skbhash
n m

The packets from pod(P1) which ip is 192.168.122.100, will use the txqueue n ~m.
P1 is the pod of latency sensitive traffic. so P1 use the fifo qdisc.

tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.200 action skbedit queue_mapping hash-type skbhash
0 1

The packets from pod(P2) which ip is 192.168.122.200, will use the txqueue 0 ~1.
P2 is the pod of bulk sensitive traffic. so P2 use the htb qdisc to
limit its network rate, because we don't hope P2 use all bandwidth to
affect P1.

> Your diagram was unclear how the load balancing was going to be
> achieved using the qdiscs (or was it the hardware?).
Firstly, in clsact hook, we select one tx queue from qn to qm for P1,
and use the qdisc of this tx queue, for example FIFO.
in underlay driver, because the we set the skb->queue_mapping in
skbedit, so the hw tx queue from qn to qm will be select too.
any way, in clsact hook, we can use the skbedit queue_mapping to
select software tx queue and hw tx queue.

For doing balance, we can use the skbhash/cpuid/cgroup classid to
select tx queue from qn to qm for P1.
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type cpuid n
m
tc filter add dev $NETDEV egress protocol ip prio 1 flower skip_hw
src_ip 192.168.122.100 action skbedit queue_mapping hash-type classid
n m

Why we want to do the balance, because we don't want pin the packets
from Pod to one tx queue. (in k8s the pods are created or destroy
frequently, and the number of Pods > tx queue number).
sharing the tx queue equally is more important.

> > So we can use the skb->hash or CPUID or classid to classify the
> > packets in range A or B. The balance policies are used for different
> > use case.
> > For skb->hash, the packets from Pods/Containers will share the range.
> > Should to know that one Pod/Container may use the multi TCP/UDP flows.
> > That flows share the tx queue range.
> > For CPUID, while Pod/Container use the multi flows, pod pinned on one
> > CPU will use one tx-queue in range A or B.
> > For CLASSID, the Pod may contain the multi containters.
> >
> > skb->priority may be used by applications. we can't require
> > application developer to change them.
>
> It can also be set by skbedit.
> Note also: Other than user specifying via setsockopt and skbedit,
> DSCP/TOS/COS are all translated into skb->priority. Most of those
> L3/L2 fields are intended to map to either bulk or latency sensitive
> traffic.
> More importantly:
>  From s/w level - most if not _all_ classful qdiscs look at skb->priority
> to decide where to enqueue.
>  From h/w level - skb->priority is typically mapped to qos hardware level
> (example 802.1q).
> Infact skb->priority could be translated by qdisc layer into
> classid if you set the 32 bit value to be the major:minor number for
> a specific configured classid.
>
> cheers,
> jamal



-- 
Best regards, Tonghao