lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLdP061LMUN-gRA8z4=YgMpbxTt7=3_Ny9ZWfKHTA2cpg@mail.gmail.com>
Date:   Mon, 20 Dec 2021 05:57:55 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Tonghao Zhang <xiangxia.m.yue@...il.com>
Cc:     netdev <netdev@...r.kernel.org>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Alexander Lobakin <alobakin@...me>,
        Paolo Abeni <pabeni@...hat.com>,
        Talal Ahmad <talalahmad@...gle.com>,
        Kevin Hao <haokexin@...il.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Kees Cook <keescook@...omium.org>,
        Kumar Kartikeya Dwivedi <memxor@...il.com>,
        Antoine Tenart <atenart@...nel.org>,
        Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v5 1/2] net: sched: use queue_mapping to pick tx queue

On Mon, Dec 20, 2021 at 4:38 AM <xiangxia.m.yue@...il.com> wrote:
>
> From: Tonghao Zhang <xiangxia.m.yue@...il.com>
>
> This patch fixes issue:
> * If we install tc filters with act_skbedit in clsact hook.
>   It doesn't work, because netdev_core_pick_tx() overwrites
>   queue_mapping.
>
>   $ tc filter ... action skbedit queue_mapping 1
>
> And this patch is useful:
> * We can use FQ + EDT to implement efficient policies. Tx queues
>   are picked by xps, ndo_select_queue of netdev driver, or skb hash
>   in netdev_core_pick_tx(). In fact, the netdev driver, and skb
>   hash are _not_ under control. xps uses the CPUs map to select Tx
>   queues, but we can't figure out which task_struct of pod/containter
>   running on this cpu in most case. We can use clsact filters to classify
>   one pod/container traffic to one Tx queue. Why ?
>
>   In containter networking environment, there are two kinds of pod/
>   containter/net-namespace. One kind (e.g. P1, P2), the high throughput
>   is key in these applications. But avoid running out of network resource,
>   the outbound traffic of these pods is limited, using or sharing one
>   dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
>   (e.g. Pn), the low latency of data access is key. And the traffic is not
>   limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
>   This choice provides two benefits. First, contention on the HTB/FQ Qdisc
>   lock is significantly reduced since fewer CPUs contend for the same queue.
>   More importantly, Qdisc contention can be eliminated completely if each
>   CPU has its own FIFO Qdisc for the second kind of pods.
>
>   There must be a mechanism in place to support classifying traffic based on
>   pods/container to different Tx queues. Note that clsact is outside of Qdisc
>   while Qdisc can run a classifier to select a sub-queue under the lock.
>
>   In general recording the decision in the skb seems a little heavy handed.
>   This patch introduces a per-CPU variable, suggested by Eric.
>
>   The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
>   - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
>     is set in qdisc->enqueue() though tx queue has been selected in
>     netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
>     firstly in __dev_queue_xmit(), is useful:
>   - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
>     in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
>     For example, eth0, macvlan in pod, which root Qdisc install skbedit
>     queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
>     eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
>     because there is no filters in clsact or tx Qdisc of this netdev.
>     Same action taked in eth0, ixgbe in Host.
>   - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
>     in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
>     in __dev_queue_xmit when processing next packets.
>
>   +----+      +----+      +----+
>   | P1 |      | P2 |      | Pn |
>   +----+      +----+      +----+
>     |           |           |
>     +-----------+-----------+
>                 |
>                 | clsact/skbedit
>                 |      MQ
>                 v
>     +-----------+-----------+
>     | q0        | q1        | qn
>     v           v           v
>   HTB/FQ      HTB/FQ  ...  FIFO
>
> Cc: Jamal Hadi Salim <jhs@...atatu.com>
> Cc: Cong Wang <xiyou.wangcong@...il.com>
> Cc: Jiri Pirko <jiri@...nulli.us>
> Cc: "David S. Miller" <davem@...emloft.net>
> Cc: Jakub Kicinski <kuba@...nel.org>
> Cc: Jonathan Lemon <jonathan.lemon@...il.com>
> Cc: Eric Dumazet <edumazet@...gle.com>
> Cc: Alexander Lobakin <alobakin@...me>
> Cc: Paolo Abeni <pabeni@...hat.com>
> Cc: Talal Ahmad <talalahmad@...gle.com>
> Cc: Kevin Hao <haokexin@...il.com>
> Cc: Ilias Apalodimas <ilias.apalodimas@...aro.org>
> Cc: Kees Cook <keescook@...omium.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@...il.com>
> Cc: Antoine Tenart <atenart@...nel.org>
> Cc: Wei Wang <weiwan@...gle.com>
> Cc: Arnd Bergmann <arnd@...db.de>
> Suggested-by: Eric Dumazet <edumazet@...gle.com>

I have not suggested this patch, only to not add yet another bit in sk_buff.


> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@...il.com>
> ---
>  include/linux/netdevice.h | 19 +++++++++++++++++++
>  net/core/dev.c            |  7 ++++++-
>  net/sched/act_skbedit.c   |  4 +++-
>  3 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8b0bdeb4734e..8d02dafb32ba 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3009,6 +3009,7 @@ struct softnet_data {
>         /* written and read only by owning cpu: */
>         struct {
>                 u16 recursion;
> +               u8  skip_txqueue;
>                 u8  more;
>         } xmit;
>  #ifdef CONFIG_RPS
> @@ -4696,6 +4697,24 @@ static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_devi
>         return rc;
>  }
>
> +static inline void netdev_xmit_skip_txqueue(bool skip)
> +{
> +       __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> +}
> +
> +static inline bool netdev_xmit_txqueue_skipped(void)
> +{
> +       return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> +}
> +
> +static inline struct netdev_queue *
> +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> +{
> +       int qm = skb_get_queue_mapping(skb);
> +
> +       return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> +}
> +
>  int netdev_class_create_file_ns(const struct class_attribute *class_attr,
>                                 const void *ns);
>  void netdev_class_remove_file_ns(const struct class_attribute *class_attr,
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a855e41bbe39..e3f548c54dda 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4048,6 +4048,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>         skb_update_prio(skb);
>
>         qdisc_pkt_len_init(skb);
> +       netdev_xmit_skip_txqueue(false);
>  #ifdef CONFIG_NET_CLS_ACT
>         skb->tc_at_ingress = 0;
>  #endif
> @@ -4073,7 +4074,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>         else
>                 skb_dst_force(skb);
>
> -       txq = netdev_core_pick_tx(dev, skb, sb_dev);
> +       if (netdev_xmit_txqueue_skipped())
> +               txq = netdev_tx_queue_mapping(dev, skb);
> +       else
> +               txq = netdev_core_pick_tx(dev, skb, sb_dev);
> +

If we really need to add yet another conditional in fast path, I would
suggest using a static key.

Only hosts where SKBEDIT_F_QUEUE_MAPPING is requested would pay the price.


>         q = rcu_dereference_bh(txq->qdisc);
>
>         trace_net_dev_queue(skb);
> diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
> index ceba11b198bb..48504ed3b280 100644
> --- a/net/sched/act_skbedit.c
> +++ b/net/sched/act_skbedit.c
> @@ -58,8 +58,10 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
>                 }
>         }
>         if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
> -           skb->dev->real_num_tx_queues > params->queue_mapping)
> +           skb->dev->real_num_tx_queues > params->queue_mapping) {
> +               netdev_xmit_skip_txqueue(true);
>                 skb_set_queue_mapping(skb, params->queue_mapping);
> +       }
>         if (params->flags & SKBEDIT_F_MARK) {
>                 skb->mark &= ~params->mask;
>                 skb->mark |= params->mark & params->mask;
> --
> 2.27.0
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ