[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMDZJNWLjvV_SfEmLr5FwLx2gcg4FrHqvp1q+fA2AXwAvTCLDg@mail.gmail.com>
Date: Mon, 20 Dec 2021 22:31:48 +0800
From: Tonghao Zhang <xiangxia.m.yue@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev <netdev@...r.kernel.org>,
Jamal Hadi Salim <jhs@...atatu.com>,
Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Jonathan Lemon <jonathan.lemon@...il.com>,
Alexander Lobakin <alobakin@...me>,
Paolo Abeni <pabeni@...hat.com>,
Talal Ahmad <talalahmad@...gle.com>,
Kevin Hao <haokexin@...il.com>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Kees Cook <keescook@...omium.org>,
Kumar Kartikeya Dwivedi <memxor@...il.com>,
Antoine Tenart <atenart@...nel.org>,
Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de>
Subject: Re: [net-next v5 1/2] net: sched: use queue_mapping to pick tx queue
On Mon, Dec 20, 2021 at 10:21 PM Tonghao Zhang
<xiangxia.m.yue@...il.com> wrote:
>
> On Mon, Dec 20, 2021 at 9:58 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Mon, Dec 20, 2021 at 4:38 AM <xiangxia.m.yue@...il.com> wrote:
> > >
> > > From: Tonghao Zhang <xiangxia.m.yue@...il.com>
> > >
> > > This patch fixes issue:
> > > * If we install tc filters with act_skbedit in clsact hook.
> > > It doesn't work, because netdev_core_pick_tx() overwrites
> > > queue_mapping.
> > >
> > > $ tc filter ... action skbedit queue_mapping 1
> > >
> > > And this patch is useful:
> > > * We can use FQ + EDT to implement efficient policies. Tx queues
> > > are picked by xps, ndo_select_queue of netdev driver, or skb hash
> > > in netdev_core_pick_tx(). In fact, the netdev driver, and skb
> > > hash are _not_ under control. xps uses the CPUs map to select Tx
> > > queues, but we can't figure out which task_struct of pod/containter
> > > running on this cpu in most case. We can use clsact filters to classify
> > > one pod/container traffic to one Tx queue. Why ?
> > >
> > > In containter networking environment, there are two kinds of pod/
> > > containter/net-namespace. One kind (e.g. P1, P2), the high throughput
> > > is key in these applications. But avoid running out of network resource,
> > > the outbound traffic of these pods is limited, using or sharing one
> > > dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
> > > (e.g. Pn), the low latency of data access is key. And the traffic is not
> > > limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
> > > This choice provides two benefits. First, contention on the HTB/FQ Qdisc
> > > lock is significantly reduced since fewer CPUs contend for the same queue.
> > > More importantly, Qdisc contention can be eliminated completely if each
> > > CPU has its own FIFO Qdisc for the second kind of pods.
> > >
> > > There must be a mechanism in place to support classifying traffic based on
> > > pods/container to different Tx queues. Note that clsact is outside of Qdisc
> > > while Qdisc can run a classifier to select a sub-queue under the lock.
> > >
> > > In general recording the decision in the skb seems a little heavy handed.
> > > This patch introduces a per-CPU variable, suggested by Eric.
> > >
> > > The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
> > > - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
> > > is set in qdisc->enqueue() though tx queue has been selected in
> > > netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
> > > firstly in __dev_queue_xmit(), is useful:
> > > - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
> > > in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
> > > For example, eth0, macvlan in pod, which root Qdisc install skbedit
> > > queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
> > > eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
> > > because there is no filters in clsact or tx Qdisc of this netdev.
> > > Same action taked in eth0, ixgbe in Host.
> > > - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
> > > in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
> > > in __dev_queue_xmit when processing next packets.
> > >
> > > +----+ +----+ +----+
> > > | P1 | | P2 | | Pn |
> > > +----+ +----+ +----+
> > > | | |
> > > +-----------+-----------+
> > > |
> > > | clsact/skbedit
> > > | MQ
> > > v
> > > +-----------+-----------+
> > > | q0 | q1 | qn
> > > v v v
> > > HTB/FQ HTB/FQ ... FIFO
> > >
> > > Cc: Jamal Hadi Salim <jhs@...atatu.com>
> > > Cc: Cong Wang <xiyou.wangcong@...il.com>
> > > Cc: Jiri Pirko <jiri@...nulli.us>
> > > Cc: "David S. Miller" <davem@...emloft.net>
> > > Cc: Jakub Kicinski <kuba@...nel.org>
> > > Cc: Jonathan Lemon <jonathan.lemon@...il.com>
> > > Cc: Eric Dumazet <edumazet@...gle.com>
> > > Cc: Alexander Lobakin <alobakin@...me>
> > > Cc: Paolo Abeni <pabeni@...hat.com>
> > > Cc: Talal Ahmad <talalahmad@...gle.com>
> > > Cc: Kevin Hao <haokexin@...il.com>
> > > Cc: Ilias Apalodimas <ilias.apalodimas@...aro.org>
> > > Cc: Kees Cook <keescook@...omium.org>
> > > Cc: Kumar Kartikeya Dwivedi <memxor@...il.com>
> > > Cc: Antoine Tenart <atenart@...nel.org>
> > > Cc: Wei Wang <weiwan@...gle.com>
> > > Cc: Arnd Bergmann <arnd@...db.de>
> > > Suggested-by: Eric Dumazet <edumazet@...gle.com>
> >
> > I have not suggested this patch, only to not add yet another bit in sk_buff.
> sorry for that
> >
> > > Signed-off-by: Tonghao Zhang <xiangxia.m.yue@...il.com>
> > > ---
> > > include/linux/netdevice.h | 19 +++++++++++++++++++
> > > net/core/dev.c | 7 ++++++-
> > > net/sched/act_skbedit.c | 4 +++-
> > > 3 files changed, 28 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 8b0bdeb4734e..8d02dafb32ba 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -3009,6 +3009,7 @@ struct softnet_data {
> > > /* written and read only by owning cpu: */
> > > struct {
> > > u16 recursion;
> > > + u8 skip_txqueue;
> > > u8 more;
> > > } xmit;
> > > #ifdef CONFIG_RPS
> > > @@ -4696,6 +4697,24 @@ static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_devi
> > > return rc;
> > > }
> > >
> > > +static inline void netdev_xmit_skip_txqueue(bool skip)
> > > +{
> > > + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
> > > +}
> > > +
> > > +static inline bool netdev_xmit_txqueue_skipped(void)
> > > +{
> > > + return __this_cpu_read(softnet_data.xmit.skip_txqueue);
> > > +}
> > > +
> > > +static inline struct netdev_queue *
> > > +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
> > > +{
> > > + int qm = skb_get_queue_mapping(skb);
> > > +
> > > + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
> > > +}
> > > +
> > > int netdev_class_create_file_ns(const struct class_attribute *class_attr,
> > > const void *ns);
> > > void netdev_class_remove_file_ns(const struct class_attribute *class_attr,
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index a855e41bbe39..e3f548c54dda 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4048,6 +4048,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > skb_update_prio(skb);
> > >
> > > qdisc_pkt_len_init(skb);
> > > + netdev_xmit_skip_txqueue(false);
> > > #ifdef CONFIG_NET_CLS_ACT
> > > skb->tc_at_ingress = 0;
> > > #endif
> > > @@ -4073,7 +4074,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > else
> > > skb_dst_force(skb);
> > >
> > > - txq = netdev_core_pick_tx(dev, skb, sb_dev);
> > > + if (netdev_xmit_txqueue_skipped())
> > > + txq = netdev_tx_queue_mapping(dev, skb);
> > > + else
> > > + txq = netdev_core_pick_tx(dev, skb, sb_dev);
> > > +
> >
> > If we really need to add yet another conditional in fast path, I would
> > suggest using a static key.
> Thanks, I will add a static key for this patch.
> > Only hosts where SKBEDIT_F_QUEUE_MAPPING is requested would pay the price.
Hi Eirc
Do you mean we need a static key, such as, egress_needed_key ?
static DEFINE_STATIC_KEY_FALSE(egress_needed_key);
and we also should use the per-cpu var to decide to select tx from
netdev_tx_queue_mapping or netdev_core_pick_tx.
> >
> > > q = rcu_dereference_bh(txq->qdisc);
> > >
> > > trace_net_dev_queue(skb);
> > > diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
> > > index ceba11b198bb..48504ed3b280 100644
> > > --- a/net/sched/act_skbedit.c
> > > +++ b/net/sched/act_skbedit.c
> > > @@ -58,8 +58,10 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
> > > }
> > > }
> > > if (params->flags & SKBEDIT_F_QUEUE_MAPPING &&
> > > - skb->dev->real_num_tx_queues > params->queue_mapping)
> > > + skb->dev->real_num_tx_queues > params->queue_mapping) {
> > > + netdev_xmit_skip_txqueue(true);
> > > skb_set_queue_mapping(skb, params->queue_mapping);
> > > + }
> > > if (params->flags & SKBEDIT_F_MARK) {
> > > skb->mark &= ~params->mask;
> > > skb->mark |= params->mark & params->mask;
> > > --
> > > 2.27.0
> > >
>
>
>
> --
> Best regards, Tonghao
--
Best regards, Tonghao
Powered by blists - more mailing lists