lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 20 Dec 2021 22:31:48 +0800 From: Tonghao Zhang <xiangxia.m.yue@...il.com> To: Eric Dumazet <edumazet@...gle.com> Cc: netdev <netdev@...r.kernel.org>, Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Jonathan Lemon <jonathan.lemon@...il.com>, Alexander Lobakin <alobakin@...me>, Paolo Abeni <pabeni@...hat.com>, Talal Ahmad <talalahmad@...gle.com>, Kevin Hao <haokexin@...il.com>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, Kees Cook <keescook@...omium.org>, Kumar Kartikeya Dwivedi <memxor@...il.com>, Antoine Tenart <atenart@...nel.org>, Wei Wang <weiwan@...gle.com>, Arnd Bergmann <arnd@...db.de> Subject: Re: [net-next v5 1/2] net: sched: use queue_mapping to pick tx queue On Mon, Dec 20, 2021 at 10:21 PM Tonghao Zhang <xiangxia.m.yue@...il.com> wrote: > > On Mon, Dec 20, 2021 at 9:58 PM Eric Dumazet <edumazet@...gle.com> wrote: > > > > On Mon, Dec 20, 2021 at 4:38 AM <xiangxia.m.yue@...il.com> wrote: > > > > > > From: Tonghao Zhang <xiangxia.m.yue@...il.com> > > > > > > This patch fixes issue: > > > * If we install tc filters with act_skbedit in clsact hook. > > > It doesn't work, because netdev_core_pick_tx() overwrites > > > queue_mapping. > > > > > > $ tc filter ... action skbedit queue_mapping 1 > > > > > > And this patch is useful: > > > * We can use FQ + EDT to implement efficient policies. Tx queues > > > are picked by xps, ndo_select_queue of netdev driver, or skb hash > > > in netdev_core_pick_tx(). In fact, the netdev driver, and skb > > > hash are _not_ under control. xps uses the CPUs map to select Tx > > > queues, but we can't figure out which task_struct of pod/containter > > > running on this cpu in most case. We can use clsact filters to classify > > > one pod/container traffic to one Tx queue. Why ? > > > > > > In containter networking environment, there are two kinds of pod/ > > > containter/net-namespace. One kind (e.g. P1, P2), the high throughput > > > is key in these applications. But avoid running out of network resource, > > > the outbound traffic of these pods is limited, using or sharing one > > > dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods > > > (e.g. Pn), the low latency of data access is key. And the traffic is not > > > limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc. > > > This choice provides two benefits. First, contention on the HTB/FQ Qdisc > > > lock is significantly reduced since fewer CPUs contend for the same queue. > > > More importantly, Qdisc contention can be eliminated completely if each > > > CPU has its own FIFO Qdisc for the second kind of pods. > > > > > > There must be a mechanism in place to support classifying traffic based on > > > pods/container to different Tx queues. Note that clsact is outside of Qdisc > > > while Qdisc can run a classifier to select a sub-queue under the lock. > > > > > > In general recording the decision in the skb seems a little heavy handed. > > > This patch introduces a per-CPU variable, suggested by Eric. > > > > > > The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit(). > > > - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag > > > is set in qdisc->enqueue() though tx queue has been selected in > > > netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared > > > firstly in __dev_queue_xmit(), is useful: > > > - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev > > > in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy: > > > For example, eth0, macvlan in pod, which root Qdisc install skbedit > > > queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of > > > eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping > > > because there is no filters in clsact or tx Qdisc of this netdev. > > > Same action taked in eth0, ixgbe in Host. > > > - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue > > > in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it > > > in __dev_queue_xmit when processing next packets. > > > > > > +----+ +----+ +----+ > > > | P1 | | P2 | | Pn | > > > +----+ +----+ +----+ > > > | | | > > > +-----------+-----------+ > > > | > > > | clsact/skbedit > > > | MQ > > > v > > > +-----------+-----------+ > > > | q0 | q1 | qn > > > v v v > > > HTB/FQ HTB/FQ ... FIFO > > > > > > Cc: Jamal Hadi Salim <jhs@...atatu.com> > > > Cc: Cong Wang <xiyou.wangcong@...il.com> > > > Cc: Jiri Pirko <jiri@...nulli.us> > > > Cc: "David S. Miller" <davem@...emloft.net> > > > Cc: Jakub Kicinski <kuba@...nel.org> > > > Cc: Jonathan Lemon <jonathan.lemon@...il.com> > > > Cc: Eric Dumazet <edumazet@...gle.com> > > > Cc: Alexander Lobakin <alobakin@...me> > > > Cc: Paolo Abeni <pabeni@...hat.com> > > > Cc: Talal Ahmad <talalahmad@...gle.com> > > > Cc: Kevin Hao <haokexin@...il.com> > > > Cc: Ilias Apalodimas <ilias.apalodimas@...aro.org> > > > Cc: Kees Cook <keescook@...omium.org> > > > Cc: Kumar Kartikeya Dwivedi <memxor@...il.com> > > > Cc: Antoine Tenart <atenart@...nel.org> > > > Cc: Wei Wang <weiwan@...gle.com> > > > Cc: Arnd Bergmann <arnd@...db.de> > > > Suggested-by: Eric Dumazet <edumazet@...gle.com> > > > > I have not suggested this patch, only to not add yet another bit in sk_buff. > sorry for that > > > > > Signed-off-by: Tonghao Zhang <xiangxia.m.yue@...il.com> > > > --- > > > include/linux/netdevice.h | 19 +++++++++++++++++++ > > > net/core/dev.c | 7 ++++++- > > > net/sched/act_skbedit.c | 4 +++- > > > 3 files changed, 28 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > > index 8b0bdeb4734e..8d02dafb32ba 100644 > > > --- a/include/linux/netdevice.h > > > +++ b/include/linux/netdevice.h > > > @@ -3009,6 +3009,7 @@ struct softnet_data { > > > /* written and read only by owning cpu: */ > > > struct { > > > u16 recursion; > > > + u8 skip_txqueue; > > > u8 more; > > > } xmit; > > > #ifdef CONFIG_RPS > > > @@ -4696,6 +4697,24 @@ static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_devi > > > return rc; > > > } > > > > > > +static inline void netdev_xmit_skip_txqueue(bool skip) > > > +{ > > > + __this_cpu_write(softnet_data.xmit.skip_txqueue, skip); > > > +} > > > + > > > +static inline bool netdev_xmit_txqueue_skipped(void) > > > +{ > > > + return __this_cpu_read(softnet_data.xmit.skip_txqueue); > > > +} > > > + > > > +static inline struct netdev_queue * > > > +netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb) > > > +{ > > > + int qm = skb_get_queue_mapping(skb); > > > + > > > + return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm)); > > > +} > > > + > > > int netdev_class_create_file_ns(const struct class_attribute *class_attr, > > > const void *ns); > > > void netdev_class_remove_file_ns(const struct class_attribute *class_attr, > > > diff --git a/net/core/dev.c b/net/core/dev.c > > > index a855e41bbe39..e3f548c54dda 100644 > > > --- a/net/core/dev.c > > > +++ b/net/core/dev.c > > > @@ -4048,6 +4048,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > > skb_update_prio(skb); > > > > > > qdisc_pkt_len_init(skb); > > > + netdev_xmit_skip_txqueue(false); > > > #ifdef CONFIG_NET_CLS_ACT > > > skb->tc_at_ingress = 0; > > > #endif > > > @@ -4073,7 +4074,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) > > > else > > > skb_dst_force(skb); > > > > > > - txq = netdev_core_pick_tx(dev, skb, sb_dev); > > > + if (netdev_xmit_txqueue_skipped()) > > > + txq = netdev_tx_queue_mapping(dev, skb); > > > + else > > > + txq = netdev_core_pick_tx(dev, skb, sb_dev); > > > + > > > > If we really need to add yet another conditional in fast path, I would > > suggest using a static key. > Thanks, I will add a static key for this patch. > > Only hosts where SKBEDIT_F_QUEUE_MAPPING is requested would pay the price. Hi Eirc Do you mean we need a static key, such as, egress_needed_key ? static DEFINE_STATIC_KEY_FALSE(egress_needed_key); and we also should use the per-cpu var to decide to select tx from netdev_tx_queue_mapping or netdev_core_pick_tx. > > > > > q = rcu_dereference_bh(txq->qdisc); > > > > > > trace_net_dev_queue(skb); > > > diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c > > > index ceba11b198bb..48504ed3b280 100644 > > > --- a/net/sched/act_skbedit.c > > > +++ b/net/sched/act_skbedit.c > > > @@ -58,8 +58,10 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a, > > > } > > > } > > > if (params->flags & SKBEDIT_F_QUEUE_MAPPING && > > > - skb->dev->real_num_tx_queues > params->queue_mapping) > > > + skb->dev->real_num_tx_queues > params->queue_mapping) { > > > + netdev_xmit_skip_txqueue(true); > > > skb_set_queue_mapping(skb, params->queue_mapping); > > > + } > > > if (params->flags & SKBEDIT_F_MARK) { > > > skb->mark &= ~params->mask; > > > skb->mark |= params->mark & params->mask; > > > -- > > > 2.27.0 > > > > > > > -- > Best regards, Tonghao -- Best regards, Tonghao
Powered by blists - more mailing lists