netdev - Re: [PATCH net] net: sched: refine software bypass handling in tc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADvbK_chW7898xRLXeJkis3dLnDjP72MZQ__5GB57R1OHW6Z3w@mail.gmail.com>
Date: Thu, 9 Jan 2025 10:47:30 -0500
From: Xin Long <lucien.xin@...il.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: network dev <netdev@...r.kernel.org>, davem@...emloft.net, kuba@...nel.org, 
	Eric Dumazet <edumazet@...gle.com>, Jamal Hadi Salim <jhs@...atatu.com>, 
	Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, 
	Marcelo Ricardo Leitner <marcelo.leitner@...il.com>, ast@...erby.net, Shuang Li <shuali@...hat.com>
Subject: Re: [PATCH net] net: sched: refine software bypass handling in tc_run

On Thu, Jan 9, 2025 at 5:46 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 1/6/25 4:08 PM, Xin Long wrote:
> > This patch addresses issues with filter counting in block (tcf_block),
> > particularly for software bypass scenarios, by introducing a more
> > accurate mechanism using useswcnt.
> >
> > Previously, filtercnt and skipswcnt were introduced by:
> >
> >   Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
> >   Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")
> >
> >   filtercnt tracked all tp (tcf_proto) objects added to a block, and
> >   skipswcnt counted tp objects with the skipsw attribute set.
> >
> > The problem is: a single tp can contain multiple filters, some with skipsw
> > and others without. The current implementation fails in the case:
> >
> >   When the first filter in a tp has skipsw, both skipswcnt and filtercnt
> >   are incremented, then adding a second filter without skipsw to the same
> >   tp does not modify these counters because tp->counted is already set.
> >
> >   This results in bypass software behavior based solely on skipswcnt
> >   equaling filtercnt, even when the block includes filters without
> >   skipsw. Consequently, filters without skipsw are inadvertently bypassed.
> >
> > To address this, the patch introduces useswcnt in block to explicitly count
> > tp objects containing at least one filter without skipsw. Key changes
> > include:
> >
> >   Whenever a filter without skipsw is added, its tp is marked with usesw
> >   and counted in useswcnt. tc_run() now uses useswcnt to determine software
> >   bypass, eliminating reliance on filtercnt and skipswcnt.
> >
> >   This refined approach prevents software bypass for blocks containing
> >   mixed filters, ensuring correct behavior in tc_run().
> >
> > Additionally, as atomic operations on useswcnt ensure thread safety and
> > tp->lock guards access to tp->usesw and tp->counted, the broader lock
> > down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
> > and this resolves a performance regression caused by the filter counting
> > mechanism during parallel filter insertions.
> >
> >   The improvement can be demonstrated using the following script:
> >
> >   # cat insert_tc_rules.sh
> >
> >     tc qdisc add dev ens1f0np0 ingress
> >     for i in $(seq 16); do
> >         taskset -c $i tc -b rules_$i.txt &
> >     done
> >     wait
> >
> >   Each of rules_$i.txt files above includes 100000 tc filter rules to a
> >   mlx5 driver NIC ens1f0np0.
> >
> >   Without this patch:
> >
> >   # time sh insert_tc_rules.sh
> >
> >     real    0m50.780s
> >     user    0m23.556s
> >     sys           4m13.032s
> >
> >   With this patch:
> >
> >   # time sh insert_tc_rules.sh
> >
> >     real    0m17.718s
> >     user    0m7.807s
> >     sys     3m45.050s
> >
> > Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
> > Reported-by: Shuang Li <shuali@...hat.com>
> > Signed-off-by: Xin Long <lucien.xin@...il.com>
> > Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@...il.com>
>
> Given the quite large scope of this change and the functional and
> performance implications, I think it's more suited for net-next.
>
Sounds fine to me.

> > ---
> >  include/net/pkt_cls.h     | 18 +++++++-------
> >  include/net/sch_generic.h |  5 ++--
> >  net/core/dev.c            | 11 ++-------
> >  net/sched/cls_api.c       | 52 +++++++++------------------------------
> >  net/sched/cls_bpf.c       |  2 ++
> >  net/sched/cls_flower.c    |  2 ++
> >  net/sched/cls_matchall.c  |  2 ++
> >  net/sched/cls_u32.c       |  2 ++
> >  8 files changed, 32 insertions(+), 62 deletions(-)
> >
> > diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> > index cf199af85c52..d66cb315a6b5 100644
> > --- a/include/net/pkt_cls.h
> > +++ b/include/net/pkt_cls.h
> > @@ -74,15 +74,6 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
> >       return block && block->index;
> >  }
> >
> > -#ifdef CONFIG_NET_CLS_ACT
> > -DECLARE_STATIC_KEY_FALSE(tcf_bypass_check_needed_key);
>
> I think it would be better, if possible, to preserve this static key;
> will reduce the delta and avoid additional tests in fast-path for S/W
> only setup.
That's difficult. This static key will in/decrement according to
block->useswcnt, and we have to hold down_write(&block->cb_lock)
for its update when adding a filter, and the performance issue
will come back again.

>
> > -
> > -static inline bool tcf_block_bypass_sw(struct tcf_block *block)
> > -{
> > -     return block && block->bypass_wanted;
> > -}
> > -#endif
> > -
> >  static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
> >  {
> >       WARN_ON(tcf_block_shared(block));
> > @@ -760,6 +751,15 @@ tc_cls_common_offload_init(struct flow_cls_common_offload *cls_common,
> >               cls_common->extack = extack;
> >  }
> >
> > +static inline void tcf_proto_update_usesw(struct tcf_proto *tp, u32 flags)
> > +{
> > +     if (tp->usesw)
> > +             return;
> > +     if (tc_skip_sw(flags) && tc_in_hw(flags))
> > +             return;
> > +     tp->usesw = true;
> > +}
>
> It looks like 'usesw' is never cleared. Can't user-space change the
> skipsw flag for an existing tp?
skipsw flag belongs to a tc rule/filter, and a tp may include multiple
rules/filters, and a tp doesn't have flags for skipsw directly.

Now we are adding a tp->usesw to reflect if any rule without skipsw flag
is ever added in this tp. And yes, this tp->usesw will NOT be cleared
even if this rule without the skipsw flag is deleted. I guess that's
all we can do now.

If we want to use a tp->skipswcnt to dynamically track the skipsw flag
for a tp, some common code/functions for rule adding and deleting will
be needed. However,there's no such code in tc for a rule deleting, so
it's not doable unless we touch all cls_*.c files.

>
> [...]
> > diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> > index d3a03c57545b..5e8f191fd820 100644
> > --- a/net/sched/cls_u32.c
> > +++ b/net/sched/cls_u32.c
> > @@ -1164,6 +1164,8 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
> >               if (!tc_in_hw(n->flags))
> >                       n->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
> >
> > +             tcf_proto_update_usesw(tp, n->flags);
>
> Why don't you need to hook also in the 'key existing' branch on line 909?
>
Good catch, didn't notice that cls_u32 calls to replace hw filter in
two places.

Thanks.