netdev - Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87pm66wbgo.fsf@nvidia.com>
Date: Thu, 8 Jun 2023 12:17:27 +0300
From: Vlad Buslov <vladbu@...dia.com>
To: Peilin Ye <yepeilin.cs@...il.com>
CC: Jamal Hadi Salim <jhs@...atatu.com>, Jakub Kicinski <kuba@...nel.org>,
	Pedro Tammela <pctammela@...atatu.com>, "David S. Miller"
	<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Paolo Abeni
	<pabeni@...hat.com>, Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko
	<jiri@...nulli.us>, Peilin Ye <peilin.ye@...edance.com>, Daniel Borkmann
	<daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>, "Hillf
 Danton" <hdanton@...a.com>, <netdev@...r.kernel.org>, Cong Wang
	<cong.wang@...edance.com>
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and
 clsact Qdiscs before grafting


On Wed 07 Jun 2023 at 17:39, Peilin Ye <yepeilin.cs@...il.com> wrote:
> On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
>> >> >> If livelock with concurrent filters insertion is an issue, then it can
>> >> >> be remedied by setting a new Qdisc->flags bit
>> >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> >> >> insertion coming after the flag is set to synchronize on rtnl lock.
>> >> >
>> >> > Thanks for the suggestion!  I'll try this approach.
>> >> >
>> >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>> >> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
>> >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>> >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>> >> > even longer?).
>> >> 
>> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> >> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> >> set is also fine? Would be much easier to implement as opposed to moving
>> >> rtnl_lock there.
>> >
>> > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
>> > 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
>> > "request A") while it has a lot of ongoing filter requests, and here's the
>> > result:
>> >
>> >                         #1         #2         #3         #4
>> >   ----------------------------------------------------------
>> >    a. refcnt            89         93        230        571
>> >    b. replayed     167,568    196,450    336,291    878,027
>> >    c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
>> >            user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
>> >             sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
>> >
>> >    a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>> >    b. is the number of times A has been replayed;
>> >    c. is the time(1) output for A.
>> >
>> > a. and b. are collected from printk() output.  This is better than before,
>> > but A could still be replayed for hundreds of thousands of times and hang
>> > for a few seconds.
>> 
>> I don't get where does few seconds waiting time come from. I'm probably
>> missing something obvious here, but the waiting time should be the
>> maximum filter op latency of new/get/del filter request that is already
>> in-flight (i.e. already passed qdisc_is_destroying() check) and it
>> should take several orders of magnitude less time.
>
> Yeah I agree, here's what I did:
>
> In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic
> loop:
>
>   $ echo "1 1 32" > /sys/bus/netdevsim/new_device
>   $ tc qdisc add dev eth1 ingress
>   $ for (( i=1; i<=3000; i++ ))
>   > do
>   > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
>   > done
>
> When the loop is running, I delete the Qdisc in Terminal 2:
>
>   $ time tc qdisc delete dev eth1 ingress
>
> Which took seconds on average.  However, if I specify a unique "prio" when
> adding filters in that loop, e.g.:
>
>   $ for (( i=1; i<=3000; i++ ))
>   > do
>   > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
>   > done				     ^^^^^^^
>
> Then deleting the Qdisc in Terminal 2 becomes a lot faster:
>
>   real  0m0.712s
>   user  0m0.000s
>   sys   0m0.152s 
>
> In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did
> yet another test [1], which looks a lot better.

That makes sense, thanks for explaining.

>
> When I didn't specify "prio", sometimes that
> rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
> -EEXIST.  Is it because that concurrent add-filter requests auto-allocated
> the same "prio" number, so they collided with each other?  Do you think
> this is related to why it's slow?

It is slow because when creating a filter without providing priority you
are basically measuring the latency of creating a whole flower
classifier instance (multiple memory allocation, initialization of all
kinds of idrs, hash tables and locks, updating tp list in chain, etc.),
not just a single filter, so significantly higher latency is expected.

But my point still stands: with latest version of your fix the maximum
time of 'spinning' in sch_api is the maximum concurrent
tcf_{new|del|get}_tfilter op latency that has already obtained the qdisc
and any concurrent filter API messages coming after qdisc->flags
"DELETED-REJECT-NEW-FILTERS" has been set will fail and can't livelock
the concurrent qdisc del/replace.

>
> Thanks,
> Peilin Ye
>
> [1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances
> in -batch mode that keeps adding a unique filter (with "prio" and "handle"
> specified) then deletes it.  Again, when they are running I delete the
> ingress Qdisc, and here's the result:
>
>                          #1         #2         #3         #4
>    ----------------------------------------------------------
>     a. refcnt            64         63         64         64
>     b. replayed         169      5,630        887      3,442
>     c. time real   0m0.171s   0m0.147s   0m0.186s   0m0.111s
>             user   0m0.000s   0m0.009s   0m0.001s   0m0.000s
>              sys   0m0.112s   0m0.108s   0m0.115s   0m0.104s