netdev - Re: iproute2: tc deletion freezes whole server

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANxWus8GQ-YGKa24iQQJbWrDnkQB9BptM80P22n5OLCmDN+Myw@mail.gmail.com>
Date:   Tue, 19 May 2020 10:04:18 +0200
From:   Václav Zindulka <vaclav.zindulka@...pnet.cz>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: iproute2: tc deletion freezes whole server

On Mon, May 18, 2020 at 8:22 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>
> On Mon, May 18, 2020 at 7:16 AM Václav Zindulka
> <vaclav.zindulka@...pnet.cz> wrote:
> >
> > On Sun, May 17, 2020 at 9:35 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
> > >
> > > On Fri, May 8, 2020 at 6:59 AM Václav Zindulka
> > > <vaclav.zindulka@...pnet.cz> wrote:
> > > > > > >
> > > > > > > I tried to emulate your test case in my VM, here is the script I use:
> > > > > > >
> > > > > > > ====
> > > > > > > ip li set dev dummy0 up
> > > > > > > tc qd add dev dummy0 root handle 1: htb default 1
> > > > > > > for i in `seq 1 1000`
> > > > > > > do
> > > > > > >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> > > > > > >   tc qd add dev dummy0 parent 1:$i fq_codel
> > > > > > > done
> > > > > > >
> > > > > > > time tc qd del dev dummy0 root
> > > > > > > ====
> > > > > > >
> > > > > > > And this is the result:
> > > > > > >
> > > > > > >     Before my patch:
> > > > > > >      real   0m0.488s
> > > > > > >      user   0m0.000s
> > > > > > >      sys    0m0.325s
> > > > > > >
> > > > > > >     After my patch:
> > > > > > >      real   0m0.180s
> > > > > > >      user   0m0.000s
> > > > > > >      sys    0m0.132s
> > > > > >
> > > > > > My results with your test script.
> > > > > >
> > > > > > before patch:
> > > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > > > real 1.63
> > > > > > user 0.00
> > > > > > sys 1.63
> > > > > >
> > > > > > after patch:
> > > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > > > real 1.55
> > > > > > user 0.00
> > > > > > sys 1.54
> > > > > >
> > > > > > > This is an obvious improvement, so I have no idea why you didn't
> > > > > > > catch any difference.
> > > > > >
> > > > > > We use hfsc instead of htb. I don't know whether it may cause any
> > > > > > difference. I can provide you with my test scripts if necessary.
> > > > >
> > > > > Yeah, you can try to replace the htb with hfsc in my script,
> > > > > I didn't spend time to figure out hfsc parameters.
> > > >
> > > > class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
> > > > 13107200 ul m1 0 d 0 m2 13107200
> > > >
> > > > but it behaves the same as htb...
> > > >
> > > > > My point here is, if I can see the difference with merely 1000
> > > > > tc classes, you should see a bigger difference with hundreds
> > > > > of thousands classes in your setup. So, I don't know why you
> > > > > saw a relatively smaller difference.
> > > >
> > > > I saw a relatively big difference. It was about 1.5s faster on my huge
> > > > setup which is a lot. Yet maybe the problem is caused by something
> > >
> > > What percentage? IIUC, without patch it took you about 11s, so
> > > 1.5s faster means 13% improvement for you?
> >
> > My whole setup needs 22.17 seconds to delete with an unpatched kernel.
> > With your patches applied it is 21.08. So it varies between 1 - 1.5s.
> > Improvement is about 5 - 6%.
>
> Good to know.
>
> >
> > > > else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
> > > > queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
> > > > queues. I've tried to increase the number of queues using ethtool from
> > > > 4 to 8 and decreased to 2. But there was no difference. It was about
> > > > 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
> > > > patches applied. I've tried it for ifb and RJ45 interfaces where it
> > > > took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
> > > > patches applied, which is strange, but it may be caused by the fact it
> > > > was very fast even before.
> > >
> > > That is odd. In fact, this is highly related to number of TX queues,
> > > because the existing code resets the qdisc's once for each TX
> > > queue, so the more TX queues you have, the more resets kernel
> > > will do, that is the more time it will take.
> >
> > Can't the problem be caused that reset is done on active and inactive
> > queues every time? It would explain why it had no effect in decreasing
> > and increasing the number of active queues. Yet it doesn't explain why
> > Intel card (82599ES) with 64 possible queues has exactly the same
> > problem as Mellanox (ConnectX-4 LX) with 8 possible queues.
>
> Regardless of these queues, the qdisc's should be only reset once,
> because all of these queues point to the same instance of root
> qdisc in your case.
>
> [...]
> > With the attached patch I'm down to 1.7 seconds - more than 90%
> > improvement :-) Can you please check it and pass it to proper places?
> > According to debugging printk messages it empties only active queues.
>
> You can't change netdev_for_each_tx_queue(), it would certainly at least
> break netif_alloc_netdev_queues().

You are right. I didn't check this. But that is the only one occurence
of netdev_for_each_tx_queue() except sch_generic.c. I've duplicated
netdev_for_each_tx_queue() to netdev_for_each_active_tx_queue() with
real_num_tx_queues in for loop and altered sch_generic.c where I
replaced all the occurences with the new function. From my point of
view it is easier to fix unnecessary excessive calls for unallocated
queues. Yet I totally understand your point of view which would mean a
cleaner approach. I'm more than happy with that small fix. I'll help
you test all the patches necessary yet our production will run on that
dirty fix. You know, my first kernel patch. :-D If you want I'll send
it to you.

>
> Let me think how to fix this properly, I have some ideas and will provide
> you some patch(es) to test soon.

Sure, I'll wait. I have plenty of time now with the main problem fixed :-)

Thank you.