netdev - Re: iproute2: tc deletion freezes whole server

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANxWus8AqCM4Dk87TTXB3xxtQPqPYjs-KmzVv8hjZwaAqg2AYQ@mail.gmail.com>
Date:   Mon, 18 May 2020 16:16:38 +0200
From:   Václav Zindulka <vaclav.zindulka@...pnet.cz>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: iproute2: tc deletion freezes whole server

On Sun, May 17, 2020 at 9:35 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>
> On Fri, May 8, 2020 at 6:59 AM Václav Zindulka
> <vaclav.zindulka@...pnet.cz> wrote:
> > > > >
> > > > > I tried to emulate your test case in my VM, here is the script I use:
> > > > >
> > > > > ====
> > > > > ip li set dev dummy0 up
> > > > > tc qd add dev dummy0 root handle 1: htb default 1
> > > > > for i in `seq 1 1000`
> > > > > do
> > > > >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> > > > >   tc qd add dev dummy0 parent 1:$i fq_codel
> > > > > done
> > > > >
> > > > > time tc qd del dev dummy0 root
> > > > > ====
> > > > >
> > > > > And this is the result:
> > > > >
> > > > >     Before my patch:
> > > > >      real   0m0.488s
> > > > >      user   0m0.000s
> > > > >      sys    0m0.325s
> > > > >
> > > > >     After my patch:
> > > > >      real   0m0.180s
> > > > >      user   0m0.000s
> > > > >      sys    0m0.132s
> > > >
> > > > My results with your test script.
> > > >
> > > > before patch:
> > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > real 1.63
> > > > user 0.00
> > > > sys 1.63
> > > >
> > > > after patch:
> > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > real 1.55
> > > > user 0.00
> > > > sys 1.54
> > > >
> > > > > This is an obvious improvement, so I have no idea why you didn't
> > > > > catch any difference.
> > > >
> > > > We use hfsc instead of htb. I don't know whether it may cause any
> > > > difference. I can provide you with my test scripts if necessary.
> > >
> > > Yeah, you can try to replace the htb with hfsc in my script,
> > > I didn't spend time to figure out hfsc parameters.
> >
> > class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
> > 13107200 ul m1 0 d 0 m2 13107200
> >
> > but it behaves the same as htb...
> >
> > > My point here is, if I can see the difference with merely 1000
> > > tc classes, you should see a bigger difference with hundreds
> > > of thousands classes in your setup. So, I don't know why you
> > > saw a relatively smaller difference.
> >
> > I saw a relatively big difference. It was about 1.5s faster on my huge
> > setup which is a lot. Yet maybe the problem is caused by something
>
> What percentage? IIUC, without patch it took you about 11s, so
> 1.5s faster means 13% improvement for you?

My whole setup needs 22.17 seconds to delete with an unpatched kernel.
With your patches applied it is 21.08. So it varies between 1 - 1.5s.
Improvement is about 5 - 6%.

> > else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
> > queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
> > queues. I've tried to increase the number of queues using ethtool from
> > 4 to 8 and decreased to 2. But there was no difference. It was about
> > 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
> > patches applied. I've tried it for ifb and RJ45 interfaces where it
> > took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
> > patches applied, which is strange, but it may be caused by the fact it
> > was very fast even before.
>
> That is odd. In fact, this is highly related to number of TX queues,
> because the existing code resets the qdisc's once for each TX
> queue, so the more TX queues you have, the more resets kernel
> will do, that is the more time it will take.

Can't the problem be caused that reset is done on active and inactive
queues every time? It would explain why it had no effect in decreasing
and increasing the number of active queues. Yet it doesn't explain why
Intel card (82599ES) with 64 possible queues has exactly the same
problem as Mellanox (ConnectX-4 LX) with 8 possible queues.

I've been playing with this problem today. Every deleted fq_codel
qdisc, root and non root, requires a network adapter to empty all
possible queues. But not just the active ones, but it cleared all
possible queues. Event those the adapter can't even use. For every
SFP+ I have tested it calls fq_codel_reset() and would call any other
reset function. 64 times for egress qdisc and 64 times for ingress
qdisc. Even when ingress is not defined. Solution to this whole
problem would be to let reset only activated queues. On the RJ45
interface I've tested there are 8 possible queues. So the reset is
done 8 times for every deleted qdisc. 16 in total, since ingress and
egress are processed all the time.

So a little bit of calculation. My initial setup contained 13706 qdisc
rules. For ifb it means 13706 for egress and 13706 for ingress. 27412
reset calls because there can be only one transmit queue for the ifb
interface. Average time spent between fq_codel_reset() according to my
initial perf reports is somewhat between 7 - 16 micro seconds. 27412 *
0.000008 = 0.219296 s. For RJ45 interface it does 8 calls for every
qdisc. 13706 * 8 * 2 = 219296 resets. 219296 * 0.000008 = 1.754368 s.
It is still ok. For SFP+ interface it is 64 resets per qdisc rule.
13706 * 64 * 2 = 1754368. So we are very close to the huge number I
noticed initially using printk. And now 1754368 * 0.000008 =
14.034944s. In case of slowest calls 1754368 * 0.000016 = 28.069888.
Woala. Gotcha. So my final judgement is - don't empty something we
don't use anyway. For Intel it may be reasonable, it can have all 64
queues defined. Mellanox has its limit at 8. But it still is being
reset 64 times. We mostly decrease the number of queues to 4.
Sometimes 2 according to the CPU used. Yet every CPU had to handle 64
resets.

With the attached patch I'm down to 1.7 seconds - more than 90%
improvement :-) Can you please check it and pass it to proper places?
According to debugging printk messages it empties only active queues.

Thank you for all your help and effort.


>
> I plan to address this later on top of the existing patches.
>
> Thanks.

View attachment "netdevice_num_tx_queues.patch" of type "text/x-patch" (300 bytes)