netdev - Re: Performance regression on lan966x when extracting frames

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <877419667aa827cc5843f7ae22658686af22515f.camel@redhat.com>
Date: Tue, 16 May 2023 16:32:52 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Horatiu Vultur <horatiu.vultur@...rochip.com>
Cc: Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org
Subject: Re: Performance regression on lan966x when extracting frames

On Tue, 2023-05-16 at 16:11 +0200, Horatiu Vultur wrote:
> The 05/16/2023 12:16, Paolo Abeni wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > 
> > On Tue, 2023-05-16 at 11:27 +0200, Horatiu Vultur wrote:
> > > The 05/16/2023 10:04, Eric Dumazet wrote:
> > > > 
> > > > On Tue, May 16, 2023 at 9:45 AM Horatiu Vultur
> > > > <horatiu.vultur@...rochip.com> wrote:
> > > > > 
> > > > > The 05/15/2023 14:30, Eric Dumazet wrote:
> > > > > > 
> > > > > > On Mon, May 15, 2023 at 11:12 AM Horatiu Vultur
> > > > > > <horatiu.vultur@...rochip.com> wrote:
> > > > > 
> > > > > Hi Eric,
> > > > > 
> > > > > Thanks for looking at this.
> > > > > 
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I have noticed that on the HEAD of net-next[0] there is a performance drop
> > > > > > > for lan966x when extracting frames towards the CPU. Lan966x has a Cortex
> > > > > > > A7 CPU. All the tests are done using iperf3 command like this:
> > > > > > > 'iperf3 -c 10.97.10.1 -R'
> > > > > > > 
> > > > > > > So on net-next, I can see the following:
> > > > > > > [  5]   0.00-10.01  sec   473 MBytes   396 Mbits/sec  456 sender
> > > > > > > And it gets around ~97000 interrupts.
> > > > > > > 
> > > > > > > While going back to the commit[1], I can see the following:
> > > > > > > [  5]   0.00-10.02  sec   632 MBytes   529 Mbits/sec   11 sender
> > > > > > > And it gets around ~1000 interrupts.
> > > > > > > 
> > > > > > > I have done a little bit of searching and I have noticed that this
> > > > > > > commit [2] introduce the regression.
> > > > > > > I have tried to revert this commit on net-next and tried again, then I
> > > > > > > can see much better results but not exactly the same:
> > > > > > > [  5]   0.00-10.01  sec   616 MBytes   516 Mbits/sec    0 sender
> > > > > > > And it gets around ~700 interrupts.
> > > > > > > 
> > > > > > > So my question is, was I supposed to change something in lan966x driver?
> > > > > > > or is there a bug in lan966x driver that pop up because of this change?
> > > > > > > 
> > > > > > > Any advice will be great. Thanks!
> > > > > > > 
> > > > > > > [0] befcc1fce564 ("sfc: fix use-after-free in efx_tc_flower_record_encap_match()")
> > > > > > > [1] d4671cb96fa3 ("Merge branch 'lan966x-tx-rx-improve'")
> > > > > > > [2] 8b43fd3d1d7d ("net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ")
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > Hmmm... thanks for the report.
> > > > > > 
> > > > > > This seems related to softirq (k)scheduling.
> > > > > > 
> > > > > > Have you tried to apply this recent commit ?
> > > > > > 
> > > > > > Commit-ID:     d15121be7485655129101f3960ae6add40204463
> > > > > > Gitweb:        https://git.kernel.org/tip/d15121be7485655129101f3960ae6add40204463
> > > > > > Author:        Paolo Abeni <pabeni@...hat.com>
> > > > > > AuthorDate:    Mon, 08 May 2023 08:17:44 +02:00
> > > > > > Committer:     Thomas Gleixner <tglx@...utronix.de>
> > > > > > CommitterDate: Tue, 09 May 2023 21:50:27 +02:00
> > > > > > 
> > > > > > Revert "softirq: Let ksoftirqd do its job"
> > > > > 
> > > > > I have tried to apply this patch but the results are the same:
> > > > > [  5]   0.00-10.01  sec   478 MBytes   400 Mbits/sec  188 sender
> > > > > And it gets just a little bit bigger number of interrupts ~11000
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > Alternative would be to try this :
> > > > > > 
> > > > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > > > index b3c13e0419356b943e90b1f46dd7e035c6ec1a9c..f570a3ca00e7aa0e605178715f90bae17b86f071
> > > > > > 100644
> > > > > > --- a/net/core/dev.c
> > > > > > +++ b/net/core/dev.c
> > > > > > @@ -6713,8 +6713,8 @@ static __latent_entropy void
> > > > > > net_rx_action(struct softirq_action *h)
> > > > > >         list_splice(&list, &sd->poll_list);
> > > > > >         if (!list_empty(&sd->poll_list))
> > > > > >                 __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> > > > > > -       else
> > > > > > -               sd->in_net_rx_action = false;
> > > > > > +
> > > > > > +       sd->in_net_rx_action = false;
> > > > > > 
> > > > > >         net_rps_action_and_irq_enable(sd);
> > > > > >  end:;
> > > > > 
> > > > > I have tried to use also this change with and without the previous patch
> > > > > but the result is the same:
> > > > > [  5]   0.00-10.01  sec   478 MBytes   401 Mbits/sec  256 sender
> > > > > And it is the same number of interrupts.
> > > > > 
> > > > > Is something else that I should try?
> > > > 
> > > > High number of interrupts for a saturated receiver seems wrong.
> > > > (Unless it is not saturating the cpu ?)
> > > 
> > > The CPU usage seems to be almost at 100%. This is the output of top
> > > command:
> > > 149   132 root     R     5032   0%  96% iperf3 -c 10.97.10.1 -R
> > >  12     2 root     SW       0   0%   3% [ksoftirqd/0]
> > > 150   132 root     R     2652   0%   1% top
> > > ...
> > 
> > Sorry for the dumb question, is the above with fdma == false? (that is,
> > no napi?) Why can't lan966x_xtr_irq_handler() be converted to the napi
> > model regardless of fdma ?!?
> 
> No, this is with fdma == true. Where we use napi.
> 
> Will it be any advantage to use NAPI for lan966x_xtr_irq_handler()?

Using NAPI you will avoid extra queuing and will gain GRO. Should make
quite a difference.

> Because for lan966x_xtr_irq_handler() we will still need to read each
> word of the frame, which I think will be a big drawback compared with
> lan966x_fdma_irq_handler().

I guess/hope all the lan966x_rx_frame_word() work could be moved into
the napi poll callback.

In any case the fdma == false code path will be likely quite slower
then the fdma == true path - and hopefully faster then the current
code.

> Or did I misunderstand the question?

I think you didn't ;)

Cheers,

Paolo