[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190124172126.GA66944@dennisz-mbp.dhcp.thefacebook.com>
Date: Thu, 24 Jan 2019 12:21:26 -0500
From: Dennis Zhou <dennis@...nel.org>
To: Eric Dumazet <edumazet@...gle.com>,
Vlad Buslov <vladbu@...lanox.com>
Cc: Tejun Heo <tj@...nel.org>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Yevgeny Kliteynik <kliteyn@...lanox.com>,
Yossef Efraim <yossefe@...lanox.com>,
Maor Gottlieb <maorg@...lanox.com>
Subject: Re: tc filter insertion rate degradation
Hi Vlad and Eric,
On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@...lanox.com> wrote:
> >
> > Hi Eric,
> >
> > I've been investigating significant tc filter insertion rate degradation
> > and it seems it is caused by your commit 001c96db0181 ("net: align
> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> > from file in tc batch mode on my machine.
> >
> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
> >
> > 1) Before:
> >
> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> > Children Self Co Shared Object Symbol
> > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
> > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
> >
> > 2) After:
> >
> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> > Children Self Co Shared Object Symbol
> > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
> > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
> >
> > It seems that it takes much more work for pcpu allocator to perform
> > allocation with new stricter alignment requirements. Not sure if it is
> > expected behavior or not in this case.
> >
> > Regards,
> > Vlad
Would you mind sharing a little more information with me:
1) output before and after a run of /sys/kernel/debug/percpu_stats
2) a full perf output
3) a reproducer
I'm a little surprised we're spending time in pcpu_alloc_area(), but it
might be due to constantly breaking the hint as an immediate guess.
>
> Hi Vlad
>
> I guess this is more a question for per-cpu allocator experts / maintainers ?
>
> 16-bytes alignment for 16-bytes objects sound quite reasonable [1]
>
The alignment request seems reasonable. But as Tejun mentioned in a
reply to this, the overhead of forced alignment would be both in percpu
memory itself and in allocation time due to the stricter requirement.
Thanks,
Dennis
Powered by blists - more mailing lists