netdev - Re: tc filter insertion rate degradation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190124172126.GA66944@dennisz-mbp.dhcp.thefacebook.com>
Date:   Thu, 24 Jan 2019 12:21:26 -0500
From:   Dennis Zhou <dennis@...nel.org>
To:     Eric Dumazet <edumazet@...gle.com>,
        Vlad Buslov <vladbu@...lanox.com>
Cc:     Tejun Heo <tj@...nel.org>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Yevgeny Kliteynik <kliteyn@...lanox.com>,
        Yossef Efraim <yossefe@...lanox.com>,
        Maor Gottlieb <maorg@...lanox.com>
Subject: Re: tc filter insertion rate degradation

Hi Vlad and Eric,

On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@...lanox.com> wrote:
> >
> > Hi Eric,
> >
> > I've been investigating significant tc filter insertion rate degradation
> > and it seems it is caused by your commit 001c96db0181 ("net: align
> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> > from file in tc batch mode on my machine.
> >
> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
> >
> > 1) Before:
> >
> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> >   Children      Self  Co  Shared Object     Symbol
> > +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
> >
> > 2) After:
> >
> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> >   Children      Self  Co  Shared Object     Symbol
> > +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
> >
> > It seems that it takes much more work for pcpu allocator to perform
> > allocation with new stricter alignment requirements. Not sure if it is
> > expected behavior or not in this case.
> >
> > Regards,
> > Vlad

Would you mind sharing a little more information with me:
1) output before and after a run of /sys/kernel/debug/percpu_stats
2) a full perf output
3) a reproducer

I'm a little surprised we're spending time in pcpu_alloc_area(), but it
might be due to constantly breaking the hint as an immediate guess.

> 
> Hi Vlad
> 
> I guess this is more a question for per-cpu allocator experts / maintainers ?
> 
> 16-bytes alignment for 16-bytes objects sound quite reasonable [1]
> 

The alignment request seems reasonable. But as Tejun mentioned in a
reply to this, the overhead of forced alignment would be both in percpu
memory itself and in allocation time due to the stricter requirement.

Thanks,
Dennis