lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <vbfzhrj9smb.fsf@mellanox.com>
Date:   Tue, 29 Jan 2019 19:22:10 +0000
From:   Vlad Buslov <vladbu@...lanox.com>
To:     Dennis Zhou <dennis@...nel.org>
CC:     Eric Dumazet <edumazet@...gle.com>, Tejun Heo <tj@...nel.org>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Yevgeny Kliteynik <kliteyn@...lanox.com>,
        Yossef Efraim <yossefe@...lanox.com>,
        Maor Gottlieb <maorg@...lanox.com>
Subject: Re: tc filter insertion rate degradation


On Thu 24 Jan 2019 at 17:21, Dennis Zhou <dennis@...nel.org> wrote:
> Hi Vlad and Eric,
>
> On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
>> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@...lanox.com> wrote:
>> >
>> > Hi Eric,
>> >
>> > I've been investigating significant tc filter insertion rate degradation
>> > and it seems it is caused by your commit 001c96db0181 ("net: align
>> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
>> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
>> > from file in tc batch mode on my machine.
>> >
>> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>> >
>> > 1) Before:
>> >
>> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
>> >   Children      Self  Co  Shared Object     Symbol
>> > +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
>> > +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>> >
>> > 2) After:
>> >
>> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
>> >   Children      Self  Co  Shared Object     Symbol
>> > +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
>> > +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>> >
>> > It seems that it takes much more work for pcpu allocator to perform
>> > allocation with new stricter alignment requirements. Not sure if it is
>> > expected behavior or not in this case.
>> >
>> > Regards,
>> > Vlad
>
> Would you mind sharing a little more information with me:
> 1) output before and after a run of /sys/kernel/debug/percpu_stats

Hi Dennis,

Some of these files are quite large, so I put them to my Dropbox.

Output before:

Percpu Memory Statistics
Allocation Info:
----------------------------------------
  unit_size           :       262144
  static_size         :       139160
  reserved_size       :         8192
  dyn_size            :        28776
  atom_size           :      2097152
  alloc_size          :      2097152

Global Stats:
----------------------------------------
  nr_alloc            :         3343
  nr_dealloc          :          752
  nr_cur_alloc        :         2591
  nr_max_alloc        :         2598
  nr_chunks           :            3
  nr_max_chunks       :            3
  min_alloc_size      :            4
  max_alloc_size      :         8208
  empty_pop_pages     :            3

Per Chunk Stats:
----------------------------------------
Chunk: <- Reserved Chunk
  nr_alloc            :            5
  max_alloc_size      :          320
  empty_pop_pages     :            0
  first_bit           :         1002
  free_bytes          :         7448
  contig_bytes        :         7424
  sum_frag            :           24
  max_frag            :           24
  cur_min_alloc       :           16
  cur_med_alloc       :           64
  cur_max_alloc       :          320

Chunk: <- First Chunk
  nr_alloc            :          479
  max_alloc_size      :         8208
  empty_pop_pages     :            0
  first_bit           :         8192
  free_bytes          :            0
  contig_bytes        :            0
  sum_frag            :            0
  max_frag            :            0
  cur_min_alloc       :            4
  cur_med_alloc       :           24
  cur_max_alloc       :         8208

Chunk:
  nr_alloc            :         1925
  max_alloc_size      :         8208
  empty_pop_pages     :            0
  first_bit           :        63102
  free_bytes          :          852
  contig_bytes        :           12
  sum_frag            :          852
  max_frag            :           12
  cur_min_alloc       :            4
  cur_med_alloc       :            8
  cur_max_alloc       :         8208

Chunk:
  nr_alloc            :          182
  max_alloc_size      :          936
  empty_pop_pages     :            3
  first_bit           :           21
  free_bytes          :       256452
  contig_bytes        :       255120
  sum_frag            :         1332
  max_frag            :          368
  cur_min_alloc       :            8
  cur_med_alloc       :           20
  cur_max_alloc       :          320


After: https://www.dropbox.com/s/unyzhx4vgo2x30e/stats_after?dl=0

> 2) a full perf output

https://www.dropbox.com/s/isfcxca3npn5slx/perf.data?dl=0

> 3) a reproducer

$ sudo tc -b add.0

Example batch file: https://www.dropbox.com/s/ey7cbl5nwu5p0tg/add.0?dl=0

Thanks,
Vlad

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ