netdev - Re: tc filter insertion rate degradation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com>
Date:   Tue, 22 Jan 2019 09:33:10 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Vlad Buslov <vladbu@...lanox.com>, Dennis Zhou <dennis@...nel.org>,
        Tejun Heo <tj@...nel.org>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Yevgeny Kliteynik <kliteyn@...lanox.com>,
        Yossef Efraim <yossefe@...lanox.com>,
        Maor Gottlieb <maorg@...lanox.com>
Subject: Re: tc filter insertion rate degradation

On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@...lanox.com> wrote:
>
> Hi Eric,
>
> I've been investigating significant tc filter insertion rate degradation
> and it seems it is caused by your commit 001c96db0181 ("net: align
> gnet_stats_basic_cpu struct"). With this commit insertion rate is
> reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> from file in tc batch mode on my machine.
>
> Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>
> 1) Before:
>
> Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
>   Children      Self  Co  Shared Object     Symbol
> +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>
> 2) After:
>
> Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
>   Children      Self  Co  Shared Object     Symbol
> +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>
> It seems that it takes much more work for pcpu allocator to perform
> allocation with new stricter alignment requirements. Not sure if it is
> expected behavior or not in this case.
>
> Regards,
> Vlad

Hi Vlad

I guess this is more a question for per-cpu allocator experts / maintainers ?

16-bytes alignment for 16-bytes objects sound quite reasonable [1]

It also means that if your workload is mostly being able to setup /
dismantle tc filters,
instead of really using them, you might go back to atomics instead of
expensive per cpu storage.

(Ie optimize control path instead of data path)

Thanks !

[1] We even might make this generic as in :

diff --git a/mm/percpu.c b/mm/percpu.c
index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7
100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size,
size_t align, bool reserved,
         */
        if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
                align = PCPU_MIN_ALLOC_SIZE;
-
+       while (align < L1_CACHE_BYTES && (align << 1) <= size) {
+               if (size % (align << 1))
+                       break;
+               align <<= 1;
+       }
        size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
        bits = size >> PCPU_MIN_ALLOC_SHIFT;
        bit_align = align >> PCPU_MIN_ALLOC_SHIFT;