[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com>
Date: Tue, 22 Jan 2019 09:33:10 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Vlad Buslov <vladbu@...lanox.com>, Dennis Zhou <dennis@...nel.org>,
Tejun Heo <tj@...nel.org>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>,
Yevgeny Kliteynik <kliteyn@...lanox.com>,
Yossef Efraim <yossefe@...lanox.com>,
Maor Gottlieb <maorg@...lanox.com>
Subject: Re: tc filter insertion rate degradation
On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@...lanox.com> wrote:
>
> Hi Eric,
>
> I've been investigating significant tc filter insertion rate degradation
> and it seems it is caused by your commit 001c96db0181 ("net: align
> gnet_stats_basic_cpu struct"). With this commit insertion rate is
> reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> from file in tc batch mode on my machine.
>
> Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>
> 1) Before:
>
> Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> Children Self Co Shared Object Symbol
> + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
> + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
>
> 2) After:
>
> Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> Children Self Co Shared Object Symbol
> + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
> + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
>
> It seems that it takes much more work for pcpu allocator to perform
> allocation with new stricter alignment requirements. Not sure if it is
> expected behavior or not in this case.
>
> Regards,
> Vlad
Hi Vlad
I guess this is more a question for per-cpu allocator experts / maintainers ?
16-bytes alignment for 16-bytes objects sound quite reasonable [1]
It also means that if your workload is mostly being able to setup /
dismantle tc filters,
instead of really using them, you might go back to atomics instead of
expensive per cpu storage.
(Ie optimize control path instead of data path)
Thanks !
[1] We even might make this generic as in :
diff --git a/mm/percpu.c b/mm/percpu.c
index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7
100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size,
size_t align, bool reserved,
*/
if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
align = PCPU_MIN_ALLOC_SIZE;
-
+ while (align < L1_CACHE_BYTES && (align << 1) <= size) {
+ if (size % (align << 1))
+ break;
+ align <<= 1;
+ }
size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
bits = size >> PCPU_MIN_ALLOC_SHIFT;
bit_align = align >> PCPU_MIN_ALLOC_SHIFT;
Powered by blists - more mailing lists