[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d35a2039-1755-b0be-6733-bb7ec19b2ea8@csgroup.eu>
Date: Thu, 1 Sep 2022 16:46:40 +0000
From: Christophe Leroy <christophe.leroy@...roup.eu>
To: Ard Biesheuvel <ardb@...nel.org>
CC: Michael Ellerman <mpe@...erman.id.au>,
Nicholas Piggin <npiggin@...il.com>,
"sv@...ux.ibm.com" <sv@...ux.ibm.com>,
"agust@...x.de" <agust@...x.de>,
Josh Poimboeuf <jpoimboe@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Jason Baron <jbaron@...mai.com>,
"Steven Rostedt (VMware)" <rostedt@...dmis.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
"open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)"
<linuxppc-dev@...ts.ozlabs.org>, X86 ML <x86@...nel.org>,
Chen Zhongjin <chenzhongjin@...wei.com>
Subject: Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2
Le 09/07/2022 à 08:52, Ard Biesheuvel a écrit :
> Hello Christophe,
>
> On Fri, 8 Jul 2022 at 19:32, Christophe Leroy
> <christophe.leroy@...roup.eu> wrote:
>>
>> This series applies on top of the series v3 "objtool: Enable and
>> implement --mcount option on powerpc" [1] rebased on powerpc-next branch
>>
>> A few modifications are done to core parts to enable powerpc
>> implementation:
>> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
>> redefined as R_PPC_REL32.
>> - A call to static_call_init() is added to start_kernel() to avoid
>> every architecture to have to call it
>> - Trampoline address is provided to arch_static_call_transform() even
>> when setting a site to fallback on a call to the trampoline when the
>> target is too far.
>>
>> [1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef
>>
>> Christophe Leroy (7):
>> powerpc: Add missing asm/asm.h for objtool
>> objtool/powerpc: Activate objtool on PPC32
>> objtool: Add architecture specific R_REL32 macro
>> objtool/powerpc: Add necessary support for inline static calls
>> init: Call static_call_init() from start_kernel()
>> static_call_inline: Provide trampoline address when updating sites
>> powerpc/static_call: Implement inline static calls
>>
>
> Could you quantify the performance gains of moving from out-of-line,
> patched tail-call branch instructions to full-fledged inline static
> calls? On x86, the retpoline problem makes this glaringly obvious, but
> on other architectures, the complexity of supporting this model may
> outweigh the performance advantages.
Surprisingly, I get worst performance with inline static call than with
out of line static call:
No static call:
root@...ip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.186
Running with 1*40 (== 40) tasks.
Time: 16.738
Running with 1*40 (== 40) tasks.
Time: 16.579
Running with 1*40 (== 40) tasks.
Time: 16.838
Running with 1*40 (== 40) tasks.
Time: 16.652
Running with 1*40 (== 40) tasks.
Time: 17.380
Running with 1*40 (== 40) tasks.
Time: 16.630
Running with 1*40 (== 40) tasks.
Time: 16.850
Running with 1*40 (== 40) tasks.
Time: 17.161
Running with 1*40 (== 40) tasks.
Time: 16.722
Performance counter stats for './hackbench 1' (10 runs):
17019.55 msec task-clock # 0.980 CPUs
utilized ( +- 0.51% )
4847 context-switches # 282.280 /sec
( +- 6.32% )
0 cpu-migrations # 0.000 /sec
1249 page-faults # 72.739 /sec
( +- 0.49% )
2245344976 cycles # 0.131 GHz
( +- 0.51% )
727437072 instructions # 0.32 insn per
cycle ( +- 0.40% )
<not supported> branches
<not supported> branch-misses
17.3585 +- 0.0909 seconds time elapsed ( +- 0.52% )
Outline static call:
root@...ip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 15.892
Running with 1*40 (== 40) tasks.
Time: 15.731
Running with 1*40 (== 40) tasks.
Time: 15.507
Running with 1*40 (== 40) tasks.
Time: 16.269
Running with 1*40 (== 40) tasks.
Time: 15.934
Running with 1*40 (== 40) tasks.
Time: 16.048
Running with 1*40 (== 40) tasks.
Time: 15.700
Running with 1*40 (== 40) tasks.
Time: 16.063
Running with 1*40 (== 40) tasks.
Time: 15.852
Running with 1*40 (== 40) tasks.
Time: 15.941
Performance counter stats for './hackbench 1' (10 runs):
16227.32 msec task-clock # 0.992 CPUs
utilized ( +- 0.42% )
3732 context-switches # 230.525 /sec
( +- 6.42% )
0 cpu-migrations # 0.000 /sec
1244 page-faults # 76.842 /sec
( +- 0.11% )
2141094288 cycles # 0.132 GHz
( +- 0.42% )
712598441 instructions # 0.33 insn per
cycle ( +- 0.29% )
<not supported> branches
<not supported> branch-misses
16.3539 +- 0.0675 seconds time elapsed ( +- 0.41% )
Inline static call:
root@...ip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.512
Running with 1*40 (== 40) tasks.
Time: 17.240
Running with 1*40 (== 40) tasks.
Time: 16.901
Running with 1*40 (== 40) tasks.
Time: 17.125
Running with 1*40 (== 40) tasks.
Time: 17.262
Running with 1*40 (== 40) tasks.
Time: 17.298
Running with 1*40 (== 40) tasks.
Time: 17.182
Running with 1*40 (== 40) tasks.
Time: 16.988
Running with 1*40 (== 40) tasks.
Time: 17.102
Running with 1*40 (== 40) tasks.
Time: 16.669
Performance counter stats for './hackbench 1' (10 runs):
16976.76 msec task-clock # 0.964 CPUs
utilized ( +- 0.44% )
4760 context-switches # 273.007 /sec
( +- 4.93% )
0 cpu-migrations # 0.000 /sec
1252 page-faults # 71.808 /sec
( +- 0.35% )
2239986112 cycles # 0.128 GHz
( +- 0.44% )
721540184 instructions # 0.31 insn per
cycle ( +- 0.31% )
<not supported> branches
<not supported> branch-misses
17.6126 +- 0.0762 seconds time elapsed ( +- 0.43% )
Summary:
No static calls:
17.3585 +- 0.0909 seconds time elapsed ( +- 0.52% )
Out-of-line static calls:
16.3539 +- 0.0675 seconds time elapsed ( +- 0.41% )
Inline static calls:
17.6126 +- 0.0762 seconds time elapsed ( +- 0.43% )
Is there anything wrong with inline statica calls ?
Christophe
Powered by blists - more mailing lists