[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ac33795-9402-43e6-9595-d6c07f3250bc@redhat.com>
Date: Fri, 24 Sep 2021 10:13:42 +0200
From: Denys Vlasenko <dvlasenk@...hat.com>
To: Feng Tang <feng.tang@...el.com>,
Josh Poimboeuf <jpoimboe@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, H Peter Anvin <hpa@...or.com>,
Borislav Petkov <bp@...en8.de>,
Peter Zijlstra <peterz@...radead.org>, x86@...nel.org,
linux-kernel@...r.kernel.org, Dave Hansen <dave.hansen@...el.com>,
Tony Luck <tony.luck@...el.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andy Lutomirski <luto@...nel.org>
Subject: Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data
sections aligned
On 9/23/21 4:57 PM, Feng Tang wrote:
> On Wed, Sep 22, 2021 at 11:51:37AM -0700, Josh Poimboeuf wrote:
>> Hi Feng,
>>
>> Thanks for the interesting LPC presentation about alignment-related
>> performance issues (which mentioned this patch).
>>
>> https://linuxplumbersconf.org/event/11/contributions/895/
>>
>> I wonder if we can look at enabling some kind of data section alignment
>> unconditionally instead of just making it a debug option. Have you done
>> any performance and binary size comparisons?
>
> Thanks for reviewing this!
>
> For binary size, I just tested 5.14 kernel with a default desktop
> config from Ubuntu (I didn't use the normal rhel-8.3 config used
> by 0Day, which is more for server):
>
> v5.14
> ------------------------
> text data bss dec hex filename
> 16010221 14971391 6098944 37080556 235cdec vmlinux
>
> v5.14 + 64B-function-align
> --------------------------
> text data bss dec hex filename
> 18107373 14971391 6098944 39177708 255cdec vmlinux
>
> v5.14 + data-align(THREAD_SIZE 16KB)
> --------------------------
> text data bss dec hex filename
> 16010221 57001791 6008832 79020844 4b5c32c vmlinux
>
> So for the text-align, we see 13.1% increase for text. And for data-align,
> there is 280.8% increase for data.
Page-size alignment of all data is WAY too much. At most, alignment
to cache line size should work to make timings stable.
(In your case with "adjacent cache line prefetcher",
it may need to be 128 bytes. But definitely not 4096 bytes).
> Performance wise, I have done some test with the force-32bytes-text-align
> option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench,
> netperf and kbuild:
> * no obvious change for will-it-scale/fsmark/kbuild
> * see both regression/improvement for different hackbench case
> * see both regression/improvement for netperf, from -20% to +98%
What usually happens here is that testcases are crafted to measure
how well some workloads scale, and to measure that efficiently,
testcases were intentionally written to cause congestion -
this way, benefits of better algorithms are easily seen.
However, this also means that in the congested scenario (e.g.
cache bouncing), small changes in CPU architecture are also
easily visible - including cases where optimizations are going awry.
In your presentation, you stumbled upon one such case:
the "adjacent cache line prefetcher" is counter-productive here,
it pulls unrelated cache into the CPU, not knowing that
this is in fact harmful - other CPUs will need this cache line,
not this one!
Since this particular case was a change in structure layout,
increasing alignment of .data sections won't help here.
My opinion is that we shouldn't worry about this too much.
Diagnose the observed slow downs, if they are "real"
(there is a way to improve), fix that, else if they are spurious,
just let them be.
Even when some CPU optimizations are unintentionally hurting some
benchmarks, on the average they are usually a win:
CPU makers have hundreds of people looking at that as their
full-time jobs. With your example of "adjacent cache line prefetcher",
CPU people might be looking at ways to detect when these
speculatively pulled-in cache lines are bouncing.
> For data-alignment, it has huge impact for the size, and occupies more
> cache/TLB, plus it hurts some normal function like dynamic-debug. So
> I'm afraid it can only be used as a debug option.
>
>> On a similar vein I think we should re-explore permanently enabling
>> cacheline-sized function alignment i.e. making something like
>> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some
>> research on that a while back:
>>
>> https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com
>
> Thanks for sharing this, from which I learned a lot, and I hope I
> knew this thread when we first check strange regressions in 2019 :)
>
>> At the time, the main reported drawback of -falign-functions=64 was that
>> even small functions got aligned. But now I think that can be mitigated
>> with some new options like -flimit-function-alignment and/or
>> -falign-functions=64,X (for some carefully-chosen value of X).
-falign-functions=64,7 should be about right, I guess.
http://lkml.iu.edu/hypermail/linux/kernel/1505.2/03292.html
"""
defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
6923 of them have 1st insn 5 or more bytes long,
5841 of them have 1st insn 6 or more bytes long,
5095 of them have 1st insn 7 or more bytes long,
786 of them have 1st insn 8 or more bytes long,
548 of them have 1st insn 9 or more bytes long,
375 of them have 1st insn 10 or more bytes long,
73 of them have 1st insn 11 or more bytes long,
one of them has 1st insn 12 bytes long:
this "heroic" instruction is in local_touch_nmi()
65 48 c7 05 44 3c 00 7f 00 00 00 00
movq $0x0,%gs:0x7f003c44(%rip)
Thus ensuring that at least seven first bytes do not cross
64-byte boundary would cover >98% of all functions.
"""
Powered by blists - more mailing lists