linux-kernel - Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ac33795-9402-43e6-9595-d6c07f3250bc@redhat.com>
Date:   Fri, 24 Sep 2021 10:13:42 +0200
From:   Denys Vlasenko <dvlasenk@...hat.com>
To:     Feng Tang <feng.tang@...el.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, H Peter Anvin <hpa@...or.com>,
        Borislav Petkov <bp@...en8.de>,
        Peter Zijlstra <peterz@...radead.org>, x86@...nel.org,
        linux-kernel@...r.kernel.org, Dave Hansen <dave.hansen@...el.com>,
        Tony Luck <tony.luck@...el.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andy Lutomirski <luto@...nel.org>
Subject: Re: [RFC PATCH] x86, vmlinux.lds: Add debug option to force all data
 sections aligned

On 9/23/21 4:57 PM, Feng Tang wrote:
> On Wed, Sep 22, 2021 at 11:51:37AM -0700, Josh Poimboeuf wrote:
>> Hi Feng,
>>
>> Thanks for the interesting LPC presentation about alignment-related
>> performance issues (which mentioned this patch).
>>   
>>    https://linuxplumbersconf.org/event/11/contributions/895/
>>
>> I wonder if we can look at enabling some kind of data section alignment
>> unconditionally instead of just making it a debug option.  Have you done
>> any performance and binary size comparisons?
>   
> Thanks for reviewing this!
> 
> For binary size, I just tested 5.14 kernel with a default desktop
> config from Ubuntu (I didn't use the normal rhel-8.3 config used
> by 0Day, which is more for server):
> 
> v5.14
> ------------------------
> text		data		bss	    dec		hex	filename
> 16010221	14971391	6098944	37080556	235cdec	vmlinux
> 
> v5.14 + 64B-function-align
> --------------------------
> text		data		bss	    dec		hex	filename
> 18107373	14971391	6098944	39177708	255cdec	vmlinux
> 
> v5.14 + data-align(THREAD_SIZE 16KB)
> --------------------------
> text		data		bss	    dec		hex	filename
> 16010221	57001791	6008832	79020844	4b5c32c	vmlinux
> 
> So for the text-align, we see 13.1% increase for text. And for data-align,
> there is 280.8% increase for data.

Page-size alignment of all data is WAY too much. At most, alignment
to cache line size should work to make timings stable.
(In your case with "adjacent cache line prefetcher",
it may need to be 128 bytes. But definitely not 4096 bytes).


> Performance wise, I have done some test with the force-32bytes-text-align
> option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench,
> netperf and kbuild:
> * no obvious change for will-it-scale/fsmark/kbuild
> * see both regression/improvement for different hackbench case
> * see both regression/improvement for netperf, from -20% to +98%

What usually happens here is that testcases are crafted to measure
how well some workloads scale, and to measure that efficiently,
testcases were intentionally written to cause congestion -
this way, benefits of better algorithms are easily seen.

However, this also means that in the congested scenario (e.g.
cache bouncing), small changes in CPU architecture are also
easily visible - including cases where optimizations are going awry.

In your presentation, you stumbled upon one such case:
the "adjacent cache line prefetcher" is counter-productive here,
it pulls unrelated cache into the CPU, not knowing that
this is in fact harmful - other CPUs will need this cache line,
not this one!

Since this particular case was a change in structure layout,
increasing alignment of .data sections won't help here.

My opinion is that we shouldn't worry about this too much.
Diagnose the observed slow downs, if they are "real"
(there is a way to improve), fix that, else if they are spurious,
just let them be.

Even when some CPU optimizations are unintentionally hurting some
benchmarks, on the average they are usually a win:
CPU makers have hundreds of people looking at that as their
full-time jobs. With your example of "adjacent cache line prefetcher",
CPU people might be looking at ways to detect when these
speculatively pulled-in cache lines are bouncing.


> For data-alignment, it has huge impact for the size, and occupies more
> cache/TLB, plus it hurts some normal function like dynamic-debug. So
> I'm afraid it can only be used as a debug option.
> 
>> On a similar vein I think we should re-explore permanently enabling
>> cacheline-sized function alignment i.e. making something like
>> CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default.  Ingo did some
>> research on that a while back:
>>
>>     https://lkml.kernel.org/r/20150519213820.GA31688@gmail.com
> 
> Thanks for sharing this, from which I learned a lot, and I hope I
> knew this thread when we first check strange regressions in 2019 :)
> 
>> At the time, the main reported drawback of -falign-functions=64 was that
>> even small functions got aligned.  But now I think that can be mitigated
>> with some new options like -flimit-function-alignment and/or
>> -falign-functions=64,X (for some carefully-chosen value of X).

-falign-functions=64,7 should be about right, I guess.

http://lkml.iu.edu/hypermail/linux/kernel/1505.2/03292.html

"""
defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
6923 of them have 1st insn 5 or more bytes long,
5841 of them have 1st insn 6 or more bytes long,
5095 of them have 1st insn 7 or more bytes long,
786 of them have 1st insn 8 or more bytes long,
548 of them have 1st insn 9 or more bytes long,
375 of them have 1st insn 10 or more bytes long,
73 of them have 1st insn 11 or more bytes long,
one of them has 1st insn 12 bytes long:
this "heroic" instruction is in local_touch_nmi()
   65 48 c7 05 44 3c 00 7f 00 00 00 00
   movq $0x0,%gs:0x7f003c44(%rip)

Thus ensuring that at least seven first bytes do not cross
64-byte boundary would cover >98% of all functions.
"""