[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250306203714.118ead69@pumpkin>
Date: Thu, 6 Mar 2025 20:37:14 +0000
From: David Laight <david.laight.linux@...il.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: Uros Bizjak <ubizjak@...il.com>, Peter Zijlstra <peterz@...radead.org>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...el.com>,
x86@...nel.org, linux-kernel@...r.kernel.org, Thomas Gleixner
<tglx@...utronix.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter
Anvin" <hpa@...or.com>, Linus Torvalds <torvalds@...uxfoundation.org>,
Linus Torvalds <torvalds@...ux-foundation.org>, Arnd Bergmann
<arnd@...db.de>
Subject: Re: kernel: Current status of CONFIG_CC_OPTIMIZE_FOR_SIZE=y (was:
Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic locking
insns)
On Thu, 6 Mar 2025 10:43:26 +0100
Ingo Molnar <mingo@...nel.org> wrote:
> * Uros Bizjak <ubizjak@...il.com> wrote:
...
> And this one by Linus, 14 years ago:
>
> =================>
> 281dc5c5ec0f ("Give up on pushing CC_OPTIMIZE_FOR_SIZE")
> =================>
>
> From: Linus Torvalds <torvalds@...ux-foundation.org>
> Date: Sun, 22 May 2011 14:30:36 -0700
> Subject: [PATCH] Give up on pushing CC_OPTIMIZE_FOR_SIZE
>
> I still happen to believe that I$ miss costs are a major thing, but
> sadly, -Os doesn't seem to be the solution. With or without it, gcc
> will miss some obvious code size improvements, and with it enabled gcc
> will sometimes make choices that aren't good even with high I$ miss
> ratios.
>
> For example, with -Os, gcc on x86 will turn a 20-byte constant memcpy
> into a "rep movsl". While I sincerely hope that x86 CPU's will some day
> do a good job at that, they certainly don't do it yet, and the cost is
> higher than a L1 I$ miss would be.
Well 'rep movsb' is a lot better than it was then.
Even on Sandy bridge (IIRC) it is ~20 clocks for short transfers (of any length).
Unlike the P4 with a 140 clock overhead!
Still slower for short fixed sizes, but probably good for anything variable
because of the costs of the function call and the conditionals to select the
'best' algorithm.
OTOH if you know it is only a few bytes a code loop may be best - and gcc will
convert it to a memcpy() call for you!
The really silly one was 'push immd_byte; pop reg' to get a sign extended value.
But I do remember -O2 being smaller than -Oz !
Just changing the inlining thresholds and code replication on loops
(and never unrollong loops) would probably be a good start.
David
Powered by blists - more mailing lists