linux-kernel - Re: [PATCH v2] noinstr: Use asm_inline() in instrumentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAFULd4ZiaboD7zT5tfz4Bdjah68E3iuBRVzrBOW3qQMoaBT5+g@mail.gmail.com>
Date: Wed, 23 Apr 2025 09:07:52 +0200
From: Uros Bizjak <ubizjak@...il.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: Josh Poimboeuf <jpoimboe@...nel.org>, x86@...nel.org, linux-kernel@...r.kernel.org, 
	Peter Zijlstra <peterz@...radead.org>, Linus Torvalds <torvalds@...ux-foundation.org>, 
	"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v2] noinstr: Use asm_inline() in instrumentation_{begin,end}()

On Tue, Apr 22, 2025 at 10:05 PM Ingo Molnar <mingo@...nel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@...il.com> wrote:
>
> > > That still doesn't make it clear where the apparently ~10
> > > instructions per inlining come from, right?
> >
> > The growth is actually from different inlining decisions, that cover
> > not only inlining of small functions, but other code blocks (hot vs.
> > cold, tail duplication, etc) too. The compiler uses certain
> > thresholds to estimate inlining gain (thresholds are different for
> > -Os and -O2). Artificially bloated functions that don't use
> > asm_inline() fall under this threshold (IOW, the inlining would
> > increase size too much), so they are not inlined; code blocks that
> > enclose unfixed asm clauses are treated differently than when they
> > use asm_inline() instead of asm(). When asm_inline() is introduced,
> > the size of the function (and consequently inlining gain) is
> > estimated more accurately, the estimated size is lower, so there is
> > more inlining happening.
> >
> > I'd again remark that the code size is not the right metric when
> > compiling with -O2.
>
> Understood, but still we somehow have to be able to measure whether the
> marking of these primitives with asm_inline() is beneficial in
> isolation - even if on a real build the noise of GCC's overall inlining
> decisions obscure the results - and may even reverse them.
>
> Is there a way to coax GCC into a mode of build where such changes can
> be measured in a better fashion?

There are several debug options that report details of inliner
decisions. You can add -fdump-ipa-inline or -fdump-ipa-inline-details
[1] to generate a debug file for interprocedural inlining.

[1] https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html

> For example would setting -finline-limit=1000 or -finline-limit=10 (or
> some other well-chosen inlining threshold value, or tweaking any of the
> inliner parameters via --param values?), just for the sake of
> measurement, give us more representative .text size change values?

I don't think so, because inliner uses pseudo instructions [2] where:

    _Note:_ pseudo instruction represents, in this particular context,
    an abstract measurement of function's size.  In no way does it
    represent a count of assembly instructions and as such its exact
    meaning might change from one release to an another.

[2] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-finline-limit

OTOH, there are plenty of --param choices to play with the inliner
besides -finline-limit= option. Please see [3]

[3] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-param

In the dump mentioned above, you will  get e.g.:

IPA function summary for pfmemalloc_match/5736 inlinable
  global time:     8.200000
  self size:       11
  global size:     11
  min size:       7
  self stack:      0
  global stack:    0
    size:4.000000, time:4.000000
    size:3.000000, time:2.000000,  executed if:(not inlined)
    size:0.500000, time:0.500000,  executed if:(op0 not sra candidate)
&& (not inlined)
    size:0.500000, time:0.500000,  executed if:(op0 not sra candidate)
...

The estimator estimates size and execution time and decides how to
(and if) inline the function.

> Because, ideally, if we do these decisions correctly at the asm()
> level, compilers will, eventually, after a few decades, catch up
> and do the right thing as well. ;-)

We (as in gcc developers) are eagerly waiting for better tuning
parameters that would satisfy everyone's needs. Rest assured that many
have tried to fine-tune the heuristics, with various levels of success
;)

Jokes aside, it is important to feed the estimator correct data, as
precise as possible. There are limitations with asm(), because the
compiler doesn't know what is inside the asm template. It estimates
one instruction for every instruction delimiter, where the size of
"one instruction" is estimated to 16 bytes. In case of __ASM_ANNOTATE,
the estimator estimates 5 instructions and 80 bytes total vs. one
instruction when asm_inline() is used. Based on this fact, I think
that changing asm() to asm_inline() for insns with __ASM_ANNOTATE is
beneficial.

Thanks,
Uros.