linux-kernel - Re: [GIT pull] perf/urgent for 5.7-rc2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200420074845.GA72554@gmail.com>
Date:   Mon, 20 Apr 2020 09:48:45 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     Josh Poimboeuf <jpoimboe@...hat.com>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Masahiro Yamada <masahiroy@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        the arch/x86 maintainers <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [GIT pull] perf/urgent for 5.7-rc2

* Josh Poimboeuf <jpoimboe@...hat.com> wrote:

> On Sun, Apr 19, 2020 at 11:56:51AM -0700, Linus Torvalds wrote:
> 
> > So I'm wondering if there any way that objtool could be run at 
> > link-time (and archive time) rather than force a re-build of all the 
> > object files from source?
> 
> We've actually been making progress in that direction.  Peter added 
> partial vmlinux.o support, for Thomas' noinstr validation.  The problem 
> is, linking is single-threaded so it ends up making the kernel build 
> slower overall.
> 
> So right now, we still do most things per compilation unit, and only do 
> the noinstr validation at vmlinux.o link time.  Eventually, especially 
> with LTO, we'll probably end up moving everything over to link time.

Fortunately, much of what objtool does against vmlinux.o can be 
parallelized in a rather straightforward fashion I believe, if we build 
with -ffunction-sections.

Here's the main "objtool check" processing steps:

int check(const char *_objname, bool orc)
{
...
        ret = decode_sections(&file);
...

        ret = validate_functions(&file);
...
        ret = validate_unwind_hints(&file);
...
                ret = validate_reachable_instructions(&file);
...
                ret = create_orc(&file);
...
                ret = create_orc_sections(&file);
}

The 'decode_sections()' step takes about 92% of the runtime against 
vmlinux.o:

 $ taskset 1 perf stat --repeat 3 --sync --null tools/objtool/objtool check vmlinux.o

 Performance counter stats for 'tools/objtool/objtool check vmlinux.o' (3 runs):

           3.05757 +- 0.00247 seconds time elapsed  ( +-  0.08% )

 $ taskset 1 perf stat --repeat 3 --exit-after-decode --null tools/objtool/objtool check vmlinux.o            

 Performance counter stats for 'tools/objtool/objtool check vmlinux.o' (3 runs):

           2.83132 +- 0.00272 seconds time elapsed  ( +-  0.10% )

(The --exit-after-decode hack makes it exit right after 
decode_sections().)

Within decode_sections(), the main overhead is in decode_instructions() 
(~75% of the total objtool overhead):

           2.31325 +- 0.00609 seconds time elapsed  ( +-  0.26% )

This goes through every executable section, to decode the instructions:

static int decode_instructions(struct objtool_file *file)
{
...
        for_each_sec(file, sec) {

                if (!(sec->sh.sh_flags & SHF_EXECINSTR))
                        continue;

The size distribution of function section sizes is strongly biased 
towards section sizes of 100 bytes or less, over 95% of all instructions 
in the vmlinux.o are in such a section.

In fact over 99% of all decoded instructions are in a section of 500 
bytes or smaller, so a threaded decoder where each thread batch-decodes a 
handful of sections in a single processing step and then batch-inserts it 
into the (global) instructions hash should do the trick.

The batching size could be driven by section byte size, i.e. we could say 
that the unit of batching is for a decoding thread to grab ~10k bytes 
worth of sections from the list, build a local list of decoded 
instructions, and then insert them into the global hash in a single go. 
This would scale very well IMO, with the defconfig already having almost 
3 million instructions, and a distro build or allmodconfig build a lot 
more.

I believe the 3.0 seconds total objdump runtime above could be reduced to 
below 1.0 second on typical contemporary development systems - which 
would IMHO make it a feasible model to run objtool only against the whole 
kernel binary.

Is there any code generation disadvantage or other quirk to 
-ffunction-sections, or other complications that I missed, that would 
make this difficult?

Thanks,

	Ingo