lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 3 Jul 2020 17:47:24 -0700
From:   Saravana Kannan <saravanak@...gle.com>
To:     Ard Biesheuvel <ardb@...nel.org>
Cc:     Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Android Kernel Team <kernel-team@...roid.com>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] arm64/module: Optimize module load time by optimizing
 PLT counting

On Thu, Jul 2, 2020 at 8:30 AM Ard Biesheuvel <ardb@...nel.org> wrote:
>
> On Tue, 23 Jun 2020 at 03:27, Saravana Kannan <saravanak@...gle.com> wrote:
> >
> > When loading a module, module_frob_arch_sections() tries to figure out
> > the number of PLTs that'll be needed to handle all the RELAs. While
> > doing this, it tries to dedupe PLT allocations for multiple
> > R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> > R_AARCH64_JUMP26 relocations.
> >
> > To make checks for duplicates easier/faster, it sorts the relocation
> > list by type, symbol and addend. That way, to check for a duplicate
> > relocation, it just needs to compare with the previous entry.
> >
> > However, sorting the entire relocation array is unnecessary and
> > expensive (O(n log n)) because there are a lot of other relocation types
> > that don't need deduping or can't be deduped.
> >
> > So this commit partitions the array into entries that need deduping and
> > those that don't. And then sorts just the part that needs deduping. And
> > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> > if it's disabled.
> >
> > This gives significant reduction in module load time for modules with
> > large number of relocations with no measurable impact on modules with a
> > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > enabled, these were the results for a few downstream modules:
> >
> > Module          Size (MB)
> > wlan            14
> > video codec     3.8
> > drm             1.8
> > IPA             2.5
> > audio           1.2
> > gpu             1.8
> >
> > Without this patch:
> > Module          Number of entries sorted        Module load time (ms)
> > wlan            243739                          283
> > video codec     74029                           138
> > drm             53837                           67
> > IPA             42800                           90
> > audio           21326                           27
> > gpu             20967                           32
> >
> > Total time to load all these module: 637 ms
> >
> > With this patch:
> > Module          Number of entries sorted        Module load time (ms)
> > wlan            22454                           61
> > video codec     10150                           47
> > drm             13014                           40
> > IPA             8097                            63
> > audio           4606                            16
> > gpu             6527                            20
> >
> > Total time to load all these modules: 247
> >
> > Time saved during boot for just these 6 modules: 390 ms
> >
> > Cc: Ard Biesheuvel <ard.biesheuvel@...aro.org>
>
> [I am no longer at Linaro so please don't use my @linaro.org address]

Hmm... I'm pretty sure I got this using the get_maintainers script.
Maybe update the MAINTAINERS file if you haven't already (I didn't
check)?

But if I ever manually add in your email, I'll keep this in mind.

> > Signed-off-by: Saravana Kannan <saravanak@...gle.com>
> > ---
> >
> > v1 -> v2:
> > - Provided more details in the commit text
> > - Pulled in Will's comments on the coding style
> > - Pulled in Ard's suggestion about skipping jumps with the same section
> >   index (parts of Will's suggested code)
> >
> >  arch/arm64/kernel/module-plts.c | 46 ++++++++++++++++++++++++++++++---
> >  1 file changed, 43 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > index 65b08a74aec6..0ce3a28e3347 100644
> > --- a/arch/arm64/kernel/module-plts.c
> > +++ b/arch/arm64/kernel/module-plts.c
> > @@ -253,6 +253,40 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> >         return ret;
> >  }
> >
> > +static bool branch_rela_needs_plt(Elf64_Sym *syms, Elf64_Rela *rela,
> > +                                 Elf64_Word dstidx)
> > +{
> > +
> > +       Elf64_Sym *s = syms + ELF64_R_SYM(rela->r_info);
> > +
> > +       if (s->st_shndx == dstidx)
> > +               return false;
> > +
> > +       return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26 ||
> > +              ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > +}
> > +
> > +/* Group branch PLT relas at the front end of the array. */
> > +static int partition_branch_plt_relas(Elf64_Sym *syms, Elf64_Rela *rela,
> > +                                     int numrels, Elf64_Word dstidx)
> > +{
> > +       int i = 0, j = numrels - 1;
> > +
> > +       if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> > +               return 0;
> > +
> > +       while (i < j) {
> > +               if (branch_rela_needs_plt(syms, &rela[i], dstidx))
> > +                       i++;
> > +               else if (branch_rela_needs_plt(syms, &rela[j], dstidx))
> > +                       swap(rela[i], rela[j]);
>
> Nit: would be slightly better to put
>
>   swap(rela[i++], rela[j]);
>
> here so the next iteration of the loop will not call
> branch_rela_needs_plt() on rela[i] redundantly. But the current code
> is also correct.

Oh yeah, I noticed that unnecessary repeat of branch_rela_needs_plt()
on rela[i] when j had to be decremented, but forgot to handle it after
I was done with all the testing.

But I did compare it to the code I had written in v1 that didn't have
this extra check for rela[i]. I couldn't find any measurable
difference in the module load time. Maybe 1ms for the worst case
module, but that could have been just run to run variation.

Anyway, maybe send this as another patch since Catalin has already
picked this mine?

Thanks,
Saravana

Powered by blists - more mailing lists