lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMe9rOryEsLHossVa=QmaKW1wbW67AOnewkB_RN_WkveBEwnZg@mail.gmail.com>
Date:   Thu, 28 Feb 2019 08:25:39 -0800
From:   "H.J. Lu" <hjl.tools@...il.com>
To:     Daniel Borkmann <daniel@...earbox.net>
Cc:     David Woodhouse <dwmw2@...radead.org>,
        Ingo Molnar <mingo@...nel.org>, bjorn.topel@...el.com,
        David Miller <davem@...emloft.net>, brouer@...hat.com,
        magnus.karlsson@...el.com, Andy Lutomirski <luto@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Borislav Petkov <bp@...en8.de>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>, ast@...nel.org,
        linux-tip-commits@...r.kernel.org
Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating
 indirect calls from switch-case

On Thu, Feb 28, 2019 at 8:18 AM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> On 02/28/2019 01:53 PM, H.J. Lu wrote:
> > On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <dwmw2@...radead.org> wrote:
> >> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
> >>> Commit-ID:  ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> >>> Gitweb:     https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> >>> Author:     Daniel Borkmann <daniel@...earbox.net>
> >>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
> >>> Committer:  Thomas Gleixner <tglx@...utronix.de>
> >>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
> >>>
> >>> x86, retpolines: Raise limit for generating indirect calls from switch-case
> >>>
> >>> From networking side, there are numerous attempts to get rid of indirect
> >>> calls in fast-path wherever feasible in order to avoid the cost of
> >>> retpolines, for example, just to name a few:
> >>>
> >>>   * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
> >>>   * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
> >>>   * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
> >>>   * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
> >>>   * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
> >>>   * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
> >>>   [...]
> >>>
> >>> Recent work on XDP from Björn and Magnus additionally found that manually
> >>> transforming the XDP return code switch statement with more than 5 cases
> >>> into if-else combination would result in a considerable speedup in XDP
> >>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
> >>> builds.
> >>
> >> +HJL
> >>
> >> This is a GCC bug, surely? It should know how expensive each
> >> instruction is, and choose which to use accordingly. That should be
> >> true even when the indirect branch "instruction" is a retpoline, and
> >> thus enormously expensive.
> >>
> >> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
> >> please at least reference that bug, and be prepared to turn this hack
> >> off when GCC is fixed.
> >
> > We couldn't find a testcase to show jump table with indirect branch
> > is slower than direct branches.
>
> Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
> with the below on top.
>
>  Makefile | 6 +++---
>  switch.c | 2 +-
>  test.c   | 6 ++++--
>  3 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index bd83233..ea81520 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1,16 +1,16 @@
>  CC=gcc
>  CFLAGS=-g -I.
> -CFLAGS+=-O2 -mindirect-branch=thunk
> +CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Does slowdown show up only with -mindirect-branch=thunk-inline?

>  ASFLAGS=-g
>
>  EXE=test
>
>  OBJS=test.o switch-no-table.o switch.o
>
> -switch-no-table.o switch-no-table.s: CFLAGS += -fno-jump-tables
> +switch-no-table.o switch-no-table.s: CFLAGS += --param=case-values-threshold=20
>
>  all: $(EXE)
> -       ./$(EXE)
> +       taskset 1 ./$(EXE)
>
>  $(EXE): $(OBJS)
>         $(CC) -o $@ $^
> diff --git a/switch.c b/switch.c
> index fe0a8b0..233ec14 100644
> --- a/switch.c
> +++ b/switch.c
> @@ -3,7 +3,7 @@ int global;
>  int
>  foo (int x)
>  {
> -  switch (x) {
> +  switch (x & 0xf) {
>      case 0:
>        return 11;
>      case 1:
> diff --git a/test.c b/test.c
> index 3d1e0da..7fc22a4 100644
> --- a/test.c
> +++ b/test.c
> @@ -15,21 +15,23 @@ main ()
>    unsigned long long start, end;
>    unsigned long long diff1, diff2;
>
> +  global = 0;
>    start = __rdtscp (&i);
>    for (i = 0; i < LOOP; i++)
>      foo_no_table (i);
>    end = __rdtscp (&i);
>    diff1 = end - start;
>
> -  printf ("no jump table: %lld\n", diff1);
> +  printf ("global:%d no jump table: %lld\n", global, diff1);
>
> +  global = 0;
>    start = __rdtscp (&i);
>    for (i = 0; i < LOOP; i++)
>      foo (i);
>    end = __rdtscp (&i);
>    diff2 = end - start;
>
> -  printf ("jump table   : %lld (%.2f%%)\n", diff2, 100.0f * diff2 / diff1);
> +  printf ("global:%d jump table   : %lld (%.2f%%)\n", global, diff2, 100.0f * diff2 / diff1);
>
>    return 0;
>  }
> --
> 2.17.1
>
> ** This basically iterates through the cases:
>
> Below I'm getting ~twice the time needed for jump table vs no jump table
> for the flags kernel is using:
>
> # make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20   -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:50000000 no jump table: 6329361694
> global:50000000 jump table   : 13745181180 (217.17%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6328846466
> global:50000000 jump table   : 13746479870 (217.20%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6326922428
> global:50000000 jump table   : 13745139496 (217.25%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6327943506
> global:50000000 jump table   : 13744388354 (217.20%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6332503572
> global:50000000 jump table   : 13729817800 (216.82%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6328378006
> global:50000000 jump table   : 13747069902 (217.23%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6326481236
> global:50000000 jump table   : 13749345724 (217.33%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6329332628
> global:50000000 jump table   : 13745879704 (217.18%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6327734850
> global:50000000 jump table   : 13746412678 (217.24%)
>
> For comparison that both are 100% when raising limit is _not_ in use
> (which is expected of course but just to make sure):
>
> root@...t:~/microbenchmark# make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:50000000 no jump table: 13704083238
> global:50000000 jump table   : 13746838060 (100.31%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13753854740
> global:50000000 jump table   : 13746624470 (99.95%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13707053714
> global:50000000 jump table   : 13746682002 (100.29%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13708843624
> global:50000000 jump table   : 13749733040 (100.30%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13707365404
> global:50000000 jump table   : 13747683096 (100.29%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13707014114
> global:50000000 jump table   : 13746444272 (100.29%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13709596158
> global:50000000 jump table   : 13750499176 (100.30%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13709484118
> global:50000000 jump table   : 13747952446 (100.28%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13708873570
> global:50000000 jump table   : 13748950096 (100.29%)
>
> ** Next case would be constantly hitting first switch case:
>
> diff --git a/switch.c b/switch.c
> index 233ec14..fe0a8b0 100644
> --- a/switch.c
> +++ b/switch.c
> @@ -3,7 +3,7 @@ int global;
>  int
>  foo (int x)
>  {
> -  switch (x & 0xf) {
> +  switch (x) {
>      case 0:
>        return 11;
>      case 1:
> diff --git a/test.c b/test.c
> index 7fc22a4..2849112 100644
> --- a/test.c
> +++ b/test.c
> @@ -5,6 +5,7 @@ extern int foo (int);
>  extern int foo_no_table (int);
>
>  int global = 20;
> +int j = 0;
>
>  #define LOOP 800000000
>
> @@ -18,7 +19,7 @@ main ()
>    global = 0;
>    start = __rdtscp (&i);
>    for (i = 0; i < LOOP; i++)
> -    foo_no_table (i);
> +    foo_no_table (j);
>    end = __rdtscp (&i);
>    diff1 = end - start;
>
> @@ -27,7 +28,7 @@ main ()
>    global = 0;
>    start = __rdtscp (&i);
>    for (i = 0; i < LOOP; i++)
> -    foo (i);
> +    foo (j);
>    end = __rdtscp (&i);
>    diff2 = end - start;
>
> # make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20   -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:0 no jump table: 6098109200
> global:0 jump table   : 30717871980 (503.73%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097799330
> global:0 jump table   : 30727386270 (503.91%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097559796
> global:0 jump table   : 30715992452 (503.74%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6098532172
> global:0 jump table   : 30716423870 (503.67%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097429586
> global:0 jump table   : 30715774634 (503.75%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097813848
> global:0 jump table   : 30716476820 (503.73%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6096955736
> global:0 jump table   : 30715385478 (503.78%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6096820240
> global:0 jump table   : 30719682434 (503.86%)
>
> ** And next case would be constantly hitting default case:
>
> diff --git a/test.c b/test.c
> index 2849112..be9bfc1 100644
> --- a/test.c
> +++ b/test.c
> @@ -5,7 +5,7 @@ extern int foo (int);
>  extern int foo_no_table (int);
>
>  int global = 20;
> -int j = 0;
> +int j = 1000;
>
>  #define LOOP 800000000
>
> # make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20   -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register   -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:0 no jump table: 6422890064
> global:0 jump table   : 6866072454 (106.90%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6423267608
> global:0 jump table   : 6866266176 (106.90%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6424721624
> global:0 jump table   : 6866607842 (106.88%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6424225664
> global:0 jump table   : 6866843372 (106.89%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6424073830
> global:0 jump table   : 6866467050 (106.89%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6426515396
> global:0 jump table   : 6867031640 (106.85%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6425126656
> global:0 jump table   : 6866352988 (106.87%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6423040024
> global:0 jump table   : 6867233670 (106.92%)
> root@...t:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6422256136
> global:0 jump table   : 6865902094 (106.91%)
>
> I could also try different distributions perhaps for the case selector,
> but observations match in that direction with what Bjorn et al also have
> been seen in XDP case.
>
> Thanks,
> Daniel



-- 
H.J.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ