linux-kernel - Re: [PATCH v2 0/4] Static calls

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181126155405.72b4f718@gandalf.local.home>
Date:   Mon, 26 Nov 2018 15:54:05 -0500
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Josh Poimboeuf <jpoimboe@...hat.com>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        Ard Biesheuvel <ard.biesheuvel@...aro.org>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Jason Baron <jbaron@...mai.com>, Jiri Kosina <jkosina@...e.cz>,
        David Laight <David.Laight@...LAB.COM>,
        Borislav Petkov <bp@...en8.de>,
        Julia Cartwright <julia@...com>, Jessica Yu <jeyu@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH v2 0/4] Static calls


Here's the test with the attached config (A fedora distro with
localmodconfig run against it), with also two patches to implement
tracepoints with static calls. The first makes it where a tracepoint
will call a function pointer to a single callback if there's only one
callback, or an "iterator" which iterates a list of callbacks (when
there are more than one callback associated to a tracepoint).

It adds printks() to where it enables and disables the tracepoints so
expect to see a lot of output when you enable the tracepoints. This is
to verify that it's assigning the right code.

Here's what I did.

1) I first took the config and turned off CONFIG_RETPOLINE and built
v4.20-rc4 with that. I ran this to see what the affect was without
retpolines. I booted that kernel and did the following (which is also
what I did for every kernel):

 # trace-cmd start -e all

  To get the same affect you could also do:

    # echo 1 > /sys/kernel/debug/tracing/events/enable

 # perf stat -r 10 /work/c/hackbench 50

The output was this:

No RETPOLINES:
  
# perf stat -r 10 /work/c/hackbench 50
Time: 1.351
Time: 1.414
Time: 1.319
Time: 1.277
Time: 1.280
Time: 1.305
Time: 1.294
Time: 1.342
Time: 1.319
Time: 1.288

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         10,727.44 msec task-clock                #    7.397 CPUs utilized            ( +-  0.95% )
           126,300      context-switches          # 11774.138 M/sec                   ( +- 13.80% )
            14,309      cpu-migrations            # 1333.973 M/sec                    ( +-  8.73% )
            44,073      page-faults               # 4108.652 M/sec                    ( +-  0.68% )
    39,484,799,554      cycles                    # 3680914.295 GHz                   ( +-  0.95% )
    28,470,896,143      stalled-cycles-frontend   #   72.11% frontend cycles idle     ( +-  0.95% )
    26,521,427,813      instructions              #    0.67  insn per cycle         
                                                  #    1.07  stalled cycles per insn  ( +-  0.85% )
     4,931,066,096      branches                  # 459691625.400 M/sec               ( +-  0.87% )
        19,063,801      branch-misses             #    0.39% of all branches          ( +-  2.05% )

            1.4503 +- 0.0148 seconds time elapsed  ( +-  1.02% )

Then I enabled CONFIG_RETPOLINES, built boot and ran it again:

baseline RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.313
Time: 1.386
Time: 1.335
Time: 1.363
Time: 1.357
Time: 1.369
Time: 1.363
Time: 1.489
Time: 1.357
Time: 1.422

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         11,162.24 msec task-clock                #    7.383 CPUs utilized            ( +-  1.11% )
           112,882      context-switches          # 10113.153 M/sec                   ( +- 15.86% )
            14,255      cpu-migrations            # 1277.103 M/sec                    ( +-  7.78% )
            43,067      page-faults               # 3858.393 M/sec                    ( +-  1.04% )
    41,076,270,559      cycles                    # 3680042.874 GHz                   ( +-  1.12% )
    29,669,137,584      stalled-cycles-frontend   #   72.23% frontend cycles idle     ( +-  1.21% )
    26,647,656,812      instructions              #    0.65  insn per cycle         
                                                  #    1.11  stalled cycles per insn  ( +-  0.81% )
     5,069,504,923      branches                  # 454179389.091 M/sec               ( +-  0.83% )
        99,135,413      branch-misses             #    1.96% of all branches          ( +-  0.87% )

            1.5120 +- 0.0133 seconds time elapsed  ( +-  0.88% )


Then I applied the first tracepoint patch to make the change to call
directly (and be able to use static calls later). And tested that.

Added direct calls for trace_events:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.448
Time: 1.386
Time: 1.404
Time: 1.386
Time: 1.344
Time: 1.397
Time: 1.378
Time: 1.351
Time: 1.369
Time: 1.385

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         11,249.28 msec task-clock                #    7.382 CPUs utilized            ( +-  0.64% )
           112,058      context-switches          # 9961.721 M/sec                    ( +- 11.15% )
            15,535      cpu-migrations            # 1381.033 M/sec                    ( +- 10.34% )
            43,673      page-faults               # 3882.433 M/sec                    ( +-  1.14% )
    41,407,431,000      cycles                    # 3681020.455 GHz                   ( +-  0.63% )
    29,842,394,154      stalled-cycles-frontend   #   72.07% frontend cycles idle     ( +-  0.63% )
    26,669,867,181      instructions              #    0.64  insn per cycle         
                                                  #    1.12  stalled cycles per insn  ( +-  0.58% )
     5,085,122,641      branches                  # 452055102.392 M/sec               ( +-  0.60% )
       108,935,006      branch-misses             #    2.14% of all branches          ( +-  0.57% )

            1.5239 +- 0.0139 seconds time elapsed  ( +-  0.91% )


Then I added patch 1 and 2, and applied the second attached patch and
ran that:

With static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.407
Time: 1.424
Time: 1.352
Time: 1.355
Time: 1.361
Time: 1.416
Time: 1.453
Time: 1.353
Time: 1.341
Time: 1.439

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         11,293.08 msec task-clock                #    7.390 CPUs utilized            ( +-  0.93% )
           125,343      context-switches          # 11099.462 M/sec                   ( +- 11.84% )
            15,587      cpu-migrations            # 1380.272 M/sec                    ( +-  8.21% )
            43,871      page-faults               # 3884.890 M/sec                    ( +-  1.06% )
    41,567,508,330      cycles                    # 3680918.499 GHz                   ( +-  0.94% )
    29,851,271,023      stalled-cycles-frontend   #   71.81% frontend cycles idle     ( +-  0.99% )
    26,878,085,513      instructions              #    0.65  insn per cycle         
                                                  #    1.11  stalled cycles per insn  ( +-  0.72% )
     5,125,816,911      branches                  # 453905346.879 M/sec               ( +-  0.74% )
       107,643,635      branch-misses             #    2.10% of all branches          ( +-  0.71% )

            1.5282 +- 0.0135 seconds time elapsed  ( +-  0.88% )

Then I applied patch 3 and tested that:

With static call trampolines:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.350
Time: 1.333
Time: 1.369
Time: 1.361
Time: 1.375
Time: 1.352
Time: 1.316
Time: 1.336
Time: 1.339
Time: 1.371

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         10,964.38 msec task-clock                #    7.392 CPUs utilized            ( +-  0.41% )
            75,986      context-switches          # 6930.527 M/sec                    ( +-  9.23% )
            12,464      cpu-migrations            # 1136.858 M/sec                    ( +-  7.93% )
            44,476      page-faults               # 4056.558 M/sec                    ( +-  1.12% )
    40,354,963,428      cycles                    # 3680712.468 GHz                   ( +-  0.42% )
    29,057,240,222      stalled-cycles-frontend   #   72.00% frontend cycles idle     ( +-  0.46% )
    26,171,883,339      instructions              #    0.65  insn per cycle         
                                                  #    1.11  stalled cycles per insn  ( +-  0.32% )
     4,978,193,830      branches                  # 454053195.523 M/sec               ( +-  0.33% )
        83,625,127      branch-misses             #    1.68% of all branches          ( +-  0.33% )

           1.48328 +- 0.00515 seconds time elapsed  ( +-  0.35% )

And finally I added patch 4 and tested that:

Full static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.302
Time: 1.323
Time: 1.356
Time: 1.325
Time: 1.372
Time: 1.373
Time: 1.319
Time: 1.313
Time: 1.362
Time: 1.322

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         10,865.10 msec task-clock                #    7.373 CPUs utilized            ( +-  0.62% )
            88,718      context-switches          # 8165.823 M/sec                    ( +- 10.11% )
            13,463      cpu-migrations            # 1239.125 M/sec                    ( +-  8.42% )
            44,574      page-faults               # 4102.673 M/sec                    ( +-  0.60% )
    39,991,476,585      cycles                    # 3680897.280 GHz                   ( +-  0.63% )
    28,713,229,777      stalled-cycles-frontend   #   71.80% frontend cycles idle     ( +-  0.68% )
    26,289,703,633      instructions              #    0.66  insn per cycle         
                                                  #    1.09  stalled cycles per insn  ( +-  0.44% )
     4,983,099,105      branches                  # 458654631.123 M/sec               ( +-  0.45% )
        83,719,799      branch-misses             #    1.68% of all branches          ( +-  0.44% )

           1.47364 +- 0.00706 seconds time elapsed  ( +-  0.48% )


In summary, we had this:

No RETPOLINES:
            1.4503 +- 0.0148 seconds time elapsed  ( +-  1.02% )

baseline RETPOLINES:
            1.5120 +- 0.0133 seconds time elapsed  ( +-  0.88% )

Added direct calls for trace_events:
            1.5239 +- 0.0139 seconds time elapsed  ( +-  0.91% )

With static calls:
            1.5282 +- 0.0135 seconds time elapsed  ( +-  0.88% )

With static call trampolines:
           1.48328 +- 0.00515 seconds time elapsed  ( +-  0.35% )

Full static calls:
           1.47364 +- 0.00706 seconds time elapsed  ( +-  0.48% )


Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown

Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown

With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown

Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
the effort. I did not expect it to go to 0% as there's a lot of other
places that retpolines cause issues, but this shows that it does help
the tracing code.

I originally did the tests with the development config, which has a
bunch of debugging options enabled (hackbench usually takes over 9
seconds, not the 1.5 that was done here), and the slowdown was closer
to 9% with retpolines. If people want me to do this with that, or I can
send them the config. Or better yet, the code is here, just use your
own configs.

-- Steve

Download attachment "config-distro" of type "application/octet-stream" (134859 bytes)

View attachment "0001-tracepoints-Add-a-direct-call-or-an-iterator.patch" of type "text/x-patch" (10980 bytes)

View attachment "0002-tracepoints-Implement-it-with-dynamic-functions.patch" of type "text/x-patch" (5374 bytes)