linux-kernel - Re: RFC: Petition Intel/AMD to add POPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5174a3b9-7f48-49f4-c59f-0fa234b68b17@redhat.com>
Date:	Thu, 18 Aug 2016 14:18:42 +0200
From:	Denys Vlasenko <dvlasenk@...hat.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Sara Sharon <sara.sharon@...el.com>,
	Dan Williams <dan.j.williams@...el.com>,
	Christian König <christian.koenig@....com>,
	Vinod Koul <vinod.koul@...el.com>,
	Alex Deucher <alexander.deucher@....com>,
	Johannes Berg <johannes.berg@...el.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	Andy Lutomirski <luto@...nel.org>,
	the arch/x86 maintainers <x86@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Adrian Hunter <adrian.hunter@...el.com>
Subject: Re: RFC: Petition Intel/AMD to add POPF_IF insn

On 08/18/2016 11:21 AM, Denys Vlasenko wrote:
>>> Of course, if somebody uses native_restore_fl() to actually *disable*
>>> interrupts (when they weren't already disabled), then this untested
>>> patch will just not work. But why would you do somethign so stupid?
>>> Famous last words...
>>
>> Looking around, the vsmp code actually uses "native_restore_fl()" to
>> not just set the interrupt flag, but AC as well.
>>
>> And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
>>
>> So that trivial patch may (or may not - still not tested) work for
>> some quick testing, but needs more effort for any *real* use.
>
> I'm going to test the attached patch.
...
>
> +# CONFIG_UPROBES is not set
> +# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
> +# CONFIG_HYPERVISOR_GUEST is not set
> +# CONFIG_SYS_HYPERVISOR is not set
> +# CONFIG_FRAME_POINTER is not set
> +# CONFIG_KMEMCHECK is not set
> +# CONFIG_DEBUG_LOCK_ALLOC is not set
> +# CONFIG_PROVE_LOCKING is not set
> +# CONFIG_LOCK_STAT is not set
> +# CONFIG_PROVE_RCU is not set
> +# CONFIG_LATENCYTOP is not set
> +# CONFIG_FTRACE is not set
> +# CONFIG_BINARY_PRINTF is not set

Need also !CONFIG_DEBUG_SPINLOCK, then unpatched irqrestore is:

ffffffff817115a0 <_raw_spin_unlock_irqrestore>:
ffffffff817115a0:       c6 07 00                movb   $0x0,(%rdi)
ffffffff817115a3:       56                      push   %rsi
ffffffff817115a4:       9d                      popfq
ffffffff817115a5:       65 ff 0d e4 ad 8f 7e    decl   %gs:__preempt_count
ffffffff817115ac:       c3                      retq
ffffffff817115ad:       0f 1f 00                nopl   (%rax)

patched one is

ffffffff81711660 <_raw_spin_unlock_irqrestore>:
ffffffff81711660:       f7 c6 00 02 00 00       test   $0x200,%esi
ffffffff81711666:       c6 07 00                movb   $0x0,(%rdi)
ffffffff81711669:       74 01                   je     ffffffff8171166c <_raw_spin_unlock_irqrestore+0xc>
ffffffff8171166b:       fb                      sti
ffffffff8171166c:       65 ff 0d 1d ad 8f 7e    decl   %gs:__preempt_count
ffffffff81711673:       c3                      retq
ffffffff81711674:       66 90                   xchg   %ax,%ax
ffffffff81711676:       66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)


Ran the following twice on a quiesced machine:

taskset 1 perf stat -r60 perf bench sched messaging
taskset 1 perf stat -r60 perf bench sched pipe

Out of these four runs, both "perf bench sched pipe" runs show improvements:

-       2648.279829      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.24% )
+       2483.143469      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.31% )
-         2,000,002      context-switches          #    0.755 M/sec                    ( +-  0.00% )
+         2,000,013      context-switches          #    0.805 M/sec                    ( +-  0.00% )
-               547      page-faults               #    0.206 K/sec                    ( +-  0.04% )
+               546      page-faults               #    0.220 K/sec                    ( +-  0.04% )
-     8,723,284,926      cycles                    #    3.294 GHz                      ( +-  0.06% )
+     8,157,949,449      cycles                    #    3.285 GHz                      ( +-  0.07% )
-    12,286,937,344      instructions              #    1.41  insn per cycle           ( +-  0.03% )
+    12,255,616,405      instructions              #    1.50  insn per cycle           ( +-  0.05% )
-     2,588,839,023      branches                  #  977.555 M/sec                    ( +-  0.02% )
+     2,599,319,615      branches                  # 1046.786 M/sec                    ( +-  0.04% )
-         3,620,273      branch-misses             #    0.14% of all branches          ( +-  0.67% )
+         3,577,771      branch-misses             #    0.14% of all branches          ( +-  0.69% )
-       2.648799072 seconds time elapsed                                          ( +-  0.24% )
+       2.487452268 seconds time elapsed                                          ( +-  0.31% )

Good, we run more insns/cycle, as expected. However, a bit more branches.

But of two "perf bench sched messaging" run, one was slower on a patched kernel,
and perf shows why: more branches, and also branch miss percentage is larger:

-        690.008697      task-clock (msec)         #    0.996 CPUs utilized            ( +-  0.45% )
+        699.526509      task-clock (msec)         #    0.994 CPUs utilized            ( +-  0.28% )
-            26,796      context-switches          #    0.039 M/sec                    ( +-  8.31% )
+            29,088      context-switches          #    0.042 M/sec                    ( +-  6.62% )
-            35,477      page-faults               #    0.051 M/sec                    ( +-  0.11% )
+            35,504      page-faults               #    0.051 M/sec                    ( +-  0.14% )
-     2,157,701,609      cycles                    #    3.127 GHz                      ( +-  0.35% )
+     2,143,781,407      cycles                    #    3.065 GHz                      ( +-  0.25% )
-     3,115,212,672      instructions              #    1.44  insn per cycle           ( +-  0.28% )
+     3,253,499,549      instructions              #    1.52  insn per cycle           ( +-  0.19% )
-       661,888,593      branches                  #  959.247 M/sec                    ( +-  0.36% )
+       707,862,655      branches                  # 1011.917 M/sec                    ( +-  0.20% )
-         2,793,203      branch-misses             #    0.42% of all branches          ( +-  1.04% )
+         3,453,397      branch-misses             #    0.49% of all branches          ( +-  0.32% )
-       0.693004918 seconds time elapsed                                          ( +-  0.45% )
+       0.703630988 seconds time elapsed                                          ( +-  0.27% )

This tipped the scales, and despite higher insns/cycle, run time is worse.

The other "perf bench sched messaging" run was more lucky:

-        706.944245      task-clock (msec)         #    0.995 CPUs utilized            ( +-  0.32% )
+        687.038856      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.31% )
-            23,489      context-switches          #    0.033 M/sec                    ( +-  7.02% )
+            26,644      context-switches          #    0.039 M/sec                    ( +-  7.46% )
-            35,360      page-faults               #    0.050 M/sec                    ( +-  0.12% )
+            35,417      page-faults               #    0.052 M/sec                    ( +-  0.13% )
-     2,183,639,816      cycles                    #    3.089 GHz                      ( +-  0.35% )
+     2,123,086,753      cycles                    #    3.090 GHz                      ( +-  0.27% )
-     3,131,362,238      instructions              #    1.43  insn per cycle           ( +-  0.19% )
+     3,236,613,433      instructions              #    1.52  insn per cycle           ( +-  0.19% )
-       667,874,319      branches                  #  944.734 M/sec                    ( +-  0.21% )
+       703,677,908      branches                  # 1024.219 M/sec                    ( +-  0.20% )
-         2,859,521      branch-misses             #    0.43% of all branches          ( +-  0.56% )
+         3,462,063      branch-misses             #    0.49% of all branches          ( +-  0.33% )
-       0.710738536 seconds time elapsed                                          ( +-  0.32% )
+       0.691908533 seconds time elapsed                                          ( +-  0.31% )

However, it still had more branches (~5% more), and worse branch miss percentage.

The patch seems to work. It also does not bloat the kernel:

    text	   data	    bss	    dec	    hex	filename
8199556	5026784	2924544	16150884 f67164	vmlinux
8199897	5026784	2924544	16151225 f672b9	vmlinux.patched

However, a "conditional CLI/STI from r/m" insn could be better still.

The patch is attached.

View attachment "z.diff" of type "text/x-patch" (3077 bytes)