lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Sun, 3 May 2015 13:51:11 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	"H. Peter Anvin" <hpa@...or.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Andy Lutomirski <luto@...nel.org>, X86 ML <x86@...nel.org>,
	Denys Vlasenko <vda.linux@...glemail.com>,
	Brian Gerst <brgerst@...il.com>,
	Denys Vlasenko <dvlasenk@...hat.com>,
	Ingo Molnar <mingo@...nel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Alexei Starovoitov <ast@...mgrid.com>,
	Will Drewry <wad@...omium.org>,
	Kees Cook <keescook@...omium.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Mel Gorman <mgorman@...e.com>,
	Aravind Gopalakrishnan <aravind.gopalakrishnan@....com>
Subject: Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor
 attribute issue

On Thu, Apr 30, 2015 at 02:39:07PM -0700, H. Peter Anvin wrote:
> This is the microbenchmark I used.
> 
> For the record, Intel's intention going forward is that 0F 1F will
> always be as fast or faster than any other alternative.

It looks like this is the case on AMD too.

So I took your benchmark and made it to measure all sizes of K8 and P6
NOPs. Also I'm doing 10^6 iterations and taking the minimum. The results
speak for themselves, especially from 5-byte NOPs onwards where we have
to repeat the K8 NOP but still can use a single P6 NOP.

And I'm going to move all relevant AMD hw to use the P6 NOPs for the
alternatives.

Unless I've done something wrong, of course. Please double-check, I'm
attaching the microbenchmark too.

Anyway, here's a patch:

---
From: Borislav Petkov <bp@...e.de>
Date: Sat, 2 May 2015 23:55:40 +0200
Subject: [PATCH] x86/alternatives: Switch AMD F15h and later to the P6 NOPs

Software optimization guides for both F15h and F16h cite those NOPs as
the optimal ones. A microbenchmark confirms that actually even older
families are better with the single-insn NOPs so switch to them for the
alternatives.

Cycles count below includes the loop overhead of the measurement but
that overhead is the same with all runs.

F10h, revE:
-----------
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
                      90     288.212282 cycles
                   66 90     288.220840 cycles
                66 66 90     288.219447 cycles
             66 66 66 90     288.223204 cycles
          66 66 90 66 90     571.393424 cycles
       66 66 90 66 66 90     571.374919 cycles
    66 66 66 90 66 66 90     572.249281 cycles
 66 66 66 90 66 66 66 90     571.388651 cycles

P6:
                      90     288.214193 cycles
                   66 90     288.225550 cycles
                0f 1f 00     288.224441 cycles
             0f 1f 40 00     288.225030 cycles
          0f 1f 44 00 00     288.233558 cycles
       66 0f 1f 44 00 00     324.792342 cycles
    0f 1f 80 00 00 00 00     325.657462 cycles
 0f 1f 84 00 00 00 00 00     430.246643 cycles

F14h:
----
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
                      90     510.404890 cycles
                   66 90     510.432117 cycles
                66 66 90     510.561858 cycles
             66 66 66 90     510.541865 cycles
          66 66 90 66 90    1014.192782 cycles
       66 66 90 66 66 90    1014.226546 cycles
    66 66 66 90 66 66 90    1014.334299 cycles
 66 66 66 90 66 66 66 90    1014.381205 cycles

P6:
                      90     510.436710 cycles
                   66 90     510.448229 cycles
                0f 1f 00     510.545100 cycles
             0f 1f 40 00     510.502792 cycles
          0f 1f 44 00 00     510.589517 cycles
       66 0f 1f 44 00 00     510.611462 cycles
    0f 1f 80 00 00 00 00     511.166794 cycles
 0f 1f 84 00 00 00 00 00     511.651641 cycles

F15h:
-----
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
                      90     243.128396 cycles
                   66 90     243.129883 cycles
                66 66 90     243.131631 cycles
             66 66 66 90     242.499324 cycles
          66 66 90 66 90     481.829083 cycles
       66 66 90 66 66 90     481.884413 cycles
    66 66 66 90 66 66 90     481.851446 cycles
 66 66 66 90 66 66 66 90     481.409220 cycles

P6:
                      90     243.127026 cycles
                   66 90     243.130711 cycles
                0f 1f 00     243.122747 cycles
             0f 1f 40 00     242.497617 cycles
          0f 1f 44 00 00     245.354461 cycles
       66 0f 1f 44 00 00     361.930417 cycles
    0f 1f 80 00 00 00 00     362.844944 cycles
 0f 1f 84 00 00 00 00 00     480.514948 cycles

F16h:
-----
Running NOP tests, 1000 NOPs x 1000000 repetitions

K8:
                      90     507.793298 cycles
                   66 90     507.789636 cycles
                66 66 90     507.826490 cycles
             66 66 66 90     507.859075 cycles
          66 66 90 66 90    1008.663129 cycles
       66 66 90 66 66 90    1008.696259 cycles
    66 66 66 90 66 66 90    1008.692517 cycles
 66 66 66 90 66 66 66 90    1008.755399 cycles

P6:
                      90     507.795232 cycles
                   66 90     507.794761 cycles
                0f 1f 00     507.834901 cycles
             0f 1f 40 00     507.822629 cycles
          0f 1f 44 00 00     507.838493 cycles
       66 0f 1f 44 00 00     507.908597 cycles
    0f 1f 80 00 00 00 00     507.946417 cycles
 0f 1f 84 00 00 00 00 00     507.954960 cycles

Signed-off-by: Borislav Petkov <bp@...e.de>
Cc: Aravind Gopalakrishnan <aravind.gopalakrishnan@....com>
---
 arch/x86/kernel/alternative.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index aef653193160..b0932c4341b3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -227,6 +227,15 @@ void __init arch_init_ideal_nops(void)
 #endif
 		}
 		break;
+
+	case X86_VENDOR_AMD:
+		if (boot_cpu_data.x86 > 0xf) {
+			ideal_nops = p6_nops;
+			return;
+		}
+
+		/* fall through */
+
 	default:
 #ifdef CONFIG_X86_64
 		ideal_nops = k8_nops;
-- 
2.3.5

Modified benchmark:

---
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdbool.h>
#include <sys/time.h>

typedef unsigned long long u64;

#define DECLARE_ARGS(val, low, high)    unsigned low, high
#define EAX_EDX_VAL(val, low, high)     ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high)    "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high)     "=a" (low), "=d" (high)

static __always_inline unsigned long long rdtsc(void)
{
        DECLARE_ARGS(val, low, high);

        asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));

        return EAX_EDX_VAL(val, low, high);
}

static inline u64 read_tsc(void)
{
	u64 ret;

	asm volatile("mfence");
	ret = rdtsc();
	asm volatile("mfence");

	return ret;
}

#define __stringify_1(x...)     #x
#define __stringify(x...)       __stringify_1(x)

#define GENERIC_NOP1 0x90

#define K8_NOP1 GENERIC_NOP1
#define K8_NOP2 0x66,K8_NOP1
#define K8_NOP3 0x66,K8_NOP2
#define K8_NOP4 0x66,K8_NOP3
#define K8_NOP5 K8_NOP3,K8_NOP2
#define K8_NOP6 K8_NOP3,K8_NOP3
#define K8_NOP7 K8_NOP4,K8_NOP3
#define K8_NOP8 K8_NOP4,K8_NOP4

#define P6_NOP3 0x0f,0x1f,0x00
#define P6_NOP4 0x0f,0x1f,0x40,0
#define P6_NOP5	0x0f,0x1f,0x44,0x00,0
#define P6_NOP6 0x66,0x0f,0x1f,0x44,0x00,0
#define P6_NOP7 0x0f,0x1f,0x80,0,0,0,0
#define P6_NOP8 0x0f,0x1f,0x84,0x00,0,0,0,0

#define BUILD_NOP(func, nop)				\
static void func(void)					\
{							\
	asm volatile(".rept 1000\n"			\
		     ".byte " __stringify(nop) "\n"	\
		     ".endr");				\
}

/* single-byte NOP */
BUILD_NOP(k8_nop1, K8_NOP1)

/* 2-byte NOPs */
BUILD_NOP(k8_nop2, K8_NOP2)

/* 3-byte NOPs */
BUILD_NOP(k8_nop3, K8_NOP3)
BUILD_NOP(p6_nop3, P6_NOP3)

/* 4-byte NOPs */
BUILD_NOP(k8_nop4, K8_NOP4)
BUILD_NOP(p6_nop4, P6_NOP4)

/* 5-byte NOPs */
static void p6_nop5(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
	       ".endr");
}

static void nop_k8(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0x66,0x66,0x66,0x66,0x90\n"
	       ".endr");
}

BUILD_NOP(k8_nop5, K8_NOP5)

static void nop_lea(void)
{
#ifdef __x86_64__
  asm volatile(".rept 1000\n"
	       ".byte 0x48,0x8d,0x74,0x26,0x00\n"
	       ".endr");
#else
  asm volatile(".rept 1000\n"
	       ".byte 0x3e,0x8d,0x74,0x26,0x00\n"
	       ".endr");
#endif
}

static void nop_jmp5(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0xe9,0,0,0,0\n"
	       ".endr");
}

static void nop_jmp2(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0xeb,3,0x90,0x90,0x90\n"
	       ".endr");
}

static void nop_xchg(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0x66,0x66,0x66,0x87,0xc0\n"
	       ".endr");
}

static void nop_mov(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0x66,0x66,0x66,0x89,0xc0\n"
	       ".endr");
}

static void nop_fdisi(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0x66,0x66,0x66,0xdb,0xe1\n"
	       ".endr");
}
  
static void nop_feni(void)
{
  asm volatile(".rept 1000\n"
	       ".byte 0x66,0x66,0x66,0xdb,0xe0\n"
	       ".endr");
}

/* 6-byte NOPs */
BUILD_NOP(k8_nop6, K8_NOP6)
BUILD_NOP(p6_nop6, P6_NOP6)

/* 7-byte NOPs */
BUILD_NOP(k8_nop7, K8_NOP7)
BUILD_NOP(p6_nop7, P6_NOP7)

/* 8-byte NOPs */
BUILD_NOP(k8_nop8, K8_NOP8)
BUILD_NOP(p6_nop8, P6_NOP8)

struct test_list {
  const char *name;
  void (*func)(void);
};

static const struct test_list tests[] = {
  { "P6 NOPs (NOPL)", p6_nop5 },
  { "K8 NOPs (66 90)", nop_k8 },
  { "LEA", nop_lea },
  { "XCHG", nop_xchg },
  { "MOV", nop_mov },
  { "FDISI", nop_fdisi },
  { "FENI", nop_feni },
  { "E9 JMP", nop_jmp5 },
  { "EB JMP", nop_jmp2 },
  { NULL, NULL }
};

#define TIMES 30
static void benchmark(const struct test_list *test, const int reps, bool warmup)
{
	u64 p1, p2, r;
	double min = 10000000000;
	int i, j;

	for (j = 0; j < TIMES; j++) {
		p1 = read_tsc();
		for (i = 0; i < reps; i++)
			test->func();
		p2 = read_tsc();

		r = (p2 - p1);

		if (r < min)
			min = r;
	}

	if (!warmup)
		printf("%24s%15f cycles\n", test->name, min/reps);
}

static const struct test_list k8_nops[] = {
	{ NULL, NULL },
	{ "90", k8_nop1 },
	{ "66 90", k8_nop2 },
	{ "66 66 90", k8_nop3 },
	{ "66 66 66 90", k8_nop4 },
	{ "66 66 90 66 90", k8_nop5 },
	{ "66 66 90 66 66 90", k8_nop6 },
	{ "66 66 66 90 66 66 90", k8_nop7 },
	{ "66 66 66 90 66 66 66 90", k8_nop8 },
	{ NULL, NULL },
};

static const struct test_list f16h_nops[] = {
	{ NULL, NULL },
	{ "90", k8_nop1 },
	{ "66 90", k8_nop2 },
	{ "0f 1f 00", p6_nop3 },
	{ "0f 1f 40 00", p6_nop4 },
	{ "0f 1f 44 00 00", p6_nop5 },
	{ "66 0f 1f 44 00 00", p6_nop6 },
	{ "0f 1f 80 00 00 00 00", p6_nop7 },
	{ "0f 1f 84 00 00 00 00 00", p6_nop8 },
	{ NULL, NULL },
};

int main(void)
{
	const int reps = 1000000;
	const struct test_list *test;
	int i;

	printf("Running NOP tests, 1000 NOPs x %d repetitions\n\n", reps);

#if 0
	for (test = tests; test->func; test++) {
		benchmark(test, reps, true);
		benchmark(test, reps, false);
	}
#endif

	printf("K8:\n");
	for (i = 1; i < 9; i++) {
		benchmark(&k8_nops[i], reps, true);
		benchmark(&k8_nops[i], reps, false);
	}
	printf("\n");

	printf("P6:\n");
	for (i = 1; i < 9; i++) {
		benchmark(&f16h_nops[i], reps, true);
		benchmark(&f16h_nops[i], reps, false);
	}
	printf("\n");

	return 0;
}

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists