From: Dave Hansen Thanks to Andy Lutomirski for finding this. Memory protection keys only affect data access. They do not affect instruction fetches. So, an instruction may not be readable, while it *is* executable. The fault prefetch checking code directly reads instructions and might trip over an instruction made unreadable by pkeys. Turn off pkeys temporarily so we can fetch the instruction. ===== Long explanation: ===== We have a long history of bugs with prefetch instructions faulting when they should not. So, in the slow paths of our page fault code, we go peeking at the instructions being run when the fault occurred to see if the fault could have been caused by a prefetch instruction. To do the peeking, the kernel looks at the instruction pointer and fetches the instruction, sometimes right out of userspace. But, protection keys may get in the way and will make the kernel's data access fault. The kernel will assume that the instruction pointer was bogus and go angrily along on the way to killing the app instead of _actually_ fetching a prefix instruction. Does this matter? Probably not. The hardware implementing protection keys is not even publicly available yet. But, if it or some future processor has a prefetch bug, this will help us properly work around it. This makes the instruction temporarily readable so that we can do a data access and peek at its opcode. This operation is only visible to the thread doing the fault. I thought this might be painful to solve, requiring something akin to a kernel_fpu_begin/end() pair. After thinking about it, I managed to convince myself that we do not need that harsh of a solution and this simple one is OK, mostly because we never do lazy save/restore of PKRU. This is slow-ish. The RDPKRU/WRPKRU/WRPKRU sequence probably costs us dozens of cycles, plus the extra branches from the if(OSPKE) checks. It also does that for each byte of the instructions that it fetches, which is a bit naughty. But, this is a slow path already, so I haven't tried to optimize it at all. Signed-off-by: Dave Hansen Cc: Andy Lutomirski --- b/arch/x86/mm/fault.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 49 insertions(+), 2 deletions(-) diff -puN arch/x86/mm/fault.c~pkeys-900-fix-prefetch arch/x86/mm/fault.c --- a/arch/x86/mm/fault.c~pkeys-900-fix-prefetch 2016-07-20 10:01:29.793625322 -0700 +++ b/arch/x86/mm/fault.c 2016-07-20 10:01:29.813626232 -0700 @@ -20,6 +20,7 @@ #include /* pgd_*(), ... */ #include /* kmemcheck_*(), ... */ #include /* VSYSCALL_ADDR */ +#include /* for use_eager_fpu() checks */ #include /* emulate_vsyscall */ #include /* struct vm86 */ #include /* vma_pkey() */ @@ -76,6 +77,52 @@ static nokprobe_inline int kprobes_fault } /* + * Memory protection keys only affect data access. They do not + * affect instruction fetches. So, an instruction may not be + * readable, while it *is* executable. This makes the + * instruction temporarily readable so that we can do a data + * access and peek at its opcode. + */ +static +int probe_insn_opcode(void *insn_address, unsigned char *ret_opcode) +{ + int ret; + u32 saved_pkru = read_pkru(); + + /* + * Clear out all of the access/write-disable bits in + * PKRU. This ensures that pkeys will not block access + * to @insn_address. If no keys are access-disabled + * (saved_pkru==0) avoid the cost of the PKRU writes + * and the continued cost of having taken it out of its + * (XSAVE) init state. + * + * Note also that this only affect access to user + * addresses. Kernel (supervisor) mappings are not + * affected by this register. + */ + if (saved_pkru) + write_pkru(0); + /* + * We normally have to be very careful with FPU registers + * and preempt. But, PKRU is different. It is never + * lazily saved/restored, so we don't have to be as + * careful with this as normal FPU state. Enforce this + * assumption with the WARN_ON(). + */ + if (cpu_feature_enabled(X86_FEATURE_OSPKE)) + WARN_ON_ONCE(!use_eager_fpu()); + ret = probe_kernel_address(insn_address, *ret_opcode); + /* + * Restore PKRU to what it was. We a + */ + if (saved_pkru) + write_pkru(saved_pkru); + + return ret; +} + +/* * Prefetch quirks: * * 32-bit mode: @@ -126,7 +173,7 @@ check_prefetch_opcode(struct pt_regs *re return !instr_lo || (instr_lo>>1) == 1; case 0x00: /* Prefetch instruction is 0x0F0D or 0x0F18 */ - if (probe_kernel_address(instr, opcode)) + if (probe_insn_opcode(instr, &opcode)) return 0; *prefetch = (instr_lo == 0xF) && @@ -160,7 +207,7 @@ is_prefetch(struct pt_regs *regs, unsign while (instr < max_instr) { unsigned char opcode; - if (probe_kernel_address(instr, opcode)) + if (probe_insn_opcode(instr, &opcode)) break; instr++; _