lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Thu, 30 Nov 2017 18:33:45 -0200
From:   Eduardo Habkost <ehabkost@...hat.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     Wanpeng Li <kernellwp@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        kvm <kvm@...r.kernel.org>, yfu@...hat.com
Subject: Re: [PATCH] KVM: x86: inject exceptions produced by x86_decode_insn

On Wed, Nov 29, 2017 at 04:42:16PM -0200, Eduardo Habkost wrote:
> On Wed, Nov 29, 2017 at 12:44:42PM +0100, Paolo Bonzini wrote:
> > On 29/11/2017 12:44, Eduardo Habkost wrote:
> > > On Mon, Nov 13, 2017 at 09:32:09AM +0100, Paolo Bonzini wrote:
> > >> On 13/11/2017 08:15, Wanpeng Li wrote:
> > >>> 2017-11-10 17:49 GMT+08:00 Paolo Bonzini <pbonzini@...hat.com>:
> > >>>> Sometimes, a processor might execute an instruction while another
> > >>>> processor is updating the page tables for that instruction's code page,
> > >>>> but before the TLB shootdown completes.  The interesting case happens
> > >>>> if the page is in the TLB.
> > >>>>
> > >>>> In general, the processor will succeed in executing the instruction and
> > >>>> nothing bad happens.  However, what if the instruction is an MMIO access?
> > >>>> If *that* happens, KVM invokes the emulator, and the emulator gets the
> > >>>> updated page tables.  If the update side had marked the code page as non
> > >>>> present, the page table walk then will fail and so will x86_decode_insn.
> > >>>>
> > >>>> Unfortunately, even though kvm_fetch_guest_virt is correctly returning
> > >>>> X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
> > >>>> a fatal error if the instruction cannot simply be reexecuted (as is the
> > >>>> case for MMIO).  And this in fact happened sometimes when rebooting
> > >>>> Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
> > >>>> the exception if true is enough to fix the case.
> > >>>
> > >>> I found the only place which can set ctxt->have_exception is in the
> > >>> function x86_emulate_insn(), and x86_decode_insn() will not set
> > >>> ctxt->have_exception even if kvm_fetch_guest_virt() returns
> > >>> X86_EMUL_PROPAGATE_FAULT.
> > >>
> > >> Hmm, you're right.  Looks like Yanan has been (un)lucky when trying out
> > >> this patch! :(
> > >>
> > >> Yanan, can you double check that you can reproduce the issue with an
> > >> unpatched kernel?  I will work on a kvm-unit-tests testcsae
> > > 
> > > We don't have a kvm-unit-tests reproducer for this yet, right?
> > > 
> > > I'm considering trying to write one, but I don't want to
> > > duplicate work.
> > 
> > No, I haven't written one yet.
> 
> The reproducer (not a full test case) is quite simple, see patch below.
> 
> Now, I've noticed something interesting when running the
> reproducer:

There's something else that makes the bug hard to reproduce: as
soon as I set RSP to a valid address in inregs before calling
trap_emulator(), the bug is not reproducible anymore.

But if I keep RSP=0, I won't be able to validate the bug fix
because I won't be able to configure a working #PF handler.

This alone makes the bug not reproducible anymore:

diff --git a/x86/emulator.c b/x86/emulator.c
index 72cb035..a7e61ff 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -1104,6 +1104,8 @@ static void test_illegal_movbe(void)

 static void test_fetch_failure(void *mem, void *alt_insn_page)
 {
+       void *stack = alloc_page();
+       inregs = (struct regs){ .rsp = (u64)stack+1024 };
        trap_emulator(mem, NULL, NULL);
 }


This is what I see:

When we don't have a stack (inregs.rsp=0),
reexecute_instruction() is preventing the emulation failure from
happening on the I/O instruction VM exits, and KVM keeps entering
the VM in a loop (getting thousands of I/O instruction VM exits)
until we finally get an EPT misconfig VM exit on GVA
0xfffffffffffffff8.

When we set up inregs.rsp, reexecute_instruction() also prevents
the emulation from failing on the I/O instruction VM exits, but
instead of a EPT misconfig VM exit, we get EPT violation VM exit
after a few thousand iterations, and the page fault is delivered
to the VCPU.

I don't know why KVM loops so many times on I/O instruction VM
exits before finally getting an emulation failure (or finally
delivering a page fault, if a stack is available), but this might
explain why the bug is so hard to reproduce under normal
circumstances.



> 
> If the test_fetch_failure() call happens before we touch
> pci-testdev through *mem (like in the patch below), we get an
> emulation failure like the one Yanan saw:
> 
>   $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.RCPjppRp8i
>   enabling apic
>   paging enabled
>   cr0 = 80010011
>   cr3 = 45e000
>   cr4 = 20
>   KVM internal error. Suberror: 1
>   emulation failure
>   RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
>   RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
>   R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
>   R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>   RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>   ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>   SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>   GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
>   LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>   TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
>   GDT=     000000000041100a 0000047f
>   IDT=     0000000000000000 00000fff
>   CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
>   DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>   DR6=00000000ffff0ff0 DR7=0000000000000400
>   EFER=0000000000000500
>   Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
> 
> but if I call test_fetch_failure() after touching *mem, like this:
> 
>     diff --git a/x86/emulator.c b/x86/emulator.c
>     index 977ec75..72cb035 100644
>     --- a/x86/emulator.c
>     +++ b/x86/emulator.c
>     @@ -1124,7 +1124,6 @@ int main()
>             alt_insn_page = alloc_page();
>             insn_ram = vmap(virt_to_phys(insn_page), 4096);
>     
>     -       test_fetch_failure(mem, alt_insn_page);
>     
>             // test mov reg, r/m and mov r/m, reg
>             t1 = 0x123456789abcdef;
>     @@ -1135,6 +1134,8 @@ int main()
>                          : "memory");
>             report("mov reg, r/m (1)", t2 == 0x123456789abcdef);
>     
>     +       test_fetch_failure(mem, alt_insn_page);
>     +
>             test_simplealu(mem);
>             test_cmps(mem);
>             test_scas(mem);
> 
> then I get a KVM_INTERNAL_ERROR_DELIVERY_EV:
> 
>     $ /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel ./x86/emulator.flat # -initrd /tmp/tmp.lmXZa46TEA
>     enabling apic
>     paging enabled
>     cr0 = 80010011
>     cr3 = 45e000
>     cr4 = 20
>     PASS: mov reg, r/m (1)
>     KVM internal error. Suberror: 3
>     extra data[0]: 80000b0e
>     extra data[1]: 31
>     extra data[2]: 182
>     extra data[3]: ff000ff8
>     RAX=0000000000000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000000
>     RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=0000000000000000
>     R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
>     R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>     RIP=ffffffffffffc08a RFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
>     ES =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     CS =0008 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>     SS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     DS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     FS =0010 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>     GS =0010 0000000000454d60 ffffffff 00c09300 DPL=0 DS   [-WA]
>     LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>     TR =0080 000000000041148a 0000ffff 00008b00 DPL=0 TSS64-busy
>     GDT=     000000000041100a 0000047f
>     IDT=     0000000000000000 00000fff
>     CR0=80010011 CR2=ffffffffffffc08a CR3=000000000045e000 CR4=00000020
>     DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>     DR6=00000000ffff0ff0 DR7=0000000000000400
>     EFER=0000000000000500
>     Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
>     ^C
> 
> Also, if I run the reproducer using ept=0, it gets stuck into a
> loop re-entering the same "in (%dx),%al" instruction over and
> over again.  trace-cmd report output:
> 
>     qemu-system-x86-18185 [001] 1057573.830491: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830494: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830503: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830504: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830505: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830506: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830507: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830508: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830509: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830510: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830511: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830511: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830512: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830513: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830514: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830514: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830515: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830516: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830517: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830518: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830519: kvm_entry:            vcpu 0
>     qemu-system-x86-18185 [001] 1057573.830521: kvm_exit:             reason IO_INSTRUCTION rip 0xffffffffffffc08a info 8 0
>     qemu-system-x86-18185 [001] 1057573.830522: kvm_emulate_insn:     0:ffffffffffffc08a: 4d 89 2c 24
>     qemu-system-x86-18185 [001] 1057573.830523: kvm_entry:            vcpu 0
>     [...]
> 
> Signed-off-by: Eduardo Habkost <ehabkost@...hat.com>
> ---
>  x86/emulator.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/x86/emulator.c b/x86/emulator.c
> index e6f27cc..977ec75 100644
> --- a/x86/emulator.c
> +++ b/x86/emulator.c
> @@ -792,9 +792,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page,
>  	extern u8 insn_page[], test_insn[];
>  
>  	insn_ram = vmap(virt_to_phys(insn_page), 4096);
> -	memcpy(alt_insn_page, insn_page, 4096);
> -	memcpy(alt_insn_page + (test_insn - insn_page),
> -			(void *)(alt_insn->ptr), alt_insn->len);
> +	if (alt_insn_page) {
> +		memcpy(alt_insn_page, insn_page, 4096);
> +		memcpy(alt_insn_page + (test_insn - insn_page),
> +				(void *)(alt_insn->ptr), alt_insn->len);
> +	}
>  	save = inregs;
>  
>  	/* Load the code TLB with insn_page, but point the page tables at
> @@ -805,7 +807,11 @@ static void trap_emulator(uint64_t *mem, void *alt_insn_page,
>  	invlpg(insn_ram);
>  	/* Load code TLB */
>  	asm volatile("call *%0" : : "r"(insn_ram));
> -	install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> +	if (alt_insn_page) {
> +		install_page(cr3, virt_to_phys(alt_insn_page), insn_ram);
> +	} else {
> +		install_pte(cr3, 1, insn_ram, PT_USER_MASK, 0);
> +	}
>  	/* Trap, let hypervisor emulate at alt_insn_page */
>  	asm volatile("call *%0": : "r"(insn_ram+1));
>  
> @@ -1096,6 +1102,11 @@ static void test_illegal_movbe(void)
>  	handle_exception(UD_VECTOR, 0);
>  }
>  
> +static void test_fetch_failure(void *mem, void *alt_insn_page)
> +{
> +	trap_emulator(mem, NULL, NULL);
> +}
> +
>  int main()
>  {
>  	void *mem;
> @@ -1113,6 +1124,8 @@ int main()
>  	alt_insn_page = alloc_page();
>  	insn_ram = vmap(virt_to_phys(insn_page), 4096);
>  
> +	test_fetch_failure(mem, alt_insn_page);
> +
>  	// test mov reg, r/m and mov r/m, reg
>  	t1 = 0x123456789abcdef;
>  	asm volatile("mov %[t1], (%[mem]) \n\t"
> -- 
> 2.13.6
> 
> 
> -- 
> Eduardo

-- 
Eduardo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ