linux-kernel - Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1b02a87e-3dd9-24eb-7960-2a2aa5a18515@kernel.org>
Date:   Sat, 20 May 2017 20:31:42 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Roman Penyaev <roman.penyaev@...fitbricks.com>,
        Mikhail Sennikovskii <mikhail.sennikovskii@...fitbricks.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Gleb Natapov <gleb@...nel.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>
Subject: Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

On 05/19/2017 09:14 AM, Roman Penyaev wrote:
> Hi folks,
> 
> After experiencing guest double faults (sometimes triple faults) on
> 3.16 guest kernels with the following common pattern:
> 
> [459395.776124] PANIC: double fault, error_code: 0x0
> [459395.776606] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
> [459395.776610] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
> [459395.776614] task: ffff880009ca06b0 ti: ffff88003cbc2000 task.ti:
> ffff88003cbc2000
> [459395.776617] RIP: 0010:[<ffffffff81048ecf>]  [<ffffffff81048ecf>]
> __do_page_fault+0x1f/0x540
> [459395.776628] RSP: 002b:00007ffe0bc9bfa8  EFLAGS: 00010012
> [459395.776631] RAX: 0000000081539927 RBX: 0000000000000000 RCX:
> ffffffff81539927
> [459395.776634] RDX: 0000000000000028 RSI: 0000000000000000 RDI:
> 00007ffe0bc9c0a8
> [459395.776637] RBP: 00007ffe0bc9c0a8 R08: 0001a1d1002e9400 R09:
> 0000000000063f1b
> [459395.776640] R10: 0000000033f8356c R11: 000029c8250c3103 R12:
> 0000000000000028
> [459395.776642] R13: 00007ff8c83e0000 R14: 0000000000000000 R15:
> 00007ffe0bc9c7c0
> [459395.776649] FS:  00007ff8d2aaa7c0(0000) GS:ffff88003f400000(0000)
> knlGS:0000000000000000
> [459395.776651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [459395.776656] CR2: 00007ffe0bc9bf98 CR3: 000000003ca46000 CR4:
> 00000000000407f0
> [459395.776658] Stack:
> [459395.776661]  0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [459395.776666]  0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [459395.776670]  0000000000000000 0000000000000000 0000000000000000
> [459395.776674] Call Trace:
> [459395.776676]  <UNK>
> [459395.776678] Code:
> [459395.776680] ad 8c 4e 00 be 04 00 03 00 eb a8 90 66 66 66 66 90 41
> 57 41 56 41 55 41 54 49 89 d4 55 53 48 89 fd 48 89 f3 48 81 ec c8 00
> 00 00 <65> 48 8b 04 25 28 00 00 00 48 89 84 24 c0 00 00 00 31 c0 65 48
> [459395.776716] Kernel panic - not syncing: Machine halted.
> [459395.777172] CPU: 0 PID: 36565 Comm: fio Not tainted 3.16.39kmemleak #4
> [459395.777673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.7.5-20140531_083029-gandalf 04/01/2014
> [459395.778373]  0000000000000086 00000000d85f6336 ffffffff81532ec9
> ffffffff8170203e
> [459395.779865]  ffff88003f402f18 ffffffff815318a1 00007ff800000008
> ffff88003f402f28
> [459395.780061]  ffff88003f402ec0 00000000d85f6336 00007fff00000008
> 0000000000000046
> [459395.780061] Call Trace:
> [459395.780061]  <#DF>  [<ffffffff81532ec9>] ? dump_stack+0x47/0x5a
> [459395.780061]  [<ffffffff815318a1>] ? panic+0xcf/0x206
> [459395.780061]  [<ffffffff8104234d>] ? df_debug+0x2d/0x30
> [459395.780061]  [<ffffffff81004f78>] ? do_double_fault+0x78/0xf0
> [459395.780061]  [<ffffffff8153a4f2>] ? double_fault+0x22/0x30
> [459395.780061]  [<ffffffff81539927>] ? native_iret+0x7/0x7
> [459395.780061]  [<ffffffff81048ecf>] ? __do_page_fault+0x1f/0x540
> 
> we found out that all kernel backtraces have userspace RSP, where
> userspace memory has normal timer, page fault or virtio interrupts
> trail:
> 
> (the following RSP pointer does not belong to this particular crash
>   above, but it does not matter, symptoms are always the same)
> 
> crash> rd -s 00007f6cb9556768 100
>      7f6cb9556768:  00007f6cfaa21270 00007f6cfaa21270
>      7f6cb9556778:  0000000000000000 00007f6cf9b8c6a0
>      7f6cb9556788:  00007f6cf983399a 0000000000000000
>      7f6cb9556798:  0000000000000000 00007f6cf98a1f2d
>      7f6cb95567a8:  00007f6cfaa21270 0000000000000000
>      7f6cb95567b8:  00007f6ca4031880
>                                      ffffffffffffff7e  IRQ,
> ~0xffffffffffffff7e = 0x81
>      7f6cb95567c8:  00007f6cfa817ae1 0000000000000033  RIP; CS
>      7f6cb95567d8:  0000000000000202 00007f6cb95567f0  EFLAGS; RSP
>      7f6cb95567e8:  000000000000002b                   SS
>                                      00007f6cb00318e0
>      7f6cb95567f8:  00007f6cfa817af5 00007f6cac0318e0
>      7f6cb9556808:  00007f6cfa817af5 00007f6cb4031880
>      7f6cb9556818:  00007f6cfa817af5 00007f6cfbe82340
>      7f6cb9556828:  00007f6cfa817af5 00007f6c940318c0
>      7f6cb9556838:  00007f6cfa817af5 00007f6ca00318c0
>      7f6cb9556848:  00007f6cfa817af5 00007f6c9c031920
>      7f6cb9556858:  00007f6cfa817af5 00007f6ca8031920
>      7f6cb9556868:  00007f6cfa817af5 00007f6ca4039df0
>      7f6cb9556878:  00007f6cfa817af5 00007f6cb0039e50
>      7f6cb9556888:  00007f6cfa817af5 00007f6cac039e50
> 
> It turned out to be that CPU does not change SS:RSP and took interrupt
> on userspace stack (BTW init_tss and gdb_page are not corrupted).
> That is completely weird.
> 
> Next step was to trace VMCB before and after VMRUN to understand the
> exact state seen by real CPU.  VMCB was traced when wrong CPU state is
> observed: RIP is to kernel and RSP is from userspace.  The following
> is a diff of VMCB, where
> --- is the state before VMRUN and
> +++ is the state after VMRUN:
> 
>   -      event_inj = 0x80000081,
>   +      event_inj = 0x0,
> 
>          ...
> 
>          cs = {
>   -        selector = 0x33,
>   -        attrib = 0xafb,
>   +        selector = 0x10,
>   +        attrib = 0x29b,
>            limit = 0xffffffff,
>            base = 0x0
>          },
>          ss = {
>            selector = 0x2b,
>            attrib = 0x0,
>            limit = 0xffffffff,
>            base = 0x0
>          },
> 
>          ...
> 
>          cpl = 0x0,              <<<<<< WTF?
> 
>          ...
> 
>   -      rip = 0x7f6cfa817ae1,
>   +      rip = 0xffffffff8152b690,
> 
>          ...
> 
>   -      rsp = 0x7f6cb95567f0,
>   +      rsp = 0x7f6cb9556768,
> 
> 
> The execution scenario is the following:
> 
>    1. Userspace code was suspended.
>    2. Virtio 0x80000081 interrupt was injected and VMRUN called again.
>    3. HW CPU took virtual interrupt, but did not switch the stack because
>       CPL *was* already set to 0.
>    4. KABOOOOOM.
> 
> So the stack switch does not happen because "the processor performs an
> automatic stack switch when a control transfer causes a change in
> privilege levels to occur" (AMD manual), and in this case privilege
> level was not changed.
> 
> The question remains who and why changed CPL to 0?
> 
> Further tracking of VMCB states before and after VMRUN showed, that
> CPL becomes 0 when VMEXIT happens with the following SS segment:
> 
>            ss = {
>              selector = 0x2b,
>              attrib = 0x400,
>              limit = 0xffffffff,
>              base = 0x0
>            },
> 
>            cpl = 0x3
> 
> Then on next VMRUN VMCB looks as the following:
> 
>            ss = {
>              selector = 0x2b,
>              attrib = 0x0,            <<< dropped to 0
>              limit = 0xffffffff,
>              base = 0x0
>            },
> 
>            cpl = 0x0,                 <<< dropped to 0
> 
> Obviously it was changed between VMRUN calls.  The following backtrace
> shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:
> 
>    CPU: 55 PID: 59531 Comm: kvm
>    [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
>    [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
>    [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
>    [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
>    [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0
> 
> SS segment which came from QEMU had the following struct members:
> 
>         SS->base      = 0
>         SS->limit     = ffffffff
>         SS->selector  = 2b
>         SS->type      = 0
>         SS->present   = 0
>         SS->dpl       = 0
>         SS->db        = 0
>         SS->s         = 0
>         SS->l         = 0
>         SS->g         = 0
>         SS->avl       = 0
>         SS->unusable  = 1
> 
> Indeed, on last VMEXIT SS segment does not have (P) present bit set in
> segment attributes:
> 
>     (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
>     $1 = 0

Huh?  How is that even possible?  It should not be possible to actually 
run the vCPU with a non-NULL SS that isn't present.  How would you cause 
it to happen?

Unless... is this the sysret_ss_attrs issue?  This is a 3.16.something 
guest, and maybe the sysret_ss_attrs code never got backported.  I bet 
that the user code running when the virtio interrupt hits is in the vDSO.

> 
> So when on VMEXIT we have such SS state (P bit is not set) and QEMU
> just decides to synchronize registers the following happens on QEMU
> side:
> 
>     kvm_cpu_synchronize_state():
>         kvm_arch_get_registers():
>             ...
>             get_seg():
>                if (rhs->unusable) {
>                    lhs->flags = 0;    <<< SS is unusable [(P) is not set)
>                                       <<< all attributes are dropped to 0.
>                 }
>         cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty
> 

Looks like the bug is in QEMU, then, right?  Couldn't you just fix this 
code in QEMU by, say, deleting it?  If it's actually needed for some 
reason, then at least sanitize the flags correctly rather than 
corrupting the DPL.  (I'm guessing the real issue here is that migration 
from AMD to Intel fails without this, but migration from AMD to Intel is 
highly dubious regardless.)

--Andy