linux-kernel - Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c553936f-ed09-577e-9f22-f249037d7186@kernel.org>
Date:   Sun, 21 May 2017 13:19:32 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Roman Penyaev <roman.penyaev@...fitbricks.com>,
        Andy Lutomirski <luto@...nel.org>
Cc:     Mikhail Sennikovskii <mikhail.sennikovskii@...fitbricks.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Gleb Natapov <gleb@...nel.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present

On 05/21/2017 12:53 AM, Roman Penyaev wrote:
> On Sun, May 21, 2017 at 5:31 AM, Andy Lutomirski <luto@...nel.org> wrote:
>> On 05/19/2017 09:14 AM, Roman Penyaev wrote:
>>>
>>> Hi folks,
>>>
>>> After experiencing guest double faults (sometimes triple faults) on
>>> 3.16 guest kernels with the following common pattern:
>>>
> 
> [cut]
> 
>>>
>>> Further tracking of VMCB states before and after VMRUN showed, that
>>> CPL becomes 0 when VMEXIT happens with the following SS segment:
>>>
>>>             ss = {
>>>               selector = 0x2b,
>>>               attrib = 0x400,
>>>               limit = 0xffffffff,
>>>               base = 0x0
>>>             },
>>>
>>>             cpl = 0x3
>>>
>>> Then on next VMRUN VMCB looks as the following:
>>>
>>>             ss = {
>>>               selector = 0x2b,
>>>               attrib = 0x0,            <<< dropped to 0
>>>               limit = 0xffffffff,
>>>               base = 0x0
>>>             },
>>>
>>>             cpl = 0x0,                 <<< dropped to 0
>>>
>>> Obviously it was changed between VMRUN calls.  The following backtrace
>>> shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:
>>>
>>>     CPU: 55 PID: 59531 Comm: kvm
>>>     [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
>>>     [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
>>>     [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
>>>     [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
>>>     [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0
>>>
>>> SS segment which came from QEMU had the following struct members:
>>>
>>>          SS->base      = 0
>>>          SS->limit     = ffffffff
>>>          SS->selector  = 2b
>>>          SS->type      = 0
>>>          SS->present   = 0
>>>          SS->dpl       = 0
>>>          SS->db        = 0
>>>          SS->s         = 0
>>>          SS->l         = 0
>>>          SS->g         = 0
>>>          SS->avl       = 0
>>>          SS->unusable  = 1
>>>
>>> Indeed, on last VMEXIT SS segment does not have (P) present bit set in
>>> segment attributes:
>>>
>>>      (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
>>>      $1 = 0
>>
>>
>> Huh?  How is that even possible?  It should not be possible to actually run
>> the vCPU with a non-NULL SS that isn't present.
> 
> That is utterly good question :)  I do not know.  According to my shallow
> understanding (P) bit is only a hint for CPU that corresponding segment was
> read from gdt and now is cached in private CPU registers (attributes).
> Am I right?
> 
> At least what I see that it is quite often the case when we exit from VMRUN
> with segment not present then VMRUN is resumed and on next vmexit segment has
> correct attributes.
> 
>> How would you cause it to happen?
> 
> We run fio and iperf tests in guests for a couple of days.  Nothing more,
> nothing special.  Guests are bare debians with 3.16 kernels.
> 
>>
>> Unless... is this the sysret_ss_attrs issue?
> 
> What is the issue?  This one
> 
> https://lkml.org/lkml/2015/4/24/770

Yes.

But I was thinking about it wrong, since this is probably 64-bit 
userspace, not 32-bit userspace.  Here's my theory:

1. User task A does a syscall.  It's not in kernel mode with SS != 0.

2. The scheduler runs and switches to task B.  SS != 0.

2. Kernel enters user mode for task B.

3. User task B gets interrupted.  Kernel ends up running with SS = 0.

4. Kernel switches back to task A.  SS == 0.

5. Kernel does SYSRET.  SS == __USER_DS, but SS's attributes are messed up.

6. QEMU does whatever it does that inspires it to zap SS's attributes.

7. Boom.

If task B were 32-bit, then the vDSO would fix up SS, so there would 
only be a 1-instruction window for problems.

To check this theory, you could try backporting this to the guest and 
seeing if the problem goes away:

commit 61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
Author: Andy Lutomirski <luto@...nel.org>
Date:   Sun Apr 26 16:47:59 2015 -0700

     x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue


>>>
>>> So when on VMEXIT we have such SS state (P bit is not set) and QEMU
>>> just decides to synchronize registers the following happens on QEMU
>>> side:
>>>
>>>      kvm_cpu_synchronize_state():
>>>          kvm_arch_get_registers():
>>>              ...
>>>              get_seg():
>>>                 if (rhs->unusable) {
>>>                     lhs->flags = 0;    <<< SS is unusable [(P) is not set)
>>>                                        <<< all attributes are dropped to 0.
>>>                  }
>>>          cpu->kvm_vcpu_dirty = true;   <<< Mark VCPU state as dirty
>>>
>>
>> Looks like the bug is in QEMU, then, right?
> 
> KVM SVM restores CPL from unusable selector, obviously this is not nice.

I would imagine that QEMU shouldn't be feeding KVM such a selector. 
Also, there's an invariant that SS.DPL == CPL, at least most of the 
time, although this SYSRET issue may be the exception.

Paolo, what's the intended behavior here?  Is the bug in KVM or in QEMU?

> 
> arch/x86/kvm/svm.c:svm_set_segment():
> 
>     if (var->unusable)
>        s->attrib = 0;
>     ...
>     if (seg == VCPU_SREG_SS)
>         svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;
> 
> 
> Meanwhile QEMU resets attributes, despite the fact that DPL (which is passed
> from KVM) is correct.
> 
> So it is not clear what is the proper way to fix that.   What is clear is
> that CPL is set to 0 because of this game with registers on both sides.
> Now the question is what side to fix or probably both.
> 
> 
>> Couldn't you just fix this code
>> in QEMU by, say, deleting it?
> 
> Certainly, but would be nice to listen to KVM maintainers.  At least the issue
> is clear and what is left is a proper one-line fix :)

--Andy