[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c553936f-ed09-577e-9f22-f249037d7186@kernel.org>
Date: Sun, 21 May 2017 13:19:32 -0700
From: Andy Lutomirski <luto@...nel.org>
To: Roman Penyaev <roman.penyaev@...fitbricks.com>,
Andy Lutomirski <luto@...nel.org>
Cc: Mikhail Sennikovskii <mikhail.sennikovskii@...fitbricks.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Gleb Natapov <gleb@...nel.org>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC] KVM: SVM: do not drop VMCB CPL to 0 if SS is not present
On 05/21/2017 12:53 AM, Roman Penyaev wrote:
> On Sun, May 21, 2017 at 5:31 AM, Andy Lutomirski <luto@...nel.org> wrote:
>> On 05/19/2017 09:14 AM, Roman Penyaev wrote:
>>>
>>> Hi folks,
>>>
>>> After experiencing guest double faults (sometimes triple faults) on
>>> 3.16 guest kernels with the following common pattern:
>>>
>
> [cut]
>
>>>
>>> Further tracking of VMCB states before and after VMRUN showed, that
>>> CPL becomes 0 when VMEXIT happens with the following SS segment:
>>>
>>> ss = {
>>> selector = 0x2b,
>>> attrib = 0x400,
>>> limit = 0xffffffff,
>>> base = 0x0
>>> },
>>>
>>> cpl = 0x3
>>>
>>> Then on next VMRUN VMCB looks as the following:
>>>
>>> ss = {
>>> selector = 0x2b,
>>> attrib = 0x0, <<< dropped to 0
>>> limit = 0xffffffff,
>>> base = 0x0
>>> },
>>>
>>> cpl = 0x0, <<< dropped to 0
>>>
>>> Obviously it was changed between VMRUN calls. The following backtrace
>>> shows that VMCB.SAVE.CPL was set to 0 by QEMU itself:
>>>
>>> CPU: 55 PID: 59531 Comm: kvm
>>> [<ffffffffa00a3a20>] kvm_arch_vcpu_ioctl_set_sregs+0x2e0/0x480 [kvm]
>>> [<ffffffffa008ddf0>] kvm_write_guest_cached+0x540/0xc00 [kvm]
>>> [<ffffffff8107d695>] ? finish_task_switch+0x185/0x240
>>> [<ffffffff8180097c>] ? __schedule+0x28c/0xa10
>>> [<ffffffff811a9aad>] do_vfs_ioctl+0x2cd/0x4a0
>>>
>>> SS segment which came from QEMU had the following struct members:
>>>
>>> SS->base = 0
>>> SS->limit = ffffffff
>>> SS->selector = 2b
>>> SS->type = 0
>>> SS->present = 0
>>> SS->dpl = 0
>>> SS->db = 0
>>> SS->s = 0
>>> SS->l = 0
>>> SS->g = 0
>>> SS->avl = 0
>>> SS->unusable = 1
>>>
>>> Indeed, on last VMEXIT SS segment does not have (P) present bit set in
>>> segment attributes:
>>>
>>> (gdb) p 0x400 & (1 << SVM_SELECTOR_P_SHIFT)
>>> $1 = 0
>>
>>
>> Huh? How is that even possible? It should not be possible to actually run
>> the vCPU with a non-NULL SS that isn't present.
>
> That is utterly good question :) I do not know. According to my shallow
> understanding (P) bit is only a hint for CPU that corresponding segment was
> read from gdt and now is cached in private CPU registers (attributes).
> Am I right?
>
> At least what I see that it is quite often the case when we exit from VMRUN
> with segment not present then VMRUN is resumed and on next vmexit segment has
> correct attributes.
>
>> How would you cause it to happen?
>
> We run fio and iperf tests in guests for a couple of days. Nothing more,
> nothing special. Guests are bare debians with 3.16 kernels.
>
>>
>> Unless... is this the sysret_ss_attrs issue?
>
> What is the issue? This one
>
> https://lkml.org/lkml/2015/4/24/770
Yes.
But I was thinking about it wrong, since this is probably 64-bit
userspace, not 32-bit userspace. Here's my theory:
1. User task A does a syscall. It's not in kernel mode with SS != 0.
2. The scheduler runs and switches to task B. SS != 0.
2. Kernel enters user mode for task B.
3. User task B gets interrupted. Kernel ends up running with SS = 0.
4. Kernel switches back to task A. SS == 0.
5. Kernel does SYSRET. SS == __USER_DS, but SS's attributes are messed up.
6. QEMU does whatever it does that inspires it to zap SS's attributes.
7. Boom.
If task B were 32-bit, then the vDSO would fix up SS, so there would
only be a 1-instruction window for problems.
To check this theory, you could try backporting this to the guest and
seeing if the problem goes away:
commit 61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
Author: Andy Lutomirski <luto@...nel.org>
Date: Sun Apr 26 16:47:59 2015 -0700
x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue
>>>
>>> So when on VMEXIT we have such SS state (P bit is not set) and QEMU
>>> just decides to synchronize registers the following happens on QEMU
>>> side:
>>>
>>> kvm_cpu_synchronize_state():
>>> kvm_arch_get_registers():
>>> ...
>>> get_seg():
>>> if (rhs->unusable) {
>>> lhs->flags = 0; <<< SS is unusable [(P) is not set)
>>> <<< all attributes are dropped to 0.
>>> }
>>> cpu->kvm_vcpu_dirty = true; <<< Mark VCPU state as dirty
>>>
>>
>> Looks like the bug is in QEMU, then, right?
>
> KVM SVM restores CPL from unusable selector, obviously this is not nice.
I would imagine that QEMU shouldn't be feeding KVM such a selector.
Also, there's an invariant that SS.DPL == CPL, at least most of the
time, although this SYSRET issue may be the exception.
Paolo, what's the intended behavior here? Is the bug in KVM or in QEMU?
>
> arch/x86/kvm/svm.c:svm_set_segment():
>
> if (var->unusable)
> s->attrib = 0;
> ...
> if (seg == VCPU_SREG_SS)
> svm->vmcb->save.cpl = (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;
>
>
> Meanwhile QEMU resets attributes, despite the fact that DPL (which is passed
> from KVM) is correct.
>
> So it is not clear what is the proper way to fix that. What is clear is
> that CPL is set to 0 because of this game with registers on both sides.
> Now the question is what side to fix or probably both.
>
>
>> Couldn't you just fix this code
>> in QEMU by, say, deleting it?
>
> Certainly, but would be nice to listen to KVM maintainers. At least the issue
> is clear and what is left is a proper one-line fix :)
--Andy
Powered by blists - more mailing lists