[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <85ec3bfd-e46e-a6fa-b530-4dc87f0c7169@redhat.com>
Date: Thu, 8 Oct 2020 08:46:58 -0400
From: Cathy Avery <cavery@...hat.com>
To: Maxim Levitsky <mlevitsk@...hat.com>,
Paolo Bonzini <pbonzini@...hat.com>,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Cc: vkuznets@...hat.com, wei.huang2@....com
Subject: Re: [PATCH] KVM: SVM: Use a separate vmcb for the nested L2 guest
On 10/8/20 6:54 AM, Maxim Levitsky wrote:
> On Thu, 2020-10-08 at 13:39 +0300, Maxim Levitsky wrote:
>> On Thu, 2020-10-08 at 13:23 +0300, Maxim Levitsky wrote:
>>> On Thu, 2020-10-08 at 07:52 +0200, Paolo Bonzini wrote:
>>>> On 08/10/20 00:14, Maxim Levitsky wrote:
>>>>>> + if (svm->vmcb01->control.asid == 0)
>>>>>> + svm->vmcb01->control.asid = svm->nested.vmcb02->control.asid;
>>>>> I think that the above should be done always. The asid field is currently host
>>>>> controlled only (that is L2 value is ignored, selective ASID tlb flush is not
>>>>> advertized to the guest and lnvlpga is emulated as invlpg).
>>>> Yes, in fact I suggested that ASID should be in svm->asid and moved to
>>>> svm->vmcb->asid in svm_vcpu_run. Then there's no need to special case
>>>> it in nested code.
>>> This makes lot of sense!
>>>> This should be a patch coming before this one.
>>>>
>>>>> 1. Something wrong with memory types - like guest is using UC memory for everything.
>>>>> I can't completely rule that out yet
>>>> You can print g_pat and see if it is all zeroes.
>>> I don't even need to print it. I know that it is never set anywhere, unless guest writes it,
>>> but now that I look at it, we set it to a default value and there is no code to set it to
>>> default value for vmcb02. This is it. now my fedora guest boots just fine!
>>>
>>> I played a lot with g_pat, and yet this didn't occur to me . I was that close :-(
>>> I knew that it has to be something with memory types, but it never occured to me
>>> that guest just doesn't write IA32_PAT and uses our value which we set in init_vmcb
>>>
>>>
>>>> In general I think it's better to be explicit with vmcb01 vs. vmcb02,
>>>> like Cathy did, but I can see it's a matter of personal preference to
>>>> some extent.
>>> I also think so in general, but in the code that is outside 'is_guest_mode'
>>> IMHO it is better to refer to vmcb01 as vmcb, although now that I think of
>>> it, its not wrong to do so either.
>>>
>>>
>>> My windows hyper-v guest doesn't boot though and I know why.
>>>
>>> As we know the vmcb save area has extra state which vmrun/vmexit don't touch.
>>> Now suppose a nested hypervisor wants to enter a nested guest.
>>>
>>> It will do vmload, which will load the extra state from the nested vmcb (vmcb12
>>> or as I woudl say the vmcb that nested hypervisor thinks that it is using),
>>> to the CPU. This can cause some vmexits I think, but this doesn't matter much.
>>>
>>> Now the nested hypervisor does vmrun. The extra state of L2 guest is in CPU registers,
>>> and it is untouched. We do vmsave on vmcb01 to preserve that state, but later
>>> when we do vmload on vmcb02 prior to vmenter on it, which loads stale state from it.
>>> The same issue happens the other way around on nested vmexit.
>>>
>>> I fixed this by doing nested_svm_vmloadsave, but that should be probably be
>>> optimized with dirty bits. Now though I guess the goal it to make
>>> it work first.
>>>
>>> With this fixed HyperV boots fine, and even passes the 'works' test of booting
>>> the windows 10 with hyperv enabled nested itself and starting the vm inside,
>>> which makes that VM L3 (in addition to windows itself that runs as L3 in relation to hyper-v)
>>>
>>> https://i.imgur.com/sRYqtVV.png
>>>
>>> In summary this is the diff of fixes (just pasted to email, probably mangled):
>>>
>>>
>>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>>> index 0a06e62010d8c..7293ba23b3cbc 100644
>>> --- a/arch/x86/kvm/svm/nested.c
>>> +++ b/arch/x86/kvm/svm/nested.c
>>> @@ -436,6 +436,9 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
>>> WARN_ON(svm->vmcb == svm->nested.vmcb02);
>>>
>>> svm->nested.vmcb02->control = svm->vmcb01->control;
>>> +
>>> + nested_svm_vmloadsave(svm->vmcb01, svm->nested.vmcb02);
>>> +
>>> svm->vmcb = svm->nested.vmcb02;
>>> svm->vmcb_pa = svm->nested.vmcb02_pa;
>>> load_nested_vmcb_control(svm, &nested_vmcb->control);
>>> @@ -622,6 +625,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>>> if (svm->vmcb01->control.asid == 0)
>>> svm->vmcb01->control.asid = svm->nested.vmcb02->control.asid;
>>>
>>> + nested_svm_vmloadsave(svm->nested.vmcb02, svm->vmcb01);
>>> svm->vmcb = svm->vmcb01;
>>> svm->vmcb_pa = svm->nested.vmcb01_pa;
>>>
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index b66239b26885d..ee9f87fe611f2 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -1097,6 +1097,7 @@ static void init_vmcb(struct vcpu_svm *svm)
>>> clr_cr_intercept(svm, INTERCEPT_CR3_READ);
>>> clr_cr_intercept(svm, INTERCEPT_CR3_WRITE);
>>> save->g_pat = svm->vcpu.arch.pat;
>>> + svm->nested.vmcb02->save.g_pat = svm->vcpu.arch.pat;
I had noticed that g_pat was not set in vmcb02 and set it to vmcb01's
value which was not helpful but I did not try the current vcpu value.
I am getting a #UD which I suspected had something to do with cr3 but
I'll know more after I add your suggestions.
emu-system-x86-1647 [033] .... 3167.589402: kvm_nested_vmexit_inject:
reason: UD excp ext_inf1: 0x0000000000000000 ext_inf2:
0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>>> save->cr3 = 0;
>>> save->cr4 = 0;
>>> }
>>>
>>>
>>>
>>> Best regards,
>>> Maxim Levitsky
>>>
>>>> Paolo
>>>>
>> And another thing I spotted before I forget.
>>
>> If we setup a tlb flush in ctl.tlb_ctl of vmcb01, just prior to nested vmentry
>> then this field will be copied to vmcb02 but on next vmexit we clear it in current
>> (that is vmcb02) and that change will not propogate to vmcb01.
ctl.tlb_ctl is dependent on the value of save.cr4 which was not being
set in vmcb02.
>>
>> I am not sure if this is a theorerical issue or not. We probably should apply the same treatment to
>> it as what Paulo suggested to do with asid -
>> set it just prior to vmentry if tlb flush is needed, and clear it afterwards as we do.
> And yet another thing to note is that we curently ignore L2's g_pat. However it _seems_ that we practically
> ignore L1 PAT as well in regard to shadowing NPT mmu. I am not 100% sure about this.
> I'll dig that area eventually.
>
> Best regards,
> Maxim Levitsky
>
>> Best regards,
>> Maxim Levitsky
>
Thanks,
Cathy
Powered by blists - more mailing lists