lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 8 Oct 2020 08:46:58 -0400
From:   Cathy Avery <cavery@...hat.com>
To:     Maxim Levitsky <mlevitsk@...hat.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Cc:     vkuznets@...hat.com, wei.huang2@....com
Subject: Re: [PATCH] KVM: SVM: Use a separate vmcb for the nested L2 guest

On 10/8/20 6:54 AM, Maxim Levitsky wrote:
> On Thu, 2020-10-08 at 13:39 +0300, Maxim Levitsky wrote:
>> On Thu, 2020-10-08 at 13:23 +0300, Maxim Levitsky wrote:
>>> On Thu, 2020-10-08 at 07:52 +0200, Paolo Bonzini wrote:
>>>> On 08/10/20 00:14, Maxim Levitsky wrote:
>>>>>> +	if (svm->vmcb01->control.asid == 0)
>>>>>> +		svm->vmcb01->control.asid = svm->nested.vmcb02->control.asid;
>>>>> I think that the above should be done always. The asid field is currently host
>>>>> controlled only (that is L2 value is ignored, selective ASID tlb flush is not
>>>>> advertized to the guest and lnvlpga is emulated as invlpg).
>>>> Yes, in fact I suggested that ASID should be in svm->asid and moved to
>>>> svm->vmcb->asid in svm_vcpu_run.  Then there's no need to special case
>>>> it in nested code.
>>> This makes lot of sense!
>>>> This should be a patch coming before this one.
>>>>
>>>>> 1. Something wrong with memory types - like guest is using UC memory for everything.
>>>>>      I can't completely rule that out yet
>>>> You can print g_pat and see if it is all zeroes.
>>> I don't even need to print it. I know that it is never set anywhere, unless guest writes it,
>>> but now that I look at it, we set it to a default value and there is no code to set it to
>>> default value for vmcb02. This is it. now my fedora guest boots just fine!
>>>
>>> I played a lot with g_pat, and yet this didn't occur to me . I was that close :-(
>>> I knew that it has to be something with memory types, but it never occured to me
>>> that guest just doesn't write IA32_PAT and uses our value which we set in init_vmcb
>>>
>>>
>>>> In general I think it's better to be explicit with vmcb01 vs. vmcb02,
>>>> like Cathy did, but I can see it's a matter of personal preference to
>>>> some extent.
>>> I also think so in general, but in the code that is outside 'is_guest_mode'
>>> IMHO it is better to refer to vmcb01 as vmcb, although now that I think of
>>> it, its not wrong to do so either.
>>>
>>>
>>> My windows hyper-v guest doesn't boot though and I know why.
>>>
>>> As we know the vmcb save area has extra state which vmrun/vmexit don't touch.
>>> Now suppose a nested hypervisor wants to enter a nested guest.
>>>
>>> It will do vmload, which will load the extra state from the nested vmcb (vmcb12
>>> or as I woudl say the vmcb that nested hypervisor thinks that it is using),
>>> to the CPU. This can cause some vmexits I think, but this doesn't matter much.
>>>
>>> Now the nested hypervisor does vmrun. The extra state of L2 guest is in CPU registers,
>>> and it is untouched. We do vmsave on vmcb01 to preserve that state, but later
>>> when we do vmload on vmcb02 prior to vmenter on it, which loads stale state from it.
>>> The same issue happens the other way around on nested vmexit.
>>>
>>> I fixed this by doing nested_svm_vmloadsave, but that should be probably be
>>> optimized with dirty bits. Now though I guess the goal it to make
>>> it work first.
>>>
>>> With this fixed HyperV boots fine, and even passes the 'works' test of booting
>>> the windows 10 with hyperv enabled nested itself and starting the vm inside,
>>> which makes that VM L3 (in addition to windows itself that runs as L3 in relation to hyper-v)
>>>
>>> https://i.imgur.com/sRYqtVV.png
>>>
>>> In summary this is the diff of fixes (just pasted to email, probably mangled):
>>>
>>>
>>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>>> index 0a06e62010d8c..7293ba23b3cbc 100644
>>> --- a/arch/x86/kvm/svm/nested.c
>>> +++ b/arch/x86/kvm/svm/nested.c
>>> @@ -436,6 +436,9 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
>>>          WARN_ON(svm->vmcb == svm->nested.vmcb02);
>>>   
>>>          svm->nested.vmcb02->control = svm->vmcb01->control;
>>> +
>>> +       nested_svm_vmloadsave(svm->vmcb01, svm->nested.vmcb02);
>>> +
>>>          svm->vmcb = svm->nested.vmcb02;
>>>          svm->vmcb_pa = svm->nested.vmcb02_pa;
>>>          load_nested_vmcb_control(svm, &nested_vmcb->control);
>>> @@ -622,6 +625,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>>>          if (svm->vmcb01->control.asid == 0)
>>>                  svm->vmcb01->control.asid = svm->nested.vmcb02->control.asid;
>>>   
>>> +       nested_svm_vmloadsave(svm->nested.vmcb02, svm->vmcb01);
>>>          svm->vmcb = svm->vmcb01;
>>>          svm->vmcb_pa = svm->nested.vmcb01_pa;
>>>   
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index b66239b26885d..ee9f87fe611f2 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -1097,6 +1097,7 @@ static void init_vmcb(struct vcpu_svm *svm)
>>>                  clr_cr_intercept(svm, INTERCEPT_CR3_READ);
>>>                  clr_cr_intercept(svm, INTERCEPT_CR3_WRITE);
>>>                  save->g_pat = svm->vcpu.arch.pat;
>>> +               svm->nested.vmcb02->save.g_pat = svm->vcpu.arch.pat;

I had noticed that g_pat was not set in vmcb02 and set it to vmcb01's 
value which was not helpful but I did not try the current vcpu value.

I am getting a #UD which I suspected had something to do with cr3 but 
I'll know more after I add your suggestions.

emu-system-x86-1647  [033] ....  3167.589402: kvm_nested_vmexit_inject: 
reason: UD excp ext_inf1: 0x0000000000000000 ext_inf2: 
0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000


>>>                  save->cr3 = 0;
>>>                  save->cr4 = 0;
>>>          }
>>>
>>>
>>>
>>> Best regards,
>>> 	Maxim Levitsky
>>>
>>>> Paolo
>>>>
>> And another thing I spotted before I forget.
>>
>> If we setup a tlb flush in ctl.tlb_ctl of vmcb01, just prior to nested vmentry
>> then this field will be copied to vmcb02 but on next vmexit we clear it in current
>> (that is vmcb02) and that change will not propogate to vmcb01.
ctl.tlb_ctl is dependent on the value of save.cr4 which was not being 
set in vmcb02.
>>
>> I am not sure if this is a theorerical issue or not. We probably should apply the same treatment to
>> it as what Paulo suggested to do with asid -
>> set it just prior to vmentry if tlb flush is needed, and clear it afterwards as we do.
> And yet another thing to note is that we curently ignore L2's g_pat. However it _seems_ that we practically
> ignore L1 PAT as well in regard to shadowing NPT mmu. I am not 100% sure about this.
> I'll dig that area eventually.
>
> Best regards,
> 	Maxim Levitsky
>
>> Best regards,
>> 	Maxim Levitsky
>
Thanks,

Cathy

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ