linux-kernel - Re: [PATCH kernel v3] x86/compressed/64: reduce #VC nesting for intercepted CPUID for SEV-SNP guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1d9a3a05-fbcb-49b5-955a-a3686ab3efdd@amd.com>
Date:   Thu, 5 Oct 2023 20:36:37 +1100
From:   Alexey Kardashevskiy <aik@....com>
To:     Tom Lendacky <thomas.lendacky@....com>, x86@...nel.org
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>,
        Sean Christopherson <seanjc@...gle.com>
Subject: Re: [PATCH kernel v3] x86/compressed/64: reduce #VC nesting for
 intercepted CPUID for SEV-SNP guest


On 5/10/23 00:53, Tom Lendacky wrote:
> On 10/3/23 18:22, Alexey Kardashevskiy wrote:
>>
>> On 4/10/23 04:21, Tom Lendacky wrote:
>>> On 10/3/23 02:31, Alexey Kardashevskiy wrote:
>>>> For certain intercepts an SNP guest uses the GHCB protocol to talk to
>>>> the hypervisor from the #VC handler. The protocol requires a shared 
>>>> page so
>>>> there is one per vCPU. In case NMI arrives in a middle of #VC or the 
>>>> NMI
>>>> handler triggers a #VC, there is another "backup" GHCB page which 
>>>> stores
>>>> the content of the first one while SVM_VMGEXIT_NMI_COMPLETE is sent.
>>>> The vc_raw_handle_exception() handler manages main and backup GHCB 
>>>> pages
>>>> via __sev_get_ghcb/__sev_put_ghcb.
>>>>
>>>> This works fine for #VC and occasional NMIs but not so fine when the 
>>>> #VC
>>>> handler causes intercept + another #VC. If NMI arrives during
>>>> the second #VC, there are no more pages for SVM_VMGEXIT_NMI_COMPLETE.
>>>> The problem place is the #VC CPUID handler which reads an MSR which
>>>> triggers another #VC and if "perf" was running, panic happens:
>>>>
>>>> Kernel panic - not syncing: Unable to handle #VC exception! GHCB and 
>>>> Backup GHCB are already in use
>>>>
>>>> Add a helper similar to native_read_msr_safe() for making a direct 
>>>> hypercall
>>>> in the SEV-ES environment. Use the new helper instead of the raw 
>>>> "rdmsr" to
>>>> avoid the extra #VC event.
>>>>
>>>> Fixes: ee0bfa08a345 ("x86/compressed/64: Add support for SEV-SNP 
>>>> CPUID table in #VC handlers")
>>>> Signed-off-by: Alexey Kardashevskiy <aik@....com>
>>>> ---
>>>>
>>>> Based on:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-x86-urgent
>>>> which top at the time was:
>>>> 62d5e970d022 "x86/sev: Change npages to unsigned long in 
>>>> snp_accept_memory()"
>>>>
>>>> ---
>>>> Changes:
>>>> v3:
>>>> * made it a function, mimic native_read_msr_safe() which 1) returns 
>>>> value 2) returns an error
>>>> * removed debug backtraces the commit log as these were added for 
>>>> debugging and never
>>>> appear with actual kernels
>>>>
>>>>
>>>> v2:
>>>> * de-uglify by defining rdmsr_safe_GHCB()
>>>> ---
>>>>   arch/x86/kernel/sev-shared.c | 27 +++++++++++++++++---
>>>>   1 file changed, 23 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/sev-shared.c 
>>>> b/arch/x86/kernel/sev-shared.c
>>>> index dcf325b7b022..494d92a71986 100644
>>>> --- a/arch/x86/kernel/sev-shared.c
>>>> +++ b/arch/x86/kernel/sev-shared.c
>>>> @@ -241,6 +241,25 @@ static enum es_result 
>>>> sev_es_ghcb_hv_call(struct ghcb *ghcb,
>>>>       return verify_exception_info(ghcb, ctxt);
>>>>   }
>>>> +
>>>> +/* Paravirt SEV-ES rdmsr which avoids extra #VC event */
>>>> +static unsigned long long ghcb_prot_read_msr(unsigned int msr, 
>>>> struct ghcb *ghcb,
>>>> +                         struct es_em_ctxt *ctxt, int *err)
>>>
>>> Alternatively you could return enum es_result and take xss as a 
>>> parameter... six of one, half dozen of another I guess.
>>
>> How do we decide on this? :)
>>
>> and yeah, I need to s/int/enum es_result/
>>
>>>> +{
>>>> +    unsigned long long ret = 0;
>>>> +
>>>> +    ghcb_set_rcx(ghcb, msr);
>>>> +
>>>> +    *err = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, 0, 0);
>>>> +    if (*err == ES_OK)
>>>> +        ret = (ghcb->save.rdx << 32) | ghcb->save.rax;
>>>
>>> You should check ghcb_rax_is_valid(ghcb) and ghcb_rdx_is_valid(ghcb) 
>>> before using the values.
>>
>> Huh. v4 is coming then. Although what are the chances of *err == ES_OK 
>> and !ghcb_rax_is_valid() at the same time? What if *err == ES_OK and 
>> ghcb_rdx_is_valid()==true but ghcb_rax_is_valid()==false?
>>
>> return ((ghcb_rdx_is_valid(ghcb)?(ghcb->save.rdx << 32):0) |
>>      (ghcb_rax_is_valid(ghcb)?ghcb->save.rax:0;
>>
>> Or I can just drop *err, invalidate ghcb before sev_es_ghcb_hv_call() 
>> and only rely on (ghcb_rdx_is_valid() && ghcb_rax_is_valid)?
>>
>> Where should I stop with this? :)
> 
> No, you can't drop *err. The GHCB protocol specifically calls out how 
> errors can be returned and how register state is returned.
> 
> In this case, sev_es_ghcb_hv_call() will check for general errors being 
> returned from the hypervisor, e.g. non-zero SW_EXITINFO1[31:0] and that 
> is why you need to check *err.
> 
> Then you need to validate that the hypervisor set the proper registers, 
> hence the check for ghcb_rax/rdx_is_valid() (see __sev_cpuid_hv_ghcb() 
> as an example).


After an offline discussion, it turns out this intercepted rdmsr of XSS 
in this particular place (postprocessing of CPUID 0xd:0x1 bit3 == 
"XSAVES, XRSTOR, and XSS are supported") in the guest should not have 
been intercepted in the first place as it is virtualized and swapped as 
typeB, but it is intercepted as this is the default.


This applied to KVM fixes the guest crashing problem:

--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4266,6 +4266,11 @@ static void svm_vcpu_after_set_cpuid(struct 
kvm_vcpu *vcpu)
         vcpu->arch.xsaves_enabled = guest_cpuid_has(vcpu, 
X86_FEATURE_XSAVE) &&
                                     boot_cpu_has(X86_FEATURE_XSAVE) &&
                                     boot_cpu_has(X86_FEATURE_XSAVES);
+       if (vcpu->arch.xsaves_enabled)
+               set_msr_interception(vcpu, svm->msrpm, MSR_IA32_XSS, 1, 1);


Sooo. I guess we want to fix the KVM but at least for now the guest 
needs the fix too, does not it?

And adding Sean in cc.

Thanks,


> 
> Thanks,
> Tom
> 
>>
>>>> +
>>>> +    /* Invalidate qwords for likely another following GHCB call */
>>>> +    vc_ghcb_invalidate(ghcb);
>>>
>>> We should probably call this on entry to the function, too, right? 
>>> Not sure it really matters though.
>>
>> The SVM_EXIT_MSR's handler in SVM/KVM only cares if RCX is valid in 
>> sev_es_validate_vmgexit() and the guest's ghcb_set_rcx() does that. 
>> Nothing in SVM enforces that other (unused) registers are not valid 
>> though. Thanks,
>>
>>
>>>
>>> Thanks,
>>> Tom
>>>
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>   static int __sev_cpuid_hv(u32 fn, int reg_idx, u32 *reg)
>>>>   {
>>>>       u64 val;
>>>> @@ -477,11 +496,11 @@ static int snp_cpuid_postprocess(struct ghcb 
>>>> *ghcb, struct es_em_ctxt *ctxt,
>>>>           if (leaf->subfn == 1) {
>>>>               /* Get XSS value if XSAVES is enabled. */
>>>>               if (leaf->eax & BIT(3)) {
>>>> -                unsigned long lo, hi;
>>>> +                int err = 0;
>>>> -                asm volatile("rdmsr" : "=a" (lo), "=d" (hi)
>>>> -                             : "c" (MSR_IA32_XSS));
>>>> -                xss = (hi << 32) | lo;
>>>> +                xss = ghcb_prot_read_msr(MSR_IA32_XSS, ghcb, ctxt, 
>>>> &err);
>>>> +                if (err != ES_OK)
>>>> +                    return -EINVAL;
>>>>               }
>>>>               /*
>>

-- 
Alexey