linux-kernel - Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f133d093-fb91-4dd6-b936-cd4a17c24ecf@linux.intel.com>
Date: Mon, 10 Feb 2025 23:29:34 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Sean Christopherson <seanjc@...gle.com>,
 Rick P Edgecombe <rick.p.edgecombe@...el.com>, "Xu, Min M"
 <min.m.xu@...el.com>
Cc: "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
 "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
 "thomas.lendacky@....com" <thomas.lendacky@....com>,
 "dionnaglaze@...gle.com" <dionnaglaze@...gle.com>,
 Binbin Wu <binbin.wu@...el.com>,
 "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
 "mingo@...hat.com" <mingo@...hat.com>,
 "pbonzini@...hat.com" <pbonzini@...hat.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "hpa@...or.com" <hpa@...or.com>, "vkuznets@...hat.com"
 <vkuznets@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
 "jgross@...e.com" <jgross@...e.com>, "pgonda@...gle.com"
 <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>,
 jiewen.yao@...el.com, Binbin Wu <binbin.wu@...ux.intel.com>
Subject: Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX



On 2/4/2025 8:27 AM, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
>> On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
>>>> Since there is no upstream KVM TDX support yet, why isn't it an option to
>>>> still
>>>> revert the EDKII commit too? It was a relatively recent change.
>>> I'm fine with that route too, but it too is a band-aid.  Relying on the
>>> *untrusted*
>>> hypervisor to essentially communicate memory maps is not a winning strategy.
>>>
>>>> To me it seems that the normal KVM MTRR support is not ideal, because it is
>>>> still lying about what it is doing. For example, in the past there was an
>>>> attempt to use UC to prevent speculative execution accesses to sensitive
>>>> data.
>>>> The KVM MTRR support only happens to work with existing guests, but not all
>>>> possible MTRR usages.
>>>>
>>>> Since diverging from the architecture creates loose ends like that, we could
>>>> instead define some other way for EDKII to communicate the ranges to the
>>>> kernel.
>>>> Like some simple KVM PV MSRs that are for communication only, and not
>>> Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
>>> through the hypervisor to communicate information within the guest is asinine,
>>> especially for CoCo VMs.
>> Hmm, right.
>>
>> So the other options could be:
>>
>> 1. Some TDX module feature to hold the ranges:
>>   - Con: Not shared with AMD
>>
>> 2. Re-use MTRRs for the communication, revert changes in guest and edk2:
> Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
> changes is necessary regardless of what happens in the kernel.
IIUC, 071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest") was added
to avoid changing the setting of MTRR, which will trigger #VE by setting
CR0.CD=1.

And recently, 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always
return FALSE in TD-Guest") was added to avoid touching MTRR MSRs at all,
so that the MTRR MSRs support for TDX guests was dropped as described in
[PATCH 00/18] KVM: TDX: TDX "the rest" part [1].

If we want to revert the two commits, we need to:
1. Make sure that OVMF will not set CR0.CD to 1 for TDX guests, probably
    needs some kind of hack in OVMF.

2. Need to bring back the support of MTRR MSRs in KVM for TDX guests.
    TDX KVM basic support patch set v19 and earlier versions enforce default
    MTRR type as WB and disabled fixed/variable MTRR (by reporting MSR_MTRRcap
    as 0) for TDX guests, which was kind of half support and needed
    "clearcpuid=mtrr" to avoid toggling of CR0.CD.

    If we really want to rely on the OVMF to program the MTRR values, maybe we
    can treat MTRR MSRs for TDX guests as normal VMs, i.e., allow guests to
    read/write the values without any further virtualization.
    Of course, it needs to prompt the guest kernel to skip toggling CR0.CD for
    TDX guests.

[1] https://lore.kernel.org/kvm/20241210004946.3718496-1-binbin.wu@linux.intel.com


>    Or at the least,
> somehow communicate to EDK2 users that ingesting those changes is a bad idea
> unless the kernel has also been updated.
>
> AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
> is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
> will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
> default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
> guest kernel, there's no way to know if deploying those EDK2 changes is safe.

Oh, yea.

And if we drop the MTRR MSRs access in KVM for TDX guests, an old guest kernel
without commit 8e690b817e38 ("x86/kvm: Override default caching mode for
SEV-SNP and TDX") will require "clearcpuid=mtrr". :(

>   
> [*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf
>
>>   - Con: Creating more half support, when it's technically not required
>>   - Con: Still bouncing through the hypervisor
> I assume by "Re-use MTRRs for the communication" you also mean updating the guest
> to address the "everything is UC!" flaw, otherwise another con is:
>
>     - Con: Doesn't address the performance issue with TDX guests "using" UC
>            memory by default (unless there's yet more enabled).
>
> Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
> doing MTRR stuff as nonrmal?
>
>>   - Pro: Design and code is clear
>>
>> 3. Create some new architectural definition, like a bit that means "MTRRs don't
>> actually work:
>>   - Con: Takes a long time, need to get agreement
>>   - Con: Still bouncing through the hypervisor
> Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
> don't actually affect the memory type when running under KVM.
>
> FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
> KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
> overwhelming majority of guests.  That's not desirable long term because it
> prevents the guest from using WC (via PAT) in situations where doing so is needed
> for performance and/or correctness.
>
>>   - Pro: More pure solution
> MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
> it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
> desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
> because there's never a valid mapping, i.e. there is no physical memory and thus
> no memtype.  In other words, under KVM guests (and possibly other hypervisors),
> MTRRs end up being nothing more than a communication channel between guest firmware
> and the kernel.
>
> The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
> by the untrusted host.  But that's largely a future problem, unless someone has a
> clever way to fix the kernel mess.
>
>> 4. Do this series:
>>   - Pro: Looks ok to me
It looks OK to me too.
But as mentioned above, mismatch of OVMF, guest kernel, host kernel
version will cause problem.

>>   - Cons: As explained in the patches, it's a bit hacky.
Skipping toggling CR0.CD in guest kernel seems also a bit hacky.

>>   - Cons: Could there be more cases than the legacy PCI hole?
>>
>> I would kind of like to see something like 3, but 2 or 4 seem the only feasible
>> ones if we want to resolve this soon.