[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aKYMQP5AEC2RkOvi@google.com>
Date: Wed, 20 Aug 2025 10:56:16 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Binbin Wu <binbin.wu@...ux.intel.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, Nikolay Borisov <nik.borisov@...e.com>,
Jianxiong Gao <jxgao@...gle.com>, "Borislav Petkov (AMD)" <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, Dionna Glaze <dionnaglaze@...gle.com>,
"H. Peter Anvin" <hpa@...or.com>, jgross@...e.com,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, pbonzini@...hat.com,
Peter Gonda <pgonda@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
Tom Lendacky <thomas.lendacky@....com>, Vitaly Kuznetsov <vkuznets@...hat.com>, x86@...nel.org,
Rick Edgecombe <rick.p.edgecombe@...el.com>, jiewen.yao@...el.com
Subject: Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
On Wed, Aug 20, 2025, Binbin Wu wrote:
> On 8/20/2025 6:03 PM, Binbin Wu wrote:
> > > > > Presumably this an EDK2 bug? If it's not an EDK2 bug, then how is the kernel's
> > > > > ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
> >
> > Checked with Jiewen offline.
> >
> > He didn't think there was an existing interface to tell the OS to map a
> > OperationRegion of SystemMemory as UC via the ACPI table. He thought the
> > OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
> > alternative way.
> >
> > > > According to the ACPI spec 6.6, an operation region of SystemMemory has no
> > > > interface to specify the cacheable attribute.
> > > >
> > > > One solution could be using MTRRs to communicate the memory attribute of legacy
> > > > PCI hole to the kernel.
So IIUC, there are no bugs anywhere, just a gap in specs that has been hidden
until now :-(
> > > > But during the PUCK meeting last week, Sean mentioned
> > > > that "long-term, firmware should not be using MTRRs to communicate anything to
> > > > the kernel." So this solution is not preferred.
> > > >
> > > > If not MTRRs, there should be an alternative way to do the job.
> > > > 1. ACPI table
> > > > According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
> > > > Range Descriptor can specify the cacheable attribute.
> > > > "Address Space Resource Descriptors" could be used to describe a memory range
> > > > and the they can specify the cacheable attribute via "Type Specific Flags".
> > > > One of the Address Space Resource Descriptors could be added to the ACPI
> > > > table as a hint when the kernel do the mapping for operation region.
> > > > (There is "System Physical Address (SPA) Range Structure", which also can
> > > > specify the cacheable attribute. But it's should be used for NVDIMMs.)
> > > > 2. EFI memory map descriptor
> > > > EFI memory descriptor can specify the cacheable attribute. Firmware can add
> > > > a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
> > > > the mapping for operation region.
> > > >
> > > > Operation region of SystemMemory is still needed if a "Control Method" of APCI
> > > > needs to access a field, e.g., the method _STA. Checking another descriptor for
> > > > cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
> > > > map descriptor" during the ACPI code doing the mapping for operation region
> > > > makes the code complicated.
> > > >
> > > > Another thing is if long-term firmware should not be using MTRRs to to
> > > > communicate anything to the kernel. It seems it's safer to use ioremap() instead
> > > > of ioremap_cache() for MMIO resource when the kernel do the mapping for the
> > > > operation region access?
> > > >
> > > Would it work if instead of doubling down on declaring the low memory
> > > above TOLUD as WB, guest kernel reserves the range as uncacheable by
> > > default i.e. effectively simulating a ioremap before ACPI tries to map
> > > the memory as WB?
> >
> > It seems as hacky as this patch set?
> >
> >
> Hi Sean,
>
> Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
> I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
> hole range as UC?
>
> Is it less hacky?
Oh! That's a way better idea than my hack. I missed that the kernel would still
consult MTRRs.
Compile tested only, but something like this?
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8ae750cde0c6..45c8871cdda1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
static void __init kvm_init_platform(void)
{
+ u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
+ struct mtrr_var_range pci_hole = {
+ .base_lo = tolud | X86_MEMTYPE_UC,
+ .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
+ .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
+ };
+
if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
unsigned long nr_pages;
@@ -982,8 +989,12 @@ static void __init kvm_init_platform(void)
kvmclock_init();
x86_platform.apic_post_init = kvm_apic_init;
- /* Set WB as the default cache mode for SEV-SNP and TDX */
- guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
+ /*
+ * Set WB as the default cache mode for SEV-SNP and TDX, with a single
+ * UC range for the legacy PCI hole, e.g. so that devices that expect
+ * to get UC/WC mappings don't get surprised with WB.
+ */
+ guest_force_mtrr_state(&pci_hole, 1, MTRR_TYPE_WRBACK);
}
#if defined(CONFIG_AMD_MEM_ENCRYPT)
Powered by blists - more mailing lists