[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9EL0=Cxu7f8tD6rEvnpC7uLAw6jKijHdFUQYvbyJgkzA@mail.gmail.com>
Date: Tue, 19 Aug 2025 20:07:58 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Binbin Wu <binbin.wu@...ux.intel.com>
Cc: Sean Christopherson <seanjc@...gle.com>, Nikolay Borisov <nik.borisov@...e.com>,
Jianxiong Gao <jxgao@...gle.com>, "Borislav Petkov (AMD)" <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, Dionna Glaze <dionnaglaze@...gle.com>,
"H. Peter Anvin" <hpa@...or.com>, jgross@...e.com,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, pbonzini@...hat.com,
Peter Gonda <pgonda@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
Tom Lendacky <thomas.lendacky@....com>, Vitaly Kuznetsov <vkuznets@...hat.com>, x86@...nel.org,
Rick Edgecombe <rick.p.edgecombe@...el.com>, jiewen.yao@...el.com
Subject: Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
On Wed, Jul 30, 2025 at 12:34 AM Binbin Wu <binbin.wu@...ux.intel.com> wrote:
>
>
>
> On 7/28/2025 11:33 PM, Sean Christopherson wrote:
> > +Jiewen
>
> Jiewen is out of the office until August 4th.
Hi Jiewen, can we get some help in deciding the next steps here?
>
> >
> > Summary, with the questions at the end.
> >
> > Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
> > the TPM driver's ioremap (with UC) failing because the kernel has already mapped
> > the range using a cachaeable mapping (WB).
> >
> > ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
> > tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
> >
> > The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
> > SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
> > to WB. With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
> > thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
> > actually honors the memtypes programmed into the virtual MTRRs.
> >
> > It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
> > region (and potentially other regions) to be UC, so that the kernel ACPI driver's
> > attempts to map of SystemMemory entries as cacheable get forced to UC. With MTRRs
> > forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
> > causes the ioremap infrastructure to reject the TPM driver's UC mapping.
> >
> > IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
> > EDK2's TPM ASL. And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
> >
> > //
> > // Operational region for TPM access
> > //
> > OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
> >
> > generates the problematic SystemMemory entry that triggers the ACPI driver's
> > auto-mapping logic.
> >
> > QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
> > EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
> > Memory32Fixed entry.
> >
> > Presumably this an EDK2 bug? If it's not an EDK2 bug, then how is the kernel's
> > ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
> According to the ACPI spec 6.6, an operation region of SystemMemory has no
> interface to specify the cacheable attribute.
>
> One solution could be using MTRRs to communicate the memory attribute of legacy
> PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
> that "long-term, firmware should not be using MTRRs to communicate anything to
> the kernel." So this solution is not preferred.
>
> If not MTRRs, there should be an alternative way to do the job.
> 1. ACPI table
> According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
> Range Descriptor can specify the cacheable attribute.
> "Address Space Resource Descriptors" could be used to describe a memory range
> and the they can specify the cacheable attribute via "Type Specific Flags".
> One of the Address Space Resource Descriptors could be added to the ACPI
> table as a hint when the kernel do the mapping for operation region.
> (There is "System Physical Address (SPA) Range Structure", which also can
> specify the cacheable attribute. But it's should be used for NVDIMMs.)
> 2. EFI memory map descriptor
> EFI memory descriptor can specify the cacheable attribute. Firmware can add
> a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
> the mapping for operation region.
>
> Operation region of SystemMemory is still needed if a "Control Method" of APCI
> needs to access a field, e.g., the method _STA. Checking another descriptor for
> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
> map descriptor" during the ACPI code doing the mapping for operation region
> makes the code complicated.
>
> Another thing is if long-term firmware should not be using MTRRs to to
> communicate anything to the kernel. It seems it's safer to use ioremap() instead
> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
> operation region access?
>
Would it work if instead of doubling down on declaring the low memory
above TOLUD as WB, guest kernel reserves the range as uncacheable by
default i.e. effectively simulating a ioremap before ACPI tries to map
the memory as WB?
Powered by blists - more mailing lists