lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <697aa804-b321-4dba-9060-7ac17e0a489f@linux.intel.com>
Date: Wed, 20 Aug 2025 19:13:33 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Sean Christopherson <seanjc@...gle.com>,
 Vishal Annapurve <vannapurve@...gle.com>
Cc: Nikolay Borisov <nik.borisov@...e.com>, Jianxiong Gao <jxgao@...gle.com>,
 "Borislav Petkov (AMD)" <bp@...en8.de>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 Dionna Glaze <dionnaglaze@...gle.com>, "H. Peter Anvin" <hpa@...or.com>,
 jgross@...e.com, "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 Ingo Molnar <mingo@...hat.com>, pbonzini@...hat.com,
 Peter Gonda <pgonda@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
 Tom Lendacky <thomas.lendacky@....com>,
 Vitaly Kuznetsov <vkuznets@...hat.com>, x86@...nel.org,
 Rick Edgecombe <rick.p.edgecombe@...el.com>, jiewen.yao@...el.com
Subject: Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX



On 8/20/2025 6:03 PM, Binbin Wu wrote:
>
>
> On 8/20/2025 11:07 AM, Vishal Annapurve wrote:
>> On Wed, Jul 30, 2025 at 12:34 AM Binbin Wu <binbin.wu@...ux.intel.com> wrote:
>>>
>>>
>>> On 7/28/2025 11:33 PM, Sean Christopherson wrote:
>>>> +Jiewen
>>> Jiewen is out of the office until August 4th.
>> Hi Jiewen, can we get some help in deciding the next steps here?
>
> Please see below.
>
>>
>>>> Summary, with the questions at the end.
>>>>
>>>> Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
>>>> the TPM driver's ioremap (with UC) failing because the kernel has already mapped
>>>> the range using a cachaeable mapping (WB).
>>>>
>>>>    ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>>>>    tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>>>>
>>>> The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
>>>> SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
>>>> to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
>>>> thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
>>>> actually honors the memtypes programmed into the virtual MTRRs.
>>>>
>>>> It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
>>>> region (and potentially other regions) to be UC, so that the kernel ACPI driver's
>>>> attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
>>>> forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
>>>> causes the ioremap infrastructure to reject the TPM driver's UC mapping.
>>>>
>>>> IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
>>>> EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
>>>>
>>>>         //
>>>>         // Operational region for TPM access
>>>>         //
>>>>         OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
>>>>
>>>> generates the problematic SystemMemory entry that triggers the ACPI driver's
>>>> auto-mapping logic.
>>>>
>>>> QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
>>>> EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
>>>> Memory32Fixed entry.
>>>>
>>>> Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
>>>> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
>
> Checked with Jiewen offline.
>
> He didn't think there was an existing interface to tell the OS to map a
> OperationRegion of SystemMemory as UC via the ACPI table. He thought the
> OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
> alternative way.
>
>
>>> According to the ACPI spec 6.6, an operation region of SystemMemory has no
>>> interface to specify the cacheable attribute.
>>>
>>> One solution could be using MTRRs to communicate the memory attribute of legacy
>>> PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
>>> that "long-term, firmware should not be using MTRRs to communicate anything to
>>> the kernel." So this solution is not preferred.
>>>
>>> If not MTRRs, there should be an alternative way to do the job.
>>> 1. ACPI table
>>>      According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>>>      Range Descriptor can specify the cacheable attribute.
>>>      "Address Space Resource Descriptors" could be used to describe a memory range
>>>      and the they can specify the cacheable attribute via "Type Specific Flags".
>>>      One of the Address Space Resource Descriptors could be added to the ACPI
>>>      table as a hint when the kernel do the mapping for operation region.
>>>      (There is "System Physical Address (SPA) Range Structure", which also can
>>>      specify the cacheable attribute. But it's should be used for NVDIMMs.)
>>> 2. EFI memory map descriptor
>>>      EFI memory descriptor can specify the cacheable attribute. Firmware can add
>>>      a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>>>      the mapping for operation region.
>>>
>>> Operation region of SystemMemory is still needed if a "Control Method" of APCI
>>> needs to access a field, e.g., the method _STA. Checking another descriptor for
>>> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
>>> map descriptor" during the ACPI code doing the mapping for operation region
>>> makes the code complicated.
>>>
>>> Another thing is if long-term firmware should not be using MTRRs to to
>>> communicate anything to the kernel. It seems it's safer to use ioremap() instead
>>> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
>>> operation region access?
>>>
>> Would it work if instead of doubling down on declaring the low memory
>> above TOLUD as WB, guest kernel reserves the range as uncacheable by
>> default i.e. effectively simulating a ioremap before ACPI tries to map
>> the memory as WB?
>
> It seems as hacky as this patch set?
>
>
Hi Sean,

Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
hole range as UC?

Is it less hacky?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ