[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d84b792e-8d26-49c2-9e7c-04093f554f8a@linux.intel.com>
Date: Thu, 21 Aug 2025 11:30:27 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>,
Nikolay Borisov <nik.borisov@...e.com>, Jianxiong Gao <jxgao@...gle.com>,
"Borislav Petkov (AMD)" <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Dionna Glaze <dionnaglaze@...gle.com>, "H. Peter Anvin" <hpa@...or.com>,
jgross@...e.com, "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Ingo Molnar <mingo@...hat.com>, pbonzini@...hat.com,
Peter Gonda <pgonda@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
Tom Lendacky <thomas.lendacky@....com>,
Vitaly Kuznetsov <vkuznets@...hat.com>, x86@...nel.org,
Rick Edgecombe <rick.p.edgecombe@...el.com>, jiewen.yao@...el.com
Subject: Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
On 8/21/2025 1:56 AM, Sean Christopherson wrote:
> On Wed, Aug 20, 2025, Binbin Wu wrote:
>> On 8/20/2025 6:03 PM, Binbin Wu wrote:
>>>>>> Presumably this an EDK2 bug? If it's not an EDK2 bug, then how is the kernel's
>>>>>> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
>>> Checked with Jiewen offline.
>>>
>>> He didn't think there was an existing interface to tell the OS to map a
>>> OperationRegion of SystemMemory as UC via the ACPI table. He thought the
>>> OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
>>> alternative way.
>>>
>>>>> According to the ACPI spec 6.6, an operation region of SystemMemory has no
>>>>> interface to specify the cacheable attribute.
>>>>>
>>>>> One solution could be using MTRRs to communicate the memory attribute of legacy
>>>>> PCI hole to the kernel.
> So IIUC, there are no bugs anywhere, just a gap in specs that has been hidden
> until now :-(
>
>>>>> But during the PUCK meeting last week, Sean mentioned
>>>>> that "long-term, firmware should not be using MTRRs to communicate anything to
>>>>> the kernel." So this solution is not preferred.
>>>>>
>>>>> If not MTRRs, there should be an alternative way to do the job.
>>>>> 1. ACPI table
>>>>> According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>>>>> Range Descriptor can specify the cacheable attribute.
>>>>> "Address Space Resource Descriptors" could be used to describe a memory range
>>>>> and the they can specify the cacheable attribute via "Type Specific Flags".
>>>>> One of the Address Space Resource Descriptors could be added to the ACPI
>>>>> table as a hint when the kernel do the mapping for operation region.
>>>>> (There is "System Physical Address (SPA) Range Structure", which also can
>>>>> specify the cacheable attribute. But it's should be used for NVDIMMs.)
>>>>> 2. EFI memory map descriptor
>>>>> EFI memory descriptor can specify the cacheable attribute. Firmware can add
>>>>> a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>>>>> the mapping for operation region.
>>>>>
>>>>> Operation region of SystemMemory is still needed if a "Control Method" of APCI
>>>>> needs to access a field, e.g., the method _STA. Checking another descriptor for
>>>>> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
>>>>> map descriptor" during the ACPI code doing the mapping for operation region
>>>>> makes the code complicated.
>>>>>
>>>>> Another thing is if long-term firmware should not be using MTRRs to to
>>>>> communicate anything to the kernel. It seems it's safer to use ioremap() instead
>>>>> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
>>>>> operation region access?
>>>>>
>>>> Would it work if instead of doubling down on declaring the low memory
>>>> above TOLUD as WB, guest kernel reserves the range as uncacheable by
>>>> default i.e. effectively simulating a ioremap before ACPI tries to map
>>>> the memory as WB?
>>> It seems as hacky as this patch set?
>>>
>>>
>> Hi Sean,
>>
>> Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
>> I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
>> hole range as UC?
>>
>> Is it less hacky?
> Oh! That's a way better idea than my hack. I missed that the kernel would still
> consult MTRRs.
>
> Compile tested only, but something like this?
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 8ae750cde0c6..45c8871cdda1 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
>
> static void __init kvm_init_platform(void)
> {
> + u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
> + struct mtrr_var_range pci_hole = {
> + .base_lo = tolud | X86_MEMTYPE_UC,
> + .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
> + .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
> + };
> +
This value of tolud may not meet the range size and alignment requirement for
variable MTRR.
Variable MTRR has requirement for range size and alignment:
For ranges greater than 4 KBytes, each range must be of length 2^n and its base
address must be aligned on a 2^n boundary, where n is a value equal to or
greater than 12. The base-address alignment value cannot be less than its length.
In my setup, the value of tolud is 0x7FF7C000, it requires 3 variable MTRRs to
meet the requirement, i.e.,
- 7FF7 C000 ~ 7FF8 0000
- 7FF8 0000 ~ 8000 0000
- 8000 0000 ~ 1 0000 0000
I checks the implementation in EDK2, in order to fit the legacy PCI hole into
one variable MTRR, it has some assumption to truncate the size and round up the
base address in PlatformQemuUc32BaseInitialization():
...
ASSERT (
PlatformInfoHob->HostBridgeDevId == INTEL_Q35_MCH_DEVICE_ID ||
PlatformInfoHob->HostBridgeDevId == INTEL_82441_DEVICE_ID
);
...
//
// Start with the [LowerMemorySize, 4GB) range. Make sure one
// variable MTRR suffices by truncating the size to a whole power of two,
// while keeping the end affixed to 4GB. This will round the base up.
//
PlatformInfoHob->Uc32Size = GetPowerOfTwo32 ((UINT32)(SIZE_4GB - PlatformInfoHob->LowMemory));
PlatformInfoHob->Uc32Base = (UINT32)(SIZE_4GB - PlatformInfoHob->Uc32Size);
//
// Assuming that LowerMemorySize is at least 1 byte, Uc32Size is at most 2GB.
// Therefore Uc32Base is at least 2GB.
//
ASSERT (PlatformInfoHob->Uc32Base >= BASE_2GB);
I am not sure if KVM can do such assumption.
Otherwise, KVM needs to calculate the ranges to meet the requirement. :(
> if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
> kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
> unsigned long nr_pages;
> @@ -982,8 +989,12 @@ static void __init kvm_init_platform(void)
> kvmclock_init();
> x86_platform.apic_post_init = kvm_apic_init;
>
> - /* Set WB as the default cache mode for SEV-SNP and TDX */
> - guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
> + /*
> + * Set WB as the default cache mode for SEV-SNP and TDX, with a single
> + * UC range for the legacy PCI hole, e.g. so that devices that expect
> + * to get UC/WC mappings don't get surprised with WB.
> + */
> + guest_force_mtrr_state(&pci_hole, 1, MTRR_TYPE_WRBACK);
> }
>
> #if defined(CONFIG_AMD_MEM_ENCRYPT)
Powered by blists - more mailing lists