[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240428062446.gfux36w3j2fy2dnj@yy-desk-7060>
Date: Sun, 28 Apr 2024 14:24:46 +0800
From: Yuan Yao <yuan.yao@...ux.intel.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: isaku.yamahata@...el.com, pbonzini@...hat.com, erdemaktas@...gle.com,
vkuznets@...hat.com, seanjc@...gle.com, vannapurve@...gle.com,
jmattson@...gle.com, mlevitsk@...hat.com, xiaoyao.li@...el.com,
chao.gao@...el.com, rick.p.edgecombe@...el.com, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH V5 3/4] KVM: x86: Add a capability to configure bus
frequency for APIC timer
On Thu, Apr 25, 2024 at 03:07:01PM -0700, Reinette Chatre wrote:
> From: Isaku Yamahata <isaku.yamahata@...el.com>
>
> Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC
> bus clock frequency for APIC timer emulation.
> Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the
> frequency in nanoseconds. When using this capability, the user space
> VMM should configure CPUID leaf 0x15 to advertise the frequency.
>
> Vishal reported that the TDX guest kernel expects a 25MHz APIC bus
> frequency but ends up getting interrupts at a significantly higher rate.
>
> The TDX architecture hard-codes the core crystal clock frequency to
> 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
> does not allow the VMM to override the value.
>
> In addition, per Intel SDM:
> "The APIC timer frequency will be the processor’s bus clock or core
> crystal clock frequency (when TSC/core crystal clock ratio is
> enumerated in CPUID leaf 0x15) divided by the value specified in
> the divide configuration register."
>
> The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
> APIC bus frequency of 1GHz.
>
> The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user
> space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is
> enumerated, the guest kernel uses it as the APIC bus frequency. If not,
> the guest kernel measures the frequency based on other known timers like
> the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest
> kernel expects a 25MHz timer frequency but gets timer interrupt more
> frequently due to the 1GHz frequency used by KVM.
>
> To ensure that the guest doesn't have a conflicting view of the APIC bus
> frequency, allow the userspace to tell KVM to use the same frequency that
> TDX mandates instead of the default 1Ghz.
>
> There are several options to address this:
> 1. Make the KVM able to configure APIC bus frequency (this series).
> Pro: It resembles the existing hardware. The recent Intel CPUs
> adapts 25MHz.
> Con: Require the VMM to emulate the APIC timer at 25MHz.
> 2. Make the TDX architecture enumerate CPUID leaf 0x15 to configurable
> frequency or not enumerate it.
> Pro: Any APIC bus frequency is allowed.
> Con: Deviates from TDX architecture.
> 3. Make the TDX guest kernel use 1GHz when it's running on KVM.
> Con: The kernel ignores CPUID leaf 0x15.
> 4. Change CPUID leaf 0x15 under TDX to report the crystal clock frequency
> as 1 GHz.
> Pro: This has been the virtual APIC frequency for KVM guests for 13
> years.
> Pro: This requires changing only one hard-coded constant in TDX.
> Con: It doesn't work with other VMMs as TDX isn't specific to KVM.
> Con: Core crystal clock frequency is also used to calculate TSC
> frequency.
> Con: If it is configured to value different from hardware, it will
> break the correctness of INTEL-PT Mini Time Count (MTC) packets
> in TDs.
Reviewed-by: Yuan Yao <yuan.yao@...el.com>
>
> Reported-by: Vishal Annapurve <vannapurve@...gle.com>
> Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com/
> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> Co-developed-by: Reinette Chatre <reinette.chatre@...el.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@...el.com>
> ---
> Changes v5:
> - Rename capability KVM_CAP_X86_APIC_BUS_FREQUENCY ->
> KVM_CAP_X86_APIC_BUS_CYCLES_NS. (Xiaoyao Li)
> - Add Rick's Reviewed-by tag.
>
> Changes v4:
> - Rework implementation following Sean's guidance in:
> https://lore.kernel.org/all/ZdjzIgS6EAeCsUue@google.com/
> - Reword con #2 to acknowledge feedback. (Sean)
> - Add the "con" information from Xiaoyao during earlier review of v2.
> - Rework changelog to address comments related to "bus clock" vs
> "core crystal clock" frequency. (Xiaoyao)
> - Drop snippet about impact on TSC deadline timer emulation. (Maxim)
> - Drop Maxim Levitsky's "Reviewed-by" tag due to many changes to patch
> since tag received.
> - Switch "Subject:" to match custom "KVM: X86:" -> "KVM: x86:"
>
> Changes v3:
> - Added reviewed-by Maxim Levitsky.
> - Minor update of the commit message.
>
> Changes v2:
> - Add check if vcpu isn't created.
> - Add check if lapic chip is in-kernel emulation.
> - Fix build error for i386.
> - Add document to api.rst.
> - Typo in the commit message.
>
> Documentation/virt/kvm/api.rst | 17 +++++++++++++++++
> arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 1 +
> 3 files changed, 45 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f0b76ff5030d..f014cd9b2217 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8063,6 +8063,23 @@ error/annotated fault.
>
> See KVM_EXIT_MEMORY_FAULT for more information.
>
> +7.35 KVM_CAP_X86_APIC_BUS_CYCLES_NS
> +-----------------------------------
> +
> +:Architectures: x86
> +:Target: VM
> +:Parameters: args[0] is the desired APIC bus clock rate, in nanoseconds
> +:Returns: 0 on success, -EINVAL if args[0] contains an invalid value for the
> + frequency or if any vCPUs have been created, -ENXIO if a virtual
> + local APIC has not been created using KVM_CREATE_IRQCHIP.
> +
> +This capability sets VM's APIC bus clock frequency, used by KVM's in-kernel
> +virtual APIC when emulating APIC timers. KVM's default value can be retrieved
> +by KVM_CHECK_EXTENSION.
> +
> +Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the
> +core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest.
> +
> 8. Other capabilities.
> ======================
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 10e6315103f4..fa6954c9a9d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4715,6 +4715,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_MEMORY_FAULT_INFO:
> r = 1;
> break;
> + case KVM_CAP_X86_APIC_BUS_CYCLES_NS:
> + r = APIC_BUS_CYCLE_NS_DEFAULT;
> + break;
> case KVM_CAP_EXIT_HYPERCALL:
> r = KVM_EXIT_HYPERCALL_VALID_MASK;
> break;
> @@ -6755,6 +6758,30 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> }
> mutex_unlock(&kvm->lock);
> break;
> + case KVM_CAP_X86_APIC_BUS_CYCLES_NS: {
> + u64 bus_cycle_ns = cap->args[0];
> + u64 unused;
> +
> + r = -EINVAL;
> + /*
> + * Guard against overflow in tmict_to_ns(). 128 is the highest
> + * divide value that can be programmed in APIC_TDCR.
> + */
> + if (!bus_cycle_ns ||
> + check_mul_overflow((u64)U32_MAX * 128, bus_cycle_ns, &unused))
> + break;
> +
> + r = 0;
> + mutex_lock(&kvm->lock);
> + if (!irqchip_in_kernel(kvm))
> + r = -ENXIO;
> + else if (kvm->created_vcpus)
> + r = -EINVAL;
> + else
> + kvm->arch.apic_bus_cycle_ns = bus_cycle_ns;
> + mutex_unlock(&kvm->lock);
> + break;
> + }
> default:
> r = -EINVAL;
> break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..6a4d9432ab11 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -917,6 +917,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_MEMORY_ATTRIBUTES 233
> #define KVM_CAP_GUEST_MEMFD 234
> #define KVM_CAP_VM_TYPES 235
> +#define KVM_CAP_X86_APIC_BUS_CYCLES_NS 236
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> --
> 2.34.1
>
>
Powered by blists - more mailing lists