[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c14b3e33-59da-4b8c-8952-caf87126803d@arm.com>
Date: Thu, 1 May 2025 16:09:16 +0100
From: Steven Price <steven.price@....com>
To: Suzuki K Poulose <suzuki.poulose@....com>, kvm@...r.kernel.org,
kvmarm@...ts.linux.dev
Cc: Catalin Marinas <catalin.marinas@....com>, Marc Zyngier <maz@...nel.org>,
Will Deacon <will@...nel.org>, James Morse <james.morse@....com>,
Oliver Upton <oliver.upton@...ux.dev>, Zenghui Yu <yuzenghui@...wei.com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Joey Gouly <joey.gouly@....com>, Alexandru Elisei
<alexandru.elisei@....com>, Christoffer Dall <christoffer.dall@....com>,
Fuad Tabba <tabba@...gle.com>, linux-coco@...ts.linux.dev,
Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>,
Gavin Shan <gshan@...hat.com>, Shanker Donthineni <sdonthineni@...dia.com>,
Alper Gun <alpergun@...gle.com>, "Aneesh Kumar K . V"
<aneesh.kumar@...nel.org>
Subject: Re: [PATCH v8 07/43] arm64: RME: ioctls to create and configure
realms
On 29/04/2025 10:45, Suzuki K Poulose wrote:
> On 16/04/2025 14:41, Steven Price wrote:
>> Add the KVM_CAP_ARM_RME_CREATE_RD ioctl to create a realm. This involves
>> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
>> the base level of the Realm Translation Tables (RTT). A VMID also need
>> to be picked, since the RMM has a separate VMID address space a
>> dedicated allocator is added for this purpose.
>>
>> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
>> before it is created. Configuration options can be classified as:
>>
>> 1. Parameters specific to the Realm stage2 (e.g. IPA Size, vmid, stage2
>> entry level, entry level RTTs, number of RTTs in start level, LPA2)
>> Most of these are not measured by RMM and comes from KVM book
>> keeping.
>>
>> 2. Parameters controlling "Arm Architecture features for the VM". (e.g.
>> SVE VL, PMU counters, number of HW BRPs/WPs), configured by the VMM
>> using the "user ID register write" mechanism. These will be
>> supported in the later patches.
>>
>> 3. Parameters are not part of the core Arm architecture but defined
>> by the RMM spec (e.g. Hash algorithm for measurement,
>> Personalisation value). These are programmed via
>> KVM_CAP_ARM_RME_CONFIG_REALM.
>>
>> For the IPA size there is the possibility that the RMM supports a
>> different size to the IPA size supported by KVM for normal guests. At
>> the moment the 'normal limit' is exposed by KVM_CAP_ARM_VM_IPA_SIZE and
>> the IPA size is configured by the bottom bits of vm_type in
>> KVM_CREATE_VM. This means that it isn't easy for the VMM to discover
>> what IPA sizes are supported for Realm guests. Since the IPA is part of
>> the measurement of the realm guest the current expectation is that the
>> VMM will be required to pick the IPA size demanded by attestation and
>> therefore simply failing if this isn't available is fine. An option
>> would be to expose a new capability ioctl to obtain the RMM's maximum
>> IPA size if this is needed in the future.
>>
>> Co-developed-by: Suzuki K Poulose <suzuki.poulose@....com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@....com>
>> Signed-off-by: Steven Price <steven.price@....com>
>> Reviewed-by: Gavin Shan <gshan@...hat.com>
>> ---
>> Changes since v7:
>> * Minor code cleanup following Gavin's review.
>> Changes since v6:
>> * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>> host page size to be larger than 4k while still communicating with an
>> RMM which uses 4k granules.
>> Changes since v5:
>> * Introduce free_delegated_granule() to replace many
>> undelegate/free_page() instances and centralise the comment on
>> leaking when the undelegate fails.
>> * Several other minor improvements suggested by reviews - thanks for
>> the feedback!
>> Changes since v2:
>> * Improved commit description.
>> * Improved return failures for rmi_check_version().
>> * Clear contents of PGD after it has been undelegated in case the RMM
>> left stale data.
>> * Minor changes to reflect changes in previous patches.
>> ---
>> arch/arm64/include/asm/kvm_emulate.h | 5 +
>> arch/arm64/include/asm/kvm_rme.h | 19 ++
>> arch/arm64/kvm/arm.c | 16 ++
>> arch/arm64/kvm/mmu.c | 22 +-
>> arch/arm64/kvm/rme.c | 319 +++++++++++++++++++++++++++
>> 5 files changed, 379 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/
>> include/asm/kvm_emulate.h
>> index 1c43a4fc25dd..4ee6c215da82 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -699,6 +699,11 @@ static inline enum realm_state
>> kvm_realm_state(struct kvm *kvm)
>> return READ_ONCE(kvm->arch.realm.state);
>> }
>> +static inline bool kvm_realm_is_created(struct kvm *kvm)
>> +{
>> + return kvm_is_realm(kvm) && kvm_realm_state(kvm) !=
>> REALM_STATE_NONE;
>> +}
>> +
>> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
>> {
>> return false;
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/
>> asm/kvm_rme.h
>> index 9c8a0b23e0e4..5dc1915de891 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -6,6 +6,8 @@
>> #ifndef __ASM_KVM_RME_H
>> #define __ASM_KVM_RME_H
>> +#include <uapi/linux/kvm.h>
>> +
>> /**
>> * enum realm_state - State of a Realm
>> */
>> @@ -46,11 +48,28 @@ enum realm_state {
>> * struct realm - Additional per VM data for a Realm
>> *
>> * @state: The lifetime state machine for the realm
>> + * @rd: Kernel mapping of the Realm Descriptor (RD)
>> + * @params: Parameters for the RMI_REALM_CREATE command
>> + * @num_aux: The number of auxiliary pages required by the RMM
>> + * @vmid: VMID to be used by the RMM for the realm
>> + * @ia_bits: Number of valid Input Address bits in the IPA
>> */
>> struct realm {
>> enum realm_state state;
>> +
>> + void *rd;
>> + struct realm_params *params;
>> +
>> + unsigned long num_aux;
>> + unsigned int vmid;
>> + unsigned int ia_bits;
>> };
>> void kvm_init_rme(void);
>> +u32 kvm_realm_ipa_limit(void);
>> +
>> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
>> +int kvm_init_realm_vm(struct kvm *kvm);
>> +void kvm_destroy_realm(struct kvm *kvm);
>> #endif /* __ASM_KVM_RME_H */
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 856a721d41ac..0e8482fdc4d3 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -136,6 +136,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>> }
>> mutex_unlock(&kvm->lock);
>> break;
>> + case KVM_CAP_ARM_RME:
>> + mutex_lock(&kvm->lock);
>> + r = kvm_realm_enable_cap(kvm, cap);
>> + mutex_unlock(&kvm->lock);
>> + break;
>> default:
>> break;
>> }
>> @@ -198,6 +203,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned
>> long type)
>> bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>> + /* Initialise the realm bits after the generic bits are enabled */
>> + if (kvm_is_realm(kvm)) {
>> + ret = kvm_init_realm_vm(kvm);
>> + if (ret)
>> + goto err_free_cpumask;
>> + }
>> +
>> return 0;
>> err_free_cpumask:
>> @@ -257,6 +269,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>> kvm_unshare_hyp(kvm, kvm + 1);
>> kvm_arm_teardown_hypercalls(kvm);
>> + kvm_destroy_realm(kvm);
>> }
>> static bool kvm_has_full_ptr_auth(void)
>> @@ -405,6 +418,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm,
>> long ext)
>> case KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES:
>> r = BIT(0);
>> break;
>> + case KVM_CAP_ARM_RME:
>> + r = static_key_enabled(&kvm_rme_is_available);
>> + break;
>> default:
>> r = 0;
>> }
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 2feb6c6b63af..5957a07de86d 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -876,12 +876,16 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>> .icache_inval_pou = invalidate_icache_guest_page,
>> };
>> -static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long
>> type)
>> +static int kvm_init_ipa_range(struct kvm *kvm,
>> + struct kvm_s2_mmu *mmu, unsigned long type)
>> {
>> u32 kvm_ipa_limit = get_kvm_ipa_limit();
>> u64 mmfr0, mmfr1;
>> u32 phys_shift;
>> + if (kvm_is_realm(kvm))
>> + kvm_ipa_limit = kvm_realm_ipa_limit();
>> +
>> if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>> return -EINVAL;
>> @@ -946,7 +950,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct
>> kvm_s2_mmu *mmu, unsigned long t
>> return -EINVAL;
>> }
>> - err = kvm_init_ipa_range(mmu, type);
>> + err = kvm_init_ipa_range(kvm, mmu, type);
>> if (err)
>> return err;
>> @@ -1072,6 +1076,20 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>> struct kvm_pgtable *pgt = NULL;
>> write_lock(&kvm->mmu_lock);
>> + if (kvm_is_realm(kvm) &&
>> + (kvm_realm_state(kvm) != REALM_STATE_DEAD &&
>> + kvm_realm_state(kvm) != REALM_STATE_NONE)) {
>> + /* Tearing down RTTs will be added in a later patch */
>> + write_unlock(&kvm->mmu_lock);
>> +
>> + /*
>> + * The physical PGD pages are delegated to the RMM, so cannot
>> + * be freed at this point. This function will be called again
>> + * from kvm_destroy_realm() after the physical pages have been
>> + * returned at which point the memory can be freed.
>> + */
>> + return;
>> + }
>> pgt = mmu->pgt;
>> if (pgt) {
>> mmu->pgd_phys = 0;
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index 67cf2d94cb2d..dbb6521fe380 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -5,9 +5,23 @@
>> #include <linux/kvm_host.h>
>> +#include <asm/kvm_emulate.h>
>> +#include <asm/kvm_mmu.h>
>> #include <asm/rmi_cmds.h>
>> #include <asm/virt.h>
>> +#include <asm/kvm_pgtable.h>
>> +
>> +static unsigned long rmm_feat_reg0;
>> +
>> +#define RMM_PAGE_SHIFT 12
>> +#define RMM_PAGE_SIZE BIT(RMM_PAGE_SHIFT)
>> +
>> +static bool rme_has_feature(unsigned long feature)
>> +{
>> + return !!u64_get_bits(rmm_feat_reg0, feature);
>> +}
>> +
>> static int rmi_check_version(void)
>> {
>> struct arm_smccc_res res;
>> @@ -42,6 +56,305 @@ static int rmi_check_version(void)
>> return 0;
>> }
>> +u32 kvm_realm_ipa_limit(void)
>> +{
>> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>> +}
>> +
>> +static int get_start_level(struct realm *realm)
>> +{
>> + return 4 - ((realm->ia_bits - 8) / (RMM_PAGE_SHIFT - 3));
>
> minor nit: It may be worth adding a comment here, how we got this magic
> number 8. We could say:
>
> Open coded version of ARM64_HW_PGTABLE_LEVELS(ia_bits - 4) (accounting
> for the concatenation of upto 16 tables in entry level) for RMM Stage2.
>
Agreed a comment makes sense. This is actually an open coded version of
"4 - stage2_pgtable_levels()" but using the RMM's page size (indeed
previous versions of the series did exactly that). There's already a
comment for stage2_pgtable_levels() explaining the concatenation.
>> +}
>> +
>> +static void free_delegated_granule(phys_addr_t phys)
>> +{
>> + if (WARN_ON(rmi_granule_undelegate(phys))) {
>> + /* Undelegate failed: leak the page */
>> + return;
>> + }
>> +
>> + kvm_account_pgtable_pages(phys_to_virt(phys), -1);
>> +
>> + free_page((unsigned long)phys_to_virt(phys));
>> +}
>> +
>> +/* Calculate the number of s2 root rtts needed */
>> +static int realm_num_root_rtts(struct realm *realm)
>> +{
>> + unsigned int ipa_bits = realm->ia_bits;
>> + unsigned int levels = 3 - get_start_level(realm);
>
> nit: Why is this 3 - start_level and not 4 ? Though that is compensated
> by the "levels + 1" below, hence the calculation is correct.
I honestly can't remember - will change ;)
>> + unsigned int sl_ipa_bits = (levels + 1) * (RMM_PAGE_SHIFT - 3) +
>> + RMM_PAGE_SHIFT;
>> +
>> + if (sl_ipa_bits >= ipa_bits)
>> + return 1;
>> +
>> + return 1 << (ipa_bits - sl_ipa_bits);
>> +}
>> +
>> +static int realm_create_rd(struct kvm *kvm)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + struct realm_params *params = realm->params;
>> + void *rd = NULL;
>> + phys_addr_t rd_phys, params_phys;
>> + size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> + int i, r;
>> + int rtt_num_start;
>> +
>> + realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> + rtt_num_start = realm_num_root_rtts(realm);
>> +
>> + if (WARN_ON(realm->rd || !realm->params))
>> + return -EEXIST;
>> +
>> + if (pgd_size / RMM_PAGE_SIZE < rtt_num_start)
>> + return -EINVAL;
>> +
>> + rd = (void *)__get_free_page(GFP_KERNEL);
>> + if (!rd)
>> + return -ENOMEM;
>> +
>> + rd_phys = virt_to_phys(rd);
>> + if (rmi_granule_delegate(rd_phys)) {
>> + r = -ENXIO;
>> + goto free_rd;
>> + }
>> +
>> + for (i = 0; i < pgd_size; i += RMM_PAGE_SIZE) {
>> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i;
>> +
>> + if (rmi_granule_delegate(pgd_phys)) {
>> + r = -ENXIO;
>> + goto out_undelegate_tables;
>> + }
>> + }
>> +
>> + params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> + params->rtt_level_start = get_start_level(realm);
>> + params->rtt_num_start = rtt_num_start;
>> + params->rtt_base = kvm->arch.mmu.pgd_phys;
>> + params->vmid = realm->vmid;
>> +
>> + params_phys = virt_to_phys(params);
>> +
>> + if (rmi_realm_create(rd_phys, params_phys)) {
>> + r = -ENXIO;
>> + goto out_undelegate_tables;
>> + }
>> +
>> + if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
>> + WARN_ON(rmi_realm_destroy(rd_phys));
>> + goto out_undelegate_tables;
>> + }
>> +
>> + realm->rd = rd;
>> +
>> + return 0;
>> +
>> +out_undelegate_tables:
>> + while (i > 0) {
>> + i -= RMM_PAGE_SIZE;
>> +
>> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i;
>> +
>> + if (WARN_ON(rmi_granule_undelegate(pgd_phys))) {
>> + /* Leak the pages if they cannot be returned */
>> + kvm->arch.mmu.pgt = NULL;
>> + break;
>> + }
>
> minor nit: Do we need to try undelegating the other pages ? We could
> make that WARN_ON_ONCE() too.
We could do, but there's no real point unless we also deal with leaking
only the pages which cannot be undelegated. Considering this is a
"should never happen" case I felt it was better to keep the code simple
and leak more than necessary. To me this feels like a "save everything
and reboot as soon as possible" sort of error condition - the kernel (or
RMM) has lost track of which pages are delegated and so a GPT fault
could be coming at any point. So it's better to play safe and be
pessimistic.
Thanks,
Steve
Powered by blists - more mailing lists