linux-kernel - Re: [PATCH v5 18/43] arm64: RME: Handle realm enter/exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <791e8c32-83fb-442c-9664-4b5f2f9c09bf@arm.com>
Date: Fri, 29 Nov 2024 13:45:25 +0000
From: Suzuki K Poulose <suzuki.poulose@....com>
To: Steven Price <steven.price@....com>, kvm@...r.kernel.org,
 kvmarm@...ts.linux.dev
Cc: Catalin Marinas <catalin.marinas@....com>, Marc Zyngier <maz@...nel.org>,
 Will Deacon <will@...nel.org>, James Morse <james.morse@....com>,
 Oliver Upton <oliver.upton@...ux.dev>, Zenghui Yu <yuzenghui@...wei.com>,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 Joey Gouly <joey.gouly@....com>, Alexandru Elisei
 <alexandru.elisei@....com>, Christoffer Dall <christoffer.dall@....com>,
 Fuad Tabba <tabba@...gle.com>, linux-coco@...ts.linux.dev,
 Ganapatrao Kulkarni <gankulkarni@...amperecomputing.com>,
 Gavin Shan <gshan@...hat.com>, Shanker Donthineni <sdonthineni@...dia.com>,
 Alper Gun <alpergun@...gle.com>, "Aneesh Kumar K . V"
 <aneesh.kumar@...nel.org>
Subject: Re: [PATCH v5 18/43] arm64: RME: Handle realm enter/exit

Hi Steven

On 29/11/2024 12:18, Steven Price wrote:
> Hi Suzuki,
> 
> Sorry for the very slow response to this. Coming back to this I'm having
> doubts, see below.
> 
> On 17/10/2024 14:00, Suzuki K Poulose wrote:
>> On 04/10/2024 16:27, Steven Price wrote:
>>> Entering a realm is done using a SMC call to the RMM. On exit the
>>> exit-codes need to be handled slightly differently to the normal KVM
>>> path so define our own functions for realm enter/exit and hook them
>>> in if the guest is a realm guest.
>>>
>>> Signed-off-by: Steven Price <steven.price@....com>
> ...
>>> diff --git a/arch/arm64/kvm/rme-exit.c b/arch/arm64/kvm/rme-exit.c
>>> new file mode 100644
>>> index 000000000000..e96ea308212c
>>> --- /dev/null
>>> +++ b/arch/arm64/kvm/rme-exit.c
> ...
>>> +static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
>>> +{
>>> +    struct kvm *kvm = vcpu->kvm;
>>> +    struct realm *realm = &kvm->arch.realm;
>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>> +    unsigned long base = rec->run->exit.ripas_base;
>>> +    unsigned long top = rec->run->exit.ripas_top;
>>> +    unsigned long ripas = rec->run->exit.ripas_value;
>>> +    unsigned long top_ipa;
>>> +    int ret;
>>> +
>>> +    if (!realm_is_addr_protected(realm, base) ||
>>> +        !realm_is_addr_protected(realm, top - 1)) {
>>> +        kvm_err("Invalid RIPAS_CHANGE for %#lx - %#lx, ripas: %#lx\n",
>>> +            base, top, ripas);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_cache,
>>> +                   kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
>>
>> I think we also need to filter the request for RIPAS_RAM, by consulting
>> if the "range" is backed by a memslot or not. If they are not, we should
>> reject the request with a response flag set in run.enter.flags.
> 
> It's an interesting API question. At the moment there is no requirement
> to have an active memslot to set the RIPAS - this is true both during
> the setup by the VMM and at run time.
> 
> In theory a VMM can create/destroy memslots while the guest is running.
> So absense of a memslot doesn't actually imply that the RIPAS change

Agreed. Whether an IPA range may be used as RAM is a decision that the
VMM must make. So, we could give the VMM a chance to respond to this
request before we (KVM) make the RTT changes.

> should be rejected. Obviously with realms this is tricky because when
> destroying a memslot that's in use KVM would rip those pages out from
> the guest and it would require guest cooperation to restore those pages
> (transition to RIPAS_EMPTY and back to RIPAS_RAM). But it's not
> something that has been prohibited so far.

True, and it shouldn't be prohibited. If the Host wants to take away a
memslot it must be able to do that. But if it wants to do that in
good faith with the Realm, there must have been some communication
(e.g., virtio-mem ?) between the Host and the Realm and as long as the
Realm knows not to trust the contents on that region it could be 
recovered without a transition to EMPTY.

e.g. From RIPAS_DESTROYED => RIPAS_RAM with RSI_SET_IPA_STATE(... 
CHANGE_DESTROYED).


> 
> On the other hand this is a clear way for a (malicious/buggy) guest to
> use a fair bit of RAM by transitioning to RIPAS_RAM (sparse) pages not
> in a memslot and forcing KVM to allocate the RTT pages to delegate to
> the RMM. But we do exit to the VMM, so this is solvable in the VMM (by
> killing a misbehaving guest). The number of pages this would consume per
> exit is also fairly small.

Correct. If the VMM has no intention to provide memory at a given IPA
range, KVM shouldn't report RSI_ACCEPT to the Realm and the Realm later
gets a stage2 fault that cannot be serviced by KVM.

> 
> So my instinct is that we shouldn't impose that requirement.

I think we may be able to fix this by letting the VMM ACCEPT or REJECT
a given RIPAS_RAM transition request. That way, KVM isn't playing by
the rules set by the VMM and whether the VMM wants to trick the Realm
or play by the rules is upto it.


> 
> Any thoughts?
> 
>> As for EMPTY requests, if the guest wants to explicitly mark any range
>> as EMPTY, it doesn't matter, as long as it is within the protected IPA.
>> (even though they may be EMPTY in the first place).
>>
>>> +    write_lock(&kvm->mmu_lock);
>>> +    ret = realm_set_ipa_state(vcpu, base, top, ripas, &top_ipa);
>>> +    write_unlock(&kvm->mmu_lock);
>>> +
>>> +    WARN(ret && ret != -ENOMEM,
>>> +         "Unable to satisfy RIPAS_CHANGE for %#lx - %#lx, ripas:
>>> %#lx\n",
>>> +         base, top, ripas);
>>> +
>>> +    /* Exit to VMM to complete the change */
>>> +    kvm_prepare_memory_fault_exit(vcpu, base, top_ipa - base, false,
>>> false,
>>> +                      ripas == RMI_RAM);
>>
>> Again this may only be need if the range is backed by a memslot ?
>> Otherwise the VMM has nothing to do.
> 
> Assuming the above, then the VMM would be the one to kill a misbehaving
> guest, so would need a notification.

May be we could reverse the order of operations by delaying the 
realm_set_ipa_state() to occur on VMMs request from the memory_fault_exit.


Suzuki

> 
> Thanks,
> Steve
> 
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
>>> +{
>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>> +
>>> +    __vcpu_sys_reg(vcpu, CNTV_CTL_EL0) = rec->run->exit.cntv_ctl;
>>> +    __vcpu_sys_reg(vcpu, CNTV_CVAL_EL0) = rec->run->exit.cntv_cval;
>>> +    __vcpu_sys_reg(vcpu, CNTP_CTL_EL0) = rec->run->exit.cntp_ctl;
>>> +    __vcpu_sys_reg(vcpu, CNTP_CVAL_EL0) = rec->run->exit.cntp_cval;
>>> +
>>> +    kvm_realm_timers_update(vcpu);
>>> +}
>>> +
>>> +/*
>>> + * Return > 0 to return to guest, < 0 on error, 0 (and set
>>> exit_reason) on
>>> + * proper exit to userspace.
>>> + */
>>> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
>>> +{
>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>> +    u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
>>> +    unsigned long status, index;
>>> +
>>> +    status = RMI_RETURN_STATUS(rec_run_ret);
>>> +    index = RMI_RETURN_INDEX(rec_run_ret);
>>> +
>>> +    /*
>>> +     * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we
>>> might
>>> +     * see the following status code and index indicating an attempt
>>> to run
>>> +     * a REC when the RD state is SYSTEM_OFF.  In this case, we just
>>> need to
>>> +     * return to user space which can deal with the system event or
>>> will try
>>> +     * to run the KVM VCPU again, at which point we will no longer
>>> attempt
>>> +     * to enter the Realm because we will have a sleep request
>>> pending on
>>> +     * the VCPU as a result of KVM's PSCI handling.
>>> +     */
>>> +    if (status == RMI_ERROR_REALM && index == 1) {
>>> +        vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (rec_run_ret)
>>> +        return -ENXIO;
>>> +
>>> +    vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
>>> +    vcpu->arch.fault.far_el2 = rec->run->exit.far;
>>> +    vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar;
>>> +
>>> +    update_arch_timer_irq_lines(vcpu);
>>> +
>>> +    /* Reset the emulation flags for the next run of the REC */
>>> +    rec->run->enter.flags = 0;
>>> +
>>> +    switch (rec->run->exit.exit_reason) {
>>> +    case RMI_EXIT_SYNC:
>>> +        return rec_exit_handlers[esr_ec](vcpu);
>>> +    case RMI_EXIT_IRQ:
>>> +    case RMI_EXIT_FIQ:
>>> +        return 1;
>>> +    case RMI_EXIT_PSCI:
>>> +        return rec_exit_psci(vcpu);
>>> +    case RMI_EXIT_RIPAS_CHANGE:
>>> +        return rec_exit_ripas_change(vcpu);
>>> +    }
>>> +
>>> +    kvm_pr_unimpl("Unsupported exit reason: %u\n",
>>> +              rec->run->exit.exit_reason);
>>> +    vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>>> +    return 0;
>>> +}
>>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>>> index 1fa9991d708b..4c0751231810 100644
>>> --- a/arch/arm64/kvm/rme.c
>>> +++ b/arch/arm64/kvm/rme.c
>>> @@ -899,6 +899,25 @@ void kvm_destroy_realm(struct kvm *kvm)
>>>        kvm_free_stage2_pgd(&kvm->arch.mmu);
>>>    }
>>>    +int kvm_rec_enter(struct kvm_vcpu *vcpu)
>>> +{
>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>> +
>>> +    switch (rec->run->exit.exit_reason) {
>>> +    case RMI_EXIT_HOST_CALL:
>>> +    case RMI_EXIT_PSCI:
>>> +        for (int i = 0; i < REC_RUN_GPRS; i++)
>>> +            rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
>>> +        break;
>>> +    }
>>
>> As mentioned in the patch following (MMIO emulation support), we may be
>> able to do this unconditionally for all REC entries, to cover ourselves
>> from missing out other cases. The RMM is in charge of taking the
>> appropriate action anyways to copy the results back.
>>
>> Suzuki
>>
>>> +
>>> +    if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
>>> +        return -EINVAL;
>>> +
>>> +    return rmi_rec_enter(virt_to_phys(rec->rec_page),
>>> +                 virt_to_phys(rec->run));
>>> +}
>>> +
>>>    static void free_rec_aux(struct page **aux_pages,
>>>                 unsigned int num_aux)
>>>    {
>