[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZiAw1jd8840jXqok@google.com>
Date: Wed, 17 Apr 2024 13:28:06 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
isaku.yamahata@...el.com, xiaoyao.li@...el.com, binbin.wu@...ux.intel.com,
rick.p.edgecombe@...el.com
Subject: Re: [PATCH 1/7] KVM: Document KVM_MAP_MEMORY ioctl
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> +4.143 KVM_MAP_MEMORY
> +------------------------
> +
> +:Capability: KVM_CAP_MAP_MEMORY
> +:Architectures: none
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_map_memory (in/out)
> +:Returns: 0 on success, < 0 on error
> +
> +Errors:
> +
> + ========== ===============================================================
> + EINVAL The specified `base_address` and `size` were invalid (e.g. not
> + page aligned or outside the defined memory slots).
"outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
that can _never_ succeed.
> + EAGAIN The ioctl should be invoked again and no page was processed.
> + EINTR An unmasked signal is pending and no page was processed.
I'm guessing we'll want to handle large ranges, at which point we'll likely end
up with EAGAIN and/or EINTR after processing at least one page.
> + EFAULT The parameter address was invalid.
> + EOPNOTSUPP The architecture does not support this operation, or the
> + guest state does not allow it.
I would phrase this as something like:
Mapping memory given for a GPA is unsupported by the
architecture, and/or for the current vCPU state/mode.
It's not that the guest state doesn't "allow" it, it's that it's explicitly
unsupported because it's nonsensical without a GVA (or L2 GPA).
> + ========== ===============================================================
> +
> +::
> +
> + struct kvm_map_memory {
> + /* in/out */
> + __u64 base_address;
I think we should commit to this being limited to gpa mappings, e.g. go with
"gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa").
> + __u64 size;
> + /* in */
> + __u64 flags;
> + __u64 padding[5];
> + };
> +
> +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
I think we should word this very carefully and explicitly so that KVM doesn't
commit to behavior that can't be guaranteed. We might even want to use a name
that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY?
Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU"
could be interpreted as the _guest's_ page tables.
Something like:
KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
for the current vCPU state. KVM maps memory as if the vCPU generated a
stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
CoW. However, KVM does not mark any newly created stage-2 PTE as Accessed.
> +When the ioctl returns, the input values are updated to point to the
> +remaining range. If `size` > 0 on return, the caller can just issue
> +the ioctl again with the same `struct kvm_map_memory` argument.
This is likely misleading. Unless KVM explicitly zeros size on *every* failure,
a pedantic reading of this would suggest that userspace can retry and it should
eventually succeed.
> +In some cases, multiple vCPUs might share the page tables. In this
> +case, if this ioctl is called in parallel for multiple vCPUs the
> +ioctl might return with `size` > 0.
Why? If there's already a valid mapping, mission accomplished. I don't see any
reason to return an error. If x86's page fault path returns RET_PF_RETRY, then I
think it makes sense to retry in KVM, not punt this to userspace.
> +The ioctl may not be supported for all VMs, and may just return
> +an `EOPNOTSUPP` error if a VM does not support it. You may use
> +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
> +supported.
Why per-VM? I don't think there's any per-VM state that would change the behavior.
The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause
problems are per-vCPU, not per-VM.
Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I
don't think it would add much value beyond returning EOPNOTSUPP for the ioctl()
itself.
> +Also, shadow page tables cannot support this ioctl because they
> +are indexed by virtual address or nested guest physical address.
> +Calling this ioctl when the guest is using shadow page tables (for
> +example because it is running a nested guest) will also fail.
Running a nested guest using TDP.
> +
> +`flags` must currently be zero.
> +
> +
> 5. The kvm_run structure
> ========================
>
> --
> 2.43.0
>
>
Powered by blists - more mailing lists