linux-kernel - Re: [RFC PATCH 1/8] KVM: Document KVM_MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240308021941.GM368614@ls.amr.corp.intel.com>
Date: Thu, 7 Mar 2024 18:19:41 -0800
From: Isaku Yamahata <isaku.yamahata@...ux.intel.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: David Matlack <dmatlack@...gle.com>, Kai Huang <kai.huang@...el.com>,
	Isaku Yamahata <isaku.yamahata@...ux.intel.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	Isaku Yamahata <isaku.yamahata@...el.com>,
	"federico.parola@...ito.it" <federico.parola@...ito.it>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"michael.roth@....com" <michael.roth@....com>
Subject: Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

On Thu, Mar 07, 2024 at 05:28:20PM -0800,
Sean Christopherson <seanjc@...gle.com> wrote:

> On Thu, Mar 07, 2024, David Matlack wrote:
> > On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > > +:Returns: 0 on success, <0 on error
> > > > > > +
> > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > > +
> > > > > > +::
> > > > > > +
> > > > > > +  struct kvm_memory_mapping {
> > > > > > +	__u64 base_gfn;
> > > > > > +	__u64 nr_pages;
> > > > > > +	__u64 flags;
> > > > > > +	__u64 source;
> > > > > > +  };
> > > > > > +
> > > > > > +  /* For kvm_memory_mapping:: flags */
> > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> > > > > 
> > > > > I am not sure what's the good of having "FLAG_USER"?
> > > > > 
> > > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > > as user-fault?
> > > > 
> > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > > fault.  Not we call the ioctl as user context.
> > > 
> > > Sorry I don't quite follow.  What's wrong if KVM just append the #PF USER
> > > error bit before it calls into the fault handler?
> > > 
> > > My question is, since this is ABI, you have to tell how userspace is
> > > supposed to use this.  Maybe I am missing something, but I don't see how
> > > USER should be used here.
> > 
> > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> > is meaningless, PFERR_USER_MASK is only relevant for shadow paging.
> 
> +1
> 
> > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> > populated with writes (which avoids just faulting in the zero-page for
> > anon or tmpfs backed memslots), while also allowing populating read-only
> > memslots.
> > 
> > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
> 
> It would midly be interesting for something like the NX hugepage mitigation.
> 
> For the initial implementation, I don't think the ioctl() should specify
> protections, period.
> 
> VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
> And for guest_memfd, finer grained control in general, and long term compatibility
> with other features that are in-flight or proposed, I would rather userspace specify
> RWX protections via KVM_SET_MEMORY_ATTRIBUTES.  Oh, and dirty logging would be a
> pain too.
> 
> KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
> we can simply define KVM_MAP_MEMORY to behave like a read fault.  E.g. map RX,
> and add W if all underlying protections allow it.
> 
> That way we can defer dealing with things like XO and RW *if* KVM ever does gain
> support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
> will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
> for XO memory.
> 
> And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
> KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
> RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
> regions.
> 
> We can always expand the uAPI, but taking away functionality is much harder, if
> not impossible.

Ok, let me drop all the flags.  Here is the updated one.

4.143 KVM_MAP_MEMORY
------------------------

:Capability: KVM_CAP_MAP_MEMORY
:Architectures: none
:Type: vcpu ioctl
:Parameters: struct kvm_memory_mapping(in/out)
:Returns: 0 on success, < 0 on error

Errors:

  ======   =============================================================
  EINVAL   vcpu state is not in TDP MMU mode or is in guest mode.
           Currently, this ioctl is restricted to TDP MMU.
  EAGAIN   The region is only processed partially.  The caller should
           issue the ioctl with the updated parameters.
  EINTR    An unmasked signal is pending.  The region may be processed
           partially.  If `nr_pages` > 0, the caller should issue the
           ioctl with the updated parameters.
  ======   =============================================================

KVM_MAP_MEMORY populates guest memory before the VM starts to run.  Multiple
vcpus can call this ioctl simultaneously.  It may result in the error of EAGAIN
due to race conditions.

::

  struct kvm_memory_mapping {
	__u64 base_gfn;
	__u64 nr_pages;
	__u64 flags;
	__u64 source;
  };

KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
`nr_pages`) in the underlying mapping. `source` is an optional user pointer.  If
`source` is not NULL and the underlying technology supports it, the memory
contents of `source` are copied into the guest memory.  The backend may encrypt
it.  `flags` must be zero.  It's reserved for future use.

When the ioctl returns, the input values are updated.  If `nr_pages` is large,
it may return EAGAIN or EINTR for pending signal and update the values
(`base_gfn` and `nr_pages`.  `source` if not zero) to point to the remaining
range.

-- 
Isaku Yamahata <isaku.yamahata@...ux.intel.com>