linux-kernel - Re: [PATCH v6 1/2] KVM: guest_memfd: add generic population via write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8a28ddea-35c0-490e-a7d2-7fb612fdd008@amazon.com>
Date: Fri, 24 Oct 2025 15:35:34 +0100
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Sean Christopherson <seanjc@...gle.com>, Nikita Kalyazin
	<kalyazin@...zon.co.uk>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "shuah@...nel.org"
	<shuah@...nel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"david@...hat.com" <david@...hat.com>, "jthoughton@...gle.com"
	<jthoughton@...gle.com>, "patrick.roy@...ux.dev" <patrick.roy@...ux.dev>,
	Jack Thomson <jackabt@...zon.co.uk>, Derek Manwaring <derekmn@...zon.com>,
	Marco Cali <xmarcalx@...zon.co.uk>, <ackerleytng@...gle.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>
Subject: Re: [PATCH v6 1/2] KVM: guest_memfd: add generic population via write



On 23/10/2025 17:07, Sean Christopherson wrote:
> On Mon, Oct 20, 2025, Nikita Kalyazin wrote:
>> From: Nikita Kalyazin <kalyazin@...zon.com>

+ Vishal and Ackerley

>>
>> write syscall populates guest_memfd with user-supplied data in a generic
>> way, ie no vendor-specific preparation is performed.  If the request is
>> not page-aligned, the remaining bytes are initialised to 0.
>>
>> write is only supported for non-CoCo setups where guest memory is not
>> hardware-encrypted.
> 
> Please include all of the "why".  The code mostly communicates the "what", but
> it doesn't capture why write() support is at all interesting, nor does it explain
> why read() isn't supported.

Hi Sean,

Thanks for the review.

Do you think including the explanation from the cover letter would be 
sufficient?  Shall I additionally say that read() isn't supported 
because there is no use case for it as of now or would it be obvious?

> 
>> Signed-off-by: Nikita Kalyazin <kalyazin@...zon.com>
>> ---
>>   virt/kvm/guest_memfd.c | 48 ++++++++++++++++++++++++++++++++++++++++++
> 
> There's a notable lack of uAPI and Documentation chanegs.  I.e. this needs a
> GUEST_MEMFD_FLAG_xxx along with proper documentation.

Would the following be ok in the doc?

When the capability KVM_CAP_GUEST_MEMFD_WRITE is supported, the 'flags' 
field
supports GUEST_MEMFD_FLAG_WRITE. Setting this flag on guest_memfd creation
enables write() syscall operations to populate guest_memfd memory from host
userspace.

When a write() operation is performed on a guest_memfd file descriptor 
with the
GUEST_MEMFD_FLAG_WRITE set, the syscall will populate the guest memory with
user-supplied data in a generic way, without any vendor-specific 
preparation.
The write operation is only supported for non-CoCo (Confidential Computing)
setups where guest memory is not hardware-encrypted. If the write request is
not page-aligned, any remaining bytes within the page are initialized to 
zero.

> 
> And while it's definitely it's a-ok to land .write() in advance of the direct map
> changes, we do need to at least map out how we want the two to interact, e.g. so
> that we don't end up with constraints that are impossible to satisfy.
> 

write() shall not attempt to access a page that is not in the direct 
map, which I believe can be achieved via kvm_kmem_gmem_write_begin() 
consulting the KVM_GMEM_FOLIO_NO_DIRECT_MAP in folio->private 
(introduced in [1]).

Do you think we should mention it in the commit message in some way? 
What particular constraint are you cautious about?

[1] https://lore.kernel.org/kvm/20250924152214.7292-2-roypat@amazon.co.uk/

>>   1 file changed, 48 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 94bafd6c558c..f4e218049afa 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -380,6 +380,8 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
>>
>>   static struct file_operations kvm_gmem_fops = {
>>        .mmap           = kvm_gmem_mmap,
>> +     .llseek         = default_llseek,
>> +     .write_iter     = generic_perform_write,
>>        .open           = generic_file_open,
>>        .release        = kvm_gmem_release,
>>        .fallocate      = kvm_gmem_fallocate,
>> @@ -390,6 +392,49 @@ void kvm_gmem_init(struct module *module)
>>        kvm_gmem_fops.owner = module;
>>   }
>>
>> +static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
>> +                                  struct address_space *mapping,
>> +                                  loff_t pos, unsigned int len,
>> +                                  struct folio **foliop,
>> +                                  void **fsdata)
> 
> Over-aggressive wrapping, this can be
> 
> 
> static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
>                                       struct address_space *mapping, loff_t pos,
>                                       unsigned int len, struct folio **folio,
>                                       void **fsdata)
> 
> or
> 
> static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
>                                       struct address_space *mapping,
>                                       loff_t pos, unsigned int len,
>                                       struct folio **folio, void **fsdata)
> 
> if we want to bundle pos+len.

Ack.

> 
>> +{
>> +     struct file *file = kiocb->ki_filp;
> 
> ki_filp is already a file, and even if it were a "void *", there's no need for a
> local variable.

Ack.

> 
>> +     struct inode *inode = file_inode(file);
>> +     pgoff_t index = pos >> PAGE_SHIFT;
>> +     struct folio *folio;
>> +
>> +     if (!kvm_gmem_supports_mmap(inode))
> 
> Checking for MMAP is neither sufficient nor strictly necessary.  MMAP doesn't
> imply SHARED, and it's not clear to me that mmap() support should be in any way
> tied to WRITE support.

As in my reply to the comment about doc, I plan to introduce 
KVM_CAP_GUEST_MEMFD_WRITE and GUEST_MEMFD_FLAG_WRITE.  The 
kvm_arch_supports_gmem_write() will be a weak symbol and relying on 
!kvm_arch_has_private_mem() on x86, similar to 
kvm_arch_supports_gmem_mmap().  Does it look right?

> 
>> +             return -ENODEV;
>> +
>> +     if (pos + len > i_size_read(inode))
>> +             return -EINVAL;
>> +
>> +     folio = kvm_gmem_get_folio(inode, index);
> 
> Eh, since "index" is only used once, my vote is to use "pos" and do the shift
> here, so that it's obvious that the input to kvm_gmem_get_folio() is being checked.

Ack.

> 
>> +     if (IS_ERR(folio))
>> +             return -EFAULT;
> 
> Why EFAULT?

Will propagate the error like you suggest below.

> 
>> +
>> +     *foliop = folio;
> 
> There shouldn't be any need for a local "folio".  What about having the "out"
> param be just "folio"?
> 
> E.g.
> 
> static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
>                                       struct address_space *mapping,
>                                       loff_t pos, unsigned int len,
>                                       struct folio **folio, void **fsdata)
> {
>          struct inode *inode = file_inode(kiocb->ki_filp);
> 
>          if (!kvm_gmem_supports_write(inode))
>                  return -ENODEV;
> 
>          if (pos + len > i_size_read(inode))
>                  return -EINVAL;
> 
>          *folio = kvm_gmem_get_folio(inode, pos >> PAGE_SHIFT);
>          if (IS_ERR(*folio))
>                  return PTR_ERR(*folio);
> 
>          return 0;
> }

Ack.

> 
> 
>> +     return 0;
>> +}
>> +
>> +static int kvm_kmem_gmem_write_end(const struct kiocb *kiocb,
>> +                                struct address_space *mapping,
>> +                                loff_t pos, unsigned int len,
>> +                                unsigned int copied,
>> +                                struct folio *folio, void *fsdata)
>> +{
>> +     if (copied && copied < len) {
> 
> Why check if "copied" is non-zero?  I don't see why KVM should behave differently
> with respect to unwritten bytes if copy_folio_from_iter_atomic() fails on the
> first byte or the Nth byte.

No, I don't think there is a need for this check indeed.  It looks like 
a leftover from my previous changes.

> 
>> +             unsigned int from = pos & ((1UL << folio_order(folio)) - 1);
> 
> Uh, isn't this just offset_in_folio()?
> 
>> +
>> +             folio_zero_range(folio, from + copied, len - copied);
> 
> I'd probably be in favor of omitting "from" entirely, e.g.
> 
>          if (copied < len)
>                  folio_zero_range(folio, offset_in_folio(pos) + copied,
>                                   len - copied);
> 

Ack.

>> +     }
>> +
>> +     folio_unlock(folio);
>> +     folio_put(folio);
>> +
>> +     return copied;
>> +}