[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8a28ddea-35c0-490e-a7d2-7fb612fdd008@amazon.com>
Date: Fri, 24 Oct 2025 15:35:34 +0100
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Sean Christopherson <seanjc@...gle.com>, Nikita Kalyazin
<kalyazin@...zon.co.uk>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "shuah@...nel.org"
<shuah@...nel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"david@...hat.com" <david@...hat.com>, "jthoughton@...gle.com"
<jthoughton@...gle.com>, "patrick.roy@...ux.dev" <patrick.roy@...ux.dev>,
Jack Thomson <jackabt@...zon.co.uk>, Derek Manwaring <derekmn@...zon.com>,
Marco Cali <xmarcalx@...zon.co.uk>, <ackerleytng@...gle.com>, "Vishal
Annapurve" <vannapurve@...gle.com>
Subject: Re: [PATCH v6 1/2] KVM: guest_memfd: add generic population via write
On 23/10/2025 17:07, Sean Christopherson wrote:
> On Mon, Oct 20, 2025, Nikita Kalyazin wrote:
>> From: Nikita Kalyazin <kalyazin@...zon.com>
+ Vishal and Ackerley
>>
>> write syscall populates guest_memfd with user-supplied data in a generic
>> way, ie no vendor-specific preparation is performed. If the request is
>> not page-aligned, the remaining bytes are initialised to 0.
>>
>> write is only supported for non-CoCo setups where guest memory is not
>> hardware-encrypted.
>
> Please include all of the "why". The code mostly communicates the "what", but
> it doesn't capture why write() support is at all interesting, nor does it explain
> why read() isn't supported.
Hi Sean,
Thanks for the review.
Do you think including the explanation from the cover letter would be
sufficient? Shall I additionally say that read() isn't supported
because there is no use case for it as of now or would it be obvious?
>
>> Signed-off-by: Nikita Kalyazin <kalyazin@...zon.com>
>> ---
>> virt/kvm/guest_memfd.c | 48 ++++++++++++++++++++++++++++++++++++++++++
>
> There's a notable lack of uAPI and Documentation chanegs. I.e. this needs a
> GUEST_MEMFD_FLAG_xxx along with proper documentation.
Would the following be ok in the doc?
When the capability KVM_CAP_GUEST_MEMFD_WRITE is supported, the 'flags'
field
supports GUEST_MEMFD_FLAG_WRITE. Setting this flag on guest_memfd creation
enables write() syscall operations to populate guest_memfd memory from host
userspace.
When a write() operation is performed on a guest_memfd file descriptor
with the
GUEST_MEMFD_FLAG_WRITE set, the syscall will populate the guest memory with
user-supplied data in a generic way, without any vendor-specific
preparation.
The write operation is only supported for non-CoCo (Confidential Computing)
setups where guest memory is not hardware-encrypted. If the write request is
not page-aligned, any remaining bytes within the page are initialized to
zero.
>
> And while it's definitely it's a-ok to land .write() in advance of the direct map
> changes, we do need to at least map out how we want the two to interact, e.g. so
> that we don't end up with constraints that are impossible to satisfy.
>
write() shall not attempt to access a page that is not in the direct
map, which I believe can be achieved via kvm_kmem_gmem_write_begin()
consulting the KVM_GMEM_FOLIO_NO_DIRECT_MAP in folio->private
(introduced in [1]).
Do you think we should mention it in the commit message in some way?
What particular constraint are you cautious about?
[1] https://lore.kernel.org/kvm/20250924152214.7292-2-roypat@amazon.co.uk/
>> 1 file changed, 48 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 94bafd6c558c..f4e218049afa 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -380,6 +380,8 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
>>
>> static struct file_operations kvm_gmem_fops = {
>> .mmap = kvm_gmem_mmap,
>> + .llseek = default_llseek,
>> + .write_iter = generic_perform_write,
>> .open = generic_file_open,
>> .release = kvm_gmem_release,
>> .fallocate = kvm_gmem_fallocate,
>> @@ -390,6 +392,49 @@ void kvm_gmem_init(struct module *module)
>> kvm_gmem_fops.owner = module;
>> }
>>
>> +static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
>> + struct address_space *mapping,
>> + loff_t pos, unsigned int len,
>> + struct folio **foliop,
>> + void **fsdata)
>
> Over-aggressive wrapping, this can be
>
>
> static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
> struct address_space *mapping, loff_t pos,
> unsigned int len, struct folio **folio,
> void **fsdata)
>
> or
>
> static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
> struct address_space *mapping,
> loff_t pos, unsigned int len,
> struct folio **folio, void **fsdata)
>
> if we want to bundle pos+len.
Ack.
>
>> +{
>> + struct file *file = kiocb->ki_filp;
>
> ki_filp is already a file, and even if it were a "void *", there's no need for a
> local variable.
Ack.
>
>> + struct inode *inode = file_inode(file);
>> + pgoff_t index = pos >> PAGE_SHIFT;
>> + struct folio *folio;
>> +
>> + if (!kvm_gmem_supports_mmap(inode))
>
> Checking for MMAP is neither sufficient nor strictly necessary. MMAP doesn't
> imply SHARED, and it's not clear to me that mmap() support should be in any way
> tied to WRITE support.
As in my reply to the comment about doc, I plan to introduce
KVM_CAP_GUEST_MEMFD_WRITE and GUEST_MEMFD_FLAG_WRITE. The
kvm_arch_supports_gmem_write() will be a weak symbol and relying on
!kvm_arch_has_private_mem() on x86, similar to
kvm_arch_supports_gmem_mmap(). Does it look right?
>
>> + return -ENODEV;
>> +
>> + if (pos + len > i_size_read(inode))
>> + return -EINVAL;
>> +
>> + folio = kvm_gmem_get_folio(inode, index);
>
> Eh, since "index" is only used once, my vote is to use "pos" and do the shift
> here, so that it's obvious that the input to kvm_gmem_get_folio() is being checked.
Ack.
>
>> + if (IS_ERR(folio))
>> + return -EFAULT;
>
> Why EFAULT?
Will propagate the error like you suggest below.
>
>> +
>> + *foliop = folio;
>
> There shouldn't be any need for a local "folio". What about having the "out"
> param be just "folio"?
>
> E.g.
>
> static int kvm_kmem_gmem_write_begin(const struct kiocb *kiocb,
> struct address_space *mapping,
> loff_t pos, unsigned int len,
> struct folio **folio, void **fsdata)
> {
> struct inode *inode = file_inode(kiocb->ki_filp);
>
> if (!kvm_gmem_supports_write(inode))
> return -ENODEV;
>
> if (pos + len > i_size_read(inode))
> return -EINVAL;
>
> *folio = kvm_gmem_get_folio(inode, pos >> PAGE_SHIFT);
> if (IS_ERR(*folio))
> return PTR_ERR(*folio);
>
> return 0;
> }
Ack.
>
>
>> + return 0;
>> +}
>> +
>> +static int kvm_kmem_gmem_write_end(const struct kiocb *kiocb,
>> + struct address_space *mapping,
>> + loff_t pos, unsigned int len,
>> + unsigned int copied,
>> + struct folio *folio, void *fsdata)
>> +{
>> + if (copied && copied < len) {
>
> Why check if "copied" is non-zero? I don't see why KVM should behave differently
> with respect to unwritten bytes if copy_folio_from_iter_atomic() fails on the
> first byte or the Nth byte.
No, I don't think there is a need for this check indeed. It looks like
a leftover from my previous changes.
>
>> + unsigned int from = pos & ((1UL << folio_order(folio)) - 1);
>
> Uh, isn't this just offset_in_folio()?
>
>> +
>> + folio_zero_range(folio, from + copied, len - copied);
>
> I'd probably be in favor of omitting "from" entirely, e.g.
>
> if (copied < len)
> folio_zero_range(folio, offset_in_folio(pos) + copied,
> len - copied);
>
Ack.
>> + }
>> +
>> + folio_unlock(folio);
>> + folio_put(folio);
>> +
>> + return copied;
>> +}
Powered by blists - more mailing lists