linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <diqzjz0yfk06.fsf@google.com>
Date: Mon, 13 Oct 2025 16:40:41 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Yan Zhao <yan.y.zhao@...el.com>, Fuad Tabba <tabba@...gle.com>, 
	Binbin Wu <binbin.wu@...ux.intel.com>, Michael Roth <michael.roth@....com>, 
	Ira Weiny <ira.weiny@...el.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, David Hildenbrand <david@...hat.com>, 
	Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

Ackerley Tng <ackerleytng@...gle.com> writes:

> 
> [...snip...]
> 
>>> >> > The kvm_memory_attributes structure is compatible, all that's needed AFAICT is a
>>> >> > union to clarify it's a pgoff instead of an address when used for guest_memfd.
>>> >> >
>>> >> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> >> > index 52f6000ab020..e0d8255ac8d2 100644
>>> >> > --- a/include/uapi/linux/kvm.h
>>> >> > +++ b/include/uapi/linux/kvm.h
>>> >> > @@ -1590,7 +1590,10 @@ struct kvm_stats_desc {
>>> >> >  #define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
>>> >> >  
>>> >> >  struct kvm_memory_attributes {
>>> >> > -       __u64 address;
>>> >> > +       union {
>>> >> > +               __u64 address;
>>> >> > +               __u64 offset;
>>> >> > +       };
>>> >> >         __u64 size;
>>> >> >         __u64 attributes;
>>> >> >         __u64 flags;
>>> >> >
>>> >> 
>>> >> struct kvm_memory_attributes doesn't have room for reporting the offset
>>> >> at which conversion failed (error_offset in the new struct). How do we
>>> >> handle this? Do we reuse the flags field, or do we not report
>>> >> error_offset?
>>> >
>>> > Write back at address/offset
>>> 
>>> I think it might be surprising to the userspace program, when it wants
>>> to check the offset that it had requested and found that it changed due
>>> to an error, or upon decoding the error, be unable to find the original
>>> offset it had requested.
>>
>> It's a somewhat common pattern in the kernel.  Updating the offset+size is most
>> often used with -EAGAIN to say "got this far, try the syscall again from this
>> point".
>>
>
> TIL, thanks!
>
>>> Like,
>>> 
>>>     printf("Error during conversion from offset=%lx with size=%lx, at
>>>            error_offset=%lx", attr.offset, attr.size, attr.error_offset)
>>> 
>>> would be nicer than 
>>> 
>>>     original_offset = attr.offset
>>>     printf("Error during conversion from offset=%lx with size=%lx, at
>>>            error_offset=%lx", original_offset, attr.size, attr.error_offset)
>>>            
>>> > (and update size too, which I probably forgot to do).
>>> 
>>> Why does size need to be updated? I think u64 for size is great, and
>>> size is better than nr_pages since nr_pages differs on different
>>> platforms based on PAGE_SIZE and also nr_pages introduces the question
>>> of "was it hugetlb, or a native page size?".
>>
>> I meant update the number of bytes remaining when updating the offset so that
>> userspace can redo the ioctl without having to update parameters.
>>

Was working through this again, I think the attr.offset returned from
the conversion ioctl is not the same as other syscalls where an updated
offset+size indicates "got this far, try the syscall again from this
point".

For the conversion ioctl, -EAGAIN indicates that a some unexpected
refcount was first found at offset error_offset, but does not imply that
everything up till error_offset had been converted.

This arises when we start to have hugepage support. To restructure
hugepage-by-hugepage, we will iterate hugepage-wise and check for
elevated refcounts.

Suppose we're converting 10 1G pages and on the 3rd hugepage, the 5th
offset has an elevated refcount.

error_offset should be set to the 5th offset in the 3rd hugepage, but
userspace should retry beginning at the offset of the 3rd hugepage with
size 8G.

If the offset returned to userspace is the 3rd hugepage, then we lose
precision. The refcount at the 3rd hugepage could be fine and expected -
it is the page at the 5th offset in the 3rd hugepage that is pinned and
userspace should be unpin.

So perhaps the interface needs to be defined as

If the error is -EAGAIN:
   + offset: the offset to retry from
   + size: the remaining size to retry
   + error_offset: the offset where an unexpected refcount was found

>>> 
>>> [...snip...]
>>>