linux-kernel - Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <diqzv7q4pg97.fsf@ackerleytng-ctop.c.googlers.com>
Date: Tue, 13 May 2025 10:33:40 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: vannapurve@...gle.com, chenyi.qiang@...el.com, tabba@...gle.com, 
	quic_eberman@...cinc.com, roypat@...zon.co.uk, jgg@...dia.com, 
	peterx@...hat.com, david@...hat.com, rientjes@...gle.com, fvdl@...gle.com, 
	jthoughton@...gle.com, seanjc@...gle.com, pbonzini@...hat.com, 
	zhiquan1.li@...el.com, fan.du@...el.com, jun.miao@...el.com, 
	isaku.yamahata@...el.com, muchun.song@...ux.dev, erdemaktas@...gle.com, 
	qperret@...gle.com, jhubbard@...dia.com, willy@...radead.org, 
	shuah@...nel.org, brauner@...nel.org, bfoster@...hat.com, 
	kent.overstreet@...ux.dev, pvorel@...e.cz, rppt@...nel.org, 
	richard.weiyang@...il.com, anup@...infault.org, haibo1.xu@...el.com, 
	ajones@...tanamicro.com, vkuznets@...hat.com, maciej.wieczor-retman@...el.com, 
	pgonda@...gle.com, oliver.upton@...ux.dev, linux-kernel@...r.kernel.org, 
	linux-mm@...ck.org, kvm@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct
 HugeTLB page

Yan Zhao <yan.y.zhao@...el.com> writes:

> On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@...el.com> writes:
>> 
>> >> > <snip>
>> >> >
>> >> > What options does userspace have in this scenario?
>> >> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
>> >> > isn't ideal either.
>> >> >
>> >> > What about something similar as below?
>> >> >
>> >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> >> > index d2feacd14786..87c33704a748 100644
>> >> > --- a/virt/kvm/guest_memfd.c
>> >> > +++ b/virt/kvm/guest_memfd.c
>> >> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>> >> >         }
>> >> >
>> >> >         *pfn = folio_file_pfn(folio, index);
>> >> > -       if (max_order)
>> >> > -               *max_order = folio_order(folio);
>> >> > +       if (max_order) {
>> >> > +               int order;
>> >> > +
>> >> > +               order = folio_order(folio);
>> >> > +
>> >> > +               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
>> >> > +                       order--;
>> >> > +
>> >> > +               *max_order = order;
>> >> > +       }
>> >> >
>> >> >         *is_prepared = folio_test_uptodate(folio);
>> >> >         return folio;
>> >> >
>> >> 
>> >> Vishal was wondering how this is working before guest_memfd was
>> >> introduced, for other backing memory like HugeTLB.
>> >> 
>> >> I then poked around and found this [1]. I will be adding a similar check
>> >> for any slot where kvm_slot_can_be_private(slot).
>> >>
>> >> Yan, that should work, right?
>> > No, I don't think the checking of ugfn [1] should work.
>> >
>> > 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory
>> > are allocated from guest_memfd), the slot->userspace_addr does not necessarily
>> > have the same offset as slot->gmem.pgoff. Even if we audit the offset in
>> > kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing
>> > slot->userspace_addr to point to a different offset.
>> >
>> > 2. for slots bound to guest_memfd that do not support in-place-conversion,
>> > shared memory is allocated from a different backend. Therefore, checking
>> > "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is
>> > currently absent because guest_memfd supports 4K only.
>> >
>> >
>> 
>> Let me clarify, I meant these changes:
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 4b64ab3..d0dccf1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
>>         return 0;
>>  }
>>  
>> +static inline bool kvm_is_level_aligned(u64 value, int level)
>> +{
>> +       return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
>> +}
>> +
>>  static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>>                                       struct kvm_memory_slot *slot)
>>  {
>> @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>>  
>>                 slot->arch.lpage_info[i - 1] = linfo;
>>  
>> -               if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
>> +               if (!kvm_is_level_aligned(slot->base_gfn, level))
>>                         linfo[0].disallow_lpage = 1;
>> -               if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
>> +               if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
>>                         linfo[lpages - 1].disallow_lpage = 1;
>>                 ugfn = slot->userspace_addr >> PAGE_SHIFT;
>>                 /*
>> -                * If the gfn and userspace address are not aligned wrt each
>> -                * other, disable large page support for this slot.
>> +                * If the gfn and userspace address are not aligned or if gfn
>> +                * and guest_memfd offset are not aligned wrt each other,
>> +                * disable large page support for this slot.
>>                  */
>> -               if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
>> +               if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
>> +                   (kvm_slot_can_be_private(slot) &&
>> +                    !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
>> +                                          level))) {
>>                         unsigned long j;
>>  
>>                         for (j = 0; j < lpages; ++j)
>> 
>> This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.
> In the case of shared memory is not allocated from guest_memfd, (e.g. with the
> current upstream code), the checking of gmem.pgoff here will disallow huge page
> of shared memory even if "slot->base_gfn ^ ugfn" is aligned.
>

Thanks, I get it now. What you mean is that the memslot could have been
set up such that

+ slot->userspace_addr is aligned with slot->base_gfn, to be used for
  shared memory, and 
+ slot->gmem.pgoff is not aligned with slot->base_gfn, to be used for
  private memory

and this check would disallow huge page mappings even though this
memslot was going to only be used for shared memory.

The only way to fix this would indeed be a runtime check, since the
shared/private status can change at runtime.

I think it is okay that this check is stricter than necessary, since it
just results in mapping without huge pages.

What do you think?

>> I think this should take care of case (1.), for guest_memfds going to be
>> used for both shared and private memory. Userspace can't update
>> slot->userspace_addr, since guest_memfd memslots cannot be updated and
>> can only be deleted.
>> 
>> If userspace re-uses slot->userspace_addr for some other memory address
>> without deleting and re-adding a memslot,
>> 
>> + KVM's access to memory should still be fine, since after the recent
>>   discussion at guest_memfd upstream call, KVM's guest faults will
>>   always go via fd+offset and KVM's access won't be disrupted
>>   there. Whatever checking done at memslot binding time will still be
>>   valid.
> Could the offset of shared memory and offset of private memory be different if
> userspace re-uses slot->userspace_addr without deleting and re-adding a memslot?
>

They could be different, yes. I think what you mean is if userspace does
something like

addr = mmap(guest_memfd);
ioctl(KVM_SET_USER_MEMORY_REGION, addr, guest_memfd);
munmap(addr);
addr = mmap(addr, other_fd);
(with no second call to KVM_SET_USER_MEMORY_REGION)

Without guest_memfd, when munmap() happens, KVM should get a
notification via mmu_notifiers. That will unmap the pages from guest
page tables. At the next fault, host page tables will be consulted to
determine max_mapping_level, and at that time the mapping level would be
the new mapping level in host page tables.

> Then though the two offsets are validated as equal in kvm_gmem_bind(), they may
> differ later on.
>

This is true.

Continuing from above, with guest_memfd, no issues if guest_memfd is
only used for private memory, since shared memory uses the same
mechanism as before guest_memfd.

If guest_memfd is used for both private and shared memory, on unmapping,
KVM will also get notified via mmu_notifiers. On the next fault, the
mapping level is determined as follows (I have a patch coming up that
will illustrate this better)

1. guest_memfd will return 4K since this is a shared folio and shared
   folios are always split to 4K. But suppose in future guest_memfd
   supports shared folios at higher levels, say 1G, we continue...
2. lpage info (not updated since userspace swapped out addr) will say
   map at 1G
3. Since this is a shared fault, we check host page tables, which would
   say 4K since there was a munmap() and mmap().

I think it should still work as expected.

>> + Host's access and other accesses (e.g. instruction emulation, which
>>   uses slot->userspace_addr) to guest memory will be broken, but I think
>>   there's nothing protecting against that. The same breakage would
>>   happen for non-guest_memfd memslot.
> Why is host access broken in non-guest_memfd case?
> The HVA is still a valid one in QEMU's mmap-ed address space.
>

I was thinking that if a guest was executing code and the code gets
swapped out from under its feet by replacing the memory pointed to by
addr, the guest would be broken.

Now that I think about it again, it could be a valid use case. You're
right, thanks for pointing this out!

>> p.s. I will be adding the validation as you suggested [1], though that
>> shouldn't make a difference here, since the above check directly
>> validates against gmem.pgoff.
>> 
>> Regarding 2., checking this checks against gmem.pgoff and should handle
>> that as well.
>> 
>> [1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/
>> 
>> I prefer checking at binding time because it aligns with the ugfn check
>> that is already there, and avoids having to check at every fault.
>> 
>> >> [1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996
>> >> 
>> >> >> >> Adding checks at binding time will allow hugepage-unaligned offsets (to
>> >> >> >> be at parity with non-guest_memfd backing memory) but still fix this
>> >> >> >> issue.
>> >> >> >> 
>> >> >> >> lpage_info will make sure that ranges near the bounds will be
>> >> >> >> fragmented, but the hugepages in the middle will still be mappable as
>> >> >> >> hugepages.
>> >> >> >> 
>> >> >> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg