linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com>
Date: Mon, 23 Jun 2025 17:27:34 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
	<david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>,
	"Du, Fan" <fan.du@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Weiny, Ira" <ira.weiny@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Wed, Jun 18, 2025 at 08:41:38AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-18 at 08:19 +0800, Yan Zhao wrote:
> > > I don't think a potential bug in KVM is a good enough reason. If we are
> > > concerned can we think about a warning instead?
> > > 
> > > We had talked enhancing kasan to know when a page is mapped into S-EPT in
> > > the
> > > past. So rather than design around potential bugs we could focus on having a
> > > simpler implementation with the infrastructure to catch and fix the bugs.
> > However, if failing to remove a guest private page would only cause memory
> > leak,
> > it's fine. 
> > If TDX does not hold any refcount, guest_memfd has to know that which private
> > page is still mapped. Otherwise, the page may be re-assigned to other kernel
> > components while it may still be mapped in the S-EPT.
> 
> KASAN detects use-after-free's like that. However, the TDX module code is not
> instrumented. It won't check against the KASAN state for it's accesses.
> 
> I had a brief chat about this with Dave and Kirill. A couple ideas were
> discussed. One was to use page_ext to keep a flag that says the page is in-use
Thanks!

To use page_ext, should we introduce a new flag PAGE_EXT_FIRMWARE_IN_USE,
similar to PAGE_EXT_YOUNG?

Due to similar issues as those with normal page/folio flags (see the next
comment for details), TDX needs to set PAGE_EXT_FIRMWARE_IN_USE on a
page-by-page basis rather than folio-by-folio.

Additionally, it seems reasonable for guest_memfd not to copy the
PAGE_EXT_FIRMWARE_IN_USE flag when splitting a huge folio?
(in __folio_split() --> split_folio_to_order(), PAGE_EXT_YOUNG and
PAGE_EXT_IDLE are copied to the new folios though).

Furthermore, page_ext uses extra memory. With CONFIG_64BIT, should we instead
introduce a PG_firmware_in_use in page flags, similar to PG_young and PG_idle?

> by the TDX module. There was also some discussion of using a normal page flag,
> and that the reserved page flag might prevent some of the MM operations that
> would be needed on guestmemfd pages. I didn't see the problem when I looked.
> 
> For the solution, basically the SEAMCALL wrappers set a flag when they hand a
> page to the TDX module, and clear it when they successfully reclaim it via
> tdh_mem_page_remove() or tdh_phymem_page_reclaim(). Then if the page makes it
> back to the page allocator, a warning is generated.
After some testing, to use a normal page flag, we may need to set it on a
page-by-page basis rather than folio-by-folio. See "Scheme 1".
And guest_memfd may need to selectively copy page flags when splitting huge
folios. See "Scheme 2".

Scheme 1: Set/unset page flag on folio-by-folio basis, i.e.
        - set folio reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
        - unset folio reserved after a successful tdh_mem_page_remove() or
          tdh_phymem_page_reclaim().

        It has problem in following scenario:
        1. tdh_mem_page_aug() adds a 2MB folio. It marks the folio as reserved
	   via "folio_set_reserved(page_folio(page))"

        2. convert a 4KB page of the 2MB folio to shared.
        2.1 tdh_mem_page_demote() is executed first.
       
        2.2 tdh_mem_page_remove() then removes the 4KB mapping.
            "folio_clear_reserved(page_folio(page))" clears reserved flag for
            the 2MB folio while the rest 511 pages are still mapped in the
            S-EPT.

        2.3. guest_memfd splits the 2MB folio into 512 4KB folios.


Scheme 2: Set/unset page flag on page-by-page basis, i.e.
        - set page flag reserved at tdh_mem_page_aug(), tdh_mem_page_add(),
        - unset page flag reserved after a successful tdh_mem_page_remove() or
          tdh_phymem_page_reclaim().

        It has problem in following scenario:
        1. tdh_mem_page_aug() adds a 2MB folio. It marks pages as reserved by
           invoking "SetPageReserved()" on each page.
           As the folio->flags equals to page[0]->flags, folio->flags is also
	   with reserved set.

        2. convert a 4KB page of the 2MB folio to shared. say, it's page[4].
        2.1 tdh_mem_page_demote() is executed first.
       
        2.2 tdh_mem_page_remove() then removes the 4KB mapping.
            "ClearPageReserved()" clears reserved flag of page[4] of the 2MB
            folio.

        2.3. guest_memfd splits the 2MB folio into 512 4KB folios.
             In guestmem_hugetlb_split_folio(), "p->flags = folio->flags" marks
             page[4]->flags as reserved again as page[0] is still reserved.

            (see the code in https://lore.kernel.org/all/2ae41e0d80339da2b57011622ac2288fed65cd01.1747264138.git.ackerleytng@google.com/
            for (i = 1; i < orig_nr_pages; ++i) {
                struct page *p = folio_page(folio, i);

                /* Copy flags from the first page to split pages. */
                p->flags = folio->flags;

                p->mapping = NULL;
                clear_compound_head(p);
            }
            )

I'm not sure if "p->flags = folio->flags" can be removed. Currently flag like
PG_unevictable is preserved via this step.

If we selectively copy flags, we may need to implement the following changes to
prevent the page from being available to the page allocator. Otherwise, the
"HugePages_Free" count will not decrease, and the same huge folio will continue
to be recycled (i.e., being allocated and consumed by other VMs).

diff --git a/mm/swap.c b/mm/swap.c
index 2747230ced89..72d8c53e2321 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -98,8 +98,36 @@ static void page_cache_release(struct folio *folio)
                unlock_page_lruvec_irqrestore(lruvec, flags);
 }

+static inline bool folio_is_reserved(struct folio *folio)
+{
+       long nr_pages = folio_nr_pages(folio);
+       long i;
+
+       for (i = 0; i < nr_pages; i++) {
+               if (!PageReserved(folio_page(folio, i)))
+                       continue;
+
+               return true;
+       }
+
+       return false;
+}
+
 static void free_typed_folio(struct folio *folio)
 {
@@ -118,6 +146,13 @@ static void free_typed_folio(struct folio *folio)

 void __folio_put(struct folio *folio)
 {
+       if (folio_is_reserved(folio)) {
+               VM_WARN_ON_FOLIO(folio_is_reserved(folio), folio);
+               return;
+       }
+
        if (unlikely(folio_is_zone_device(folio))) {
                free_zone_device_folio(folio);
                return;
@@ -986,6 +1021,12 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
                if (!folio_ref_sub_and_test(folio, nr_refs))
                        continue;

+               if (folio_is_reserved(folio)) {
+                       VM_WARN_ON_FOLIO(folio_is_reserved(folio), folio);
+                       continue;
+               }
+
                if (unlikely(folio_has_type(folio))) {
                        /* typed folios have their own memcg, if any */
                        if (lruvec) {


Besides, guest_memfd needs to reject converting to shared when a page is still
mapped in S-EPT.

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d71653e7e51e..6449151a3a69 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -553,6 +553,41 @@ static void kvm_gmem_convert_invalidate_end(struct inode *inode,
                kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
 }

+static bool kvm_gmem_has_invalid_folio(struct address_space *mapping, pgoff_t start,
+                                       size_t nr_pages)
+{
+       pgoff_t index = start, end = start + nr_pages;
+       bool ret = false;
+
+       while (index < end) {
+               struct folio *f;
+               long i = 0, nr;
+
+               f = filemap_get_folio(mapping, index);
+               if (IS_ERR(f))
+                       continue;
+
+               if (f->index < start)
+                       i = start - f->index;
+
+               nr = folio_nr_pages(f);
+               if (f->index + folio_nr_pages(f) > end)
+                       nr -= f->index + folio_nr_pages(f) - end;
+
+               for (; i < nr; i++) {
+                       if (PageReserved(folio_page(f, i))) {
+                               ret = true;
+                               folio_put(f);
+                               goto out;
+                       }
+               }
+               index += folio_nr_pages(f);
+               folio_put(f);
+       }
+out:
+       return ret;
+}
 static int kvm_gmem_convert_should_proceed(struct inode *inode,
                                           struct conversion_work *work,
                                           bool to_shared, pgoff_t *error_index)
@@ -572,6 +607,12 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
                        if (ret)
                                return ret;
                        kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
+
+                       if (kvm_gmem_has_invalid_folio(inode->i_mapping, work->start,
+                                                      work->nr_pages)) {
+                               ret = -EFAULT;
+                       }
+
                }
        } else {
                unmap_mapping_pages(inode->i_mapping, work->start,



> Also it was mentioned that SGX did have a similar issue to what is being worried
> about here:
> https://lore.kernel.org/linux-sgx/aCYey1W6i7i3yPLL@gmail.com/T/#m86c8c4cf0e6b9a653bf0709a22bb360034a24d95
> 
> > 
> > 
> > > > 
> > > > > > 
> > > > > > This would allow guest_memfd to maintain an internal reference count
> > > > > > for
> > > > > > each
> > > > > > private GFN. TDX would call guest_memfd_add_page_ref_count() for
> > > > > > mapping
> > > > > > and
> > > > > > guest_memfd_dec_page_ref_count() after a successful unmapping. Before
> > > > > > truncating
> > > > > > a private page from the filemap, guest_memfd could increase the real
> > > > > > folio
> > > > > > reference count based on its internal reference count for the private
> > > > > > GFN.
> > > > > 
> > > > > What does this get us exactly? This is the argument to have less error
> > > > > prone
> > > > > code that can survive forgetting to refcount on error? I don't see that
> > > > > it
> > > > > is an
> > > > > especially special case.
> > > > Yes, for a less error prone code.
> > > > 
> > > > If this approach is considered too complex for an initial implementation,
> > > > using
> > > > tdx_hold_page_on_error() is also a viable option.
> > > 
> > > I'm saying I don't think it's not a good enough reason. Why is it different
> > > then
> > > other use-after free bugs? I feel like I'm missing something.
> > By tdx_hold_page_on_error(), it could be implememented as on removal failure,
> > invoke a guest_memfd interface to let guest_memfd know exact ranges still
> > being
> > under use by the TDX module due to unmapping failures.
> > Do you think it's ok?
> 
> Either way is ok to me. It seems like we have three ok solutions. But the tone
> of the thread is that we are solving some deep problem. Maybe I'm missing
> something.