linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH_ggm8N-R9QbV1f8mo8-cQkqyEta3W=h2jry-NRD7_6OA@mail.gmail.com>
Date: Tue, 29 Apr 2025 06:46:59 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kirill.shutemov@...el.com, tabba@...gle.com, 
	ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Mon, Apr 28, 2025 at 05:17:16PM -0700, Vishal Annapurve wrote:
> > On Wed, Apr 23, 2025 at 8:07 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > Increase folio ref count before mapping a private page, and decrease
> > > folio ref count after a mapping failure or successfully removing a private
> > > page.
> > >
> > > The folio ref count to inc/dec corresponds to the mapping/unmapping level,
> > > ensuring the folio ref count remains balanced after entry splitting or
> > > merging.
> > >
> > > Signed-off-by: Xiaoyao Li <xiaoyao.li@...el.com>
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> > > Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> > > ---
> > >  arch/x86/kvm/vmx/tdx.c | 19 ++++++++++---------
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 355b21fc169f..e23dce59fc72 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -1501,9 +1501,9 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> > >         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> > >  }
> > >
> > > -static void tdx_unpin(struct kvm *kvm, struct page *page)
> > > +static void tdx_unpin(struct kvm *kvm, struct page *page, int level)
> > >  {
> > > -       put_page(page);
> > > +       folio_put_refs(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> > >  }
> > >
> > >  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > > @@ -1517,13 +1517,13 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > >
> > >         err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> > >         if (unlikely(tdx_operand_busy(err))) {
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return -EBUSY;
> > >         }
> > >
> > >         if (KVM_BUG_ON(err, kvm)) {
> > >                 pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return -EIO;
> > >         }
> > >
> > > @@ -1570,10 +1570,11 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >          * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> > >          * migration.  Until guest_memfd supports page migration, prevent page
> > >          * migration.
> > > -        * TODO: Once guest_memfd introduces callback on page migration,
> > > -        * implement it and remove get_page/put_page().
> > > +        * TODO: To support in-place-conversion in gmem in futre, remove
> > > +        * folio_ref_add()/folio_put_refs().
> >
> > With necessary infrastructure in guest_memfd [1] to prevent page
> > migration, is it necessary to acquire extra folio refcounts? If not,
> > why not just cleanup this logic now?
> Though the old comment says acquiring the lock is for page migration, the other
> reason is to prevent the folio from being returned to the OS until it has been
> successfully removed from TDX.
>
> If there's an error during the removal or reclaiming of a folio from TDX, such
> as a failure in tdh_mem_page_remove()/tdh_phymem_page_wbinvd_hkid() or
> tdh_phymem_page_reclaim(), it is important to hold the page refcount within TDX.
>
> So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> folio_ref_add() in the event of a removal failure.

In my opinion, the above scheme can be deployed with this series
itself. guest_memfd will not take away memory from TDX VMs without an
invalidation. folio_ref_add() will not work for memory not backed by
page structs, but that problem can be solved in future possibly by
notifying guest_memfd of certain ranges being in use even after
invalidation completes.


>
> > [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/virt/kvm/guest_memfd.c?h=kvm-coco-queue#n441
> >
> > > Only increase the folio ref count
> > > +        * when there're errors during removing private pages.
> > >          */
> > > -       get_page(page);
> > > +       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
> > >
> > >         /*
> > >          * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > @@ -1647,7 +1648,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > >                 return -EIO;
> > >
> > >         tdx_clear_page(page, level);
> > > -       tdx_unpin(kvm, page);
> > > +       tdx_unpin(kvm, page, level);
> > >         return 0;
> > >  }
> > >
> > > @@ -1727,7 +1728,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> > >         if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> > >             !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> > >                 atomic64_dec(&kvm_tdx->nr_premapped);
> > > -               tdx_unpin(kvm, page);
> > > +               tdx_unpin(kvm, page, level);
> > >                 return 0;
> > >         }
> > >
> > > --
> > > 2.43.2
> > >