[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRbHtnMcoqM1gmL9@yzhao56-desk.sh.intel.com>
Date: Fri, 14 Nov 2025 14:09:58 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave"
<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "michael.roth@....com"
<michael.roth@....com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "kas@...nel.org" <kas@...nel.org>, "Weiny, Ira"
<ira.weiny@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>, "Yamahata,
Isaku" <isaku.yamahata@...el.com>, "Annapurve, Vishal"
<vannapurve@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"Miao, Jun" <jun.miao@...el.com>, "x86@...nel.org" <x86@...nel.org>,
"pgonda@...gle.com" <pgonda@...gle.com>
Subject: Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce
kvm_split_cross_boundary_leafs()
On Thu, Nov 13, 2025 at 07:02:59PM +0800, Huang, Kai wrote:
> On Thu, 2025-11-13 at 16:54 +0800, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 06:42:55PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-08-07 at 17:43 +0800, Yan Zhao wrote:
> > > > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > struct kvm_mmu_page *root,
> > > > gfn_t start, gfn_t end,
> > > > - int target_level, bool shared)
> > > > + int target_level, bool shared,
> > > > + bool only_cross_bounday, bool *flush)
> > > > {
> > > > struct kvm_mmu_page *sp = NULL;
> > > > struct tdp_iter iter;
> > > > @@ -1589,6 +1596,13 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > > > * level into one lower level. For example, if we encounter a 1GB page
> > > > * we split it into 512 2MB pages.
> > > > *
> > > > + * When only_cross_bounday is true, just split huge pages above the
> > > > + * target level into one lower level if the huge pages cross the start
> > > > + * or end boundary.
> > > > + *
> > > > + * No need to update @flush for !only_cross_bounday cases, which rely
> > > > + * on the callers to do the TLB flush in the end.
> > > > + *
> > >
> > > s/only_cross_bounday/only_cross_boundary
> > >
> > > From tdp_mmu_split_huge_pages_root()'s perspective, it's quite odd to only
> > > update 'flush' when 'only_cross_bounday' is true, because
> > > 'only_cross_bounday' can only results in less splitting.
> > I have to say it's a reasonable point.
> >
> > > I understand this is because splitting S-EPT mapping needs flush (at least
> > > before non-block DEMOTE is implemented?). Would it better to also let the
> > Actually the flush is only required for !TDX cases.
> >
> > For TDX, either the flush has been performed internally within
> > tdx_sept_split_private_spt()
> >
>
> AFAICT tdx_sept_split_private_spt() only does tdh_mem_track(), so KVM should
> still kick all vCPUs out of guest mode so other vCPUs can actually flush the
> TLB?
tdx_sept_split_private_spt() actually invokes tdx_track(), which performs the
kicking off all vCPUs by invoking
"kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE)".
> > or the flush is not required for future non-block
> > DEMOTE. So, the flush in KVM core on the mirror root may be skipped as a future
> > optimization for TDX if necessary.
> >
> > > caller to decide whether TLB flush is needed? E.g., we can make
> > > tdp_mmu_split_huge_pages_root() return whether any split has been done or
> > > not. I think this should also work?
> > Do you mean just skipping the changes in the below "Hunk 1"?
> >
> > Since tdp_mmu_split_huge_pages_root() originally did not do flush by itself,
> > which relied on the end callers (i.e.,kvm_mmu_slot_apply_flags(),
> > kvm_clear_dirty_log_protect(), and kvm_get_dirty_log_protect()) to do the flush
> > unconditionally, tdp_mmu_split_huge_pages_root() previously did not return
> > whether any split has been done or not.
>
> Right. But making it return any split has been done doesn't harm.
>
> >
> > So, if we want callers of kvm_split_cross_boundary_leafs() to do flush only
> > after splitting occurs, we have to return whether flush is required.
>
> But assuming we always return whether "split has been done", the caller can also
> effectively know whether the flush is needed.
>
> >
> > Then, in this patch, seems only the changes in "Hunk 1" can be dropped.
>
> I am thinking dropping both "Hunk 1" and "Hunk 3". This at least makes
> kvm_split_cross_boundary_leafs() more reasonable, IMHO.
>
> Something like below:
>
> @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
> tdp_iter *iter,
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> struct kvm_mmu_page *root,
> gfn_t start, gfn_t end,
> - int target_level, bool shared)
> + int target_level, bool shared,
> + bool only_cross_boundary,
> + bool *split)
> {
> struct kvm_mmu_page *sp = NULL;
> struct tdp_iter iter;
> @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> if (!is_shadow_present_pte(iter.old_spte) ||
> !is_large_pte(iter.old_spte))
> continue;
>
> + if (only_cross_boundary && !iter_cross_boundary(&iter, start,
> end))
> + continue;
> +
> if (!sp) {
> rcu_read_unlock();
>
> @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> goto retry;
>
> sp = NULL;
> + *split = true;
> }
>
> rcu_read_unlock();
This looks more reasonable for tdp_mmu_split_huge_pages_root();
Given that splitting only adds a new page to the paging structure (unlike page
merging), I currently can't think of any current use cases that would be broken
by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
mmu_lock.
This is because:
1) if the split is triggered in a fault path, the hardware shouldn't have cached
the old huge translation.
2) if the split is triggered in a zap or convert path,
- there shouldn't be concurrent faults on the range due to the protection of
mmu_invalidate_range*.
- for concurrent splits on the same range, though the other vCPUs may
temporally see stale huge TLB entries after they believe they have
performed a split, they will be kicked off to flush the cache soon after
tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
This should be acceptable since I don't see any special guest needs that
rely on pure splits.
So I tend to agree with your suggestion though the implementation in this patch
is safer.
> Btw, I have to follow up this next week, since tomorrow is public holiday.
NP.
Powered by blists - more mailing lists