[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240529015749.GF386318@ls.amr.corp.intel.com>
Date: Tue, 28 May 2024 18:57:49 -0700
From: Isaku Yamahata <isaku.yamahata@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc: "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"seanjc@...gle.com" <seanjc@...gle.com>,
"sagis@...gle.com" <sagis@...gle.com>,
"dmatlack@...gle.com" <dmatlack@...gle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
"Zhao, Yan Y" <yan.y.zhao@...el.com>,
"Aktas, Erdem" <erdemaktas@...gle.com>
Subject: Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for
TDP MMU
On Tue, May 28, 2024 at 11:06:45PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > - u64 old_spte, u64 new_spte, int level,
> > - bool shared)
> > + u64 old_spte, u64 new_spte,
> > + union kvm_mmu_page_role role, bool shared)
> > {
> > + bool is_private = kvm_mmu_page_role_is_private(role);
> > + int level = role.level;
> > bool was_present = is_shadow_present_pte(old_spte);
> > bool is_present = is_shadow_present_pte(new_spte);
> > bool was_leaf = was_present && is_last_spte(old_spte, level);
> > bool is_leaf = is_present && is_last_spte(new_spte, level);
> > - bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> > + kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > + kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> > + bool pfn_changed = old_pfn != new_pfn;
> >
> > WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
> > WARN_ON_ONCE(level < PG_LEVEL_4K);
> > @@ -513,7 +636,7 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id, gfn_t gfn,
> >
> > if (was_leaf && is_dirty_spte(old_spte) &&
> > (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> > - kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > + kvm_set_pfn_dirty(old_pfn);
> >
> > /*
> > * Recursively handle child PTs if the change removed a subtree from
> > @@ -522,15 +645,21 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id, gfn_t gfn,
> > * pages are kernel allocations and should never be migrated.
> > */
> > if (was_present && !was_leaf &&
> > - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> > + KVM_BUG_ON(is_private !=
> > is_private_sptep(spte_to_child_pt(old_spte, level)),
> > + kvm);
> > handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> > shared);
> > + }
> > +
> > + if (is_private && !is_present)
> > + handle_removed_private_spte(kvm, gfn, old_spte, new_spte,
> > role.level);
>
> I'm a little bothered by the asymmetry of where the mirrored hooks get called
> between setting and zapping PTEs. Tracing through the code, the relevent
> operations that are needed for TDX are:
> 1. tdp_mmu_iter_set_spte() from tdp_mmu_zap_leafs() and __tdp_mmu_zap_root()
> 2. tdp_mmu_set_spte_atomic() is used for mapping, linking
>
> (1) is a simple case because the mmu_lock is held for writes. It updates the
> mirror root like normal, then has extra logic to call out to update the S-EPT.
>
> (2) on the other hand just has the read lock, so it has to do the whole
> operation in a special way. First set REMOVED_SPTE, then update the private
> copy, then write to the mirror page tables. It can't get stuffed into
> handle_changed_spte() because it has to write REMOVED_SPTE first.
>
> In some ways it makes sense to update the S-EPT. Despite claiming
> "handle_changed_spte() only updates stats.", it does some updating of other PTEs
> based on the current PTE change. Which is pretty similar to what the mirrored
> PTEs are doing. But we can't really do the setting of present PTEs because of
> the REMOVED_SPTE stuff.
>
> So we could only make it more symmetrical by moving the S-EPT ops out of
> handle_changed_spte() and manually call it in the two places relevant for TDX,
> like the below.
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e966986bb9f2..c9ddb1c2a550 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t
> pt, bool shared)
> */
> old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
> REMOVED_SPTE, level);
> +
> + if (is_mirror_sp(sp))
> + reflect_removed_spte(kvm, gfn, old_spte,
> REMOVED_SPTE, level);
The callback before handling lower level will result in error.
> }
> handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> old_spte, REMOVED_SPTE, sp->role, shared);
We should call it here after processing lower level.
> @@ -667,9 +670,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id,
> gfn_t gfn,
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> shared);
> }
>
> - if (is_mirror && !is_present)
> - reflect_removed_spte(kvm, gfn, old_spte, new_spte, role.level);
> -
> if (was_leaf && is_accessed_spte(old_spte) &&
> (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> @@ -839,6 +839,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> tdp_ptep_t sptep,
> new_spte, level), kvm);
> }
>
> + if (is_mirror_sptep(sptep))
> + reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
> +
Ditto.
> role = sptep_to_sp(sptep)->role;
> role.level = level;
> handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
The callback should be here. It should be after handling the lower level.
> Otherwise, we could move the "set present" mirroring operations into
> handle_changed_spte(), and have some earlier conditional logic do the
> REMOVED_SPTE parts. It starts to become more scattered.
> Anyway, it's just a code clarity thing arising from having hard time explaining
> the design in the log. Any opinions?
Originally I tried to consolidate the callbacks by following TDP MMU using
handle_changed_spte(). Anyway we can pick from two outcomes based on which is
easy to understand/maintain.
> A separate but related comment is below.
>
> >
> > if (was_leaf && is_accessed_spte(old_spte) &&
> > (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> > kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > }
> >
> > @@ -648,6 +807,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> > static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > u64 old_spte, u64 new_spte, gfn_t gfn, int level)
> > {
> > + union kvm_mmu_page_role role;
> > +
> > lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > /*
> > @@ -660,8 +821,16 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> > WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
> >
> > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > + if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
> > + is_shadow_present_pte(new_spte)) {
> > + /* Because write spin lock is held, no race. It should
> > success. */
> > + KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn,
> > old_spte,
> > + new_spte, level), kvm);
> > + }
>
> Based on the above enumeration, I don't see how this hunk gets used.
I should've removed it. This is leftover from the old patches.
--
Isaku Yamahata <isaku.yamahata@...el.com>
Powered by blists - more mailing lists