linux-kernel - Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YYxvSfUPTXbclpSa@google.com>
Date:   Thu, 11 Nov 2021 01:18:01 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Ben Gardon <bgardon@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, Peter Xu <peterx@...hat.com>,
        Peter Shier <pshier@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Mingwei Zhang <mizhang@...gle.com>,
        Yulei Zhang <yulei.kernel@...il.com>,
        Wanpeng Li <kernellwp@...il.com>,
        Xiao Guangrong <xiaoguangrong.eric@...il.com>,
        Kai Huang <kai.huang@...el.com>,
        Keqian Zhu <zhukeqian1@...wei.com>,
        David Hildenbrand <david@...hat.com>
Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of
 make_spte

On Wed, Nov 10, 2021, Ben Gardon wrote:
> On Wed, Nov 10, 2021 at 2:45 PM Paolo Bonzini <pbonzini@...hat.com> wrote:
> >
> > On 11/10/21 23:30, Ben Gardon wrote:
> > > -     WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> > > +     WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
> > >                 "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> > > -               get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> > > +               get_rsvd_bits(shadow_zero_check, spte, level));
> >
> > Hmm, there is a deeper issue here, in that when using EPT/NPT (on either
> > the legacy aka shadow or the TDP MMU) large parts of vcpu->arch.mmu are
> > really the same for all vCPUs.  The only thing that varies is those
> > parts that actually depend on the guest's paging mode---the extended
> > role, the reserved bits, etc.  Those are needed by the emulator, but
> > don't really belong in vcpu->arch.mmu when EPT/NPT is in use.
> >
> > I wonder if there's room for splitting kvm_mmu in two parts, such as
> > kvm_mmu and kvm_guest_paging_context, and possibly change the walk_mmu
> > pointer into a pointer to kvm_guest_paging_context.  This way the
> > EPT/NPT MMU (again either shadow or TDP) can be moved to kvm->arch.  It
> > should simplify this series and also David's work on eager page splitting.
> >
> > I'm not asking you to do this, of course, but perhaps I can trigger
> > Sean's itch to refactor stuff. :)
> >
> > Paolo
> >
> 
> I think that's a great idea. I'm frequently confused as to why the
> struct kvm_mmu is a per-vcpu construct as opposed to being VM-global.
> Moving part of the struct to be a member for struct kvm would also
> open the door to formalizing the MMU interface a little better and
> perhaps even reveal more MMU code that can be consolidated across
> architectures.

But what would you actually move?  Even shadow_zero_check barely squeaks by,
e.g. if NX is ever used to for NPT, then maybe it stops being a per-VM setting.

Going through the fields...

These are all related to guest context:

	unsigned long (*get_guest_pgd)(struct kvm_vcpu *vcpu);
	u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
	int (*page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
	void (*inject_page_fault)(struct kvm_vcpu *vcpu,
				  struct x86_exception *fault);
	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gpa_t gva_or_gpa,
			    u32 access, struct x86_exception *exception);
	gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
			       struct x86_exception *exception);
	int (*sync_page)(struct kvm_vcpu *vcpu,
			 struct kvm_mmu_page *sp);
	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
	union kvm_mmu_role mmu_role;
	u8 root_level;
	u8 permissions[16];
	u32 pkru_mask;
	struct rsvd_bits_validate guest_rsvd_check;
	u64 pdptrs[4];
	gpa_t root_pgd;

One field, ept_ad, can be straight deleted as it's redundant with respect to
the above mmu_role.ad_disabled.

	u8 ept_ad;

Ditto for direct_map flag (mmu_role.direct) and shadow_root_level (mmu_role.level).
I haven't bothered to yank those because they have a lot of touchpoints.

	bool direct_map;
	u8 shadow_root_level;

The prev_roots could be dropped if TDP roots were tracked per-VM, but we'd still
want an equivalent for !TDP and nTDP MMUs.

	struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];

shadow_zero_check can be made per-VM if all vCPUs are required to have the same
cpuid.MAXPHYADDR or if we remove the (IMO) pointless 5-level vs. 4-level behavior,
which by-the-by, has my vote since we could make shadow_zero_check _global_, not
just per-VM, and everything I've heard is that the extra level has no measurable
performance overhead.

	struct rsvd_bits_validate shadow_zero_check;

And that leaves us with:
	hpa_t root_hpa;

	u64 *pae_root;
	u64 *pml4_root;
	u64 *pml5_root;

Of those, _none_ of them can be per-VM, because they are all nothing more than
shadow pages, and thus cannot be per-VM unless there is exactly one set of TDP
page tables for the guest.  Even if/when we strip the unnecessary role bits from
these for TDP (on my todo list), we still need up to three sets of page tables:

	1. Normal
	2. SMM
	3. Guest (if L1 doesn't use TDP)

So I suppose we could refactor KVM to explicitly track its three possible TDP
roots, but I don't think it buys us anything and would complicate supporting
!TDP as well as nTDP.