linux-kernel - Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 8 Aug 2023 07:26:07 -0700
From:   Sean Christopherson <seanjc@...gle.com>
To:     Jason Gunthorpe <jgg@...dia.com>
Cc:     Yan Zhao <yan.y.zhao@...el.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        pbonzini@...hat.com, mike.kravetz@...cle.com, apopple@...dia.com,
        rppt@...nel.org, akpm@...ux-foundation.org, kevin.tian@...el.com
Subject: Re: [RFC PATCH 3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for
 NUMA migration

On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> >  
> > +		if (skip_pinned) {
> > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > +			struct folio *folio;
> > +
> > +			if (!page)
> > +				continue;
> > +
> > +			folio = page_folio(page);
> > +
> > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > +			    folio_maybe_dma_pinned(folio))
> > +				continue;
> > +		}
> > +
> 
> I don't get it..
> 
> The last patch made it so that the NUMA balancing code doesn't change
> page_maybe_dma_pinned() pages to PROT_NONE
> 
> So why doesn't KVM just check if the current and new SPTE are the same
> and refrain from invalidating if nothing changed?

Because KVM doesn't have visibility into the current and new PTEs when the zapping
occurs.  The contract for invalidate_range_start() requires that KVM drop all
references before returning, and so the zapping occurs before change_pte_range()
or change_huge_pmd() have done antyhing.

> Duplicating the checks here seems very frail to me.

Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
can yield different results purely based on refcounts, i.e. KVM could skip pages
that the primary MMU does not, and thus violate the mmu_notifier contract.  And
in general, I am steadfastedly against adding any kind of heuristic to KVM's
zapping logic.

This really needs to be fixed in the primary MMU and not require any direct
involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
to be skipped.