linux-kernel - Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZNxbLPG8qbs1FjhM@yzhao56-desk.sh.intel.com>
Date:   Wed, 16 Aug 2023 13:14:20 +0800
From:   Yan Zhao <yan.y.zhao@...el.com>
To:     bibo mao <maobibo@...ngson.cn>
CC:     Sean Christopherson <seanjc@...gle.com>, <linux-mm@...ck.org>,
        <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <pbonzini@...hat.com>, <mike.kravetz@...cle.com>,
        <apopple@...dia.com>, <jgg@...dia.com>, <rppt@...nel.org>,
        <akpm@...ux-foundation.org>, <kevin.tian@...el.com>,
        <david@...hat.com>
Subject: Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed
 protected for NUMA migration

On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
> 
> 
> 在 2023/8/16 10:43, bibo mao 写道:
> > 
> > 
> > 在 2023/8/15 22:50, Sean Christopherson 写道:
> >> On Tue, Aug 15, 2023, Yan Zhao wrote:
> >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> >>>>>>
> >>>>>> Compile tested only.
> >>>>>
> >>>>> I don't find a matching end to each
> >>>>> mmu_notifier_invalidate_range_start_nonblock().
> >>>>
> >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> >>>>
> >>>> 	if (range.start)
> >>>> 		mmu_notifier_invalidate_range_end(&range);
> >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
> >>> if we only want the range to include pages successfully set to PROT_NONE.
> >>
> >> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
> >> to defer invalidation until it was actually needed, but still perform only a
> >> single notification so as to batch the TLB flushes, e.g. the start() call still
> >> used the original @end.
> >>
> >> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
> >> It was complete untested though, so it may not have actually done anything to reduce
> >> the number of pointless invalidations.
> > For numa-balance scenery, can original page still be used by application even if pte
> > is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,

See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/

> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
> new page, my meaning that can original page still be used by application even if pte
> is changed with PROT_NONE and before replaced with new page?
It's not .change_pte() notification, which is sent when COW.
The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
protected with PROT_NONE, and if every check passes, a separate
.invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent.

So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
(this approach is implemented in RFC v1
https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/)

> 
> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
> after mmu_notifier_invalidate_range_end notification for secondary mmu.
> Regards
> Bibo Mao

> >> in notification mmu_notifier_invalidate_range_end with precised range, the range can
But I don't think flush tlb only in the .invalidate_range_end() in
secondary MMU is a good idea.
Flush must be done before kvm->mmu_lock is unlocked, otherwise,
confusion will be caused when multiple threads trying to update the
secondary MMU.

> >> be cross-range between range mmu_gather and mmu_notifier_range.