linux-kernel - Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <42ff33c7-ec50-1310-3e57-37e8283b9b16@loongson.cn>
Date:   Wed, 16 Aug 2023 15:29:22 +0800
From:   bibo mao <maobibo@...ngson.cn>
To:     Yan Zhao <yan.y.zhao@...el.com>
Cc:     Sean Christopherson <seanjc@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        pbonzini@...hat.com, mike.kravetz@...cle.com, apopple@...dia.com,
        jgg@...dia.com, rppt@...nel.org, akpm@...ux-foundation.org,
        kevin.tian@...el.com, david@...hat.com
Subject: Re: [RFC PATCH v2 5/5] KVM: Unmap pages only when it's indeed
 protected for NUMA migration



在 2023/8/16 13:14, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
>>
>>
>> 在 2023/8/16 10:43, bibo mao 写道:
>>>
>>>
>>> 在 2023/8/15 22:50, Sean Christopherson 写道:
>>>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>>>
>>>>>>>> Compile tested only.
>>>>>>>
>>>>>>> I don't find a matching end to each
>>>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>>>
>>>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>>>
>>>>>> 	if (range.start)
>>>>>> 		mmu_notifier_invalidate_range_end(&range);
>>>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>>>> if we only want the range to include pages successfully set to PROT_NONE.
>>>>
>>>> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
>>>> to defer invalidation until it was actually needed, but still perform only a
>>>> single notification so as to batch the TLB flushes, e.g. the start() call still
>>>> used the original @end.
>>>>
>>>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>>>> It was complete untested though, so it may not have actually done anything to reduce
>>>> the number of pointless invalidations.
>>> For numa-balance scenery, can original page still be used by application even if pte
>>> is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
> For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,
> 
> See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/
> 
>> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
>> new page, my meaning that can original page still be used by application even if pte
>> is changed with PROT_NONE and before replaced with new page?
> It's not .change_pte() notification, which is sent when COW.
> The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
> protected with PROT_NONE, and if every check passes, a separate
> .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
> sent.
yes, you are right. change_pte() notification will be will called
when migrate_vma_pages, I messed it with numa page migration. However
invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent also when new page is replaced.
> 
> So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
> keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
> (this approach is implemented in RFC v1
> https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/)
> 
>>
>> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
>> after mmu_notifier_invalidate_range_end notification for secondary mmu.
>> Regards
>> Bibo Mao
> 
>>>> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> But I don't think flush tlb only in the .invalidate_range_end() in
> secondary MMU is a good idea.
I have no good idea, and it beyond my ability to modify kvm framework now :(

> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> confusion will be caused when multiple threads trying to update the
> secondary MMU.
Since tlb flush is delayed after all pte entries are cleared, and currently
there is no tlb flush range supported for secondary mmu. I do know why there
is confusion before or after kvm->mmu_lock.

Regards
Bibo Mao
> 
>>>> be cross-range between range mmu_gather and mmu_notifier_range.
> 
>