[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6b48a161-257b-a02b-c483-87c04b655635@redhat.com>
Date: Fri, 11 Aug 2023 19:25:30 +0200
From: David Hildenbrand <david@...hat.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org, pbonzini@...hat.com, seanjc@...gle.com,
mike.kravetz@...cle.com, apopple@...dia.com, jgg@...dia.com,
rppt@...nel.org, akpm@...ux-foundation.org, kevin.tian@...el.com,
John Hubbard <jhubbard@...dia.com>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a
VM
On 10.08.23 11:50, Yan Zhao wrote:
> On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
>>> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
>>> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
>>> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
>>> event is sent for NUMA migration purpose in specific.
>>>
>>> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
>>> MMU to avoid NUMA protection introduced page faults and restoration of old
>>> huge PMDs/PTEs in primary MMU.
>>>
>>> Patch 3 introduces a new mmu notifier callback .numa_protect(), which
>>> will be called in patch 4 when a page is ensured to be PROT_NONE protected.
>>>
>>> Then in patch 5, KVM can recognize a .invalidate_range_start() notification
>>> is for NUMA balancing specific and do not do the page unmap in secondary
>>> MMU until .numa_protect() comes.
>>>
>>
>> Why do we need all that, when we should simply not be applying PROT_NONE to
>> pinned pages?
>>
>> In change_pte_range() we already have:
>>
>> if (is_cow_mapping(vma->vm_flags) &&
>> page_count(page) != 1)
>>
>> Which includes both, shared and pinned pages.
> Ah, right, currently in my side, I don't see any pinned pages are
> outside of this condition.
> But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
> need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?
One issue is that folio_maybe_pinned...() ... is unreliable as soon as
your page is mapped more than 1024 times.
One might argue that we also want to exclude pages that are mapped that
often. That might possibly work.
>
>> Staring at page #2, are we still missing something similar for THPs?
> Yes.
>
>> Why is that MMU notifier thingy and touching KVM code required?
> Because NUMA balancing code will firstly send .invalidate_range_start() with
> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
> unconditionally, before it goes down into change_pte_range() and
> change_huge_pmd() to check each page count and apply PROT_NONE.
Ah, okay I see, thanks. That's indeed unfortunate.
>
> Then current KVM will unmap all notified pages from secondary MMU
> in .invalidate_range_start(), which could include pages that finally not
> set to PROT_NONE in primary MMU.
>
> For VMs with pass-through devices, though all guest pages are pinned,
> KVM still periodically unmap pages in response to the
> .invalidate_range_start() notification from auto NUMA balancing, which
> is a waste.
Should we want to disable NUMA hinting for such VMAs instead (for
example, by QEMU/hypervisor) that knows that any NUMA hinting activity
on these ranges would be a complete waste of time? I recall that John H.
once mentioned that there are similar issues with GPU memory: NUMA
hinting is actually counter-productive and they end up disabling it.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists