lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6b48a161-257b-a02b-c483-87c04b655635@redhat.com>
Date:   Fri, 11 Aug 2023 19:25:30 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Yan Zhao <yan.y.zhao@...el.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, pbonzini@...hat.com, seanjc@...gle.com,
        mike.kravetz@...cle.com, apopple@...dia.com, jgg@...dia.com,
        rppt@...nel.org, akpm@...ux-foundation.org, kevin.tian@...el.com,
        John Hubbard <jhubbard@...dia.com>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a
 VM

On 10.08.23 11:50, Yan Zhao wrote:
> On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote:
>>> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
>>> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
>>> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
>>> event is sent for NUMA migration purpose in specific.
>>>
>>> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
>>> MMU to avoid NUMA protection introduced page faults and restoration of old
>>> huge PMDs/PTEs in primary MMU.
>>>
>>> Patch 3 introduces a new mmu notifier callback .numa_protect(), which
>>> will be called in patch 4 when a page is ensured to be PROT_NONE protected.
>>>
>>> Then in patch 5, KVM can recognize a .invalidate_range_start() notification
>>> is for NUMA balancing specific and do not do the page unmap in secondary
>>> MMU until .numa_protect() comes.
>>>
>>
>> Why do we need all that, when we should simply not be applying PROT_NONE to
>> pinned pages?
>>
>> In change_pte_range() we already have:
>>
>> if (is_cow_mapping(vma->vm_flags) &&
>>      page_count(page) != 1)
>>
>> Which includes both, shared and pinned pages.
> Ah, right, currently in my side, I don't see any pinned pages are
> outside of this condition.
> But I have a question regarding to is_cow_mapping(vma->vm_flags), do we
> need to allow pinned pages in !is_cow_mapping(vma->vm_flags)?

One issue is that folio_maybe_pinned...() ... is unreliable as soon as 
your page is mapped more than 1024 times.

One might argue that we also want to exclude pages that are mapped that 
often. That might possibly work.

> 
>> Staring at page #2, are we still missing something similar for THPs?
> Yes.
> 
>> Why is that MMU notifier thingy and touching KVM code required?
> Because NUMA balancing code will firstly send .invalidate_range_start() with
> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
> unconditionally, before it goes down into change_pte_range() and
> change_huge_pmd() to check each page count and apply PROT_NONE.

Ah, okay I see, thanks. That's indeed unfortunate.

> 
> Then current KVM will unmap all notified pages from secondary MMU
> in .invalidate_range_start(), which could include pages that finally not
> set to PROT_NONE in primary MMU.
> 
> For VMs with pass-through devices, though all guest pages are pinned,
> KVM still periodically unmap pages in response to the
> .invalidate_range_start() notification from auto NUMA balancing, which
> is a waste.

Should we want to disable NUMA hinting for such VMAs instead (for 
example, by QEMU/hypervisor) that knows that any NUMA hinting activity 
on these ranges would be a complete waste of time? I recall that John H. 
once mentioned that there are similar issues with GPU memory:  NUMA 
hinting is actually counter-productive and they end up disabling it.

-- 
Cheers,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ