linux-kernel - Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <41a893e1-f2e7-23f4-cad2-d5c353a336a3@redhat.com>
Date:   Thu, 10 Aug 2023 11:34:07 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Yan Zhao <yan.y.zhao@...el.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Cc:     pbonzini@...hat.com, seanjc@...gle.com, mike.kravetz@...cle.com,
        apopple@...dia.com, jgg@...dia.com, rppt@...nel.org,
        akpm@...ux-foundation.org, kevin.tian@...el.com
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a
 VM

On 10.08.23 10:56, Yan Zhao wrote:
> This is an RFC series trying to fix the issue of unnecessary NUMA
> protection and TLB-shootdowns found in VMs with assigned devices or VFIO
> mediated devices during NUMA balance.
> 
> For VMs with assigned devices or VFIO mediated devices, all or part of
> guest memory are pinned for long-term.
> 
> Auto NUMA balancing will periodically selects VMAs of a process and change
> protections to PROT_NONE even though some or all pages in the selected
> ranges are long-term pinned for DMAs, which is true for VMs with assigned
> devices or VFIO mediated devices.
> 
> Though this will not cause real problem because NUMA migration will
> ultimately reject migration of those kind of pages and restore those
> PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
> with equal SPTEs finally faulted back, wasting CPU cycles and generating
> unnecessary TLB-shootdowns.
> 
> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
> event is sent for NUMA migration purpose in specific.
> 
> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
> MMU to avoid NUMA protection introduced page faults and restoration of old
> huge PMDs/PTEs in primary MMU.
> 
> Patch 3 introduces a new mmu notifier callback .numa_protect(), which
> will be called in patch 4 when a page is ensured to be PROT_NONE protected.
> 
> Then in patch 5, KVM can recognize a .invalidate_range_start() notification
> is for NUMA balancing specific and do not do the page unmap in secondary
> MMU until .numa_protect() comes.
> 

Why do we need all that, when we should simply not be applying PROT_NONE 
to pinned pages?

In change_pte_range() we already have:

if (is_cow_mapping(vma->vm_flags) &&
     page_count(page) != 1)

Which includes both, shared and pinned pages.

Staring at page #2, are we still missing something similar for THPs?

Why is that MMU notifier thingy and touching KVM code required?

-- 
Cheers,

David / dhildenb