linux-kernel - Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d0ad2642-6d72-489e-91af-a7cb15e75a8a@nvidia.com>
Date:   Fri, 11 Aug 2023 12:35:27 -0700
From:   John Hubbard <jhubbard@...dia.com>
To:     David Hildenbrand <david@...hat.com>,
        Yan Zhao <yan.y.zhao@...el.com>
CC:     <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
        <kvm@...r.kernel.org>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
        <mike.kravetz@...cle.com>, <apopple@...dia.com>, <jgg@...dia.com>,
        <rppt@...nel.org>, <akpm@...ux-foundation.org>,
        <kevin.tian@...el.com>, "Mel Gorman" <mgorman@...hsingularity.net>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a
 VM

On 8/11/23 11:39, David Hildenbrand wrote:
...
>>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
>> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>>
>>
>> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
>> accelerators that map memory...and VMs as well, it seems. Basically,
>> anything that has its own processors and page tables needs to be left
>> strictly alone by NUMA balancing. Because the kernel is (still, even
>> today) unaware of what those processors are doing, and so it has no way
>> to do productive NUMA balancing.
> 
> Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> 
> MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> 
> I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> 
> CCing Mel.
> 

Let's discern between page pinning situations, and HMM-style situations.
Page pinning of CPU memory is unnecessary when setting up for using that
memory by modern GPUs or accelerators, because the latter can handle
replayable page faults. So for such cases, the pages are in use by a GPU
or accelerator, but unpinned.

The performance problem occurs because for those pages, the NUMA
balancing causes unmapping, which generates callbacks to the device
driver, which dutifully unmaps the pages from the GPU or accelerator,
even if the GPU might be busy using those pages. The device promptly
causes a device page fault, and the driver then re-establishes the
device page table mapping, which is good until the next round of
unmapping from the NUMA balancer.

hmm_range_fault()-based memory management in particular might benefit
from having NUMA balancing disabled entirely for the memremap_pages()
region, come to think of it. That seems relatively easy and clean at
first glance anyway.

For other regions (allocated by the device driver), a per-VMA flag
seems about right: VM_NO_NUMA_BALANCING ?

thanks,
-- 
John Hubbard
NVIDIA