linux-kernel - Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <846e9117-1f79-a5e0-1b14-3dba91ab8033@redhat.com>
Date:   Fri, 11 Aug 2023 20:39:46 +0200
From:   David Hildenbrand <david@...hat.com>
To:     John Hubbard <jhubbard@...dia.com>, Yan Zhao <yan.y.zhao@...el.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, pbonzini@...hat.com, seanjc@...gle.com,
        mike.kravetz@...cle.com, apopple@...dia.com, jgg@...dia.com,
        rppt@...nel.org, akpm@...ux-foundation.org, kevin.tian@...el.com,
        Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a
 VM

>> Ah, okay I see, thanks. That's indeed unfortunate.
> 
> Sigh. All this difficulty reminds me that this mechanism was created in
> the early days of NUMA. I wonder sometimes lately whether the cost, in
> complexity and CPU time, is still worth it on today's hardware.
> 
> But of course I am deeply biased, so don't take that too seriously.
> See below. :)

:)

>>
>>>
>>> Then current KVM will unmap all notified pages from secondary MMU
>>> in .invalidate_range_start(), which could include pages that finally not
>>> set to PROT_NONE in primary MMU.
>>>
>>> For VMs with pass-through devices, though all guest pages are pinned,
>>> KVM still periodically unmap pages in response to the
>>> .invalidate_range_start() notification from auto NUMA balancing, which
>>> is a waste.
>>
>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>
> 
> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> accelerators that map memory...and VMs as well, it seems. Basically,
> anything that has its own processors and page tables needs to be left
> strictly alone by NUMA balancing. Because the kernel is (still, even
> today) unaware of what those processors are doing, and so it has no way
> to do productive NUMA balancing.

Is there any existing way we could handle that better on a per-VMA 
level, or on the process level? Any magic toggles?

MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might 
be better, but with things like iouring still too restrictive eventually.

I recall that setting a mempolicy could prevent auto-numa from getting 
active, but that might be undesired.

CCing Mel.

-- 
Cheers,

David / dhildenb