[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZN63m5Dej5MBLTqr@yzhao56-desk.sh.intel.com>
Date: Fri, 18 Aug 2023 08:13:15 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: David Hildenbrand <david@...hat.com>
CC: John Hubbard <jhubbard@...dia.com>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
<pbonzini@...hat.com>, <seanjc@...gle.com>,
<mike.kravetz@...cle.com>, <apopple@...dia.com>, <jgg@...dia.com>,
<rppt@...nel.org>, <akpm@...ux-foundation.org>,
<kevin.tian@...el.com>, Mel Gorman <mgorman@...hsingularity.net>,
<alex.williamson@...hat.com>
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in
a VM
On Thu, Aug 17, 2023 at 09:38:37AM +0200, David Hildenbrand wrote:
> On 17.08.23 07:05, Yan Zhao wrote:
> > On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
> > > On 8/16/23 02:49, David Hildenbrand wrote:
> > > > But do 32bit architectures even care about NUMA hinting? If not, just
> > > > ignore them ...
> > >
> > > Probably not!
> > >
> > > ...
> > > > > So, do you mean that let kernel provide a per-VMA allow/disallow
> > > > > mechanism, and
> > > > > it's up to the user space to choose between per-VMA and complex way or
> > > > > global and simpler way?
> > > >
> > > > QEMU could do either way. The question would be if a per-vma settings
> > > > makes sense for NUMA hinting.
> > >
> > > From our experience with compute on GPUs, a per-mm setting would suffice.
> > > No need to go all the way to VMA granularity.
> > >
> > After an offline internal discussion, we think a per-mm setting is also
> > enough for device passthrough in VMs.
> >
> > BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
> > think it's of any value to providing a flag like VM_MAYDMA?
> > Auto NUMA balancing or other components can decide how to use it by
> > themselves.
>
> Short-lived DMA is not really the problem. The problem is long-term pinning.
>
> There was a discussion about letting user space similarly hint that
> long-term pinning might/will happen.
>
> Because when long-term pinning a page we have to make sure to migrate it off
> of ZONE_MOVABLE / MIGRATE_CMA.
>
> But the kernel prefers to place pages there.
>
> So with vfio in QEMU, we might preallocate memory for the guest and place it
> on ZONE_MOVABLE/MIGRATE_CMA, just so long-term pinning has to migrate all
> these fresh pages out of these areas again.
>
> So letting the kernel know about that in this context might also help.
>
Thanks! Glad to know it :)
But consider for GPUs case as what John mentioned, since the memory is
not even pinned, maybe they still need flag VM_NO_NUMA_BALANCING ?
For VMs, we hint VM_NO_NUMA_BALANCING for passthrough devices supporting
IO page fault (so no need to pin), and VM_MAYLONGTERMDMA to avoid misplace
and migration.
Is that good?
Or do you think just a per-mm flag like MMF_NO_NUMA is good enough for
now?
Powered by blists - more mailing lists