[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aPAkxp67-R9aQ8oN@google.com>
Date: Wed, 15 Oct 2025 15:48:38 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Gregory Price <gourry@...rry.net>
Cc: Shivank Garg <shivankg@....com>, jgowans@...zon.com, mhocko@...e.com, jack@...e.cz,
kvm@...r.kernel.org, david@...hat.com, linux-btrfs@...r.kernel.org,
aik@....com, papaluri@....com, kalyazin@...zon.com, peterx@...hat.com,
linux-mm@...ck.org, clm@...com, ddutile@...hat.com,
linux-kselftest@...r.kernel.org, shdhiman@....com, gshan@...hat.com,
ying.huang@...ux.alibaba.com, shuah@...nel.org, roypat@...zon.co.uk,
matthew.brost@...el.com, linux-coco@...ts.linux.dev, zbestahu@...il.com,
lorenzo.stoakes@...cle.com, linux-bcachefs@...r.kernel.org,
ira.weiny@...el.com, dhavale@...gle.com, jmorris@...ei.org,
willy@...radead.org, hch@...radead.org, chao.gao@...el.com, tabba@...gle.com,
ziy@...dia.com, rientjes@...gle.com, yuzhao@...gle.com, xiang@...nel.org,
nikunj@....com, serge@...lyn.com, amit@...radead.org, thomas.lendacky@....com,
ashish.kalra@....com, chao.p.peng@...el.com, yan.y.zhao@...el.com,
byungchul@...com, michael.day@....com, Neeraj.Upadhyay@....com,
michael.roth@....com, bfoster@...hat.com, bharata@....com,
josef@...icpanda.com, Liam.Howlett@...cle.com, ackerleytng@...gle.com,
dsterba@...e.com, viro@...iv.linux.org.uk, jefflexu@...ux.alibaba.com,
jaegeuk@...nel.org, dan.j.williams@...el.com, surenb@...gle.com,
vbabka@...e.cz, paul@...l-moore.com, joshua.hahnjy@...il.com,
apopple@...dia.com, brauner@...nel.org, quic_eberman@...cinc.com,
rakie.kim@...com, cgzones@...glemail.com, pvorel@...e.cz,
linux-erofs@...ts.ozlabs.org, kent.overstreet@...ux.dev,
linux-kernel@...r.kernel.org, linux-f2fs-devel@...ts.sourceforge.net,
pankaj.gupta@....com, linux-security-module@...r.kernel.org,
lihongbo22@...wei.com, linux-fsdevel@...r.kernel.org, pbonzini@...hat.com,
akpm@...ux-foundation.org, vannapurve@...gle.com, suzuki.poulose@....com,
rppt@...nel.org, jgg@...dia.com
Subject: Re: [f2fs-dev] [PATCH kvm-next V11 6/7] KVM: guest_memfd: Enforce
NUMA mempolicy using shared policy
On Wed, Oct 15, 2025, Gregory Price wrote:
> On Fri, Sep 26, 2025 at 12:36:27PM -0700, Sean Christopherson via Linux-f2fs-devel wrote:
> > >
> > > static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> > > unsigned long addr, pgoff_t *pgoff)
> > > {
> > > *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> > >
> > > return __kvm_gmem_get_policy(GMEM_I(file_inode(vma->vm_file)), *pgoff);
> >
> > Argh!!!!! This breaks the selftest because do_get_mempolicy() very specifically
> > falls back to the default_policy, NOT to the current task's policy. That is
> > *exactly* the type of subtle detail that needs to be commented, because there's
> > no way some random KVM developer is going to know that returning NULL here is
> > important with respect to get_mempolicy() ABI.
> >
>
> Do_get_mempolicy was designed to be accessed by the syscall, not as an
> in-kernel ABI.
Ya, by "get_mempolicy() ABI" I meant the uABI for the get_mempolicy syscall.
> get_task_policy also returns the default policy if there's nothing
> there, because that's what applies.
>
> I have dangerous questions:
Not dangerous at all, I find them very helpful!
> why is __kvm_gmem_get_policy using
> mpol_shared_policy_lookup()
> instead of
> get_vma_policy()
With the disclaimer that I haven't followed the gory details of this series super
closely, my understanding is...
Because the VMA is a means to an end, and we want the policy to persist even if
the VMA goes away.
With guest_memfd, KVM effectively inverts the standard MMU model. Instead of mm/
being the primary MMU and KVM being a secondary MMU, guest_memfd is the primary
MMU and any VMAs are secondary (mostly; it's probably more like 1a and 1b). This
allows KVM to map guest_memfd memory into a guest without a VMA, or with more
permissions than are granted to host userspace, e.g. guest_memfd memory could be
writable by the guest, but read-only for userspace.
But we still want to support things like mbind() so that userspace can ensure
guest_memfd allocations align with the vNUMA topology presented to the guest,
or are bound to the NUMA node where the VM will run. We considered adding equivalent
file-based syscalls, e.g. fbind(), but IIRC the consensus was that doing so was
unnecessary (and potentially messy?) since we were planning on eventually adding
mmap() support to guest_memfd anyways.
> get_vma_policy does this all for you
I assume that doesn't work if the intent is for new VMAs to pick up the existing
policy from guest_memfd? And more importantly, guest_memfd needs to hook
->set_policy so that changes through e.g. mbind() persist beyond the lifetime of
the VMA.
> struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> unsigned long addr, int order, pgoff_t *ilx)
> {
> struct mempolicy *pol;
>
> pol = __get_vma_policy(vma, addr, ilx);
> if (!pol)
> pol = get_task_policy(current);
> if (pol->mode == MPOL_INTERLEAVE ||
> pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> *ilx += vma->vm_pgoff >> order;
> *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> }
> return pol;
> }
>
> Of course you still have the same issue: get_task_policy will return the
> default, because that's what applies.
>
> do_get_mempolicy just seems like the completely incorrect interface to
> be using here.
Powered by blists - more mailing lists