lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c2c92eb-1699-4256-ab67-7c4dcee4e547@amd.com>
Date: Thu, 16 Oct 2025 18:28:26 +0530
From: "Garg, Shivank" <shivankg@....com>
To: Sean Christopherson <seanjc@...gle.com>, Gregory Price <gourry@...rry.net>
Cc: jgowans@...zon.com, mhocko@...e.com, jack@...e.cz, kvm@...r.kernel.org,
 david@...hat.com, linux-btrfs@...r.kernel.org, aik@....com,
 papaluri@....com, kalyazin@...zon.com, peterx@...hat.com,
 linux-mm@...ck.org, clm@...com, ddutile@...hat.com,
 linux-kselftest@...r.kernel.org, shdhiman@....com, gshan@...hat.com,
 ying.huang@...ux.alibaba.com, shuah@...nel.org, roypat@...zon.co.uk,
 matthew.brost@...el.com, linux-coco@...ts.linux.dev, zbestahu@...il.com,
 lorenzo.stoakes@...cle.com, linux-bcachefs@...r.kernel.org,
 ira.weiny@...el.com, dhavale@...gle.com, jmorris@...ei.org,
 willy@...radead.org, hch@...radead.org, chao.gao@...el.com,
 tabba@...gle.com, ziy@...dia.com, rientjes@...gle.com, yuzhao@...gle.com,
 xiang@...nel.org, nikunj@....com, serge@...lyn.com, amit@...radead.org,
 thomas.lendacky@....com, ashish.kalra@....com, chao.p.peng@...el.com,
 yan.y.zhao@...el.com, byungchul@...com, michael.day@....com,
 Neeraj.Upadhyay@....com, michael.roth@....com, bfoster@...hat.com,
 bharata@....com, josef@...icpanda.com, Liam.Howlett@...cle.com,
 ackerleytng@...gle.com, dsterba@...e.com, viro@...iv.linux.org.uk,
 jefflexu@...ux.alibaba.com, jaegeuk@...nel.org, dan.j.williams@...el.com,
 surenb@...gle.com, vbabka@...e.cz, paul@...l-moore.com,
 joshua.hahnjy@...il.com, apopple@...dia.com, brauner@...nel.org,
 quic_eberman@...cinc.com, rakie.kim@...com, cgzones@...glemail.com,
 pvorel@...e.cz, linux-erofs@...ts.ozlabs.org, kent.overstreet@...ux.dev,
 linux-kernel@...r.kernel.org, pankaj.gupta@....com,
 linux-security-module@...r.kernel.org, lihongbo22@...wei.com,
 linux-fsdevel@...r.kernel.org, pbonzini@...hat.com,
 akpm@...ux-foundation.org, vannapurve@...gle.com, suzuki.poulose@....com,
 rppt@...nel.org, jgg@...dia.com, linux-f2fs-devel@...ts.sourceforge.net
Subject: Re: [f2fs-dev] [PATCH kvm-next V11 6/7] KVM: guest_memfd: Enforce
 NUMA mempolicy using shared policy



On 10/16/2025 4:18 AM, Sean Christopherson wrote:
> On Wed, Oct 15, 2025, Gregory Price wrote:
>> On Fri, Sep 26, 2025 at 12:36:27PM -0700, Sean Christopherson via Linux-f2fs-devel wrote:
>>>>
>>>> static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
>>>> 					     unsigned long addr, pgoff_t *pgoff)
>>>> {
>>>> 	*pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
>>>>
>>>> 	return __kvm_gmem_get_policy(GMEM_I(file_inode(vma->vm_file)), *pgoff);
>>>
>>> Argh!!!!!  This breaks the selftest because do_get_mempolicy() very specifically
>>> falls back to the default_policy, NOT to the current task's policy.  That is
>>> *exactly* the type of subtle detail that needs to be commented, because there's
>>> no way some random KVM developer is going to know that returning NULL here is
>>> important with respect to get_mempolicy() ABI.
>>>
>>
>> Do_get_mempolicy was designed to be accessed by the syscall, not as an
>> in-kernel ABI.
> 
> Ya, by "get_mempolicy() ABI" I meant the uABI for the get_mempolicy syscall.
> 
>> get_task_policy also returns the default policy if there's nothing
>> there, because that's what applies.
>>
>> I have dangerous questions:
> 
> Not dangerous at all, I find them very helpful!
> 
>> why is __kvm_gmem_get_policy using
>> 	mpol_shared_policy_lookup()
>> instead of
>> 	get_vma_policy()
> 
> With the disclaimer that I haven't followed the gory details of this series super
> closely, my understanding is...
> 
> Because the VMA is a means to an end, and we want the policy to persist even if
> the VMA goes away.
> 
> With guest_memfd, KVM effectively inverts the standard MMU model.  Instead of mm/
> being the primary MMU and KVM being a secondary MMU, guest_memfd is the primary
> MMU and any VMAs are secondary (mostly; it's probably more like 1a and 1b).  This
> allows KVM to map guest_memfd memory into a guest without a VMA, or with more
> permissions than are granted to host userspace, e.g. guest_memfd memory could be
> writable by the guest, but read-only for userspace.
> 
> But we still want to support things like mbind() so that userspace can ensure
> guest_memfd allocations align with the vNUMA topology presented to the guest,
> or are bound to the NUMA node where the VM will run.  We considered adding equivalent
> file-based syscalls, e.g. fbind(), but IIRC the consensus was that doing so was
> unnecessary (and potentially messy?) since we were planning on eventually adding
> mmap() support to guest_memfd anyways.
> 
>> get_vma_policy does this all for you
> 
> I assume that doesn't work if the intent is for new VMAs to pick up the existing
> policy from guest_memfd?  And more importantly, guest_memfd needs to hook
> ->set_policy so that changes through e.g. mbind() persist beyond the lifetime of
> the VMA.
> 
Additionally, the shared_policy based design enables range-based policies via its RB-tree
implementation. IIUC, this will not work with VMA-specific policy design.

>> struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>>                                  unsigned long addr, int order, pgoff_t *ilx)
>> {
>>         struct mempolicy *pol;
>>
>>         pol = __get_vma_policy(vma, addr, ilx);
>>         if (!pol)
>>                 pol = get_task_policy(current);
>>         if (pol->mode == MPOL_INTERLEAVE ||
>>             pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
>>                 *ilx += vma->vm_pgoff >> order;
>>                 *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
>>         }
>>         return pol;
>> }
>>
>> Of course you still have the same issue: get_task_policy will return the
>> default, because that's what applies.
>>
>> do_get_mempolicy just seems like the completely incorrect interface to
>> be using here.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ