lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <69dc324f-99fb-44ec-8501-086fe7af9d0d@amazon.com>
Date: Thu, 13 Mar 2025 15:25:16 +0000
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Peter Xu <peterx@...hat.com>
CC: James Houghton <jthoughton@...gle.com>, <akpm@...ux-foundation.org>,
	<pbonzini@...hat.com>, <shuah@...nel.org>, <kvm@...r.kernel.org>,
	<linux-kselftest@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-mm@...ck.org>, <lorenzo.stoakes@...cle.com>, <david@...hat.com>,
	<ryan.roberts@....com>, <quic_eberman@...cinc.com>, <graf@...zon.de>,
	<jgowans@...zon.com>, <roypat@...zon.co.uk>, <derekmn@...zon.com>,
	<nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing



On 12/03/2025 19:32, Peter Xu wrote:
> On Wed, Mar 12, 2025 at 05:07:25PM +0000, Nikita Kalyazin wrote:
>> However if MISSING is not registered, the kernel will auto-populate with a
>> clear page, ie there is no way to inject custom content from userspace.  To
>> explain my use case a bit more, the population thread will be trying to copy
>> all guest memory proactively, but there will inevitably be cases where a
>> page is accessed through pgtables _before_ it gets populated.  It is not
>> desirable for such access to result in a clear page provided by the kernel.
> 
> IMHO populating with a zero page in the page cache is fine. It needs to
> make sure all accesses will go via the pgtable, as discussed below in my
> previous email [1], then nobody will be able to see the zero page, not
> until someone updates the content then follow up with a CONTINUE to install
> the pgtable entry.
> 
> If there is any way that the page can be accessed without the pgtable
> installation, minor faults won't work indeed.

I think I see what you mean now.  I agree, it isn't the end of the world 
if the kernel clears the page and then userspace overwrites it.

The way I see it is:

@@ -400,20 +401,26 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
         if (WARN_ON_ONCE(folio_test_large(folio))) {
                 ret = VM_FAULT_SIGBUS;
                 goto out_folio;
         }

         if (!folio_test_uptodate(folio)) {
                 clear_highpage(folio_page(folio, 0));
                 kvm_gmem_mark_prepared(folio);
         }

+       if (userfaultfd_minor(vmf->vma)) {
+               folio_unlock(folio);
+               filemap_invalidate_unlock_shared(inode->i_mapping);
+               return handle_userfault(vmf, VM_UFFD_MISSING);
+       }
+
         vmf->page = folio_file_page(folio, vmf->pgoff);

  out_folio:
         if (ret != VM_FAULT_LOCKED) {
                 folio_unlock(folio);
                 folio_put(folio);
         }

On the first fault (cache miss), the kernel will allocate/add/clear the 
page (as there is no MISSING trap now), and once the page is in the 
cache, a MINOR event will be sent for userspace to copy its content. 
Please let me know if this is an acceptable semantics.

Since userspace is getting notified after KVM calls 
kvm_gmem_mark_prepared(), which removes the page from the direct map 
[1], userspace can't use write() to populate the content because write() 
relies on direct map [2].  However userspace can do a plain memcpy that 
would use user pagetables instead.  This forces userspace to respond to 
stage-2 and VMA faults in guest_memfd differently, via write() and 
memcpy respectively.  It doesn't seem like a significant problem though.

I believe, with this approach the original race condition is gone 
because UFFD messages are only sent on cache hit and it is up to 
userspace to serialise writes.  Please correct me if I'm wrong here.

[1] 
https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#mdf41fe2dc33332e9c500febd47e14ae91ad99724
[2] 
https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/#mf5d794aa31d753cbc73e193628f31e418051983d

>>
>>> as long as the content can only be accessed from the pgtable (either via
>>> mmap() or GUP on top of it), then afaiu it could work similarly like
>>> MISSING faults, because anything trying to access it will be trapped.
> 
> [1]
> 
> --
> Peter Xu
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ