lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24528be7-8f7a-4928-8bca-5869cf14eace@amazon.com>
Date: Fri, 14 Mar 2025 17:12:35 +0000
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Peter Xu <peterx@...hat.com>
CC: James Houghton <jthoughton@...gle.com>, <akpm@...ux-foundation.org>,
	<pbonzini@...hat.com>, <shuah@...nel.org>, <kvm@...r.kernel.org>,
	<linux-kselftest@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-mm@...ck.org>, <lorenzo.stoakes@...cle.com>, <david@...hat.com>,
	<ryan.roberts@....com>, <quic_eberman@...cinc.com>, <graf@...zon.de>,
	<jgowans@...zon.com>, <roypat@...zon.co.uk>, <derekmn@...zon.com>,
	<nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing



On 13/03/2025 22:38, Peter Xu wrote:
> On Thu, Mar 13, 2025 at 10:13:23PM +0000, Nikita Kalyazin wrote:
>> Yes, that's right, mmap() + memcpy() is functionally sufficient. write() is
>> an optimisation.  Most of the pages in guest_memfd are only ever accessed by
>> the vCPU (not userspace) via TDP (stage-2 pagetables) so they don't need
>> userspace pagetables set up.  By using write() we can avoid VMA faults,
>> installing corresponding PTEs and double page initialisation we discussed
>> earlier.  The optimised path only contains pagecache population via write().
>> Even TDP faults can be avoided if using KVM prefaulting API [1].
>>
>> [1] https://docs.kernel.org/virt/kvm/api.html#kvm-pre-fault-memory
> 
> Could you elaborate why VMA faults matters in perf?

Based on my experiments, I can populate 3GiB of guest_memfd with write() 
in 980 ms, while memcpy takes 2140 ms.  When I was profiling it, I saw 
~63% of memcpy time spent in the exception handler, which made me think 
VMA faults mattered.

> If we're talking about postcopy-like migrations on top of KVM guest-memfd,
> IIUC the VMAs can be pre-faulted too just like the TDP pgtables, e.g. with
> MADV_POPULATE_WRITE.

Yes, I was thinking about MADV_POPULATE_WRITE as well, but AFAIK it 
isn't available in guest_memfd, at least with direct map removed due to 
[1] being updated in [2]:

diff --git a/mm/gup.c b/mm/gup.c
index 3883b307780e..7ddaf93c5b6a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1283,7 +1283,7 @@ static int check_vma_flags(struct vm_area_struct 
*vma, unsigned long gup_flags)
  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
  		return -EOPNOTSUPP;

-	if (vma_is_secretmem(vma))
+	if (vma_is_secretmem(vma) || vma_is_no_direct_map(vma))
  		return -EFAULT;

  	if (write) {

[1] https://elixir.bootlin.com/linux/v6.13.6/source/mm/gup.c#L1286
[2] 
https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/#m05b5c6366be27c98a86baece52b2f408c455e962

> Normally, AFAIU userapp optimizes IOs the other way round.. to change
> write()s into mmap()s, which at least avoids one round of copy.
> 
> For postcopy using minor traps (and since guest-memfd is always shared and
> non-private..), it's also possible to feed the mmap()ed VAs to NIC as
> buffers (e.g. in recvmsg(), for example, as part of iovec[]), and as long
> as the mmap()ed ranges are not registered by KVM memslots, there's no
> concern on non-atomic copy.

Yes, I see what you mean.  It may be faster depending on the setup, if 
it's possible to remove one copy.

Anyway, it looks like the solution we discussed allows to choose between 
memcpy-only and memcpy/write-combined userspace implementations.  I'm 
going to work on the next version of the series that would include MINOR 
trap and avoiding KVM dependency in mm via calling vm_ops->fault() in 
UFFDIO_CONTINUE.

> Thanks,
> 
> --
> Peter Xu
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ