linux-kernel - Re: [RFC PATCH 0/4] KVM: ioctl for populating guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9286da7a-9923-4a3b-a769-590e8824fa10@redhat.com>
Date: Wed, 20 Nov 2024 17:20:09 +0100
From: David Hildenbrand <david@...hat.com>
To: kalyazin@...zon.com, pbonzini@...hat.com, corbet@....net,
 kvm@...r.kernel.org, linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: jthoughton@...gle.com, brijesh.singh@....com, michael.roth@....com,
 graf@...zon.de, jgowans@...zon.com, roypat@...zon.co.uk, derekmn@...zon.com,
 nsaenz@...zon.es, xmarcalx@...zon.com,
 Sean Christopherson <seanjc@...gle.com>, linux-mm@...ck.org
Subject: Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

On 20.11.24 16:58, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 15:13, David Hildenbrand wrote:
>   > Hi!
> 
> Hi! :)
> 
>   >> Results:
>   >>    - MAP_PRIVATE: 968 ms
>   >>    - MAP_SHARED: 1646 ms
>   >
>   > At least here it is expected to some degree: as soon as the page cache
>   > is involved map/unmap gets slower, because we are effectively
>   > maintaining two datastructures (page tables + page cache) instead of
>   > only a single one (page cache)
>   >
>   > Can you make sure that THP/large folios don't interfere in your
>   > experiments (e.g., madvise(MADV_NOHUGEPAGE))?
> 
> I was using transparent_hugepage=never command line argument in my testing.
> 
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> 
> Is that sufficient to exclude the THP/large folio factor?

Yes!

> 
>   >> While this logic is intuitive, its performance effect is more
>   >> significant that I would expect.
>   >
>   > Yes. How much of the performance difference would remain if you hack out
>   > the atomic op just to play with it? I suspect there will still be some
>   > difference.
> 
> I have tried that, but could not see any noticeable difference in the
> overall results.
> 
> It looks like a big portion of the bottleneck has moved from
> shmem_get_folio_gfp/folio_mark_uptodate to
> finish_fault/__pte_offset_map_lock somehow.  I have no good explanation
> for why:

That's what I assumed. The profiling results can be rather fuzzy and 
misleading with micro-benchmarks. :(

> 
> Orig:
>                     - 69.62% do_fault
>                        + 44.61% __do_fault
>                        + 20.26% filemap_map_pages
>                        + 3.48% finish_fault
> Hacked:
>                     - 67.39% do_fault
>                        + 32.45% __do_fault
>                        + 21.87% filemap_map_pages
>                        + 11.97% finish_fault
> 
> Orig:
>                        - 3.48% finish_fault
>                           - 1.28% set_pte_range
>                                0.96% folio_add_file_rmap_ptes
>                           - 0.91% __pte_offset_map_lock
>                                0.54% _raw_spin_lock
> Hacked:
>                        - 11.97% finish_fault
>                           - 8.59% __pte_offset_map_lock
>                              - 6.27% _raw_spin_lock
>                                   preempt_count_add
>                                1.00% __pte_offset_map
>                           - 1.28% set_pte_range
>                              - folio_add_file_rmap_ptes
>                                   __mod_node_page_state
> 
>   > Note that we might improve allocation times with guest_memfd when
>   > allocating larger folios.
> 
> I suppose it may not always be an option depending on requirements to
> consistency of the allocation latency.  Eg if a large folio isn't
> available at the time, the performance would degrade to the base case
> (please correct me if I'm missing something).

Yes, there are cons to that.

> 
>> Heh, now I spot that your comment was as reply to a series.
> 
> Yeah, sorry if it wasn't obvious.
> 
>> If your ioctl is supposed to to more than "allocating memory" like
>> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
>> Because for allocating memory, we would want to use fallocate() instead.
>> I assume you want to "allocate+copy"?
> 
> Yes, the ultimate use case is "allocate+copy".
> 
>> I'll note that, as we're moving into the direction of moving
>> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
>> ioctls, and think about something generic.
> 
> Good point, thanks.  Are we at the stage where some concrete API has
> been proposed yet? I might have missed that.

People are working on it, and we're figuring out some remaining details 
(e.g., page_type to intercept folio_put() ). I assume we'll see a new 
RFC soonish (famous last words), but it's not been proposed yet.

> 
>> Any clue how your new ioctl will interact with the WIP to have shared
>> memory as part of guest_memfd? For example, could it be reasonable to
>> "populate" the shared memory first (via VMA) and then convert that
>> "allocated+filled" memory to private?
> 
> No, I can't immediately see why it shouldn't work.  My main concern
> would probably still be about the latency of the population stage as I
> can't see why it would improve compared to what we have now, because my
 > feeling is this is linked with the sharedness property of guest_memfd.

If the problem is the "pagecache" overhead, then yes, it will be a 
harder nut to crack. But maybe there are some low-hanging fruits to 
optimize? Finding the main cause for the added overhead would be 
interesting.

-- 
Cheers,

David / dhildenb