[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9286da7a-9923-4a3b-a769-590e8824fa10@redhat.com>
Date: Wed, 20 Nov 2024 17:20:09 +0100
From: David Hildenbrand <david@...hat.com>
To: kalyazin@...zon.com, pbonzini@...hat.com, corbet@....net,
kvm@...r.kernel.org, linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: jthoughton@...gle.com, brijesh.singh@....com, michael.roth@....com,
graf@...zon.de, jgowans@...zon.com, roypat@...zon.co.uk, derekmn@...zon.com,
nsaenz@...zon.es, xmarcalx@...zon.com,
Sean Christopherson <seanjc@...gle.com>, linux-mm@...ck.org
Subject: Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
On 20.11.24 16:58, Nikita Kalyazin wrote:
>
>
> On 20/11/2024 15:13, David Hildenbrand wrote:
> > Hi!
>
> Hi! :)
>
> >> Results:
> >> - MAP_PRIVATE: 968 ms
> >> - MAP_SHARED: 1646 ms
> >
> > At least here it is expected to some degree: as soon as the page cache
> > is involved map/unmap gets slower, because we are effectively
> > maintaining two datastructures (page tables + page cache) instead of
> > only a single one (page cache)
> >
> > Can you make sure that THP/large folios don't interfere in your
> > experiments (e.g., madvise(MADV_NOHUGEPAGE))?
>
> I was using transparent_hugepage=never command line argument in my testing.
>
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> Is that sufficient to exclude the THP/large folio factor?
Yes!
>
> >> While this logic is intuitive, its performance effect is more
> >> significant that I would expect.
> >
> > Yes. How much of the performance difference would remain if you hack out
> > the atomic op just to play with it? I suspect there will still be some
> > difference.
>
> I have tried that, but could not see any noticeable difference in the
> overall results.
>
> It looks like a big portion of the bottleneck has moved from
> shmem_get_folio_gfp/folio_mark_uptodate to
> finish_fault/__pte_offset_map_lock somehow. I have no good explanation
> for why:
That's what I assumed. The profiling results can be rather fuzzy and
misleading with micro-benchmarks. :(
>
> Orig:
> - 69.62% do_fault
> + 44.61% __do_fault
> + 20.26% filemap_map_pages
> + 3.48% finish_fault
> Hacked:
> - 67.39% do_fault
> + 32.45% __do_fault
> + 21.87% filemap_map_pages
> + 11.97% finish_fault
>
> Orig:
> - 3.48% finish_fault
> - 1.28% set_pte_range
> 0.96% folio_add_file_rmap_ptes
> - 0.91% __pte_offset_map_lock
> 0.54% _raw_spin_lock
> Hacked:
> - 11.97% finish_fault
> - 8.59% __pte_offset_map_lock
> - 6.27% _raw_spin_lock
> preempt_count_add
> 1.00% __pte_offset_map
> - 1.28% set_pte_range
> - folio_add_file_rmap_ptes
> __mod_node_page_state
>
> > Note that we might improve allocation times with guest_memfd when
> > allocating larger folios.
>
> I suppose it may not always be an option depending on requirements to
> consistency of the allocation latency. Eg if a large folio isn't
> available at the time, the performance would degrade to the base case
> (please correct me if I'm missing something).
Yes, there are cons to that.
>
>> Heh, now I spot that your comment was as reply to a series.
>
> Yeah, sorry if it wasn't obvious.
>
>> If your ioctl is supposed to to more than "allocating memory" like
>> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
>> Because for allocating memory, we would want to use fallocate() instead.
>> I assume you want to "allocate+copy"?
>
> Yes, the ultimate use case is "allocate+copy".
>
>> I'll note that, as we're moving into the direction of moving
>> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
>> ioctls, and think about something generic.
>
> Good point, thanks. Are we at the stage where some concrete API has
> been proposed yet? I might have missed that.
People are working on it, and we're figuring out some remaining details
(e.g., page_type to intercept folio_put() ). I assume we'll see a new
RFC soonish (famous last words), but it's not been proposed yet.
>
>> Any clue how your new ioctl will interact with the WIP to have shared
>> memory as part of guest_memfd? For example, could it be reasonable to
>> "populate" the shared memory first (via VMA) and then convert that
>> "allocated+filled" memory to private?
>
> No, I can't immediately see why it shouldn't work. My main concern
> would probably still be about the latency of the population stage as I
> can't see why it would improve compared to what we have now, because my
> feeling is this is linked with the sharedness property of guest_memfd.
If the problem is the "pagecache" overhead, then yes, it will be a
harder nut to crack. But maybe there are some low-hanging fruits to
optimize? Finding the main cause for the added overhead would be
interesting.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists