[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240712232937.2861788-1-ackerleytng@google.com>
Date: Fri, 12 Jul 2024 23:29:37 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: quic_eberman@...cinc.com
Cc: akpm@...ux-foundation.org, david@...hat.com, kvm@...r.kernel.org,
linux-arm-msm@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-kselftest@...r.kernel.org, linux-mm@...ck.org, maz@...nel.org,
pbonzini@...hat.com, shuah@...nel.org, tabba@...gle.com, willy@...radead.org,
vannapurve@...gle.com, hch@...radead.org, jgg@...dia.com, rientjes@...gle.com,
seanjc@...gle.com, jhubbard@...dia.com, qperret@...gle.com,
smostafa@...gle.com, fvdl@...gle.com, hughd@...gle.com
Subject: Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning
Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am
PDT:
The current direction is:
+ Allow mmap() of ranges that cover both shared and private memory, but disallow
faulting in of private pages
+ On access to private pages, userspace will get some error, perhaps SIGBUS
+ On shared to private conversions, unmap the page and decrease refcounts
+ To support huge pages, guest_memfd will take ownership of the hugepages, and
provide interested parties (userspace, KVM, iommu) with pages to be used.
+ guest_memfd will track usage of (sub)pages, for both private and shared
memory
+ Pages will be broken into smaller (probably 4K) chunks at creation time to
simplify implementation (as opposed to splitting at runtime when private to
shared conversion is requested by the guest)
+ Core MM infrastructure will still be used to track page table mappings in
mapcounts and other references (refcounts) per subpage
+ HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to
be optimized later. Suggestions:
+ Use a tracking data structure other than struct page
+ Remove the memory for struct pages backing private memory from the
vmemmap, and re-populate the vmemmap on conversion from private to
shared
+ Implementation pointers for huge page support
+ Consensus was that getting core MM to do tracking seems wrong
+ Maintaining special page refcounts for guest_memfd pages is difficult to
get working and requires weird special casing in many places. This was
tried for FS DAX pages and did not work out: [1]
+ Implementation suggestion: use infrastructure similar to what ZONE_DEVICE
uses, to provide the huge page to interested parties
+ TBD: how to actually get huge pages into guest_memfd
+ TBD: how to provide/convert the huge pages to ZONE_DEVICE
+ Perhaps reserve them at boot time like in HugeTLB
+ Line of sight to compaction/migration:
+ Compaction here means making memory contiguous
+ Compaction/migration scope:
+ In scope for 4K pages
+ Out of scope for 1G pages and anything managed through ZONE_DEVICE
+ Out of scope for an initial implementation
+ Ideas for future implementations
+ Reuse the non-LRU page migration framework as used by memory balloning
+ Have userspace drive compaction/migration via ioctls
+ Having line of sight to optimizing lost HVO means avoiding being locked
in to any implementation requiring struct pages
+ Without struct pages, it is hard to reuse core MM’s
compaction/migration infrastructure
+ Discuss more details at LPC in Sep 2024, such as how to use huge pages,
shared/private conversion, huge page splitting
This addresses the prerequisites set out by Fuad and Elliott at the beginning of
the session, which were:
1. Non-destructive shared/private conversion
+ Through having guest_memfd manage and track both shared/private memory
2. Huge page support with the option of converting individual subpages
+ Splitting of pages will be managed by guest_memfd
3. Line of sight to compaction/migration of private memory
+ Possibly driven by userspace using guest_memfd ioctls
4. Loading binaries into guest (private) memory before VM starts
+ This was identified as a special case of (1.) above
5. Non-protected guests in pKVM
+ Not discussed during session, but this is a goal of guest_memfd, for all VM
types [2]
David Hildenbrand summarized this during the meeting at t=47m25s [3].
[1]: https://lore.kernel.org/linux-mm/cover.66009f59a7fe77320d413011386c3ae5c2ee82eb.1719386613.git-series.apopple@nvidia.com/
[2]: https://lore.kernel.org/lkml/ZnRMn1ObU8TFrms3@google.com/
[3]: https://drive.google.com/file/d/17lruFrde2XWs6B1jaTrAy9gjv08FnJ45/view?t=47m25s&resourcekey=0-LiteoxLd5f4fKoPRMjMTOw
Powered by blists - more mailing lists