[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <883a0f0d-7342-479e-aa3c-13deb7e99338@redhat.com>
Date: Tue, 6 Aug 2024 15:43:24 +0200
From: David Hildenbrand <david@...hat.com>
To: "Gowans, James" <jgowans@...zon.com>, "jack@...e.cz" <jack@...e.cz>,
"muchun.song@...ux.dev" <muchun.song@...ux.dev>
Cc: "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"rppt@...nel.org" <rppt@...nel.org>, "brauner@...nel.org"
<brauner@...nel.org>, "Graf (AWS), Alexander" <graf@...zon.de>,
"anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>,
"steven.sistare@...cle.com" <steven.sistare@...cle.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"Durrant, Paul" <pdurrant@...zon.co.uk>,
"seanjc@...gle.com" <seanjc@...gle.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"Woodhouse, David" <dwmw@...zon.co.uk>,
"Saenz Julienne, Nicolas" <nsaenz@...zon.es>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"nh-open-source@...zon.com" <nh-open-source@...zon.com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"jgg@...pe.ca" <jgg@...pe.ca>
Subject: Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory
filesystem
> 1. Secret hiding: with guestmemfs all of the memory is out of the kernel
> direct map as an additional defence mechanism. This means no
> read()/write() syscalls to guestmemfs files, and no IO to it. The only
> way to access it is to mmap the file.
There are people interested into similar things for guest_memfd.
>
> 2. No struct page overhead: the intended use case is for systems whose
> sole job is to be a hypervisor, typically for large (multi-GiB) VMs, so
> the majority of system RAM would be donated to this fs. We definitely
> don't want 4 KiB struct pages here as it would be a significant
> overhead. That's why guestmemfs carves the memory out in early boot and
> sets memblock flags to avoid struct page allocation. I don't know if
> hugetlbfs does anything fancy to avoid allocating PTE-level struct pages
> for its memory?
Sure, it's called HVO and can optimize out a significant portion of the
vmemmap.
>
> 3. guest_memfd interface: For confidential computing use-cases we need
> to provide a guest_memfd style interface so that these FDs can be used
> as a guest_memfd file in KVM memslots. Would there be interest in
> extending hugetlbfs to also support a guest_memfd style interface?
>
"Extending hugetlbfs" sounds wrong; hugetlbfs is a blast from the past
and not something people are particularly keen to extend for such use
cases. :)
Instead, as Jason said, we're looking into letting guest_memfd own and
manage large chunks of contiguous memory.
> 4. Metadata designed for persistence: guestmemfs will need to keep
> simple internal metadata data structures (limited allocations, limited
> fragmentation) so that pages can easily and efficiently be marked as
> persistent via KHO. Something like slab allocations would probably be a
> no-go as then we'd need to persist and reconstruct the slab allocator. I
> don't know how hugetlbfs structures its fs metadata but I'm guessing it
> uses the slab and does lots of small allocations so trying to retrofit
> persistence via KHO to it may be challenging.
>
> 5. Integration with persistent IOMMU mappings: to keep DMA running
> across kexec, iommufd needs to know that the backing memory for an IOAS
> is persistent too. The idea is to do some DMA pinning of persistent
> files, which would require iommufd/guestmemfs integration - would we
> want to add this to hugetlbfs?
>
> 6. Virtualisation-specific APIs: starting to get a bit esoteric here,
> but use-cases like being able to carve out specific chunks of memory
> from a running VM and turn it into memory for another side car VM, or
> doing post-copy LM via DMA by mapping memory into the IOMMU but taking
> page faults on the CPU. This may require virtualisation-specific ioctls
> on the files which wouldn't be generally applicable to hugetlbfs.
>
> 7. NUMA control: a requirement is to always have correct NUMA affinity.
> While currently not implemented the idea is to extend the guestmemfs
> allocation to support specifying allocation sizes from each NUMA node at
> early boot, and then having multiple mount points, one per NUMA node (or
> something like that...). Unclear if this is something hugetlbfs would
> want.
>
> There are probably more potential issues, but those are the ones that
> come to mind... That being said, if hugetlbfs maintainers are interested
> in going in this direction then we can definitely look at enhancing
> hugetlbfs.
>
> I think there are two types of problems: "Would hugetlbfs want this
> functionality?" - that's the majority. An a few are "This would be hard
> with hugetlbfs!" - persistence probably falls into this category.
I'm much rather asking myself if you should instead teach/extend the
guest_memfd concept by some of what you propose here.
At least "guest_memfd" sounds a lot like the "anonymous fd" based
variant of guestmemfs ;)
Like we have hugetlbfs and memfd with hugetlb pages.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists