[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220818132421.6xmjqduempmxnnu2@box>
Date: Thu, 18 Aug 2022 16:24:21 +0300
From: "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>
To: Hugh Dickins <hughd@...gle.com>
Cc: Chao Peng <chao.p.peng@...ux.intel.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org,
linux-doc@...r.kernel.org, qemu-devel@...gnu.org,
linux-kselftest@...r.kernel.org,
Paolo Bonzini <pbonzini@...hat.com>,
Jonathan Corbet <corbet@....net>,
Sean Christopherson <seanjc@...gle.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>,
Jeff Layton <jlayton@...nel.org>,
"J . Bruce Fields" <bfields@...ldses.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Shuah Khan <shuah@...nel.org>, Mike Rapoport <rppt@...nel.org>,
Steven Price <steven.price@....com>,
"Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
Vlastimil Babka <vbabka@...e.cz>,
Vishal Annapurve <vannapurve@...gle.com>,
Yu Zhang <yu.c.zhang@...ux.intel.com>, luto@...nel.org,
jun.nakajima@...el.com, dave.hansen@...el.com, ak@...ux.intel.com,
david@...hat.com, aarcange@...hat.com, ddutile@...hat.com,
dhildenb@...hat.com, Quentin Perret <qperret@...gle.com>,
Michael Roth <michael.roth@....com>, mhocko@...e.com,
Muchun Song <songmuchun@...edance.com>,
"Gupta, Pankaj" <pankaj.gupta@....com>
Subject: Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM
guest private memory
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> On Wed, 6 Jul 2022, Chao Peng wrote:
> > This is the v7 of this series which tries to implement the fd-based KVM
> > guest private memory.
>
> Here at last are my reluctant thoughts on this patchset.
>
> fd-based approach for supporting KVM guest private memory: fine.
>
> Use or abuse of memfd and shmem.c: mistaken.
>
> memfd_create() was an excellent way to put together the initial prototype.
>
> But since then, TDX in particular has forced an effort into preventing
> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
>
> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
>
> What use do you have for a filesystem here? Almost none.
> IIUC, what you want is an fd through which QEMU can allocate kernel
> memory, selectively free that memory, and communicate fd+offset+length
> to KVM. And perhaps an interface to initialize a little of that memory
> from a template (presumably copied from a real file on disk somewhere).
>
> You don't need shmem.c or a filesystem for that!
>
> If your memory could be swapped, that would be enough of a good reason
> to make use of shmem.c: but it cannot be swapped; and although there
> are some references in the mailthreads to it perhaps being swappable
> in future, I get the impression that will not happen soon if ever.
>
> If your memory could be migrated, that would be some reason to use
> filesystem page cache (because page migration happens to understand
> that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> Some of these impressions may come from earlier iterations of the
> patchset (v7 looks better in several ways than v5). I am probably
> underestimating the extent to which you have taken on board other
> usages beyond TDX and SEV private memory, and rightly want to serve
> them all with similar interfaces: perhaps there is enough justification
> for shmem there, but I don't see it. There was mention of userfaultfd
> in one link: does that provide the justification for using shmem?
>
> I'm afraid of the special demands you may make of memory allocation
> later on - surprised that huge pages are not mentioned already;
> gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of
MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
implications for shmem. It is going to be separate struct memfile_backing_store.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
to be movable if platform supports it and secretmem is not migratable by
design (without direct mapping fragmentations).
> Here's what I would prefer, and imagine much easier for you to maintain;
> but I'm no system designer, and may be misunderstanding throughout.
>
> QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> the fallocate syscall interface itself) to allocate and free the memory,
> ioctl for initializing some of it too. KVM in control of whether that
> fd can be read or written or mmap'ed or whatever, no need to prevent it
> in shmem.c, no need for flags, seals, notifications to and fro because
> KVM is already in control and knows the history. If shmem actually has
> value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> mmap, and i915/gem make use of it underneath. If shmem has nothing to
> add, just allocate and free kernel memory directly, recorded in your
> own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why
it would not. But I'm not sure it is right direction. We risk creating yet
another parallel VM with own rules/locking/accounting that opaque to
core-mm.
Note that on machines that run TDX guests such memory would likely be the
bulk of memory use. Treating it as a fringe case may bite us one day.
--
Kiryl Shutsemau / Kirill A. Shutemov
Powered by blists - more mailing lists