[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230118101723.GA2976263@ls.amr.corp.intel.com>
Date: Wed, 18 Jan 2023 02:17:23 -0800
From: Isaku Yamahata <isaku.yamahata@...il.com>
To: Chao Peng <chao.p.peng@...ux.intel.com>
Cc: Sean Christopherson <seanjc@...gle.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, linux-arch@...r.kernel.org,
linux-api@...r.kernel.org, linux-doc@...r.kernel.org,
qemu-devel@...gnu.org, Paolo Bonzini <pbonzini@...hat.com>,
Jonathan Corbet <corbet@....net>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Arnd Bergmann <arnd@...db.de>,
Naoya Horiguchi <naoya.horiguchi@....com>,
Miaohe Lin <linmiaohe@...wei.com>, x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>, Hugh Dickins <hughd@...gle.com>,
Jeff Layton <jlayton@...nel.org>,
"J . Bruce Fields" <bfields@...ldses.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Shuah Khan <shuah@...nel.org>, Mike Rapoport <rppt@...nel.org>,
Steven Price <steven.price@....com>,
"Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
Vlastimil Babka <vbabka@...e.cz>,
Vishal Annapurve <vannapurve@...gle.com>,
Yu Zhang <yu.c.zhang@...ux.intel.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
luto@...nel.org, jun.nakajima@...el.com, dave.hansen@...el.com,
ak@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
ddutile@...hat.com, dhildenb@...hat.com,
Quentin Perret <qperret@...gle.com>, tabba@...gle.com,
Michael Roth <michael.roth@....com>, mhocko@...e.com,
wei.w.wang@...el.com, isaku.yamahata@...il.com
Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to
create restricted user memory
On Wed, Jan 18, 2023 at 04:16:41PM +0800,
Chao Peng <chao.p.peng@...ux.intel.com> wrote:
> On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote:
> > On Tue, Jan 17, 2023, Chao Peng wrote:
> > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > > > + list_for_each_entry(notifier, &data->notifiers, list) {
> > > > > + notifier->ops->invalidate_start(notifier, start, end);
> > > >
> > > > Two major design issues that we overlooked long ago:
> > > >
> > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a
> > > > VM with a large number of convertible memslots that are all backed by a
> > > > single large restrictedmem instance, then converting a single page will
> > > > result in a linear walk through all memslots. I don't expect anyone to
> > > > actually do something silly like that, but I also never expected there to be
> > > > a legitimate usecase for thousands of memslots.
> > > >
> > > > 2. This approach fails to provide the ability for KVM to ensure a guest has
> > > > exclusive access to a page. As discussed in the past, the kernel can rely
> > > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> > > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent,
> > > > e.g. SEV, there is value in ensuring a 1:1 association.
> > > >
> > > > And probably more importantly, relying on hardware for SNP and TDX yields a
> > > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a
> > > > page is exclusive to a guest, i.e. if userspace can hand out the same page
> > > > from a restrictedmem instance to multiple VMs, then failure will occur only
> > > > when KVM tries to assign the page to the second VM. That will happen deep
> > > > in KVM, which means KVM needs to gracefully handle such errors, and it means
> > > > that KVM's ABI effectively allows plumbing garbage into its memslots.
> > >
> > > It may not be a valid usage, but in my TDX environment I do meet below
> > > issue.
> > >
> > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
> > >
> > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> > > and slot#2 fails due to below exclusive check.
> > >
> > > Currently I changed QEMU code to mark these alias slots as shared
> > > instead of private but I'm not 100% confident this is correct fix.
> >
> > That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't
> > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.
>
> Thanks for the confirmation. As long as we only bind one notifier for
> each address, using xarray does make things simple.
In the past, I had patches for qemu to disable PAM and SMRAM, but they were
dropped for simplicity because SMRAM/PAM are disabled as reset state with unused
memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them.
Now we can revive them.
--
Isaku Yamahata <isaku.yamahata@...il.com>
Powered by blists - more mailing lists