lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230118101723.GA2976263@ls.amr.corp.intel.com>
Date:   Wed, 18 Jan 2023 02:17:23 -0800
From:   Isaku Yamahata <isaku.yamahata@...il.com>
To:     Chao Peng <chao.p.peng@...ux.intel.com>
Cc:     Sean Christopherson <seanjc@...gle.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-arch@...r.kernel.org,
        linux-api@...r.kernel.org, linux-doc@...r.kernel.org,
        qemu-devel@...gnu.org, Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Arnd Bergmann <arnd@...db.de>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Miaohe Lin <linmiaohe@...wei.com>, x86@...nel.org,
        "H . Peter Anvin" <hpa@...or.com>, Hugh Dickins <hughd@...gle.com>,
        Jeff Layton <jlayton@...nel.org>,
        "J . Bruce Fields" <bfields@...ldses.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>, Mike Rapoport <rppt@...nel.org>,
        Steven Price <steven.price@....com>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        luto@...nel.org, jun.nakajima@...el.com, dave.hansen@...el.com,
        ak@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
        ddutile@...hat.com, dhildenb@...hat.com,
        Quentin Perret <qperret@...gle.com>, tabba@...gle.com,
        Michael Roth <michael.roth@....com>, mhocko@...e.com,
        wei.w.wang@...el.com, isaku.yamahata@...il.com
Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to
 create restricted user memory

On Wed, Jan 18, 2023 at 04:16:41PM +0800,
Chao Peng <chao.p.peng@...ux.intel.com> wrote:

> On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote:
> > On Tue, Jan 17, 2023, Chao Peng wrote:
> > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > > > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > > > > +		notifier->ops->invalidate_start(notifier, start, end);
> > > > 
> > > > Two major design issues that we overlooked long ago:
> > > > 
> > > >   1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
> > > >      VM with a large number of convertible memslots that are all backed by a
> > > >      single large restrictedmem instance, then converting a single page will
> > > >      result in a linear walk through all memslots.  I don't expect anyone to
> > > >      actually do something silly like that, but I also never expected there to be
> > > >      a legitimate usecase for thousands of memslots.
> > > > 
> > > >   2. This approach fails to provide the ability for KVM to ensure a guest has
> > > >      exclusive access to a page.  As discussed in the past, the kernel can rely
> > > >      on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
> > > >      only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
> > > >      e.g. SEV, there is value in ensuring a 1:1 association.
> > > > 
> > > >      And probably more importantly, relying on hardware for SNP and TDX yields a
> > > >      poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
> > > >      page is exclusive to a guest, i.e. if userspace can hand out the same page
> > > >      from a restrictedmem instance to multiple VMs, then failure will occur only
> > > >      when KVM tries to assign the page to the second VM.  That will happen deep
> > > >      in KVM, which means KVM needs to gracefully handle such errors, and it means
> > > >      that KVM's ABI effectively allows plumbing garbage into its memslots.
> > > 
> > > It may not be a valid usage, but in my TDX environment I do meet below
> > > issue.
> > > 
> > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
> > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22
> > > 
> > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> > > and slot#2 fails due to below exclusive check.
> > > 
> > > Currently I changed QEMU code to mark these alias slots as shared
> > > instead of private but I'm not 100% confident this is correct fix.
> > 
> > That's a QEMU bug of sorts.  SMM is mutually exclusive with TDX, QEMU shouldn't
> > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.
> 
> Thanks for the confirmation. As long as we only bind one notifier for
> each address, using xarray does make things simple.

In the past, I had patches for qemu to disable PAM and SMRAM, but they were
dropped for simplicity because SMRAM/PAM are disabled as reset state with unused
memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them.
Now we can revive them.
-- 
Isaku Yamahata <isaku.yamahata@...il.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ