linux-kernel - Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221223082406.GB1829090@chaop.bj.intel.com>
Date:   Fri, 23 Dec 2022 16:24:06 +0800
From:   Chao Peng <chao.p.peng@...ux.intel.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     "Huang, Kai" <kai.huang@...el.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "jmattson@...gle.com" <jmattson@...gle.com>,
        "Lutomirski, Andy" <luto@...nel.org>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>,
        "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
        "Hocko, Michal" <mhocko@...e.com>,
        "qemu-devel@...gnu.org" <qemu-devel@...gnu.org>,
        "tabba@...gle.com" <tabba@...gle.com>,
        "david@...hat.com" <david@...hat.com>,
        "michael.roth@....com" <michael.roth@....com>,
        "corbet@....net" <corbet@....net>,
        "bfields@...ldses.org" <bfields@...ldses.org>,
        "dhildenb@...hat.com" <dhildenb@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>, "bp@...en8.de" <bp@...en8.de>,
        "linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
        "rppt@...nel.org" <rppt@...nel.org>,
        "shuah@...nel.org" <shuah@...nel.org>,
        "vkuznets@...hat.com" <vkuznets@...hat.com>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "mail@...iej.szmigiero.name" <mail@...iej.szmigiero.name>,
        "ddutile@...hat.com" <ddutile@...hat.com>,
        "qperret@...gle.com" <qperret@...gle.com>,
        "arnd@...db.de" <arnd@...db.de>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "vannapurve@...gle.com" <vannapurve@...gle.com>,
        "naoya.horiguchi@....com" <naoya.horiguchi@....com>,
        "wanpengli@...cent.com" <wanpengli@...cent.com>,
        "yu.c.zhang@...ux.intel.com" <yu.c.zhang@...ux.intel.com>,
        "hughd@...gle.com" <hughd@...gle.com>,
        "aarcange@...hat.com" <aarcange@...hat.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "Nakajima, Jun" <jun.nakajima@...el.com>,
        "jlayton@...nel.org" <jlayton@...nel.org>,
        "joro@...tes.org" <joro@...tes.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "Wang, Wei W" <wei.w.wang@...el.com>,
        "steven.price@....com" <steven.price@....com>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "Hansen, Dave" <dave.hansen@...el.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "linmiaohe@...wei.com" <linmiaohe@...wei.com>
Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to
 create restricted user memory

On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote:
> On Wed, Dec 21, 2022, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > 
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
> > 
> > > 
> > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> 
> No, requiring the user (KVM) to guard against lack of support for page migration
> in restricted mem is a terrible API.  It's totally fine for restricted mem to not
> support page migration until there's a use case, but punting the problem to KVM
> is not acceptable.  Restricted mem itself doesn't yet support page migration,
> e.g. explosions would occur even if KVM wanted to allow migration since there is
> no notification to invalidate existing mappings.
> 
> > I argue that this page pinning (or page migration prevention) is not
> > tied to where the page comes from, instead related to how the page will
> > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > it's used by current version of TDX then the page pinning is needed. So
> > such page migration prevention is really TDX thing, even not KVM generic
> > thing (that's why I think we don't need change the existing logic of
> > kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
> > requires that) to increase/decrease the refcount when it populates/drops
> > the secure EPT entries? This is exactly what the current TDX code does:
> 
> I agree that whether or not migration is supported should be controllable by the
> user, but I strongly disagree on punting refcount management to KVM (or TDX).
> The whole point of restricted mem is to support technologies like TDX and SNP,
> accomodating their special needs for things like page migration should be part of
> the API, not some footnote in the documenation.

I never doubt page migration should be part of restrictedmem API, but
that's not an initial implementing as we all agreed? Then before that
API being introduced, we need find a solution to prevent page migration
for TDX. Other than refcount management, do we have any other workable
solution? 

> 
> It's not difficult to let the user communicate support for page migration, e.g.
> if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops
> to signal support (or lack thereof) for page migration.  NULL == no migration,
> non-NULL == migration allowed.

I know.

> 
> We know that supporting page migration in TDX and SNP is possible, and we know
> that page migration will require a dedicated API since the backing store can't
> memcpy() the page.  I don't see any reason to ignore that eventuality.

No, I'm not ignoring it. It's just about the short-term page migration
prevention before that dedicated API being introduced.

> 
> But again, unless I'm missing something, that's a future problem because restricted
> mem doesn't yet support page migration regardless of the downstream user.

It's true a future problem for page migration support itself, but page
migration prevention is not a future problem since TDX pages need to be
pinned before page migration gets supported.

Thanks,
Chao