[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ea1ddcfa-f52d-9a7d-cb7b-8502b38a90da@redhat.com>
Date: Fri, 14 May 2021 10:50:55 +0200
From: David Hildenbrand <david@...hat.com>
To: Mike Rapoport <rppt@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Alexander Viro <viro@...iv.linux.org.uk>,
Andy Lutomirski <luto@...nel.org>,
Arnd Bergmann <arnd@...db.de>, Borislav Petkov <bp@...en8.de>,
Catalin Marinas <catalin.marinas@....com>,
Christopher Lameter <cl@...ux.com>,
Dan Williams <dan.j.williams@...el.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Elena Reshetova <elena.reshetova@...el.com>,
"H. Peter Anvin" <hpa@...or.com>,
Hagen Paul Pfeifer <hagen@...u.net>,
Ingo Molnar <mingo@...hat.com>,
James Bottomley <jejb@...ux.ibm.com>,
Kees Cook <keescook@...omium.org>,
"Kirill A. Shutemov" <kirill@...temov.name>,
Matthew Wilcox <willy@...radead.org>,
Matthew Garrett <mjg59@...f.ucam.org>,
Mark Rutland <mark.rutland@....com>,
Michal Hocko <mhocko@...e.com>,
Mike Rapoport <rppt@...ux.ibm.com>,
Michael Kerrisk <mtk.manpages@...il.com>,
Palmer Dabbelt <palmer@...belt.com>,
Palmer Dabbelt <palmerdabbelt@...gle.com>,
Paul Walmsley <paul.walmsley@...ive.com>,
Peter Zijlstra <peterz@...radead.org>,
"Rafael J. Wysocki" <rjw@...ysocki.net>,
Rick Edgecombe <rick.p.edgecombe@...el.com>,
Roman Gushchin <guro@...com>,
Shakeel Butt <shakeelb@...gle.com>,
Shuah Khan <shuah@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Tycho Andersen <tycho@...ho.ws>, Will Deacon <will@...nel.org>,
Yury Norov <yury.norov@...il.com>, linux-api@...r.kernel.org,
linux-arch@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
linux-nvdimm@...ts.01.org, linux-riscv@...ts.infradead.org,
x86@...nel.org
Subject: Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create
"secret" memory areas
On 13.05.21 20:47, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@...ux.ibm.com>
>
> Introduce "memfd_secret" system call with the ability to create
> memory areas visible only in the context of the owning process and
> not mapped not only to other processes but in the kernel page tables
> as well.
>
> The secretmem feature is off by default and the user must explicitly
> enable it at the boot time.
>
> Once secretmem is enabled, the user will be able to create a file
> descriptor using the memfd_secret() system call. The memory areas
> created by mmap() calls from this file descriptor will be unmapped
> from the kernel direct map and they will be only mapped in the page
> table of the processes that have access to the file descriptor.
>
> The file descriptor based memory has several advantages over the
> "traditional" mm interfaces, such as mlock(), mprotect(), madvise().
> File descriptor approach allows explict and controlled sharing of the
> memory
s/explict/explicit/
> areas, it allows to seal the operations. Besides, file descriptor
> based memory paves the way for VMMs to remove the secret memory range
> from the userpace hipervisor process, for instance QEMU. Andy
> Lutomirski says:
s/userpace hipervisor/userspace hypervisor/
>
> "Getting fd-backed memory into a guest will take some possibly major
> work in the kernel, but getting vma-backed memory into a guest
> without mapping it in the host user address space seems much, much
> worse."
>
> memfd_secret() is made a dedicated system call rather than an
> extention to
s/extention/extension/
> memfd_create() because it's purpose is to allow the user to create
> more secure memory mappings rather than to simply allow file based
> access to the memory. Nowadays a new system call cost is negligible
> while it is way simpler for userspace to deal with a clear-cut system
> calls than with a multiplexer or an overloaded syscall. Moreover, the
> initial implementation of memfd_secret() is completely distinct from
> memfd_create() so there is no much sense in overloading
> memfd_create() to begin with. If there will be a need for code
> sharing between these implementation it can be easily achieved
> without a need to adjust user visible APIs.
>
> The secret memory remains accessible in the process context using
> uaccess primitives, but it is not exposed to the kernel otherwise;
> secret memory areas are removed from the direct map and functions in
> the follow_page()/get_user_page() family will refuse to return a page
> that belongs to the secret memory area.
>
> Once there will be a use case that will require exposing secretmem to
> the kernel it will be an opt-in request in the system call flags so
> that user would have to decide what data can be exposed to the
> kernel.
Maybe spell out an example: like page migration.
>
> Removing of the pages from the direct map may cause its fragmentation
> on architectures that use large pages to map the physical memory
> which affects the system performance. However, the original Kconfig
> text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct
> map "... can improve the kernel's performance a tiny bit ..." (commit
> 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1]
> showed that "... although 1G mappings are a good default choice,
> there is no compelling evidence that it must be the only choice".
> Hence, it is sufficient to have secretmem disabled by default with
> the ability of a system administrator to enable it at boot time.
Maybe add a link to the Intel performance evaluation.
>
> Pages in the secretmem regions are unevictable and unmovable to
> avoid accidental exposure of the sensitive data via swap or during
> page migration.
>
> Since the secretmem mappings are locked in memory they cannot exceed
> RLIMIT_MEMLOCK. Since these mappings are already locked independently
> from mlock(), an attempt to mlock()/munlock() secretmem range would
> fail and mlockall()/munlockall() will ignore secretmem mappings.
Maybe add something like "similar to pages pinned by VFIO".
>
> However, unlike mlock()ed memory, secretmem currently behaves more
> like long-term GUP: secretmem mappings are unmovable mappings
> directly consumed by user space. With default limits, there is no
> excessive use of secretmem and it poses no real problem in
> combination with ZONE_MOVABLE/CMA, but in the future this should be
> addressed to allow balanced use of large amounts of secretmem along
> with ZONE_MOVABLE/CMA.
>
> A page that was a part of the secret memory area is cleared when it
> is freed to ensure the data is not exposed to the next user of that
> page.
You could skip that with init_on_free (and eventually also with
init_on_alloc) set to avoid double clearing.
>
> The following example demonstrates creation of a secret mapping
> (error handling is omitted):
>
> fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL,
> MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>
> [1]
> https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
[my mail client messed up the remainder of the mail for whatever reason,
will comment in a separate mail if there is anything to comment :) ]
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists