linux-kernel - Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YKDJ1L7XpJRQgSch@kernel.org>
Date:   Sun, 16 May 2021 10:29:24 +0300
From:   Mike Rapoport <rppt@...nel.org>
To:     David Hildenbrand <david@...hat.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andy Lutomirski <luto@...nel.org>,
        Arnd Bergmann <arnd@...db.de>, Borislav Petkov <bp@...en8.de>,
        Catalin Marinas <catalin.marinas@....com>,
        Christopher Lameter <cl@...ux.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Elena Reshetova <elena.reshetova@...el.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Hagen Paul Pfeifer <hagen@...u.net>,
        Ingo Molnar <mingo@...hat.com>,
        James Bottomley <jejb@...ux.ibm.com>,
        Kees Cook <keescook@...omium.org>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Matthew Wilcox <willy@...radead.org>,
        Matthew Garrett <mjg59@...f.ucam.org>,
        Mark Rutland <mark.rutland@....com>,
        Michal Hocko <mhocko@...e.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Palmer Dabbelt <palmer@...belt.com>,
        Palmer Dabbelt <palmerdabbelt@...gle.com>,
        Paul Walmsley <paul.walmsley@...ive.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Rick Edgecombe <rick.p.edgecombe@...el.com>,
        Roman Gushchin <guro@...com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Shuah Khan <shuah@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Tycho Andersen <tycho@...ho.ws>, Will Deacon <will@...nel.org>,
        Yury Norov <yury.norov@...il.com>, linux-api@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
        linux-nvdimm@...ts.01.org, linux-riscv@...ts.infradead.org,
        x86@...nel.org
Subject: Re: [PATCH v19 5/8] mm: introduce memfd_secret system call to create
 "secret" memory areas

On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote:
> >   #ifdef CONFIG_IA64
> >   # include <linux/efi.h>
> > @@ -64,6 +65,9 @@ static inline int valid_mmap_phys_addr_range(unsigned long pfn, size_t size)
> >   #ifdef CONFIG_STRICT_DEVMEM
> >   static inline int page_is_allowed(unsigned long pfn)
> >   {
> > +	if (pfn_valid(pfn) && page_is_secretmem(pfn_to_page(pfn)))
> > +		return 0;
> > +
> 
> 1. The memmap might be garbage. You should use pfn_to_online_page() instead.
> 
> page = pfn_to_online_page(pfn);
> if (page && page_is_secretmem(page))
> 	return 0;
> 
> 2. What about !CONFIG_STRICT_DEVMEM?
> 
> 3. Someone could map physical memory before a secretmem page gets allocated
> and read the content after it got allocated and gets used. If someone would
> gain root privileges and would wait for the target application to (re)start,
> that could be problematic.
> 
> 
> I do wonder if enforcing CONFIG_STRICT_DEVMEM would be cleaner.
> devmem_is_allowed() should disallow access to any system ram, and thereby,
> any possible secretmem pages, avoiding this check completely.

I've been thinking a bit more about the /dev/mem case, it seems I was to
fast on the trigger with adding that test for page_is_secretmem().

When CONFIG_STRICT_DEVMEM=y the access to RAM is anyway forbidden and if
the user built a kernel with CONFIG_STRICT_DEVMEM=n all the physical memory
is accessible by root anyway.

We might want to default STRICT_DEVMEM to "y" for all architectures and not
only arm64, ppc and x86, but this is not strictly related to this series.
 
> [...]
> 
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > new file mode 100644
> > index 000000000000..1ae50089adf1
> > --- /dev/null
> > +++ b/mm/secretmem.c
> > @@ -0,0 +1,239 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright IBM Corporation, 2021
> > + *
> > + * Author: Mike Rapoport <rppt@...ux.ibm.com>
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/fs.h>
> > +#include <linux/swap.h>
> > +#include <linux/mount.h>
> > +#include <linux/memfd.h>
> > +#include <linux/bitops.h>
> > +#include <linux/printk.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/syscalls.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/secretmem.h>
> > +#include <linux/set_memory.h>
> > +#include <linux/sched/signal.h>
> > +
> > +#include <uapi/linux/magic.h>
> > +
> > +#include <asm/tlbflush.h>
> > +
> > +#include "internal.h"
> > +
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) "secretmem: " fmt
> > +
> > +/*
> > + * Define mode and flag masks to allow validation of the system call
> > + * parameters.
> > + */
> > +#define SECRETMEM_MODE_MASK	(0x0)
> > +#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
> > +
> > +static bool secretmem_enable __ro_after_init;
> > +module_param_named(enable, secretmem_enable, bool, 0400);
> > +MODULE_PARM_DESC(secretmem_enable,
> > +		 "Enable secretmem and memfd_secret(2) system call");
> > +
> > +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > +{
> > +	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > +	struct inode *inode = file_inode(vmf->vma->vm_file);
> > +	pgoff_t offset = vmf->pgoff;
> > +	gfp_t gfp = vmf->gfp_mask;
> > +	unsigned long addr;
> > +	struct page *page;
> > +	int err;
> > +
> > +	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > +		return vmf_error(-EINVAL);
> > +
> > +retry:
> > +	page = find_lock_page(mapping, offset);
> > +	if (!page) {
> > +		page = alloc_page(gfp | __GFP_ZERO);
> 
> We'll end up here with gfp == GFP_HIGHUSER (via the mapping below), correct?

Yes
 
> > +		if (!page)
> > +			return VM_FAULT_OOM;
> > +
> > +		err = set_direct_map_invalid_noflush(page, 1);
> > +		if (err) {
> > +			put_page(page);
> > +			return vmf_error(err);
> 
> Would we want to translate that to a proper VM_FAULT_..., which would most
> probably be VM_FAULT_OOM when we fail to allocate a pagetable?

That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ.

> > +		}
> > +
> > +		__SetPageUptodate(page);
> > +		err = add_to_page_cache_lru(page, mapping, offset, gfp);
> > +		if (unlikely(err)) {
> > +			put_page(page);
> > +			/*
> > +			 * If a split of large page was required, it
> > +			 * already happened when we marked the page invalid
> > +			 * which guarantees that this call won't fail
> > +			 */
> > +			set_direct_map_default_noflush(page, 1);
> > +			if (err == -EEXIST)
> > +				goto retry;
> > +
> > +			return vmf_error(err);
> > +		}
> > +
> > +		addr = (unsigned long)page_address(page);
> > +		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> 
> Hmm, to me it feels like something like that belongs into the
> set_direct_map_invalid_*() calls? Otherwise it's just very easy to mess up
> ...

AFAIU set_direct_map() deliberately do not flush TLB and leave it to the
caller to allow gathering multiple updates of the direct map and doing a
single TLB flush afterwards.

> I'm certainly not a filesystem guy. Nothing else jumped at me.
> 
> 
> To me, the overall approach makes sense and I consider it an improved
> mlock() mechanism for storing secrets, although I'd love to have some more
> information in the log regarding access via root, namely that there are
> still fancy ways to read secretmem memory once root via
> 
> 1. warm reboot attacks especially in VMs (e.g., modifying the cmdline)
> 2. kexec-style reboot attacks (e.g., modifying the cmdline)
> 3. kdump attacks
> 4. kdb most probably
> 5. "letting the process read the memory for us" via Kees if that still
>    applies
> 6. ... most probably something else
> 
> Just to make people aware that there are still some things to be sorted out
> when we fully want to protect against privilege escalations.
> 
> (maybe this information is buried in the cover letter already, where it
> usually gets lost)

I believe that it belongs more to the man page than to changelog so that
the *users* are aware of secretmem limitations.
 
-- 
Sincerely yours,
Mike.