lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221207135049.GA1275553@chaop.bj.intel.com>
Date:   Wed, 7 Dec 2022 21:50:49 +0800
From:   Chao Peng <chao.p.peng@...ux.intel.com>
To:     Fuad Tabba <tabba@...gle.com>
Cc:     kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-api@...r.kernel.org,
        linux-doc@...r.kernel.org, qemu-devel@...gnu.org,
        Paolo Bonzini <pbonzini@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Arnd Bergmann <arnd@...db.de>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Miaohe Lin <linmiaohe@...wei.com>, x86@...nel.org,
        "H . Peter Anvin" <hpa@...or.com>, Hugh Dickins <hughd@...gle.com>,
        Jeff Layton <jlayton@...nel.org>,
        "J . Bruce Fields" <bfields@...ldses.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>, Mike Rapoport <rppt@...nel.org>,
        Steven Price <steven.price@....com>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        Vlastimil Babka <vbabka@...e.cz>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yu Zhang <yu.c.zhang@...ux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        luto@...nel.org, jun.nakajima@...el.com, dave.hansen@...el.com,
        ak@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
        ddutile@...hat.com, dhildenb@...hat.com,
        Quentin Perret <qperret@...gle.com>,
        Michael Roth <michael.roth@....com>, mhocko@...e.com,
        wei.w.wang@...el.com
Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to
 create restricted user memory

On Tue, Dec 06, 2022 at 02:57:04PM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@...ux.intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through the new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> 
> nit: memory

Ya!

> 
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When hole-punched, KVM can get notified through
> > invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> > remove any mapped entries of the range in the secondary page tables.
> >
> > Machine check can happen for memory pages in the restricted memfd,
> > instead of routing this directly to userspace, we call the error()
> > callback that KVM registered. KVM then gets chance to handle it
> > correctly.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
> 
> Reviewed-by: Fuad Tabba <tabba@...gle.com>
> After wiring the system call for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@...gle.com>

Thanks.
Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@...ux.intel.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h          |  71 ++++++
> >  include/linux/syscalls.h               |   1 +
> >  include/uapi/asm-generic/unistd.h      |   5 +-
> >  include/uapi/linux/magic.h             |   1 +
> >  kernel/sys_ni.c                        |   3 +
> >  mm/Kconfig                             |   4 +
> >  mm/Makefile                            |   1 +
> >  mm/memory-failure.c                    |   3 +
> >  mm/restrictedmem.c                     | 318 +++++++++++++++++++++++++
> >  11 files changed, 408 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448    i386    process_mrelease        sys_process_mrelease
> >  449    i386    futex_waitv             sys_futex_waitv
> >  450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> > +451    i386    memfd_restricted        sys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448    common  process_mrelease        sys_process_mrelease
> >  449    common  futex_waitv             sys_futex_waitv
> >  450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451    common  memfd_restricted        sys_memfd_restricted
> >
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..c2700c5daa43
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> > +
> > +struct restrictedmem_notifier;
> > +
> > +struct restrictedmem_notifier_ops {
> > +       void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> > +                                pgoff_t start, pgoff_t end);
> > +       void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> > +                              pgoff_t start, pgoff_t end);
> > +       void (*error)(struct restrictedmem_notifier *notifier,
> > +                              pgoff_t start, pgoff_t end);
> > +};
> > +
> > +struct restrictedmem_notifier {
> > +       struct list_head list;
> > +       const struct restrictedmem_notifier_ops *ops;
> > +};
> > +
> > +#ifdef CONFIG_RESTRICTEDMEM
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier);
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                          struct page **pagep, int *order);
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +       return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> > +}
> > +
> > +void restrictedmem_error_page(struct page *page, struct address_space *mapping);
> > +
> > +#else
> > +
> > +static inline void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                                        struct page **pagep, int *order)
> > +{
> > +       return -1;
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +       return false;
> > +}
> > +
> > +static inline void restrictedmem_error_page(struct page *page,
> > +                                           struct address_space *mapping)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index a34b0f9a9972..f9e9e0c820c5 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> >  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> >                                             unsigned long home_node,
> >                                             unsigned long flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags);
> >
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 45fa180cc56a..e93cd35e46d0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> >  #define __NR_set_mempolicy_home_node 450
> >  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >
> > +#define __NR_memfd_restricted 451
> > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 452
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..8aa38324b90a 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
> >  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
> >  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> > +#define RESTRICTEDMEM_MAGIC    0x5245534d      /* "RESM" */
> >
> >  #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 860b2dcf3ac4..7c4a32cbd2e7 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> >  /* memfd_secret */
> >  COND_SYSCALL(memfd_secret);
> >
> > +/* memfd_restricted */
> > +COND_SYSCALL(memfd_restricted);
> > +
> >  /*
> >   * Architecture specific weak syscall entries.
> >   */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 57e1d8c5b505..06b0e1d6b8c1 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1076,6 +1076,10 @@ config IO_MAPPING
> >  config SECRETMEM
> >         def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> >
> > +config RESTRICTEDMEM
> > +       bool
> > +       depends on TMPFS
> > +
> >  config ANON_VMA_NAME
> >         bool "Anonymous VMA name support"
> >         depends on PROC_FS && ADVISE_SYSCALLS && MMU
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 8e105e5b3e29..bcbb0edf9ba1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> >  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >  obj-$(CONFIG_SECRETMEM) += secretmem.o
> > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> >  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> >  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 145bb561ddb3..f91b444e471e 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -62,6 +62,7 @@
> >  #include <linux/page-isolation.h>
> >  #include <linux/pagewalk.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/restrictedmem.h>
> >  #include "swap.h"
> >  #include "internal.h"
> >  #include "ras/ras_event.h"
> > @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >                 goto out;
> >         }
> >
> > +       restrictedmem_error_page(p, mapping);
> > +
> >         /*
> >          * The shmem page is kept in page cache instead of truncating
> >          * so is expected to have an extra refcount after error-handling.
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..56953c204e5c
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,318 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> > +       struct mutex lock;
> > +       struct file *memfd;
> > +       struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> > +                                          pgoff_t start, pgoff_t end)
> > +{
> > +       struct restrictedmem_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->invalidate_start(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
> > +                                        pgoff_t start, pgoff_t end)
> > +{
> > +       struct restrictedmem_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->invalidate_end(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static void restrictedmem_notifier_error(struct restrictedmem_data *data,
> > +                                        pgoff_t start, pgoff_t end)
> > +{
> > +       struct restrictedmem_notifier *notifier;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_for_each_entry(notifier, &data->notifiers, list) {
> > +               notifier->ops->error(notifier, start, end);
> > +       }
> > +       mutex_unlock(&data->lock);
> > +}
> > +
> > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > +{
> > +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +
> > +       fput(data->memfd);
> > +       kfree(data);
> > +       return 0;
> > +}
> > +
> > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
> > +                                    loff_t offset, loff_t len)
> > +{
> > +       int ret;
> > +       pgoff_t start, end;
> > +       struct file *memfd = data->memfd;
> > +
> > +       if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +               return -EINVAL;
> > +
> > +       start = offset >> PAGE_SHIFT;
> > +       end = (offset + len) >> PAGE_SHIFT;
> > +
> > +       restrictedmem_invalidate_start(data, start, end);
> > +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +       restrictedmem_invalidate_end(data, start, end);
> > +
> > +       return ret;
> > +}
> > +
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +                                   loff_t offset, loff_t len)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +
> > +       if (mode & FALLOC_FL_PUNCH_HOLE)
> > +               return restrictedmem_punch_hole(data, mode, offset, len);
> > +
> > +       return memfd->f_op->fallocate(memfd, mode, offset, len);
> > +}
> > +
> > +static const struct file_operations restrictedmem_fops = {
> > +       .release = restrictedmem_release,
> > +       .fallocate = restrictedmem_fallocate,
> > +};
> > +
> > +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> > +                                const struct path *path, struct kstat *stat,
> > +                                u32 request_mask, unsigned int query_flags)
> > +{
> > +       struct inode *inode = d_inode(path->dentry);
> > +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +
> > +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> > +                                            request_mask, query_flags);
> > +}
> > +
> > +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> > +                                struct dentry *dentry, struct iattr *attr)
> > +{
> > +       struct inode *inode = d_inode(dentry);
> > +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       int ret;
> > +
> > +       if (attr->ia_valid & ATTR_SIZE) {
> > +               if (memfd->f_inode->i_size)
> > +                       return -EPERM;
> > +
> > +               if (!PAGE_ALIGNED(attr->ia_size))
> > +                       return -EINVAL;
> > +       }
> > +
> > +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> > +                                           file_dentry(memfd), attr);
> > +       return ret;
> > +}
> > +
> > +static const struct inode_operations restrictedmem_iops = {
> > +       .getattr = restrictedmem_getattr,
> > +       .setattr = restrictedmem_setattr,
> > +};
> > +
> > +static int restrictedmem_init_fs_context(struct fs_context *fc)
> > +{
> > +       if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> > +               return -ENOMEM;
> > +
> > +       fc->s_iflags |= SB_I_NOEXEC;
> > +       return 0;
> > +}
> > +
> > +static struct file_system_type restrictedmem_fs = {
> > +       .owner          = THIS_MODULE,
> > +       .name           = "memfd:restrictedmem",
> > +       .init_fs_context = restrictedmem_init_fs_context,
> > +       .kill_sb        = kill_anon_super,
> > +};
> > +
> > +static struct vfsmount *restrictedmem_mnt;
> > +
> > +static __init int restrictedmem_init(void)
> > +{
> > +       restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> > +       if (IS_ERR(restrictedmem_mnt))
> > +               return PTR_ERR(restrictedmem_mnt);
> > +       return 0;
> > +}
> > +fs_initcall(restrictedmem_init);
> > +
> > +static struct file *restrictedmem_file_create(struct file *memfd)
> > +{
> > +       struct restrictedmem_data *data;
> > +       struct address_space *mapping;
> > +       struct inode *inode;
> > +       struct file *file;
> > +
> > +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +       if (!data)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       data->memfd = memfd;
> > +       mutex_init(&data->lock);
> > +       INIT_LIST_HEAD(&data->notifiers);
> > +
> > +       inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > +       if (IS_ERR(inode)) {
> > +               kfree(data);
> > +               return ERR_CAST(inode);
> > +       }
> > +
> > +       inode->i_mode |= S_IFREG;
> > +       inode->i_op = &restrictedmem_iops;
> > +       inode->i_mapping->private_data = data;
> > +
> > +       file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > +                                "restrictedmem", O_RDWR,
> > +                                &restrictedmem_fops);
> > +       if (IS_ERR(file)) {
> > +               iput(inode);
> > +               kfree(data);
> > +               return ERR_CAST(file);
> > +       }
> > +
> > +       file->f_flags |= O_LARGEFILE;
> > +
> > +       /*
> > +        * These pages are currently unmovable so don't place them into movable
> > +        * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > +        */
> > +       mapping = memfd->f_mapping;
> > +       mapping_set_unevictable(mapping);
> > +       mapping_set_gfp_mask(mapping,
> > +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > +
> > +       return file;
> > +}
> > +
> > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > +{
> > +       struct file *file, *restricted_file;
> > +       int fd, err;
> > +
> > +       if (flags)
> > +               return -EINVAL;
> > +
> > +       fd = get_unused_fd_flags(0);
> > +       if (fd < 0)
> > +               return fd;
> > +
> > +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > +       if (IS_ERR(file)) {
> > +               err = PTR_ERR(file);
> > +               goto err_fd;
> > +       }
> > +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> > +       file->f_flags |= O_LARGEFILE;
> > +
> > +       restricted_file = restrictedmem_file_create(file);
> > +       if (IS_ERR(restricted_file)) {
> > +               err = PTR_ERR(restricted_file);
> > +               fput(file);
> > +               goto err_fd;
> > +       }
> > +
> > +       fd_install(fd, restricted_file);
> > +       return fd;
> > +err_fd:
> > +       put_unused_fd(fd);
> > +       return err;
> > +}
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_add(&notifier->list, &data->notifiers);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> > +
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_del(&notifier->list);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                          struct page **pagep, int *order)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       struct folio *folio;
> > +       struct page *page;
> > +       int ret;
> > +
> > +       ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
> > +       if (ret)
> > +               return ret;
> > +
> > +       page = folio_file_page(folio, offset);
> > +       *pagep = page;
> > +       if (order)
> > +               *order = thp_order(compound_head(page));
> > +
> > +       SetPageUptodate(page);
> > +       unlock_page(page);
> > +
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > +
> > +void restrictedmem_error_page(struct page *page, struct address_space *mapping)
> > +{
> > +       struct super_block *sb = restrictedmem_mnt->mnt_sb;
> > +       struct inode *inode, *next;
> > +
> > +       if (!shmem_mapping(mapping))
> > +               return;
> > +
> > +       spin_lock(&sb->s_inode_list_lock);
> > +       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> > +               struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +               struct file *memfd = data->memfd;
> > +
> > +               if (memfd->f_mapping == mapping) {
> > +                       pgoff_t start, end;
> > +
> > +                       spin_unlock(&sb->s_inode_list_lock);
> > +
> > +                       start = page->index;
> > +                       end = start + thp_nr_pages(page);
> > +                       restrictedmem_notifier_error(data, start, end);
> > +                       return;
> > +               }
> > +       }
> > +       spin_unlock(&sb->s_inode_list_lock);
> > +}
> > --
> > 2.25.1
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ