linux-kernel - Re: [RFC PATCH] mm/mbind: Introduce process_mbind() syscall for external memory binding

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACSyD1OFjROw26+2ojG37eDBParVg721x1HCROMiF2pW2aHj8A@mail.gmail.com>
Date:   Fri, 24 Nov 2023 16:13:41 +0800
From:   Zhongkun He <hezhongkun.hzk@...edance.com>
To:     Vinicius Petrucci <vpetrucci@...il.com>
Cc:     akpm@...ux-foundation.org, linux-mm@...r.kernel.org,
        linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-api@...r.kernel.org,
        minchan@...nel.org, dave.hansen@...ux.intel.com, x86@...nel.org,
        Jonathan.Cameron@...wei.com, aneesh.kumar@...ux.ibm.com,
        gregory.price@...verge.com, ying.huang@...el.com,
        dan.j.williams@...el.com, fvdl@...gle.com, surenb@...gle.com,
        rientjes@...gle.com, hannes@...xchg.org, mhocko@...e.com,
        Hasan.Maruf@....com, jgroves@...ron.com, ravis.opensrc@...ron.com,
        sthanneeru@...ron.com, emirakhur@...ron.com,
        vtavarespetr@...ron.com
Subject: Re: [RFC PATCH] mm/mbind: Introduce process_mbind() syscall for
 external memory binding

Hi Vinicius！

On Thu, Nov 23, 2023 at 5:32 AM Vinicius Petrucci <vpetrucci@...il.com> wrote:
>
> From: Vinicius Tavares Petrucci <vtavarespetr@...ron.com>
>
> This patch introduces `process_mbind()` to enable a userspace orchestrator with
> an understanding of another process's memory layout to alter its memory policy.
> As system memory configurations become more and more complex (e.g., DDR+HBM+CXL memories),
> such a userspace orchestrator can explore more advanced techniques to guide memory placement
> to individual NUMA nodes across memory tiers. This allows for a more efficient allocation of
> memory resources, leading to enhanced application performance.
>
> Alternatively, there are existing methods such as LD_PRELOAD (https://pmem.io/memkind/) or
> syscall_intercept (https://github.com/pmem/syscall_intercept), but these techniques, beyond the
> lack of portability/universality, can lead to system incompatibility issues, inconsistency in
> application behavior, potential risks due to global system-wide settings, and increased
> complexity in implementation.
>
> The concept of an external entity that understands the layout of another process's VM
> is already present with `process_madvise()`. Thus, it seems reasonable to introduce
> the `process_mbind` variant of `mbind`. The implementation framework of `process_mbind()`
> is akin to `process_madvise()`. It uses pidfd of an external process to direct the memory
> policy and supports a vector of memory address ranges.
>
> The general use case here is similar to the prior RFC `pidfd_set_mempolicy()`
> (https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/),
> but offers a more fine-grained external control by binding specific memory regions
> (say, heap data structures) to specific NUMA nodes. Another concrete use case was described
> by a prior work showing up to 2X runtime improvement (compared to AutoNUMA tiering) using
> memory object/region-based memory placement for workloads with irregular access patterns
> such as graph analytics: https://iiswc.org/iiswc2022/IISWC2022_42.pdf
>
> The proposed API is as follows:
>
> long process_mbind(int pidfd,
>                 const struct iovec *iovec,
>                 unsigned long vlen,
>                 unsigned long mode,
>                 const unsigned long *nmask,
>                 unsigned int flags);
>
> The `pidfd` argument is used to select the process that is identified by the PID file
> descriptor provided in pidfd. (See pidofd_open(2) for more information)
>
> The pointer `iovec` points to an array of iovec structures (as described in <sys/uio.h>):
>
> struct iovec {
>     void *iov_base;         /* starting address of region */
>     size_t iov_len;         /* size of region (in bytes) */
> };
>
> The `iovec` defines memory regions that start at the address (iov_base) and
> have a size measured in bytes (iov_len).

Good idea.

>
> The `vlen` indicates the quantity of elements contained in iovec.
>
> Please note the initial `maxnode` parameter from `mbind` was omitted
> to ensure the API doesn't exceed 6 arguments. Instead, the constant
> MAX_NUMNODES was utilized.

The original parameters should not be omitted,  the patch from
Gregory Price is a good solution to put all of them together.

>
> Please see the mbind(2) man page for more details about other's arguments.
>
> Additionally, it is worth noting the following:
> - Using a vector of address ranges as an argument in `process_mbind` provides more
> flexibility than the original `mbind` system call, even when invoked from a current
> or local process.
> - In contrast to `move_pages`, which requires an array of fixed-size pages,
> `process_mbind` (with flags MPOL_MF_MOVE*) offers a more convinient and flexible page
> migration capability on a per object or region basis.
> - Similar to `process_madvise`, manipulating the memory binding of external processes
> necessitates `CAP_SYS_NICE` and `PTRACE_MODE_READ_FSCREDS` checks (refer to ptrace(2)).
>
> Suggested-by: Frank van der Linden <fvdl@...gle.com>
> Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@...ron.com>
> Signed-off-by: Hasan Al Maruf <Hasan.Maruf@....com>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/syscalls.h               |  4 ++
>  include/uapi/asm-generic/unistd.h      |  4 +-
>  kernel/sys_ni.c                        |  1 +
>  mm/mempolicy.c                         | 86 +++++++++++++++++++++++++-
>  5 files changed, 92 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 8cb8bf68721c..9d9db49a3242 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -378,6 +378,7 @@
>  454    common  futex_wake              sys_futex_wake
>  455    common  futex_wait              sys_futex_wait
>  456    common  futex_requeue           sys_futex_requeue
> +457    common  process_mbind           sys_process_mbind
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index fd9d12de7e92..def5250ed625 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -816,6 +816,10 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
>                                 const unsigned long __user *nmask,
>                                 unsigned long maxnode,
>                                 unsigned flags);
> +asmlinkage long sys_process_mbind(int pidfd, const struct iovec __user *vec,
> +                               size_t vlen, unsigned long mode,
> +                               const unsigned long __user *nmask,
> +                               unsigned flags);
>  asmlinkage long sys_get_mempolicy(int __user *policy,
>                                 unsigned long __user *nmask,
>                                 unsigned long maxnode,
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 756b013fb832..9ed2c91940d6 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
>  __SYSCALL(__NR_futex_wait, sys_futex_wait)
>  #define __NR_futex_requeue 456
>  __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
> +#define __NR_process_mbind 457
> +__SYSCALL(__NR_process_mbind, sys_process_mbind)
>
>  #undef __NR_syscalls
> -#define __NR_syscalls 457
> +#define __NR_syscalls 458
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index e1a6e3c675c0..cc5cb5ae3ae7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise);
>  COND_SYSCALL(process_mrelease);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
> +COND_SYSCALL(process_mbind);
>  COND_SYSCALL(get_mempolicy);
>  COND_SYSCALL(set_mempolicy);
>  COND_SYSCALL(migrate_pages);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..91ee300fa728 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1215,11 +1215,10 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
>  }
>  #endif
>
> -static long do_mbind(unsigned long start, unsigned long len,
> +static long do_mbind(struct mm_struct *mm, unsigned long start, unsigned long len,
>                      unsigned short mode, unsigned short mode_flags,
>                      nodemask_t *nmask, unsigned long flags)
>  {
> -       struct mm_struct *mm = current->mm;
>         struct vm_area_struct *vma, *prev;
>         struct vma_iterator vmi;
>         struct migration_mpol mmpol;
> @@ -1465,10 +1464,84 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
>         return 0;
>  }
>
> +static long kernel_mbind_process(int pidfd, const struct iovec __user *vec,
> +               size_t vlen, unsigned long mode,
> +               const unsigned long __user *nmask, unsigned int flags)
> +{
> +       ssize_t ret;
> +       struct iovec iovstack[UIO_FASTIOV];
> +       struct iovec *iov = iovstack;
> +       struct iov_iter iter;
> +       struct task_struct *task;
> +       struct mm_struct *mm;
> +       unsigned int f_flags;
> +       unsigned short mode_flags;
> +       int lmode = mode;
> +       unsigned long maxnode = MAX_NUMNODES;
> +       int err;
> +       nodemask_t nodes;
> +
> +       err = sanitize_mpol_flags(&lmode, &mode_flags);
> +       if (err)
> +               goto out;
> +
> +       err = get_nodes(&nodes, nmask, maxnode);
> +       if (err)
> +               goto out;
> +
> +       ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack),
> +                       &iov, &iter);
> +       if (ret < 0)
> +               goto out;
> +
> +       task = pidfd_get_task(pidfd, &f_flags);
> +       if (IS_ERR(task)) {
> +               ret = PTR_ERR(task);
> +               goto free_iov;
> +       }
> +
> +       /* From process_madvise: Require PTRACE_MODE_READ
> +        * to avoid leaking ASLR metadata. */
> +       mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> +       if (IS_ERR_OR_NULL(mm)) {
> +               ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +               goto release_task;
> +       }
> +
> +       /* From process_madvise: Require CAP_SYS_NICE for
> +        * influencing process performance. */
> +       if (!capable(CAP_SYS_NICE)) {
> +               ret = -EPERM;
> +               goto release_mm;
> +       }
> +
> +       while (iov_iter_count(&iter)) {
> +               unsigned long start = untagged_addr(
> +                               (unsigned long)iter_iov_addr(&iter));
> +               unsigned long len = iter_iov_len(&iter);
> +
> +               ret = do_mbind(mm, start, len, lmode, mode_flags,
> +                               &nodes, flags);
> +               if (ret < 0)
> +                       break;
> +               iov_iter_advance(&iter, iter_iov_len(&iter));
> +       }
> +
> +release_mm:
> +       mmput(mm);
> +release_task:
> +       put_task_struct(task);
> +free_iov:
> +       kfree(iov);
> +out:
> +       return ret;
> +}
> +

The do_mbind function relies on the current task to obtain nodemask
and task policy,
so the current modification is not enough.

>  static long kernel_mbind(unsigned long start, unsigned long len,
>                          unsigned long mode, const unsigned long __user *nmask,
>                          unsigned long maxnode, unsigned int flags)
>  {
> +       struct mm_struct *mm = current->mm;
>         unsigned short mode_flags;
>         nodemask_t nodes;
>         int lmode = mode;
> @@ -1483,7 +1556,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
>         if (err)
>                 return err;
>
> -       return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
> +       return do_mbind(mm, start, len, lmode, mode_flags, &nodes, flags);
>  }
>
>  SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
> @@ -1553,6 +1626,13 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
>         return err;
>  }
>
> +SYSCALL_DEFINE6(process_mbind, int, pidfd, const struct iovec __user *, vec,
> +               size_t, vlen, unsigned long, mode,
> +               const unsigned long __user *, nmask, unsigned int, flags)
> +{
> +       return kernel_mbind_process(pidfd, vec, vlen, mode, nmask, flags);
> +}
> +
>  SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
>                 unsigned long, mode, const unsigned long __user *, nmask,
>                 unsigned long, maxnode, unsigned int, flags)
> --
> 2.41.0
>

Per my understanding,  the process_mbind() is implementable without
many difficult challenges，
since it is always protected by mm->mmap_lock. But task mempolicy does
not acquire any lock
in alloc_pages().