lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZV5zGROLefrsEcHJ@r13-u19.micron.com>
Date:   Wed, 22 Nov 2023 15:31:05 -0600
From:   Vinicius Petrucci <vpetrucci@...il.com>
To:     akpm@...ux-foundation.org
Cc:     linux-mm@...r.kernel.org, linux-cxl@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
        linux-api@...r.kernel.org, minchan@...nel.org,
        dave.hansen@...ux.intel.com, x86@...nel.org,
        Jonathan.Cameron@...wei.com, aneesh.kumar@...ux.ibm.com,
        gregory.price@...verge.com, ying.huang@...el.com,
        dan.j.williams@...el.com, hezhongkun.hzk@...edance.com,
        fvdl@...gle.com, surenb@...gle.com, rientjes@...gle.com,
        hannes@...xchg.org, mhocko@...e.com, Hasan.Maruf@....com,
        jgroves@...ron.com, ravis.opensrc@...ron.com,
        sthanneeru@...ron.com, emirakhur@...ron.com,
        vtavarespetr@...ron.com
Subject: [RFC PATCH] mm/mbind: Introduce process_mbind() syscall for external
 memory binding

From: Vinicius Tavares Petrucci <vtavarespetr@...ron.com>

This patch introduces `process_mbind()` to enable a userspace orchestrator with 
an understanding of another process's memory layout to alter its memory policy. 
As system memory configurations become more and more complex (e.g., DDR+HBM+CXL memories), 
such a userspace orchestrator can explore more advanced techniques to guide memory placement
to individual NUMA nodes across memory tiers. This allows for a more efficient allocation of 
memory resources, leading to enhanced application performance.

Alternatively, there are existing methods such as LD_PRELOAD (https://pmem.io/memkind/) or
syscall_intercept (https://github.com/pmem/syscall_intercept), but these techniques, beyond the
lack of portability/universality, can lead to system incompatibility issues, inconsistency in
application behavior, potential risks due to global system-wide settings, and increased
complexity in implementation.

The concept of an external entity that understands the layout of another process's VM 
is already present with `process_madvise()`. Thus, it seems reasonable to introduce 
the `process_mbind` variant of `mbind`. The implementation framework of `process_mbind()` 
is akin to `process_madvise()`. It uses pidfd of an external process to direct the memory 
policy and supports a vector of memory address ranges.

The general use case here is similar to the prior RFC `pidfd_set_mempolicy()` 
(https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/), 
but offers a more fine-grained external control by binding specific memory regions 
(say, heap data structures) to specific NUMA nodes. Another concrete use case was described 
by a prior work showing up to 2X runtime improvement (compared to AutoNUMA tiering) using
memory object/region-based memory placement for workloads with irregular access patterns
such as graph analytics: https://iiswc.org/iiswc2022/IISWC2022_42.pdf

The proposed API is as follows:

long process_mbind(int pidfd, 
                const struct iovec *iovec, 
                unsigned long vlen, 
                unsigned long mode, 
                const unsigned long *nmask,
                unsigned int flags);

The `pidfd` argument is used to select the process that is identified by the PID file 
descriptor provided in pidfd. (See pidofd_open(2) for more information)

The pointer `iovec` points to an array of iovec structures (as described in <sys/uio.h>):

struct iovec {
    void *iov_base;         /* starting address of region */
    size_t iov_len;         /* size of region (in bytes) */
};

The `iovec` defines memory regions that start at the address (iov_base) and 
have a size measured in bytes (iov_len).

The `vlen` indicates the quantity of elements contained in iovec.

Please note the initial `maxnode` parameter from `mbind` was omitted 
to ensure the API doesn't exceed 6 arguments. Instead, the constant 
MAX_NUMNODES was utilized.

Please see the mbind(2) man page for more details about other's arguments.

Additionally, it is worth noting the following:
- Using a vector of address ranges as an argument in `process_mbind` provides more 
flexibility than the original `mbind` system call, even when invoked from a current
or local process.
- In contrast to `move_pages`, which requires an array of fixed-size pages,
`process_mbind` (with flags MPOL_MF_MOVE*) offers a more convinient and flexible page
migration capability on a per object or region basis.
- Similar to `process_madvise`, manipulating the memory binding of external processes
necessitates `CAP_SYS_NICE` and `PTRACE_MODE_READ_FSCREDS` checks (refer to ptrace(2)).

Suggested-by: Frank van der Linden <fvdl@...gle.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@...ron.com>
Signed-off-by: Hasan Al Maruf <Hasan.Maruf@....com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h               |  4 ++
 include/uapi/asm-generic/unistd.h      |  4 +-
 kernel/sys_ni.c                        |  1 +
 mm/mempolicy.c                         | 86 +++++++++++++++++++++++++-
 5 files changed, 92 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8cb8bf68721c..9d9db49a3242 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
 454	common	futex_wake		sys_futex_wake
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
+457	common	process_mbind		sys_process_mbind
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..def5250ed625 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -816,6 +816,10 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
 				const unsigned long __user *nmask,
 				unsigned long maxnode,
 				unsigned flags);
+asmlinkage long sys_process_mbind(int pidfd, const struct iovec __user *vec,
+				size_t vlen, unsigned long mode,
+				const unsigned long __user *nmask,
+				unsigned flags);
 asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
 				unsigned long maxnode,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 756b013fb832..9ed2c91940d6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
 __SYSCALL(__NR_futex_wait, sys_futex_wait)
 #define __NR_futex_requeue 456
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_process_mbind 457
+__SYSCALL(__NR_process_mbind, sys_process_mbind)
 
 #undef __NR_syscalls
-#define __NR_syscalls 457
+#define __NR_syscalls 458
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e1a6e3c675c0..cc5cb5ae3ae7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise);
 COND_SYSCALL(process_mrelease);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
+COND_SYSCALL(process_mbind);
 COND_SYSCALL(get_mempolicy);
 COND_SYSCALL(set_mempolicy);
 COND_SYSCALL(migrate_pages);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..91ee300fa728 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1215,11 +1215,10 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 }
 #endif
 
-static long do_mbind(unsigned long start, unsigned long len,
+static long do_mbind(struct mm_struct *mm, unsigned long start, unsigned long len,
 		     unsigned short mode, unsigned short mode_flags,
 		     nodemask_t *nmask, unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
 	struct vma_iterator vmi;
 	struct migration_mpol mmpol;
@@ -1465,10 +1464,84 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 	return 0;
 }
 
+static long kernel_mbind_process(int pidfd, const struct iovec __user *vec,
+		size_t vlen, unsigned long mode,
+		const unsigned long __user *nmask, unsigned int flags)
+{
+	ssize_t ret;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter iter;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	unsigned int f_flags;
+	unsigned short mode_flags;
+	int lmode = mode;
+	unsigned long maxnode = MAX_NUMNODES;
+	int err;
+	nodemask_t nodes;
+
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		goto out;
+
+	err = get_nodes(&nodes, nmask, maxnode);
+	if (err)
+		goto out;
+
+	ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack),
+			&iov, &iter);
+	if (ret < 0)
+		goto out;
+
+	task = pidfd_get_task(pidfd, &f_flags);
+	if (IS_ERR(task)) {
+		ret = PTR_ERR(task);
+		goto free_iov;
+	}
+
+	/* From process_madvise: Require PTRACE_MODE_READ
+	 * to avoid leaking ASLR metadata. */
+	mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+	if (IS_ERR_OR_NULL(mm)) {
+		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+		goto release_task;
+	}
+
+	/* From process_madvise: Require CAP_SYS_NICE for
+	 * influencing process performance. */
+	if (!capable(CAP_SYS_NICE)) {
+		ret = -EPERM;
+		goto release_mm;
+	}
+
+	while (iov_iter_count(&iter)) {
+		unsigned long start = untagged_addr(
+				(unsigned long)iter_iov_addr(&iter));
+		unsigned long len = iter_iov_len(&iter);
+
+		ret = do_mbind(mm, start, len, lmode, mode_flags,
+				&nodes, flags);
+		if (ret < 0)
+			break;
+		iov_iter_advance(&iter, iter_iov_len(&iter));
+	}
+
+release_mm:
+	mmput(mm);
+release_task:
+	put_task_struct(task);
+free_iov:
+	kfree(iov);
+out:
+	return ret;
+}
+
 static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
 {
+	struct mm_struct *mm = current->mm;
 	unsigned short mode_flags;
 	nodemask_t nodes;
 	int lmode = mode;
@@ -1483,7 +1556,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	if (err)
 		return err;
 
-	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
+	return do_mbind(mm, start, len, lmode, mode_flags, &nodes, flags);
 }
 
 SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
@@ -1553,6 +1626,13 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 	return err;
 }
 
+SYSCALL_DEFINE6(process_mbind, int, pidfd, const struct iovec __user *, vec,
+		size_t, vlen, unsigned long, mode,
+		const unsigned long __user *, nmask, unsigned int, flags)
+{
+	return kernel_mbind_process(pidfd, vec, vlen, mode, nmask, flags);
+}
+
 SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 		unsigned long, mode, const unsigned long __user *, nmask,
 		unsigned long, maxnode, unsigned int, flags)
-- 
2.41.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ