linux-kernel - [v3 PATCH 1/1] fs/proc: Expose mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260115205407.3050262-2-atomlin@atomlin.com>
Date: Thu, 15 Jan 2026 15:54:07 -0500
From: Aaron Tomlin <atomlin@...mlin.com>
To: oleg@...hat.com,
	akpm@...ux-foundation.org,
	gregkh@...uxfoundation.org,
	david@...nel.org,
	brauner@...nel.org,
	mingo@...nel.org
Cc: neelx@...e.com,
	sean@...e.io,
	linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org
Subject: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status

This patch introduces two new fields to /proc/[pid]/status to display the
set of CPUs, representing the CPU affinity of the process's active
memory context, in both mask and list format: "Cpus_active_mm" and
"Cpus_active_mm_list". The mm_cpumask is primarily used for TLB and
cache synchronisation.

Exposing this information allows userspace to easily describe the
relationship between CPUs where a memory descriptor is "active" and the
CPUs where the thread is allowed to execute. The primary intent is to
provide visibility into the "memory footprint" across CPUs, which is
invaluable for debugging performance issues related to IPI storms and
TLB shootdowns in large-scale NUMA systems. The CPU-affinity sets the
boundary; the mm_cpumask records the arrival; they complement each
other.

Frequent mm_cpumask changes may indicate instability in placement
policies or excessive task migration overhead.

These fields are exposed only on architectures that explicitly opt-in
via CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM. This is necessary because
mm_cpumask semantics vary significantly across architectures; some
(e.g., x86) actively maintain the mask for coherency, while others may
never clear bits, rendering the data misleading for this specific use
case. x86 is updated to select this feature by default.

The implementation reads the mask directly without introducing additional
locks or snapshots. While this implies that the hex mask and list format
could theoretically observe slightly different states on a rapidly
changing system, this "best-effort" approach aligns with the standard
design philosophy of /proc and avoids imposing locking overhead on
critical memory management paths.

Signed-off-by: Aaron Tomlin <atomlin@...mlin.com>
---
 Documentation/filesystems/proc.rst |  7 +++++++
 arch/x86/Kconfig                   |  1 +
 fs/proc/Kconfig                    | 14 ++++++++++++++
 fs/proc/array.c                    | 28 +++++++++++++++++++++++++++-
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8256e857e2d7..c6ced84c5c68 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,12 +291,19 @@ It's slow but very precise.
  SpeculationIndirectBranch   indirect branch speculation mode
  Cpus_allowed                mask of CPUs on which this process may run
  Cpus_allowed_list           Same as previous, but in "list format"
+ Cpus_active_mm              mask of CPUs on which this process has an active
+                             memory context
+ Cpus_active_mm_list         Same as previous, but in "list format"
  Mems_allowed                mask of memory nodes allowed to this process
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
  ==========================  ===================================================
 
+Note "Cpus_active_mm" is currently only supported on x86. Its semantics are
+architecture-dependent; on x86, it represents the set of CPUs that may hold
+stale TLB entries for the process and thus require IPI-based TLB shootdowns to
+maintain coherency.
 
 .. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3)
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 80527299f859..f0997791dbdb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+	select ARCH_WANT_PROC_CPUS_ACTIVE_MM
 	select BUILDTIME_TABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_WATCHDOG
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 6ae966c561e7..952c40cf3baa 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -127,3 +127,17 @@ config PROC_PID_ARCH_STATUS
 config PROC_CPU_RESCTRL
 	def_bool n
 	depends on PROC_FS
+
+config ARCH_WANT_PROC_CPUS_ACTIVE_MM
+	bool
+	depends on PROC_FS
+	help
+	  Selected by architectures that reliably maintain mm_cpumask for TLB
+	  and cache synchronisation and wish to expose it in
+	  /proc/[pid]/status. Exposing this information allows userspace to
+	  easily describe the relationship between CPUs where a memory
+	  descriptor is "active" and the CPUs where the thread is allowed to
+	  execute. The primary intent is to provide visibility into the
+	  "memory footprint" across CPUs, which is invaluable for debugging
+	  performance issues related to IPI storms and TLB shootdowns in
+	  large-scale NUMA systems.
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 42932f88141a..c16aad59e0a7 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -409,6 +409,29 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
 		   cpumask_pr_args(&task->cpus_mask));
 }
 
+/**
+ * task_cpus_active_mm - Show the mm_cpumask for a process
+ * @m: The seq_file structure for the /proc/PID/status output
+ * @mm: The memory descriptor of the process
+ *
+ * Prints the set of CPUs, representing the CPU affinity of the process's
+ * active memory context, in both mask and list format. This mask is
+ * primarily used for TLB and cache synchronisation.
+ */
+#ifdef CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM
+static void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
+{
+	seq_printf(m, "Cpus_active_mm:\t%*pb\n",
+		   cpumask_pr_args(mm_cpumask(mm)));
+	seq_printf(m, "Cpus_active_mm_list:\t%*pbl\n",
+		   cpumask_pr_args(mm_cpumask(mm)));
+}
+#else
+static inline void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
+{
+}
+#endif
+
 static inline void task_core_dumping(struct seq_file *m, struct task_struct *task)
 {
 	seq_put_decimal_ull(m, "CoreDumping:\t", !!task->signal->core_state);
@@ -450,12 +473,15 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 		task_core_dumping(m, task);
 		task_thp_status(m, mm);
 		task_untag_mask(m, mm);
-		mmput(mm);
 	}
 	task_sig(m, task);
 	task_cap(m, task);
 	task_seccomp(m, task);
 	task_cpus_allowed(m, task);
+	if (mm) {
+		task_cpus_active_mm(m, mm);
+		mmput(mm);
+	}
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
 	arch_proc_pid_thread_features(m, task);
-- 
2.51.0