[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a1c24ae-29b0-4c3e-a055-789edfed32fc@kernel.org>
Date: Thu, 15 Jan 2026 22:19:08 +0100
From: "David Hildenbrand (Red Hat)" <david@...nel.org>
To: Aaron Tomlin <atomlin@...mlin.com>, oleg@...hat.com,
akpm@...ux-foundation.org, gregkh@...uxfoundation.org, brauner@...nel.org,
mingo@...nel.org
Cc: neelx@...e.com, sean@...e.io, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
On 1/15/26 21:54, Aaron Tomlin wrote:
> This patch introduces two new fields to /proc/[pid]/status to display the
> set of CPUs, representing the CPU affinity of the process's active
> memory context, in both mask and list format: "Cpus_active_mm" and
> "Cpus_active_mm_list". The mm_cpumask is primarily used for TLB and
> cache synchronisation.
>
> Exposing this information allows userspace to easily describe the
> relationship between CPUs where a memory descriptor is "active" and the
> CPUs where the thread is allowed to execute. The primary intent is to
> provide visibility into the "memory footprint" across CPUs, which is
> invaluable for debugging performance issues related to IPI storms and
> TLB shootdowns in large-scale NUMA systems. The CPU-affinity sets the
> boundary; the mm_cpumask records the arrival; they complement each
> other.
>
> Frequent mm_cpumask changes may indicate instability in placement
> policies or excessive task migration overhead.
>
> These fields are exposed only on architectures that explicitly opt-in
> via CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM. This is necessary because
> mm_cpumask semantics vary significantly across architectures; some
> (e.g., x86) actively maintain the mask for coherency, while others may
> never clear bits, rendering the data misleading for this specific use
> case. x86 is updated to select this feature by default.
>
> The implementation reads the mask directly without introducing additional
> locks or snapshots. While this implies that the hex mask and list format
> could theoretically observe slightly different states on a rapidly
> changing system, this "best-effort" approach aligns with the standard
> design philosophy of /proc and avoids imposing locking overhead on
> critical memory management paths.
Yes, restricting to architectures that have the expected semantics is
better.
... but we better get the blessing from x86 folks :)
(CCing the x86 MM folks)
>
> Signed-off-by: Aaron Tomlin <atomlin@...mlin.com>
> ---
> Documentation/filesystems/proc.rst | 7 +++++++
> arch/x86/Kconfig | 1 +
> fs/proc/Kconfig | 14 ++++++++++++++
> fs/proc/array.c | 28 +++++++++++++++++++++++++++-
> 4 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 8256e857e2d7..c6ced84c5c68 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -291,12 +291,19 @@ It's slow but very precise.
> SpeculationIndirectBranch indirect branch speculation mode
> Cpus_allowed mask of CPUs on which this process may run
> Cpus_allowed_list Same as previous, but in "list format"
> + Cpus_active_mm mask of CPUs on which this process has an active
> + memory context
> + Cpus_active_mm_list Same as previous, but in "list format"
> Mems_allowed mask of memory nodes allowed to this process
> Mems_allowed_list Same as previous, but in "list format"
> voluntary_ctxt_switches number of voluntary context switches
> nonvoluntary_ctxt_switches number of non voluntary context switches
> ========================== ===================================================
>
> +Note "Cpus_active_mm" is currently only supported on x86. Its semantics are
> +architecture-dependent; on x86, it represents the set of CPUs that may hold
> +stale TLB entries for the process and thus require IPI-based TLB shootdowns to
> +maintain coherency.
>
> .. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 80527299f859..f0997791dbdb 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -152,6 +152,7 @@ config X86
> select ARCH_WANTS_THP_SWAP if X86_64
> select ARCH_HAS_PARANOID_L1D_FLUSH
> select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> + select ARCH_WANT_PROC_CPUS_ACTIVE_MM
> select BUILDTIME_TABLE_SORT
> select CLKEVT_I8253
> select CLOCKSOURCE_WATCHDOG
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index 6ae966c561e7..952c40cf3baa 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -127,3 +127,17 @@ config PROC_PID_ARCH_STATUS
> config PROC_CPU_RESCTRL
> def_bool n
> depends on PROC_FS
> +
> +config ARCH_WANT_PROC_CPUS_ACTIVE_MM
> + bool
> + depends on PROC_FS
> + help
> + Selected by architectures that reliably maintain mm_cpumask for TLB
> + and cache synchronisation and wish to expose it in
> + /proc/[pid]/status. Exposing this information allows userspace to
> + easily describe the relationship between CPUs where a memory
> + descriptor is "active" and the CPUs where the thread is allowed to
> + execute. The primary intent is to provide visibility into the
> + "memory footprint" across CPUs, which is invaluable for debugging
> + performance issues related to IPI storms and TLB shootdowns in
> + large-scale NUMA systems.
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 42932f88141a..c16aad59e0a7 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -409,6 +409,29 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
> cpumask_pr_args(&task->cpus_mask));
> }
>
> +/**
> + * task_cpus_active_mm - Show the mm_cpumask for a process
> + * @m: The seq_file structure for the /proc/PID/status output
> + * @mm: The memory descriptor of the process
> + *
> + * Prints the set of CPUs, representing the CPU affinity of the process's
> + * active memory context, in both mask and list format. This mask is
> + * primarily used for TLB and cache synchronisation.
> + */
> +#ifdef CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM
> +static void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
> +{
> + seq_printf(m, "Cpus_active_mm:\t%*pb\n",
> + cpumask_pr_args(mm_cpumask(mm)));
> + seq_printf(m, "Cpus_active_mm_list:\t%*pbl\n",
> + cpumask_pr_args(mm_cpumask(mm)));
> +}
> +#else
> +static inline void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
> +{
> +}
> +#endif
> +
> static inline void task_core_dumping(struct seq_file *m, struct task_struct *task)
> {
> seq_put_decimal_ull(m, "CoreDumping:\t", !!task->signal->core_state);
> @@ -450,12 +473,15 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
> task_core_dumping(m, task);
> task_thp_status(m, mm);
> task_untag_mask(m, mm);
> - mmput(mm);
> }
> task_sig(m, task);
> task_cap(m, task);
> task_seccomp(m, task);
> task_cpus_allowed(m, task);
> + if (mm) {
> + task_cpus_active_mm(m, mm);
> + mmput(mm);
> + }
> cpuset_task_status_allowed(m, task);
> task_context_switch_counts(m, task);
> arch_proc_pid_thread_features(m, task);
--
Cheers
David
Powered by blists - more mailing lists