linux-kernel - Re: [v3 PATCH 1/1] fs/proc: Expose mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a1c24ae-29b0-4c3e-a055-789edfed32fc@kernel.org>
Date: Thu, 15 Jan 2026 22:19:08 +0100
From: "David Hildenbrand (Red Hat)" <david@...nel.org>
To: Aaron Tomlin <atomlin@...mlin.com>, oleg@...hat.com,
 akpm@...ux-foundation.org, gregkh@...uxfoundation.org, brauner@...nel.org,
 mingo@...nel.org
Cc: neelx@...e.com, sean@...e.io, linux-kernel@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, Dave Hansen <dave.hansen@...ux.intel.com>,
 Andy Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
 "x86@...nel.org" <x86@...nel.org>
Subject: Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status

On 1/15/26 21:54, Aaron Tomlin wrote:
> This patch introduces two new fields to /proc/[pid]/status to display the
> set of CPUs, representing the CPU affinity of the process's active
> memory context, in both mask and list format: "Cpus_active_mm" and
> "Cpus_active_mm_list". The mm_cpumask is primarily used for TLB and
> cache synchronisation.
> 
> Exposing this information allows userspace to easily describe the
> relationship between CPUs where a memory descriptor is "active" and the
> CPUs where the thread is allowed to execute. The primary intent is to
> provide visibility into the "memory footprint" across CPUs, which is
> invaluable for debugging performance issues related to IPI storms and
> TLB shootdowns in large-scale NUMA systems. The CPU-affinity sets the
> boundary; the mm_cpumask records the arrival; they complement each
> other.
> 
> Frequent mm_cpumask changes may indicate instability in placement
> policies or excessive task migration overhead.
> 
> These fields are exposed only on architectures that explicitly opt-in
> via CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM. This is necessary because
> mm_cpumask semantics vary significantly across architectures; some
> (e.g., x86) actively maintain the mask for coherency, while others may
> never clear bits, rendering the data misleading for this specific use
> case. x86 is updated to select this feature by default.
> 
> The implementation reads the mask directly without introducing additional
> locks or snapshots. While this implies that the hex mask and list format
> could theoretically observe slightly different states on a rapidly
> changing system, this "best-effort" approach aligns with the standard
> design philosophy of /proc and avoids imposing locking overhead on
> critical memory management paths.


Yes, restricting to architectures that have the expected semantics is 
better.

... but we better get the blessing from x86 folks :)

(CCing the x86 MM folks)


> 
> Signed-off-by: Aaron Tomlin <atomlin@...mlin.com>
> ---
>   Documentation/filesystems/proc.rst |  7 +++++++
>   arch/x86/Kconfig                   |  1 +
>   fs/proc/Kconfig                    | 14 ++++++++++++++
>   fs/proc/array.c                    | 28 +++++++++++++++++++++++++++-
>   4 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 8256e857e2d7..c6ced84c5c68 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -291,12 +291,19 @@ It's slow but very precise.
>    SpeculationIndirectBranch   indirect branch speculation mode
>    Cpus_allowed                mask of CPUs on which this process may run
>    Cpus_allowed_list           Same as previous, but in "list format"
> + Cpus_active_mm              mask of CPUs on which this process has an active
> +                             memory context
> + Cpus_active_mm_list         Same as previous, but in "list format"
>    Mems_allowed                mask of memory nodes allowed to this process
>    Mems_allowed_list           Same as previous, but in "list format"
>    voluntary_ctxt_switches     number of voluntary context switches
>    nonvoluntary_ctxt_switches  number of non voluntary context switches
>    ==========================  ===================================================
>   
> +Note "Cpus_active_mm" is currently only supported on x86. Its semantics are
> +architecture-dependent; on x86, it represents the set of CPUs that may hold
> +stale TLB entries for the process and thus require IPI-based TLB shootdowns to
> +maintain coherency.
>   
>   .. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3)
>   
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 80527299f859..f0997791dbdb 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -152,6 +152,7 @@ config X86
>   	select ARCH_WANTS_THP_SWAP		if X86_64
>   	select ARCH_HAS_PARANOID_L1D_FLUSH
>   	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> +	select ARCH_WANT_PROC_CPUS_ACTIVE_MM
>   	select BUILDTIME_TABLE_SORT
>   	select CLKEVT_I8253
>   	select CLOCKSOURCE_WATCHDOG
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index 6ae966c561e7..952c40cf3baa 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -127,3 +127,17 @@ config PROC_PID_ARCH_STATUS
>   config PROC_CPU_RESCTRL
>   	def_bool n
>   	depends on PROC_FS
> +
> +config ARCH_WANT_PROC_CPUS_ACTIVE_MM
> +	bool
> +	depends on PROC_FS
> +	help
> +	  Selected by architectures that reliably maintain mm_cpumask for TLB
> +	  and cache synchronisation and wish to expose it in
> +	  /proc/[pid]/status. Exposing this information allows userspace to
> +	  easily describe the relationship between CPUs where a memory
> +	  descriptor is "active" and the CPUs where the thread is allowed to
> +	  execute. The primary intent is to provide visibility into the
> +	  "memory footprint" across CPUs, which is invaluable for debugging
> +	  performance issues related to IPI storms and TLB shootdowns in
> +	  large-scale NUMA systems.
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 42932f88141a..c16aad59e0a7 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -409,6 +409,29 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
>   		   cpumask_pr_args(&task->cpus_mask));
>   }
>   
> +/**
> + * task_cpus_active_mm - Show the mm_cpumask for a process
> + * @m: The seq_file structure for the /proc/PID/status output
> + * @mm: The memory descriptor of the process
> + *
> + * Prints the set of CPUs, representing the CPU affinity of the process's
> + * active memory context, in both mask and list format. This mask is
> + * primarily used for TLB and cache synchronisation.
> + */
> +#ifdef CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM
> +static void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
> +{
> +	seq_printf(m, "Cpus_active_mm:\t%*pb\n",
> +		   cpumask_pr_args(mm_cpumask(mm)));
> +	seq_printf(m, "Cpus_active_mm_list:\t%*pbl\n",
> +		   cpumask_pr_args(mm_cpumask(mm)));
> +}
> +#else
> +static inline void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
> +{
> +}
> +#endif
> +
>   static inline void task_core_dumping(struct seq_file *m, struct task_struct *task)
>   {
>   	seq_put_decimal_ull(m, "CoreDumping:\t", !!task->signal->core_state);
> @@ -450,12 +473,15 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>   		task_core_dumping(m, task);
>   		task_thp_status(m, mm);
>   		task_untag_mask(m, mm);
> -		mmput(mm);
>   	}
>   	task_sig(m, task);
>   	task_cap(m, task);
>   	task_seccomp(m, task);
>   	task_cpus_allowed(m, task);
> +	if (mm) {
> +		task_cpus_active_mm(m, mm);
> +		mmput(mm);
> +	}
>   	cpuset_task_status_allowed(m, task);
>   	task_context_switch_counts(m, task);
>   	arch_proc_pid_thread_features(m, task);


-- 
Cheers

David