linux-kernel - Re: [PATCH] alloc_tag: add per-NUMA node stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <hvtuqyu6urtutdjfqvtowgjqf5psyubljqujwtq3exdm33a7sq@3nxcevou3pok>
Date: Tue, 10 Jun 2025 23:41:32 -0400
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Casey Chen <cachen@...estorage.com>, surenb@...gle.com, corbet@....net, 
	dennis@...nel.org, tj@...nel.org, cl@...two.org, vbabka@...e.cz, mhocko@...e.com, 
	jackmanb@...gle.com, hannes@...xchg.org, ziy@...dia.com, rientjes@...gle.com, 
	roman.gushchin@...ux.dev, harry.yoo@...cle.com, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, yzhong@...estorage.com
Subject: Re: [PATCH] alloc_tag: add per-NUMA node stats

On Tue, Jun 10, 2025 at 06:21:55PM -0700, Andrew Morton wrote:
> On Tue, 10 Jun 2025 17:30:53 -0600 Casey Chen <cachen@...estorage.com> wrote:
> 
> > Add support for tracking per-NUMA node statistics in /proc/allocinfo.
> > Previously, each alloc_tag had a single set of counters (bytes and
> > calls), aggregated across all CPUs. With this change, each CPU can
> > maintain separate counters for each NUMA node, allowing finer-grained
> > memory allocation profiling.
> > 
> > This feature is controlled by the new
> > CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS option:
> > 
> > * When enabled (=y), the output includes per-node statistics following
> >   the total bytes/calls:
> > 
> > <size> <calls> <tag info>
> > ...
> > 315456       9858     mm/dmapool.c:338 func:pool_alloc_page
> >         nid0     94912        2966
> >         nid1     220544       6892
> > 7680         60       mm/dmapool.c:254 func:dma_pool_create
> >         nid0     4224         33
> >         nid1     3456         27
> > 
> > * When disabled (=n), the output remains unchanged:
> > <size> <calls> <tag info>
> > ...
> > 315456       9858     mm/dmapool.c:338 func:pool_alloc_page
> > 7680         60       mm/dmapool.c:254 func:dma_pool_create
> > 
> > To minimize memory overhead, per-NUMA stats counters are dynamically
> > allocated using the percpu allocator. PERCPU_DYNAMIC_RESERVE has been
> > increased to ensure sufficient space for in-kernel alloc_tag counters.
> > 
> > For in-kernel alloc_tag instances, pcpu_alloc_noprof() is used to
> > allocate counters. These allocations are excluded from the profiling
> > statistics themselves.
> 
> What is glaringly missing here is "why".
> 
> What is the use case?  Why does Linux want this?  What benefit does
> this bring to our users?  This is the most important part of the
> changelog because it tells Andrew why he is even looking at this patch.

In the last thread, I was pushing to get some input from other people
working on userspace profiling - it seems like that's where we're going
with this, so we need to make sure people working in the same area are
talking to each other, and if we're building tools to export data we
want to make sure it's relevant and useful.

Steven Rostadt sent out some pings, but I don't think we've heard back
from the relevant people, and I do want that to happen before (if) this
goes in.

And more basic than this, now that we've got the ability to map kernel
address -> code that owns it, we should be seeing if that is relevant
and interesting to the tooling people before we get fancier.

Casey also mentioned that this was being used for some internal stuff at
Pure, and I'd still like to hear more about that. If they already know
interesting and cool stuff we can use this for - great, let's please
hear what it is.

But we definitely need some wider context before merging this.

> Probably related to the above omission: why per-nid?  It would be more
> flexible to present the per-cpu counts and let userspace aggregate that
> into per-node info if that is desirable.

Per nid makes more sense to me if this is for profiling, and since this
is a text interface meant primarily for human consumption we don't want
to bloat that excessively.

We can always add multiple interfaces for viewing the same data, and
there's nothing preventing us from adding that later if it turns out
something else does want per-cpu counts.

Per-cpu counts would, I believe, bloat the data structures quite a bit
since the counters are internally percpu - would there then be counters
in nr_cpus^2? I hope not :)

> > +static inline struct alloc_tag_counters alloc_tag_read_nid(struct alloc_tag *tag, int nid)
> > +{
> > +	struct alloc_tag_counters v = { 0, 0 };
> > +	struct alloc_tag_counters *counters;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu) {
> 
> for_each_possible_cpu() is lame - potentially much more expensive than
> for_each_online_cpu.  Is it impractical to use for_each_online_cpu()?
> 
> Probably doesn't matter for a userspace displaying function but
> userspace can do weird and unexpected things.

for_each_online_cpu() means having to think hard about cpu hotplug and
possibly adding callbacks, no? I think simple and dumb is fine here.

> > +	for (i = 0; i < kernel_tags.count; i++) {
> > +		/* Each CPU has one alloc_tag_counters per numa node */
> > +		kernel_tags.first_tag[i].counters =
> > +			pcpu_alloc_noprof(pcpu_counters_size,
> > +					  sizeof(struct alloc_tag_counters),
> > +					  false, GFP_KERNEL | __GFP_ZERO);
> > +		if (!kernel_tags.first_tag[i].counters) {
> > +			while (--i >= 0)
> > +				free_percpu(kernel_tags.first_tag[i].counters);
> > +			pr_info("Failed to allocate per-cpu alloc_tag counters\n");
> 
> pr_err(), methinks.
> 
> > +			return;
> 
> And now what happens.  Will the kernel even work?
> 
> This code path is untestable unless the developer jumps through hoops
> and it will never be tested again.
> 
> We assume that __init-time allocations always succeed, so a hearty
> panic() here would be OK.

percpu allocations are more limited than normal allocations, and we did
hit that when developing memory allocation profiling, so - maybe WARN()?

> > +				int nid;
> 
> C99 definition.
> 
> > +				for (nid = 0; nid < pcpu_counters_num; nid++) {
> 
> If we're going to use C99 (is OK now) then it's better to go all the
> way and give `i' loop scope.  "for (int i..".

Yes please; variable reuse can lead to some nasties.

Going full C99 has its own problems (you can't do it after labels, and
only one of gcc or clang complains and I forget which - perpetual
headache).

But C99 for loops are a total no brainer.