linux-kernel - Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aV3bykzqkyUcBJBf@moria.home.lan>
Date: Tue, 6 Jan 2026 23:07:34 -0500
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: David Wang <00107082@....com>
Cc: Suren Baghdasaryan <surenb@...gle.com>, akpm@...ux-foundation.org, 
	hannes@...xchg.org, pasha.tatashin@...een.com, souravpanda@...gle.com, 
	vbabka@...e.cz, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC] alloc_tag: add option to pick the first codetag
 along callchain

On Wed, Jan 07, 2026 at 11:38:06AM +0800, David Wang wrote:
> 
> At 2026-01-07 07:26:18, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
> >On Tue, Jan 06, 2026 at 10:07:36PM +0800, David Wang wrote:
> >> I agree, the accounting would be incorrect for alloc sites down the callchain, and would confuse things.
> >> When the call chain has more than one codetag, correct accounting for one codetag would always mean incorrect
> >> accounting for other codetags, right? But I don't think picking the first tag would make the accounting totally incorrect. 
> >
> >The trouble is you end up in situations where you have an alloc tag on
> >the stack, but then you're doing an internal allocation that definitely
> >should not be accounted to the outer alloc tag.
> >
> >E.g. there's a lot of internal mm allocations like this; object
> >extension vectors was I think the first place where it came up,
> >vmalloc() also has its own internal data structures that require
> >allocations.
> >
> >Just using the outermost tag means these inner allocations will get
> >accounted to other unrelated alloc tags _effectively at random_; meaning
> >if we're burning more memory than we should be in a place like that it
> >will never show up in a way that we'll notice and be able to track it
> >down.
> 
> Kind of feel that the same thing could be said for drivers: the driver could use more memory
> than the data says....this is actually true....
> Different developer may have different focus concerning the allocation site.
> 
> >
> >> Totally agree.
> >> I used to sum by filepath prefix to aggregate memory usage for drivers.
> >> Take usb subsystem for example,  on my system, the data say my usb drivers use up 200k memory,
> >> and if pick first codetag, the data say ~350K.   Which one is lying, or are those two both lying. I am  confused.
> >> 
> >> I think this also raises the question of what is the *correct* way to make use of /proc/allocinfo...
> >
> >So yes, summing by filepath prefix is the way we want things to work.
> >
> >But getting there - with a fully reliable end result - is a process.
> >
> >What you want to do is - preferably on a reasonably idle machine, aside
> >from the code you're looking at - just look at everything in
> >/proc/allocinfo and sort by size. Look at the biggest ones that might be
> >relevant to your subsystem, and look for any that are suspicious and
> >perhaps should be accounted to your code. Yes, that may entail reading
> >code :)
> >
> >This is why accounting to the innermost tag is important - by doing it
> >this way, if an allocation is being accounted at the wrong callsite
> >they'll all be lumped together at the specific callsite that needs to be
> >fixed, which then shows up higher than normal in /proc/allocations, so
> >that it gets looked at.
> >
> >> >The fact that you have to be explicit about where the accounting happens
> >> >via _noprof is a feature, not a bug :)
> >> 
> >> But it is tedious... :(
> >
> >That's another way of saying it's easy :)
> >
> >Spot an allocation with insufficiently fine grained accounting and it's
> >generally a 3-5 line patch to fix it, I've been doing those here and
> >there - e.g. mempools, workqueues, rhashtables.
> >
> >One trick I did with rhashtables that may be relevant to other
> >subsystems: rhashtable does background processing for your hash table,
> >which will do new allocations for your hash table out of a workqueue.
> >
> >So rhashtable_init() gets wrapped in alloc_hooks(), and then it stashes
> >the pointer to that alloc tag in the rhashtable, and uses it later for
> >all those asynchronous allocations.
> >
> >This means that instead of seeing a ton of memory accounted to the
> >rhashtable code, with no idea of which rhashtable is burning memory -
> >all the rhashtable allocations are accounted to the callsit of the
> >initialization, meaning it's trivial to see which one is burning memory.
> 
> Not that easy, ....code keeps being refactored, _noprof need to be changed along.
> I was trying to split the accounting for __filemap_get_folio to its callers in 6.18,  
> it was easy,  only ~10 lines of code changes. But 6.19 starts with code refactors to
> __filemap_get_folio, adding another level of indirection, allocation callchain becomes
> longer, and more _noprof should be added...quite unpleasant...
> 
> Sometimes I would feel too many _noprof could be obstacle for future code refactors....
> 
> PS: There are several allocation sites have *huge* memory accounting, __filemap_get_folio is
> one of those. splitting those accounting to its callers would be more informative

I'm curious why you need to change __filemap_get_folio()? In filesystem
land we just lump that under "pagecache", but I guess you're doing more
interesting things with it in driver land?