[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b356cfa.4ab3.19b97196d49.Coremail.00107082@163.com>
Date: Wed, 7 Jan 2026 14:16:24 +0800 (CST)
From: "David Wang" <00107082@....com>
To: "Kent Overstreet" <kent.overstreet@...ux.dev>
Cc: "Suren Baghdasaryan" <surenb@...gle.com>, akpm@...ux-foundation.org,
hannes@...xchg.org, pasha.tatashin@...een.com,
souravpanda@...gle.com, vbabka@...e.cz, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC] alloc_tag: add option to pick the first codetag
along callchain
At 2026-01-07 12:07:34, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
>On Wed, Jan 07, 2026 at 11:38:06AM +0800, David Wang wrote:
>>
>> At 2026-01-07 07:26:18, "Kent Overstreet" <kent.overstreet@...ux.dev> wrote:
>> >On Tue, Jan 06, 2026 at 10:07:36PM +0800, David Wang wrote:
>> >> I agree, the accounting would be incorrect for alloc sites down the callchain, and would confuse things.
>> >> When the call chain has more than one codetag, correct accounting for one codetag would always mean incorrect
>> >> accounting for other codetags, right? But I don't think picking the first tag would make the accounting totally incorrect.
>> >
>> >The trouble is you end up in situations where you have an alloc tag on
>> >the stack, but then you're doing an internal allocation that definitely
>> >should not be accounted to the outer alloc tag.
>> >
>> >E.g. there's a lot of internal mm allocations like this; object
>> >extension vectors was I think the first place where it came up,
>> >vmalloc() also has its own internal data structures that require
>> >allocations.
>> >
>> >Just using the outermost tag means these inner allocations will get
>> >accounted to other unrelated alloc tags _effectively at random_; meaning
>> >if we're burning more memory than we should be in a place like that it
>> >will never show up in a way that we'll notice and be able to track it
>> >down.
>>
>> Kind of feel that the same thing could be said for drivers: the driver could use more memory
>> than the data says....this is actually true....
>> Different developer may have different focus concerning the allocation site.
>>
>> >
>> >> Totally agree.
>> >> I used to sum by filepath prefix to aggregate memory usage for drivers.
>> >> Take usb subsystem for example, on my system, the data say my usb drivers use up 200k memory,
>> >> and if pick first codetag, the data say ~350K. Which one is lying, or are those two both lying. I am confused.
>> >>
>> >> I think this also raises the question of what is the *correct* way to make use of /proc/allocinfo...
>> >
>> >So yes, summing by filepath prefix is the way we want things to work.
>> >
>> >But getting there - with a fully reliable end result - is a process.
>> >
>> >What you want to do is - preferably on a reasonably idle machine, aside
>> >from the code you're looking at - just look at everything in
>> >/proc/allocinfo and sort by size. Look at the biggest ones that might be
>> >relevant to your subsystem, and look for any that are suspicious and
>> >perhaps should be accounted to your code. Yes, that may entail reading
>> >code :)
>> >
>> >This is why accounting to the innermost tag is important - by doing it
>> >this way, if an allocation is being accounted at the wrong callsite
>> >they'll all be lumped together at the specific callsite that needs to be
>> >fixed, which then shows up higher than normal in /proc/allocations, so
>> >that it gets looked at.
>> >
>> >> >The fact that you have to be explicit about where the accounting happens
>> >> >via _noprof is a feature, not a bug :)
>> >>
>> >> But it is tedious... :(
>> >
>> >That's another way of saying it's easy :)
>> >
>> >Spot an allocation with insufficiently fine grained accounting and it's
>> >generally a 3-5 line patch to fix it, I've been doing those here and
>> >there - e.g. mempools, workqueues, rhashtables.
>> >
>> >One trick I did with rhashtables that may be relevant to other
>> >subsystems: rhashtable does background processing for your hash table,
>> >which will do new allocations for your hash table out of a workqueue.
>> >
>> >So rhashtable_init() gets wrapped in alloc_hooks(), and then it stashes
>> >the pointer to that alloc tag in the rhashtable, and uses it later for
>> >all those asynchronous allocations.
>> >
>> >This means that instead of seeing a ton of memory accounted to the
>> >rhashtable code, with no idea of which rhashtable is burning memory -
>> >all the rhashtable allocations are accounted to the callsit of the
>> >initialization, meaning it's trivial to see which one is burning memory.
>>
>> Not that easy, ....code keeps being refactored, _noprof need to be changed along.
>> I was trying to split the accounting for __filemap_get_folio to its callers in 6.18,
>> it was easy, only ~10 lines of code changes. But 6.19 starts with code refactors to
>> __filemap_get_folio, adding another level of indirection, allocation callchain becomes
>> longer, and more _noprof should be added...quite unpleasant...
>>
>> Sometimes I would feel too many _noprof could be obstacle for future code refactors....
>>
>> PS: There are several allocation sites have *huge* memory accounting, __filemap_get_folio is
>> one of those. splitting those accounting to its callers would be more informative
>
>I'm curious why you need to change __filemap_get_folio()? In filesystem
>land we just lump that under "pagecache", but I guess you're doing more
>interesting things with it in driver land?
Oh, in [1], there is a report about possible memory leak in cephfs, (The issue is still open, tracked in [2].),
large trunk of memory could not be released even after dropcache.
memory allocation profiling shows those memory belongs to __filemap_get_folio,
something like
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
>> > 12M 3 kernel/dma/pool.c:96 func:atomic_pool_expand
>> > 13M 751 mm/slub.c:3061 func:alloc_slab_page
>> > 16M 8 mm/khugepaged.c:1069 func:alloc_charge_folio
>> > 18M 4355 mm/memory.c:1190 func:folio_prealloc
>> > 24M 6119 mm/memory.c:1192 func:folio_prealloc
>> > 58M 14784 mm/page_ext.c:271 func:alloc_page_ext
>> > 61M 15448 mm/readahead.c:189 func:ractl_alloc_folio
>> > 79M 6726 mm/slub.c:3059 func:alloc_slab_page
>> > 11G 2674488 mm/filemap.c:2012 func:__filemap_get_folio
After adding codetag to __filemap_get_folio, it shows
># sort -g /proc/allocinfo|tail|numfmt --to=iec
> 10M 2541 drivers/block/zram/zram_drv.c:1597 [zram]
>func:zram_meta_alloc 12M 3001 mm/execmem.c:41 func:execmem_vmalloc
> 12M 3605 kernel/fork.c:311 func:alloc_thread_stack_node
> 16M 992 mm/slub.c:3061 func:alloc_slab_page
> 20M 35544 lib/xarray.c:378 func:xas_alloc
> 31M 7704 mm/memory.c:1192 func:folio_prealloc
> 69M 17562 mm/memory.c:1190 func:folio_prealloc
> 104M 8212 mm/slub.c:3059 func:alloc_slab_page
> 124M 30075 mm/readahead.c:189 func:ractl_alloc_folio
> 2.6G 661392 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
>
Helpful or not, I am not sure. So far no bug has been spotted in the cephfs write path, yet.
But at least, it provides more information and narrow down the scope of suspicious.
https://lore.kernel.org/lkml/2a9ba88e.3aa6.19b0b73dd4e.Coremail.00107082@163.com/ [1]
https://tracker.ceph.com/issues/74156 [2]
Powered by blists - more mailing lists