linux-kernel - Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpE3pXB5=sZLywPgCk5sU1t-=G00TG-dLaXpYheSPYz1RA@mail.gmail.com>
Date: Sat, 6 Sep 2025 22:16:28 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Yueyang Pan <pyyjason@...il.com>, Kent Overstreet <kent.overstreet@...ux.dev>, 
	Usama Arif <usamaarif642@...il.com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Sourav Panda <souravpanda@...gle.com>, Pasha Tatashin <tatashin@...gle.com>, 
	Johannes Weiner <hannes@...xchg.org>
Subject: Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

On Wed, Aug 27, 2025 at 2:15 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> > > it possible to filter for the given memcg? Do we save memcg information
> > > in the memory allocation profiling?
> >
> > Actually I was thinking about making memory profiling memcg-aware but
> > it would be quite costly both from memory and performance points of
> > view. Currently we have a per-cpu counter for each allocation in the
> > kernel codebase. To make it work for each memcg we would have to add
> > memcg dimension to the counters, so each counter becomes per-cpu plus
> > per-memcg. I'll be thinking about possible optimizations since many of
> > these counters will stay at 0 but any such optimization would come at
> > a performance cost, which we tried to keep at the absolute minimum.
> >
> > I'm CC'ing Sourav and Pasha since they were also interested in making
> > memory allocation profiling memcg-aware. Would Meta folks (Usama,
> > Shakeel, Johannes) be interested in such enhancement as well? Would it
> > be preferable to have such accounting for a specific memcg which we
> > pre-select (less memory and performance overhead) or we need that for
> > all memcgs as a generic feature? We have some options here but I want
> > to understand what would be sufficient and add as little overhead as
> > possible.
>
> Thanks Suren, yes, as already mentioned by Usama, Meta will be
> interested in memcg aware allocation profiling. I would say start simple
> and as little overhead as possible. More functionality can be added
> later when the need arises. Maybe the first useful addition is just
> adding how many allocations for a specific allocation site are memcg
> charged.

Adding back Sourav, Pasha and Johannes who got accidentally dropped in
the replies.

I looked a bit into adding memcg-awareness into memory allocation
profiling and it's more complicated than I first thought (as usual).
The main complication is that we need to add memcg_id or some other
memcg identifier into codetag_ref. That's needed so that we can
unaccount the correct memcg when we free an allocation - that's the
usual function of the codetag_ref. Now, extending codetag_ref is not a
problem by itself but when we use mem_profiling_compressed mode, we
store an index of the codetag instead of codetag_ref in the unused
page flag bits. This is useful optimization to avoid using page_ext
and overhead associated with it. So, full blown memcg support seems
problematic.

What I'm thinking is easily doable is a filtering interface where we
could select a specific memcg to be profiled, IOW we profile only
allocations from a chosen memcg. Filtering can be done using ioctl
interface on /proc/allocinfo, which can be used for other things as
well, like filtering non-zero allocations, returning per-NUMA node
information, etc. I see that Damon uses similar memcg filtering (see
damos_filter.memcg_id), so I can reuse some of that code for
implementing this facility. From high-level, userspace will be able to
select one memcg at a time to be profiled. At some later time
profiling information is gathered and another memcg can be selected or
filtering can be reset to profile all allocations from all memcgs. I
expect overhead for this kind of memcg filtering to be quite low. WDYT
folks, would this be helpful and cover your usecases?


>