[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231129152057.x7fhbcvwtsmkbdpb@CAB-WSD-L081021>
Date: Wed, 29 Nov 2023 18:20:57 +0300
From: Dmitry Rokosov <ddrokosov@...utedevices.com>
To: Michal Hocko <mhocko@...e.com>
CC: <rostedt@...dmis.org>, <mhiramat@...nel.org>, <hannes@...xchg.org>,
<roman.gushchin@...ux.dev>, <shakeelb@...gle.com>,
<muchun.song@...ux.dev>, <akpm@...ux-foundation.org>,
<kernel@...rdevices.ru>, <rockosov@...il.com>,
<cgroups@...r.kernel.org>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <bpf@...r.kernel.org>
Subject: Re: [PATCH v3 2/2] mm: memcg: introduce new event to trace
shrink_memcg
On Tue, Nov 28, 2023 at 10:32:50AM +0100, Michal Hocko wrote:
> On Mon 27-11-23 19:16:37, Dmitry Rokosov wrote:
> > On Mon, Nov 27, 2023 at 01:50:22PM +0100, Michal Hocko wrote:
> > > On Mon 27-11-23 14:36:44, Dmitry Rokosov wrote:
> > > > On Mon, Nov 27, 2023 at 10:33:49AM +0100, Michal Hocko wrote:
> > > > > On Thu 23-11-23 22:39:37, Dmitry Rokosov wrote:
> > > > > > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > > > > > Currently, it is not possible to trace this point from non-direct
> > > > > > reclaim paths. However, direct reclaim has its own tracepoint, so there
> > > > > > is no issue there. In certain cases, when debugging memcg pressure,
> > > > > > developers may need to identify all potential requests for memcg
> > > > > > reclamation including kswapd(). The patchset introduces the tracepoints
> > > > > > mm_vmscan_memcg_shrink_{begin|end}() to address this problem.
> > > > > >
> > > > > > Example of output in the kswapd context (non-direct reclaim):
> > > > > > kswapd0-39 [001] ..... 240.356378: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356396: mm_vmscan_memcg_shrink_end: nr_reclaimed=0 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356420: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356454: mm_vmscan_memcg_shrink_end: nr_reclaimed=1 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356479: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356506: mm_vmscan_memcg_shrink_end: nr_reclaimed=4 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356525: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356593: mm_vmscan_memcg_shrink_end: nr_reclaimed=11 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356614: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356738: mm_vmscan_memcg_shrink_end: nr_reclaimed=25 memcg=16
> > > > > > kswapd0-39 [001] ..... 240.356790: mm_vmscan_memcg_shrink_begin: order=0 gfp_flags=GFP_KERNEL memcg=16
> > > > > > kswapd0-39 [001] ..... 240.357125: mm_vmscan_memcg_shrink_end: nr_reclaimed=53 memcg=16
> > > > >
> > > > > In the previous version I have asked why do we need this specific
> > > > > tracepoint when we already do have trace_mm_vmscan_lru_shrink_{in}active
> > > > > which already give you a very good insight. That includes the number of
> > > > > reclaimed pages but also more. I do see that we do not include memcg id
> > > > > of the reclaimed LRU, but that shouldn't be a big problem to add, no?
> > > >
> > > > >From my point of view, memcg reclaim includes two points: LRU shrink and
> > > > slab shrink, as mentioned in the vmscan.c file.
> > > >
> > > >
> > > > static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > > > ...
> > > > reclaimed = sc->nr_reclaimed;
> > > > scanned = sc->nr_scanned;
> > > >
> > > > shrink_lruvec(lruvec, sc);
> > > >
> > > > shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> > > > sc->priority);
> > > > ...
> > > >
> > > > So, both of these operations are important for understanding whether
> > > > memcg reclaiming was successful or not, as well as its effectiveness. I
> > > > believe it would be beneficial to summarize them, which is why I have
> > > > created new tracepoints.
> > >
> > > This sounds like nice to have rather than must. Put it differently. If
> > > you make existing reclaim trace points memcg aware (print memcg id) then
> > > what prevents you from making analysis you need?
> >
> > You are right, nothing prevents me from making this analysis... but...
> >
> > This approach does have some disadvantages:
> > 1) It requires more changes to vmscan. At the very least, the memcg
> > object should be forwarded to all subfunctions for LRU and SLAB
> > shrinkers.
>
> We should have lruvec or memcg available. lruvec_memcg() could be used
> to get memcg from the lruvec. It might be more places to add the id but
> arguably this would improve them to identify where the memory has been
> scanned/reclaimed from.
>
Oh, thank you, didn't see this conversion function before...
> > 2) With this approach, we will not have the ability to trace a situation
> > where the kernel is requesting reclaim for a specific memcg, but due to
> > limits issues, we are unable to run it.
>
> I do not follow. Could you be more specific please?
>
I'm referring to a situation where kswapd() or another kernel mm code
requests some reclaim pages from memcg, but memcg rejects it due to
limits checkers. This occurs in the shrink_node_memcgs() function.
===
mem_cgroup_calculate_protection(target_memcg, memcg);
if (mem_cgroup_below_min(target_memcg, memcg)) {
/*
* Hard protection.
* If there is no reclaimable memory, OOM.
*/
continue;
} else if (mem_cgroup_below_low(target_memcg, memcg)) {
/*
* Soft protection.
* Respect the protection only as long as
* there is an unprotected supply
* of reclaimable memory from other cgroups.
*/
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
}
memcg_memory_event(memcg, MEMCG_LOW);
}
===
With separate shrink begin()/end() tracepoints we can detect such
problem.
> > 3) LRU and SLAB shrinkers are too common places to handle memcg-related
> > tasks. Additionally, memcg can be disabled in the kernel configuration.
>
> Right. This could be all hidden in the tracing code. You simply do not
> print memcg id when the controller is disabled. Or just simply print 0.
> I do not really see any major problems with that.
>
> I would really prefer to focus on that direction rather than adding
> another begin/end tracepoint which overalaps with existing begin/end
> traces and provides much more limited information because I would bet we
> will have somebody complaining that mere nr_reclaimed is not sufficient.
Okay, I will try to prepare a new patch version with memcg printing from
lruvec and slab tracepoints.
Then Andrew should drop the previous patchsets, I suppose. Please advise
on the correct workflow steps here.
--
Thank you,
Dmitry
Powered by blists - more mailing lists