[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAH5fLggKrb4LZk6JL5A0jJODA1zJs+94AU5NMmyV9ksraigF7A@mail.gmail.com>
Date: Thu, 28 Nov 2024 13:27:03 +0100
From: Alice Ryhl <aliceryhl@...gle.com>
To: Dave Chinner <david@...morbit.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Andrew Morton <akpm@...ux-foundation.org>,
Nhat Pham <nphamcs@...il.com>, Qi Zheng <zhengqi.arch@...edance.com>,
Roman Gushchin <roman.gushchin@...ux.dev>, Muchun Song <muchun.song@...ux.dev>,
Linux Memory Management List <linux-mm@...ck.org>, Michal Hocko <mhocko@...nel.org>,
Shakeel Butt <shakeel.butt@...ux.dev>, cgroups@...r.kernel.org,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [QUESTION] What memcg lifetime is required by list_lru_add?
On Wed, Nov 27, 2024 at 11:05 PM Dave Chinner <david@...morbit.com> wrote:
>
> On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote:
> > Dear SHRINKER and MEMCG experts,
> >
> > When using list_lru_add() and list_lru_del(), it seems to be required
> > that you pass the same value of nid and memcg to both calls, since
> > list_lru_del() might otherwise try to delete it from the wrong list /
> > delete it while holding the wrong spinlock. I'm trying to understand
> > the implications of this requirement on the lifetime of the memcg.
> >
> > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking
> > to keep the memcg object alive for the duration of list_lru_add().
> > That rcu locking is used here seems to imply that without it, the
> > memcg could be deallocated during the list_lru_add() call, which is of
> > course bad. But rcu is not enough on its own to keep the memcg alive
> > all the way until the list_lru_del_obj() call, so how does it ensure
> > that the memcg stays valid for that long?
>
> We don't care if the memcg goes away whilst there are objects on the
> LRU. memcg destruction will reparent the objects to a different
> memcg via memcg_reparent_list_lrus() before the memcg is torn down.
> New objects should not be added to the memcg LRUs once the memcg
> teardown process starts, so there should never be add vs reparent
> races during teardown.
>
> Hence all the list_lru_add_obj() function needs to do is ensure that
> the locking/lifecycle rules for the memcg object that
> mem_cgroup_from_slab_obj() returns are obeyed.
>
> > And if there is a mechanism
> > to keep the memcg alive for the entire duration between add and del,
>
> It's enforced by the -complex- state machine used to tear down
> control groups.
>
> tl;dr: If the memcg gets torn down, it will reparent the objects on
> the LRU to it's parent memcg during the teardown process.
>
> This reparenting happens in the cgroup ->css_offline() method, which
> only happens after the cgroup reference count goes to zero and is
> waited on via:
>
> kill_css
> percpu_ref_kill_and_confirm(css_killed_ref_fn)
> <wait>
> css_killed_ref_fn
> offline_css
> mem_cgroup_css_offline
> memcg_offline_kmem
> {
> .....
> memcg_reparent_objcgs(memcg, parent);
>
> /*
> * After we have finished memcg_reparent_objcgs(), all list_lrus
> * corresponding to this cgroup are guaranteed to remain empty.
> * The ordering is imposed by list_lru_node->lock taken by
> * memcg_reparent_list_lrus().
> */
> memcg_reparent_list_lrus(memcg, parent)
> }
>
> Then the cgroup teardown control code then schedules the freeing
> of the memcg container via a RCU work callback when the reference
> count is globally visible as killed and the reference count has gone
> to zero.
>
> Hence the cgroup infrastructure requires RCU protection for the
> duration of unreferenced cgroup object accesses. This allows for
> subsystems to perform operations on the cgroup object without
> needing to holding cgroup references for every access. The complex,
> multi-stage teardown process allows for cgroup objects to release
> objects that it tracks hence avoiding the need for every object the
> cgroup tracks to hold a reference count on the cgroup.
>
> See the comment above css_free_rwork_fn() for more details about the
> teardown process:
>
> /*
> * css destruction is four-stage process.
> *
> * 1. Destruction starts. Killing of the percpu_ref is initiated.
> * Implemented in kill_css().
> *
> * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
> * and thus css_tryget_online() is guaranteed to fail, the css can be
> * offlined by invoking offline_css(). After offlining, the base ref is
> * put. Implemented in css_killed_work_fn().
> *
> * 3. When the percpu_ref reaches zero, the only possible remaining
> * accessors are inside RCU read sections. css_release() schedules the
> * RCU callback.
> *
> * 4. After the grace period, the css can be freed. Implemented in
> * css_free_rwork_fn().
> *
> * It is actually hairier because both step 2 and 4 require process context
> * and thus involve punting to css->destroy_work adding two additional
> * steps to the already complex sequence.
> */
Thanks a lot Dave, this clears it up for me.
I sent a patch containing some additional docs for list_lru:
https://lore.kernel.org/all/20241128-list_lru_memcg_docs-v1-1-7e4568978f4e@google.com/
Alice
Powered by blists - more mailing lists