linux-kernel - Re: [PATCH v2] workingset: ensure memcg is valid for recency check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkZsmHLAbmZXFHJA2BqPHYeyHMVYxsMKFZywTHdiNFiTdw@mail.gmail.com>
Date:   Fri, 18 Aug 2023 14:59:56 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Yu Zhao <yuzhao@...gle.com>
Cc:     Shakeel Butt <shakeelb@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Nhat Pham <nphamcs@...il.com>, akpm@...ux-foundation.org,
        kernel-team@...a.com, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, stable@...r.kernel.org
Subject: Re: [PATCH v2] workingset: ensure memcg is valid for recency check

On Fri, Aug 18, 2023 at 2:52 PM Yu Zhao <yuzhao@...gle.com> wrote:
>
> On Fri, Aug 18, 2023 at 3:35 PM Shakeel Butt <shakeelb@...gle.com> wrote:
> >
> > On Fri, Aug 18, 2023 at 11:44:45AM -0700, Yosry Ahmed wrote:
> > > On Fri, Aug 18, 2023 at 11:35 AM Johannes Weiner <hannes@...xchg.org> wrote:
> > > >
> > > > On Fri, Aug 18, 2023 at 10:45:56AM -0700, Yosry Ahmed wrote:
> > > > > On Fri, Aug 18, 2023 at 10:35 AM Johannes Weiner <hannes@...xchg.org> wrote:
> > > > > > On Fri, Aug 18, 2023 at 07:56:37AM -0700, Yosry Ahmed wrote:
> > > > > > > If this happens it seems possible for this to happen:
> > > > > > >
> > > > > > > cpu #1                                  cpu#2
> > > > > > >                                              css_put()
> > > > > > >                                              /* css_free_rwork_fn is queued */
> > > > > > > rcu_read_lock()
> > > > > > > mem_cgroup_from_id()
> > > > > > >                                              mem_cgroup_id_remove()
> > > > > > > /* access memcg */
> > > > > >
> > > > > > I don't quite see how that'd possible. IDR uses rcu_assign_pointer()
> > > > > > during deletion, which inserts the necessary barriering. My
> > > > > > understanding is that this should always be safe:
> > > > > >
> > > > > >   rcu_read_lock()                 (writer serialization, in this case ref count == 0)
> > > > > >   foo = idr_find(x)               idr_remove(x)
> > > > > >   if (foo)                        kfree_rcu(foo)
> > > > > >     LOAD(foo->bar)
> > > > > >   rcu_read_unlock()
> > > > >
> > > > > How does a barrier inside IDR removal protect against the memcg being
> > > > > freed here though?
> > > > >
> > > > > If css_put() is executed out-of-order before mem_cgroup_id_remove(),
> > > > > the memcg can be freed even before mem_cgroup_id_remove() is called,
> > > > > right?
> > > >
> > > > css_put() can start earlier, but it's not allowed to reorder the rcu
> > > > callback that frees past the rcu_assign_pointer() in idr_remove().
> > > >
> > > > This is what RCU and its access primitives guarantees. It ensures that
> > > > after "unpublishing" the pointer, all concurrent RCU-protected
> > > > accesses to the object have finished, and the memory can be freed.
> > >
> > > I am not sure I understand, this is the scenario I mean:
> > >
> > > cpu#1                      cpu#2                             cpu#3
> > > css_put()
> > > /* schedule free */
> > >                                 rcu_read_lock()
> > > idr_remove()
> > >                                mem_cgroup_from_id()
> > >
> > > /* free memcg */
> > >                                /* use memcg */
> > >
> > > If I understand correctly you are saying that the scheduled free
> > > callback cannot run before idr_remove() due to the barrier in there,
> > > but it can run after the rcu_read_lock() in cpu #2 because it was
> > > scheduled before that RCU critical section started, right?
> >
> > Isn't there a simpler explanation. The memcg whose id is stored in the
> > shadow entry has been freed and there is an ongoing new memcg allocation
> > which by chance has acquired the same id and has not yet initialized
> > completely. More specifically the new memcg creation is between
> > css_alloc() and init_and_link_css() and there is a refault for the
> > shadow entry holding that id.

That's actually very plausible.

>
> I think so, and this fix would just crash at tryget() instead when
> hitting the problem.

It seems like mem_cgroup_from_id() completely disregards
memcg->id.ref. In the case that Shakeel mentioned, memcg->id.ref
should have a count of 0. Perhaps we should update it to respect the
id refcount? Maybe try to acquire a ref first before looking up the
idr?