lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 22 Nov 2022 16:49:54 -0800
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Roman Gushchin <roman.gushchin@...ux.dev>
Cc:     Shakeel Butt <shakeelb@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...e.com>, Yu Zhao <yuzhao@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Vasily Averin <vasily.averin@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>,
        Chris Down <chris@...isdown.name>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH] mm: memcg: fix stale protection of reclaim target memcg

On Tue, Nov 22, 2022 at 4:45 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Tue, Nov 22, 2022 at 4:37 PM Roman Gushchin <roman.gushchin@...ux.dev> wrote:
> >
> > On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote:
> > > During reclaim, mem_cgroup_calculate_protection() is used to determine
> > > the effective protection (emin and elow) values of a memcg. The
> > > protection of the reclaim target is ignored, but we cannot set their
> > > effective protection to 0 due to a limitation of the current
> > > implementation (see comment in mem_cgroup_protection()). Instead,
> > > we leave their effective protection values unchaged, and later ignore it
> > > in mem_cgroup_protection().
> > >
> > > However, mem_cgroup_protection() is called later in
> > > shrink_lruvec()->get_scan_count(), which is after the
> > > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> > > result, the stale effective protection values of the target memcg may
> > > lead us to skip reclaiming from the target memcg entirely, before
> > > calling shrink_lruvec(). This can be even worse with recursive
> > > protection, where the stale target memcg protection can be higher than
> > > its standalone protection.
> > >
> > > An example where this can happen is as follows. Consider the following
> > > hierarchy with memory_recursiveprot:
> > > ROOT
> > >  |
> > >  A (memory.min = 50M)
> > >  |
> > >  B (memory.min = 10M, memory.high = 40M)
> > >
> > > Consider the following scenarion:
> > > - B has memory.current = 35M.
> > > - The system undergoes global reclaim (target memcg is NULL).
> > > - B will have an effective min of 50M (all of A's unclaimed protection).
> > > - B will not be reclaimed from.
> > > - Now allocate 10M more memory in B, pushing it above it's high limit.
> > > - The system undergoes memcg reclaim from B (target memcg is B)
> > > - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
> > >   which immediately returns for B without doing anything, as B is the
> > >   target memcg, relying on mem_cgroup_protection() to ignore B's stale
> > >   effective min (still 50M).
> > > - Directly after mem_cgroup_calculate_protection(), we will call
> > >   mem_cgroup_below_min(), which will read the stale effective min for B
> > >   and skip it (instead of ignoring its protection as intended). In this
> > >   case, it's really bad because we are not just considering B's
> > >   standalone protection (10M), but we are reading a much higher stale
> > >   protection (50M) which will cause us to not reclaim from B at all.
> > >
> > > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> > > e{low,min} state mutations from protection checks") which made
> > > mem_cgroup_calculate_protection() only change the state without
> > > returning any value. Before that commit, we used to return
> > > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> > > mem_cgroup_below_{min/low}() checks. After that commit we do not return
> > > anything and we end up checking the min & low effective protections for
> > > the target memcg, which are stale.
> > >
> > > Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
> > > the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
> > > the stale protection of the target memcg.
> > >
> > > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> > > Signed-off-by: Yosry Ahmed <yosryahmed@...gle.com>
> >
> > Great catch!
> > The fix looks good to me, only a couple of cosmetic suggestions.
> >
> > > ---
> > >  include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
> > >  mm/vmscan.c                | 11 ++++++-----
> > >  2 files changed, 33 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index e1644a24009c..22c9c9f9c6b1 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
> > >
> > >  }
> > >
> > > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> > > +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> > > +                                             struct mem_cgroup *memcg)
> > >  {
> > > -     if (!mem_cgroup_supports_protection(memcg))
> >
> > How about to merge mem_cgroup_supports_protection() and your new helper into
> > something like mem_cgroup_possibly_protected()? It seems like they never used
> > separately and unlikely ever will be used.
>
> Sounds good! I am thinking maybe mem_cgroup_no_protection() which is
> an inlining of !mem_cgroup_supports_protection() ||
> mem_cgorup_ignore_protection().
>
> > Also, I'd swap target and memcg arguments.
>
> Sounds good.

I just remembered, the reason I put "target" first is to match the
ordering of mem_cgroup_calculate_protection(), otherwise the code in
shrink_node_memcgs() may be confusing.

>
> >
> > Thank you!
> >
> >
> > PS If it's not too hard, please, consider adding a new kselftest to cover this case.
> > Thank you!
>
> I will try to translate my bash test to something in test_memcontrol,
> I don't plan to spend a lot of time on it though so I hope it's simple
> enough..

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ