linux-kernel - Re: [PATCH v2 1/3] mm: memcg: fix stale protection of reclaim target memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y369cNnRWkoymF1G@P9FQF9L96D.corp.robot.car>
Date:   Wed, 23 Nov 2022 16:40:16 -0800
From:   Roman Gushchin <roman.gushchin@...ux.dev>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     Shakeel Butt <shakeelb@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...e.com>, Yu Zhao <yuzhao@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Vasily Averin <vasily.averin@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>,
        Chris Down <chris@...isdown.name>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 1/3] mm: memcg: fix stale protection of reclaim target
 memcg

On Wed, Nov 23, 2022 at 09:21:30AM +0000, Yosry Ahmed wrote:
> During reclaim, mem_cgroup_calculate_protection() is used to determine
> the effective protection (emin and elow) values of a memcg. The
> protection of the reclaim target is ignored, but we cannot set their
> effective protection to 0 due to a limitation of the current
> implementation (see comment in mem_cgroup_protection()). Instead,
> we leave their effective protection values unchaged, and later ignore it
> in mem_cgroup_protection().
> 
> However, mem_cgroup_protection() is called later in
> shrink_lruvec()->get_scan_count(), which is after the
> mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> result, the stale effective protection values of the target memcg may
> lead us to skip reclaiming from the target memcg entirely, before
> calling shrink_lruvec(). This can be even worse with recursive
> protection, where the stale target memcg protection can be higher than
> its standalone protection. See two examples below (a similar version of
> example (a) is added to test_memcontrol in a later patch).
> 
> (a) A simple example with proactive reclaim is as follows. Consider the
> following hierarchy:
> ROOT
>  |
>  A
>  |
>  B (memory.min = 10M)
> 
> Consider the following scenario:
> - B has memory.current = 10M.
> - The system undergoes global reclaim (or memcg reclaim in A).
> - In shrink_node_memcgs():
>   - mem_cgroup_calculate_protection() calculates the effective min (emin)
>     of B as 10M.
>   - mem_cgroup_below_min() returns true for B, we do not reclaim from B.
> - Now if we want to reclaim 5M from B using proactive reclaim
>   (memory.reclaim), we should be able to, as the protection of the
>   target memcg should be ignored.
> - In shrink_node_memcgs():
>   - mem_cgroup_calculate_protection() immediately returns for B without
>     doing anything, as B is the target memcg, relying on
>     mem_cgroup_protection() to ignore B's stale effective min (still 10M).
>   - mem_cgroup_below_min() reads the stale effective min for B and we
>     skip it instead of ignoring its protection as intended, as we never
>     reach mem_cgroup_protection().
> 
> (b) An more complex example with recursive protection is as follows.
> Consider the following hierarchy with memory_recursiveprot:
> ROOT
>  |
>  A (memory.min = 50M)
>  |
>  B (memory.min = 10M, memory.high = 40M)
> 
> Consider the following scenario:
> - B has memory.current = 35M.
> - The system undergoes global reclaim (target memcg is NULL).
> - B will have an effective min of 50M (all of A's unclaimed protection).
> - B will not be reclaimed from.
> - Now allocate 10M more memory in B, pushing it above it's high limit.
> - The system undergoes memcg reclaim from B (target memcg is B).
> - Like example (a), we do nothing in mem_cgroup_calculate_protection(),
>   then call mem_cgroup_below_min(), which will read the stale effective
>   min for B (50M) and skip it. In this case, it's even worse because we
>   are not just considering B's standalone protection (10M), but we are
>   reading a much higher stale protection (50M) which will cause us to not
>   reclaim from B at all.
> 
> This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> e{low,min} state mutations from protection checks") which made
> mem_cgroup_calculate_protection() only change the state without
> returning any value. Before that commit, we used to return
> MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> mem_cgroup_below_{min/low}() checks. After that commit we do not return
> anything and we end up checking the min & low effective protections for
> the target memcg, which are stale.
> 
> Update mem_cgroup_supports_protection() to also check if we are
> reclaiming from the target, and rename it to mem_cgroup_unprotected()
> (now returns true if we should not protect the memcg, much simpler logic).
> 
> Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> Signed-off-by: Yosry Ahmed <yosryahmed@...gle.com>

Reviewed-by: Roman Gushchin <roman.gushchin@...ux.dev>

Thank you!