linux-kernel - Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkbnsSbZ2+Rf5NQKgBtH_JdN4AKMCuh8jasbQ-hcOOz-KA@mail.gmail.com>
Date:   Thu, 13 Apr 2023 03:45:42 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Christoph Lameter <cl@...ux.com>,
        David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Hyeonggon Yoo <42.hyeyoo@...il.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Miaohe Lin <linmiaohe@...wei.com>,
        David Hildenbrand <david@...hat.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Peter Xu <peterx@...hat.com>, NeilBrown <neilb@...e.de>,
        Shakeel Butt <shakeelb@...gle.com>,
        Michal Hocko <mhocko@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Dave Chinner <david@...morbit.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim

On Thu, Apr 13, 2023 at 3:40 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> We keep track of different types of reclaimed pages through
> reclaim_state->reclaimed_slab, and we add them to the reported number
> of reclaimed pages.  For non-memcg reclaim, this makes sense. For memcg
> reclaim, we have no clue if those pages are charged to the memcg under
> reclaim.
>
> Slab pages are shared by different memcgs, so a freed slab page may have
> only been partially charged to the memcg under reclaim.  The same goes for
> clean file pages from pruned inodes (on highmem systems) or xfs buffer
> pages, there is no simple way to currently link them to the memcg under
> reclaim.
>
> Stop reporting those freed pages as reclaimed pages during memcg reclaim.
> This should make the return value of writing to memory.reclaim, and may
> help reduce unnecessary reclaim retries during memcg charging.  Writing to
> memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
> for this case we want to include any freed pages, so use the
> global_reclaim() check instead of !cgroup_reclaim().
>
> Generally, this should make the return value of
> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
> freed a slab page that was mostly charged to the memcg under reclaim),
> the return value of try_to_free_mem_cgroup_pages() can be underestimated,
> but this should be fine. The freed pages will be uncharged anyway, and we
> can charge the memcg the next time around as we usually do memcg reclaim
> in a retry loop.
>
> Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> instead of pages")


Andrew, I removed the CC: stable as you were sceptical about the need
for a backport, but left the Fixes tag so that it's easy to identify
where to backport it if you and/or stable maintainers decide
otherwise.

>
>
> Signed-off-by: Yosry Ahmed <yosryahmed@...gle.com>
> ---
>  mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..be657832be48 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -511,6 +511,46 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
> + * scan_control->nr_reclaimed.
> + */
> +static void flush_reclaim_state(struct scan_control *sc)
> +{
> +       /*
> +        * Currently, reclaim_state->reclaimed includes three types of pages
> +        * freed outside of vmscan:
> +        * (1) Slab pages.
> +        * (2) Clean file pages from pruned inodes (on highmem systems).
> +        * (3) XFS freed buffer pages.
> +        *
> +        * For all of these cases, we cannot universally link the pages to a
> +        * single memcg. For example, a memcg-aware shrinker can free one object
> +        * charged to the target memcg, causing an entire page to be freed.
> +        * If we count the entire page as reclaimed from the memcg, we end up
> +        * overestimating the reclaimed amount (potentially under-reclaiming).
> +        *
> +        * Only count such pages for global reclaim to prevent under-reclaiming
> +        * from the target memcg; preventing unnecessary retries during memcg
> +        * charging and false positives from proactive reclaim.
> +        *
> +        * For uncommon cases where the freed pages were actually mostly
> +        * charged to the target memcg, we end up underestimating the reclaimed
> +        * amount. This should be fine. The freed pages will be uncharged
> +        * anyway, even if they are not counted here properly, and we will be
> +        * able to make forward progress in charging (which is usually in a
> +        * retry loop).
> +        *
> +        * We can go one step further, and report the uncharged objcg pages in
> +        * memcg reclaim, to make reporting more accurate and reduce
> +        * underestimation, but it's probably not worth the complexity for now.
> +        */
> +       if (current->reclaim_state && global_reclaim(sc)) {
> +               sc->nr_reclaimed += current->reclaim_state->reclaimed;
> +               current->reclaim_state->reclaimed = 0;
> +       }
> +}
> +
>  static long xchg_nr_deferred(struct shrinker *shrinker,
>                              struct shrink_control *sc)
>  {
> @@ -5346,8 +5386,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>                 vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
>                            sc->nr_reclaimed - reclaimed);
>
> -       sc->nr_reclaimed += current->reclaim_state->reclaimed_slab;
> -       current->reclaim_state->reclaimed_slab = 0;
> +       flush_reclaim_state(sc);
>
>         return success ? MEMCG_LRU_YOUNG : 0;
>  }
> @@ -6450,7 +6489,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>
>  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  {
> -       struct reclaim_state *reclaim_state = current->reclaim_state;
>         unsigned long nr_reclaimed, nr_scanned;
>         struct lruvec *target_lruvec;
>         bool reclaimable = false;
> @@ -6472,10 +6510,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
>         shrink_node_memcgs(pgdat, sc);
>
> -       if (reclaim_state) {
> -               sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -               reclaim_state->reclaimed_slab = 0;
> -       }
> +       flush_reclaim_state(sc);
>
>         /* Record the subtree's reclaim efficiency */
>         if (!sc->proactive)
> --
> 2.40.0.577.gac1e443424-goog
>