linux-kernel - Re: [PATCH] mm: ratelimit stat flush from workingset shrinker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkbqtrJqD9=5f-ZZBcWyX9t-e=fenJdDU5U=GDpbbWrzrw@mail.gmail.com>
Date: Thu, 28 Dec 2023 07:13:23 -0800
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Shakeel Butt <shakeelb@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	Yu Zhao <yuzhao@...gle.com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: ratelimit stat flush from workingset shrinker

On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@...gle.com> wrote:
>
> One of our internal workload regressed on newer upstream kernel and on
> further investigation, it seems like the cause is the always synchronous
> rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b
> ("mm: memcg: use rstat for non-hierarchical stats"). On further
> inspection it seems like we don't really need accurate stats in this
> function as it was already approximating the amount of appropriate
> shadow entried to keep for maintaining the refault information. Since

s/entried/entries

> there is already 2 sec periodic rstat flush, we don't need exact stats
> here. Let's ratelimit the rstat flush in this code path.

Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg:
restore subtree stats flushing")? I think the answer is yes based on
internal discussions, but this really surprises me.

Commit f82e6bf9bb9b removed the percpu loop in
lruvec_page_state_local(), and added a flush call. With  7d7ef0a4686a,
the flush call is only effective if there are pending updates in the
cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW,
we should only be doing work when actually needed, whereas before we
used to have multiple percpu loops in count_shadow_nodes() regardless
of pending updates.

It seems like the cgroup subtree is very active that we continuously
need to flush in count_shadow_nodes()? If that's the case, do we still
think it's okay not to flush when we know there are pending updates? I
don't have enough background about the workingset heuristics to judge
this.

I am not objecting to this change, I am just trying to understand
what's happening.

Thanks!

>
> Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
> Signed-off-by: Shakeel Butt <shakeelb@...gle.com>
> ---
>  mm/workingset.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 2a2a34234df9..226012974328 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 struct lruvec *lruvec;
>                 int i;
>
> -               mem_cgroup_flush_stats(sc->memcg);
> +               mem_cgroup_flush_stats_ratelimited(sc->memcg);
>                 lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>                 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>                         pages += lruvec_page_state_local(lruvec,
> --
> 2.43.0.472.g3155946c3a-goog
>