linux-kernel - Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ntpnm3kdpqexncc4hz4xmfliay3tmbasxl6zatmsauo3sruwf3@zcmgz7oq5huy>
Date: Tue, 25 Jun 2024 14:20:43 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, tj@...nel.org, 
	cgroups@...r.kernel.org, hannes@...xchg.org, lizefan.x@...edance.com, longman@...hat.com, 
	kernel-team@...udflare.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd
 across NUMA nodes

On Tue, Jun 25, 2024 at 01:45:00PM GMT, Yosry Ahmed wrote:
> On Tue, Jun 25, 2024 at 9:21 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> >
> > On Tue, Jun 25, 2024 at 09:00:03AM GMT, Yosry Ahmed wrote:
> > [...]
> > >
> > > My point is not about accuracy, although I think it's a reasonable
> > > argument on its own (a lot of things could change in a short amount of
> > > time, which is why I prefer magnitude-based ratelimiting).
> > >
> > > My point is about logical ordering. If a userspace program reads the
> > > stats *after* an event occurs, it expects to get a snapshot of the
> > > system state after that event. Two examples are:
> > >
> > > - A proactive reclaimer reading the stats after a reclaim attempt to
> > > check if it needs to reclaim more memory or fallback.
> > > - A userspace OOM killer reading the stats after a usage spike to
> > > decide which workload to kill.
> > >
> > > I listed such examples with more detail in [1], when I removed
> > > stats_flush_ongoing from the memcg code.
> > >
> > > [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@google.com/
> >
> > You are kind of arbitrarily adding restrictions and rules here. Why not
> > follow the rules of a well established and battle tested stats infra
> > used by everyone i.e. vmstats? There is no sync flush and there are
> > frequent async flushes. I think that is what Jesper wants as well.
> 
> That's how the memcg stats worked previously since before rstat and
> until the introduction of stats_flush_ongoing AFAICT. We saw an actual
> behavioral change when we were moving from a pre-rstat kernel to a
> kernel with stats_flush_ongoing. This was the rationale when I removed
> stats_flush_ongoing in [1]. It's not a new argument, I am just
> reiterating what we discussed back then.

In my reply above, I am not arguing to go back to the older
stats_flush_ongoing situation. Rather I am discussing what should be the
best eventual solution. From the vmstats infra, we can learn that
frequent async flushes along with no sync flush, users are fine with the
'non-determinism'. Of course cgroup stats are different from vmstats
i.e. are hierarchical but I think we can try out this approach and see
if this works or not.

BTW it seems like this topic should be discussed be discussed
face-to-face over vc or LPC. What do you folks thing?

Shakeel