linux-kernel - Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <exnxkjyaslel2jlvvwxlmebtav4m7fszn2qouiciwhuxpomhky@ljkycu67efbx>
Date: Mon, 24 Jun 2024 12:29:00 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, tj@...nel.org, 
	cgroups@...r.kernel.org, hannes@...xchg.org, lizefan.x@...edance.com, longman@...hat.com, 
	kernel-team@...udflare.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd
 across NUMA nodes

On Mon, Jun 24, 2024 at 10:40:48AM GMT, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 10:32 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> >
> > On Mon, Jun 24, 2024 at 05:46:05AM GMT, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 4:55 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
> > >
> > [...]
> > > I am assuming this supersedes your other patch titled "[PATCH RFC]
> > > cgroup/rstat: avoid thundering herd problem on root cgrp", so I will
> > > only respond here.
> > >
> > > I have two comments:
> > > - There is no reason why this should be limited to the root cgroup. We
> > > can keep track of the cgroup being flushed, and use
> > > cgroup_is_descendant() to find out if the cgroup we want to flush is a
> > > descendant of it. We can use a pointer and cmpxchg primitives instead
> > > of the atomic here IIUC.
> > >
> > > - More importantly, I am not a fan of skipping the flush if there is
> > > an ongoing one. For all we know, the ongoing flush could have just
> > > started and the stats have not been flushed yet. This is another
> > > example of non deterministic behavior that could be difficult to
> > > debug.
> >
> > Even with the flush, there will almost always per-cpu updates which will
> > be missed. This can not be fixed unless we block the stats updaters as
> > well (which is not going to happen). So, we are already ok with this
> > level of non-determinism. Why skipping flushing would be worse? One may
> > argue 'time window is smaller' but this still does not cap the amount of
> > updates. So, unless there is concrete data that this skipping flushing
> > is detrimental to the users of stats, I don't see an issue in the
> > presense of periodic flusher.
> 
> As you mentioned, the updates that happen during the flush are
> unavoidable anyway, and the window is small. On the other hand, we
> should be able to maintain the current behavior that at least all the
> stat updates that happened *before* the call to cgroup_rstat_flush()
> are flushed after the call.
> 
> The main concern here is that the stats read *after* an event occurs
> should reflect the system state at that time. For example, a proactive
> reclaimer reading the stats after writing to memory.reclaim should
> observe the system state after the reclaim operation happened.

What about the in-kernel users like kswapd? I don't see any before or
after events for the in-kernel users.

> 
> Please see [1] for more details about why this is important, which was
> the rationale for removing stats_flush_ongoing in the first place.
> 
> [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@google.com/
>