linux-kernel - Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkYV3iwk-ZJcr_==V4e24yH-1NaCYFUL7wDaQEi8ZXqfqQ@mail.gmail.com>
Date: Tue, 16 Jul 2024 17:35:05 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: tj@...nel.org, cgroups@...r.kernel.org, shakeel.butt@...ux.dev, 
	hannes@...xchg.org, lizefan.x@...edance.com, longman@...hat.com, 
	kernel-team@...udflare.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by
 kswapd across NUMA nodes

[..]
>
>
> This is a clean (meaning no cadvisor interference) example of kswapd
> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
> hit the race (which is handled in V6 and V7).
>
> The BPF "cnt" maps are getting cleared every second, so this
> approximates per sec numbers.  This patch reduce pressure on the lock,
> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
> flushes approx 37 per sec (every 27 ms). On the positive side
> ongoing_flusher mitigation stopped 98 per sec of these.
>
> In this clean kswapd case the patch removes the lock contention issue
> for kswapd. The lock_contended cases 27 seems to be all related to
> handled_race cases 27.
>
> The remaning high flush rate should also be addressed, and we should
> also work on aproaches to limit this like my ealier proposal[1].

I honestly don't think a high number of flushes is a problem on its
own as long as we are not spending too much time flushing, especially
when we have magnitude-based thresholding so we know there is
something to flush (although it may not be relevant to what we are
doing).

If we keep observing a lot of lock contention, one thing that I
thought about is to have a variant of spin_lock with a timeout. This
limits the flushing latency, instead of limiting the number of flushes
(which I believe is the wrong metric to optimize).

It also seems to me that we are doing a flush each 27ms, and your
proposed threshold was once per 50ms. It doesn't seem like a
fundamental difference.

I am also wondering how many more flushes could be skipped if we
handle the case of multiple ongoing flushers (whether by using a
mutex, or making it a per-cgroup property as I suggested earlier).