lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAAPL-u-Futq5biNhQKTVi15vzihZxoan-dVORPqpov1saJ99=Q@mail.gmail.com> Date: Mon, 4 Dec 2023 15:46:31 -0800 From: Wei Xu <weixugc@...gle.com> To: Shakeel Butt <shakeelb@...gle.com> Cc: Yosry Ahmed <yosryahmed@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, Muchun Song <muchun.song@...ux.dev>, Ivan Babrou <ivan@...udflare.com>, Tejun Heo <tj@...nel.org>, Michal Koutný <mkoutny@...e.com>, Waiman Long <longman@...hat.com>, kernel-team@...udflare.com, Greg Thelen <gthelen@...gle.com>, Domenico Cerasuolo <cerasuolodomenico@...il.com>, linux-mm@...ck.org, cgroups@...r.kernel.org, linux-kernel@...r.kernel.org Subject: Re: [mm-unstable v4 5/5] mm: memcg: restore subtree stats flushing On Mon, Dec 4, 2023 at 3:31 PM Shakeel Butt <shakeelb@...gle.com> wrote: > > On Mon, Dec 4, 2023 at 1:38 PM Yosry Ahmed <yosryahmed@...gle.com> wrote: > > > > On Mon, Dec 4, 2023 at 12:12 PM Yosry Ahmed <yosryahmed@...gle.com> wrote: > > > > > > On Sat, Dec 2, 2023 at 12:31 AM Shakeel Butt <shakeelb@...gle.com> wrote: > > > > > > > > On Wed, Nov 29, 2023 at 03:21:53AM +0000, Yosry Ahmed wrote: > > > > [...] > > > > > +void mem_cgroup_flush_stats(struct mem_cgroup *memcg) > > > > > { > > > > > - if (memcg_should_flush_stats(root_mem_cgroup)) > > > > > - do_flush_stats(); > > > > > + static DEFINE_MUTEX(memcg_stats_flush_mutex); > > > > > + > > > > > + if (mem_cgroup_disabled()) > > > > > + return; > > > > > + > > > > > + if (!memcg) > > > > > + memcg = root_mem_cgroup; > > > > > + > > > > > + if (memcg_should_flush_stats(memcg)) { > > > > > + mutex_lock(&memcg_stats_flush_mutex); > > > > > > > > What's the point of this mutex now? What is it providing? I understand > > > > we can not try_lock here due to targeted flushing. Why not just let the > > > > global rstat serialize the flushes? Actually this mutex can cause > > > > latency hiccups as the mutex owner can get resched during flush and then > > > > no one can flush for a potentially long time. > > > > > > I was hoping this was clear from the commit message and code comments, > > > but apparently I was wrong, sorry. Let me give more context. > > > > > > In previous versions and/or series, the mutex was only used with > > > flushes from userspace to guard in-kernel flushers against high > > > contention from userspace. Later on, I kept the mutex for all memcg > > > flushers for the following reasons: > > > > > > (a) Allow waiters to sleep: > > > Unlike other flushers, the memcg flushing path can see a lot of > > > concurrency. The mutex avoids having a lot of CPUs spinning (e.g. > > > concurrent reclaimers) by allowing waiters to sleep. > > > > > > (b) Check the threshold under lock but before calling cgroup_rstat_flush(): > > > The calls to cgroup_rstat_flush() are not very cheap even if there's > > > nothing to flush, as we still need to iterate all CPUs. If flushers > > > contend directly on the rstat lock, overlapping flushes will > > > unnecessarily do the percpu iteration once they hold the lock. With > > > the mutex, they will check the threshold again once they hold the > > > mutex. > > > > > > (c) Protect non-memcg flushers from contention from memcg flushers. > > > This is not as strong of an argument as protecting in-kernel flushers > > > from userspace flushers. > > > > > > There has been discussions before about changing the rstat lock itself > > > to be a mutex, which would resolve (a), but there are concerns about > > > priority inversions if a low priority task holds the mutex and gets > > > preempted, as well as the amount of time the rstat lock holder keeps > > > the lock for: > > > https://lore.kernel.org/lkml/ZO48h7c9qwQxEPPA@slm.duckdns.org/ > > > > > > I agree about possible hiccups due to the inner lock being dropped > > > while the mutex is held. Running a synthetic test with high > > > concurrency between reclaimers (in-kernel flushers) and stats readers > > > show no material performance difference with or without the mutex. > > > Maybe things cancel out, or don't really matter in practice. > > > > > > I would prefer to keep the current code as I think (a) and (b) could > > > cause problems in the future, and the current form of the code (with > > > the mutex) has already seen mileage with production workloads. > > > > Correction: The priority inversion is possible on the memcg side due > > to the mutex in this patch. Also, for point (a), the spinners will > > eventually sleep once they hold the lock and hit the first CPU > > boundary -- because of the lock dropping and cond_resched(). So > > eventually, all spinners should be able to sleep, although it will be > > a while until they do. With the mutex, they all sleep from the > > beginning. Point (b) still holds though. > > > > I am slightly inclined to keep the mutex but I can send a small fixlet > > to remove it if others think otherwise. > > > > Shakeel, Wei, any preferences? > > My preference is to avoid the issue we know we see in production alot > i.e. priority inversion. > > In future if you see issues with spinning then you can come up with > the lockless flush mechanism at that time. Given that the synthetic high concurrency test doesn't show material performance difference between the mutex and non-mutex versions, I agree that the mutex can be taken out from this patch set (one less global mutex to worry about).
Powered by blists - more mailing lists