[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkak0yZNh+ZQ0FRJhmHPmC5YmccV4Cs+ZOk9DCp4s1ECCA@mail.gmail.com>
Date: Wed, 4 Sep 2024 14:10:03 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: tj@...nel.org, cgroups@...r.kernel.org, shakeel.butt@...ux.dev,
hannes@...xchg.org, lizefan.x@...edance.com, longman@...hat.com,
kernel-team@...udflare.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH V10] cgroup/rstat: Avoid flushing if there is an ongoing
root flush
On Wed, Sep 4, 2024 at 12:41 PM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>
> This patch reintroduces and generalizes the "stats_flush_ongoing" concept to
> avoid redundant flushes if there is an ongoing flush at cgroup root level,
> addressing production lock contention issues on the global cgroup rstat lock.
>
> At Cloudflare, we observed significant performance degradation due to
> lock contention on the rstat lock, primarily caused by kswapd. The
> specific mem_cgroup_flush_stats() call inlined in shrink_node, which
> takes the rstat lock, is particularly problematic.
>
> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, we
> noted severe lock contention on the rstat lock, causing 12 CPUs to waste
> cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedstat)
> for kthreads revealed that we are burning an average of 20,000 CPU cores
> fleet-wide on kswapd, primarily due to spinning on the rstat lock.
>
> Here's a brief overview of the issue:
> - __alloc_pages_slowpath calls wake_all_kswapds, causing all kswapdN threads
> to wake up simultaneously.
> - The kswapd thread invokes shrink_node (via balance_pgdat), triggering the
> cgroup rstat flush operation as part of its work.
> - balance_pgdat() has a NULL value in target_mem_cgroup, causing
> mem_cgroup_flush_stats() to flush with root_mem_cgroup.
>
> The kernel previously addressed this with a "stats_flush_ongoing" concept,
> which was removed in commit 7d7ef0a4686a ("mm: memcg: restore subtree stats
> flushing"). This patch reintroduces and generalizes the concept to apply to
> all users of cgroup rstat, not just memcg.
>
> In this patch only a root cgroup can become the ongoing flusher, as this solves
> the production issue. Letting other levels becoming ongoing flusher cause root
> cgroup to contend on the lock again.
>
> Some in-kernel users of the cgroup_rstat_flush() API depend on waiting for the
> flush to complete before continuing. This patch introduce the call
> cgroup_rstat_flush_relaxed() and use it in those cases that can live with
> slightly inaccurate flushes.
>
> This change significantly reduces lock contention, especially in
> environments with multiple NUMA nodes, thereby improving overall system
> performance.
>
> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
I am honestly not happy with this patch, see below.
> ---
>
Please include a brief delta vs the previous version here to save
reviewers the duplicated effort of figuring it out.
> V9: https://lore.kernel.org/all/172245504313.3147408.12138439169548255896.stgit@firesoul/
> V8: https://lore.kernel.org/all/172139415725.3084888.13770938453137383953.stgit@firesoul
> V7: https://lore.kernel.org/all/172070450139.2992819.13210624094367257881.stgit@firesoul
> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/
> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/
> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/
> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/
> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/
> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/
[..]
> @@ -299,6 +301,67 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> spin_unlock_irq(&cgroup_rstat_lock);
> }
>
> +static inline bool cgroup_is_root(struct cgroup *cgrp)
> +{
> + return cgroup_parent(cgrp) == NULL;
> +}
> +
> +/**
> + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flusher
> + * @cgrp: target cgroup
> + *
> + * Function return value follow trylock semantics. Returning true when lock is
> + * obtained. Returning false when not locked and it detected flushing can be
> + * skipped as another ongoing flusher is taking care of the flush.
> + *
> + * For callers that depend on flush completing before returning a strict option
> + * is provided.
> + */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp, bool strict)
> +{
> + struct cgroup *ongoing;
> +
> + if (strict)
> + goto lock;
> +
> + /*
> + * Check if ongoing flusher is already taking care of this. Descendant
> + * check is necessary due to cgroup v1 supporting multiple root's.
> + */
> + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> + if (ongoing && cgroup_is_descendant(cgrp, ongoing))
> + return false;
Why did we drop the agreed upon method of waiting until the flushers
are done? This is now a much more intrusive patch which makes all
flushers skip if a root is currently flushing. This causes
user-visible problems and is something that I worked hard to fix. I
thought we got good results with waiting for the ongoing flusher as
long as it is a root? What changed?
You also never addressed my concern here about 'ongoing' while we are
accessing it, and never responded to my question in v8 about expanding
this to support non-root cgroups once we shift to a mutex.
I don't appreciate the silent yet drastic change made in this version
and without addressing concerns raised in previous versions. Please
let me know if I missed something.
> +
> + /* Grab right to be ongoing flusher */
> + if (!ongoing && cgroup_is_root(cgrp)) {
> + struct cgroup *old;
> +
> + old = cmpxchg(&cgrp_rstat_ongoing_flusher, NULL, cgrp);
> + if (old) {
> + /* Lost race for being ongoing flusher */
> + if (cgroup_is_descendant(cgrp, old))
> + return false;
> + }
> + /* Due to lock yield combined with strict mode record ID */
> + WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, current);
I am not sure I understand why we need this, do you mind elaborating?
Powered by blists - more mailing lists