lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230913153758.GB45543@cmpxchg.org>
Date:   Wed, 13 Sep 2023 11:37:58 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Ivan Babrou <ivan@...udflare.com>, Tejun Heo <tj@...nel.org>,
        Michal Koutný <mkoutny@...e.com>,
        Waiman Long <longman@...hat.com>, kernel-team@...udflare.com,
        Wei Xu <weixugc@...gle.com>, Greg Thelen <gthelen@...gle.com>,
        linux-mm@...ck.org, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and
 accuracy

On Wed, Sep 13, 2023 at 07:38:46AM +0000, Yosry Ahmed wrote:
> Stats flushing for memcg currently follows the following rules:
> - Always flush the entire memcg hierarchy (i.e. flush the root).
> - Only one flusher is allowed at a time. If someone else tries to flush
>   concurrently, they skip and return immediately.
> - A periodic flusher flushes all the stats every 2 seconds.
> 
> The reason this approach is followed is because all flushes are
> serialized by a global rstat spinlock. On the memcg side, flushing is
> invoked from userspace reads as well as in-kernel flushers (e.g.
> reclaim, refault, etc). This approach aims to avoid serializing all
> flushers on the global lock, which can cause a significant performance
> hit under high concurrency.
> 
> This approach has the following problems:
> - Occasionally a userspace read of the stats of a non-root cgroup will
>   be too expensive as it has to flush the entire hierarchy [1].
> - Sometimes the stats accuracy are compromised if there is an ongoing
>   flush, and we skip and return before the subtree of interest is
>   actually flushed. This is more visible when reading stats from
>   userspace, but can also affect in-kernel flushers.
> 
> This patch aims to solve both problems by reworking how flushing
> currently works as follows:
> - Without contention, there is no need to flush the entire tree. In this
>   case, only flush the subtree of interest. This avoids the latency of a
>   full root flush if unnecessary.
> - With contention, fallback to a coalesced (aka unified) flush of the
>   entire hierarchy, a root flush. In this case, instead of returning
>   immediately if a root flush is ongoing, wait for it to finish
>   *without* attempting to acquire the lock or flush. This is done using
>   a completion. Compared to competing directly on the underlying lock,
>   this approach makes concurrent flushing a synchronization point
>   instead of a serialization point. Once  a root flush finishes, *all*
>   waiters can wake up and continue at once.
> - Finally, with very high contention, bound the number of waiters to the
>   number of online cpus. This keeps the flush latency bounded at the tail
>   (very high concurrency). We fallback to sacrificing stats freshness only
>   in such cases in favor of performance.
> 
> This was tested in two ways on a machine with 384 cpus:
> - A synthetic test with 5000 concurrent workers doing allocations and
>   reclaim, as well as 1000 readers for memory.stat (variation of [2]).
>   No significant regressions were noticed in the total runtime.
>   Note that if concurrent flushers compete directly on the spinlock
>   instead of waiting for a completion, this test shows 2x-3x slowdowns.
>   Even though subsequent flushers would have nothing to flush, just the
>   serialization and lock contention is a major problem. Using a
>   completion for synchronization instead seems to overcome this problem.
> 
> - A synthetic stress test for concurrently reading memcg stats provided
>   by Wei Xu.
>   With 10k threads reading the stats every 100ms:
>   - 98.8% of reads take <100us
>   - 1.09% of reads take 100us to 1ms.
>   - 0.11% of reads take 1ms to 10ms.
>   - Almost no reads take more than 10ms.
>   With 10k threads reading the stats every 10ms:
>   - 82.3% of reads take <100us.
>   - 4.2% of reads take 100us to 1ms.
>   - 4.7% of reads take 1ms to 10ms.
>   - 8.8% of reads take 10ms to 100ms.
>   - Almost no reads take more than 100ms.
> 
> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
> 
> [weixugc@...gle.com: suggested the fallback logic and bounding the
> number of waiters]
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@...gle.com>

>  	/*
> +	 * Opportunistically try to only flush the requested subtree. Otherwise
> +	 * fallback to a coalesced flush below.
>  	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> +	if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> +		cgroup_rstat_flush(memcg->css.cgroup);
> +		mutex_unlock(&subtree_flush_mutex);
>  		return;
> +	}
>  
> -	WRITE_ONCE(flush_last_time, jiffies_64);
> -
> -	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> +	/* A coalesced root flush is in order. Are we the designated flusher? */
> +	spin_lock(&root_flusher_lock);

I can't say I'm crazy about this.

Let's say you have a fairly sprawling and active cgroup tree with lots
of updates in lots of groups in the system.

Add a periodic memory.stat reader to one of the subgroups, and that
reader will do very fast, localized aggregation.

Now add a periodic memory.stat reader to some other subgroup. They
might still both do very fast, localized aggregation. Or they might
collide; and now - despite having nothing in common, and sharing no
data besides the rstat lock - will have to flush the entire massive
tree. The rate at which this happens is purely dependent on timing of
what should be independent actors. This is really not great for the
p50 vs p99 gap.

I think this whole thing is getting away from us.

When we first switched to rstat, every reader would take the global
rstat lock and then flush its local subtree of interest.

This was changed in the following commit:

commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf
Author: Shakeel Butt <shakeelb@...gle.com>
Date:   Fri Nov 5 13:37:34 2021 -0700

    memcg: unify memcg stat flushing
    
    The memcg stats can be flushed in multiple context and potentially in
    parallel too.  For example multiple parallel user space readers for
    memcg stats will contend on the rstat locks with each other.  There is
    no need for that.  We just need one flusher and everyone else can
    benefit.
    
    In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
    stats") the kernel periodically flush the memcg stats from the root, so,
    the other flushers will potentially have much less work to do.
    
    Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@...gle.com>
    Acked-by: Johannes Weiner <hannes@...xchg.org>
    Cc: Michal Hocko <mhocko@...nel.org>
    Cc: "Michal Koutný" <mkoutny@...e.com>
    Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>

The idea was that we can avoid lock contention if somebody is already
doing the flushing. However, you're now bringing global serialization.
Clearly that idea didn't work out. What would be an obstacle to go
back to the original way of doing it?

With one reader, this will work the same as in your proposal.

With two readers, just like in your proposal, flushing must be
serialized against the root level. But at least the two flushes only
aggregate the local data they actually care about - not the entire
tree data that doesn't even have readers! This is much better for lock
contention and performance.

One concern is the thresholding code. The cost of flushing on every
read is too high: even when there is no flush work, checking for it is
kind of expensive due to the locks and the for_each_possible_cpu().

Could we do something like the following?

	mem_cgroup_flush(memcg)
	{
		mutex_lock(&memcg_flush_mutex);
		if (atomic_read(&memcg->stat_delta) > THRESHOLD)
			cgroup_rstat_flush(memcg);
		mutex_unlock(&memcg_flush_mutex);
	}

	mem_cgroup_css_rstat_flush(css, cpu)
	{
		...
		/*
		 * Reset here instead of mem_cgroup_flush()
		 * so that each flushed subgroup is reset
		 * recursively. A recent parent flush will
		 * allow a child flush to skip.
		 */
		atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0);
	}

	memcg_rstat_updated(memcg, val)
	{
		atomic_add(&memcg->stat_delta, val);
	}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ