[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230914225844.woz7mke6vnmwijh7@google.com>
Date: Thu, 14 Sep 2023 22:58:44 +0000
From: Shakeel Butt <shakeelb@...gle.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Muchun Song <muchun.song@...ux.dev>,
Ivan Babrou <ivan@...udflare.com>, Tejun Heo <tj@...nel.org>,
"Michal Koutný" <mkoutny@...e.com>,
Waiman Long <longman@...hat.com>, kernel-team@...udflare.com,
Wei Xu <weixugc@...gle.com>, Greg Thelen <gthelen@...gle.com>,
linux-mm@...ck.org, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy
On Thu, Sep 14, 2023 at 10:56:52AM -0700, Yosry Ahmed wrote:
[...]
> >
> > 1. How much delayed/stale stats have you observed on real world workload?
>
> I am not really sure. We don't have a wide deployment of kernels with
> rstat yet. These are problems observed in testing and/or concerns
> expressed by our userspace team.
>
Why sleep(2) not good enough for the tests?
> I am trying to solve this now because any problems that result from
> this staleness will be very hard to debug and link back to stale
> stats.
>
I think first you need to show if this (2 sec stale stats) is really a
problem.
> >
> > 2. What is acceptable staleness in the stats for your use-case?
>
> Again, unfortunately I am not sure, but right now it can be O(seconds)
> which is not acceptable as we have workloads querying the stats every
> 1s (and sometimes more frequently).
>
It is 2 seconds in most cases and if it is higher, the system is already
in bad shape. O(seconds) seems more dramatic. So, why 2 seconds
staleness is not acceptable? Is 1 second acceptable? or 500 msec? Let's
look at the use-cases below.
> >
> > 3. What is your use-case?
>
> A few use cases we have that may be affected by this:
> - System overhead: calculations using memory.usage and some stats from
> memory.stat. If one of them is fresh and the other one isn't we have
> an inconsistent view of the system.
> - Userspace OOM killing: We use some stats in memory.stat to gauge the
> amount of memory that will be freed by killing a task as sometimes
> memory.usage includes shared resources that wouldn't be freed anyway.
> - Proactive reclaim: we read memory.stat in a proactive reclaim
> feedback loop, stale stats may cause us to mistakenly think reclaim is
> ineffective and prematurely stop.
>
I don't see why userspace OOM killing and proactive reclaim need
subsecond accuracy. Please explain. Same for system overhead but I can
see the complication of two different sources for stats. Can you provide
the formula of system overhead? I am wondering why do you need to read
stats from memory.stat files. Why not the memory.current of top level
cgroups and /proc/meminfo be enough. Something like:
Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current)
> >
> > I know I am going back on some of the previous agreements but this
> > whole locking back and forth has made in question the original
> > motivation.
>
> That's okay. Taking a step back, having flushing being indeterministic
I would say atmost 2 second stale instead of indeterministic.
> in this way is a time bomb in my opinion. Note that this also affects
> in-kernel flushers like reclaim or dirty isolation
Fix the in-kernel flushers separately. Also the problem Cloudflare is
facing does not need to be tied with this.
Powered by blists - more mailing lists