[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZV3Ru1BmHaU_uW7b@tiehlicka>
Date: Wed, 22 Nov 2023 11:02:35 +0100
From: Michal Hocko <mhocko@...e.com>
To: Chengming Zhou <chengming.zhou@...ux.dev>
Cc: LKML <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
jack@...e.cz, Tejun Heo <tj@...nel.org>,
Johannes Weiner <hannes@...xchg.org>,
Christoph Hellwig <hch@....de>, shr@...kernel.io, neilb@...e.de
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty
thresh
On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
> Hello all,
>
> Sorry to bother you, we encountered a problem related to the memcg dirty
> throttle after migrating from cgroup v1 to v2, so want to ask for some
> comments or suggestions.
>
> 1. Problem
>
> We have the "containerd" service running under system.slice, with
> its memory.max set to 5GB. It will be constantly throttled in the
> balance_dirty_pages() since the memcg has dirty memory more than
> the memcg dirty thresh.
>
> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
> the per-memcg writeback and per-memcg dirty thresh. Only the global
> dirty thresh will be checked in balance_dirty_pages().
Yes, v1 didn't have any sensible IO throttling and so we had to rely on
ugly hack to wait for writeback to finish from the memcg memory reclaim
path. This is really suboptimal because it makes memcg reclaim stalls
hard to predict. So it is essentially only a poor's man OOM prevention.
V2 on the other hand has memcg aware dirty memory throttling which is a
much better solution as it throttles at the moment when the memory is
being dirtied.
Why do you consider that to be a problem? Constant throttling as you
suggest might be a result of the limit being too small?
>
> 2. Thinking
>
> So we wonder if we can support the per-memcg dirty thresh interface?
> Now the memcg dirty thresh is just calculated from memcg max * ratio,
> which can be set from /proc/sys/vm/dirty_ratio.
In general I would recommend using dirty_bytes instead as the ratio
doesn't scall all that great on larger systems.
> We have to set it to 60 instead of the default 20 to workaround now,
> but worry about the potential side effects.
>
> If we can support the per-memcg dirty thresh interface, we can set
> some containers to a much higher dirty_ratio, especially for hungry
> dirtier workloads like "containerd".
But why would you want that? If you allow heavy writers to dirty a lot
of memory then flushing that to the backing store will take more time.
That could starve small writers as well because they could end up queued
behind huge amount of data to be flushed.
I am no expert on the writeback so others could give you a better
arguments but from my POV the dirty data flushing and throttling is
mostly a global mechanism to optmize the IO pattern and is a function of
storage much more than workload specific. If you heavy writer hits
throttling too much then either the limit is too low or you should stard
background flushing earlier.
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists