lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aca14cf3-28d3-4c5a-82ab-a4607173dbee@linux.dev>
Date:   Wed, 22 Nov 2023 22:59:02 +0800
From:   Chengming Zhou <chengming.zhou@...ux.dev>
To:     Michal Hocko <mhocko@...e.com>
Cc:     LKML <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
        jack@...e.cz, Tejun Heo <tj@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Christoph Hellwig <hch@....de>, shr@...kernel.io, neilb@...e.de
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty
 thresh

On 2023/11/22 18:02, Michal Hocko wrote:
> On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
>> Hello all,
>>
>> Sorry to bother you, we encountered a problem related to the memcg dirty
>> throttle after migrating from cgroup v1 to v2, so want to ask for some
>> comments or suggestions.
>>
>> 1. Problem
>>
>> We have the "containerd" service running under system.slice, with
>> its memory.max set to 5GB. It will be constantly throttled in the
>> balance_dirty_pages() since the memcg has dirty memory more than
>> the memcg dirty thresh.
>>
>> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
>> the per-memcg writeback and per-memcg dirty thresh. Only the global
>> dirty thresh will be checked in balance_dirty_pages().
> 
> Yes, v1 didn't have any sensible IO throttling and so we had to rely on
> ugly hack to wait for writeback to finish from the memcg memory reclaim
> path.  This is really suboptimal because it makes memcg reclaim stalls
> hard to predict. So it is essentially only a poor's man OOM prevention.
> 
> V2 on the other hand has memcg aware dirty memory throttling which is a
> much better solution as it throttles at the moment when the memory is
> being dirtied.
> 
> Why do you consider that to be a problem? Constant throttling as you
> suggest might be a result of the limit being too small?

Right, v2 is better at limiting the dirty memory in one memcg, which is
better for the memcg reclaim path.

The problem we encountered is the global dirty_ratio is too small (20%)
for some cgroup workloads, like "containerd" preparing a big image file,
we want its memory.max (5GB in our case) can be most dirtied, to speed up
the process.

And yes, this may backup more dirty pages in that memcg, then the longer
writeback IO may interfere other memcgs' writeback IO. But we also have
the per-blkcg IO throttling, so it's not that bad ?

Now we have to adjust up the global dirty_ratio to achieve this result,
but it's not good for every memcg workload, and could be bad for some
memcg reclaim path as you noted.

> 
>>
>> 2. Thinking
>>
>> So we wonder if we can support the per-memcg dirty thresh interface?
>> Now the memcg dirty thresh is just calculated from memcg max * ratio,
>> which can be set from /proc/sys/vm/dirty_ratio.
> 
> In general I would recommend using dirty_bytes instead as the ratio
> doesn't scall all that great on larger systems.
>  
>> We have to set it to 60 instead of the default 20 to workaround now,
>> but worry about the potential side effects.
>>
>> If we can support the per-memcg dirty thresh interface, we can set
>> some containers to a much higher dirty_ratio, especially for hungry
>> dirtier workloads like "containerd".
> 
> But why would you want that? If you allow heavy writers to dirty a lot
> of memory then flushing that to the backing store will take more time.
> That could starve small writers as well because they could end up queued
> behind huge amount of data to be flushed.
> 

Yes, we also need per-blkcg IO throttling to distribute writeback IO bandwidth.

> I am no expert on the writeback so others could give you a better
> arguments but from my POV the dirty data flushing and throttling is
> mostly a global mechanism to optmize the IO pattern and is a function of
> storage much more than workload specific. If you heavy writer hits

Maybe the per-bdi ratio is worth trying instead of the global dirty_ratio,
which could affect all devices.

> throttling too much then either the limit is too low or you should stard
> background flushing earlier.
> 

The global dirty_ratio is too low for "containerd" in this case, so we
want more control over the memcg dirty_ratio.

Thanks!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ