linux-kernel - Re: [RFC PATCH] mm, memcg: introduce memory.high.throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8785134d-3012-42c1-a67c-b64862d89fc5@redhat.com>
Date: Thu, 30 Jan 2025 12:41:19 -0500
From: Waiman Long <llong@...hat.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>, Waiman Long <llong@...hat.com>
Cc: Roman Gushchin <roman.gushchin@...ux.dev>, Michal Hocko
 <mhocko@...e.com>, Tejun Heo <tj@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>, Michal Koutný
 <mkoutny@...e.com>, Jonathan Corbet <corbet@....net>,
 Muchun Song <muchun.song@...ux.dev>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 cgroups@...r.kernel.org, linux-mm@...ck.org, linux-doc@...r.kernel.org,
 Peter Hunt <pehunt@...hat.com>
Subject: Re: [RFC PATCH] mm, memcg: introduce memory.high.throttle

On 1/30/25 12:32 PM, Shakeel Butt wrote:
> On Thu, Jan 30, 2025 at 12:19:38PM -0500, Waiman Long wrote:
>> On 1/30/25 12:05 PM, Roman Gushchin wrote:
>>> On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote:
>>>> On 1/30/25 3:15 AM, Michal Hocko wrote:
>>>>> On Wed 29-01-25 14:12:04, Waiman Long wrote:
>>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>>>>>> reclaim over memory.high"), the amount of allocator throttling had
>>>>>> increased substantially. As a result, it could be difficult for a
>>>>>> misbehaving application that consumes increasing amount of memory from
>>>>>> being OOM-killed if memory.high is set. Instead, the application may
>>>>>> just be crawling along holding close to the allowed memory.high memory
>>>>>> for the current memory cgroup for a very long time especially those
>>>>>> that do a lot of memcg charging and uncharging operations.
>>>>>>
>>>>>> This behavior makes the upstream Kubernetes community hesitate to
>>>>>> use memory.high. Instead, they use only memory.max for memory control
>>>>>> similar to what is being done for cgroup v1 [1].
>>>>> Why is this a problem for them?
>>>> My understanding is that a mishaving container will hold up memory.high
>>>> amount of memory for a long time instead of getting OOM killed sooner and be
>>>> more productively used elsewhere.
>>>>>> To allow better control of the amount of throttling and hence the
>>>>>> speed that a misbehving task can be OOM killed, a new single-value
>>>>>> memory.high.throttle control file is now added. The allowable range
>>>>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>>>>>> like before. Any non-zero positive value represents the corresponding
>>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>>>> I do not like the interface to be honest. It exposes an implementation
>>>>> detail and casts it into a user API. If we ever need to change the way
>>>>> how the throttling is implemented this will stand in the way because
>>>>> there will be applications depending on a behavior they were carefuly
>>>>> tuned to.
>>>>>
>>>>> It is also not entirely sure how is this supposed to be used in
>>>>> practice? How do people what kind of value they should use?
>>>> Yes, I agree that a user may need to run some trial runs to find a proper
>>>> value. Perhaps a simpler binary interface of "off" and "on" may be easier to
>>>> understand and use.
>>>>>> System administrators can now use this parameter to determine how easy
>>>>>> they want OOM kills to happen for applications that tend to consume
>>>>>> a lot of memory without the need to run a special userspace memory
>>>>>> management tool to monitor memory consumption when memory.high is set.
>>>>> Why cannot they achieve the same with the existing events/metrics we
>>>>> already do provide? Most notably PSI which is properly accounted when
>>>>> a task is throttled due to memory.high throttling.
>>>> That will require the use of a userspace management agent that looks for
>>>> these stalling conditions and make the kill, if necessary. There are
>>>> certainly users out there that want to get some benefit of using memory.high
>>>> like early memory reclaim without the trouble of handling these kind of
>>>> stalling conditions.
>>> So you basically want to force the workload into some sort of a proactive
>>> reclaim but without an artificial slow down?
> I wouldn't call it a proactive reclaim as reclaim will happen
> synchronously in allocating thread.
>
>>> It makes some sense to me, but
>>> 1) Idk if it deserves a new API, because it can be relatively easy implemented
>>>     in userspace by a daemon which monitors cgroups usage and reclaims the memory
>>>     if necessarily. No kernel changes are needed.
>>> 2) If new API is introduced, I think it's better to introduce a new limit,
>>>     e.g. memory.target, keeping memory.high semantics intact.
>> Yes, you are right about that. Introducing a new "memory.target" without
>> disturbing the existing "memory.high" semantics will work for me too.
>>
> So, what happens if reclaim can not reduce usage below memory.target?
> Infinite reclaim cycles or just give up?

Just give up in this case. It is used mainly to reduce the chance of 
reaching max and cause OOM kill.

Cheers,
Longman