linux-kernel - Re: [RFC PATCH 0/3] Cgroup-based THP control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80d76bad-41d8-4108-ad74-f891e5180e47@huawei.com>
Date: Thu, 31 Oct 2024 17:37:12 +0300
From: Stepanov Anatoly <stepanov.anatoly@...wei.com>
To: Michal Hocko <mhocko@...e.com>
CC: Gutierrez Asier <gutierrez.asier@...wei-partners.com>,
	<akpm@...ux-foundation.org>, <david@...hat.com>, <ryan.roberts@....com>,
	<baohua@...nel.org>, <willy@...radead.org>, <peterx@...hat.com>,
	<hannes@...xchg.org>, <hocko@...nel.org>, <roman.gushchin@...ux.dev>,
	<shakeel.butt@...ux.dev>, <muchun.song@...ux.dev>, <cgroups@...r.kernel.org>,
	<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
	<alexander.kozhevnikov@...wei-partners.com>, <guohanjun@...wei.com>,
	<weiyongjun1@...wei.com>, <wangkefeng.wang@...wei.com>,
	<judy.chenhui@...wei.com>, <yusongping@...wei.com>, <artem.kuzin@...wei.com>,
	<kang.sun@...wei.com>, <nikita.panov@...wei-partners.com>
Subject: Re: [RFC PATCH 0/3] Cgroup-based THP control

On 10/31/2024 11:33 AM, Michal Hocko wrote:
> On Thu 31-10-24 09:06:47, Stepanov Anatoly wrote:
> [...]
>> As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread,
>> it needs app. developer participation anyway.
>> In theory, kind of a launcher-process can be used, to utilize the inheritance
>> of the corresponding prctl THP setting, but this seems not transparent
>> for the user-space.

> 
> No, this is not in theaory. This is a very common usage pattern to allow
> changing the behavior for the target application transparently.
> 
>> And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child)
>> tasks?
> 
> This is what I've had in mind. Currently we only have THP disable
> option. If we really need an override to enforce THP on an application
> then this could be a more viable path.
> 
>> IMHO, an alternative approach would be changing per-process THP-mode by PID,
>> thus also avoiding any user app. changes.

> 
> We already have process_madvise. MADV_HUGEPAGE resp. MADV_COLLAPSE are
> not supported but we can discuss that option of course. This interface
> requires much more orchestration of course because it is VMA range
> based.
> 
If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
THP mode property for several tasks at once, in this case some batch-change approach needed.

if, for example, process_madvise() would support task recursive logic, coupled with kind of
MADV_HUGE + *ITERATE_ALL_VMA*, it would be helpful.
In this case, the orchestration will be much easier.

>>> You have not really answered a more fundamental question though. Why the
>>> THP behavior should be at the cgroup scope? From a practical POV that
>>> would represent containers which are a mixed bag of applications to
>>> support the workload. Why does the same THP policy apply to all of them?
>>
>> For THP there're 3 possible levels of fine-control:
>> - global THP
>>   - THP per-group of processes
>>      - THP per-process
>>
>> I agree, that in a container, different apps might have different
>> THP requirements. 
>> But it also depends on many factors, such as:
>> container "size"(tiny/huge container), diversity of apps/functions inside a container.
>> I mean, for some cases, we might not need to go below "per-group" level in terms of THP control.
> 
> I am sorry but I do not really see any argument why this should be
> per-memcg. Quite contrary. having that per memcg seems more muddy.
> 
>>> Doesn't this make the sub-optimal global behavior the same on the cgroup
>>> level when some parts will benefit while others will not?
>>>
>>
>> I think the key idea for the sub-optimal behavior is "predictability",
>> so we know for sure which apps/services would consume THPs.
> 
> OK, that seems fair.
> 
>> We observed a significant THP usage on almost idle Ubuntu server, with simple test running,
>> (some random system services consumed few hundreds Mb of THPs).
> 
> I assume that you are using Always as global default configuration,
> right? If that is the case then the high (in fact as high as feasible)
> THP utilization is a real goal. If you want more targeted THP use then
> madvise is what you are looking for. This will not help applications
> which are not THP aware of course but then we are back to the discussion
> whether the interface should be per a) per process b) per cgroup c)
> process_madvise.
> 
>> Of course, on other distros me might have different situation.
>> But with fine-grained per-group control it's a lot more predictable.
>>
>> Am i got you question right? 

> 
> Not really but at least I do understand (hopefully) that you are trying
> to workaround THP overuse by changing the global default to be more
> restrictive while some workloads to be less restrictive. The question
> why pushing that down to memcg scope makes the situation better is not
> answered AFAICT.
>
Don't get us wrong, we're not trying to push this into memcg specifically. 
We're just trying to find a proper/friendly way to control
THP mode for a group of processes (which can be tasks without common parent).

May be if the process grouping logic were decoupled from hierarchical resource control
logic, it could be possible to gather multiple process, and batch-control some task properties.
But it would require to build kind of task properties system, where
a given set of properties can be flexibly assigned to one or more tasks.

Anyway, i think we gonna try alternative
approaches first.(prctl, process_madvise).
 
> [...]
>>> So if the parent decides that none of the children should be using THP
>>> they can override that so the tuning at parent has no imperative
>>> control. This is breaking hierarchical property that is expected from
>>> cgroup control files.
>>
>> Actually, i think we can solve this.
>> As we mostly need just a single children level,
>> "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting",
>> where sub-children are forbidden to override an inherited THP-mode.

> 
> This reduced case is not really sufficient to justify the non
> hiearchical semantic, I am afraid. There must be a _really_ strong case
> to break this property and even then I am rather skeptical to be honest.
> We have been burnt by introducing stuff like memcg.swappiness that
> seemed like a good idea initially but backfired with unexpected behavior
> to many users.
> 

-- 
Anatoly Stepanov, Huawei