linux-kernel - Re: [PATCH v2] mm: oom: introduce cpuset oom

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <342c1967-8a68-275c-042e-765d5993157c@redhat.com>
Date:   Tue, 4 Apr 2023 13:24:03 -0400
From:   Waiman Long <longman@...hat.com>
To:     Michal Hocko <mhocko@...e.com>, Gang Li <ligang.bdlg@...edance.com>
Cc:     rientjes@...gle.com, linux-kernel@...r.kernel.org,
        Zefan Li <lizefan.x@...edance.com>, cgroups@...r.kernel.org
Subject: Re: [PATCH v2] mm: oom: introduce cpuset oom

On 4/4/23 10:31, Michal Hocko wrote:
> [CC cpuset people]
>
> On Tue 04-04-23 19:55:09, Gang Li wrote:
>> When a process in cpuset triggers oom, it may kill a completely
>> irrelevant process on another numa node, which will not release any
>> memory for this cpuset.
>>
>> It seems that `CONSTRAINT_CPUSET` is not really doing much these
>> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom
>> by selecting victim from all cpusets with the same mems_allowed as
>> the current cpuset.
> This should go into more details about the usecase, testing and ideally
> also spend couple of words about how CONSTRAINT_CPUSET is actually
> implemented because this is not really immediately obvious. An example
> of before/after behavior would have been really nice as well.
>
> You should also go into much more details about how oom victims are
> actually evaluated.
>
>> Suggested-by: Michal Hocko <mhocko@...e.com>
>> Signed-off-by: Gang Li <ligang.bdlg@...edance.com>
>> ---
>> Changes in v2:
>> - Select victim from all cpusets with the same mems_allowed as the current cpuset.
>>    (David Rientjes <rientjes@...gle.com>)
>>
>> v1:
>> - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
>>
>> ---
>>   include/linux/cpuset.h |  6 ++++++
>>   kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++
>>   mm/oom_kill.c          |  4 ++++
>>   3 files changed, 38 insertions(+)
> As this is a userspace visible change it should also be documented
> somewhere  in Documentation.
>
> I am not really familiar with cpusets internals so I cannot really judge
> cpuset_cgroup_scan_tasks implementation.
>
> The oom report should be explicit about this being CPUSET specific oom
> handling so unexpected behavior could be nailed down to this change so I
> do not see a major concern from the oom POV. Nevertheless it would be
> still good to consider whether this should be an opt-in behavior. I
> personally do not see a major problem because most cpuset deployments I
> have seen tend to be well partitioned so the new behavior makes more
> sense.
>
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index 980b76a1237e..fc244141bd52 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>>   	task_unlock(current);
>>   }
>>   
>> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
>> +
>>   #else /* !CONFIG_CPUSETS */
>>   
>>   static inline bool cpusets_enabled(void) { return false; }
>> @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>>   	return false;
>>   }
>>   
>> +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
>> +{
>> +	return 0;
>> +}
>>   #endif /* !CONFIG_CPUSETS */
>>   
>>   #endif /* _LINUX_CPUSET_H */
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index bc4dcfd7bee5..b009c98ca19e 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void)
>>   	rcu_read_unlock();
>>   }
>>   
>> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
>> +{
>> +	int ret = 0;
>> +	struct css_task_iter it;
>> +	struct task_struct *task;
>> +	struct cpuset *cs;
>> +	struct cgroup_subsys_state *pos_css;
>> +
>> +	/*
>> +	 * Situation gets complex with overlapping nodemasks in different cpusets.
>> +	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
>> +	 *
>> +	 * But for now, let's make it simple. Just iterate through all cpusets
>> +	 * with the same mems_allowed as the current cpuset.
>> +	 */
>> +	rcu_read_lock();
>> +	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>> +		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
>> +			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
>> +			while (!ret && (task = css_task_iter_next(&it)))
>> +				ret = fn(task, arg);
>> +			css_task_iter_end(&it);
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +	return ret;
>> +}

You will also need to take cpuset_rwsem to make sure that cpusets are 
stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I 
will suggest you just name it as cpuset_scan_tasks(). Please also add a 
doctext comment about its purpose and how it should be used.

Cheers,
Longman