linux-kernel - Re: [PATCH] cgroup/cpuset: Optimize update_tasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4de8821b-e0c0-bf63-4d76-b0ce208cce3b@shopee.com>
Date:   Thu, 24 Nov 2022 11:33:39 +0800
From:   Haifeng Xu <haifeng.xu@...pee.com>
To:     Waiman Long <longman@...hat.com>
Cc:     lizefan.x@...edance.com, tj@...nel.org, hannes@...xchg.org,
        cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask()



On 2022/11/24 04:23, Waiman Long wrote:
> On 11/23/22 03:21, haifeng.xu wrote:
>> When change the 'cpuset.mems' under some cgroup, system will hung
>> for a long time. From the dmesg, many processes or theads are
>> stuck in fork/exit. The reason is show as follows.
>>
>> thread A:
>> cpuset_write_resmask /* takes cpuset_rwsem */
>>    ...
>>      update_tasks_nodemask
>>        mpol_rebind_mm /* waits mmap_lock */
>>
>> thread B:
>> worker_thread
>>    ...
>>      cpuset_migrate_mm_workfn
>>        do_migrate_pages /* takes mmap_lock */
>>
>> thread C:
>> cgroup_procs_write /* takes cgroup_mutex and cgroup_threadgroup_rwsem */
>>    ...
>>      cpuset_can_attach
>>        percpu_down_write /* waits cpuset_rwsem */
>>
>> Once update the nodemasks of cpuset, thread A wakes up thread B to
>> migrate mm. But when thread A iterates through all tasks, including
>> child threads and group leader, it has to wait the mmap_lock which
>> has been take by thread B. Unfortunately, thread C wants to migrate
>> tasks into cgroup at this moment, it must wait thread A to release
>> cpuset_rwsem. If thread B spends much time to migrate mm, the
>> fork/exit which acquire cgroup_threadgroup_rwsem also need to
>> wait for a long time.
>>
>> There is no need to migrate the mm of child threads which is
>> shared with group leader. Just iterate through the group
>> leader only.
>>
>> Signed-off-by: haifeng.xu <haifeng.xu@...pee.com>
>> ---
>>   kernel/cgroup/cpuset.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 589827ccda8b..43cbd09546d0 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset
>> *cs)
>>             cpuset_change_task_nodemask(task, &newmems);
>>   +        if (!thread_group_leader(task))
>> +            continue;
>> +
>>           mm = get_task_mm(task);
>>           if (!mm)
>>               continue;
> 
> Could you try the attached test patch to see if it can fix your problem?
> Something along the line of this patch will be more acceptable.
> 
> Thanks,
> Longman
> 

Hi, Longman.
Thanks for your patch, but there are still some problems.

1）
  (group leader, node: 0,1)
         cgroup0
         /     \
        /       \
    cgroup1   cgroup2
   (threads)  (threads)

If set node 0 in cgroup1 and node 1 in cgroup2, both of them will update
the mm. And the nodemask of mm depends on who set the node last.

2）
   (process, node: 0,1)
         cgroup0
         /     \
        /       \
    cgroup1   cgroup2
   (node: 0)  (node: 1)

If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach
won't update the mm. So the nodemask of thread, including mems_allowed
and mempolicy（updated in cpuset_change_task_nodemask）, is different from
the vm_policy in vma(updated in mpol_rebind_mm).


In a word, if threads have different cpusets with different nodemask, it
will cause inconsistent memory behavior.

Thanks,
Haifeng.