linux-kernel - Re: [PATCH] sched/fair: Again ignore percpu threads for imbalance pulls

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5f8497cd-aeaf-906d-a2d8-2e0a752fed4b@linux.alibaba.com>
Date:   Mon, 17 Jan 2022 22:50:55 +0800
From:   Yihao Wu <wuyihao@...ux.alibaba.com>
To:     Valentin Schneider <valentin.schneider@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Shanpei Chen <shanpeic@...ux.alibaba.com>,
        王贇 <yun.wang@...ux.alibaba.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: Again ignore percpu threads for imbalance
 pulls

Thanks a lot for the help, Valentin and Peter!

On 2021/12/17 2:26am, Valentin Schneider wrote:
> On 11/12/21 17:48, Yihao Wu wrote:
>> commit 2f5f4cce496e ("sched/fair: Ignore percpu threads for imbalance
>> pulls") was meant to fix a performance issue, when load balance tries to
>> migrate pinned kernel threads at MC domain level. This was destined to
>> fail.
> 
>> After it fails, it further makes wakeup balance at NUMA domain level
>> messed up. The most severe case that I noticed and frequently occurs:
>>      |sum_nr_running(node1) - sum_nr_running(node2)| > 100
>>
> 
> Wakeup balance (aka find_idlest_cpu()) is different from periodic load
> balance (aka load_balance()) and doesn't use can_migrate_task(), so the
> incriminated commit shouldn't have impacted it (at least not in obvious
> ways...). Do you have any more details on that issue

The original bugfix concerns only about load balance. While I found wake
up balance is impacted too, after I observed regression in lmbench3 test
suite. This is how it's impacted:

- Periodic load balance
- kthread_is_per_cpu? No
- env->flags |= LBF_SOME_PINNED
- sd_parent..imbalance being set to 1 because of LBF_SOME_PINNED

So far exactly the same as what Chandrasekhar describes in 2f5f4cce496e.
Then imbalance connects periodic and wakeup balance.

- Wakeup balance(find_idlest_group)
- update_sg_wakeup_stats classifies local_sgs as group_imbalanced
- find_idlest_group chooses another NUMA node

wakeup balance keeps doing this until another NUMA node becomes so busy.
And another periodic load balance just shifts it around, makeing the 
previously overloaded node completely idle now.

(Thanks to the great schedviz tool, I observed that all workloads as a 
whole, is migrated between the two NUMA nodes in a ping-pong pattern, 
and with a period around 3ms)

The reason wake up balance suffers more is, in fork+exit test case, 
wakeup balance happens with much higher frequency. It exists in real 
world applications too I believe.

> 
>> However the original bugfix failed, because it covers only case 1) below.
>>    1) Created by create_kthread
>>    2) Created by kernel_thread
>> No kthread is assigned to task_struct in case 2 (Please refer to comments
>> in free_kthread_struct) so it simply won't work.
>>
>> The easist way to cover both cases is to check nr_cpus_allowed, just as
>> discussed in the mailing list of the v1 version of the original fix.
>>
>> * lmbench3.lat_proc -P 104 fork (2 NUMA, and 26 cores, 2 threads)
>>
> 
> Reasoning about "proper" pcpu kthreads was simpler since they are static,
> see 3a7956e25e1d ("kthread: Fix PF_KTHREAD vs to_kthread() race")
> 
Get it. Thanks.

>>                           w/out patch                 w/ patch
>> fork+exit latency            1660 ms                  1520 ms (   8.4%)
>>
>> Fixes: 2f5f4cce496e ("sched/fair: Ignore percpu threads for imbalance pulls")
>> Signed-off-by: Yihao Wu <wuyihao@...ux.alibaba.com>
>> ---
>>   kernel/kthread.c | 6 +-----
>>   1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 4a4d7092a2d8..cb05d3ff2de4 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -543,11 +543,7 @@ void kthread_set_per_cpu(struct task_struct *k, int cpu)
>>
>>   bool kthread_is_per_cpu(struct task_struct *p)
>>   {
>> -	struct kthread *kthread = __to_kthread(p);
>> -	if (!kthread)
>> -		return false;
>> -
>> -	return test_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
>> +	return (p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1;
>>   }
> 
> As Peter said, this is going to cause issues. If you look at
> kthread_set_per_cpu(), we also store a CPU value which we expect to be
> valid when kthread_is_per_cpu(), which that change is breaking.
> 
> AIUI what you want to patch is the actual usage in can_migrate_task()
> 

Get it. Some may want a consistent view of kthread_is_per_cpu, 
kthread->cpu, and KTHREAD_IS_PER_CPU.

Are you suggesting to patch only can_migrate_task to check 
nr_cpus_allowed? Wouldn't it be confusing if it uses an alternative way 
to tell if p is a per-cpu kthread?

I haven't a better solution though. :(


Thanks,
Yihao Wu

>>
>>   /**
>> --
>> 2.32.0.604.gb1f3e1269