linux-kernel - Re: [PATCH v2] sched/fair: Introduce priority load balance to reduce interference from IDLE tasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <919c20ad-6785-59e4-e39a-a345346fb550@huawei.com>
Date:   Mon, 22 Aug 2022 14:49:15 +0800
From:   "zhangsong (J)" <zhangsong34@...wei.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
CC:     Abel Wu <wuyun.abel@...edance.com>, <mingo@...hat.com>,
        <peterz@...radead.org>, <juri.lelli@...hat.com>,
        <dietmar.eggemann@....com>, <rostedt@...dmis.org>,
        <bsegall@...gle.com>, <mgorman@...e.de>, <bristot@...hat.com>,
        <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
        kernel test robot <lkp@...el.com>
Subject: Re: [PATCH v2] sched/fair: Introduce priority load balance to reduce
 interference from IDLE tasks

Hi, Vincent,

On 2022/8/20 0:04, Vincent Guittot wrote:
> On Fri, 19 Aug 2022 at 14:35, Vincent Guittot
> <vincent.guittot@...aro.org> wrote:
>>
>> Hi Zhang,
>>
>> On Fri, 19 Aug 2022 at 12:54, zhangsong (J) <zhangsong34@...wei.com> wrote:
>>>
>>>
>>> On 2022/8/18 16:31, Vincent Guittot wrote:
>>>> Le jeudi 18 août 2022 à 10:46:55 (+0800), Abel Wu a écrit :
>>>>> On 8/17/22 8:58 PM, Vincent Guittot Wrote:
>>>>>> On Tue, 16 Aug 2022 at 04:53, zhangsong (J) <zhangsong34@...wei.com> wrote:
>>>>>>>
>>>> ...
>>>>
>>>>>>> Yes, this is usually a corner case, but suppose that some non-idle tasks bounds to CPU 1-2
>>>>>>>
>>>>>>> and idle tasks bounds to CPU 0-1, so CPU 1 may has many idle tasks and some non-idle
>>>>>>>
>>>>>>> tasks while idle tasks on CPU 1 can not be pulled to CPU 2, when trigger load balance if
>>>>>>>
>>>>>>> CPU 2 should pull some tasks from CPU 1, the bad result is idle tasks of CPU 1 cannot be
>>>>>>>
>>>>>>> migrated and non-idle tasks also cannot be migrated in case of env->loop_max constraint.
>>>>>> env->loop_max adds a break but load_balance will continue with next
>>>>>> tasks so it also tries to pull your non idle task at the end after
>>>>>> several breaks.
>>>>> Loop will be terminated without LBF_NEED_BREAK if exceeds loop_max :)
>>>> Argh yes, my brain is not yet back from vacation
>>>> I have been confused by loop_max and loop_break being set to the same value 32
>>>>
>>>> Zhang Song, Could you try the patch below ? If it works, I will prepare a
>>>> clean patch with all tags
>>>>
>>>>
>>>>
>>>> sched/fair: make sure to try to detach at least one movable task
>>>>
>>>> During load balance we try at most env->loop_max time to move a task. But
>>>> it can happen that the LRU tasks (ie tail of the cfs_tasks list) can't
>>>> be moved to dst_cpu because of affinity. In this case, loop in the list
>>>> until we found at least one.
>>>>
>>>> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
>>>> ---
>>>>    kernel/sched/fair.c | 12 +++++++++---
>>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index da388657d5ac..02b7b808e186 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
>>>>                p = list_last_entry(tasks, struct task_struct, se.group_node);
>>>>
>>>>                env->loop++;
>>>> -             /* We've more or less seen every task there is, call it quits */
>>>> -             if (env->loop > env->loop_max)
>>>> +             /*
>>>> +              * We've more or less seen every task there is, call it quits
>>>> +              * unless we haven't found any movable task yet.
>>>> +              */
>>>> +             if (env->loop > env->loop_max &&
>>>> +                 !(env->flags & LBF_ALL_PINNED))
>>>>                        break;
>>>>
>>>>                /* take a breather every nr_migrate tasks */
>>>> @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>
>>>>                if (env.flags & LBF_NEED_BREAK) {
>>>>                        env.flags &= ~LBF_NEED_BREAK;
>>>> -                     goto more_balance;
>>>> +                     /* Stop if we tried all running tasks */
>>>> +                     if (env.loop < busiest->nr_running)
>>>> +                             goto more_balance;
>>>>                }
>>>>
>>>>                /*
>>>> --
>>>> 2.17.1
>>>
>>> Thanks for your reply.
>>> I have tried your patch and run test compared with it, it seems that the
>>> patch you provide makes no sense.
>>> The test result is below(1000 idle tasks bounds to CPU 0-1 and 10 normal
>>> tasks bounds to CPU 1-2):
>>>
>>> =================================================================
>>>
>>> Without patch:
>>>
>>>
>>>             6,777.37 msec cpu-clock                 #    1.355 CPUs utilized
>>>               20,812      context-switches          #    0.003 M/sec
>>>                    0      cpu-migrations            #    0.000 K/sec
>>>                    0      page-faults               #    0.000 K/sec
>>>       13,333,983,148      cycles                    #    1.967 GHz
>>>        6,457,930,305      instructions              #    0.48  insn per cycle
>>>        2,125,644,649      branches                  #  313.639 M/sec
>>>            1,690,587      branch-misses             #    0.08% of all
>>> branches
>>>         5.001931983 seconds time elapsed
>>>
>>> With your patch:
>>>
>>>
>>>             6,791.46 msec cpu-clock                 #    1.358 CPUs utilized
>>>               20,996      context-switches          #    0.003 M/sec
>>>                    0      cpu-migrations            #    0.000 K/sec
>>>                    0      page-faults               #    0.000 K/sec
>>>       13,467,573,052      cycles                    #    1.983 GHz
>>>        6,516,989,062      instructions              #    0.48  insn per cycle
>>>        2,145,139,220      branches                  #  315.858 M/sec
>>>            1,751,454      branch-misses             #    0.08% of all
>>> branches
>>>
>>>          5.002274267 seconds time elapsed
>>>
>>> With my patch:
>>>
>>>
>>>             7,495.14 msec cpu-clock                 #    1.499 CPUs utilized
>>>               23,176      context-switches          #    0.003 M/sec
>>>                  309      cpu-migrations            #    0.041 K/sec
>>>                    0      page-faults               #    0.000 K/sec
>>>       14,849,083,489      cycles                    #    1.981 GHz
>>>        7,180,832,268      instructions              #    0.48  insn per cycle
>>>        2,363,300,644      branches                  #  315.311 M/sec
>>>            1,964,169      branch-misses             #    0.08% of all
>>> branches
>>>
>>>          5.001713352 seconds time elapsed
>>> ===============================================================
>>>
>>> Obviously,  when your patch is applied, the cpu-migrations of normal
>>> tasks is still 0 and the
>>> CPU ulization of normal tasks have no improvement compared with no patch
>>> applied.
>>> When apply my patch, the cpu-migrations and CPU ulization of normal
>>> tasks can both improve.
>>> I cannot explain the result with your patch, you also can test it by
>>> yourself.
>>
>> Do you have more details about the test that your are running ?
>>
>> Do cpu0-2 share their cache ?
>> Which kingd of task are the normal and idle tasks ? always running tasks ?
>>
>> I'm going to try to reproduce your problem locally
> 
> Some details of your UC are missing. I have tried to reproduce your
> example above:
>      1000 idle tasks bounds to CPU 0-1 and 10 normal tasks bounds to CPU 1-2
> 
> Let assume that for any reason, the 10 normal tasks wake up on CPU 1.
> Then, the thousand of idle tasks are moved to CPU0 by load balance and
> only normal tasks stay on CPU1. Then load balance will move some
> normal tasks to CPU2.
> 
> My only way to reproduce something similar to your example, is to pin
> the 1000 idle tasks on CPU1 so they can't move to CPU0. Then I can see
> that load balance reaches loop_max limit and gets hard time moving
> normal tasks on CPU2. But in this later case, my patch helps to move
> normal tasks on CPU2. Something is missing in the description of your
> UC.
> 
> Sidenote, I have the same kind of problem with 1000 normal task with
> low priority so it's not a matter of idle vs normal tasks
> 
> Regards,
> Vincent
> 

Sorry for my slow reply.

I have found a test case which can more illustrate this problem 
accurately. The test case is below.

1. a dead foreach loop process run as normal or idle task
$ cat test.c
int main(int argc, char **argv)
{
         int i = 0;
         int duration = atoi(argv[1]);

         while(1) {
                 usleep(duration);
                 for(i = 0; i < 100000; i++) {}
         }
}
$ gcc -o test test.c

2. firstly spawn 500 idle tasks which bounds to CPU 0-2
3. secondly spawn 10 normal tasks which also bounds to CPU 0-2
4. lastly spawn 500 idle tasks which bounds to CPU 0 only
5. perf stat normal tasks and get CPU utilization and cpu-migrations


Below is the whole test script.
$ cat test.sh
#!/bin/bash

# create normal and idle cgroup path
mkdir /sys/fs/cgroup/cpu/normal/
mkdir /sys/fs/cgroup/cpu/idle/

# spawn 500 idle tasks and bind tasks to CPU 0-2
for ((i = 0; i < 500; i++))
do
		taskset -c 0-2 ./test 200 &
		pid=$!
		# change to SCHED_IDLE policy
		chrt -i -p 0 $pid
		echo $pid > /sys/fs/cgroup/cpu/idle/tasks
done

# spawn 10 normal tasks and bind tasks to CPU 0-2
normal_tasks=
for ((i = 0; i < 10; i++))
do
		taskset -c 0-2 ./test 500 &
		pid=$!
		normal_tasks+=$pid,
		echo $pid > /sys/fs/cgroup/cpu/normal/tasks
done

# spawn 500 idle tasks and bind tasks to CPU 0 only
for ((i = 0; i < 500; i++))
do
		taskset -c 0 ./test 200 &
		pid=$!
		# change to SCHED_IDLE policy
		chrt -i -p 0 $pid
		echo $pid > /sys/fs/cgroup/cpu/idle/tasks
done

# perf stat normal tasks
perf stat -a -p $normal_tasks sleep 5
pkill -f test


You can try the above case and test it with your patch.

Regards,
Zhang Song