lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+KHdyUdYZKc4u5MovSY9Dp=d7uQAt+UEAyeknavz6814VGcXQ@mail.gmail.com>
Date:   Wed, 30 Aug 2017 08:04:42 +0200
From:   Uladzislau Rezki <urezki@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>, Mike Galbraith <efault@....de>,
        Oleksiy Avramchenko <oleksiy.avramchenko@...ymobile.com>,
        Paul Turner <pjt@...gle.com>, Oleg Nesterov <oleg@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Kirill Tkhai <tkhai@...dex.ru>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Nicolas Pitre <nicolas.pitre@...aro.org>
Subject: Re: [RFC v1] sched/fair: search a task from the tail of the queue

On Mon, Aug 28, 2017 at 10:41 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Aug 25, 2017 at 12:11:31AM +0200, Uladzislau Rezki (Sony) wrote:
>> From: Uladzislau Rezki <urezki@...il.com>
>>
>> As a first step this patch makes cfs_tasks list as MRU one.
>> It means, that when a next task is picked to run on physical
>> CPU it is moved to the front of the list.
>>
>> Thefore, the cfs_tasks list is more or less sorted (except woken
>> tasks) starting from recently given CPU time tasks toward tasks
>> with max wait time in a run-queue, i.e. MRU list.
>>
>> Second, as part of the load balance operation, this approach
>> starts detach_tasks()/detach_one_task() from the tail of the
>> queue instead of the head, giving some advantages:
>>
>> - tends to pick a task with highest wait time;
>> - tasks located in the tail are less likely cache-hot,
>>   therefore the can_migrate_task() decision is higher.
>>
>> hackbench illustrates slightly better performance. For example
>> doing 1000 samples and 40 groups on i5-3320M CPU, it shows below
>> figures:
>>
>> default: 0.644 avg
>> patched: 0.637 avg
>>
>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@...il.com>
>> ---
>>  kernel/sched/fair.c | 19 ++++++++++++++-----
>>  1 file changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c77e4b1d51c0..cda281c6bb29 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6357,7 +6357,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>>       if (hrtick_enabled(rq))
>>               hrtick_start_fair(rq, p);
>>
>> -     return p;
>> +     goto done;
>>  simple:
>>       cfs_rq = &rq->cfs;
>>  #endif
>> @@ -6378,6 +6378,14 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>>       if (hrtick_enabled(rq))
>>               hrtick_start_fair(rq, p);
>>
>> +done: __maybe_unused
>> +     /*
>> +      * Move the next running task to the front of
>> +      * the list, so our cfs_tasks list becomes MRU
>> +      * one.
>> +      */
>> +     list_move(&se->group_node, &rq->cfs_tasks);
>> +
>>       return p;
>>
>>  idle:
>
> Could you also run something like:
>
> $ taskset 1 perf bench sched pipe
>
> to make sure the added list_move() doesn't hurt, I'm not sure group_node
> and cfs_tasks are in cachelines we already touch for that operation.
> And if you can see that list_move() hurt in "perf annotate", try moving
> those members around to lines that we already need anyway.
>
Here we go with results and environment i tested in:

- set performance governor
- echo 0 > /proc/sys/kernel/nmi_watchdog
- intel_pstate=disable
- i5-3320M CPU @ 2.60GHz

i=0; while [ $i -le 9 ]; do $p stat --null --repeat 50 -- taskset 1 $p bench sched pipe | grep "seconds time elapsed"; sleep 10; i=$(($i+1)); done

default       patched
1.843096976   1.848465547
1.846284611   1.851829911
1.854097755   1.851955843
1.854657016   1.854875523
1.85759762    1.855901145
1.86479764    1.856830792
1.866912897   1.861084183
1.870980068   1.868732601
1.87482615    1.871576991
1.882239296   1.872073165

According to above results (they are sorted), we can say that
patched version does not introduce any degrade or drawbacks.
I did three runs and overall results followed the tendency
that patched version is never worse.

>
> And if you can see that list_move() hurt in "perf annotate", try moving
> those members around to lines that we already need anyway.
>
group_node is reasonably close to be in cache line, since we
access sched_entity in pick_next_task_fair path, so i think it
is settled in proper place. As for cfs_tasks, probably it makes
sense to shift a bit. See below the same test with extra change
for patched version (it contains a cfs_tasks shift change):

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6dda2aab731e..fbb8b62b9878 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -652,6 +652,9 @@ struct rq {
        u64 nr_switches;
 
        struct cfs_rq cfs;
+#ifdef CONFIG_SMP
+       struct list_head cfs_tasks;
+#endif
        struct rt_rq rt;
        struct dl_rq dl;
 
@@ -697,8 +700,6 @@ struct rq {
        int cpu;
        int online;
 
-       struct list_head cfs_tasks;
-
        u64 rt_avg;
        u64 age_stamp;
        u64 idle_stamp;

default        patched
1.863221123    1.84654835
1.874336946    1.857324296
1.874486829    1.858931196
1.877274906    1.863019386
1.882917257    1.865189583
1.885295215    1.870852498
1.88598286     1.88001347
1.888191362    1.888558032
1.893517671    1.893189274
1.910161077    1.911743678

Patched case wins all sets except last one.

-- 
Uladzislau Rezki

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ