linux-kernel - Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <65b1691d-8d90-4057-8ad0-da546a0ac8a1@linux.ibm.com>
Date: Fri, 12 Apr 2024 22:29:17 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Ze Gao <zegao2021@...il.com>
Cc: linux-kernel@...r.kernel.org, Ze Gao <zegao@...cent.com>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>, Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Valentin Schneider <vschneid@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its
 affinity changes

Hi Ze Gao,

On 13/03/24 14:28, Ze Gao wrote:
> We observered select_idle_sibling() is likely to return the *target* cpu
> early which is likely to be the previous cpu this task is running on even
> when it's actually not within the affinity list newly set, from where after
> we can only rely on select_fallback_rq() to choose one for us at its will
> (the first valid mostly for now).
> 
> However, the one chosen by select_fallback_rq() is highly likely not a
> good enough candidate, sometimes it has to rely on load balancer to kick
> in to place itself to a better cpu, which adds one or more unnecessary
> migrations in no doubt. For example, this is what I get when I move task
> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> 
>         swapper       0 [013]   959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> kworker/24:2-mm    1014 [024]   959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> 

I am able to reproduce this scenario of having an extra migration through load balance
swapper       0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
ksoftirqd/0  13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34

I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
busy task running on CPU 33.

> The thing is we can actually do better if we do checks early and take more
> advantages of the *target* in select_idle_sibling(). That is, we continue
> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> especially when the newly allowed cpu set shares some cpu resources with
> *target*.
> 
> And with this change, we clearly see the improvement when I move task 3964
> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> 
>         swapper       0 [027]  4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26

But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.

On placing some perf probes and testing,
    migration/57     304 [057] 12216.988491:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
         swapper       0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
         swapper       0 [004] 12226.989071:       probe:select_fallback_rq: (c0000000001a2e38) cpu=4
         swapper       0 [004] 12226.989074:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
         swapper       0 [000] 12227.007768:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34

It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
which was CPU 4.

In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
default target at the end. 

Thanks and Regards
Madadi Vineeth Reddy

> 
> Note we do the same check for *prev* in select_idle_sibling() as well.
> 
> Signed-off-by: Ze Gao <zegao@...cent.com>
> ---