linux-kernel - Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAD8CoPDajt_EciU0x8G_BocajfaJrQ1FKLYkp_nLSfF9+9FJVw@mail.gmail.com>
Date: Mon, 15 Apr 2024 13:40:13 +0800
From: Ze Gao <zegao2021@...il.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, Ze Gao <zegao@...cent.com>, 
	Ben Segall <bsegall@...gle.com>, Daniel Bristot de Oliveira <bristot@...hat.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>, 
	Peter Zijlstra <peterz@...radead.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Valentin Schneider <vschneid@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its
 affinity changes

On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
<vineethr@...ux.ibm.com> wrote:
>
> Hi Ze Gao,
>
> On 13/03/24 14:28, Ze Gao wrote:
> > We observered select_idle_sibling() is likely to return the *target* cpu
> > early which is likely to be the previous cpu this task is running on even
> > when it's actually not within the affinity list newly set, from where after
> > we can only rely on select_fallback_rq() to choose one for us at its will
> > (the first valid mostly for now).
> >
> > However, the one chosen by select_fallback_rq() is highly likely not a
> > good enough candidate, sometimes it has to rely on load balancer to kick
> > in to place itself to a better cpu, which adds one or more unnecessary
> > migrations in no doubt. For example, this is what I get when I move task
> > 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >
> >         swapper       0 [013]   959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> > kworker/24:2-mm    1014 [024]   959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >
>
> I am able to reproduce this scenario of having an extra migration through load balance
> swapper       0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> ksoftirqd/0  13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
>
> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> busy task running on CPU 33.
>
> > The thing is we can actually do better if we do checks early and take more
> > advantages of the *target* in select_idle_sibling(). That is, we continue
> > the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> > *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> > especially when the newly allowed cpu set shares some cpu resources with
> > *target*.
> >
> > And with this change, we clearly see the improvement when I move task 3964
> > to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >
> >         swapper       0 [027]  4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>
> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
>
> On placing some perf probes and testing,
>     migration/57     304 [057] 12216.988491:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
>          swapper       0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
>          swapper       0 [004] 12226.989071:       probe:select_fallback_rq: (c0000000001a2e38) cpu=4
>          swapper       0 [004] 12226.989074:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
>          swapper       0 [000] 12227.007768:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
>
> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> which was CPU 4.

After second thoughts, it indeed could happen if CPU 4 shares nothing
with CPU 33(34),
for example, in different numa nodes.

IOW, it cannot benefit from select_idle_siblings() and has to rely on
select_fallback_rq
as the last resort. Just like what I said in the changelog, this patch
aims to improve rq
selection for cases where the newly allowed cpu set shares some cpu
resources with
the old cpu set.

Sorry for not being able to recall all details immediately, since this
thread has been inactive
for a long while and receives no feedback from sched folks.

Best,
Ze

> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> default target at the end.
>
> Thanks and Regards
> Madadi Vineeth Reddy
>
> >
> > Note we do the same check for *prev* in select_idle_sibling() as well.
> >
> > Signed-off-by: Ze Gao <zegao@...cent.com>
> > ---
>