lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD8CoPCmBMwo13TEz2V0Bb9fSNBKR3uGWKFTHxQTnZ8BxQj36A@mail.gmail.com>
Date: Mon, 15 Apr 2024 19:04:59 +0800
From: Ze Gao <zegao2021@...il.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, Ze Gao <zegao@...cent.com>, 
	Ben Segall <bsegall@...gle.com>, Daniel Bristot de Oliveira <bristot@...hat.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>, 
	Peter Zijlstra <peterz@...radead.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Valentin Schneider <vschneid@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its
 affinity changes

On Mon, Apr 15, 2024 at 6:25 PM Madadi Vineeth Reddy
<vineethr@...ux.ibm.com> wrote:
>
> Hi Ze Gao,
>
> On 15/04/24 07:33, Ze Gao wrote:
> > On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> > <vineethr@...ux.ibm.com> wrote:
> >>
> >> Hi Ze Gao,
> >>
> >> On 13/03/24 14:28, Ze Gao wrote:
> >>> We observered select_idle_sibling() is likely to return the *target* cpu
> >>> early which is likely to be the previous cpu this task is running on even
> >>> when it's actually not within the affinity list newly set, from where after
> >>> we can only rely on select_fallback_rq() to choose one for us at its will
> >>> (the first valid mostly for now).
> >>>
> >>> However, the one chosen by select_fallback_rq() is highly likely not a
> >>> good enough candidate, sometimes it has to rely on load balancer to kick
> >>> in to place itself to a better cpu, which adds one or more unnecessary
> >>> migrations in no doubt. For example, this is what I get when I move task
> >>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >>>
> >>>         swapper       0 [013]   959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> >>> kworker/24:2-mm    1014 [024]   959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >>>
> >>
> >> I am able to reproduce this scenario of having an extra migration through load balance
> >> swapper       0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> >> ksoftirqd/0  13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> >> busy task running on CPU 33.
> >>
> >>> The thing is we can actually do better if we do checks early and take more
> >>> advantages of the *target* in select_idle_sibling(). That is, we continue
> >>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> >>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> >>> especially when the newly allowed cpu set shares some cpu resources with
> >>> *target*.
> >>>
> >>> And with this change, we clearly see the improvement when I move task 3964
> >>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >>>
> >>>         swapper       0 [027]  4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
> >>
> >> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
> >>
> >> On placing some perf probes and testing,
> >>     migration/57     304 [057] 12216.988491:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
> >>          swapper       0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
> >>          swapper       0 [004] 12226.989071:       probe:select_fallback_rq: (c0000000001a2e38) cpu=4
> >>          swapper       0 [004] 12226.989074:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
> >>          swapper       0 [000] 12227.007768:       sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> >> which was CPU 4.
> >
> > My best guess is that you may have hit the code path for running tasks
> > (taskset happens right after the task is woken up). Should that happen,
> > the picking is done via:
> >
> >         dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
>
> Yes, I verified that cpumask_any_and_distribute is hit.
>
> >
> > and it also makes sense that select_fallback_rq() returns 4 since that happens
> > before you change the affinities.
>
> In this case, 4 is passed as an argument to select_fallback_rq(), it's not the return
> value. It actually returns the first affined CPU which is 33 here.
>
> Also, I see select_fallback_rq() happening after cpumask_any_and_distribute.
>
> >
> > you may need to rule out this case first :)
> >
>
> I am not sure on how to avoid hitting cpumask_any_and_distribute, Can you explain how did you move the tasks in case
> of stress-ng?

Acutally there is no easy way to avoid taskset on the running task
(maybe -l with a low load expectation which increases the chance
the task sleeps). and then it is simple and stupid: let the trace be
as much verbose as I can, and filter what I pay attention to.

Regards,
Ze

> Thanks and Regards
> Madadi Vineeth Reddy
>
> > Regards,
> > Ze
> >
> >> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> >> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> >> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> >> default target at the end.
> >>
> >> Thanks and Regards
> >> Madadi Vineeth Reddy
> >>
> >>>
> >>> Note we do the same check for *prev* in select_idle_sibling() as well.
> >>>
> >>> Signed-off-by: Ze Gao <zegao@...cent.com>
> >>> ---
> >>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ