[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <jhjwnxwt7zh.mognet@arm.com>
Date: Sat, 05 Dec 2020 18:37:06 +0000
From: Valentin Schneider <valentin.schneider@....com>
To: Qian Cai <qcai@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, tglx@...utronix.de,
mingo@...nel.org, linux-kernel@...r.kernel.org,
bigeasy@...utronix.de, qais.yousef@....com, swood@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, bristot@...hat.com, vincent.donnefort@....com,
tj@...nel.org, ouwen210@...mail.com
Subject: Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative
On 04/12/20 21:19, Qian Cai wrote:
> On Tue, 2020-11-17 at 19:28 +0000, Valentin Schneider wrote:
>> We did have some breakage in that area, but all the holes I was aware of
>> have been plugged. What would help here is to see which tasks are still
>> queued on that outgoing CPU, and their recent activity.
>>
>> Something like
>> - ftrace_dump_on_oops on your kernel cmdline
>> - trace-cmd start -e 'sched:*'
>> <start the test here>
>>
>> ought to do it. Then you can paste the (tail of the) ftrace dump.
>>
>> I also had this laying around, which may or may not be of some help:
>
> Okay, your patch did not help, since it can still be reproduced using this,
>
It wasn't meant to fix this, only add some more debug prints :)
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/hotplug/cpu_hotplug/functional/cpuhotplug04.sh
>
> # while :; do cpuhotplug04.sh -l 1; done
>
> The ftrace dump has too much output on this 256-CPU system, so I have not had
> the patient to wait for it to finish after 15-min. But here is the log capturing
> so far (search for "kernel BUG" there).
>
> http://people.redhat.com/qcai/console.log
>
>From there I see:
[20798.166987][ T650] CPU127 nr_running=2
[20798.171185][ T650] p=migration/127
[20798.175161][ T650] p=kworker/127:1
so this might be another workqueue hurdle. This should be prevented by:
06249738a41a ("workqueue: Manually break affinity on hotplug")
In any case, I'll give this a try on a TX2 next week and see where it gets
me.
Note that much earlier in your log, you have a softlockup on CPU127:
[ 74.278367][ C127] watchdog: BUG: soft lockup - CPU#127 stuck for 23s! [swapper/0:1]
Powered by blists - more mailing lists