linux-kernel - Re: Problem with commit 31e77c93e432 "sched/fair: Update blocked load when newly idle"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtDigSEOmtxk7daTTBb3o6kVR8AgUVhu4X6YWutK9yWNwg@mail.gmail.com>
Date:   Wed, 28 Mar 2018 16:01:44 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Heiner Kallweit <hkallweit1@...il.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Problem with commit 31e77c93e432 "sched/fair: Update blocked load
 when newly idle"

Hi,

On 28 March 2018 at 12:37, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
> Hi,
>
> On 03/24/2018 01:47 PM, Heiner Kallweit wrote:
>>
>> Am 24.03.2018 um 07:46 schrieb Vincent Guittot:
>>>
>>> Hi Heiner,
>>>
>>> Le Friday 23 Mar 2018 à 22:28:09 (+0100), Heiner Kallweit a écrit :
>>>>
>>>> Recently I started to get the following problems with linux-next:
>>>>
>>>> - When working via Putty/SSH on the system the console frequently
>>>> freezes
>>>>    for few seconds. Sometimes only opening a second console makes the
>>>>    first one react again.
>>>>
>>>> - I get "INFO: rcu_sched detected stalls on CPUs/tasks:" warnings as
>>>>    described in [1].
>>>>
>
> I can't catch this issue on my Juno r0 (arm64 big.Little).
>
> root@...o:~# uname -r
> 4.16.0-rc4-00198-g31e77c93e432
>
> I'm using openssh-client and openssh-server though.

I think that I have finally been able to reproduce it on my hikey
(octo cortex-A53) after unplugging 6 cores and waiting for almost 2
hours
This seems to happen only on dual core system as I haven't faced that
before on the hikey which I have used for my tests

[  191.365730] CPU2: shutdown
[  191.368482] psci: CPU2 killed.
[  195.601017] CPU3: shutdown
[  195.603767] psci: CPU3 killed.
[  199.037500] CPU4: shutdown
[  199.040251] psci: CPU4 killed.
[  201.813237] CPU5: shutdown
[  201.815996] psci: CPU5 killed.
[  204.624902] CPU6: shutdown
[  204.627646] psci: CPU6 killed.
[  207.652478] CPU7: shutdown
[  207.655204] psci: CPU7 killed.
[ 6017.160463] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 6017.166151] 1-...!: (4 GPs behind) idle=e20/0/0 softirq=10820/10864 fqs=0
[ 6017.173113] (detected by 0, t=20705 jiffies, g=1389, c=1388, q=27)
[ 6017.179386] Task dump for CPU 1:
[ 6017.182612] swapper/1       R  running task        0     0      1 0x00000000
[ 6017.189666] Call trace:
[ 6017.192120]  __switch_to+0x8c/0xd0
[ 6017.195524]  cpuidle_enter_state+0x64/0x360
[ 6017.199706]  cpuidle_enter+0x18/0x20
[ 6017.203282]  call_cpuidle+0x18/0x30
[ 6017.206771]  do_idle+0x1a4/0x1e0
[ 6017.209999]  cpu_startup_entry+0x20/0x28
[ 6017.213923]  secondary_start_kernel+0x188/0x1c8
[ 6017.218457] rcu_preempt kthread starved for 20705 jiffies! g1389
c1388 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
[ 6017.228985] rcu_preempt     I    0     8      2 0x00000000
[ 6017.234474] Call trace:
[ 6017.236918]  __switch_to+0x8c/0xd0
[ 6017.240322]  __schedule+0x1b8/0x730
[ 6017.243810]  schedule+0x38/0xa0
[ 6017.246952]  schedule_timeout+0x194/0x428
[ 6017.250964]  rcu_gp_kthread+0x4d4/0x780
[ 6017.254802]  kthread+0xfc/0x128
[ 6017.257942]  ret_from_fork+0x10/0x18
[ 6066.541736] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 6066.547428] 1-...!: (5 GPs behind) idle=e28/0/0 softirq=10820/10864 fqs=0
[ 6066.554392] (detected by 0, t=12345 jiffies, g=1390, c=1389, q=48)
[ 6066.560666] Task dump for CPU 1:
[ 6066.563893] swapper/1       R  running task        0     0      1 0x00000000
[ 6066.570948] Call trace:
[ 6066.573404]  __switch_to+0x8c/0xd0
[ 6066.576809]  cpuidle_enter_state+0x64/0x360
[ 6066.580992]  cpuidle_enter+0x18/0x20
[ 6066.584568]  call_cpuidle+0x18/0x30
[ 6066.588056]  do_idle+0x1a4/0x1e0
[ 6066.591284]  cpu_startup_entry+0x20/0x28
[ 6066.595208]  secondary_start_kernel+0x188/0x1c8
[ 6066.599742] rcu_preempt kthread starved for 12345 jiffies! g1390
c1389 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
[ 6066.610270] rcu_preempt     I    0     8      2 0x00000000
[ 6066.615758] Call trace:
[ 6066.618203]  __switch_to+0x8c/0xd0
[ 6066.621607]  __schedule+0x1b8/0x730
[ 6066.625095]  schedule+0x38/0xa0
[ 6066.628236]  schedule_timeout+0x194/0x428
[ 6066.632249]  rcu_gp_kthread+0x4d4/0x780
[ 6066.636087]  kthread+0xfc/0x128

>
>>>> Bisecting the issue resulted in:
>>>>
>>>> 31e77c93e432dec79c7d90b888bbfc3652592741 is the first bad commit
>>>> commit 31e77c93e432dec79c7d90b888bbfc3652592741
>>>> Author: Vincent Guittot <vincent.guittot@...aro.org>
>>>> Date:   Wed Feb 14 16:26:46 2018 +0100
>>>>
>>>>      sched/fair: Update blocked load when newly idle
>>>>
>>>>      When NEWLY_IDLE load balance is not triggered, we might need to
>>>> update the
>>>>      blocked load anyway. We can kick an ilb so an idle CPU will take
>>>> care of
>>>>      updating blocked load or we can try to update them locally before
>>>> entering
>>>>      idle. In the latter case, we reuse part of the nohz_idle_balance.
>>>>
>>>> After reversing this commit at least the issue with the freezing console
>>>> is gone. The second one appeared only sporadically, I still have to see
>>>> whether it pops up again.
>
>
> [...]
>