linux-kernel - Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 12 Apr 2018 13:15:19 +0200
From:   Niklas Söderlund 
        <niklas.soderlund@...natech.se>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Ingo Molnar <mingo@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        linux-renesas-soc@...r.kernel.org,
        Heiner Kallweit <hkallweit1@...il.com>
Subject: Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update
 blocked load when newly idle")

Hi Vincent,

Thanks for your feedback.

On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote:
> Hi Niklas,
> 
> On 12 April 2018 at 11:18, Niklas Söderlund
> <niklas.soderlund@...natech.se> wrote:
> > Hi Vincent,
> >
> > I have observed issues running on linus/master from a few days back [1].
> > I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue
> > by X forwarding the v4l2 test application qv4l2 over ssh and moving the
> > courser around in the GUI (best test case description award...). I'm
> > sorry about the really bad way I trigger this but I can't do it in any
> > other way, I'm happy to try other methods if you got some ideas. The
> > symptom of the issue is a complete hang of the system for more then 30
> > seconds and then this information is printed in the console:
> 
> Heiner (edded cc) also reported similar problem with his platform: a
> dual core celeron
> 
> Do you confirm that your platform is a dual cortex-A15 ? At least that
> what I have seen on web
> This would confirm that dual system is a key point.

I can confirm that my platform is a dual core.

> 
> The ssh connection is also common with Heiner's setup

Interesting, I found Heiner's mail and I can confirm that I too 
experience ssh sessions lockups. I ssh into the system and by repeatedly 
hitting the return key I can lockup the board, while locked up starting 
another ssh session unblocks the first. If I don't start another ssh 
session but keep hitting return key sporadically in the first one I can 
get the trace I reported in my first mail to be printed on the serial 
console.

When locked up the symptoms are that both the single ssh session is dead 
and the serial console. But attempting another ssh connection 
immediately unblocks both ssh and serial console. And if I allow enough 
time before starting the second ssh connection I can trigger a trace to 
be printed on the serial console, it's similar but different from the 
first I reported.

[  207.548610] 	1-...!: (0 ticks this GP) idle=79a/1/1073741824 softirq=2146/2146 fqs=0 
[  207.556442] 	(detected by 0, t=12645 jiffies, g=333, c=332, q=20)
[  207.562546] rcu_sched kthread starved for 12645 jiffies! g333 c332 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
[  207.572548] RCU grace-period kthread stack dump:

[  207.577166] rcu_sched       R  running task        0     9      2 0x00000000
[  207.584389] Backtrace: 
[  207.586849] [<c0760450>] (__schedule) from [<c0760ba4>] (schedule+0x94/0xb8)
[  207.593901]  r10:e77813c0 r9:e77813c0 r8:ffffffff r7:e709bed4 r6:ffffaa80 r5:00000000
[  207.601732]  r4:ffffe000
[  207.604269] [<c0760b10>] (schedule) from [<c0764560>] (schedule_timeout+0x380/0x3dc)
[  207.612013]  r5:00000000 r4:00000000
[  207.615596] [<c07641e0>] (schedule_timeout) from [<c017b698>] (rcu_gp_kthread+0x668/0xe2c)
[  207.623863]  r10:c0b79018 r9:0000014d r8:0000014c r7:00000001 r6:00000000 r5:c0b10ad0
[  207.631693]  r4:c0b10980
[  207.634230] [<c017b030>] (rcu_gp_kthread) from [<c013ddd8>] (kthread+0x148/0x160)
[  207.641712]  r7:c0b10980
[  207.644249] [<c013dc90>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
[  207.651472] Exception stack(0xe709bfb0 to 0xe709bff8)
[  207.656527] bfa0:                                     00000000 00000000 00000000 00000000
[  207.664709] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  207.672890] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[  207.679508]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c013dc90
[  207.687340]  r4:e7026f4

Continuing the anecdotal testing, I can't seem to be able to trigger the 
lockup if i have ever had two ssh sessions open to the systems. And 
about half the time I can't trigger it at all but after a reset of the 
system it triggers with just hitting the return key 2-5 times of opening 
a ssh session and just hitting the return key. But please take this part 
with a grain of salt as it's done by the monkey testing method :-)

All tests above have been run base on c18bb396d3d261eb ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").

> 
> >
> > [  142.849390] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [  142.854972]  1-...!: (1 GPs behind) idle=7a4/0/0 softirq=3214/3217 fqs=0
> > [  142.861976]  (detected by 0, t=8232 jiffies, g=930, c=929, q=11)
> > [  142.868042] Sending NMI from CPU 0 to CPUs 1:
> > [  142.872443] NMI backtrace for cpu 1
> > [  142.872452] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.16.0-05506-g28aba11c1393691a #14
> > [  142.872455] Hardware name: Generic R8A7791 (Flattened Device Tree)
> > [  142.872473] PC is at arch_cpu_idle+0x28/0x44
> > [  142.872484] LR is at trace_hardirqs_on_caller+0x1a4/0x1d4
> > [  142.872488] pc : [<c01088a4>]    lr : [<c01747a8>]    psr: 20070013
> > [  142.872491] sp : eb0b9f90  ip : eb0b9f60  fp : eb0b9f9c
> > [  142.872495] r10: 00000000  r9 : 413fc0f2  r8 : 4000406a
> > [  142.872498] r7 : c0c08478  r6 : c0c0842c  r5 : ffffe000  r4 : 00000002
> > [  142.872502] r3 : eb0b6ec0  r2 : 00000000  r1 : 00000004  r0 : 00000001
> > [  142.872507] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> > [  142.872511] Control: 10c5387d  Table: 6a61406a  DAC: 00000051
> > [  142.872516] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.16.0-05506-g28aba11c1393691a #14
> > [  142.872519] Hardware name: Generic R8A7791 (Flattened Device Tree)
> > [  142.872522] Backtrace:
> > [  142.872534] [<c010cbf8>] (dump_backtrace) from [<c010ce78>] (show_stack+0x18/0x1c)
> > [  142.872540]  r7:c0c81388 r6:00000000 r5:60070193 r4:c0c81388
> > [  142.872550] [<c010ce60>] (show_stack) from [<c07dec20>] (dump_stack+0xa4/0xd8)
> > [  142.872557] [<c07deb7c>] (dump_stack) from [<c0108b38>] (show_regs+0x14/0x18)
> > [  142.872563]  r9:00000001 r8:00000000 r7:c0c4f678 r6:eb0b9f40 r5:00000001 r4:c13e1130
> > [  142.872571] [<c0108b24>] (show_regs) from [<c07e4cd0>] (nmi_cpu_backtrace+0xfc/0x118)
> > [  142.872578] [<c07e4bd4>] (nmi_cpu_backtrace) from [<c010fb28>] (handle_IPI+0x2a8/0x320)
> > [  142.872583]  r7:c0c4f678 r6:eb0b9f40 r5:00000007 r4:c0b75b68
> > [  142.872594] [<c010f880>] (handle_IPI) from [<c03cb528>] (gic_handle_irq+0x8c/0x98)
> > [  142.872599]  r10:00000000 r9:eb0b8000 r8:f0803000 r7:c0c4f678 r6:eb0b9f40 r5:c0c08a90
> > [  142.872602]  r4:f0802000
> > [  142.872608] [<c03cb49c>] (gic_handle_irq) from [<c01019f0>] (__irq_svc+0x70/0x98)
> > [  142.872612] Exception stack(0xeb0b9f40 to 0xeb0b9f88)
> > [  142.872618] 9f40: 00000001 00000004 00000000 eb0b6ec0 00000002 ffffe000 c0c0842c c0c08478
> > [  142.872624] 9f60: 4000406a 413fc0f2 00000000 eb0b9f9c eb0b9f60 eb0b9f90 c01747a8 c01088a4
> > [  142.872627] 9f80: 20070013 ffffffff
> > [  142.872632]  r9:eb0b8000 r8:4000406a r7:eb0b9f74 r6:ffffffff r5:20070013 r4:c01088a4
> > [  142.872642] [<c010887c>] (arch_cpu_idle) from [<c07fb1a8>] (default_idle_call+0x34/0x38)
> > [  142.872650] [<c07fb174>] (default_idle_call) from [<c0155fe0>] (do_idle+0xe0/0x134)
> > [  142.872656] [<c0155f00>] (do_idle) from [<c0156398>] (cpu_startup_entry+0x20/0x24)
> > [  142.872660]  r7:c0c8e9d0 r6:10c0387d r5:00000051 r4:00000085
> > [  142.872667] [<c0156378>] (cpu_startup_entry) from [<c010f644>] (secondary_start_kernel+0x114/0x134)
> > [  142.872673] [<c010f530>] (secondary_start_kernel) from [<401026ec>] (0x401026ec)
> > [  142.872676]  r5:00000051 r4:6b0a406a
> > [  142.873456] rcu_sched kthread starved for 8235 jiffies! g930 c929 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
> > [  143.135040] RCU grace-period kthread stack dump:
> > [  143.139695] rcu_sched       I    0     9      2 0x00000000
> > [  143.145234] Backtrace:
> > [  143.147719] [<c07f5120>] (__schedule) from [<c07f5b3c>] (schedule+0x94/0xb8)
> > [  143.154823]  r10:c0b714c0 r9:c0c85f8a r8:ffffffff r7:eb0abec4 r6:ffffa274 r5:00000000
> > [  143.162712]  r4:ffffe000
> > [  143.165273] [<c07f5aa8>] (schedule) from [<c07fa940>] (schedule_timeout+0x440/0x4b0)
> > [  143.173076]  r5:00000000 r4:eb79c4c0
> > [  143.176692] [<c07fa500>] (schedule_timeout) from [<c0196e08>] (rcu_gp_kthread+0x958/0x150c)
> > [  143.185108]  r10:c0c87274 r9:00000000 r8:c0c165b8 r7:00000001 r6:00000000 r5:c0c16590
> > [  143.192997]  r4:c0c16300
> > [  143.195560] [<c01964b0>] (rcu_gp_kthread) from [<c0145eb8>] (kthread+0x148/0x160)
> > [  143.203099]  r7:c0c16300
> > [  143.205660] [<c0145d70>] (kthread) from [<c01010b4>] (ret_from_fork+0x14/0x20)
> > [  143.212938] Exception stack(0xeb0abfb0 to 0xeb0abff8)
> > [  143.218030] bfa0:                                     00000000 00000000 00000000 00000000
> > [  143.226271] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > [  143.234511] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > [  143.241177]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0145d70
> > [  143.249065]  r4:eb037b00
> >
> > After the freeze the system becomes responsive again and I can sometimes
> > trigger the hang multiple times. I tried to bisect the problem and I
> > found that by reverting [2] I can no longer reproduce the issue. I can
> > also not reproduce the issue on v4.16.  I can't figure out if reverting
> > [2] is just treating a symptom or the root cause of my troubles and
> > would appreciate your input. Also my "test-case" do not trigger every
> > time but I have tested this scenario quiet a lot and the result seems to
> > be constant.
> >
> > My test setup involves a NFS root filesystem, I ssh in and launch the
> > GUI application over X forwarding. From what I know the application is
> > not doing any ioctl calls to the v4l2 framework it's just sitting there
> > idle as I move the courser around showing tool tips and such as I hover
> > over elements and then it freeze up. I have not observed this issue by
> > just booting the system and leaving it idle, movement in the GUI seems
> > to be the key to trigger this.
> >
> > I'm a bit lost on how to progress with this issue and would appreciate
> > any help you can provide to help me figure this out.
> 
> Can you send me your config ?
> 
> I'm going to prepare a debug patch to spy what's happening when entering idle
> 
> Regards,
> Vincent
> >
> > 1. c18bb396d3d261eb ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
> > 2. 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")
> >
> > --
> > Regards,
> > Niklas Söderlund

-- 
Regards,
Niklas Söderlund