linux-kernel - Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2b1e908-9e20-0c72-5f8e-bd14d4a7f3b0@gmail.com>
Date:   Thu, 12 Apr 2018 21:43:05 +0200
From:   Heiner Kallweit <hkallweit1@...il.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>,
        Niklas Söderlund <niklas.soderlund@...natech.se>,
        i@...aro.org
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Ingo Molnar <mingo@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        linux-renesas-soc@...r.kernel.org
Subject: Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update
 blocked load when newly idle")

Am 12.04.2018 um 15:30 schrieb Vincent Guittot:
> Heiner, Niklas,
> 
> Le Thursday 12 Apr 2018 à 13:15:19 (+0200), Niklas Söderlund a écrit :
>> Hi Vincent,
>>
>> Thanks for your feedback.
>>
>> On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote:
>>> Hi Niklas,
>>>
>>> On 12 April 2018 at 11:18, Niklas Söderlund
>>> <niklas.soderlund@...natech.se> wrote:
>>>> Hi Vincent,
>>>>
>>>> I have observed issues running on linus/master from a few days back [1].
>>>> I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue
>>>> by X forwarding the v4l2 test application qv4l2 over ssh and moving the
>>>> courser around in the GUI (best test case description award...). I'm
>>>> sorry about the really bad way I trigger this but I can't do it in any
>>>> other way, I'm happy to try other methods if you got some ideas. The
>>>> symptom of the issue is a complete hang of the system for more then 30
>>>> seconds and then this information is printed in the console:
>>>
>>> Heiner (edded cc) also reported similar problem with his platform: a
>>> dual core celeron
>>>
>>> Do you confirm that your platform is a dual cortex-A15 ? At least that
>>> what I have seen on web
>>> This would confirm that dual system is a key point.
>>
>> I can confirm that my platform is a dual core.
>>
>>>
>>> The ssh connection is also common with Heiner's setup
>>
>> Interesting, I found Heiner's mail and I can confirm that I too 
>> experience ssh sessions lockups. I ssh into the system and by repeatedly 
>> hitting the return key I can lockup the board, while locked up starting 
>> another ssh session unblocks the first. If I don't start another ssh 
>> session but keep hitting return key sporadically in the first one I can 
>> get the trace I reported in my first mail to be printed on the serial 
>> console.
>>
>> When locked up the symptoms are that both the single ssh session is dead 
>> and the serial console. But attempting another ssh connection 
>> immediately unblocks both ssh and serial console. And if I allow enough 
>> time before starting the second ssh connection I can trigger a trace to 
>> be printed on the serial console, it's similar but different from the 
>> first I reported.
>>
>> [  207.548610] 	1-...!: (0 ticks this GP) idle=79a/1/1073741824 softirq=2146/2146 fqs=0 
>> [  207.556442] 	(detected by 0, t=12645 jiffies, g=333, c=332, q=20)
>> [  207.562546] rcu_sched kthread starved for 12645 jiffies! g333 c332 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
>> [  207.572548] RCU grace-period kthread stack dump:
>>
>> [  207.577166] rcu_sched       R  running task        0     9      2 0x00000000
>> [  207.584389] Backtrace: 
>> [  207.586849] [<c0760450>] (__schedule) from [<c0760ba4>] (schedule+0x94/0xb8)
>> [  207.593901]  r10:e77813c0 r9:e77813c0 r8:ffffffff r7:e709bed4 r6:ffffaa80 r5:00000000
>> [  207.601732]  r4:ffffe000
>> [  207.604269] [<c0760b10>] (schedule) from [<c0764560>] (schedule_timeout+0x380/0x3dc)
>> [  207.612013]  r5:00000000 r4:00000000
>> [  207.615596] [<c07641e0>] (schedule_timeout) from [<c017b698>] (rcu_gp_kthread+0x668/0xe2c)
>> [  207.623863]  r10:c0b79018 r9:0000014d r8:0000014c r7:00000001 r6:00000000 r5:c0b10ad0
>> [  207.631693]  r4:c0b10980
>> [  207.634230] [<c017b030>] (rcu_gp_kthread) from [<c013ddd8>] (kthread+0x148/0x160)
>> [  207.641712]  r7:c0b10980
>> [  207.644249] [<c013dc90>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
>> [  207.651472] Exception stack(0xe709bfb0 to 0xe709bff8)
>> [  207.656527] bfa0:                                     00000000 00000000 00000000 00000000
>> [  207.664709] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>> [  207.672890] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
>> [  207.679508]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c013dc90
>> [  207.687340]  r4:e7026f4
>>
>> Continuing the anecdotal testing, I can't seem to be able to trigger the 
>> lockup if i have ever had two ssh sessions open to the systems. And 
>> about half the time I can't trigger it at all but after a reset of the 
>> system it triggers with just hitting the return key 2-5 times of opening 
>> a ssh session and just hitting the return key. But please take this part 
>> with a grain of salt as it's done by the monkey testing method :-)
>>
>> All tests above have been run base on c18bb396d3d261eb ("Merge 
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>>
>>>
>>>>
> 
> [snip]
> 
>>>>
>>>> I'm a bit lost on how to progress with this issue and would appreciate
>>>> any help you can provide to help me figure this out.
>>>
>>> Can you send me your config ?
>>>
>>> I'm going to prepare a debug patch to spy what's happening when entering idle
> 
> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can you try
> them separatly on top of c18bb396d3d261eb ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"))
> and check if one of them fixes the problem ?i
> 
> (They should apply on linux-next as well)
> 
> First patch always kick ilb instead of doing ilb on local cpu before entering idle
> 
> ---
>  kernel/sched/fair.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c..b21925b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  	 * candidate for ilb instead of waking up another idle CPU.
>  	 * Kick an normal ilb if we failed to do the update.
>  	 */
> -	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> -		kick_ilb(NOHZ_STATS_KICK);
> +	kick_ilb(NOHZ_STATS_KICK);
>  	raw_spin_lock(&this_rq->lock);
>  }
>  
> 
I tested both patches, with both of them the issue still occurs. However,
on top of linux-next from yesterday I have the impression that it happens
less frequent with the second patch.
On top of the commit mentioned by you I don't see a change in system behavior
with either patch.

Regards, Heiner