lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f977bbb-d949-4e90-b3d2-b9815189b842@gmail.com>
Date: Thu, 14 Mar 2024 14:53:05 -0700
From: Florian Fainelli <f.fainelli@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>, Boqun Feng <boqun.feng@...il.com>
Cc: Frederic Weisbecker <frederic@...nel.org>,
 "Russell King (Oracle)" <linux@...linux.org.uk>,
 Joel Fernandes <joel@...lfernandes.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>,
 Linus Torvalds <torvalds@...ux-foundation.org>,
 linux-kernel@...r.kernel.org, kernel-team@...a.com, paulmck@...nel.org,
 mingo@...nel.org, rcu@...r.kernel.org, neeraj.upadhyay@....com,
 urezki@...il.com, qiang.zhang1211@...il.com, bigeasy@...utronix.de,
 chenzhongjin@...wei.com, yangjihong1@...wei.com, rostedt@...dmis.org,
 Justin Chen <justin.chen@...adcom.com>
Subject: Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for
 v6.9]

On 3/14/24 14:21, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
>> On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
>>> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>>>
>>> this one is does include the tmigr events. Thanks!
>>
>> You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
> 
> So from the above trace it's clear where it goes south:
> 
> [  236.318158]   <idle>-0         3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> [  236.326526]   <idle>-0         3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.357809]   <idle>-0         3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> 
> [  236.377222]   <idle>-0         0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.387765]   <idle>-0         0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
> 
> [  236.401526]   <idle>-0         0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> [  236.420940]   <idle>-0         0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> [  236.434874]   <idle>-0         0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
> 
> [  236.446896]   <idle>-0         3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [  236.460916]   <idle>-0         3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [  236.473721]   <idle>-0         3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> 
> CPU3 is the last active CPU and goes idle. So it should take care of the
> pending events, but it does not.
> 
> This is the next trace entry where CPU0 magically comes back to life.
> 
> [  236.487393]   <idle>-0         0d.s.. 162001359us : timer_cancel: timer=8c725d84
> 
> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> explain the fail. I haven't yet spotted where this goes into lala land.

It was a lot harder to generate the same issue on cold boot against 
8ca1836769d758e4fbf5851bb81e181c52193f5d, however, just like against 
36e40df35d2c1891fe58241640c7c95de4aa739b, it would happen resuming from 
suspend to DRAM whereby the CPU core(s) lost their power and had to be 
re-initialized. Eventually I got a cold boot log:

https://gist.github.com/ffainelli/b5684585c78518a5492cbbf1c7dce16e

Does the consistent ~159s mean anything?
-- 
Florian


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ