linux-kernel - Re: [PATCH] watchdog: fix for lockup detector breakage on resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANMivWatpcxEern4B6ekxn2K1zvZVsF-WWq7nY5omq=91OYBYQ@mail.gmail.com>
Date:	Mon, 30 Apr 2012 14:10:54 -0700
From:	Sameer Nanda <snanda@...omium.org>
To:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
Cc:	mingo@...hat.com, peterz@...radead.org, len.brown@...el.com,
	pavel@....cz, rjw@...k.pl, akpm@...ux-foundation.org,
	dzickus@...hat.com, msb@...omium.org, linux-kernel@...r.kernel.org,
	linux-pm@...r.kernel.org, olofj@...omium.org
Subject: Re: [PATCH] watchdog: fix for lockup detector breakage on resume

On Sun, Apr 29, 2012 at 11:12 PM, Srivatsa S. Bhat
<srivatsa.bhat@...ux.vnet.ibm.com> wrote:
> On 04/27/2012 11:40 PM, Sameer Nanda wrote:
>
>> On the suspend/resume path the boot CPU does not go though an
>> offline->online transition.  This breaks the NMI detector
>> post-resume since it depends on PMU state that is lost when
>> the system gets suspended.
>>
>> Fix this by forcing a CPU offline->online transition for the
>> lockup detector on the boot CPU during resume.
>>
>> Signed-off-by: Sameer Nanda <snanda@...omium.org>
>> ---
>> To provide more context, we enable NMI watchdog on
>> Chrome OS.  We have seen several reports of systems freezing
>> up completely which indicated that the NMI watchdog was not
>> firing for some reason.
>>
>> Debugging further, we found a simple way of repro'ing system
>> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
>> after the system has been suspended/resumed one or more times.
>>
>> With this patch in place, the system freeze result in panics,
>> as expected.  These panics provide a nice stack trace for us
>> to debug the actual issue causing the freeze.
>>
>>
>>  include/linux/sched.h  |    4 ++++
>>  kernel/power/suspend.c |    3 +++
>>  kernel/watchdog.c      |   16 ++++++++++++++++
>>  3 files changed, 23 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 81a173c..118cc38 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>>                                 size_t *lenp, loff_t *ppos);
>>  extern unsigned int  softlockup_panic;
>>  void lockup_detector_init(void);
>> +void lockup_detector_bootcpu_resume(void);
>>  #else
>>  static inline void touch_softlockup_watchdog(void)
>>  {
>> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>>  static inline void lockup_detector_init(void)
>>  {
>>  }
>> +static inline void lockup_detector_bootcpu_resume(void)
>> +{
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_DETECT_HUNG_TASK
>> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>> index 396d262..0d262a8 100644
>> --- a/kernel/power/suspend.c
>> +++ b/kernel/power/suspend.c
>> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>>       arch_suspend_enable_irqs();
>>       BUG_ON(irqs_disabled());
>>
>> +     /* Kick the lockup detector */
>> +     lockup_detector_bootcpu_resume();
>> +
>>   Enable_cpus:
>>       enable_nonboot_cpus();
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index df30ee0..dd2ac93 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>>       .notifier_call = cpu_callback
>>  };
>>
>> +void lockup_detector_bootcpu_resume(void)
>> +{
>> +     void *cpu = (void *)(long)smp_processor_id();
>> +
>> +     /*
>> +      * On the suspend/resume path the boot CPU does not go though the
>> +      * offline->online transition. This breaks the NMI detector post
>> +      * resume. Force an offline->online transition for the boot CPU on
>> +      * resume.
>> +      */
>> +     cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>> +     cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>> +
>
>
> I have a couple of comments about this:
>
> 1. Strictly speaking, we should be using the _FROZEN variants here (since the
> tasks are still frozen).
>
> Like, cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
> and   cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);
>
> Right now, since the same action is taken for either variant (ie., with or without
> _FROZEN), it really doesn't matter. But still, good to be on the safer side no?

Agreed that the _FROZEN counterparts are a better fit here since the
tasks are still frozen.  Let me make this change.

>
> 2. Why are we skipping the CPU_UP_PREPARE_FROZEN callback?

Mainly because the hrtimer_init has already been done at kernel init
time.  But, this seems to be a good idea since the non-boot CPUs do
transition through the CPU_UP_PREPARE_FROZEN phase on the way up
during resume so it makes sense to keep the boot CPU path symmetrical.

Let me make this change also.

>
> 3. How about hibernation? We don't hit this problem there?

I am not too familiar with hibernation path and don't have a setup to
test it either so can't really answer this one.

>
>> +     return;
>> +}
>> +
>>  void __init lockup_detector_init(void)
>>  {
>>       void *cpu = (void *)(long)smp_processor_id();
>
>
>
> Regards,
> Srivatsa S. Bhat
>



-- 
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/