lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 27 Apr 2012 23:33:12 +0200
From:	"Rafael J. Wysocki" <rjw@...k.pl>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Sameer Nanda <snanda@...omium.org>, mingo@...hat.com,
	peterz@...radead.org, len.brown@...el.com, pavel@....cz,
	dzickus@...hat.com, msb@...omium.org, linux-kernel@...r.kernel.org,
	linux-pm@...r.kernel.org, olofj@...omium.org
Subject: Re: [PATCH] watchdog: fix for lockup detector breakage on resume

On Friday, April 27, 2012, Andrew Morton wrote:
> On Fri, 27 Apr 2012 11:10:40 -0700
> Sameer Nanda <snanda@...omium.org> wrote:
> 
> > On the suspend/resume path the boot CPU does not go though an
> > offline->online transition.  This breaks the NMI detector
> > post-resume since it depends on PMU state that is lost when
> > the system gets suspended.
> > 
> > Fix this by forcing a CPU offline->online transition for the
> > lockup detector on the boot CPU during resume.
> > 
> > Signed-off-by: Sameer Nanda <snanda@...omium.org>
> > ---
> > To provide more context, we enable NMI watchdog on
> > Chrome OS.  We have seen several reports of systems freezing
> > up completely which indicated that the NMI watchdog was not
> > firing for some reason.
> > 
> > Debugging further, we found a simple way of repro'ing system
> > freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
> > after the system has been suspended/resumed one or more times.
> > 
> > With this patch in place, the system freeze result in panics,
> > as expected.  These panics provide a nice stack trace for us
> > to debug the actual issue causing the freeze.
> > 
> > ...
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
> >  				  size_t *lenp, loff_t *ppos);
> >  extern unsigned int  softlockup_panic;
> >  void lockup_detector_init(void);
> > +void lockup_detector_bootcpu_resume(void);
> >  #else
> >  static inline void touch_softlockup_watchdog(void)
> >  {
> > @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
> >  static inline void lockup_detector_init(void)
> >  {
> >  }
> > +static inline void lockup_detector_bootcpu_resume(void)
> > +{
> > +}
> >  #endif
> >  
> >  #ifdef CONFIG_DETECT_HUNG_TASK
> > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> > index 396d262..0d262a8 100644
> > --- a/kernel/power/suspend.c
> > +++ b/kernel/power/suspend.c
> > @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
> >  	arch_suspend_enable_irqs();
> >  	BUG_ON(irqs_disabled());
> >  
> > +	/* Kick the lockup detector */
> > +	lockup_detector_bootcpu_resume();
> > +
> >   Enable_cpus:
> >  	enable_nonboot_cpus();
> >  
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > index df30ee0..dd2ac93 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
> >  	.notifier_call = cpu_callback
> >  };
> >  
> > +void lockup_detector_bootcpu_resume(void)
> > +{
> > +	void *cpu = (void *)(long)smp_processor_id();
> > +
> > +	/*
> > +	 * On the suspend/resume path the boot CPU does not go though the
> > +	 * offline->online transition. This breaks the NMI detector post
> > +	 * resume. Force an offline->online transition for the boot CPU on
> > +	 * resume.
> > +	 */
> > +	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> > +	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> > +
> > +	return;
> > +}
> 
> I have issues with the comment ;) It describes some old bug which isn't
> there any more and which nobody cares about.  A better comment would
> simply describe the function in the usual fashion.  Something like
> this:
> 
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix
> +++ a/kernel/watchdog.c
> @@ -597,20 +597,17 @@ static struct notifier_block __cpuinitda
>  	.notifier_call = cpu_callback
>  };
>  
> +/*
> + * On entry to suspend we force an offline->online transition on the boot CPU so
> + * that PMU state is available to that CPU when it comes back online after
> + * resume.  This information is required for restarting the NMI watchdog.
> + */
>  void lockup_detector_bootcpu_resume(void)
>  {
>  	void *cpu = (void *)(long)smp_processor_id();
>  
> -	/*
> -	 * On the suspend/resume path the boot CPU does not go though the
> -	 * offline->online transition. This breaks the NMI detector post
> -	 * resume. Force an offline->online transition for the boot CPU on
> -	 * resume.
> -	 */
>  	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>  	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> -
> -	return;
>  }
>  
>  void __init lockup_detector_init(void)
> _
> 
> 
> But I'm not sure how accurate it is.  Is it true that the PMU data was
> required for starting the NMI hardware?
> 
> 
> Also, this is all dead code if CONFIG_SUSPEND=n, so how about

FWIW, looks good to me.

Thanks,
Rafael


> --- a/include/linux/sched.h~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/include/linux/sched.h
> @@ -317,7 +317,6 @@ extern int proc_dowatchdog_thresh(struct
>  				  size_t *lenp, loff_t *ppos);
>  extern unsigned int  softlockup_panic;
>  void lockup_detector_init(void);
> -void lockup_detector_bootcpu_resume(void);
>  #else
>  static inline void touch_softlockup_watchdog(void)
>  {
> @@ -331,6 +330,11 @@ static inline void touch_all_softlockup_
>  static inline void lockup_detector_init(void)
>  {
>  }
> +#endif
> +
> +#if defined(CONFIG_LOCKUP_DETECTOR) && defined(CONFIG_SUSPEND)
> +void lockup_detector_bootcpu_resume(void);
> +#else
>  static inline void lockup_detector_bootcpu_resume(void)
>  {
>  }
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/kernel/watchdog.c
> @@ -597,6 +597,7 @@ static struct notifier_block __cpuinitda
>  	.notifier_call = cpu_callback
>  };
>  
> +#ifdef CONFIG_SUSPEND
>  /*
>   * On entry to suspend we force an offline->online transition on the boot CPU so
>   * that PMU state is available to that CPU when it comes back online after
> @@ -609,6 +610,7 @@ void lockup_detector_bootcpu_resume(void
>  	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>  	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>  }
> +#endif
>  
>  void __init lockup_detector_init(void)
>  {
> _
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ