linux-kernel - Re: [PATCH] watchdog: Make sure the watchdog thread gets CPU on loaded system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 15 Mar 2012 10:04:34 -0700
From:	Mandeep Singh Baines <msb@...omium.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Mandeep Singh Baines <msb@...omium.org>,
	Don Zickus <dzickus@...hat.com>, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Michal Hocko <mhocko@...e.cz>
Subject: Re: [PATCH] watchdog: Make sure the watchdog thread gets CPU on
 loaded system

Peter Zijlstra (peterz@...radead.org) wrote:
> On Thu, 2012-03-15 at 17:11 +0100, Peter Zijlstra wrote:
> > On Thu, 2012-03-15 at 17:10 +0100, Peter Zijlstra wrote:
> > > On Thu, 2012-03-15 at 08:39 -0700, Mandeep Singh Baines wrote:
> > > > Its a good tool for catching problems of scale. As we move to more and
> > > > more cores you'll uncover bugs where data structures start to blow up.
> > > > Hash tables get huge, when you have 100000s of processes or millions
> > > > of
> > > > TCP flows, or cgroups or namespace. That critical section (spinlock,
> > > > spinlock_bh, or preempt_disable) that used to be OK might no longer
> > > > be.
> > > 
> > > Or you run with the preempt latency tracer.
> > 
> > Or for that matter run cyclictest...
> 
> Thing is, if you want a latency detector, call it that and stop
> pretending its a useful debug feature. Also, if you want that, set the
> interval in the 0.1-0.5 seconds range and dump stack on every new max.
> 
> 

But preempt latency tracer is not negligible overhead while the softlockup
detector is. Softlockup is a great tool to use for detecting temporary
long duration lockups that can occur when data structures blow up.
Because of the overhead, you probably wouldn't enable preempt latency
tracking in production. If the problems is happening often enough, you
might temporarily turn it on for a few machines to get a stack trace. But
you might not have the luxury of being able to do that.

I can't predict what my users are going to do. They will do things I never
expected. So I can't test these cases in the lab, ruling out latency preempt
detector. With softlockup, I can find out about problems I never even
knew I had.

In addition, softlockup is also a great tool for find permanent lockups.

One idea for reducing preempt latency tracer overhead would use the same
approach that the HW counters use. Instead of examing every preempt
enable/disable, only examine 1 in a 1000 (some configurable numbers).
That way you could turn it on in production. Maybe a simple per_cpu
counter.

Regards,
Mandeep

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/