linux-kernel - Re: Processes spinning forever, apparently in lock_timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20070809095534.25ae1c42.akpm@linux-foundation.org>
Date:	Thu, 9 Aug 2007 09:55:34 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Matthias Hensler <matthias@...se.de>
Cc:	Chuck Ebbert <cebbert@...hat.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	richard kennedy <richard@....demon.co.uk>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: Processes spinning forever, apparently in lock_timer_base()?

On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <matthias@...se.de> wrote:

> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote:
> > On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
> > [...]
> > I am also willing to try the patch posted by Richard.
> 
> I want to give some update here:
> 
> 1. We finally hit the problem on a third system, with a total different
>    setup and hardware. However, again high I/O load caused the problem
>    and the affected filesystems were mounted with noatime.
> 
> 2. I installed a recompiled kernel with just the two line patch from
>    Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5
>    days uptime now and counting. I believe the patch fixed the problem.
>    However, I will continue running "vmstat 1" and the endless loop of
>    "cat /proc/meminfo", just in case I am wrong.
> 

Did we ever see the /proc/meminfo and /proc/vmstat output during the stall?

If Richard's patch has indeed fixed it then this confirms that we're seeing
contention over the dirty-memory limits.  Richard's patch isn't really the
right one because it allows unlimited dirty-memory windup in some situations
(large number of disks with small writes, or when we perform queue congestion
avoidance).

As you're seeing this happening when multiple disks are being written to it is
possible that the per-device-dirty-threshold patches which recently went into
-mm (and which appear to have a bug) will fix it.

But I worry that the stall appears to persist *forever*.  That would indicate
that we have a dirty-memory accounting leak, or that for some reason the
system has decided to stop doing writeback to one or more queues (might be
caused by an error in a lower-level driver's queue congestion state management).

If it is the latter, then it could be that running "sync" will clear the
problem.  Temporarily, at least.  Because sync will ignore the queue congestion
state.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/