[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <268870537.725801221819286330.JavaMail.root@mail.vpac.org>
Date: Fri, 19 Sep 2008 20:14:46 +1000 (EST)
From: Brett Pemberton <brett@...c.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org
Subject: Re: BUG: soft lockup in 2.6.25.5
----- "Andrew Morton" <akpm@...ux-foundation.org> wrote:
> On Fri, 19 Sep 2008 13:49:16 +1000 Brett Pemberton <brett@...c.org>
> wrote:
>
> > I'm getting about 3-5 machines in a cluster of 95 hanging with
> >
> > BUG: soft lockup - CPU#7 stuck for 61s! [pdflush:321]
> >
> > per week. Nothing in common each time, different users running
> > different jobs on different nodes.
> >
> > The most recent is at the end of this email, .config is attached.
> >
> > Googling is scary. Many people reporting these, but never any
> response.
> > It's happening on enough separate nodes that I can't believe it's
> > hardware, although they are identical machines:
> >
> > - 2x Quad-Core AMD Opteron(tm) Processor 2356
> > - 32gb ram
> > - 4 x sata drives
> >
> > Running CentOS 5.2 with a kernel.org kernel
> > Has been happening with a variety of kernels from 2.6.25 - present.
>
> Yes, it's a false positive. With a lot of memory and a random-access
> or lot-of-files writing behaviour, it can take tremendous amounts of
> time to get everything stored on the disk.
>
> Not sure what to do about it, really. Perhaps touch the softlockup
> detector somewhere in the writeback code.
>
> > I'd love any advice on where to turn to next and what avenues to
> pursue.
>
> Set /proc/sys/kernel/softlockup_thresh to zero to shut it up :(
>
Hmm,
I'd love to believe it's a false positive, but I guess I didn't mention that
once a machine gets one of these, it incrementally increases load until it
falls over a few (hours, days) later.
When I've noticed, and have been able to log in and run top before it falls over,
I can see that one or two cores have their processes stuck in a wait state, while
the others continue as normal.
This is consistent. I've never had a node hit this BUG: and not eventually
die (lose all network connectivity, sit at console waiting for login but
not registering keystrokes) within a week. The node today hit lockup around
2 hours after registering this BUG:
Surely turning off the detection via the proc file will just mean this will
happen silently in the future?
cheers,
/ Brett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists