linux-kernel - Re: BUG: soft lockup in 2.6.25.5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <268870537.725801221819286330.JavaMail.root@mail.vpac.org>
Date:	Fri, 19 Sep 2008 20:14:46 +1000 (EST)
From:	Brett Pemberton <brett@...c.org>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: BUG: soft lockup in 2.6.25.5


----- "Andrew Morton" <akpm@...ux-foundation.org> wrote:

> On Fri, 19 Sep 2008 13:49:16 +1000 Brett Pemberton <brett@...c.org>
> wrote:
> 
> > I'm getting about 3-5 machines in a cluster of 95 hanging with 
> > 
> > BUG: soft lockup - CPU#7 stuck for 61s! [pdflush:321]
> > 
> > per week.  Nothing in common each time, different users running
> > different jobs on different nodes.
> > 
> > The most recent is at the end of this email, .config is attached.
> > 
> > Googling is scary.  Many people reporting these, but never any
> response.
> > It's happening on enough separate nodes that I can't believe it's
> > hardware, although they are identical machines:
> > 
> > - 2x Quad-Core AMD Opteron(tm) Processor 2356
> > - 32gb ram
> > - 4 x sata drives
> > 
> > Running CentOS 5.2 with a kernel.org kernel
> > Has been happening with a variety of kernels from 2.6.25 - present.
> 
> Yes, it's a false positive.  With a lot of memory and a random-access
> or lot-of-files writing behaviour, it can take tremendous amounts of
> time to get everything stored on the disk.
> 
> Not sure what to do about it, really.  Perhaps touch the softlockup
> detector somewhere in the writeback code.
> 
> > I'd love any advice on where to turn to next and what avenues to
> pursue.
> 
> Set /proc/sys/kernel/softlockup_thresh to zero to shut it up :(
> 

Hmm,

I'd love to believe it's a false positive, but I guess I didn't mention that
once a machine gets one of these, it incrementally increases load until it
falls over a few (hours, days) later.

When I've noticed, and have been able to log in and run top before it falls over,
I can see that one or two cores have their processes stuck in a wait state, while
the others continue as normal.

This is consistent.  I've never had a node hit this BUG: and not eventually
die (lose all network connectivity, sit at console waiting for login but
not registering keystrokes) within a week.  The node today hit lockup around
2 hours after registering this BUG:

Surely turning off the detection via the proc file will just mean this will
happen silently in the future?

cheers,

     / Brett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/