linux-kernel - Re: [BUG] spinlock lockup on CPU#0

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <49DE13E9.6040605@sandeen.net>
Date:	Thu, 09 Apr 2009 10:27:37 -0500
From:	Eric Sandeen <sandeen@...deen.net>
To:	Vegard Nossum <vegard.nossum@...il.com>
CC:	Fabio Coatti <cova@...rara.linux.it>, linux-kernel@...r.kernel.org,
	xfs@....sgi.com
Subject: Re: [BUG] spinlock lockup on CPU#0

Vegard Nossum wrote:
> 2009/3/30 Fabio Coatti <cova@...rara.linux.it>:
>> Hi all, I've got the following BUG: report on one of our servers running
>> 2.6.28.8; some background:
>> we are seeing several lockups in db (mysql) servers that shows up as a sudden
>> load increase and then, very quickly, the server freezes. It happens in a
>> random way, sometimes after weeks, sometimes very quickly after a system
>> reboot. Trying to discover the problem we installed latest (at the time of
>> test) 2.6.28.X kernel and loaded it with some high disk I/O operations (find,
>> dd, rsync and so on).
>> We have been able  to crash a server with these tests; unfortunately we have
>> been able to capture only a remote screen snapshot so I copied by hand
>> (hopefully without typos) the data and this is the result is the following:
> 
> Hi,
> 
> Thanks for the report.
> 
>>  [<ffffffff80213590>] ? default_idle+0x30/0x50
>>  [<ffffffff8021358e>] ? default_idle+0x2e/0x50
>>  [<ffffffff80213793>] ? c1e_idle+0x73/0x120
>>  [<ffffffff80259f11>] ? atomic_notifier_call_chain+0x11/0x20
>>  [<ffffffff8020a31f>] ? cpu_idle+0x3f/0x70
>> BUG: spinlock lockup on CPU#0, find/13114, ffff8801363d2c80
>> Pid: 13114, comm: find Tainted: G      D W  2.6.28.8 #5
>> Call Trace:
>>  [<ffffffff8041a02e>] _raw_spin_lock+0x14e/0x180
>>  [<ffffffff8060b691>] _spin_lock+0x51/0x70
>>  [<ffffffff80231ca4>] ? task_rq_lock+0x54/0xa0
>>  [<ffffffff80231ca4>] task_rq_lock+0x54/0xa0
>>  [<ffffffff80234501>] try_to_wake_up+0x91/0x280
>>  [<ffffffff80234720>] wake_up_process+0x10/0x20
>>  [<ffffffff803bf863>] xfsbufd_wakeup+0x53/0x70
>>  [<ffffffff802871e0>] shrink_slab+0x90/0x180
>>  [<ffffffff80287526>] try_to_free_pages+0x256/0x3a0
>>  [<ffffffff80285280>] ? isolate_pages_global+0x0/0x280
>>  [<ffffffff80281166>] __alloc_pages_internal+0x1b6/0x460
>>  [<ffffffff802a186d>] alloc_page_vma+0x6d/0x110
>>  [<ffffffff8028d3ab>] handle_mm_fault+0x4ab/0x790
>>  [<ffffffff80225293>] do_page_fault+0x463/0x870
>>  [<ffffffff8060b199>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>>  [<ffffffff8060bf52>] error_exit+0x0/0xa9
> 
> Seems like you hit this:

In _xfs_buf_lookup_pages?  that's not on the stack, and we didn't see
the below printk...

> /*
>  * This could deadlock.
>  *
>  * But until all the XFS lowlevel code is revamped to
>  * handle buffer allocation failures we can't do much.
>  */
> if (!(++retries % 100))
>         printk(KERN_ERR
>                         "XFS: possible memory allocation "
>                         "deadlock in %s (mode:0x%x)\n",
>                         __func__, gfp_mask);
> 
...

so I don't think so.  From the trace:

>>  [<ffffffff803bf863>] xfsbufd_wakeup+0x53/0x70
>>  [<ffffffff802871e0>] shrink_slab+0x90/0x180

this is the shrinker kicking off:

static struct shrinker xfs_buf_shake = {
        .shrink = xfsbufd_wakeup,
        .seeks = DEFAULT_SEEKS,
};


> ...so my guess is that you ran out of memory (and XFS simply can't
> handle it -- an error in the XFS code, of course).

Wrong guess, I think.  XFS has been called via the shrinker mechanisms
to *free* memory, and we're not able to get the task rq lock in the
wakeup path, but not sure why...

> My first tip, if you simply want your servers not to crash, is to
> switch to another filesystem. You could at least try it and see if it
> helps your problem -- that's the most straight-forward solution I can
> think of.
> 
>> The machine is a dual 2216HE (2 cores) AMD with 4 Gb ram; below you can find
>> the .config file. (from /proc/config.gz)
>>
>> we are seeing similar lockups (at least similar for the results) since several
>> kernel revisions (starting from 2.6.25.X) and on different hardware. Several
>> machines are hit by this, mostly databases (maybe for the specific usage, other
>> machines being apache servers, I don't know).
>>
>> Could someone give us some hints about this issue, or at least some
>> suggestions on how to dig it? Of course we can do any sort of testing and
>> tries.

If sysrq-t (show all running tasks) still works post-oops, capturing
that might help to see where other threads are at.  Hook up a serial
console to make capturing the output possible.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/