linux-kernel - Re: kernel 3.0: BUG: soft lockup: find_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.02.1108151450190.30763@p34.internal.lan>
Date:	Mon, 15 Aug 2011 15:02:52 -0400 (EDT)
From:	Justin Piszcz <jpiszcz@...idpixels.com>
To:	Hugh Dickins <hughd@...gle.com>
cc:	linux-kernel@...r.kernel.org
Subject: Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110



On Mon, 15 Aug 2011, Hugh Dickins wrote:

> On Mon, 15 Aug 2011, Justin Piszcz wrote:
>> Hello,
>>
>> What causes this(?) -- am I out of memory(?) or is this a kernel bug?
>
> It would be a kernel bug to lock up even if you are out of memory.
This machine has 48GB of RAM and its just a linux router and some gqview's
running..

>
> It does look like you're under memory pressure, but I don't see any OOM.
>
> Is this something you've noticed just once, or does it happen repeatedly?
This has happened once before (I've e-mailed LKML about it last weekend or
thereabouts but nobody responded)

It is here:
http://lkml.org/lkml/2011/8/12/54 (down?)
http://comments.gmane.org/gmane.linux.kernel/1178570

>
> Does it always hit somewhere in find_get_pages(), or does the loop span
> wider than that?
Per: http://comments.gmane.org/gmane.linux.kernel/1178570

Slightly different (From August 12)

   75 [330509.718763] Call Trace:
   76 [330509.718771]  [<ffffffff81089e15>] ? pagevec_lookup+0x15/0x20
   77 [330509.718776]  [<ffffffff8108b905>] ? invalidate_mapping_pages+0x55/0x130
   78 [330509.718784]  [<ffffffff810d6835>] ? shrink_icache_memory+0x2c5/0x310
   79 [330509.718788]  [<ffffffff8108c254>] ? shrink_slab+0x104/0x170
   80 [330509.718793]  [<ffffffff8108eda2>] ? balance_pgdat+0x492/0x600
   81 [330509.718798]  [<ffffffff8108efbc>] ? kswapd+0xac/0x250
   82 [330509.718803]  [<ffffffff81050fd0>] ? abort_exclusive_wait+0xb0/0xb0
   83 [330509.718807]  [<ffffffff8108ef10>] ? balance_pgdat+0x600/0x600
   84 [330509.718811]  [<ffffffff8105082e>] ? kthread+0x7e/0x90
   85 [330509.718818]  [<ffffffff815b4e14>] ? kernel_thread_helper+0x4/0x10
   86 [330509.718822]  [<ffffffff810507b0>] ? kthread_worker_fn+0x120/0x120
   87 [330509.718825]  [<ffffffff815b4e10>] ? gs_change+0xb/0xb

The first time it happened was when running a lot of I/O \
(dumps and streams/backups over SSH).

>
> I'm answering out of interest in find_get_pages(): which does contain
> a number of gotos which could result in endless looping; except that
> they're all supposed to be for very transitory conditions which a
> second glance at the RCU-protected tree should correct.
I am using 'server' for the workload type, not 'low latency' -- which exposes
more bugs/problems..

>
> But if a radix_tree node got corrupted, then yes, it could loop forever.
>
> If it's repeatable, please try again with slab poisoning (and frame
> pointers) enabled?
I will enable frame pointers and wait for the next error/problem and report
back if/when it recurs, thanks!

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/