linux-kernel - Re: kernel BUG at mm/vmscan.c:1114

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJn8CcGGsdPdaJ7t_RcBmFOGgVLVjAP8Mr40Cv=FknLTNgBUsg@mail.gmail.com>
Date:	Thu, 4 Aug 2011 11:54:24 +0800
From:	Xiaotian Feng <xtfeng@...il.com>
To:	Mel Gorman <mgorman@...e.de>
Cc:	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: kernel BUG at mm/vmscan.c:1114

On Wed, Aug 3, 2011 at 4:54 PM, Mel Gorman <mgorman@...e.de> wrote:
> On Wed, Aug 03, 2011 at 02:44:20PM +0800, Xiaotian Feng wrote:
>> On Tue, Aug 2, 2011 at 3:22 PM, Andrew Morton <akpm@...ux-foundation.org> wrote:
>> > On Tue, 2 Aug 2011 15:09:57 +0800 Xiaotian Feng <xtfeng@...il.com> wrote:
>> >
>> >> __ __I'm hitting the kernel BUG at mm/vmscan.c:1114 twice, each time I
>> >> was trying to build my kernel. The photo of crash screen and my config
>> >> is attached.
>> >
>> > hm, now why has that started happening?
>> >
>> > Perhaps you could apply this debug patch, see if we can narrow it down?
>> >
>>
>> I will try it then, but it isn't very reproducible :(
>> But my system hung after some list corruption warnings... I hit the
>> corruption 4 times...
>>
>
> That is very unexpected but if lists are being corrupted, it could
> explain the previously reported bug as that bug looked like an active
> page on an inactive list.
>
> What was the last working kernel? Can you bisect?
>
>>  [ 1220.468089] ------------[ cut here ]------------
>>  [ 1220.468099] WARNING: at lib/list_debug.c:56 __list_del_entry+0x82/0xd0()
>>  [ 1220.468102] Hardware name: 42424XC
>>  [ 1220.468104] list_del corruption. next->prev should be
>> ffffea0000e069a0, but was ffff880100216c78
>>  [ 1220.468106] Modules linked in: ip6table_filter ip6_tables
>> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
>> xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp
>> iptable_filter ip_tables x_tables binfmt_misc bridge stp parport_pc
>> ppdev snd_hda_codec_conexant snd_hda_intel snd_hda_codec thinkpad_acpi
>> snd_hwdep snd_pcm i915 snd_seq_midi snd_rawmidi arc4 cryptd
>> snd_seq_midi_event aes_x86_64 snd_seq drm_kms_helper iwlagn snd_timer
>> aes_generic drm snd_seq_device mac80211 psmouse uvcvideo videodev snd
>> v4l2_compat_ioctl32 soundcore snd_page_alloc serio_raw i2c_algo_bit
>> btusb tpm_tis tpm tpm_bios video cfg80211 bluetooth nvram lp joydev
>> parport usbhid hid ahci libahci firewire_ohci firewire_core e1000e
>> sdhci_pci sdhci crc_itu_t
>>  [ 1220.468185] Pid: 1168, comm: Xorg Tainted: G        W   3.0.0+ #23
>>  [ 1220.468188] Call Trace:
>>  [ 1220.468190]  <IRQ>  [<ffffffff8106db3f>] warn_slowpath_common+0x7f/0xc0
>>  [ 1220.468201]  [<ffffffff8106dc36>] warn_slowpath_fmt+0x46/0x50
>>  [ 1220.468206]  [<ffffffff81332a52>] __list_del_entry+0x82/0xd0
>>  [ 1220.468210]  [<ffffffff81332ab1>] list_del+0x11/0x40
>>  [ 1220.468216]  [<ffffffff8117a212>] __slab_free+0x362/0x3d0
>>  [ 1220.468222]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468226]  [<ffffffff8117b767>] ? kmem_cache_free+0x97/0x220
>>  [ 1220.468230]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468234]  [<ffffffff811c6606>] ? bvec_free_bs+0x26/0x40
>>  [ 1220.468239]  [<ffffffff8117b8df>] kmem_cache_free+0x20f/0x220
>>  [ 1220.468243]  [<ffffffff811c6606>] bvec_free_bs+0x26/0x40
>>  [ 1220.468247]  [<ffffffff811c6654>] bio_free+0x34/0x70
>>  [ 1220.468250]  [<ffffffff811c66a5>] bio_fs_de
>>
>

I'm hitting this again today, when I'm trying to rebuild my kernel....
Looking it a bit

 list_del corruption. next->prev should be ffffea0000e069a0, but was
ffff880100216c78

I find something interesting from my syslog:

 PERCPU: Embedded 28 pages/cpu @ffff880100200000 s83456 r8192 d23040 u262144

> This warning and the page reclaim warning are on paths that are
> commonly used and I would expect to see multiple reports. I wonder
> what is happening on your machine that is so unusual.
>
> Have you run memtest on this machine for a few hours and badblocks
> on the disk to ensure this is not hardware trouble?
>
>> So is it possible that my previous BUG is triggered by slab list corruption?
>
> Not directly, but clearly there is something very wrong.
>
> If slub corruption reports are very common and kernel 3.0 is fine, my
> strongest candidate for the corruption would be the SLUB lockless
> patches. Try
>
> git diff e4a46182e1bcc2ddacff5a35f6b52398b51f1b11..9e577e8b46ab0c38970c0f0cd7eae62e6dffddee | patch -p1 -R
>

I will try it now, thanks.

> They should revert cleanly with offsets.
>
> --
> Mel Gorman
> SUSE Labs
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/