linux-kernel - Re: bio linked list corruption.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20161026225421.GA15247@clm-mbp.thefacebook.com>
Date:   Wed, 26 Oct 2016 18:54:21 -0400
From:   Chris Mason <clm@...com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
CC:     Dave Jones <davej@...emonkey.org.uk>,
        Andy Lutomirski <luto@...capital.net>,
        Andy Lutomirski <luto@...nel.org>, Jens Axboe <axboe@...com>,
        Al Viro <viro@...iv.linux.org.uk>, Josef Bacik <jbacik@...com>,
        David Sterba <dsterba@...e.com>,
        linux-btrfs <linux-btrfs@...r.kernel.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>,
        Dave Chinner <david@...morbit.com>
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote:
>On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <clm@...com> wrote:
>>
>> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
>> ran dbench 2048:
>>
>> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
>> [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
>> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
>> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
>> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
>> [ 2759.125077]  ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
>> [ 2759.125077]  ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
>> [ 2759.127444]  ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
>> [ 2759.127444] Call Trace:
>> [ 2759.127444]  [<ffffffff814fe4ff>] dump_stack+0x53/0x74
>> [ 2759.127444]  [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
>> [ 2759.127444]  [<ffffffff81064bbf>] __warn+0xff/0x120
>> [ 2759.127444]  [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
>> [ 2759.127444]  [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
>> [ 2759.127444]  [<ffffffff814df338>] blk_sq_make_request+0x388/0x580
>
>Ok, that's definitely the same one that Dave started out seeing.
>
>The fact that it is that reliable - two different machines, two very
>different loads (dbench looks nothing like trinity) really makes me
>think that maybe the problem really is in the block plugging after
>all.
>
>It very much does not smell like random stack corruption. It's simply
>not random enough.

Agreed.  I'd feel better if I could trigger this outside of btrfs, even 
though Dave Chinner hit something very similar on xfs.  I'll peel off 
another test machine for a long XFS run.

>
>And I just noticed something: I originally thought that this is the
>"list_add_tail()" to the plug - which is the only "list_add()" variant
>in that function.
>
>But that never made sense, because the whole "but was" isn't a stack
>address, and "next" in "list_add_tail()" is basically fixed, and would
>have to be the stack.
>
>But I now notice that there's actually another "list_add()" variant
>there, and it's the one from __blk_mq_insert_request() that gets
>inlined into blk_mq_insert_request(), which then gets inlined into
>blk_mq_make_request().
>
>And that actually makes some sense just looking at the offsets too:
>
>     blk_sq_make_request+0x388/0x580
>
>so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.
>
>And there it makes perfect sense that the "next should be" value is
>*not* on the stack.
>
>Chris, if you built with debug info, you can try
>
>    ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388
>
>to get what line that blk_sq_make_request+0x388 address actually is. I
>think it's the
>
>                list_add_tail(&rq->queuelist, &ctx->rq_list);
>
>in __blk_mq_insert_req_list() (when it's inlined from
>blk_sq_make_request(), "at_head" will be false.
>
>So it smells like "&ctx->rq_list" might be corrupt.

I'm running your current git here, so these line numbers should line up 
for you:

blk_sq_make_request+0x388/0x578:
__blk_mq_insert_request at block/blk-mq.c:1049
 (inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175
  (inlined by) blk_sq_make_request at block/blk-mq.c:1419

The fsync path in the WARN doesn't have any plugs that I can see, so its 
not surprising that we're not in the plugging path.  I'm here:

	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
		/*
		 * For a SYNC request, send it to the hardware immediately. For
		 * an ASYNC request, just ensure that we run it later on. The
		 * latter allows for merging opportunities and more efficient
		 * dispatching.
		 */

I'll try the debugging patch you sent in the other email.

-chris