linux-kernel - Re: bio linked list corruption.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFwY5jNcUAJzDWERP2r9iZHEKzHSwMi5AJCJiaVd3Z0E-g@mail.gmail.com>
Date:   Wed, 26 Oct 2016 15:21:53 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Chris Mason <clm@...com>
Cc:     Dave Jones <davej@...emonkey.org.uk>,
        Andy Lutomirski <luto@...capital.net>,
        Andy Lutomirski <luto@...nel.org>, Jens Axboe <axboe@...com>,
        Al Viro <viro@...iv.linux.org.uk>, Josef Bacik <jbacik@...com>,
        David Sterba <dsterba@...e.com>,
        linux-btrfs <linux-btrfs@...r.kernel.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>,
        Dave Chinner <david@...morbit.com>
Subject: Re: bio linked list corruption.

On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason <clm@...com> wrote:
>
> This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in < 10 minutes.
> I've done 30 minutes each with XFS and Ext4 without luck.

Ok, see the email I wrote that crossed yours - if it's really some
list corruption on ctx->rq_list due to some locking problem, I really
would expect CONFIG_VMAP_STACK to be entirely irrelevant, except
perhaps from a timing standpoint.

> WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> list_add corruption. prev->next should be next (ffffe8ffffd80b08), but was ffff88012b65fb88. (prev=ffff880128c8d500).
> Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk
> CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
>  ffff880104eff868 ffffffff814fde0f ffffffff8151c46e ffff880104eff8c8
>  ffff880104eff8c8 0000000000000000 ffff880104eff8b8 ffffffff810648cf
>  ffff880128cab2c0 000000213fc57c68 ffff8801384e8928 ffff880128cab180
> Call Trace:
>  [<ffffffff814fde0f>] dump_stack+0x53/0x74
>  [<ffffffff8151c46e>] ? __list_add+0xbe/0xd0
>  [<ffffffff810648cf>] __warn+0xff/0x120
>  [<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50
>  [<ffffffff8151c46e>] __list_add+0xbe/0xd0
>  [<ffffffff814dec38>] blk_sq_make_request+0x388/0x580
>  [<ffffffff814d5444>] generic_make_request+0x104/0x200

Well, it's very consistent, I have to say. So I really don't think
this is random corruption.

Could you try the attached patch? It adds a couple of sanity tests:

 - a number of tests to verify that 'rq->queuelist' isn't already on
some queue when it is added to a queue

 - one test to verify that rq->mq_ctx is the same ctx that we have locked.

I may be completely full of shit, and this patch may be pure garbage
or "obviously will never trigger", but humor me.

          Linus

View attachment "patch.diff" of type "text/plain" (2059 bytes)