[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5679E7FB.3080505@fb.com>
Date: Tue, 22 Dec 2015 17:16:59 -0700
From: Jens Axboe <axboe@...com>
To: Steven Rostedt <rostedt@...dmis.org>,
LKML <linux-kernel@...r.kernel.org>
CC: Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Michael Ellerman <mpe@...erman.id.au>,
Mark Salter <msalter@...hat.com>,
Laurent Dufour <ldufour@...ux.vnet.ibm.com>,
Ming Lei <ming.lei@...onical.com>,
<linux-block@...r.kernel.org>
Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond
On 12/22/2015 05:09 PM, Steven Rostedt wrote:
> OK, I started with 4.4-rc4 to add some urgent ftrace patches and
> started testing. My tests started to fail, and then I noticed they
> failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed
> that I was constantly getting messages like this:
>
> ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen
> ata2.00: irq_stat 0x20000000, host bus error
> ata2: SError: { HostInt }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2: hard resetting link
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata2.00: configured for UDMA/100
> ata2: EH complete
>
>
> The test box has a relatively new mobo and such, but I know the HD was
> old. So I thought that the HD was simply failing. I installed a new HD
> and spent lots of time since last Thursday trying to set it up to work
> with my testing scripts. Unfortunately, I installed a newer Fedora that
> no longer supported the older grub1 and I wasted lots of time trying to
> get grub2 to do what I wanted. I finally gave up and used
> syslinux/extlinux and got it working again. Unfortunately, I still got
> these ata2 errors! I started thinking that the mobo may be bad.
>
> But then I decided to try an older kernel, and the errors never showed
> up. I booted back and forth several times and the errors were very
> reliable. I have multiple OSes on this box so every time I got an
> error, I would boot into one of the other OSes and do fsck on the
> filesystems. Because the longer I ran my tests with this bug, it would
> eventually start corrupting the ext4 filesystem.
>
> Since it seemed very reliable, I started my bisect. It came down to this
> patch:
>
> From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@...onical.com>
> Date: Tue, 24 Nov 2015 10:35:29 +0800
> Subject: [PATCH] block: fix segment split
>
>
> I thought this strange, because I don't see anything wrong with this
> patch. But if I removed it, the problem went away, and when I added it
> back, the problem would show up easily.
>
> I checkout v4.4-rc6 and tested again, thinking something else may be
> wrong and has since been fixed. Nope, the error still showed up. I then
> removed this commit and tried again. Sure enough, the problem went away!
Probably the other way around, I think, it uncovered an issue with the
segment counting for certain cases.
> My guess is that there's another bug lurking around somewhere, and the
> bug that this patch fixed hid the problem. Now that this patch fixed a
> bug that would hide the issue, the issue is showing up.
>
> I'll pass this along to the block experts and see what you can think of
> it. I attached my config, and the test was a script that stress
> trace-cmd filters.
>
> Oh, and I ran this on my i386 kernel and OS. I haven't tried testing
> much on x86_64 as my tests start with i386. It originally had issues in
> x86_64 but that may be because the i386 test corrupted the filesystem
> which is shared.
>
> There may be a 32bit vs 64bit issue somewhere?
I'm guessing it's the same issue that was recently diagnosed, which
would make sense if you hit this on 32-bit with highmem. Patch is
pending, if you feel inclined, it'd be great if you could add this patch
and retry:
http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f
--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists