[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50876E1D.3040501@redhat.com>
Date: Tue, 23 Oct 2012 23:27:09 -0500
From: Eric Sandeen <sandeen@...hat.com>
To: Nix <nix@...eri.org.uk>
CC: "Ted Ts'o" <tytso@....edu>, linux-ext4@...r.kernel.org,
linux-kernel@...r.kernel.org,
"J. Bruce Fields" <bfields@...ldses.org>,
Bryan Schumaker <bjschuma@...app.com>,
Peng Tao <bergwolf@...il.com>, Trond.Myklebust@...app.com,
gregkh@...uxfoundation.org, linux-nfs@...r.kernel.org
Subject: Re: Apparent serious progressive ext4 data corruption bug in 3.6.3
(and other stable branches?)
On 10/23/12 11:15 PM, Nix wrote:
> On 24 Oct 2012, Eric Sandeen uttered the following:
>
>> On 10/23/12 3:57 PM, Nix wrote:
>>> The only unusual thing about the filesystems on this machine are that
>>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>>> 'nobarrier':
>>
>> I should have read more. :( More questions follow:
>>
>> * Does the Areca have a battery backed write cache?
>
> Yes (though I'm not powering off, just rebooting). Battery at 100% and
> happy, though the lack of power-off means it's not actually getting
> used, since the cache is obviously mains-backed as well.
>
>> * Are you crashing or rebooting cleanly?
>
> Rebooting cleanly, everything umounted happily including /home and /var.
>
>> * Do you see log recovery messages in the logs for this filesystem?
>
> My memory says yes, but nothing seems to be logged when this happens
> (though with my logs on the first filesystem damaged by this, this is
> rather hard to tell, they're all quite full of NULs by now).
>
> I'll double-reboot tomorrow via the faulty kernel and check, unless I
> get asked not to in the interim. (And then double-reboot again to fsck
> everything...)
>
>>> the full set of options for all my ext4 filesystems are:
>>>
>>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>>
>> ok journal_async_commit is off the reservation a bit; that's really not
>> tested, and Jan had serious reservations about its safety.
>
> OK, well, I've been 'testing' it for years :) No problems until now. (If
> anything, I was more concerned about journal_checksum. I thought that
> had actually been implicated in corruption before now...)
It had, but I fixed it AFAIK; OTOH, we turned it off by default
after that episode.
>> * Can you reproduce this w/o journal_async_commit?
>
> I can try!
Ok, fair enough. If the BBU is working, nobarrier is ok; I don't trust
journal_async_commit, but that doesn't mean this isn't a regression.
Thanks for the answers... onward. :)
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists