linux-kernel - Re: FS corruption after I/O errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <490C54D4.4030603@vlnb.net>
Date:	Sat, 01 Nov 2008 16:08:36 +0300
From:	Vladislav Bolkhovitin <vst@...b.net>
To:	Nick Piggin <nickpiggin@...oo.com.au>
CC:	linux-fsdevel@...r.kernel.org, viro@...iv.linux.org.uk,
	linux-kernel@...r.kernel.org,
	James Bottomley <James.Bottomley@...senPartnership.com>,
	scst-devel <scst-devel@...ts.sourceforge.net>
Subject: Re: FS corruption after I/O errors

Vladislav Bolkhovitin wrote:
> Nick Piggin wrote:
>> On Wednesday 29 October 2008 06:38, Vladislav Bolkhovitin wrote:
>>> Nick Piggin wrote:
>>>> On Saturday 25 October 2008 03:10, Vladislav Bolkhovitin wrote:
>>>>> Hi,
>>>>>
>>>>> During recent debugging session of my SCSI target SCST
>>>>> (http://scst.sf.net) I noticed many
>>>>>
>>>>> WARNING: at fs/buffer.c:1186 mark_buffer_dirty+0x51/0x66()
>>>>>
>>>>> messages in kernel log on the initiator. I attached the full log of
>>>>> several of them.
>>>>>
>>>>> My target was buggy and I was working on fixing it, but I suppose Linux
>>>>> should handle such failures more gracefully. In all the cases the target
>>>>> had one type of failure: it "ate" a SCSI command and never returned
>>>>> result of it.
>>>> Right. This is one of the warnings I see in my fault-injection testing.
>>>> It is fixed by my patch to clean up and improve the page and buffer
>>>> error handling in the vm/fs.
>>> Can you specify which patch you referring? Is it in 2.6.27?
>> It's just an RFC at the moment which I posted to fsdevel. Not in 2.6.27.
> 
> I see. I'm looking forward to see it in 2.6.28 or .29. This is really a 
> needed work.
> 
> BTW, have you even seen in your fault-injection testing that after 
> receiving a failure from a SCSI device during heavy load ext3 file 
> system mounted on it gets corrupted and journal replay on remount 
> doesn't repair it, only manual e2fsck helps? I've many times seen that, 
> including cases when the target was remaining up and fully functional. 
> See, e.g., "MOANING MODE ON" part in 
> http://marc.info/?l=linux-scsi&m=121932252324432&w=2. I haven't checked 
> that case since then, although I see such corruptions quite often. But 
> in all them I can't so clearly say that it isn't a target's failure.

I've just checked it with 2.6.27. The situation greatly improved and 
dbench was able to complete several runs under constant TASK_ABORTED 
"bombarding" (TASK RESET task management commands using "sg_reset -b" 
each 31 seconds from another "connection" to that device via qla2xxx 
initiator driver. You can see those resets in the attached log). But 
when then I unmounted the affected partition, e2fsck found errors on it. 
See attachments for details. The target all the times was fine and 
completely healthy.



View attachment "dbench" of type "text/plain" (17885 bytes)

View attachment "kernel" of type "text/plain" (38110 bytes)