lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50f93ccb-2b2c-15c5-8b08-facc3a25068a@dupond.be>
Date:   Mon, 9 Mar 2020 14:52:38 +0100
From:   Jean-Louis Dupond <jean-louis@...ond.be>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     linux-ext4@...r.kernel.org
Subject: Re: Filesystem corruption after unreachable storage

On 28/02/2020 12:06, Jean-Louis Dupond wrote:
> On 25/02/2020 18:23, Theodore Y. Ts'o wrote:
>> This is going to be a long shot, but if you could try testing with
>> 5.6-rc3, or with this commit cherry-picked into a 5.4 or later kernel:
>>
>>     commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6
>>     Author: wangyan <wangyan122@...wei.com>
>>     Date:   Thu Feb 20 21:46:14 2020 +0800
>>
>>         jbd2: fix ocfs2 corrupt when clearing block group bits
>>                 I found a NULL pointer dereference in 
>> ocfs2_block_group_clear_bits().
>>         The running environment:
>>                 kernel version: 4.19
>>                 A cluster with two nodes, 5 luns mounted on two 
>> nodes, and do some
>>                 file operations like dd/fallocate/truncate/rm on 
>> every lun with storage
>>                 network disconnection.
>>                 The fallocate operation on dm-23-45 caused an null 
>> pointer dereference.
>>         ...
>>
>> ... it would be interesting to see if fixes things for you.  I can't
>> guarantee that it will, but the trigger of the failure which wangyan
>> found is very similar indeed.
>>
>> Thanks,
>>
>>                         - Ted
> Unfortunately it was a too long shot :)
>
> Tested with a 5.4 kernel with that patch included, and also with 5.6-rc3.
> But both had the same issue.
>
> - Filesystem goes read-only when the storage comes back
> - Manual fsck needed on bootup to recover from it.
>
> It would be great if we could make it not corrupt the filesystem on 
> storage recovery.
> I'm happy to test some patches if they are available :)
>
> Thanks
> Jean-Louis

Did some more tests today.

Setting the SCSi timeout higher seems to be the most reliable solution.
When the storage recovers, the VM just recovers and we can continue :)

Also did test setting the filesystem option 'error=panic'.
When the storage recovers, the VM freezes. So a hard reset is needed. 
But on boot a manual fsck is also needed like in the default situation.
So it seems like it still writes data to the FS before doing the panic?
You would expect it to not touch the fs anymore.

Would be nice if this situation could be a bit more error-proof :)

Thanks
Jean-Louis


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ