linux-ext4 - Re: Filesystem corruption after unreachable storage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3a7bc899-31d9-51f2-1ea9-b3bef2a98913@dupond.be>
Date:   Thu, 20 Feb 2020 10:08:44 +0100
From:   Jean-Louis Dupond <jean-louis@...ond.be>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     linux-ext4@...r.kernel.org
Subject: Re: Filesystem corruption after unreachable storage

As the mail seems to have been trashed somewhere, I'll retry :)

Thanks
Jean-Louis


On 24/01/2020 21:37, Theodore Y. Ts'o wrote:
> On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:
>> There was a short disruption of the SAN, which caused it to be 
>> unavailable
>> for 20-25 minutes for the ESXi.
> 20-25 minutes is "short"? I guess it depends on your definition / POV. :-)
Well more downtime was caused to recover (due to manual fsck) then the 
time the storage was down :)
>
>> What worries me is that almost all of the VM's (out of 500) were 
>> showing the
>> same error.
> So that's a bit surprising...
Indeed, that's were I thought, something went wrong here!
I've tried to simulate it, and were able to simulate the same error when 
we let the san recover BEFORE VM is shutdown.
If I poweroff the VM and then recover the SAN, it does an automatic fsck 
without problems.
So it really seems it breaks when the VM can write again to the SAN.
>
>> And even some (+-10) were completely corrupt.
> What do you mean by "completely corrupt"? Can you send an e2fsck
> transcript of file systems that were "completely corrupt"?
Well it was moving a tons of files to lost+found etc. So that was really 
broken.
I'll see if I can recover some backup of one in broken state.
Anyway this was only a very small percentage, so worries me less then 
the rest :)
>
>> Is there for example a chance that the filesystem gets corrupted the 
>> moment
>> the SAN storage was back accessible?
> Hmm... the one possibility I can think of off the top of my head is
> that in order to mark the file system as containing an error, we need
> to write to the superblock. The head of the linked list of orphan
> inodes is also in the superblock. If that had gotten modified in the
> intervening 20-25 minutes, it's possible that this would result in
> orphaned inodes not on the linked list, causing that error.
>
> It doesn't explain the more severe cases of corruption, though.
If fixing that would have left us with only 10 corrupt disks instead of 
500, would be a big win :)
>
>> I also have some snapshot available of a corrupted disk if some 
>> additional
>> debugging info is required.
> Before e2fsck was run? Can you send me a copy of the output of
> dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a
> copy of that snapshot?
Sure:
dumpe2fs -> see attachment

Fsck:
# e2fsck -fy /dev/mapper/vg01-root
e2fsck 1.44.5 (15-Dec-2018)
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix? yes

Inode 165708 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(863328--863355)
Fix? yes

Free blocks count wrong for group #26 (3485, counted=3513).
Fix? yes

Free blocks count wrong (1151169, counted=1151144).
Fix? yes

Inode bitmap differences:  -4401 -165708
Fix? yes

Free inodes count wrong for group #0 (2489, counted=2490).
Fix? yes

Free inodes count wrong for group #20 (1298, counted=1299).
Fix? yes

Free inodes count wrong (395115, counted=395098).
Fix? yes


/dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous), 
882520/2033664 blocks

>
>> It would be great to gather some feedback on how to improve the situation
>> (next to of course have no SAN outage :)).
> Something that you could consider is setting up your system to trigger
> a panic/reboot on a hung task timeout, or when ext4 detects an error
> (see the man page of tune2fs and mke2fs and the -e option for those
> programs).
>
> There are tradeoffs with this, but if you've lost the SAN for 15-30
> minutes, the file systems are going to need to be checked anyway, and
> the machine will certainly not be serving. So forcing a reboot might
> be the best thing to do.
Going to look into that! Thanks for the info.
>> On KVM for example there is a unlimited timeout (afaik) until the 
>> storage is
>> back, and the VM just continues running after storage recovery.
> Well, you can adjust the SCSI timeout, if you want to give that a try....
It has some other disadvantages? Or is it quite safe to increment the 
SCSI timeout?
>
> Cheers,
>
> - Ted