lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A9FB10B.60209@redhat.com>
Date:	Thu, 03 Sep 2009 08:05:31 -0400
From:	Ric Wheeler <rwheeler@...hat.com>
To:	Rob Landley <rob@...dley.net>
CC:	Pavel Machek <pavel@....cz>, david@...g.hm,
	Theodore Tso <tytso@....edu>, Florian Weimer <fweimer@....de>,
	Goswin von Brederlow <goswin-v-b@....de>,
	kernel list <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...l.org>, mtk.manpages@...il.com,
	rdunlap@...otime.net, linux-doc@...r.kernel.org,
	linux-ext4@...r.kernel.org, corbet@....net
Subject: Re: [PATCH] Update Documentation/md.txt to mention journaling won't
 help dirty+degraded case.

On 09/02/2009 06:49 PM, Rob Landley wrote:
> From: Rob Landley<rob@...dley.net>
>
> Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
> explaining that using a journaling filesystem can't overcome this problem.
>
> Signed-off-by: Rob Landley<rob@...dley.net>
> ---
>
>   Documentation/md.txt |   17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/md.txt b/Documentation/md.txt
> index 4edd39e..52b8450 100644
> --- a/Documentation/md.txt
> +++ b/Documentation/md.txt
> @@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
>
>      md-mod.start_dirty_degraded=1
>
> +Note that Journaling filesystems do not effectively protect data in this
> +case, because the update granularity of the RAID is larger than the journal
> +was designed to expect.  Reconstructing data via partity information involes
> +matching together corresponding stripes, and updating only some of these
> +stripes renders the corresponding data in all the unmatched stripes
> +meaningless.  Thus seemingly unrelated data in other parts of the filesystem
> +(stored in the unmatched stripes) can become unreadable after a partial
> +update, but the journal is only aware of the parts it modified, not the
> +"collateral damage" elsewhere in the filesystem which was affected by those
> +changes.
> +
> +Thus successful journal replay proves nothing in this context, and even a
> +full fsck only shows whether or not the filesystem's metadata was affected.
> +(A proper solution to this problem would involve adding journaling to the RAID
> +itself, at least during degraded writes.  In the meantime, try not to allow
> +a system to shut down uncleanly with its RAID both dirty and degraded, it
> +can handle one but not both.)
>
>   Superblock formats
>   ------------------
>
>

NACK.

Now you have moved the inaccurate documentation about journalling file systems 
into the MD documentation.

Repeat after me:

(1) partial writes to a RAID stripe (with or without file systems, with or 
without journals) create an invalid stripe

(2) partial writes can be prevented in most cases by running with write cache 
disabled or working barriers

(3) fsck can (for journalling fs or non journalling fs) detect and fix your file 
system. It won't give you back the data in that stripe, but you will get the 
rest of your metadata and data back and usable.

You don't need MD in the picture to test this - take fsfuzzer or just dd and 
zero out a RAID stripe width of data from a file system. If you hit data blocks, 
your fsck (for ext2) or mount (for any journalling fs) will not see an error. If 
metadata, fsck in both cases when run will try to fix it as best as it can.

Also note that partial writes (similar to torn writes) can happen for multiple 
reasons on non-RAID systems and leave the same kind of damage.

Side note, proposing a half sketched out "fix" for partial stripe writes in 
documentation is not productive. Much better to submit a fully thought out 
proposal or actual patches to demonstrate the issue.

Rob, you should really try to take a few disks, build a working MD RAID5 group 
and test your ideas. Try it with and without the write cache enabled.

Measure and report, say after 20 power losses, how  files integrity and fsck 
repairs were impacted.

Try the same with ext2 and ext3.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ