linux-kernel - Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4A9FDC37.5060004@redhat.com>
Date:	Thu, 03 Sep 2009 11:09:43 -0400
From:	Ric Wheeler <rwheeler@...hat.com>
To:	Florian Weimer <fweimer@....de>
CC:	Krzysztof Halasa <khc@...waw.pl>,
	Christoph Hellwig <hch@...radead.org>, Mark Lord <lkml@....ca>,
	Michael Tokarev <mjt@....msk.ru>, david@...g.hm,
	Pavel Machek <pavel@....cz>, Theodore Tso <tytso@....edu>,
	NeilBrown <neilb@...e.de>, Rob Landley <rob@...dley.net>,
	Goswin von Brederlow <goswin-v-b@....de>,
	kernel list <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...l.org>, mtk.manpages@...il.com,
	rdunlap@...otime.net, linux-doc@...r.kernel.org,
	linux-ext4@...r.kernel.org, corbet@....net
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe
 width, writes in storage

On 09/03/2009 10:26 AM, Florian Weimer wrote:
> * Ric Wheeler:
>
>> Note that even without MD raid, the file system issues IO's in file
>> system block size (4096 bytes normally) and most commodity storage
>> devices use a 512  byte sector size which means that we have to update
>> 8 512b sectors.
>
> Database software often attempts to deal with this phenomenon
> (sometimes called "torn page writes").  For example, you can make sure
> that the first time you write to a database page, you keep a full copy
> in your transaction log.  If the machine crashes, the log is replayed,
> first completely overwriting the partially-written page.  Only after
> that, you can perform logical/incremental logging.
>
> The log itself has to be protected with a different mechanism, so that
> you don't try to replay bad data.  But you haven't comitted to this
> data yet, so it is fine to skip bad records.

Yes - databases worry a lot about this. Another technique that they tend to use 
is to have state bits at the beginning and end of their logical pages. For 
example, the first byte and last byte toggle together from 1 to 0 to 1 to 0 as 
you update.

If the bits don't match, that is a quick level indication of a torn write.

Even with the above scheme, you can still have data loss of course - you just 
need an IO error in the log and in your db table that was recently updated. Not 
entirely unlikely, especially if you use write cache enabled storage and don't 
flush that cache :-)

>
> Therefore, sub-page corruption is a fundamentally different issue from
> super-page corruption.

We have to be careful to keep our terms clear since the DB pages are (usually) 
larger than the FS block size which in turn is larger than non-RAID storage 
sector size. At the FS level, we send down multiples of fs blocks (not 
blocked/aligned at RAID stripe levels, etc).

In any case, we can get sub-FS block level "torn writes" even with a local S-ATA 
drive in edge conditions.

>
> BTW, older textbooks will tell you that mirroring requires that you
> read from two copies of the data and compare it (and have some sort of
> tie breaker if you need availability).  And you also have to re-read
> data you've just written to disk, to make sure it's actually there and
> hit the expected sectors.  We can't even do this anymore, thanks to
> disk caches.  And it doesn't seem to be necessary in most cases.
>

We can do something like this with the built in RAID in btrfs. If you detect an 
IO error (or bad checksum) on a read, btrfs knows how to request/grab another copy.

Also note that the SCSI T10 DIF/DIX has baked in support for applications to 
layer on extra data integrity (look for MKP's slide decks). This is really neat 
since you can intercept bad IO's on the way down and prevent overwriting good data.

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/