lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0804100503160.26713@artax.karlin.mff.cuni.cz>
Date:	Thu, 10 Apr 2008 05:07:30 +0200 (CEST)
From:	Mikulas Patocka <mikulas@...ax.karlin.mff.cuni.cz>
To:	Bill Davidsen <davidsen@....com>
cc:	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
	device-mapper development <dm-devel@...hat.com>,
	agk@...hat.com, mingo@...hat.com, neilb@...e.de
Subject: Re: Data corruption on software RAID

> > Possibilities how to fix it:
> >
> > 1. lock the buffers and pages while they are being written --- this would
> > cause performance degradation (the most severe degradation would be in case
> > when one process does repeatedly sync() and other unrelated process
> > repeatedly writes to some file).
> >
> > Lock the buffers and pages only for RAID --- would create many special cases
> > and possible bugs.
> >
> > 2. never turn the region dirty bit off until the filesystem is unmounted.
> > --- this is the simplest fix. If the computer crashes after a long time, it
> > resynchronizes the whole device. But there won't cause application-visible
> > or filesystem-visible data corruption.
> >
> > 3. turn off the region bit if the region wasn't written in one pdflush
> > period --- requires an interaction with pdflush, rather complex. The problem
> > here is that pdflush makes its best effort to write data in
> > dirty_writeback_centisecs interval, but it is not guaranteed to do it.
> >
> > 4. make more region states: Region has in-memory states CLEAN, DIRTY,
> > MAYBE_DIRTY, CLEAN_CANDIDATE.
> >
> > When you start writing to the region, it is always moved to DIRTY state (and
> > on-disk bit is turned on).
> >
> > When you finish all writes to the region, move it to MAYBE_DIRTY state, but
> > leave bit on disk on. We now don't know if the region is dirty or no.
> >
> > Run a helper thread that does periodically:
> > Change MAYBE_DIRTY regions to CLEAN_CANDIDATE
> > Issue sync()
> > Change CLEAN_CANDIDATE regions to CLEAN state and clear their on-disk bit.
> >
> > The rationale is that if the above write-while-modify scenario happens, the
> > page is always dirty. Thus, sync() will write the page, kick the region back
> > from CLEAN_CANDIDATE to MAYBE_DIRTY state and we won't mark the region as
> > clean on disk.
> >
> >
> > I'd like to know you ideas on this, before we start coding a solution.
> >   
> 
> I looked at just this problem a while ago, and came to the conclusion that
> what was needed was a COW bit, to show that there was i/o in flight, and that
> before modification it needed to be copied. Since you don't want to let that
> recurse, you don't start writing the copy until the original is written and
> freed. Ideally you wouldn't bother to finish writing the original, but that
> doesn't seem possible. That allows at most two copies of a chunk to take up
> memory space at once, although it's still ugly and can be a bottleneck.

Copying the data would be performance overkill. You can really write 
different data to different disks, you just must not forget to resync them 
after a crash. The filesystem/application will recover with either old or 
new data --- it just won't recover when it's reading old and new data from 
the same location.

>From my point of view that trick with thread doing sync() and turning off 
region bits looks best. I'd like to know if that solution doesn't have any 
other flaw.

> For reliable operation I would want all copies (and/or CRCs) to be written on
> an fsync, by the time I bother to fsync I really, really, want the data on the
> disk.

fsync already works this way.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ