lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0704290006330.12830@artax.karlin.mff.cuni.cz>
Date:	Sun, 29 Apr 2007 00:38:03 +0200 (CEST)
From:	Mikulas Patocka <mikulas@...ax.karlin.mff.cuni.cz>
To:	Bill Huey <billh@...ppy.monkey.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andreas Dilger <adilger@...sterfs.com>,
	Marat Buharov <marat.buharov@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mike Galbraith <efault@....de>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <jens.axboe@...cle.com>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Alex Tomas <alex@...sterfs.com>
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS
 is under heavy write load (massive starvation)

> On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
>> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
>> phase tree filesystems (TUX2); it writes inside normal used structures,
>> but it marks each structure with generation tags --- when it updates
>> global table of tags, it atomically makes several structures valid. I
>> don't know about this idea being used elsewhere.
>
> So how is this generation structure organized ? paper ?

Paper is in CITSA 2006 proceedings (but you likely don't have them and I 
signed some statement that I can't post it elsewhere :-( )

Basicly the idea is this:
* you have array containing 65536 32-bit numbers --- crash count table --- 
that array is on disk and in memory (see struct __spadfs->cct in my sources)
* you have 16-bit value --- crash count, that value is on disk and in memory 
too (see struct __spadfs->cc)

* On mount, you load crash count table and crash count from disk to 
memory. You increment carsh count on disk (but leave old in memory). You 
increment one entry in crash count table - cct[cc] in memory, but leave 
old on disk.
* On sync you write all metadata buffers, do write barrier, write one 
sector of crash count table from memory to disk and do write 
barrier again.
* On unmount, you sync and decrement crash count on disk.

--- so crash count counts crashes --- it is increased each time you mount 
and don't unmount.

Consistency of structures:
* Each directory entry has two tags --- 32-bit transaction count (txc) 
and 16-bit crash count(cc).
* You create directory entry with entry->txc = fs->txc[fs->cc] and 
entry->cc = fs->cc
* Directory entry is considered valid if fs->txc[entry->cc] >= entry->txc 
(see macro CC_VALID)
* If the directory entry is not valid, it is skipped during directory 
scan, as if it wasn't there
--- so you create a directory entry and its valid. If the system crashes, 
it will load crash count table from disk and there's one-less value than 
entry->txc, so the entry will be invalid. It will also run with increased 
cc, so it will never touch txc at an old index, so the entry will be valid 
forever.
--- if you sync, you write crash count table to disk and directory entry 
will be atomically made valid forever (because values in crash count table 
never decrease)

In my implementation, the top bit of entry->txc is used to mark whether 
the entry is scheduled for adding or delete, so that you can atomically 
add one directory entry and delete other.

Space allocation bitmaps or lists are managed in such a way that there are 
two copies and cc/txc pair determining which one is valid.

Files are extended in such a way that each file has two "size" entries and 
cc/txc pair denoting which one is valid, so that you can atomically 
extend/truncate file and mark its space allocated/freed in bitmaps or 
lists (BTW. this cc/txc pair is the same one that denotes if the directory 
entry is valid and another bit determines one of these two functions --- 
to save space).

Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ