lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 22 Feb 2007 05:30:03 +0100 (CET)
From:	Juan Piernas Canovas <piernas@...ec.um.es>
To:	Jörn Engel <joern@...ybastard.org>
Cc:	Sorin Faibish <sfaibish@....com>,
	kernel list <linux-kernel@...r.kernel.org>
Subject: Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

Hi Jörn,

I have been thinking about the problem that you describe, and, 
definitively, DualFS does not have that problem. I could be wrong, but, 
I actually believe that the GC implemented by DualFS is deadlock-free. 
The key is the design of the log-structured file system used by DualFS for 
the meta-data device, which is different to the design that you propose.

On Wed, 21 Feb 2007, [utf-8] Jörn Engel wrote:

> On Wed, 21 February 2007 19:31:40 +0100, Juan Piernas Canovas wrote:
>>
>> I do not understand. Do you mean that if I have 10 segments, 5 busy and 5
>> free, after cleaning I could need 6 segments? How? Where the extra blocks
>> come from?
>
> This is a fairly complicated subject and I have trouble explaining it to
> people - even though I hope that maybe one or two dozen understand it by
> now.  So let me try to give you an example:
>
> In LogFS, inodes are stored in an inode file.  There are no B-Trees yet,
> so the regular unix indirect blocks are used.  My example will be
> writing to a directory, so that should only involve metadata by your
> definition and be a valid example for DualFS as well.  If it is not,
> please tell me where the difference lies.
>
> The directory is large, so appending to it involves writing a datablock
> (D0), and indirect block (D1) and a doubly indirect block (D2).
>
> Before:
> Segment 1: [some data] [   D1  ] [more data]
> Segment 2: [some data] [   D0  ] [more data]
> Segment 3: [some data] [   D2  ] [more data]
> Segment 4: [             empty             ]
> ...

DualFS writes meta-blocks in variable-sized chunks that we call partial 
segments. The meta-data device, however, is divided into segments, which 
have the same size. A partial segment can be as large a a segment, but a 
segment usually has more that one partial segment. Besides, a partial 
segment can not cross a segment boundary.

A partial segment is a transaction unit, and contains "all" the blocks 
modified by a file system operation, including indirect blocks and i-nodes 
(actually, it contains the blocks modified by several file system 
operations, but let us assume that every partial segment only contains the 
blocks modified by a single file system operation).

So, the above figure is as follows in DualFS:

  Before:
  Segment 1: [some data] [ D0 D1 D2 I ] [more data]
  Segment 2: [             some data              ]
  Segment 3: [               empty                ]

If the datablock D0 is modified, what you get is:

  Segment 1: [some data] [  garbage   ] [more data]
  Segment 2: [             some data              ]
  Segment 3: [ D0 D1 D2 I ] [       empty         ]

This is very similar to what the cleaner does. Therefore, moving a direct 
datablock (D0) to a new segment does not require more space than in the 
original segment. That is, cleaning a segment in DualFS requires just one 
free segment, and not more.

The result is that you can use all the free segments in DualFS, and its 
cleaner is simple and deadlock-free. Probably the design is not the most 
space-efficient in the world, but it removes some other serious problems.

And, remember, we are talking about meta-data (which is a small part of 
the file system), and disk space (which is quite inexpensive).

Regards,

 	Juan.

>
> After:
> Segment 1: [some data] [garbage] [more data]
> Segment 2: [some data] [garbage] [more data]
> Segment 3: [some data] [garbage] [more data]
> Segment 4: [D0][D1][D2][      empty        ]
> ...
>
> Ok.  After this, the position of D2 on the medium has changed.  So we
> need to update the inode and write that as well.  If the inode number
> for this directory is high, we will need to write the inode (I0), an
> indirect block (I1) and a doubly indirect block (I2).  The picture
> becomes a bit more complicates.
>
> Before:
> Segment 1: [some data] [   D1  ] [more data]
> Segment 2: [some data] [   D0  ] [more data]
> Segment 3: [some data] [   D2  ] [more data]
> Segment 4: [             empty             ]
> Segment 5: [some data] [   I1  ] [more data]
> Segment 6: [some data] [   I0  ] [more data]
> Segment 7: [some data] [   I2  ] [more data]
> ...
>
> After:
> Segment 1: [some data] [garbage] [more data]
> Segment 2: [some data] [garbage] [more data]
> Segment 3: [some data] [garbage] [more data]
> Segment 4: [D0][D1][D2][I0][I1][I2][ empty ]
> Segment 5: [some data] [garbage] [more data]
> Segment 6: [some data] [garbage] [more data]
> Segment 7: [some data] [garbage] [more data]
> ...
>
> So what has just happened?  The user did a single "touch foo" in a large
> directory and has caused six objects to move.  Unless some of those
> objects were in the same segment before, we now have six segments
> containing a tiny amount of garbage.
>
> And there is almost no way how you can squeeze that garbage back out.
> The cleaner will fundamentally do the same thing as a regular write - it
> will move objects.  So if you want to clean a segment containing the
> block of a different directory, you may again have to move five
> additional objects, the indirect blocks, inode and ifile indirect
> blocks.
>
> At this point, your cleaner is becoming a threat.  There is a real
> danger that it will create more garbage in unrelated segments than it
> frees up.  I claim that you cannot keep 50% clean segments, unless you
> move away from the simplistic cleaner I described above.
>
> Jörn
>
>

-- 
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657    Fax: +34968364151
email: piernas@...ec.um.es
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es&op=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript :-) ***

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ