linux-ext4 - Re: fallocate creating fragmented files

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1359585611.5124.140661184685613.5C09C3AB@webmail.messagingengine.com>
Date:	Thu, 31 Jan 2013 09:40:11 +1100
From:	Bron Gondwana <brong@...tmail.fm>
To:	"Theodore Ts'o" <tytso@....edu>, Robert Mueller <robm@...tmail.fm>
Cc:	Eric Sandeen <sandeen@...hat.com>,
	Linux Ext4 mailing list <linux-ext4@...r.kernel.org>
Subject: Re: fallocate creating fragmented files

On Thu, Jan 31, 2013, at 08:43 AM, Theodore Ts'o wrote:
> On Thu, Jan 31, 2013 at 08:21:50AM +1100, Robert Mueller wrote:

(around now, was dropping the kids at school)

> > For that matter, one big question I have is why each of these results is
> > so different.
> > 
> > [robm@...p14 conf]$ for i in 1 2 3 4 5 6 7 8 9 10; do fallocate -l 20m
> > testfile3; filefrag testfile3; /bin/rm testfile3; done
> 
> The most likely reason is that it depends on transaction boundaries.
> After a block has been released, we can't reuse it until after the
> jbd2 transaction which contains the deletion of the inode has
> committed.  So even after you've deleted the file, we can't reuse the
> blocks right away.  The other thing which will influence the block
> allocation is which block group the last allocation was for that
> particular file.  So if blocks become available after a commit
> completes, if we've started allocating in another block group, we
> won't go back to the initial block group.

The particular directory we're doing this test in is a cyrus imapd "conf"
directory.  It contains mostly symlinks and sub directories (some of them
quite hot) but it also contains mailboxes.db, which is a very active
database file.  In this case it's twoskip, which is a skiplist-based file
format.

When any change is made to a twoskip file, the IO pattern is:

1) rewrite first 64 bytes (marking file dirty) and fdatasync
2) append new change/delete records and update back pointers (involves
   between 1 and 20 random rewrites of between 32 and 200ish bytes per
   change)
3) fsync
4) rewrite first 64 bytes (marking file clean again) and fdatasync

So we get two fdatasyncs, one fsync (to save the metadata about the
file being longer now) a bunch of random updates throughout the file,
and some amount of new data appended to the file.

Every so often the file contains too many obsolete records, and it gets
repacked.  This involves creating a new database file (mailboxes.db.NEW)
and walking through the original database copying each record to the new
database.  Finally, the new database is renamed over the old.

It uses flock on the entire file for serialisation, so there can only be
a single writer at a time.

Writes are done using seek and writev, reads are done by MMAPing the
entire file.

More detail about twoskip here if anyone cares:

http://opera.brong.fastmail.fm/talks/twoskip/

It's the twoskip files that we're particularly concerned about.  Not so
much that they fragment during use, that's kind of expected - but that a
repack doesn't result in a single contiguous file.  Apart from the header,
I can't see why it doesn't.

I could probably change the repack code to not do the two first fdatasyncs,
and just do a final fsync before renaming, if you think that initial fsync
of just a couple of hundred bytes (header plus initial dummy record) is
likely to mess up page allocation.

Bron.
-- 
  Bron Gondwana
  brong@...tmail.fm

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html