linux-ext4 - Of block allocation algorithms, fsck times, and file fragmentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <E1M1fIm-0001Hw-0f@closure.thunk.org>
Date:	Wed, 06 May 2009 07:28:40 -0400
From:	"Theodore Ts'o" <tytso@....edu>
To:	linux-ext4@...r.kernel.org
cc:	Curt Wohlgemuth <curtw@...gle.com>
Subject: Of block allocation algorithms, fsck times, and file fragmentation

With the flexgroups Orlov allocator and with the don't-avoid-
BLOCK_UNINIT-block-groups patch I decided it was time to do a quick
check on fsck times.   Using a root filesystem freshly copied to a
laptop hardrive, I got the following results:
       
                    Ext3                          Ext4
             Time (seconds) Data Read       Time (seconds) Data Read
         Real  User   Sys   MB    Mb/s   Real  User Sys   MB   Mb/s
Pass 1  192.30 20.65 12.45  1324  6.89    9.87 5.32 0.91  203  20.56
Pass 2   11.81  2.31  1.70   260 22.02    6.34 1.98 1.49  261  41.19
Pass 3    0.01  0.01  0.00     1 74.38    0.01 0.01 0.00    1  75.06
Pass 4    0.13  0.13  0.00     0  0.00    0.18 0.18 0.00    0   0.00
Pass 5    6.56  0.75  0.21     3  0.46    2.24 1.66 0.05    2   0.89
------
Total   211.10 23.90 14.38  1588  7.52   18.75 9.19 2.46 466  24.85

The ext4 fsck time is a little over 11 times better than ext3 time.
This isn't entirely a fair comparison with the 6.7 times improvement
discussed at

     http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/

... since that filesystem had 67% of its blocks used and 9.3% of its
inode used, where as this filesystem has 41% of its block used and 18%
of its inodes used.  However, the improvement in e2fsck pass2 is quite
satisfactorily dramatic.

So that's the good news.  However, the block allocation shows that we
are doing something... strange.  Running an e2fsck -E fragcheck report,
the large files seem to be written out in 8 megabyte chunks:

  1313(f): expecting  51200 actual extent phys  53248 log 2048 len 2048
  1313(f): expecting  55296 actual extent phys  59392 log 4096 len 2048
  1313(f): expecting  61440 actual extent phys  63488 log 6144 len 9
  1351(f): expecting  53248 actual extent phys  57344 log 2048 len 2048
  1351(f): expecting  59392 actual extent phys  67584 log 4096 len 4096
  1351(f): expecting  71680 actual extent phys  73728 log 8192 len 2048
  1351(f): expecting  75776 actual extent phys  77824 log 10240 len 2048
  1351(f): expecting  79872 actual extent phys  83968 log 12288 len 642
  1572(f): expecting  63488 actual extent phys  64512 log 1024 len 99
  1573(f): expecting  49152 actual extent phys  64000 log 512 len 412
  1574(f): expecting  67584 actual extent phys  71680 log 2048 len 2048
  1574(f): expecting  73728 actual extent phys  75776 log 4096 len 2048
  1574(f): expecting  77824 actual extent phys  81920 log 6144 len 2048
  1574(f): expecting  83968 actual extent phys  86016 log 8192 len 12288
  1574(f): expecting  98304 actual extent phys 100352 log 20480 len 32768
  1574(f): expecting 149504 actual extent phys 151552 log 69632 len 2048
  1574(f): expecting 153600 actual extent phys 155648 log 71680 len 2048
  1574(f): expecting 157696 actual extent phys 159744 log 73728 len 2048
  1574(f): expecting 161792 actual extent phys 165888 log 75776 len 2048
  1574(f): expecting 167936 actual extent phys 169984 log 77824 len 2048
  1574(f): expecting 172032 actual extent phys 174080 log 79872 len 1959

The ext3 and ext4 filesystems were copied using rsync, which copies
files on a file-by-file basis; that is, one file should have been
written, followed by another file.   Yet there seems to be some kind of
interleaving effect going on.  

  1351(f): expecting  71680 actual extent phys  73728 log 8192 len 2048
  1574(f): expecting  67584 actual extent phys  71680 log 2048 len 2048

Logical block 8192 of inode 1371 *should* have been written at physical
block 71680 in order to keep 1371 contiguous on disk.  Yet logical block
2048 of inode 1574 was written there instead.  Why?

This also happened here:

  1351(f): expecting  75776 actual extent phys  77824 log 10240 len 2048
  1574(f): expecting  73728 actual extent phys  75776 log 4096 len 2048

and here:

  1572(f): expecting  63488 actual extent phys  64512 log 1024 len 99
  1313(f): expecting  61440 actual extent phys  63488 log 6144 len 9

The bottom line is this was a freshly mke2fs'ed filesystem, and the
files were getting copied one at a time using rsync, so in theory all of
the files should be written contiguously on the disk.  However, this was
not true:

     535 non-contiguous files (0.1%)

None of the fragmented files were disastrously fragmented; the files
seem to be written in extents that are sized in multiples of 2048
blocks, or 8 megabytes, interleaved with files that were written before
and after a particular file in question.  The question is why is this
happening at all, and can we do better?

This effect looks like the one which Curt Wohlgemuth had noticed and
reported last week.

-----------------

On a lark, I tried copying the filesystem with nodelalloc, and the
results were *really* bad:

   33780 non-contiguous files (4.2%)

Worse yet, the fragments were happening at boundaries of 60k, after 15
blocks:

   288(f): expecting  34777 actual extent phys  37155 log 15 len 1
   288(f): expecting  37156 actual extent phys  37728 log 16 len 3
   338(f): expecting  37912 actual extent phys  36340 log 15 len 1
   338(f): expecting  36341 actual extent phys  37744 log 16 len 5
   400(f): expecting  41714 actual extent phys  37116 log 15 len 1
   400(f): expecting  37117 actual extent phys  40224 log 16 len 3
   430(f): expecting  41741 actual extent phys  37117 log 15 len 1
   438(f): expecting  42063 actual extent phys  37118 log 15 len 1
   438(f): expecting  37119 actual extent phys  40240 log 16 len 112
   438(f): expecting  40352 actual extent phys  42496 log 128 len 723
   440(f): expecting  41770 actual extent phys  37119 log 15 len 1
   440(f): expecting  37120 actual extent phys  40352 log 16 len 5
   441(f): expecting  41785 actual extent phys  37523 log 15 len 1
   441(f): expecting  37524 actual extent phys  40368 log 16 len 7
   443(f): expecting  41808 actual extent phys  37156 log 15 len 1
   443(f): expecting  37157 actual extent phys  43232 log 16 len 468
   446(f): expecting  41825 actual extent phys  37157 log 15 len 1
   446(f): expecting  37158 actual extent phys  40384 log 16 len 7
   447(f): expecting  41840 actual extent phys  37158 log 15 len 1
   447(f): expecting  37159 actual extent phys  40400 log 16 len 48
   447(f): expecting  40448 actual extent phys  43712 log 64 len 55

A quick look with debugfs shows the obvious block interleaving:

debugfs:  stat <400>
	  ...
BLOCKS:
(0-14):41699-41713, (15):37116, (16-18):40224-40226

debugfs:  stat <401>
	  ...
BLOCKS:
(0):41714

debugfs:  stat <403>
	  ...
(0-4):41715-41719

debugfs:  stat <404>
	  ...
(0-4):41720-41724

debugfs:  stat <405>
	  ..
(0):41725

debugfs:  stat <406>
	  ..
(0-2):42008-42010

debugfs:  stat <407>
	  ...
(0):42011

debugfs:  stat <408>
	  ...
(0):42012

Thinking this was perhaps rsync's fault, I tried the experiment where I
copied the files using tar:

       tar -cf - -C /mnt2 . | tar -xpf - -C /mnt .

However, the same pattern was visible.  Tar definitely copies files
using one at a time, so this must be an artifact of the page writeback
algorithms.

						- Ted



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html