lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140129185741.GA8798@birch.djwong.org>
Date:	Wed, 29 Jan 2014 10:57:41 -0800
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	"Theodore Ts'o" <tytso@....edu>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: FAST paper on ffsck

On Mon, Dec 09, 2013 at 01:01:49PM -0500, Theodore Ts'o wrote:
> Andreas brought up on today's conference call Kirk McKusick's recent
> changes[1] to try to improve fsck times for FFS, in response to the
> recent FAST paper covering fsck speed ups for ext3, "ffsck: The Fast
> Filesystem Checker"[2]
> 
> [1] http://www.mckusick.com/publications/faster_fsck.pdf
> [2] https://www.usenix.org/system/files/conference/fast13/fast13-final52_0.pdf
> 
> All of the changes which Kirk outlined are ones which we had done
> several years ago, in the early days of ext4 development.  I talked
> about some of these in some blog entries, "Fast ext4 fsck times"[3], and
> "Fast ext4 fsck times, revisited"[4]
> 
> [3] http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/
> [4] http://thunk.org/tytso/blog/2009/02/26/fast-ext4-fsck-times-revisited/
> 
> (Apologies for the really bad formatting; I recovered my blog from
> backups a few months ago, installed onto a brand-new Wordpress
> installation --- since the old one was security bug ridden and
> horribly obsolete --- and I haven't had a chance to fix up some of the
> older blog entries that had explicit HTML for tables to work with the
> new theme.)
> 
> One further observation from reading the ffsck paper.  Their method of
> introducing heavy file system fragmentation resulted in a file system
> where most of the files had external extent tree blocks; that is, the
> trees had a depth > 1.  I have not observed this in file systems under
> normal load, since most files are written once and not rewritten, and
> those that are rewritten (i.e., database files) are not the common
> case, and even then, generally aren't written in a random append
> workload where there are hundreds of files in the same directory which
> are appended to in random order.  So looking at at a couple file
> systems' fsck -v output, I find results such as this:
> 
>              Extent depth histogram: 1229346/569/3
>              Extent depth histogram: 332256/141
>              Extent depth histogram: 23253/456
> 
> ... where the first number is the number of inode where all of the
> extent information stored in the inode, and the second number is the
> number of inodes with a single level of external extent tree blocks,
> and so on.
> 
> As a result, I'm not seeing the fsck time degradation resulting from
> file system aging, because with at leat my workloads, the file system
> isn't getting fragmented in enough to result in a large number of
> inodes with external extent tree blocks.
> 
> We could implement schemes to optimize fsck performance for heavily
> fragmented file systems; a few which could be done using just e2fsck
> optimizations, and some which would require file system format
> changes.  However, it's not clear to me that it's worth it.
> 
> If folks would like help run some experiments, it would be useful to
> run a test e2fsck on a partition: "e2fsck -Fnfvtt /dev/sdb1" and look
> at the extent depth histogram and the I/O rates for the various e2fsck
> passes (see below for an example).
> 
> If you have examples where the file system has a very large number of
> inodes with extent tree depths > 1, it would be useful to see these
> numbers, with a description of how old the file system has been, and
> what sort of workload might have contributed to its aging.
> 

I don't know about "very large", but here's what I see on the server that I
share with some friends.  Afaik it's used mostly for VM images and test
kernels... and other parallel-write-once files. ;)  This FS has been running
since Nov. 2012.  That said, I think the VM images were created without
fallocate; some of these files have tens of thousands of tiny extents.

     5386404 inodes used (4.44%, out of 121307136)
       22651 non-contiguous files (0.4%)
        7433 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 5526723/1334/16
   202583901 blocks used (41.75%, out of 485198848)
           0 bad blocks
          34 large files

     5207070 regular files
      313009 directories
         576 character device files
         192 block device files
          11 fifos
     1103023 links
       94363 symbolic links (86370 fast symbolic links)
          73 sockets
------------
     6718317 files

On my main dev box, which is entirely old photos, mp3s, VM images, and kernel
builds, I see:

     2155348 inodes used (2.94%, out of 73211904)
       14923 non-contiguous files (0.7%)
        1528 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 2147966/685/3
    85967035 blocks used (29.36%, out of 292834304)
           0 bad blocks
           6 large files

     1862617 regular files
      284915 directories
         370 character device files
          59 block device files
           6 fifos
      609215 links
        7454 symbolic links (6333 fast symbolic links)
          24 sockets
------------
     2764660 files

Sadly, since I've left the LTC I no longer have access to tux1, which had a
rather horrifically fragmented ext3.  Its backup server, which created a Time
Machine-like series of "snapshots" with rsync --link-dest, took days to fsck,
despite being ext4.

--D

> Thanks, regards,
> 
> 					- Ted
> 
> e2fsck 1.42.8 (20-Jun-2013)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 1: Memory used: 668k/7692k (575k/94k), time:  0.92/ 0.42/ 0.02
> Pass 1: I/O read: 11MB, write: 0MB, rate: 11.95MB/s
> Pass 2: Checking directory structure
> Pass 2: Memory used: 784k/15196k (466k/319k), time:  0.44/ 0.03/ 0.00
> Pass 2: I/O read: 10MB, write: 0MB, rate: 22.76MB/s
> Pass 3: Checking directory connectivity
> Peak memory: Memory used: 784k/15196k (466k/319k), time:  1.60/ 0.63/ 0.02
> Pass 3: Memory used: 784k/15196k (439k/346k), time:  0.00/ 0.00/ 0.00
> Pass 3: I/O read: 1MB, write: 0MB, rate: 2793.30MB/s
> Pass 4: Checking reference counts
> Pass 4: Memory used: 784k/188k (432k/353k), time:  0.63/ 0.63/ 0.00
> Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
> Pass 5: Checking group summary information
> Pass 5: Memory used: 784k/188k (426k/359k), time:  4.95/ 0.16/ 0.10
> Pass 5: I/O read: 19MB, write: 0MB, rate: 3.84MB/s
> 
>        13825 inodes used (0.03%, out of 47906816)
>         1425 non-contiguous files (10.3%)
>           11 non-contiguous directories (0.1%)
>              # of inodes with ind/dind/tind blocks: 0/0/0
>              Extent depth histogram: 12986/831
>    141525383 blocks used (73.85%, out of 191627264)
>            0 bad blocks
>            4 large files
> 
>        11537 regular files
>         2279 directories
>            0 character device files
>            0 block device files
>            0 fifos
>            0 links
>            0 symbolic links (0 fast symbolic links)
>            0 sockets
> ------------
>        13816 files
> Memory used: 784k/188k (426k/359k), time:  7.19/ 1.42/ 0.12
> I/O read: 39MB, write: 0MB, rate: 5.43MB/s
> 
> Note: the reason why this file system has so many files with large
> extents is because there are some video files which large enough that
> even when contiguous, they will require an external extent block, e.g:
> 
> File size of 01 Yankee White.m4v is 499375730 (121918 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:   19802112..  19802112:      1:            
>    1:        2..     315:   19802114..  19802427:    314:   19802113:
>    2:      543..   14335:   19802655..  19816447:  13793:   19802428:
>    3:    14336..   47103:   19830784..  19863551:  32768:   19816448:
>    4:    47104..   73727:   19896320..  19922943:  26624:   19863552:
>    5:    73728..   79871:   19955712..  19961855:   6144:   19922944:
>    6:    79872..  112639:   19994624..  20027391:  32768:   19961856:
>    7:   112640..  121917:   20060160..  20069437:   9278:   20027392: eof
> 01 Yankee White.m4v: 8 extents found
> 
> BTW, looking at the output of filefrag -v on large files, it does look
> like there is some work we can do to improve the block allocation
> hueristics.  These files were written w/o the benefit of fallocate,
> but with delayed allocation, and apparently we aren't automatically
> figuring out that we should be in stream mode from the get-go.  This
> pattern is reproduced in most of the files in the directory:
> 
> File size of 02 Hung Out to Dry.m4v is 552382434 (134859 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:   19816448..  19816448:      1:            
>    1:        2..     314:   19816450..  19816762:    313:   19816449:
>    2:      542..   14335:   19816990..  19830783:  13794:   19816763:
>    3:    14336..   47103:   19863552..  19896319:  32768:   19830784:
>    4:    47104..   79871:   19961856..  19994623:  32768:   19896320:
>    5:    79872..  112639:   20027392..  20060159:  32768:   19994624:
>    6:   112640..  134858:   20070400..  20092618:  22219:   20060160: eof
> 02 Hung Out to Dry.m4v: 7 extents found
> 
> File size of 03 Sea Dog.m4v is 553146161 (135046 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:   20092928..  20092928:      1:            
>    1:        2..     159:   20092930..  20093087:    158:   20092929:
>    2:      161..     306:   20093089..  20093234:    146:   20093088:
>    3:      534..   14335:   20093462..  20107263:  13802:   20093235:
>    4:    14336..   47103:   20121600..  20154367:  32768:   20107264:
>    5:    47104..   79871:   20187136..  20219903:  32768:   20154368:
>    6:    79872..  112639:   20252672..  20285439:  32768:   20219904:
>    7:   112640..  135045:   20318208..  20340613:  22406:   20285440: eof
> 03 Sea Dog.m4v: 8 extents found
> 
> File size of 04 The Immortals.m4v is 516091162 (125999 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:   20107264..  20107264:      1:            
>    1:        2..     162:   20107266..  20107426:    161:   20107265:
>    2:      164..     312:   20107428..  20107576:    149:   20107427:
>    3:      540..   14335:   20107804..  20121599:  13796:   20107577:
>    4:    14336..   47103:   20154368..  20187135:  32768:   20121600:
>    5:    47104..   79871:   20219904..  20252671:  32768:   20187136:
>    6:    79872..  112639:   20285440..  20318207:  32768:   20252672:
>    7:   112640..  125998:   20340736..  20354094:  13359:   20318208: eof
> 04 The Immortals.m4v: 8 extents found
> 
> Looking at all of these files, actually, if we had managed to allocate
> them using contiguous 32768 block extents, these 45 minute TV episodes
> would have just fit inside the in-inode's 4 extent slots.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ