linux-kernel - [ext3 performance regression] internal fragmentation on ftruncate+MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20090414235348.GA32477@hexapodia.org>
Date:	Tue, 14 Apr 2009 16:53:48 -0700
From:	Andy Isaacson <adi@...apodia.org>
To:	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: [ext3 performance regression] internal fragmentation on ftruncate+MAP_SHARED allocations

I'm seeing really dramatic internal fragmentation (the file gets a
squential set of blocks assigned, but the virtual->physical mapping is
severely nonlinear) on ext3 with a test program that basically does

   open(O_CREAT)
   ftruncate()
   mmap(MAP_SHARED)
   ... write to random offsets in the mmap region eventually filling
   entire file ...

filefrag(8) reports:

/home/andy/tmp/big: 50435 extents found, perfection would be 9 extents

and filefrag -v reports:

Filesystem type is: ef53
Filesystem cylinder groups is approximately 5423
Blocksize of file /home/andy/tmp/big is 4096
File size of /home/andy/tmp/big is 1073741824 (262144 blocks)
First block: 59605576
Last block: 68838012
Discontinuity: Block 1 is at 59563360 (was 59605576)
Discontinuity: Block 3 is at 59563365 (was 59563361)
Discontinuity: Block 8 is at 59563362 (was 59563369)
Discontinuity: Block 10 is at 59563370 (was 59563363)
Discontinuity: Block 12 is at 59563372 (was 59563372)
Discontinuity: Block 16 is at 59563381 (was 59563375)
Discontinuity: Block 17 is at 59605568 (was 59563381)
...

The resulting 50,000-fragment file is very painful to read:

% dd if=/home/andy/tmp/big of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 58.4419 s, 18.4 MB/s

(this on a 7200rpm disk that can do 70MB/s sequential; on a slow laptop
HDD the behavior is vastly worse, with reports of it taking 10 minutes
to read a 1GB file.)

I've tested on multiple kernels including 2.6.25, 2.6.28, 2.6.29-pre,
2.6.29.1, and 2.6.30-pre with similar results observed.

On RHEL4's 2.6.9-55.ELsmp less fragmentation is observed if the file
fits in memory -- I've seen around 9-20 fragments.  On RHEL
fragmentation is observed if the file exceeds the size of physical
memory.  (On modern kernels fragmentation occurs in both the in-memory
and out-of-core cases.)  Anecdotally the behavior got worse around
2.6.18, but unfortunately I can't easily test kernels that old (the RHEL
system is a special case).

I'm using the default mount options and the destination filesystem has
lots of free space:

% grep ' /home' /proc/mounts
/dev/sda9 /home ext3 rw,errors=continue,data=ordered 0 0
% df -h /home
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda9             678G  170G  474G  27% /home

There's nothing special about the particular area of the disk being
allocated from; I created 9 1GB files "to fill in the holes" (wildly
speculating that some property of the "head of the free list" might be
causing the fragmentation) and did not see any changes in behavior.

I'm using cfq; tried noop, deadline, and anticipatory with no change.

The test system is an amd64 quad-core 6GB running 2.6.29.

Test program is attached.  Sample output:

% gcc -O2 -Wall alloctest.c -o alloctest
% rm -f ~/tmp/big && time ./alloctest ~/tmp/big $((1024*1024*1024))
touched 262144 pages in 14.552441 seconds (55.51 usec/page, 70.37 MB/s)
msync took  23.309054 seconds
munmap took 0.048076 seconds
close took  0.000013 seconds
total throughput 27.011429 MB/sec
./alloctest ~/tmp/big $((1024*1024*1024))  0.10s user 3.00s system 8% cpu 37.911 total
% sudo filefrag ~/tmp/big
/home/andy/tmp/big: 50435 extents found, perfection would be 9 extents

POTENTIAL WORKAROUNDS:

1. Using posix_fallocate(3) is somewhat helpful, but on ext3 it falls
back to doing block IO over the entire region -- which leads to a
significant delay at application startup time.

2. Behavior on ext4 and xfs is better, either with posix_fallocate(3) or
with random allocations.  Neither shows the same terrible fragmentation
pattern, and of course posix_fallocate() can simply allocate an extent.

3. Increasing vm.dirty_ratio so that synchronous writeout is never
triggered.  This does improve the behavior:

% sudo sysctl -w vm.dirty_background_ratio=5 vm.dirty_ratio=90
vm.dirty_background_ratio = 5
vm.dirty_ratio = 90
% rm -f ~/tmp/big && time ./alloctest ~/tmp/big $((1024*1024*1024))
touched 262144 pages in 1.281706 seconds (4.89 usec/page, 798.94 MB/s)
msync took  20.630176 seconds
munmap took 0.044767 seconds
close took  0.000014 seconds
total throughput 46.637147 MB/sec
./alloctest ~/tmp/big $((1024*1024*1024))  0.11s user 2.87s system 13% cpu 21.966 total
% sudo filefrag ~/tmp/big
/home/andy/tmp/big: 483 extents found, perfection would be 9 extents

but I'm concerned (1) that it's setting us up for poor behavior
elsewhere and (2) that ext3 requires this when ext4 does not.

Thanks,
-andy

View attachment "alloctest.c" of type "text/plain" (3680 bytes)