[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131121234116.GD6502@dastard>
Date: Fri, 22 Nov 2013 10:41:16 +1100
From: Dave Chinner <david@...morbit.com>
To: Martin Boutin <martboutin@...il.com>
Cc: Eric Sandeen <sandeen@...hat.com>,
"Kernel.org-Linux-RAID" <linux-raid@...r.kernel.org>,
xfs-oss <xfs@....sgi.com>,
"Kernel.org-Linux-EXT4" <linux-ext4@...r.kernel.org>
Subject: Re: Filesystem writes on RAID5 too slow
On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
> $ uname -a
> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
> i686 GNU/Linux
Oh, it's 32 bit system. Things you don't know from the obfuscating
codenames everyone uses these days...
> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
> $ mount -t xfs /dev/md0 /tmp/diskmnt/
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
....
> $ cat /proc/mounts
> (...)
> /dev/md0 /tmp/diskmnt xfs
> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0
sunit/swidth is 512k/1MB
> # same layout for other disks
> $ fdisk -c -u /dev/sda
....
> Device Boot Start End Blocks Id System
> /dev/sda1 2048 20565247 10281600 83 Linux
Aligned to 1 MB.
> /dev/sda2 20565248 1953525167 966479960 83 Linux
And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
aligned to 4k, though, so there shouldn't be any hardware RMW
cycles.
> $ xfs_info /dev/md0
> meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks
> = sectsz=4096 attr=2
> data = bsize=4096 blocks=483239168, imaxpct=5
> = sunit=12
sunit/swidth of 512k/1MB, so it matches the MD device.
> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
> /tmp/diskmnt/filewr.zero:
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111
> FLAG Values:
> 010000 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end on stripe unit
> 000010 Doesn't begin on stripe width
> 000001 Doesn't end on stripe width
> # this does not look good, does it?
Yup, looks broken.
/me digs through git.
Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
the code that sets stripe unit alignment for the initial allocation
way back in 3.2.
[ Hmmm, that would explain the very occasional failure that
generic/223 throws outi (maybe once a month I see it fail). ]
Which means MD is doing RMW cycles for it's parity calculations, and
that's where performance is going south.
Current code:
$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..2097151]: 1056..2098207 0 (1056..2098207) 2097152 11111
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
$
Which indicates that even if we take direct IO based allocation out
of the picture, the allocation does not get aligned properly. This
in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.
With a fixed kernel:
$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..2097151]: 6293504..8390655 0 (6293504..8390655) 2097152 10000
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
$
It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.
Take fallocate out of the picture so the direct IO does the
allocation:
$ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
testfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..2097151]: 2099200..4196351 0 (2099200..4196351) 2097152 00000
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
It's slower than with preallocation (no surprise - no allocation
overhead per write(2) call after preallocation is done) but the
allocation is still correctly aligned.
The patch below should fix the unaligned allocation problem you are
seeing, but because XFS defaults to stripe unit alignment for large
allocations, you might still see RMW cycles when it aligns to a
stripe unit that is not the first in a MD stripe. I'll have a quick
look at fixing that behaviour when the swalloc mount option is
specified....
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
xfs: align initial file allocations correctly.
From: Dave Chinner <dchinner@...hat.com>
The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.
Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.
Fix it by considering allocation into empty files as requiring
aligned allocation again.
Signed-off-by: Dave Chinner <dchinner@...hat.com>
---
fs/xfs/xfs_bmap.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 3ef11b2..8401f11 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -1635,7 +1635,7 @@ xfs_bmap_last_extent(
* blocks at the end of the file which do not start at the previous data block,
* we will try to align the new blocks at stripe unit boundaries.
*
- * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
+ * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
* at, or past the EOF.
*/
STATIC int
@@ -1650,9 +1650,14 @@ xfs_bmap_isaeof(
bma->aeof = 0;
error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
&is_empty);
- if (error || is_empty)
+ if (error)
return error;
+ if (is_empty) {
+ bma->aeof = 1;
+ return 0;
+ }
+
/*
* Check if we are allocation or past the last extent, or at least into
* the last delayed allocated extent.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists