linux-kernel - Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <341F6DCC-1788-4ACC-A86E-A5D99CC05320@whamcloud.com>
Date:	Wed, 18 Apr 2012 08:09:02 -0700
From:	Andreas Dilger <adilger@...mcloud.com>
To:	Zheng Liu <gnehzuil.liu@...il.com>
Cc:	Lukas Czerner <lczerner@...hat.com>,
	Eric Sandeen <sandeen@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Zheng Liu <wenqing.lz@...bao.com>
Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 2012-04-18, at 5:48, Zheng Liu <gnehzuil.liu@...il.com> wrote:
> I run a more detailed benchmark again.  The environment is as before,
> and the machine has Intel(R) Core(TM)2 Duo CPU E8400, 4G memory and a
> WDC WD1600AAJS-75M0A0 160G SATA disk.
> 
> I use 'fallocate' and 'dd' command to create a 256M file.  I compare
> three cases, which are fallocate w/o new flag, fallocate w/ new flag,
> and dd.  We use these commands to create a file.  Meanwhile w/ journal
> and w/o journal are compared.  When I format the filesytem, I use
> '-E lazy_itable_init=0' to avoid impact.  I use this command to do the
> comparsion:
> 
> time for((i=0;i<2000;i++)); \
>        do \
>        dd if=/dev/zero of=/mnt/sda1/testfile conv=notrunc bs=4k \
>        count=1 seek=`expr $i \* 16` oflag=sync,direct 2>/dev/null; \
>        done
> 
> 
> The result:
> 
> nojournal:
>        fallocte        dd              fallocate w/ new flag
> real    0m4.196s        0m3.720s        0m3.782s
> user    0m0.167s        0m0.194s        0m0.192s
> sys     0m0.404s        0m0.393s        0m0.390s
> 
> data=journal:
>        fallocate       dd              fallocate w/ new flag
> real    1m9.673s        1m10.241s       1m9.773s
> user    0m0.183s        0m0.205s        0m0.192s
> sys     0m0.397s        0m0.407s        0m0.398s
> 
> data=ordered
>        fallocate       dd              fallocate w/ new flag
> real    1m16.006s       0m18.291s       0m18.449s
> user    0m0.193s        0m0.193s        0m0.201s
> sys     0m0.384s        0m0.387s        0m0.381s
> 
> data=writeback
>        fallocate       dd              fallocate w/ new flag
> real    1m16.247s       0m18.133s       0m18.417s
> user    0m0.187s        0m0.193s        0m0.205s
> sys     0m0.401s        0m0.398s        0m0.387s
> 
> In journal mode, we can see, when data is set to 'journal', the result
> is almost the same.  However, when data is set 'ordered' or 'writeback',
> the slowdown in w/ conversion case is severe.  Then I do the same test
> without 'oflag=sync,direct', and the result doesn't change.  IMHO, I
> guess that journal is the *root cause*.  Until now, I don't have a
> definitely conclusion, and I will go on traing this issue.  Please feel
> free to comment it.

Looking at these performance numbers again, it would seem better if ext4 _was_ zero filling the whole file and converting the whole thing to initialized extents instead of leaving so many uninitialized extents behind.

The file size is 256MB, and the disk would have to be doing only 3.5MB/s for linear streaming writes to match the performance that you report, so a modern disk doing 50MB/s should be able to zero the whole file in 5s.

It seems the threshold for zeroing uninitialized extents is incorrect. EXT4_EXT_ZERO_LEN is only 7 blocks (28kB normally), but typical disks can write 64kB as easily as 4kB, so it would be interesting to change EXT4_EXT_ZERO_LEN to 16 and re-run your test. 

If that solves this particular test case, it wont necessarily the general case, but is still a useful fix.  If you submit a patch for this, please change this code to compare against 64kB instead of a block count, and also to take s_raid_stride into account if set, like:

        ext_zero_len = max(EXT4_EXT_ZERO_LEN * 1024 >> inode->i_blkbits,
                           EXT4_SB(inode->i_sb)->s_es->s_raid_stride);

This would write up to 64kB, or a full RAID stripe (since it already needs to seek that spindle), whichever is larger.  It isn't perfect, since it should really align the zero-out to the RAID stripe to avoid seeking two spindles, but it is a starting point. 

Cheers, Andreas

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/