linux-ext4 - Re: [RFC] ext4: block reservation allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120227150908.GA15097@gmail.com>
Date:	Mon, 27 Feb 2012 23:09:08 +0800
From:	Zheng Liu <gnehzuil.liu@...il.com>
To:	Lukas Czerner <lczerner@...hat.com>
Cc:	linux-ext4@...r.kernel.org, Yongqiang Yang <xiaoqiangnk@...il.com>
Subject: Re: [RFC] ext4: block reservation allocation

On Mon, Feb 27, 2012 at 02:33:28PM +0100, Lukas Czerner wrote:
> On Mon, 27 Feb 2012, Zheng Liu wrote:
> 
> > On Mon, Feb 27, 2012 at 01:00:07PM +0100, Lukas Czerner wrote:
> > > On Mon, 27 Feb 2012, Zheng Liu wrote:
> > > 
> > > > Hi list,
> > > > 
> > > > Now, in ext4, we have multi-block allocation and delay allocation. They work
> > > > well for most scenarios. However, in some specific scenarios, they cannot help
> > > > us to optimize block allocation. For example, the user may want to indicate some
> > > > file set to be allocated at the beginning of the disk because its speed in this
> > > > position is faster than its speed at the end of disk.
> > > > 
> > > > I have done the following experiment. The experiment is on my own server, which
> > > > has 16 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 48G memory and a 1T sas disk. I
> > > > split this disk into two partitions, one has 900G, and another has 100G. Then I
> > > > use dd to get the speed of read/write. The result is as following.
> > > > 
> > > > [READ]
> > > > # dd if=/dev/sdk1 of=/dev/null bs=128k count=10000 iflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 9.41151 s, 139 MB/s
> > > > 
> > > > # dd if=/dev/sdk2 of=/dev/null bs=128k count=10000 iflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 17.952 s, 73.0 MB/s
> > > > 
> > > > [WRITE]
> > > > # dd if=/dev/zero of=/dev/sdk1 bs=128k count=10000 oflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 8.46005 s, 155 MB/s
> > > > 
> > > > # dd if=/dev/zero of=/dev/sdk2 bs=128k count=10000 oflag=direct
> > > > 1310720000 bytes (1.3 GB) copied, 15.8493 s, 82.7 MB/s
> > > > 
> > > > So filesystem can provide a new feature to let the user to indicate a value
> > > > for reserving some blocks from the beginning of the disk. When the user needs
> > > > to allocate some blocks for an important file that needs to be read/write as
> > > > quick as possible, the user can use ioctl(2) and/or other ways to notify
> > > > filesystem to allocate these blocks in the reservation area. Thereby, the user
> > > > can obtain the higher performance for manipulating this file set.
> > > > 
> > > > This idea is very trivial. So any comments or suggestions are appreciated.
> > > > 
> > > > Regards,
> > > > Zheng
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > the body of a message to majordomo@...r.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > 
> > > Hi Zheng,
> > > 
> > > I have to admit I do not like it :). I think that this kind of
> > > optimization is useless in the long run. There are several reasons for
> > > this:
> > 
> > Hi Lukas,
> > 
> > Thank you for your opinion. ;-)
> > 
> > > 
> > >  - the test you've done is purely fabricated and does not respond to
> > >    real workload at all. Especially because it is done on a huge files.
> > >    I can imagine this approach improving boot speed, but you will
> > >    usually have to load just small files, so for single file it does not
> > >    make much sense. Moreover with small files more seeks would have to
> > >    be done hugely reducing the advantage you can see with dd.
> > 
> > I will describe the problem that we encounter. the problem shows that
> > even if files are small, the performance can be improved in some
> > specific scenarios using this block allocation.
> > 
> > >  - HDD might have more platters than just one
> > >  - Your file system might span across several drives
> > >  - On thinly provisioned storage this does not make sense at all
> > >  - SSD's are more and more common and this optimization is useless for
> > >    them.
> > > 
> > > Is there any 'real' problem you would want to solve with this ? Or is it
> > > just something that came to you mind ? I agree that we want to improve
> > > our allocators, but IMHO especially for better scalability, not to cover
> > > this disputable niche.
> > 
> > We encounter a problem in our product system. In a 2TB sata disk, the
> > file can be divided into two categories. One is index file, and another
> > is block file. The average size of index files is about 128k and will
> > increase as time goes on. The size of block files is 70M and they are
> > created by fallocate(2). Thus, index file is allocated at the end of the
> > disk. When application starts up, it needs to load all of index files
> > into memory. So it costs too much time. If we can allocate index files
> > at the beginning of the disk, we will cut down the startup time and
> > increase the service time of this application.
> > 
> > Therefore, I think that it might be as a generic mechanism to provide
> > other users that have the similar requirement.
> 
> Ok, so this seems like a valid use case. However I think that this is
> exactly something that can be quite easily solved without having to
> modify file system code, right ?
> 
> You can simply use separate drive for the index files, or even raid. Or
> you can actually use an SSD for this, which I believe will give you *a
> lot* better performance improvements and you wont be bothered by the
> size/price ratio for SSD as you would only store indexes there, right ?
> 
> Or, if you really do not want to, or can not, but a new hardware for
> some reason, you can always partition a 2TB disk and put all your index
> files on the smaller, close to the disk center partition. I really do
> not see a reason to modify the code.
> 
> What might be even more interesting is, that you might generally benefit
> from splitting the index/data file systems. The reason is that your data
> file and your index file filesystem might benefit from bigalloc if you
> split them, because you can set different cluster sizes on both file
> system depending on the file sizes you would actually store there, since
> as I understand the index and data files differs in size significantly.

You are right. I am trying this solution in our test environment. I have
splitted a 2TB disk into 2 partitions. One is for index file and is
formated with big alloc, and another is for block file.

> 
> How much of the performance boost do you expect by doing this your way -
> modifying the file system? Note that dd will not tell you that, as I
> explained earlier. I surely would not match using SSD for index files by
> far.
> 
> What do you think?

As Yongqiang said, maybe we can allocate faster block for the file which
needs to be fast read/write when the user sets a flag to notify the file
system. Maybe we don't need to implement a new block allocation
algorithm. We only need to modify the current block allocation to
provide this mechansim.

Regards,
Zheng

> 
> Thanks!
> -Lukas
> 
> 
> 
> > 
> > Regards,
> > Zheng
> > 
> > > 
> > > Anyway, you may try to come up with better experiment. Something which
> > > would actually show how much can we get from the more realistic workload
> > > rather than showing that contiguous serial writes are faster closely to
> > > the center of the disk platter, we know that.
> > > 
> > > Thanks!
> > > -Lukas
> > 
> 
> -- 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html