linux-ext4 - Re: Ext4 speedup by storing metadata and data on separate devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <2A740265-42DD-4FA4-8D10-327E9177F6F4@dilger.ca>
Date:	Tue, 20 Nov 2012 13:56:41 -0700
From:	Andreas Dilger <adilger@...ger.ca>
To:	Ivan Zahariev <famzah@...soft.com>
Cc:	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Re: Ext4 speedup by storing metadata and data on separate devices

On 2012-11-20, at 4:04, Ivan Zahariev <famzah@...soft.com> wrote:
> 
> This suggestion is not about storing the journal on a separate device.
> 
> Many of the tasks on an Ext4 file-system require a full or massive scan of the metadata. A few examples:
> - backup: you need to get a list with all "mtime" or "size" changed files since last backup
> - reporting: you need to get a list with all files of a particular "group" owner ID
> - delete: deleting the "/home/$user" of someone with lots of data and files
> 
> I know many efforts have been made to make the (meta)data operations "local" -- this speeds up spinning disks operations a lot, also SSD ones. However, having the whole metadata on an SSD disk (or a RAID1 of two such disks) could speed up many common tasks a lot. And the hardware price for such a benefit is really affordable now.
> 
> I see two possible implementations:
> 
> 1. Re-work the Ext4 metadata operations (that work with inodes, etc) to read/write on a separate block device.
> 
> or
> 
> 2. Add an option to the "data locality" algorithm to force it to store all metadata only at the beginning of a device (we can pre-allocate enough space). We can then transparently map in the DM those blocks to a separate faster block device, thus making the changes to Ext4 minimal.
> 
> Does all this make sense, or I'm missing something obvious?

We have implemented the #2 option using LVM with a script to map the first 128MB of the logical volume to SSD (RAID-1) and the next 255 * 128MB to HDD (usually RAID-6).  This repeats as long as there is HDD and SSD space remaining. This is done easily using lvextend in a script. 

For mke2fs, specifying a flex_bg factor of "-G 256", and limiting the inode ratio ("-n 69905", for an average file size just over 64kB) allows all of the block bitmaps and inode tables to fit into the first 128MB of the flex group with some space to spare.  This means all of the static metadata is allocated on SSD, and the directory allocations are also biased toward the remaining space in the first flex_bg group.

It isn't elegant, but it works with minimal complexity.

There was also a discussion about implementing the #1 option, to have ext4 access multiple devices for data/metadata, but nobody has actually started to implement this.

Cheers, Andreas--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html