linux-kernel - Re: ext4 write performance regression in 3.6-rc1 on RAID0/5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1B78C9B4-CC2B-43B3-8513-2221925D25A0@dilger.ca>
Date:	Wed, 22 Aug 2012 01:14:42 -0600
From:	Andreas Dilger <adilger@...ger.ca>
To:	NeilBrown <neilb@...e.de>
Cc:	Yuanhan Liu <yuanhan.liu@...ux.intel.com>,
	Fengguang Wu <fengguang.wu@...el.com>,
	Li Shaohua <shli@...ionio.com>, Theodore Ts'o <tytso@....edu>,
	Marti Raudsepp <marti@...fo.org>,
	Kernel hackers <linux-kernel@...r.kernel.org>,
	ext4 hackers <linux-ext4@...r.kernel.org>, maze@...gle.com,
	"Shi, Alex" <alex.shi@...el.com>, linux-fsdevel@...r.kernel.org,
	linux RAID <linux-raid@...r.kernel.org>
Subject: Re: ext4 write performance regression in 3.6-rc1 on RAID0/5

On 2012-08-22, at 12:00 AM, NeilBrown wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <yuanhan.liu@...ux.intel.com>
> wrote:
>> 
>> -#define NR_STRIPES		256
>> +#define NR_STRIPES		1024
> 
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.

We've actually been carrying a patch for a few years in Lustre to
increase the NR_STRIPES to 2048, and made it a configurable module
parameter.  This made a noticeable improvement to the performance
for fast systems.

> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
> 
> I thing the problems with RAID5 are deeper than that.  Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.

The other MD RAID-5/6 patches that we have change the page submission
order to avoid the need to merge pages in the elevator so much, and a
patch to allow zero-copy IO submission if the caller marks the page for
direct IO (indicating it will not be modified until after IO completes).
This avoids a lot of overhead on fast systems.

This isn't really my area of expertise, but patches against RHEL6
could be seen at http://review.whamcloud.com/1142 if you want to
take a look.  I don't know if that code is at all relevant to what
is in 3.x today.

> I don't think the size of the cache is a big part of the solution.  I think
> correct scheduling of IO is the real answer.

My experience is that on fast systems the IO scheduler just gets in the
way.  Submitting larger contiguous IOs to each disk in the first place
is far better than trying to merge small IOs again at the back end.

Cheers, Andreas

Download attachment "PGP.sig" of type "application/pgp-signature" (236 bytes)