[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B961FE1.5040105@msgid.tls.msk.ru>
Date: Tue, 09 Mar 2010 13:16:01 +0300
From: Michael Tokarev <mjt@....msk.ru>
To: Karel Zak <kzak@...hat.com>
CC: Mike Snitzer <snitzer@...hat.com>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
Tejun Heo <tj@...nel.org>,
"linux-ide@...r.kernel.org" <linux-ide@...r.kernel.org>,
lkml <linux-kernel@...r.kernel.org>,
Daniel Taylor <Daniel.Taylor@....com>,
Jeff Garzik <jeff@...zik.org>, Mark Lord <kernel@...savvy.com>,
tytso@....edu, "H. Peter Anvin" <hpa@...or.com>,
hirofumi@...l.parknet.co.jp,
Andrew Morton <akpm@...ux-foundation.org>,
Alan Cox <alan@...rguk.ukuu.org.uk>, irtiger@...il.com,
Matthew Wilcox <matthew@....cx>, aschnell@...e.de,
knikanth@...e.de, jdelvare@...e.de,
Jim Meyering <jim@...ering.net>, Neil Brown <neilb@...e.de>
Subject: Re: ATA 4 KiB sector issues.
Karel Zak wrote:
> On Tue, Mar 09, 2010 at 09:53:37AM +0300, Michael Tokarev wrote:
[]
>> Think of a raid5 array - with all the mentioned good stuff in place
>> fdisk should figure out to align partitions on the array stripe
>> boundary, and should do that automatically. And this should be
>
> Yes. For userspace there is not a difference between RAID and non-RAID
> device -- the topology support in kernel provides unified API to all
> devices. It means we needn't any extra support for RAIDs in
> fdisk/parted. The userspace tools follow topology data from kernel.
>
> The good thing with 1MiB default alignment is that it is usable for
> usual stripe sizes (for sizes greater than 1MiB we use optimal I/O
> size).
No, it's not that simple. For raid5 (and I especially mentioned raid5
above), raid4 and raid6, 1MiB is only good when the number of devices
is 2^N+1 (for raid[45]) or 2^N+2 (for raid6). For raid5 that means
3, 5, 9, 17, .. disks. In all other cases the alignment (which should
match stripe size) will not be power of two. For example, for a 4-disk
raid5 array with 1MiB chunk size the partitions should be aligned at
3MiB boundaries. For 6-disk raid5 with 256KiB chunk size it is
5x256=1280 Kib. And so on.
Yes it has little to do with the $subject (4KiB sectors), but it is
closely related still.
>> most easy to debug/test, since the whole thing is controllable
>> by kernel.
>
> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> It works as expected.
Actually, for raid0, the alignment is questionable. Should it be a
multiple of chunk size or whole stripe size? I'm not sure, both ways
has bad and good sides.. But if it is the latter, the same issues
pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.
[]
> Disk /dev/sdb: 2621 MB, 2621440000 bytes
> 255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 32768 bytes
Good.
> # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}
That's 3-disk stripe size with default 64Kb chunk size, which makes
3x64=320KiB - the number to which everything should be aligned.
> # fdisk -lcu /dev/md8
>
> Disk /dev/md8: 1572 MB, 1572667392 bytes
> 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 65536 bytes / 65536 bytes
And here we go: fdisk does not see the right number: nothing
is dividable by 3.
[]
> # cat /sys/block/md8/md8p{1,2}/alignment_offset
> 0
> 0
And that's where the issue is. md does not {sup,re}port all
this stuff yet.
This is what I'm talking about.
Thanks!
/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists