linux-kernel - Re: Small writes being split with fdatasync based on non-aligned partition ending

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALjAwxg3Y2a+ahW6apM2dw5HfZ4+F13cPL92iW7atypFnEMa_w@mail.gmail.com>
Date:	Thu, 11 Feb 2016 03:48:45 +0000
From:	Sitsofe Wheeler <sitsofe@...il.com>
To:	Jens Rosenboom <j.rosenboom@...on.de>
Cc:	Fio <fio@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	parted-devel@...ts.alioth.debian.org, linux-block@...r.kernel.org
Subject: Re: Small writes being split with fdatasync based on non-aligned
 partition ending

Trying to cc the GNU parted and linux-block mailing lists.

On 9 February 2016 at 13:02, Jens Rosenboom <j.rosenboom@...on.de> wrote:
> While trying to reproduce some performance issues I have been seeing
> with Ceph, I have come across a strange behaviour which is seemingly
> affected only by the end point (and thereby the size) of a partition
> being an odd number of sectors. Since all documentation about
> alignment only refers to the starting point of the partition, this was
> pretty surprising and I would like to know whether this is expected
> behaviour or maybe a kernel issue.
>
> The command I am using is pretty simple:
>
> fio --rw=randwrite --size=1G --fdatasync=1 --bs=4k
> --filename=/dev/sdb2 --runtime=10 --name=test
>
> The difference shows itself when the partition is created either by
> sgdisk or by parted:
>
> sgdisk --new=2:6000M: /dev/sdb
>
> parted -s /dev/sdb mkpart osd-device-1-block 6291456000B 100%
>
> The difference in the partition table looks like this:
>
> <  2      6291456000B  1600320962559B  1594029506560B
> osd-device-1-block
> ---
>>  2      6291456000B  1600321297919B  1594029841920B               osd-device-1-block

Looks like parted took you at your word when you asked for your
partition at 100%. Just out of curiosity if you try and make the same
partition interactively with parted do you get any warnings after
making and after running align-check ?

> So this is really only the end of the partition that is different.
> However, in the first case, the 4k writes all get broken up into 512b
> writes somewhere in the kernel, as can be seen with btrace:
>
>   8,16   3       36     0.000102666  8184  A  WS 12353985 + 1 <- (8,18) 65985
>   8,16   3       37     0.000102739  8184  Q  WS 12353985 + 1 [fio]
>   8,16   3       38     0.000102875  8184  M  WS 12353985 + 1 [fio]
>   8,16   3       39     0.000103038  8184  A  WS 12353986 + 1 <- (8,18) 65986
>   8,16   3       40     0.000103109  8184  Q  WS 12353986 + 1 [fio]
>   8,16   3       41     0.000103196  8184  M  WS 12353986 + 1 [fio]
>   8,16   3       42     0.000103335  8184  A  WS 12353987 + 1 <- (8,18) 65987
>   8,16   3       43     0.000103403  8184  Q  WS 12353987 + 1 [fio]
>   8,16   3       44     0.000103489  8184  M  WS 12353987 + 1 [fio]
>   8,16   3       45     0.000103609  8184  A  WS 12353988 + 1 <- (8,18) 65988
>   8,16   3       46     0.000103678  8184  Q  WS 12353988 + 1 [fio]
>   8,16   3       47     0.000103767  8184  M  WS 12353988 + 1 [fio]
>   8,16   3       48     0.000103879  8184  A  WS 12353989 + 1 <- (8,18) 65989
>   8,16   3       49     0.000103947  8184  Q  WS 12353989 + 1 [fio]
>   8,16   3       50     0.000104035  8184  M  WS 12353989 + 1 [fio]
>   8,16   3       51     0.000104150  8184  A  WS 12353990 + 1 <- (8,18) 65990
>   8,16   3       52     0.000104219  8184  Q  WS 12353990 + 1 [fio]
>   8,16   3       53     0.000104307  8184  M  WS 12353990 + 1 [fio]
>   8,16   3       54     0.000104452  8184  A  WS 12353991 + 1 <- (8,18) 65991
>   8,16   3       55     0.000104520  8184  Q  WS 12353991 + 1 [fio]
>   8,16   3       56     0.000104609  8184  M  WS 12353991 + 1 [fio]
>   8,16   3       57     0.000104885  8184  I  WS 12353984 + 8 [fio]
>
> whereas in the second case, I'm getting the expected 4k writes:
>
>   8,16   6       42 1266874889.659842036  8409  A  WS 12340232 + 8 <-
> (8,18) 52232
>   8,16   6       43 1266874889.659842167  8409  Q  WS 12340232 + 8 [fio]
>   8,16   6       44 1266874889.659842393  8409  G  WS 12340232 + 8 [fio]

This is weird because --size=1G should mean that fio is "seeing" an
aligned end. Does direct=1 with a sequential job of iodepth=1 show the
problem too?

> The above examples are from running with an SSD, where the small
> writes get merged together again before hitting the block device,
> which is still pretty o.k. performance wise. But when I run the same
> test on some NVMe device, the writes do not get merged, instead the
> performance drops to less then 10% of what I get in the second case.

Perhaps the ioscheduler doesn't have the opportunity with the NVMe device...

> If this is indeed expected behaviour from the kernel pov, it might
> need some better documentation and probably sgdisk should also be
> enhanced to align the end of the partition as well. FWIW, this happens
> on a stock 4.4.0 kernel as well as recent Ubuntu and CentOS kernels.

Do you mean parted?

-- 
Sitsofe | http://sucs.org/~sits/