lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <i70pn6$otu$1@dough.gmane.org>
Date:	Fri, 17 Sep 2010 18:22:28 -0400
From:	Brett Russ <icycle+lkml@...il.com>
To:	linux-kernel@...r.kernel.org
Subject: Re: O_DIRECT reads appear to be cached on block device partition
 file?

Dave Chinner wrote:
> On Mon, Sep 13, 2010 at 11:49:32PM -0400, Brett Russ wrote:
>> If I run the above on the monitoring blade, then sync an update to
>> the sector in question from another blade, then re-reun the above
>> code on the monitoring blade, believe it or not I appear to be
>> reading stale data.  If I use dd with iflag=direct, reading the same
>> sector offset at the /dev/sdX3 partition file, I see the same stale
>> data as seen from the code above.  If, however, I instead access
>> this sector offset from the /dev/sdX device file using the (offset
>> of partition 3 + offset of the sector) I see the intended data,
>> which makes me believe some caching occurred locally for /dev/sdX3.
>
> What does blktrace tell you?

Thanks Dave for the pointer to blktrace.  I'd not used this before.

The short answer is that I now trust O_DIRECT.  The cause for me going 
down this path to begin with was caused by a stale cache in our application.

The longer answer of how my dd double-check could have gone wrong follows:

I've discovered that the start-of-partition LBA does not *always* agree 
between the kernel (reported by blktrace and sysfs) and utilities such 
as {fdisk|sfdisk}.  This means that my experiment of accessing the 
sector within the partition via the parent device may have been invalid, 
since I was trusting fdisk to determine the correct sector offset of the 
partition.

> spu0103# fdisk -l -u /dev/sdbk
...
> Units = sectors of 1 * 512 = 512 bytes
>
>     Device Boot      Start         End      Blocks  Id System
...
> /dev/sdbk3      1197742140  1944780704   373519282+ 83 Linux

> spu0103# sfdisk -uS -l /dev/sdbk
...
> Units = sectors of 512 bytes, counting from 0
>
>    Device Boot    Start       End   #sectors  Id  System
...
> /dev/sdbk3     1197742140 1944780704  747038565  83  Linux

> spu0103# cat /sys/block/sdbk/sdbk3/start
> 1197934920

The above discrepancy was also shown with blktrace:

> spu0103# blkparse -q 1
> Input file 1.blktrace.5 added
> Input file 1.blktrace.6 added
> Input file 1.blktrace.7 added
>

This command:

> spu0103# dd-7.1  if=/dev/sdbk3 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace:

>  67,224  5        1     0.000000000 29726  A   R 1197934920 + 1 <- (67,227) 0

Note the kernel remapped the access to sdbk3 (offset 0) to sdbk (offset
1197934920) (see the major:minor numbers listed after the trace), which
is quite different from the partition start shown in fdisk of 1197742140.

>  67,224  5        2     0.000000564 29726  Q   R 1197934920 + 1 [dd-7.1]
>  67,224  5        3     0.000004032 29726  G   R 1197934920 + 1 [dd-7.1]
>  67,224  5        4     0.000006223 29726  P   N [dd-7.1]
>  67,224  5        5     0.000008152 29726  I   R 1197934920 + 1 [dd-7.1]
>  67,224  5        6     0.000009916 29726  U   N [dd-7.1] 1
>  67,224  5        7     0.000012286 29726  D   R 1197934920 + 1 [dd-7.1]
>  67,224  7        1     0.006802504     0  C   R 1197934920 + 1 [0]

And this command (accessing the start of partition using fdisk sector 
offset):

> spu0103# dd-7.1  if=/dev/sdbk skip=1197742140 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace (as expected):

>  67,224  7        2    75.330506824 29924  Q   R 1197742140 + 1 [dd-7.1]
>  67,224  7        3    75.330509804 29924  G   R 1197742140 + 1 [dd-7.1]
>  67,224  7        4    75.330511985 29924  P   N [dd-7.1]
>  67,224  7        5    75.330513836 29924  I   R 1197742140 + 1 [dd-7.1]
>  67,224  7        6    75.330515495 29924  U   N [dd-7.1] 1
>  67,224  7        7    75.330517901 29924  D   R 1197742140 + 1 [dd-7.1]
>  67,224  6        1    75.340722638     0  C   R 1197742140 + 1 [0]

The aforementioned major/minor numbers:

> spu0103# ls -l /dev/|grep 67|grep '22[47]'
> brw-rw-rw-    1 root     root      67, 224 Sep 15 11:59 sdbk
> brw-rw-rw-    1 root     root      67, 227 Sep 15 11:59 sdbk3

*All* other drives in my system that I tested do show a match between 
the 3 methods above (fdisk, sfdisk, sysfs).

I don't know how this discrepancy with the partition start could have 
been introduced, but it is most likely a byproduct of my testing.

Thanks,
Brett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ