linux-kernel - Re: Block device cache issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20090407003153.41fb9c78.akpm@linux-foundation.org>
Date:	Tue, 7 Apr 2009 00:31:53 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Apollon Oikonomopoulos <ao-lkml@....grnet.gr>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: Block device cache issue

On Thu, 2 Apr 2009 17:52:05 +0300 Apollon Oikonomopoulos <ao-lkml@....grnet.gr> wrote:

> Greetings to the list,
> 
> At my company, we have come across something that we think is a design 
> limitation in the way the Linux kernel handles block device caches.  I 
> will first describe the incident we encountered, before speculating on 
> the actual cause.
> 
> As part of our infrastructure, we are running some Linux servers used as 
> Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain 
> normal MBR partition tables. At some point  we came across a VM, that - 
> due to a misconfiguration of GRUB - failed on a reboot. We used 
> multipath-tools' kpartx to create a device-mapper device pointing to the 
> first partition of the LUN, mounted the filesystem, changed 
> boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.  
> To our surprise, Xen's pygrub showed the boot menu exactly as it was 
> before the changes we made. We double-checked that the changes we made 
> were indeed there and tried to find out what was actually going on.
> 
> As it turned out, the LUN device's read buffers had not been updated;  
> losetup'ing the LUN device with the proper offset to the first partition 
> and mounting it gave us exactly the image of the filesystem as it was 
> _before_ our changes. We started digging into the kernel's buffer 
> internals and came along the conclusion [1] that every block device  has 
> its own pagecache, attached to a hash of (major,minor), that is 
> independent from the caches of its containing or contained devices.  
> 
> Now, in practice one rarely - if ever - accesses the same data from 
> these two different paths (disk + partition), except in scenarios like 
> this. However currently there seems to be an implicit assumption that 
> these two paths should not be used in the same "uptime" cycle at all, at 
> least not without dropping the caches.  For the record, I managed to 
> reproduce the whole issue by reading a single block through sda, dd'ing 
> random data to it through sda1 and re-reading it through sda: its 
> contents were intact (even hours later) and were up-to-date only when 
> using O_DIRECT and finally when I dropped all caches (using 
> /proc/sys/vm/drop_caches).
> 
> And now we come to the question part: Can someone please verify that the 
> above statements are correct, or am I missing something?

The above statements are correct ;)

Similarly, the pagecache for /etc/password is separate from the
pagecache for the device upon which /etc is mounted.

> If they are, 
> should it perhaps be the case that the partition's buffers somehow be 
> linked with those of the containing device, or even be part of them? I 
> don't even know if this is possible without significant overhead in the 
> page cache (of which my understanding is very shallow), but keep in mind 
> that this behaviour almost led to filesystem corruption (luckily we only 
> changed a single file and hit a single inode).

It would incur overhead.  We could perhaps fix it by having a single
cache for /dev/sda and then just making /dev/sda1 access that cache
with an offset.  But it rarely if ever comes up - I guess the few
applications which do this sort of thing are taking suitable steps to
avoid it - fsync, ioctl(BKLFLSBUF), posix_fadvise(FADV_DONTNEED),
O_DIRECT, etc.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/