lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090402145205.GG30077@apollon.noc.grnet.gr>
Date:	Thu, 2 Apr 2009 17:52:05 +0300
From:	Apollon Oikonomopoulos <ao-lkml@....grnet.gr>
To:	linux-kernel@...r.kernel.org
Subject: Block device cache issue

Greetings to the list,

At my company, we have come across something that we think is a design 
limitation in the way the Linux kernel handles block device caches.  I 
will first describe the incident we encountered, before speculating on 
the actual cause.

As part of our infrastructure, we are running some Linux servers used as 
Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain 
normal MBR partition tables. At some point  we came across a VM, that - 
due to a misconfiguration of GRUB - failed on a reboot. We used 
multipath-tools' kpartx to create a device-mapper device pointing to the 
first partition of the LUN, mounted the filesystem, changed 
boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.  
To our surprise, Xen's pygrub showed the boot menu exactly as it was 
before the changes we made. We double-checked that the changes we made 
were indeed there and tried to find out what was actually going on.

As it turned out, the LUN device's read buffers had not been updated;  
losetup'ing the LUN device with the proper offset to the first partition 
and mounting it gave us exactly the image of the filesystem as it was 
_before_ our changes. We started digging into the kernel's buffer 
internals and came along the conclusion [1] that every block device  has 
its own pagecache, attached to a hash of (major,minor), that is 
independent from the caches of its containing or contained devices.  

Now, in practice one rarely - if ever - accesses the same data from 
these two different paths (disk + partition), except in scenarios like 
this. However currently there seems to be an implicit assumption that 
these two paths should not be used in the same "uptime" cycle at all, at 
least not without dropping the caches.  For the record, I managed to 
reproduce the whole issue by reading a single block through sda, dd'ing 
random data to it through sda1 and re-reading it through sda: its 
contents were intact (even hours later) and were up-to-date only when 
using O_DIRECT and finally when I dropped all caches (using 
/proc/sys/vm/drop_caches).

And now we come to the question part: Can someone please verify that the 
above statements are correct, or am I missing something? If they are, 
should it perhaps be the case that the partition's buffers somehow be 
linked with those of the containing device, or even be part of them? I 
don't even know if this is possible without significant overhead in the 
page cache (of which my understanding is very shallow), but keep in mind 
that this behaviour almost led to filesystem corruption (luckily we only 
changed a single file and hit a single inode).

Thank you for your time. Cheers,
Apollon

PS: I am not subscribed to the list, so I would appreciate if you could 
    Cc any answers to my address.


[1] If I interpret the contents of fs/buffer.c and 
include/linux/buffer_head.h correctly. Unfortunately, I'm not a kernel 
hacker, so I apologise if I'm mistaken at this point.

-- 
-----------------------------------------------------------
 Apollon Oikonomopoulos - GRNET Network Operations Centre
 Greek Research & Technology Network - http://www.grnet.gr
----------------------------------------------------------- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ