linux-kernel - Re: ftruncate-mmap: pages are lost after writing to mmaped file.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <604427e00903271335p14f96910y912ccd349faf154c@mail.gmail.com>
Date:	Fri, 27 Mar 2009 13:35:16 -0700
From:	Ying Han <yinghan@...gle.com>
To:	Nick Piggin <nickpiggin@...oo.com.au>
Cc:	Jan Kara <jack@...e.cz>, "Martin J. Bligh" <mbligh@...igh.org>,
	linux-ext4@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>, guichaz@...il.com,
	Alex Khesin <alexk@...gle.com>,
	Mike Waychison <mikew@...gle.com>,
	Rohit Seth <rohitseth@...gle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, Mar 24, 2009 at 12:44 AM, Nick Piggin <nickpiggin@...oo.com.au> wrote:
> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
>> On Fri 20-03-09 02:48:21, Nick Piggin wrote:
>
>> > Holding mapping->private_lock over the __set_page_dirty should
>> > fix it, although I guess you'd want to release it before calling
>> > __mark_inode_dirty so as not to put inode_lock under there. I
>> > have a patch for this if it sounds reasonable.
>>
>>   Yes, that seems to be a bug - the function actually looked suspitious to
>> me yesterday but I somehow convinced myself that it's fine. Probably
>> because fsx-linux is single-threaded.
>
>
> After a whole lot of chasing my own tail in the VM and buffer layers,
> I think it is a problem in ext2 (and I haven't been able to reproduce
> with ext3 yet, which might lend weight to that, although as we have
> seen, it is very timing dependent).
>
> That would be slightly unfortunate because we still have Jan's ext3
> problem, and also another reported problem of corruption on ext3 (on
> brd driver).
I believe i see the same issue on ext2 as well as ext4.
>
> Anyway, when I have reproduced the problem with the test case, the
> "lost" writes are all reported to be holes. Unfortunately, that doesn't
> point straight to the filesystem, because ext2 allocates blocks in this
> case at writeout time, so if dirty bits are getting lost, then it would
> be normal to see holes.
>
> I then put in a whole lot of extra infrastructure to track metadata about
> each struct page (when it was last written out, when it last had the number
> of writable ptes reach 0, when the dirty bits were last cleared etc). And
> none of the normal asertions were triggering: eg. when any page is removed
> from pagecache (except truncates), it has always had all its buffers
> written out *after* all ptes were made readonly or unmapped. Lots of other
> tests and crap like that.

Do you think there might be a race in the page reclaim path? I did a
hack which commeted out
wakeup_pdflush in try_to_free_pages ( based on 2.6.21, just randomly
picked on has the problem)
It runs for couple of hours and the problem not happened yet. I am not
sure if that is the problem or not,
and i will leave it running.
The reason i tried the hack since i reproduce the "bad pages" easily
everytime i put more memory pressure
on the system.


diff --git a/mm/vmscan.c b/mm/vmscan.c
index db023e2..b4b7e1f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1067,11 +1067,13 @@ unsigned long try_to_free_pages(struct zone **zones, g
 		 * that's undesirable in laptop mode, where we *want* lumpy
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
+/*
 		if (total_scanned > sc.swap_cluster_max +
 					sc.swap_cluster_max / 2) {
 			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
 			sc.may_writepage = 1;
 		}
+*/

 		/* Take a nap, wait for some writeback to complete */
 		if (sc.nr_scanned && priority < DEF_PRIORITY - 2)

>
> So I tried what I should have done to start with and did an e2fsck after
> seeing corruption. Yes, it comes up with errors. Now that is unusual
> because that should be largely insulated from the vm: if a dirty bit gets
> lost, then the filesystem image should be quite happy and error-free with
> a hole or unwritten data there.
>
> I don't know ext? locking very well, except that it looks pretty overly
> complex and crufty.
>
> Usually, blocks are instantiated by write(2), under i_mutex, serialising
> the allocator somewhat. mmap-write blocks are instantiated at writeout
> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> function, and can no longer trigger the problem. Might be a timing issue
> though -- Ying, can you try this and see if you can still reproduce?
>
> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?
>
> I'm leaving ext3 running for now. It looks like a nasty task to bisect
> ext2 down to that commit :( and I would be more interested in trying to
> reproduce Jan's ext3 problem, however, because I'm not too interested in
> diving into ext2 locking to work out exactly what is racing and how to
> fix it properly. I suspect it would be most productive to wire up some
> ioctls right into the block allocator/lookup and code up a userspace
> tester for it that could probably stress it a lot harder than kernel
> writeout can.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/