linux-kernel - Re: 2.6.19 file content corruption on ext3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <45876C65.7010301@yahoo.com.au>
Date:	Tue, 19 Dec 2006 15:36:53 +1100
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Linus Torvalds <torvalds@...l.org>
CC:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Andrew Morton <akpm@...l.org>, andrei.popa@...eo.ro,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Hugh Dickins <hugh@...itas.com>,
	Florian Weimer <fw@...eb.enyo.de>,
	Marc Haber <mh+linux-kernel@...schlus.de>,
	Martin Michlmayr <tbm@...ius.com>
Subject: Re: 2.6.19 file content corruption on ext3

Linus Torvalds wrote:
> On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> 
>>This should be safe; page_mkclean walks the rmap and flips the pte's
>>under the pte lock and records the dirty state while iterating.
>>Concurrent faults will either do set_page_dirty() before we get around
>>to doing it or vice versa, but dirty state is not lost.
> 
> 
> Ok, I really liked this patch, but the more I thought about it, the more I 
> started to doubt the reasons for liking it.

Well this implements my suggestion to redirty the page if there were dirty
ptes. I think it is a good fix (whether or not it fixes Andrei's bug, it
does fix a bug), though maybe _slightly_ suboptimal.

> I think we have some core fundamental problem here that this patch is 
> needed at all.
> 
> So let's think about this: we apparently have two cases of 
> "clear_page_dirty()":
> 
>  - the one that really wants to clear the bit unconditionally (Andrew 
>    calls this the "must_clean_ptes" case, which I personally find to be a 
>    really confusing name, but whatever)
> 
>  - the other case. The case that doesn't want to really clear the pte 
>    dirty bits.

I don't think this characterises it correctly. Think about how it worked
before the page_mkclean went in there.

We really _never_ want to just clear pte dirty bits, because that would be
a data loss situation[*]. The only reason we clear PG_dirty is because some
filesystem may have cleaned each buffer without realising it has cleaned
the whole page. But if you have a dirty pte, then all bets are off: a
buffer with a clear dirty bit can not be considered clean.

Before the dirty page tracking, it was fine to clear PG_dirty here, because
we would pick up the pte dirty info later on. After the page dirty tracking,
clearing pte dirty is a bug here, and re-accounting the dirty page is
arguably the minimal fix.

[*] except in the truncate case where we are happy to throw out dirty data,
     but in that case there would be no ptes anyway.

The only thing I would suggest is not applying Andrew's patch at all, and
do the special casing in try_to_free_buffers(). I've attached a patch for
comments.

> and I thought your patch made sense, because it saved away the pte state 
> in the page dirty state, and that matches my mental model, but the more I 
> think about it, the less sense that whole "the other case" situation makes 
> AT ALL.
> 
> Why does "the other case" exist at all? If you want to clear the dirty 
> page flag, what is _ever_ the reason for not wanting to drop PTE dirty 
> information? In other words, what possible reason can there ever be for 
> saying "I want this page to be clean", while at the same time saying "but 
> if it was dirty in the page tables, don't forget about that state".

We never want to drop dirty data! (ignoring the truncate case, which is
handled privately by truncate anyway)

This whole exercise is not about cleaning or dirtying or fogetting the actual
*data* in the page. It is about bringing the pagecache's notion of whether
the page is dirty or clean in line with the (more uptodate) filesystem's
notion.

After dirty write accounting, we also threw in "the virtual memory manager's
notion", but got that case slightly wrong.

As unlikely as this race is for SMP systems, I think it is easily possible
for PREEMPT kernels. And they have featured in all bug reports, AFAIKS.

-- 
SUSE Labs, Novell Inc.

View attachment "fs-fix.patch" of type "text/plain" (3905 bytes)