linux-kernel - Re: 2.6.19 file content corruption on ext3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0612281325290.4473@woody.osdl.org>
Date:	Thu, 28 Dec 2006 14:37:37 -0800 (PST)
From:	Linus Torvalds <torvalds@...l.org>
To:	Andrew Morton <akpm@...l.org>
cc:	Guillaume Chazarain <guichaz@...oo.fr>,
	David Miller <davem@...emloft.net>, ranma@...edrich.de,
	gordonfarquharson@...il.com,
	Marc Haber <mh+linux-kernel@...schlus.de>,
	Nick Piggin <nickpiggin@...oo.com.au>, andrei.popa@...eo.ro,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Hugh Dickins <hugh@...itas.com>,
	Florian Weimer <fw@...eb.enyo.de>,
	Martin Michlmayr <tbm@...ius.com>, arjan@...radead.org,
	Chen Kenneth W <kenneth.w.chen@...el.com>
Subject: Re: 2.6.19 file content corruption on ext3


Ok,
 with the ugly trace capture patch, I've actually captured this corruption 
in action, I think.

I did a full trace of all pages involved in one run, and picked one 
corruption at random:

	Chunk 14465 corrupted (0-75)  (01423fb4-01423fff)
	Expected 129, got 0
	Written as (5126)9509(15017)

That's the first 76 bytes of a chunk missing, and it's the last 76 bytes 
on a page. It's page index 01423 in the mapped file, and bytes fb4-fff 
within that file.

There were four chunks written to that page:

	Writing chunk 14463/15800 (15%) (0142344c) (1)
	Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423)
	Writing chunk 14464/15800 (32%) (01423a00) (3)
	Writing chunk 14465/15800 (60%) (01423fb4) (4)  <--- LOST!

and the other three chunks checked out all right.

And here's the annotated trace as it concerns that page:

 - here we write the first chunk to the page:
	** (1)  do_no_page: mapping index 00001423 at b7d1f44c (write)
	**      Setting page 00001423 dirty

 - something flushes it out to disk:
	**      cpd_for_io: index 00001423
	**      cleaning index 00001423 at b7d1f000

 - here we write the second chunk (which was split over the previous page 
   and the interesting one):
	** (2)  Setting page 00001422 dirty
	** (2)  Setting page 00001423 dirty

 - and here we do a cleaning event
	**      cpd_for_io: index 00001423
	**      cleaning index 00001423 at b7d1f000

 - here we write the third chunk:
	** (3)  Setting page 00001423 dirty

 - here we write the fourth chunk:
	** (4) NO DIRTY EVENT

 - and a third flush to disk: 
	**      cpd_for_io: index 00001423
	**      cleaning index 00001423 at b7d1f000

 - here we unmap and flush:
	**      Unmapped index 00001423 at b7d1f000
	**      Removing index 00001423 from page cache

 - here we remap to check:
	**      do_no_page: mapping index 00001423 at b7d1f000 (read)
	**      Unmapped index 00001423 at b7d1f000

 - and finally, here I remove the file after the run:
	**      Removing index 00001423 from page cache

Now, the important thing to see here is:

 - the missing write did not have a "Setting page 00001423 dirty" event 
   associated with it.

 - but I can _see_ where the actual dirty event would be happening in the 
   logs, because I can see the dirty events of the other chunk writes 
   around it, so I know exactly where that fourth write happens. And 
   indeed, it _shouldn't_ get a dirty event, because the page is still 
   dirty from the write of chunk #3 to that page, which _did_ get a dirty 
   event.

   I can see that, because the testing app writes the log of the pages it 
   writes, and this is the log around the fourth and final write:

	...
        Writing chunk 5338/15800 (60%) (0076eb48)       PFN: 76e/76f
        Writing chunk 960/15800 (60%) (00156300)        PFN: 156
        Writing chunk 14465/15800 (60%) (01423fb4)  <----
        Writing chunk 8594/15800 (60%) (00bf74a8)       PFN: bf7
        Writing chunk 556/15800 (60%) (000c62f0)        PFN: c6
	Writing chunk 15190/15800 (60%) (01526678)	PFN: 1526
	...

   and I can match this up with the full log from the kernel, which looks 
   like this:

        Setting page 0000076e dirty
        Setting page 0000076f dirty
        Setting page 00000156 dirty
        Setting page 000000c6 dirty
	Setting page 00001526 dirty

   so I know exactly where the missing writes (to our page at pfn 1423, 
   and the fpn-bf7 page) happened.

 - and the thing is, I can see a "cpd_for_io()" happening AFTER that 
   fourth write. Quite a long while after, in fact. So all of this looks 
   very fine indeed. We are not losing any dirty bits.

 - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses 
   the SAME dirty bit as write 4 did (which didn't make it out to disk!). 
   The event that clears the dirty bit that write 3 did happens AFTER 
   write 4 has happened!

So if we're not losing any dirty bits, what's going on?

I think we have some nasty interaction with the buffer heads. In 
particular, I don't think it's the dirty page bits that are broken (I 
_see_ that the PageDirty bit was set after write 4 was done to memory in 
the kernel traces). So I think that a real writeback just doesn't happen, 
because somebody has marked the buffer heads clean _after_ it started IO 
on them.

I think "__mpage_writepage()" is buggy in this regard, for example. It 
even has a comment about its crapola behaviour:

        /*
         * Must try to add the page before marking the buffer clean or
         * the confused fail path above (OOM) will be very confused when
         * it finds all bh marked clean (i.e. it will not write anything)
         */

however, I don't think that particular thing explains it, because I don't 
think we use that function for the cases I'm looking at.

Anyway, I'll add tracing for page-writeback setting/cleaning too, in case 
I might see anything new there..

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/