lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0612231004270.3671@woody.osdl.org>
Date:	Sat, 23 Dec 2006 10:30:40 -0800 (PST)
From:	Linus Torvalds <torvalds@...l.org>
To:	Jean Delvare <khali@...ux-fr.org>,
	Ian McDonald <ian.mcdonald@...di.co.nz>
cc:	Thomas Meyer <thomas@...3r.de>, linux-cifs-client@...ts.samba.org,
	Steve French <sfrench@...ba.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...l.org>
Subject: Re: WARNING: "test_clear_page_dirty" [fs/cifs/cifs.ko] undefined!


[ Andrew - I'm cc'ing you, because you caused the requirement that people 
  use "set_page_writeback()" in their writepage() routine that CIFS seems 
  to have been ignoring all these years. That was introduced more than 
two years ago, back in April 11, 2004:

	[PATCH] fdatasync integrity fix

	fdatasync can fail to wait on some pages due to a race.
	...

  and as far as I can see, ever since then, any filesystem that didn't do 
  a "set_page_writeback()" to sync up the TAG_DIRTY bit would have this 
  CPU usage problem. Please double-check whether I'm right or barking up 
  the wrong tree.

  Afaik, the lack of doing the page writeback bit handling properly would 
  seem to not cause any actual visible _semantic_ problems, it would just 
  cause fdatasync to not necessarily be entirely reliable (which I guess 
  is semantic, but very hard to see) and just wasted CPU cycles when we 
  look up pages that are marked dirty in the radix tree, but aren't 
  actually really dirty. 

  Correct? Who else is implicated in all of this? ]

On Fri, 22 Dec 2006, Linus Torvalds wrote:
> 
> CIFS _should_ be using "clear_page_dirty_for_io()" in that place, and that 
> will fix the build. However, the reason I didn't just do that myself is 
> that I can't test the end result, and for the life of me, I can't see 
> where CIFS does the "end_page_writeback()" that it needs to do at IO 
> completion time.

Ok, I spent some more time looking at this.

The reason cifs didn't do an "end_page_writeback()" was that it didn't 
even do the "set_page_writeback()" to mark the page under writeback in the 
first place.

Now, you might think that since it didn't do a set_page_writeback(), it 
doesn't need to do the matching end_page_writeback() at all, and instead 
just continue to use the old (_really_ old) way of just unlocking the page 
when it is done.

However, you'd be wrong. The thing is, a "writepage()" function will be 
called with the dirty bit cleared in the "struct page *", but the mapping 
radix trees will still have the dirty bit set, exactly because the VM 
_requires_ the filesystem to tell it what the h*ll it is doing with the 
page. So a low-level filesystem must always do one of two things in it's 
"writepage()" function. Either: 

 - "set_page_writeback()" (and then an "end_page_writeback()" when 
   finished, of course)

OR

 - "redirty_page_for_writepage()" to tell the VM to move the page to the 
   back of the LRU queues because it can't be cleaned (eg, some temporary 
   problem with write ordering or similar, or something fundamental like 
   "I'm ramfs, and I don't _have_ any backing store").

and if the low-level filesystem doesn't do either of those, then the 
status bits in the radix tree that contains the mapping information will 
never be updated, so the page that got cleaned will continue to be marked 
"dirty" in the radix tree (which admittedly will generally be invisible, 
except for "sync()" and friends spending inordinate amounts of time 
looking at pages that aren't even dirty any more - since they look things 
up by the radix tree tags).

So I think the old code happened to work, but it was definitely incorrect, 
and would leave the dirty tags in the radix tree very confused indeed (it 
so happened that "cifs_writepages()" - with an "s" at the end - because it 
used "test_clear_page_dirty()" - would also clear the dirty tag, but any 
page that went through the generic VM routines and the single-page 
"cifs_writepage()" - without an "s" at the end - would then be forever 
marked dirty in the radix tree even though it was clean.

Somebody should check me, though.

This fairly mindless patch adds the proper "set_page_writeback()" calls 
(and the "clear_page_writeback()" ones I had already added before I looked 
more closely at this). 

I added a comment in "cifs_writepage()" (the single-page case) for why 
this all is the case,

			Linus

PS. To clarify: the old "test_clear_page_dirty()" would actually clear the 
dirty bit in the radix tree too, so in that sense it was the RIGHT thing 
to do for CIFS, since CIFS was mostly unaware of the need to clear the 
radix tree dirty bit (even if cifs_writepages() actually used that bit to 
look up pages).

HOWEVER, since CIFS is called from the generic routines (which _are_ 
radix-tree-aware and need the bit to be cleared explicitly), even the old 
code was actually totally broken. It would clear - largely by mistake - 
the radix tree dirty bit only for one case, not for _all_ the cases. A 
filesystem really does need to know about these things now, although a 
lot of filesystems can ignore them, since if they use all the generic 
routines, they generic routines will handle it all for them.

---
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..8a49b2e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,14 +1245,21 @@ retry:
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-					!test_clear_page_dirty(page)) {
+					!clear_page_dirty_for_io(page)) {
 				unlock_page(page);
 				break;
 			}
 
+			/*
+			 * This actually clears the dirty bit in the radix tree.
+			 * See cifs_writepage() for more commentary.
+			 */
+			set_page_writeback(page);
+
 			if (page_offset(page) >= mapping->host->i_size) {
 				done = 1;
 				unlock_page(page);
+				end_page_writeback(page);
 				break;
 			}
 
@@ -1316,6 +1323,7 @@ retry:
 					SetPageError(page);
 				kunmap(page);
 				unlock_page(page);
+				end_page_writeback(page);
 				page_cache_release(page);
 			}
 			if ((wbc->nr_to_write -= n_iov) <= 0)
@@ -1352,11 +1360,23 @@ static int cifs_writepage(struct page* page, struct writeback_control *wbc)
         if (!PageUptodate(page)) {
 		cFYI(1, ("ppw - page not up to date"));
 	}
-	
+
+	/*
+	 * Set the "writeback" flag, and clear "dirty" in the radix tree.
+	 *
+	 * A writepage() implementation always needs to do either this,
+	 * or re-dirty the page with "redirty_page_for_writepage()" in
+	 * the case of a failure.
+	 *
+	 * Just unlocking the page will cause the radix tree tag-bits
+	 * to fail to update with the state of the page correctly.
+	 */
+	set_page_writeback(page);		
 	rc = cifs_partialpagewrite(page, 0, PAGE_CACHE_SIZE);
 	SetPageUptodate(page); /* BB add check for error and Clearuptodate? */
 	unlock_page(page);
-	page_cache_release(page);	
+	end_page_writeback(page);
+	page_cache_release(page);
 	FreeXid(xid);
 	return rc;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ