linux-kernel - Re: [PATCH] writeback: Fix broken sync writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.00.1002120722270.7792@localhost.localdomain>
Date:	Fri, 12 Feb 2010 07:45:04 -0800 (PST)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Jens Axboe <jens.axboe@...cle.com>
cc:	Linux Kernel <linux-kernel@...r.kernel.org>, jack@...e.cz,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Fri, 12 Feb 2010, Jens Axboe wrote:
> 
> This fixes it by using the passed in page writeback count, instead of
> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> finish properly even when new pages are being dirted.

This seems broken.

The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't 
generate a single IO bigger than a couple of MB. And then we have that 
loop around things to do multiple chunks. Your change to use nr_pages 
seems to make the whole looping totally pointless, and breaks that "don't 
do huge hunks" logic.

So I don't think that your patch is correct.

That said, I _do_ believe you when you say it makes a difference, which 
makes me think there is a bug there. I just don't think you fixed the 
right bug, and your change just happens to do what you wanted by pure 
random luck.

The _real_ bug seems to bethe one you mentioned, but then ignored:

> Instead of flushing everything older than the sync run, it will do 
> chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and 
> over.

and it would seem that the _logical_ way to fix it would be something like 
the appended...

Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that 
any _future_ data is written back, so the 'oldest_jif' thing would seem to 
be sane regardless of sync mode.

  NOTE NOTE NOTE! I do have to admit that this patch scares me, because 
  there could be some bug in the 'older_than_this' logic that means that 
  somebody sets it even if the inode is already dirty. So this patch makes 
  conceptual sense to me, and I think it's the right thing to do, but I 
  also suspect that we do not actually have a lot of test coverage of the 
  whole 'older_than_this' logic, because it historically has been just a
  small optimization for kupdated.

  So this patch scares me, as it could break 'fdatasync' entirely. So 
  somebody should really double-check the whole 'dirtied_when' logic, just 
  to be safe. If anybody ever sets 'dirtied_when' to the current time even 
  if the inode is already dirty (and has an earlier dirtied_when'), then 
  that would open up 'fdatasync()' and friends up to not writing things 
  back properly at all (because a newer write set 'dirtied_when' so that 
  old writes get ignored and thought to be 'totally new')

Comments? 

		Linus

---
 fs/fs-writeback.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a7c42c..a0a8424 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb,
 	long wrote = 0;
 	struct inode *inode;

-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
+	/*
+	 * We never write back data that was dirtied _after_ we
+	 * started writeback. But kupdate doesn't even want to
+	 * write back recently dirtied stuff, only older data.
+	 */
+	oldest_jif = jiffies-1;
+	wbc.older_than_this = &oldest_jif;
+	if (wbc.for_kupdate)
+		oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10);
+
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/