linux-kernel - Re: Page Cache writeback too slow, SSD/noop scheduler/ext2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090329023238.GA7825@localhost>
Date:	Sun, 29 Mar 2009 10:32:38 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Jos Houtman <jos@...es.nl>
Cc:	Nick Piggin <nickpiggin@...oo.com.au>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Jeff Layton <jlayton@...hat.com>,
	Dave Chinner <david@...morbit.com>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"jens.axboe@...cle.com" <jens.axboe@...cle.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"hch@...radead.org" <hch@...radead.org>,
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>
Subject: Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2

On Sat, Mar 28, 2009 at 12:59:43AM +0800, Jos Houtman wrote:
> Hi,
> 
> >> 
> >> kupdate surely should just continue to keep trying to write back pages
> >> so long as there are more old pages to clean, and the queue isn't
> >> congested. That seems to be the intention anyway: MAX_WRITEBACK_PAGES
> >> is just the number to write back in a single call, but you see
> >> nr_to_write is set to the number of dirty pages in the system.
> 
> And when it's congested it should just wait a little bit before continuing.
>  
> >> On your system, what must be happening is more_io is not being set.
> >> The logic in fs/fs-writeback.c might be busted.
> 
> I don't know about more_io, but I agree that the logic seems busted.
> 
> > 
> > Hi Jos,
> > 
> > I prepared a debugging patch for 2.6.28. (I cannot observe writeback
> > problems on my local ext2 mount.)
> 
> Thanx for the patch, but for the next time: How should I apply it?
> it seems to be context aware (@@) and broke on all kernel versions I tried
> 2.6.28/2.6.28.7/2.6.29

Do you mean that the patch applies after removing " @@.*$"?

To be safe, I created the patch with quilt as well as git, for 2.6.29.

> Because I saw the patch only a few hour ago and didn't want to block on your
> reply I decided to patch it manually and in the process ported it to 2.6.29.
> 
> As for the information the patch provided: It is most helpful.
> 
> Attached you will find a list of files containing dirty pages and the count
> of there dirty pages, there is also a dmesg output where I trace the
> writeback for 40 seconds.

They helped, thank you!

> I did some testing on my own using printk's and what I saw is that for the
> inodes located on sdb1 (the database) a lot of times they would pass
> http://lxr.linux.no/linux+v2.6.29/fs/fs-writeback.c#L335
> And then redirty_tail would be called, I haven't had the time to dig deeper,
> but that is my primary suspect for the moment.

You are right. In your case, there are several big dirty files in sdb1,
and the sdb write queue is constantly (almost-)congested. The SSD write
speed is so slow, that in each round of sdb1 writeback, it begins with
an uncongested queue, but quickly fills up after writing some pages.
Hence all the inodes will get redirtied because of (nr_to_write > 0).

The following quick fix should solve the slow-writeback-on-congested-SSD
problem. However the writeback sequence is suboptimal: it sync-and-requeue
each file until congested (in your case about 3~600 pages) instead of
until MAX_WRITEBACK_PAGES=1024 pages.

A more complete fix would be turning MAX_WRITEBACK_PAGES into an exact
per-file limit. It has been sitting in my todo list for quite a while...

Thanks,
Fengguang

---
 fs/fs-writeback.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
+				if (wbc->nr_to_write <= 0 ||
+				    wbc->encountered_congestion) {
 					/*
 					 * slice used up: queue for next turn
 					 */

View attachment "writeback-requeue-congestion-quickfix.patch" of type "text/x-diff" (487 bytes)