linux-kernel - Re: Page Cache writeback too slow, SSD/noop scheduler/ext2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090325052637.GA5912@localhost>
Date:	Wed, 25 Mar 2009 13:26:37 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Nick Piggin <nickpiggin@...oo.com.au>
Cc:	Jos Houtman <jos@...es.nl>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Jeff Layton <jlayton@...hat.com>,
	Dave Chinner <david@...morbit.com>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"jens.axboe@...cle.com" <jens.axboe@...cle.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"hch@...radead.org" <hch@...radead.org>,
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>
Subject: Re: Page Cache writeback too slow,   SSD/noop scheduler/ext2

On Wed, Mar 25, 2009 at 01:48:53AM +1100, Nick Piggin wrote:
> On Monday 23 March 2009 03:53:29 Jos Houtman wrote:
> > On 3/21/09 11:53 AM, "Andrew Morton" <akpm@...ux-foundation.org> wrote:
> > > On Fri, 20 Mar 2009 19:26:06 +0100 Jos Houtman <jos@...es.nl> wrote:
> > >> Hi,
> > >>
> > >> We have hit a problem where the page-cache writeback algorithm is not
> > >> keeping up.
> > >> When memory gets low this will result in very irregular performance
> > >> drops.
> > >>
> > >> Our setup is as follows:
> > >> 30 x Quad core machine with 64GB ram.
> > >> These are single purpose machines running MySQL.
> > >> Kernel version: 2.6.28.7
> > >> A dedicated SSD drive for the ext2 database partition
> > >> Noop scheduler for the ssd drive.
> > >>
> > >>
> > >> The current hypothesis is as follows:
> > >> The wk_update function does not write enough dirty pages, which allows
> > >> the number of dirty pages to grow to the dirty_background limit.
> > >> When memory is low,  __background_writeout() comes around and
> > >> __forcefully__ writes dirty pages to disk.
> > >> This forced write fills the disk queue and starves read calls that MySQL
> > >> is trying to do: basically killing performance  for a few seconds. This
> > >> pattern repeats as soon as the cleared memory is filled again.
> > >>
> > >> Decreasing the dirty_writeback_centisecs to 100 doesn__t help
> > >>
> > >> I don__t know why this is, but I did some preliminary tracing using
> > >> systemtap and it seems that the majority of times wk_update calls
> > >> decides to do nothing.
> > >>
> > >> Doubling /sys/block/sdb/queue/nr_requests  to 256, seems to help abit: 
> > >> the nr_dirty pages is increasing more slowly.
> > >> But I am unsure of side-effects and am afraid of increasing the
> > >> starvation problem for mysql.
> > >>
> > >>
> > >> I__am very much willing to work on this issue and see it fixed, but
> > >> would like to tap into the knowledge of people here.
> > >> So:
> > >> * Have more people seen this or simular issues?
> > >> * Is the hypothesis above a viable one?
> > >> * Suggestions/pointers for further research and statistics I should
> > >> measure to improve the understanding of this problem.
> > >
> > > I don't think that noop-iosched tries to do anything to prevent
> > > writes-starve-reads.  Do you get better behaviour from any of the other
> > > IO schedulers?
> >
> > I did a quick stress test and cfq does not immediately seem to hurt
> > performance, although some of my colleague's have tested this in the past
> > with the opposite results (which is why we use noop).
> >
> > But despite the scheduler, the real problem is in the writeback algorithm
> > not keeping up.
> > We can grow 600K dirty pages during the day, and only ~300k is flushed to
> > disk during the night hours.
> >
> > While a quick look at the writeback algorithm let me to expect
> > __wk_update()__ to flush ~1024 pages every 5 seconds, which is almost 3GB
> > per hour.  It obviously does not manage to do this in our setup.
> >
> > I don¹t believe the speed of the ssd to be the problem, running sync
> > manually only takes a few minutes to flush 800K dirty pages to disk.
> 
> kupdate surely should just continue to keep trying to write back pages
> so long as there are more old pages to clean, and the queue isn't
> congested. That seems to be the intention anyway: MAX_WRITEBACK_PAGES
> is just the number to write back in a single call, but you see
> nr_to_write is set to the number of dirty pages in the system.
> 
> On your system, what must be happening is more_io is not being set.
> The logic in fs/fs-writeback.c might be busted.

Hi Jos,

I prepared a debugging patch for 2.6.28. (I cannot observe writeback
problems on my local ext2 mount.)

You can view the states of all dirty inodes by doing

        modprobe filecache
        echo ls dirty > /proc/filecache
        cat /proc/filecache

The 'age' field shows (jiffies - inode->dirtied_when), which may also be useful
for debugging Jeff and Ian's case(if it keeps growing, then dirtied_when is stuck).

The detailed dirty writeback traces can be retrieved by doing

        echo 1 > /proc/sys/fs/dirty_debug        
        sleep 6s
        echo 0 > /proc/sys/fs/dirty_debug        
        dmesg

The dmesg trace should help identify the bug in periodic writeback.

Thanks,
Fengguang

View attachment "filecache+writeback-debug-2.6.28.patch" of type "text/x-diff" (40786 bytes)