linux-kernel - Re: [PATCH 12/15] writeback: remove writeback_control.more

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.00.1107121207540.1752@sister.anvils>
Date:	Tue, 12 Jul 2011 12:50:31 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	Wu Fengguang <fengguang.wu@...el.com>
cc:	Andrew Morton <akpm@...ux-foundation.org>, Jan Kara <jack@...e.cz>,
	Mel Gorman <mel@....ul.ie>, Dave Chinner <david@...morbit.com>,
	Christoph Hellwig <hch@...radead.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 12/15] writeback: remove writeback_control.more_io

On Mon, 11 Jul 2011, Wu Fengguang wrote:
> On Tue, Jul 12, 2011 at 05:31:50AM +0800, Hugh Dickins wrote:
> > On Wed, 8 Jun 2011, Wu Fengguang wrote:
> > > When wbc.more_io was first introduced, it indicates whether there are
> > > at least one superblock whose s_more_io contains more IO work. Now with
> > > the per-bdi writeback, it can be replaced with a simple b_more_io test.
> > 
> > This commit, b7a2441f9966fe3e1be960a876ab52e6029ea005 in your branch
> > for linux-next, seems very reasonable to me.
> > 
> > But bisection, confirmed on x86_64 and ppc64 by patching the effective
> > (fs-writeback.c) mods into and out of mmotm with that patch reverted,
> > show it to be responsible for freezes when running my kernel builds
> > on ext2 on loop on tmpfs swapping test.
> > 
> > flush-7:0 (which is doing writeback to the ext2 filesystem on loop0 on
> > a 450MB tmpfs file, though I'm using the ext4 driver to run that ext2fs)
> > seems to get stuck circling around __writeback_inodes_wb(), called from
> > wb_writeback() from wb_do_writeback() from bdi_writeback_thread().
> > 
> > Other tasks then hang trying to get the spinlock in inode_wb_list_del()
> > (memory pressure is trying to evict inodes) or __mark_inode_dirty().
> 
> I created the ext2 on tmpfs loop file and did some simple file copies,
> however cannot reproduce the problem. It would help if you have happen
> to have some usable test scripts.

It takes a while to explain: the sizes need to be tuned to get enough
memory pressure.  I'll forward you the mail in which I described it two
years ago, which should be enough to give you the idea, even if it's not
identical to what I'm using now.

But from what you say below, I think it's pretty much all (the ext2,
the loop, the tmpfs) irrelevant: memory pressure and lots of files
modified, a kernel build in limited memory, that should be enough.

> Or may I ask for your help to follow
> the below analyze and perhaps tracing efforts?
> 
> > I spent a little while trying to understand why,
> > but couldn't work it out: hope you can do better!
> 
> The patch in theory only makes difference in this case in
> writeback_sb_inodes():
> 
>                 if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>                         spin_unlock(&inode->i_lock);
>                         requeue_io(inode, wb);
>                         continue;
>                 }
> 
> So if some inode is stuck in the I_NEW, I_FREEING or I_WILL_FREE state,
> the flusher will get stuck busy retrying that inode.

Well, if that's the case (it's not obvious to me how removing more_io
made a difference there) then I think you've already got the answer:
thank you!  As I mentioned, memory pressure is trying to evict inodes, so
tasks are hanging trying to acquire wb->list_lock in inode_wb_list_del().

But inodes being evicted are I_FREEING, and the flush thread is
holding wb->list_lock across all of this (since one of your earlier
patchs), so if one of the inodes being evicted is on the b_io list,
then indeed we'll be stuck.

> 
> It's relatively easy to confirm, by reusing the below trace event to
> show the inode (together with its state) being requeued.
> 
> If this is the root cause, it may equally be fixed by
> 
> -			requeue_io(inode, wb);
> +			redirty_tail(inode, wb);
> 
> which would be useful in case the bug is so deadly that it's no longer
> possible to do tracing.

I checked again this morning that I could reproduce it on two machines,
one went in a few minutes, the other within the hour.  Then I made that
patch changing the requeue_io to redirty_tail, and left home with them
running the test with the new kernel: we'll see at the end of the day
how they fared.

I do have a variant of kdb in (but I think my breakpoints are broken at
present), so I'd be inclined to use that rather than get into tracing:
I can easily add a counter in there and see what happens to it, if more
investigation turns out to be needed.

redirty_tail() instead of requeue_io(): is there any danger that doing
that could delay some updates indefinitely?

Hugh

> 
> Thanks,
> Fengguang
> ---
> 
> echo 1 > /debug/tracing/events/writeback/writeback_single_inode*
> 
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-07-11 23:07:04.000000000 -0700
> +++ linux-next/fs/fs-writeback.c	2011-07-11 23:08:45.000000000 -0700
> @@ -726,6 +726,7 @@ static long writeback_sb_inodes(struct s
>  		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>  			spin_unlock(&inode->i_lock);
>  			requeue_io(inode, wb);
> +			trace_writeback_single_inode_requeue(inode, &wbc, 0);
>  			continue;
>  		}
>  		__iget(inode);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/