[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1002162052230.4141@localhost.localdomain>
Date: Tue, 16 Feb 2010 21:16:46 -0800 (PST)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: tytso@....edu
cc: Jan Kara <jack@...e.cz>, Jens Axboe <jens.axboe@...cle.com>,
Linux Kernel <linux-kernel@...r.kernel.org>,
jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback
On Tue, 16 Feb 2010, tytso@....edu wrote:
>
> We've had this logic for a long time, and given the increase in disk
> density, and spindle speeds, the 4MB limit, which might have made
> sense 10 years ago, probably doesn't make sense now.
I still don't think that 4MB is enough on its own to suck quite that
much. Even a fast device should be perfectly happy with 4MB IOs, or it
must be sucking really badly.
In order to see the kinds of problems that got quoted in the original
thread, there must be something else going on too, methinks (disk light
was "blinking").
So I would guess that it's also getting stuck on that
inode_wait_for_writeback(inode);
inside that loop in wb_writeback().
In fact, I'm starting to wonder about that "Nothing written" case. The
code basically decides that "if I wrote zero pages, I didn't write
anything at all, so I must wait for the inode to complete old writes in
order to not busy-loop". Which sounds sensible on the face of it, but the
thing is, inodes can be dirty without actually having any dirty _pages_
associated with them.
Are we perhaps ending up in a situation where we essentially wait
synchronously on just the inode itself being written out? That would
explain the "40kB/s" kind of behavior.
If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s
throughput.
But if we do a 4MB chunk (for the one file that had real dirty data in
it), and then do a few hundred trivial "write out the inode data
_synchronously_" (due to access time changes etc) in between until we hit
the file that has real dirty data again - now _that_ would explain 40kB/s
throughput. It's not just seeking around - it's not even trying to push
multiple IO's to get any elevator going or anything like that.
And then the patch that started this discussion makes sense: it improves
performance because in between those synchronous inode updates it now
writes big chunks. But again, it's mostly hiding us just doing insane
things.
I dunno. Just a theory. The more I look at that code, the uglier it looks.
And I do get the feeling that the "4MB chunking" is really just making the
more fundamental problems show up.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists