lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 17 Feb 2010 12:57:34 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Jens Axboe <jens.axboe@...cle.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Wed, Feb 17, 2010 at 02:33:37AM +0100, Jan Kara wrote:
> On Tue 16-02-10 15:34:01, Linus Torvalds wrote:
> > On Wed, 17 Feb 2010, Jan Kara wrote:
> > >
> > >   The IO size actually does matter for performance because if you switch
> > > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
> > 
> > No.
> > 
> > Dammit, read the code.
> > 
> > That's my whole _point_. Look at the for-loop.
> > 
> > We DO NOT SWITCH to another inode, because we just continue in the 
> > for-loop.
> > 
> > This is why I think your patch is crap. You clearly haven't even read the 
> > code, the patch makes no sense, and there must be something else going on 
> > than what you _claim_ is going on.
>   I've read the code. Maybe I'm missing something but look:
> writeback_inodes_wb(nr_to_write = 1024)
>   -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
>   ...
>   writeback_single_inode()
>     ...writes 1024 pages.
>     if we haven't written everything in the inode (more than 1024 dirty
>     pages) we end up doing either requeue_io() or redirty_tail(). In the
>     first case the inode is put to b_more_io list, in the second case to
>     the tail of b_dirty list. In either case it will not receive further
>     writeout until we go through all other members of current b_io list.
> 
>   So I claim we currently *do* switch to another inode after 4 MB. That
> is a fact.

That is my understanding of how it works, too.

>   I *think* it is by design - mainly to avoid the situation where someone
> continuously writes a huge file and kupdate or pdflush would never get to
> writing other files with dirty data (at least that's impression I've built
> over the years - heck, even 2.6.16 seems to have this redirty_tail logic
> with a comment about the above livelock).

Right, and there is another condition as well - writing lots of
small files could starve large file writeback as the large file only
got 4MB written back every 30s writeback period because it took that
long to write out all the new files.

IIRC it was in 2.6.16 that both these problems were fixed...

>   I do find this design broken as well as you likely do and think that the
> livelock issue described in the above paragraph should be solved differently
> (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.
> 
>   The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
> I think that changing the writeback logic so that it does not switch inodes
> after 4 MB is too risky for these two kernels. So with the above
> explanation would you accept some fix along the lines of original Jens'
> fix?

We've had this sync() writeback behaviour for a long time - my
question is why is it only now a problem?

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ