linux-kernel - Re: [PATCH] writeback: Fix broken sync writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.00.1002161848370.4141@localhost.localdomain>
Date:	Tue, 16 Feb 2010 19:35:35 -0800 (PST)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Jan Kara <jack@...e.cz>
cc:	Jens Axboe <jens.axboe@...cle.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Wed, 17 Feb 2010, Jan Kara wrote:
>
>   I've read the code. Maybe I'm missing something but look:
> writeback_inodes_wb(nr_to_write = 1024)
>   -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
>   ...
>   writeback_single_inode()
>     ...writes 1024 pages.
>     if we haven't written everything in the inode (more than 1024 dirty
>     pages) we end up doing either requeue_io() or redirty_tail(). In the
>     first case the inode is put to b_more_io list, in the second case to
>     the tail of b_dirty list. In either case it will not receive further
>     writeout until we go through all other members of current b_io list.
> 
>   So I claim we currently *do* switch to another inode after 4 MB. That
> is a fact.

Ok, I think that's the bug. I do agree that it may well be intentional, 
but considering the performance impact, I suspect it's been "intentional 
without any performance numbers".

Which just makes me very unhappy to just paper it over for the sync case, 
and leave the now known-broken state alone for the async case. That really 
isn't how we want to do things.

That said, if we've done this forever, I can certainly see the allure to 
just keep doing it, and then handle the sync case separately. 

>   I do find this design broken as well as you likely do and think that the
> livelock issue described in the above paragraph should be solved differently
> (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.

Hmm. The thing is, the new radix tree bit you propose also sounds like 
overdesigning things. 

If we really do switch inodes (which I obviously didn't expect, even if I 
may have been aware of it many years ago), then the max rate limiting is 
just always bad.

If it's bad for synchronous syncs, then it's bad for background syncing 
too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely - 
since the whole latency argument goes away if we don't always honor it 
("Oh, we have good latency - _except_ if you do 'sync()' to synchronously 
write something out" - that's just insane).

>   The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
> I think that changing the writeback logic so that it does not switch inodes
> after 4 MB is too risky for these two kernels. So with the above
> explanation would you accept some fix along the lines of original Jens'
> fix?

What is affected if we just remove MAX_WRITEBACK_PAGES entirely (as 
opposed to the patch under discussion that effectively removes it for 
WB_SYNC_ALL)?

I see balance_dirty_pages -> bdi_start_writeback, but that if anything 
would be something that I think would be better off with efficient 
writeback, and doesn't seem like it should try to round-robin over inodes 
for latency reasons.

But I guess we can do it in stages, if it's about "minimal changes for 
2.6.32/33.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/