linux-kernel - Re: Bug in kernel 2.6.31, Slow wb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 29 Jul 2009 00:15:48 -0700
From:	Martin Bligh <mbligh@...gle.com>
To:	Chad Talbott <ctalbott@...gle.com>
Cc:	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	wfg@...l.ustc.edu.cn, Michael Rubin <mrubin@...gle.com>,
	Andrew Morton <akpm@...gle.com>, sandeen@...hat.com,
	Michael Davidson <md@...gle.com>
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@...gle.com> wrote:
>> An interesting recent-ish change is "writeback: speed up writeback of
>> big dirty files."  When I revert the change to __sync_single_inode the
>> problem appears to go away and background writeout proceeds at disk
>> speed.  Interestingly, that code is in the git commit [2], but not in
>> the post to LKML. [3]  This is may not be the fix, but it makes this
>> test behave better.
>
> I'm fairly sure this is not fixing the root cause - but putting it at the head
> rather than the tail of the queue causes the error not to starve wb_kupdate
> for nearly so long - as long as we keep the queue full, the bug is hidden.

OK, it seems this is the root cause - I wasn't clear why all the pages weren't
being written back, and thought there was another bug. What happens is
we go into write_cache_pages, and stuff the disk queue with as much as
we can put into it, and then inevitably hit the congestion limit.

Then we back out to __sync_single_inode, who says "huh, you didn't manage
to write your whole slice", and penalizes the poor blameless inode in question
by putting it back into the penalty box for 30s.

This results in very lumpy I/O writeback at 5s intervals, and very
poor throughput.

Patch below is inline and probably text munged, but is for RFC only.
I'll test it
more thoroughly tomorrow. As for the comment about starving other writes,
I believe requeue_io moves it from s_io to s_more_io which should at least
allow some progress of other files.

--- linux-2.6.30/fs/fs-writeback.c.old  2009-07-29 00:08:29.000000000 -0700
+++ linux-2.6.30/fs/fs-writeback.c      2009-07-29 00:11:28.000000000 -0700
@@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
                        /*
                         * We didn't write back all the pages.  nfs_writepages()
                         * sometimes bales out without doing anything. Redirty
-                        * the inode; Move it from s_io onto s_more_io/s_dirty.
+                        * the inode; Move it from s_io onto s_more_io. It
+                        * may well have just encountered congestion
                         */
-                       /*
-                        * akpm: if the caller was the kupdate function we put
-                        * this inode at the head of s_dirty so it gets first
-                        * consideration.  Otherwise, move it to the tail, for
-                        * the reasons described there.  I'm not really sure
-                        * how much sense this makes.  Presumably I had a good
-                        * reasons for doing it this way, and I'd rather not
-                        * muck with it at present.
-                        */
-                       if (wbc->for_kupdate) {
-                               /*
-                                * For the kupdate function we move the inode
-                                * to s_more_io so it will get more writeout as
-                                * soon as the queue becomes uncongested.
-                                */
-                               inode->i_state |= I_DIRTY_PAGES;
-                               if (wbc->nr_to_write <= 0) {
-                                       /*
-                                        * slice used up: queue for next turn
-                                        */
-                                       requeue_io(inode);
-                               } else {
-                                       /*
-                                        * somehow blocked: retry later
-                                        */
-                                       redirty_tail(inode);
-                               }
-                       } else {
-                               /*
-                                * Otherwise fully redirty the inode so that
-                                * other inodes on this superblock will get some
-                                * writeout.  Otherwise heavy writing to one
-                                * file would indefinitely suspend writeout of
-                                * all the other files.
-                                */
-                               inode->i_state |= I_DIRTY_PAGES;
-                               redirty_tail(inode);
-                       }
+                       inode->i_state |= I_DIRTY_PAGES;
+                       requeue_io(inode);
                } else if (inode->i_state & I_DIRTY) {
                        /*
                         * Someone redirtied the inode while were writing back
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/