linux-kernel - Re: regression in page writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 22 Sep 2009 18:59:41 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Wu Fengguang <fengguang.wu@...el.com>
Cc:	Chris Mason <chris.mason@...cle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"Li, Shaohua" <shaohua.li@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"richard@....demon.co.uk" <richard@....demon.co.uk>,
	"jens.axboe@...cle.com" <jens.axboe@...cle.com>
Subject: Re: regression in page writeback

On Wed, 23 Sep 2009 09:45:00 +0800 Wu Fengguang <fengguang.wu@...el.com> wrote:

> On Wed, Sep 23, 2009 at 09:28:32AM +0800, Andrew Morton wrote:
> > On Wed, 23 Sep 2009 09:17:58 +0800 Wu Fengguang <fengguang.wu@...el.com> wrote:
> > 
> > > On Wed, Sep 23, 2009 at 08:54:52AM +0800, Andrew Morton wrote:
> > > > On Wed, 23 Sep 2009 08:22:20 +0800 Wu Fengguang <fengguang.wu@...el.com> wrote:
> > > > 
> > > > > Jens' per-bdi writeback has another improvement. In 2.6.31, when
> > > > > superblocks A and B both have 100000 dirty pages, it will first
> > > > > exhaust A's 100000 dirty pages before going on to sync B's.
> > > > 
> > > > That would only be true if someone broke 2.6.31.  Did they?
> > > > 
> > > > SYSCALL_DEFINE0(sync)
> > > > {
> > > > 	wakeup_pdflush(0);
> > > > 	sync_filesystems(0);
> > > > 	sync_filesystems(1);
> > > > 	if (unlikely(laptop_mode))
> > > > 		laptop_sync_completion();
> > > > 	return 0;
> > > > }
> > > > 
> > > > the sync_filesystems(0) is supposed to non-blockingly start IO against
> > > > all devices.  It used to do that correctly.  But people mucked with it
> > > > so perhaps it no longer does.
> > > 
> > > I'm referring to writeback_inodes(). Each invocation of which (to sync
> > > 4MB) will do the same iteration over superblocks A => B => C ... So if
> > > A has dirty pages, it will always be served first.
> > > 
> > > So if wbc->bdi == NULL (which is true for kupdate/background sync), it
> > > will have to first exhaust A before going on to B and C.
> > 
> > But that works OK.  We fill the first device's queue, then it gets
> > congested and sync_sb_inodes() does nothing and we advance to the next
> > queue.
> 
> So in common cases "exhaust" is a bit exaggerated, but A does receive
> much more opportunity than B. Computation resources for IO submission
> are unbalanced for A, and there are pointless overheads in rechecking A.

That's unquantified handwaving.  One CPU can do a *lot* of IO.

> > If a device has more than a queue's worth of dirty data then we'll
> > probably leave some of that dirty memory un-queued, so there's some
> > lack of concurrency in that situation.
> 
> Good insight.

It was wrong.  See the other email.

> That possibly explains one major factor of the
> performance gains of Jens' per-bdi writeback.

I've yet to see any believable and complete explanation for these
gains.  I've asked about these things multiple times and nothing happened.

I suspect that what happened over time was that previously-working code
got broken, then later people noticed the breakage but failed to
analyse and fix it in favour of simply ripping everything out and
starting again.

So for the want of analysing and fixing several possible regressions,
we've tossed away some very sensitive core kernel code which had tens
of millions of machine-years testing.  I find this incredibly rash.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/