linux-kernel - [PATCH] Improve buffered streaming write ordering

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <1222886451.9158.34.camel@think.oraclecorp.com>
Date:	Wed, 01 Oct 2008 14:40:51 -0400
From:	Chris Mason <chris.mason@...cle.com>
To:	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: [PATCH] Improve buffered streaming write ordering

Hello everyone,

write_cache_pages can use the address space writeback_index field to
try and pick up where it left off between calls.  pdflush and
balance_dirty_pages both enable this mode in hopes of having writeback
evenly walk down the file instead of just servicing pages at the
start of the address space.

But, there is no locking around this field, and concurrent callers of
write_cache_pages on the same inode can get some very strange results.
pdflush uses writeback_acquire function to make sure that only one
pdflush process is servicing a given backing device, but
balance_dirty_pages does not.

When there are a small number of dirty inodes in the system,
balance_dirty_pages is likely to run in parallel with pdflush on one or
two of them, leading to somewhat random updates of the writeback_index
field in struct address space.

The end result is very seeky writeback during streaming IO.  A 4 drive
hardware raid0 array here can do 317MB/s streaming O_DIRECT writes on
ext4.  This is creating a new file, so O_DIRECT is really just a way to
bypass write_cache_pages.

If I do buffered writes instead, XFS does 205MB/s, and ext4 clocks in at
81.7MB/s.  Looking at the buffered IO traces for each one, we can see a
lot of seeks.

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-nopatch.png

http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-nopatch.png

The patch below changes write_cache_pages to only use writeback_index
when current_is_pdflush().  The basic idea is that pdflush is the only
one who has concurrency control against the bdi, so it is the only one
who can safely use and update writeback_index.

The performance changes quite a bit:

        patched        unpatched
XFS     247MB/s        205MB/s
Ext4    246MB/s        81.7MB/s

The graphs after the patch:

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-patched.png

http://oss.oracle.com/~mason/bugs/writeback_ordering/xfs-patched.png

The ext4 graph really does look strange.  What's happening there is the
lazy inode table init has dirtied a whole bunch of pages on the block
device inode.  I don't have much of an answer for why my patch makes all
of this writeback happen up front, other then writeback_index is no
longer bouncing all over the address space.

It is also worth noting that before the patch, filefrag shows ext4 using
about 4000 extents on the file.  After the patch it is around 400.  XFS
uses 2 extents both patched and unpatched.

This is just one benchmark, and I'm not convinced this patch is right.
The ordering of pdflush vs balance_dirty pages is very tricky so I
definitely think we need more thought on this one.

Signed-off-by: Chris Mason <chris.mason@...cle.com>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 24de8b6..d799f03 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -884,7 +884,11 @@ int write_cache_pages(struct address_space *mapping,

 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
+		/* start from previous offset done by pdflush */
+		if (current_is_pdflush())
+			index = mapping->writeback_index;
+		else
+			index = 0;
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
@@ -958,7 +962,8 @@ retry:
 		index = 0;
 		goto retry;
 	}
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
+	if (current_is_pdflush() &&
+	    (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)))
 		mapping->writeback_index = index;

 	if (wbc->range_cont)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/