[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <200709072052.48396.bs@q-leap.de>
Date: Fri, 7 Sep 2007 20:52:38 +0200
From: Bernd Schubert <bs@...eap.de>
To: linux-kernel@...r.kernel.org
Subject: [PATCH 1/2][RESEND] improve generic_file_buffered_write()
No further response to our patches yet, so we are sending them again,
re-diffed against 2.6.23-rc5
Hi,
recently we discovered writing to a nfs-exported lustre filesystem is rather
slow (20-40 MB/s writing, but over 200 MB/s reading).
As I already explained on the nfs mailing list, this happens since there is an
offset on the very first page due to the nfs header.
http://sourceforge.net/mailarchive/forum.php?thread_name=200708312003.30446.bernd-schubert%40gmx.de&forum_name=nfs
While this especially effects lustre, Olaf Kirch also noticed it on another
filesystem before and wrote a nfs patch for it. This patch has two
disadvantages - it requires to move all data within the pages, IMHO rather
cpu time consuming, furthermore, it presently causes data corruption when
more than one nfs thread is running.
After thinking it over and over again we (Goswin and I) believe it would be
best to improve generic_file_buffered_write().
If there is sufficient data now, as it is usual for aio writes,
generic_file_buffered_write() will now fill each page as much as possible and
only then prepare/commit it. Before generic_file_buffered_write() commited
chunks of pages even though there were still more data.
Some statistics:
num_writes = 4669440, bytes_total = 20231249633, segs_total = 5738644,
commit_loops = 7697604, commits_total = 6628750
commit_loops is the number commits without the patch and commits_total the
number of commits we actually have now. This shows a saving of nearly 14% of
prepare, commit, cond_sched calls.
< 1: Write size = 0, Num segs = 0
< 2: Write size = 20244, Num segs = 4455583
< 4: Write size = 6722, Num segs = 24
< 8: Write size = 19653, Num segs = 213842
< 16: Write size = 31778, Num segs = 0
< 32: Write size = 73395, Num segs = 0
< 64: Write size = 148840, Num segs = 0
< 128: Write size = 310178, Num segs = 0
< 256: Write size = 89027, Num segs = 0
< 512: Write size = 111903, Num segs = 0
< 1024: Write size = 140509, Num segs = 0
< 2048: Write size = 244052, Num segs = 0
< 4096: Write size = 217164, Num segs = 0
< 8192: Write size = 2784875, Num segs = 0
< 16384: Write size = 433506, Num segs = 0
< 32768: Write size = 11742, Num segs = 0
< 65536: Write size = 15783, Num segs = 0
< 131072: Write size = 6851, Num segs = 0
< 262144: Write size = 1562, Num segs = 0
< 524288: Write size = 755, Num segs = 0
< 1048576: Write size = 531, Num segs = 0
< 2097152: Write size = 272, Num segs = 0
< 4194304: Write size = 107, Num segs = 0
< 8388608: Write size = 0, Num segs = 0
Write size shows the number of writes with the total size smaller than denoted
in the first column. Num segs shows the number of writes with less segments
than denoted in the first column. Most writes (~95%) only have one segment.
However, no nfs activity has been done, which is actually the case we made
the patches for.
size\num 1 2 3 4 5 6 7+
< 1: 0 24 0 0 0 0 0
< 2: 20244 0 0 0 0 641526 0
< 4: 6722 0 0 0 0 0 0
< 8: 19653 0 0 0 0 213842 0
< 16: 31778 0 0 0 0 213856 0
< 32: 73395 0 0 0 0 590 0
< 64: 147730 0 0 0 0 93626 0
< 128: 100888 0 0 0 0 119597 0
< 256: 85588 0 0 0 0 12 0
< 512: 111900 0 0 0 0 3 0
< 1024: 140509 0 0 0 0 0 0
< 2048: 244052 0 0 0 0 0 0
< 4096: 217160 4 0 0 0 0 0
< 8192: 2784855 20 0 0 0 0 0
< 16384: 433506 0 0 0 0 0 0
< 32768: 11742 0 0 0 0 0 0
< 65536: 15783 0 0 0 0 0 0
< 131072: 6851 0 0 0 0 0 0
< 262144: 1562 0 0 0 0 0 0
< 524288: 755 0 0 0 0 0 0
< 1048576: 531 0 0 0 0 0 0
< 2097152: 272 0 0 0 0 0 0
< 4194304: 107 0 0 0 0 0 0
< 8388608: 0 0 0 0 0 0 0
This table shows the number of segments smaller than denoted in the first
column within a vector of sizes 1 to 7 segments. This shows that without nfs
only 1, 2 and 6 segment vectors appear to happen. Depending on wsize nfs will
have a lot more segments (129 for 512kiB wsize).
Btw, whats the cause of the 24 vectors with 2 segments, one of them being size
zero?
Signed-off-by: Bernd Schubert <bs@...eap.de>
Signed-off-by: Goswin von Brederlow <goswin@...brederlow.de>
Cheers,
Goswin and Bernd
mm/filemap.c | 139 +++++++++++++++++++++++++++++++++----------------
1 file changed, 95 insertions(+), 44 deletions(-)
Index: linux-2.6.23-rc5/mm/filemap.c
===================================================================
--- linux-2.6.23-rc5.orig/mm/filemap.c 2007-09-06 18:33:11.000000000 +0200
+++ linux-2.6.23-rc5/mm/filemap.c 2007-09-06 18:33:15.000000000 +0200
@@ -1834,6 +1834,21 @@
}
EXPORT_SYMBOL(generic_file_direct_write);
+/**
+ * generic_file_buffered_write - handle iov'ectors
+ * @iob: file operations
+ * @iov: vector of data to write
+ * @nr_segs: number of iov segments
+ * @pos: position in the file
+ * @ppos: position in the file after this function
+ * @count: number of bytes to write
+ * written: offset in iov->base (data to skip on write)
+ *
+ * This function will do 3 main tasks for each iov:
+ * - prepare a write
+ * - copy the data from iov into a new page
+ * - commit this page
+ */
ssize_t
generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1851,6 +1866,11 @@
const struct iovec *cur_iov = iov; /* current iovec */
size_t iov_base = 0; /* offset in the current iovec */
char __user *buf;
+ unsigned long data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ loff_t wpos = pos; /* the position in the file we will return */
+
+ /* position in file as index of pages */
+ unsigned long index = pos >> PAGE_CACHE_SHIFT;
pagevec_init(&lru_pvec, 0);
@@ -1864,9 +1884,15 @@
buf = cur_iov->iov_base + iov_base;
}
+ page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ goto out;
+ }
+
do {
- unsigned long index;
unsigned long offset;
+ unsigned long data_end; /* end of data within the page */
size_t copied;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
@@ -1897,11 +1923,6 @@
*/
fault_in_pages_readable(buf, bytes);
}
- page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
- if (!page) {
- status = -ENOMEM;
- break;
- }
if (unlikely(bytes == 0)) {
status = 0;
@@ -1909,22 +1930,26 @@
goto zero_length_segment;
}
- status = a_ops->prepare_write(file, page, offset, offset+bytes);
- if (unlikely(status)) {
- loff_t isize = i_size_read(inode);
+ data_end = offset + bytes;
- if (status != AOP_TRUNCATED_PAGE)
- unlock_page(page);
- page_cache_release(page);
- if (status == AOP_TRUNCATED_PAGE)
- continue;
+ if (data_end == PAGE_CACHE_SIZE || count == bytes) {
/*
- * prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
+ * Only prepare a write if its an entire page or if
+ * we don't have more data
*/
- if (pos + bytes > isize)
- vmtruncate(inode, isize);
- break;
+ status = a_ops->prepare_write(file, page, data_start, data_end);
+ if (unlikely(status)) {
+ loff_t isize = i_size_read(inode);
+
+ if (status == AOP_TRUNCATED_PAGE)
+ continue;
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again.
+ */
+ if (pos + bytes > isize)
+ vmtruncate(inode, isize);
+ }
}
if (likely(nr_segs == 1))
copied = filemap_copy_from_user(page, offset,
@@ -1932,60 +1957,86 @@
else
copied = filemap_copy_from_user_iovec(page, offset,
cur_iov, iov_base, bytes);
- flush_dcache_page(page);
- status = a_ops->commit_write(file, page, offset, offset+bytes);
- if (status == AOP_TRUNCATED_PAGE) {
+
+ if (data_end == PAGE_CACHE_SIZE || count == bytes) {
+ /*
+ * Same as above, always try fill pages up to
+ * PAGE_CACHE_SIZE if possible (sufficient data)
+ */
+ flush_dcache_page(page);
+ status = a_ops->commit_write(file, page,
+ data_start, data_end);
+ if (status == AOP_TRUNCATED_PAGE) {
+ continue;
+ }
+ unlock_page(page);
+ mark_page_accessed(page);
page_cache_release(page);
- continue;
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
+ if (bytes < count) { /* still more iov data to write */
+ page = __grab_cache_page(mapping, index + 1,
+ &cached_page, &lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ goto out;
+ }
+ } else {
+ page = NULL;
+ }
+ written += data_end - data_start;
+ wpos += data_end - data_start;
+ data_start = 0; /* correct would be data_end % PAGE_SIZE
+ * but if this would be != 0, we don't
+ * have more data
+ */
}
zero_length_segment:
if (likely(copied >= 0)) {
- if (!status)
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
+ if (!status) {
+ count -= copied;
+ pos += copied;
+ buf += copied;
if (unlikely(nr_segs > 1)) {
filemap_set_next_iovec(&cur_iov,
- &iov_base, status);
+ &iov_base, copied);
if (count)
buf = cur_iov->iov_base +
iov_base;
} else {
- iov_base += status;
+ iov_base += copied;
}
}
}
if (unlikely(copied != bytes))
- if (status >= 0)
+ if (!status)
status = -EFAULT;
- unlock_page(page);
- mark_page_accessed(page);
- page_cache_release(page);
- if (status < 0)
+ if (status)
break;
- balance_dirty_pages_ratelimited(mapping);
- cond_resched();
} while (count);
- *ppos = pos;
+
+out:
+ *ppos = wpos;
if (cached_page)
page_cache_release(cached_page);
+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
- if (likely(status >= 0)) {
+ if (likely(!status)) {
if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
if (!a_ops->writepage || !is_sync_kiocb(iocb))
status = generic_osync_inode(inode, mapping,
OSYNC_METADATA|OSYNC_DATA);
}
- }
-
+ }
+
/*
* If we get here for O_DIRECT writes then we must have fallen through
* to buffered writes (block instantiation inside i_size). So we sync
--
Bernd Schubert
Q-Leap Networks GmbH
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists