linux-kernel - [PATCH 1/2][RESEND] improve generic_file_buffered

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <200709072052.48396.bs@q-leap.de>
Date:	Fri, 7 Sep 2007 20:52:38 +0200
From:	Bernd Schubert <bs@...eap.de>
To:	linux-kernel@...r.kernel.org
Subject: [PATCH 1/2][RESEND] improve generic_file_buffered_write()

No further response to our patches yet, so we are sending them again, 
re-diffed against 2.6.23-rc5

Hi,

recently we discovered writing to a nfs-exported lustre filesystem is rather 
slow (20-40 MB/s writing, but over 200 MB/s reading).

As I already explained on the nfs mailing list, this happens since there is an 
offset on the very first page due to the nfs header.

http://sourceforge.net/mailarchive/forum.php?thread_name=200708312003.30446.bernd-schubert%40gmx.de&forum_name=nfs

While this especially effects lustre, Olaf Kirch also noticed it on another 
filesystem before and wrote a nfs patch for it. This patch has two 
disadvantages  - it requires to move all data within the pages, IMHO rather 
cpu time consuming, furthermore, it presently causes data corruption when 
more than one nfs thread is running.

After thinking it over and over again we (Goswin and I) believe it would be 
best to improve generic_file_buffered_write().
If there is sufficient data now, as it is usual for aio writes, 
generic_file_buffered_write() will now fill each page as much as possible and 
only then prepare/commit it. Before generic_file_buffered_write() commited 
chunks of pages even though there were still more data.

Some statistics:

num_writes = 4669440, bytes_total = 20231249633, segs_total = 5738644, 
commit_loops = 7697604, commits_total = 6628750

commit_loops is the number commits without the patch and commits_total the 
number of commits we actually have now. This shows a saving of nearly 14% of 
prepare, commit, cond_sched calls.

<       1:      Write size =        0,  Num segs =        0
<       2:      Write size =    20244,  Num segs =  4455583
<       4:      Write size =     6722,  Num segs =       24
<       8:      Write size =    19653,  Num segs =   213842
<      16:      Write size =    31778,  Num segs =        0
<      32:      Write size =    73395,  Num segs =        0
<      64:      Write size =   148840,  Num segs =        0
<     128:      Write size =   310178,  Num segs =        0
<     256:      Write size =    89027,  Num segs =        0
<     512:      Write size =   111903,  Num segs =        0
<    1024:      Write size =   140509,  Num segs =        0
<    2048:      Write size =   244052,  Num segs =        0
<    4096:      Write size =   217164,  Num segs =        0
<    8192:      Write size =  2784875,  Num segs =        0
<   16384:      Write size =   433506,  Num segs =        0
<   32768:      Write size =    11742,  Num segs =        0
<   65536:      Write size =    15783,  Num segs =        0
<  131072:      Write size =     6851,  Num segs =        0
<  262144:      Write size =     1562,  Num segs =        0
<  524288:      Write size =      755,  Num segs =        0
< 1048576:      Write size =      531,  Num segs =        0
< 2097152:      Write size =      272,  Num segs =        0
< 4194304:      Write size =      107,  Num segs =        0
< 8388608:      Write size =        0,  Num segs =        0

Write size shows the number of writes with the total size smaller than denoted 
in the first column. Num segs shows the number of writes with less segments 
than denoted in the first column. Most writes (~95%) only have one segment. 
However, no nfs activity has been done, which is actually the case we made 
the patches for.

size\num        1       2       3       4       5       6       7+       
<       1:      0       24      0       0       0       0       0       
<       2:      20244   0       0       0       0       641526  0       
<       4:      6722    0       0       0       0       0       0       
<       8:      19653   0       0       0       0       213842  0       
<      16:      31778   0       0       0       0       213856  0       
<      32:      73395   0       0       0       0       590     0       
<      64:      147730  0       0       0       0       93626   0       
<     128:      100888  0       0       0       0       119597  0       
<     256:      85588   0       0       0       0       12      0       
<     512:      111900  0       0       0       0       3       0       
<    1024:      140509  0       0       0       0       0       0       
<    2048:      244052  0       0       0       0       0       0       
<    4096:      217160  4       0       0       0       0       0       
<    8192:      2784855 20      0       0       0       0       0       
<   16384:      433506  0       0       0       0       0       0       
<   32768:      11742   0       0       0       0       0       0       
<   65536:      15783   0       0       0       0       0       0       
<  131072:      6851    0       0       0       0       0       0       
<  262144:      1562    0       0       0       0       0       0       
<  524288:      755     0       0       0       0       0       0       
< 1048576:      531     0       0       0       0       0       0       
< 2097152:      272     0       0       0       0       0       0       
< 4194304:      107     0       0       0       0       0       0       
< 8388608:      0       0       0       0       0       0       0       

This table shows the number of segments smaller than denoted in the first 
column within a vector of sizes 1 to 7 segments. This shows that without nfs 
only 1, 2 and 6 segment vectors appear to happen. Depending on wsize nfs will 
have a lot more segments (129 for 512kiB wsize).

Btw, whats the cause of the 24 vectors with 2 segments, one of them being size 
zero?


Signed-off-by: Bernd Schubert <bs@...eap.de>
Signed-off-by: Goswin von Brederlow <goswin@...brederlow.de>


Cheers,
Goswin and Bernd


 mm/filemap.c |  139 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 95 insertions(+), 44 deletions(-)

 
Index: linux-2.6.23-rc5/mm/filemap.c
===================================================================
--- linux-2.6.23-rc5.orig/mm/filemap.c	2007-09-06 18:33:11.000000000 +0200
+++ linux-2.6.23-rc5/mm/filemap.c	2007-09-06 18:33:15.000000000 +0200
@@ -1834,6 +1834,21 @@
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+/**
+ * generic_file_buffered_write - handle iov'ectors
+ * @iob:	file operations
+ * @iov:	vector of data to write
+ * @nr_segs:	number of iov segments
+ * @pos:	position in the file
+ * @ppos:	position in the file after this function
+ * @count:	number of bytes to write
+ * written:	offset in iov->base (data to skip on write)
+ *
+ * This function will do 3 main tasks for each iov:
+ * - prepare a write
+ * - copy the data from iov into a new page
+ * - commit this page
+ */
 ssize_t
 generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1851,6 +1866,11 @@
 	const struct iovec *cur_iov = iov; /* current iovec */
 	size_t		iov_base = 0;	   /* offset in the current iovec */
 	char __user	*buf;
+	unsigned long	data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+	loff_t		wpos = pos; /* the position in the file we will return */
+
+	/* position in file as index of pages */
+	unsigned long	index = pos >> PAGE_CACHE_SHIFT;
 
 	pagevec_init(&lru_pvec, 0);
 
@@ -1864,9 +1884,15 @@
 		buf = cur_iov->iov_base + iov_base;
 	}
 
+	page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+	if (!page) {
+		status = -ENOMEM;
+		goto out;
+	}
+
 	do {
-		unsigned long index;
 		unsigned long offset;
+		unsigned long data_end; /* end of data within the page */
 		size_t copied;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
@@ -1897,11 +1923,6 @@
 			 */
 			fault_in_pages_readable(buf, bytes);
 		}
-		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
-		if (!page) {
-			status = -ENOMEM;
-			break;
-		}
 
 		if (unlikely(bytes == 0)) {
 			status = 0;
@@ -1909,22 +1930,26 @@
 			goto zero_length_segment;
 		}
 
-		status = a_ops->prepare_write(file, page, offset, offset+bytes);
-		if (unlikely(status)) {
-			loff_t isize = i_size_read(inode);
+		data_end = offset + bytes;
 
-			if (status != AOP_TRUNCATED_PAGE)
-				unlock_page(page);
-			page_cache_release(page);
-			if (status == AOP_TRUNCATED_PAGE)
-				continue;
+		if (data_end == PAGE_CACHE_SIZE || count == bytes) {
 			/*
-			 * prepare_write() may have instantiated a few blocks
-			 * outside i_size.  Trim these off again.
+			 * Only prepare a write if its an entire page or if
+			 * we don't have more data
 			 */
-			if (pos + bytes > isize)
-				vmtruncate(inode, isize);
-			break;
+			status = a_ops->prepare_write(file, page, data_start, data_end);
+			if (unlikely(status)) {
+				loff_t isize = i_size_read(inode);
+
+				if (status == AOP_TRUNCATED_PAGE)
+					continue;
+				/*
+				* prepare_write() may have instantiated a few blocks
+				* outside i_size.  Trim these off again.
+				*/
+				if (pos + bytes > isize)
+					vmtruncate(inode, isize);
+			}
 		}
 		if (likely(nr_segs == 1))
 			copied = filemap_copy_from_user(page, offset,
@@ -1932,60 +1957,86 @@
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
 						cur_iov, iov_base, bytes);
-		flush_dcache_page(page);
-		status = a_ops->commit_write(file, page, offset, offset+bytes);
-		if (status == AOP_TRUNCATED_PAGE) {
+
+		if (data_end == PAGE_CACHE_SIZE || count == bytes) {
+			/*
+			 * Same as above, always try fill pages up to
+			 * PAGE_CACHE_SIZE if possible (sufficient data)
+			 */
+			flush_dcache_page(page);
+			status = a_ops->commit_write(file, page,
+						     data_start, data_end);
+			if (status == AOP_TRUNCATED_PAGE) {
+				continue;
+			}
+			unlock_page(page);
+			mark_page_accessed(page);
 			page_cache_release(page);
-			continue;
+			balance_dirty_pages_ratelimited(mapping);
+			cond_resched();
+			if (bytes < count) { /* still more iov data to write */
+				page = __grab_cache_page(mapping, index + 1,
+							 &cached_page, &lru_pvec);
+				if (!page) {
+					status = -ENOMEM;
+					goto out;
+				}
+			} else {
+				page = NULL;
+			}
+			written += data_end - data_start;
+			wpos    += data_end - data_start;
+			data_start = 0; /* correct would be data_end % PAGE_SIZE
+			                 * but if this would be != 0, we don't
+			                 * have more data
+			                 */
 		}
 zero_length_segment:
 		if (likely(copied >= 0)) {
-			if (!status)
-				status = copied;
-
-			if (status >= 0) {
-				written += status;
-				count -= status;
-				pos += status;
-				buf += status;
+			if (!status) {
+				count -= copied;
+				pos += copied;
+				buf += copied;
 				if (unlikely(nr_segs > 1)) {
 					filemap_set_next_iovec(&cur_iov,
-							&iov_base, status);
+							&iov_base, copied);
 					if (count)
 						buf = cur_iov->iov_base +
 							iov_base;
 				} else {
-					iov_base += status;
+					iov_base += copied;
 				}
 			}
 		}
 		if (unlikely(copied != bytes))
-			if (status >= 0)
+			if (!status)
 				status = -EFAULT;
-		unlock_page(page);
-		mark_page_accessed(page);
-		page_cache_release(page);
-		if (status < 0)
+		if (status)
 			break;
-		balance_dirty_pages_ratelimited(mapping);
-		cond_resched();
 	} while (count);
-	*ppos = pos;
+
+out:
+	*ppos = wpos;
 
 	if (cached_page)
 		page_cache_release(cached_page);
 
+	if (page) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+
 	/*
 	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
 	 */
-	if (likely(status >= 0)) {
+	if (likely(!status)) {
 		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 			if (!a_ops->writepage || !is_sync_kiocb(iocb))
 				status = generic_osync_inode(inode, mapping,
 						OSYNC_METADATA|OSYNC_DATA);
 		}
-  	}
-	
+	}
+
 	/*
 	 * If we get here for O_DIRECT writes then we must have fallen through
 	 * to buffered writes (block instantiation inside i_size).  So we sync


-- 
Bernd Schubert
Q-Leap Networks GmbH

Content of type "application/pgp-signature" skipped