lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <x49ac4a5zkw.fsf@segfault.boston.devel.redhat.com>
Date:	Thu, 05 Oct 2006 15:31:43 -0400
From:	Jeff Moyer <jmoyer@...hat.com>
To:	Andrew Morton <akpm@...l.org>
Cc:	Zach Brown <zach.brown@...cle.com>, linux-kernel@...r.kernel.org
Subject: Re: [patch] call truncate_inode_pages in the DIO fallback to buffered I/O path

==> Regarding Re: [patch] call truncate_inode_pages in the DIO fallback to buffered I/O path; Andrew Morton <akpm@...l.org> adds:

akpm> On Wed, 04 Oct 2006 16:53:53 -0400
akpm> Jeff Moyer <jmoyer@...hat.com> wrote:

>> The man page for open states:
>> 
>> O_DIRECT
>> Try to minimize cache effects of the I/O to and from this file.
>> 
>> I think that invalidating the page cache pages we use when falling back to
>> buffered I/O stays true to the above description.

akpm> What the manpage forgot to mention is "direct-io is synchronous".

akpm> Except it isn't, when we fall back to buffered-IO.  So yup, I think
akpm> we could justify this (sort of) change on those grounds alone:
akpm> preserving the synchronous semantics.

akpm> I'd propose that we do this via

akpm> 	generic_file_buffered_write(...);
akpm> 	do_sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|
akpm> 			SYNC_FILE_RANGE_WRITE|
akpm> 			SYNC_FILE_RANGE_WAIT_AFTER)

akpm> 	invalidate_mapping_pages(...);

OK, patch attached.  It's a bit awkward that some interfaces take a page
index while others take an offset in bytes.  I tested it with the code I
sent earlier in this thread.  I'll get results from oast and some dbcreate
runs comparing kernels with and without this change and post them when
available (probably in the next day or two).  Feel free to hold off
comitting this until that data is presented.

akpm> There is a slight inefficiency here: generic_file_direct_IO() does
akpm> invalidate_inode_pages2_range(), then we go and instantiate some
akpm> pagecache, then we strip it away again with
akpm> invalidate_mapping_pages().  That first
akpm> invalidate_inode_pages2_range() was somewhat of a waste of cycles.

akpm> But we expect that the next call to generic_file_direct_IO() won't
akpm> actually call invalidate_inode_pages2_range(), because
akpm> mapping->nrpages is usually zero.

akpm> Well, it would have been, back in the days when we were invalidating
akpm> the whole file.  Now are more efficient and we only invalidate the
akpm> specific segment of that file.  So if there's a stray pagecache page
akpm> somewhere at the far end ofthe file, we'll pointlessly call
akpm> invalidate_inode_pages2_range() every time.  Oh well.

I don't think it's worth losing sleep over.  I wonder how many applications
actually mix buffered and unbuffered I/O to the same file.

Cheers,

Jeff

--- linux-2.6.18.i686/mm/filemap.c.orig	2006-10-02 12:59:25.000000000 -0400
+++ linux-2.6.18.i686/mm/filemap.c	2006-10-05 15:06:26.000000000 -0400
@@ -2350,13 +2350,13 @@ __generic_file_aio_write_nolock(struct k
 				unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = iocb->ki_filp;
-	const struct address_space * mapping = file->f_mapping;
+	struct address_space * mapping = file->f_mapping;
 	size_t ocount;		/* original count */
 	size_t count;		/* after file limit checks */
 	struct inode 	*inode = mapping->host;
 	unsigned long	seg;
 	loff_t		pos;
-	ssize_t		written;
+	ssize_t		written, written_direct;
 	ssize_t		err;
 
 	ocount = 0;
@@ -2386,7 +2386,7 @@ __generic_file_aio_write_nolock(struct k
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
-	written = 0;
+	written = written_direct = 0;
 
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
@@ -2413,10 +2413,32 @@ __generic_file_aio_write_nolock(struct k
 		 */
 		pos += written;
 		count -= written;
+
+		written_direct = written;
 	}
 
 	written = generic_file_buffered_write(iocb, iov, nr_segs,
 			pos, ppos, count, written);
+
+	/*
+	 *  When falling through to buffered I/O, we need to ensure that the
+	 *  page cache pages are written to disk and invalidated to preserve
+	 *  the expected O_DIRECT semantics.
+	 */
+	if (unlikely(file->f_flags & O_DIRECT)) {
+		pgoff_t endbyte = pos + count - 1;
+
+		err = do_sync_file_range(file, pos, endbyte,
+					 SYNC_FILE_RANGE_WAIT_BEFORE|
+					 SYNC_FILE_RANGE_WRITE|
+					 SYNC_FILE_RANGE_WAIT_AFTER);
+		if (err == 0)
+			invalidate_mapping_pages(mapping,
+						 pos >> PAGE_CACHE_SHIFT,
+						 endbyte >> PAGE_CACHE_SHIFT);
+		else
+			written = written_direct;
+	}
 out:
 	current->backing_dev_info = NULL;
 	return written ? written : err;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ