lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100216181312.GA9700@frostnet.net>
Date:	Tue, 16 Feb 2010 10:13:12 -0800
From:	Chris Frost <chris@...stnet.net>
To:	Andi Kleen <andi@...stfloor.org>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Heiko Carstens <heiko.carstens@...ibm.com>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Benny Halevy <bhalevy@...asas.com>, Andrew@...stfloor.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Steve VanDeBogart <vandebo-lkml@...dbox.net>,
	linux-fsdevel@...r.kernel.org, Matt Mackall <mpm@...enic.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

Add the fincore() system call. fincore() is mincore() for file descriptors.

The functionality of fincore() can be emulated with an mmap(), mincore(),
and munmap(), but this emulation requires more system calls and requires
page table modifications. fincore() can provide a significant performance
improvement for non-sequential in-core queries.

CC: Andi Kleen <andi@...stfloor.org>
CC: Wu Fengguang <fengguang.wu@...el.com>
CC: Andrew Morton <akpm@...ux-foundation.org>
Signed-off-by: Chris Frost <frost@...ucla.edu>
Signed-off-by: Steve VanDeBogart <vandebo@...ucla.edu>
---

Andi, Wu, and Andrew,

Thanks for the feedback. I have incorporated most of your suggestions
into this patch. A man page for fincore(2) is attached.

A few questions about the suggestions:

* Early return when a signal is queued
Andi pointed out that it would be good to return early if a signal is
queued for the calling process. I see two options to do this:
1. Return -EINTR (the patch in this email takes this approach).
2. Return the results discovered so far.

And the tradeoffs:
1. Returning -EINTR abandons good work.
2. fincore needs to inform the caller of the number of valid vec entries.
   Three approaches:
   - Return the number of file bytes with valid page entries.
     Issue: The return type is a long. For a 32 bit process, sizeof(long)
     is significantly less than sizeof(loff_t).
   - Return the number of pages with valid page entries (use type ssize_t).
     Issue: I don't see any significant issues.
   - Add an 'loff_t *result' parameter to the system call (a la _llseek).
     Issue: does this push the number of arguments too high?

Given the above, I feel that returning the number of valid page entries is
the best approach. Feedback?

* Radix gang lookup
Andi and Andrew suggested using radix gang lookups for larger ranges.
My benchmarks show that my change to do this isn't a strict performance
win. Attached is my patch that changes the below code to do gang lookups.
Here are the microbenchmark results for querying a 1 GiB file (results
are times in seconds):
  density       | none in-core | many in-core |
  pages/syscall | 1    | 8192  | 1    | 8192  |
  --------------+------+-------+------+-------|
  fincore       | .055 | .0045 | .078 | .0250 |
  fincore-batch | .059 | .0010 | .123 | .0170 |
"none in-core" means none of the file pages were in ram and "many in-core"
that the majority were in ram. the next row indicates how many pages
are queried in each fincore(2) call by the process (1 or 8192 pages).
These results show that fincore-batch is ~2-5x faster for large count
calls, but that it is slower for small count calls. Both classes of calls
are made by our macrobenchmarks. The macrobenchmarks (a SQLite query and
a GIMP image blur) show the batch version slows performance by 2-12%.

Perhaps my gang patch can be further optimized, or is incorrect? (Of note,
I did not avoid the gets and puts.) If not, fincore_pages() could be
changed to decide choose which of a single vs. gang lookup to do based on
a heuristic tradeoff value. Or it could stay simple and only make single
lookups.

* Kernel output buffer size
Andi pointed out that the optimization to avoid allocating a buffer page
for small queries could be changed to claim more than 64 bytes of on-stack
buffer space. I haven't seen increasing this space to significantly affect
my particular macrobenchmarks. I'm ok with leaving it at 64 bytes or
changing it; this number was picked fairly arbitrarily.

* Upper query size limit
Andi suggested imposing a reasonable upper limit on the number of pages
that can be queried in a single fincore call, because the process would
not be debuggable while the system call is executing. Andi, could you
cite an example system call for me to look at that takes this approach?

* System call argument feasibility across architectures
The fincore(2) parameters are now reordered so that they should fit in six
registers on all platforms. I intend for the libc wrapper to expose the
call as shown in the man page, rather than the system call's ordering.
I think this issue is now resolved, but wanted to mention it just in case
I missed anything; does this look good for all architectures?

 include/linux/syscalls.h           |    2 
 fs/fincore.c                       |  126 +++++++++++++++++++++++++++++++++++++
 fs/Makefile                        |    2 
 include/asm-generic/unistd.h       |    5 +
 arch/x86/ia32/ia32entry.S          |    1 
 arch/x86/include/asm/unistd_32.h   |    3 
 arch/x86/include/asm/unistd_64.h   |    2 
 arch/x86/kernel/syscall_table_32.S |    1 
 8 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a990ace..814e4f5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -534,6 +534,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
+asmlinkage long sys_fincore(unsigned int fd, unsigned char __user *vec,
+				loff_t start, loff_t len);
 
 asmlinkage long sys_pivot_root(const char __user *new_root,
 				const char __user *put_old);
diff --git a/fs/fincore.c b/fs/fincore.c
new file mode 100644
index 0000000..f329fe4
--- /dev/null
+++ b/fs/fincore.c
@@ -0,0 +1,126 @@
+/*
+ *	fs/fincore.c
+ *
+ * Copyright (C) 2009, 2010 Chris Frost, UC Regents
+ * Copyright (C) 2008 Steve VanDeBogart, UC Regents
+ */
+
+/*
+ * The fincore() system call.
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+
+static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff)
+{
+	unsigned char present = 0;
+	struct page *page = find_get_page(mapping, pgoff);
+	if (page) {
+		present = PageUptodate(page);
+		page_cache_release(page);
+	}
+
+	return present;
+}
+
+/*
+ * The fincore(2) system call.
+ *
+ * fincore() returns the memory residency status of the pages backing
+ * a file range specified by fd and [start, start + len).
+ * The status is returned in a vector of bytes.  The least significant
+ * bit of each byte is 1 if the referenced page is in memory, otherwise
+ * it is zero.
+ *
+ * Because the status of a page can change after fincore() checks it
+ * but before it returns to the application, the returned vector may
+ * contain stale information.  Only locked pages are guaranteed to
+ * remain in memory.
+ *
+ * Note that the parameter order for this system calls differ from the order
+ * for the libc wrapper. This syscall order allows the parameters to fit
+ * in six registers on all architectures.
+ *
+ * return values:
+ *  zero    - success
+ *  -EBADF  - fd is an illegal file descriptor
+ *  -EFAULT - vec points to an illegal address
+ *  -EINVAL - start is not a multiple of PAGE_CACHE_SIZE
+ *  -EAGAIN - A kernel resource was temporarily unavailable
+ *  -EINTR  - The call was interrupted by a signal
+ */
+SYSCALL_DEFINE4(fincore, unsigned int, fd, unsigned char __user *, vec,
+		loff_t, start, loff_t, len)
+{
+	long retval;
+	pgoff_t pgoff = start >> PAGE_SHIFT;
+	pgoff_t npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	pgoff_t pgend = pgoff + npages;
+	struct file *filp;
+	int fput_needed;
+	loff_t file_nbytes;
+	pgoff_t file_npages;
+	unsigned char *kernel_vec = NULL;
+	unsigned char kernel_vec_small[64];
+	unsigned kernel_vec_count;
+	int i;
+
+	/* Check the start address: needs to be page-aligned.. */
+	if (start & ~PAGE_CACHE_MASK)
+		return -EINVAL;
+
+	filp = fget_light(fd, &fput_needed);
+	if (!filp)
+		return -EBADF;
+
+	file_nbytes = i_size_read(filp->f_mapping->host);
+
+	file_npages = (file_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	/*
+	 * Allocate buffer vector page.
+	 * Optimize allocation for small values of npages because the
+	 * __get_free_page() call doubles fincore(2) runtime when npages == 1.
+	 */
+	if (npages <= sizeof(kernel_vec_small)) {
+		kernel_vec = kernel_vec_small;
+		kernel_vec_count = sizeof(kernel_vec_small);
+	} else {
+		kernel_vec = (void *) __get_free_page(GFP_USER);
+		if (!kernel_vec) {
+			retval = -EAGAIN;
+			goto done;
+		}
+		kernel_vec_count = PAGE_SIZE;
+	}
+
+	while (pgoff < pgend) {
+		/*
+		 * Do at most kernel_vec_count entries per iteration, due to
+		 * the limited buffer size.
+		 */
+		for (i = 0; pgoff < pgend && i < kernel_vec_count; pgoff++, i++)
+			kernel_vec[i] = fincore_page(filp->f_mapping, pgoff);
+
+		if (copy_to_user(vec, kernel_vec, i)) {
+			retval = -EFAULT;
+			break;
+		}
+		vec += i;
+
+		if (signal_pending(current)) {
+			retval = -EINTR;
+			break;
+		}
+		cond_resched();
+	}
+	retval = 0;
+done:
+	if (kernel_vec && kernel_vec != kernel_vec_small)
+		free_page((unsigned long) kernel_vec);
+	fput_light(filp, fput_needed);
+	return retval;
+}
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..a3ccd6b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o
+		stack.o fs_struct.o fincore.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index d76b66a..ce76dc8 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open 241
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 
+#define __NR_fincore 242
+__SYSCALL(__NR_fincore, sys_fincore)
+
 #undef __NR_syscalls
-#define __NR_syscalls 242
+#define __NR_syscalls 243
 
 /*
  * All syscalls below here should go away really,
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 581b056..cbf96e6 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -841,4 +841,5 @@ ia32_sys_call_table:
 	.quad compat_sys_pwritev
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
+	.quad sys_fincore
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..088b235 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,10 +342,11 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_fincore		337
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 337
+#define NR_syscalls 338
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..ebc04b5 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_fincore				299
+__SYSCALL(__NR_fincore, sys_fincore)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..1fdc8bc 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
+	.long sys_fincore
-- 
Chris Frost
http://www.frostnet.net/chris/

View attachment "fincore.2" of type "text/plain" (4137 bytes)

View attachment "batch.patch" of type "text/x-diff" (2487 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ