lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140602122322.GB8691@node.dhcp.inet.fi>
Date:	Mon, 2 Jun 2014 15:23:22 +0300
From:	"Kirill A. Shutemov" <kirill@...temov.name>
To:	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Konstantin Khlebnikov <koct9i@...il.com>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Borislav Petkov <bp@...en8.de>,
	Johannes Weiner <hannes@...xchg.org>,
	Rusty Russell <rusty@...tcorp.com.au>,
	David Miller <davem@...emloft.net>,
	Andres Freund <andres@...quadrant.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 2/3] mm: introduce fincore()

On Mon, Jun 02, 2014 at 01:24:58AM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
> 
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
> 
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@...jp.nec.com>

...

> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + *     The page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I'm okay with bytemap. Just wounder why not bitmap?

> + *
> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + *     stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + *     for details.

Is it safe to expose this info to unprivilaged process (consider all three
flags above)?

> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + *     information about hole. Instead each records per page has the entry
> + *     of page offset (using 8 bytes.) This mode is useful if we handle
> + *     large file and only few pages are on memory for the file.

Hm.. It's probably overkill, but instead of filling userspace buffer we
could return file descriptor and define lseek(SEEK_HOLE). Just thinking.

> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + *   page offset  0   1   2   3   4
> + *              +---+---+---+---+---+
> + *              | 1 | 0 | 0 | 1 | 1 | ...
> + *              +---+---+---+---+---+
> + *               <->
> + *              1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + *   page offset    0       1       2       3       4
> + *              +-------+-------+-------+-------+-------+
> + *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
> + *              +-------+-------+-------+-------+-------+
> + *               <----->
> + *               8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + *    page offset 0    page offset 1   page offset 2
> + *   +-------+-------+-------+-------+-------+-------+
> + *   |  pfn  | flags |  pfn  | flags |  pfn  | flags | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + *   +-------+-------+-------+-------+-------+-------+
> + *   | pgoff |  pfn  | pgoff |  pfn  | pgoff |  pfn  | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + */
> +#define FINCORE_BMAP		0x01	/* bytemap mode */
> +#define FINCORE_PFN		0x02
> +#define FINCORE_PAGE_FLAGS	0x04
> +#define FINCORE_PAGECACHE_TAGS	0x08
> +#define FINCORE_SKIP_HOLE	0x10

FINCORE_SKIP_HOLE is greater then FINCORE_PFN but pgoff precedes pfn in
records. It's confusing. We need clear definition of record format.

What about rename FINCORE_SKIP_HOLE -> FINCORE_PGOFF, move it before
FINCORE_PFN. So FINCORE_PGOFF is less than FINCORE_PFN, which is less than
FINCORE_PAGE_FLAGS, which is less than FINCORE_PAGECACHE_TAGS. It matches
order in records:

FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS

 +-------+-------+-------+-------+-------+-------+-------+-------+
 | pgoff |  pfn  | flags |  tags | pgoff |  pfn  | flags |  tags | ...
 +-------+-------+-------+-------+-------+-------+-------+-------+
  <-----------------------------> <------------------------------>
             32 bytes                        32 bytes

> +
> +#define FINCORE_MODE_MASK	0x1f
> +#define FINCORE_LONGENTRY_MASK	(FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> +				 FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ