linux-kernel - Re: [RFC 1/8] Introduce new system call mvolatile

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130104042520.GC2617@blaptop>
Date:	Fri, 4 Jan 2013 13:25:20 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Taras Glek <tglek@...illa.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	Michael Kerrisk <mtk.manpages@...il.com>,
	Arun Sharma <asharma@...com>, sanjay@...gle.com,
	Paul Turner <pjt@...gle.com>,
	David Rientjes <rientjes@...gle.com>,
	John Stultz <john.stultz@...aro.org>,
	Christoph Lameter <cl@...ux.com>,
	Android Kernel Team <kernel-team@...roid.com>,
	Robert Love <rlove@...gle.com>, Mel Gorman <mel@....ul.ie>,
	Hugh Dickins <hughd@...gle.com>,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Rik van Riel <riel@...hat.com>,
	Dave Chinner <david@...morbit.com>, Neil Brown <neilb@...e.de>,
	Mike Hommey <mh@...ndium.org>,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Subject: Re: [RFC 1/8] Introduce new system call mvolatile

Hi,

On Thu, Jan 03, 2013 at 10:35:24AM -0800, Taras Glek wrote:
> On 1/2/2013 8:27 PM, Minchan Kim wrote:
> >This patch adds new system call m[no]volatile.
> >If someone asks is_volatile system call, it could be added, too.
> >
> >The reason why I introduced new system call instead of madvise is
> >m[no]volatile vma handling is totally different with madvise's vma
> >handling.
> >
> >1) The m[no]volatile should be successful although the range includes
> >    unmapped or non-volatile range. It just skips such range
> >    without stopping with returning error although it encounters
> >    invalid range. It makes user convenient without calling several
> >    system call of small range - Suggested by John Stultz
> >
> >2) The purged state of volatile range should be propagated out to user
> >    although the range is merged with adjacent non-volatile range when
> >    user calls mnovolatile.
> >
> >3) mvolatile's interface could be changed with madvise
> >    in future discussion.  For example, I feel needs
> >    movlatile(start, len, mode).
> >    'mode' means FULL_VOLATILE or PARTIAL_VOLATILE.
> >    FULL volatile means that if VM decide to reclaim the range, it would
> >    reclaim all of pages in the range but in case of PARTIAL_VOLATILE,
> >    VM could reclaim just a few number of pages in the range.
> >    In case of tmpfs-volatile, user may regenerate all images data once
> >    one of page in the range is discarded so there is pointless that
> >    VM discard a page in the range when memory pressure is severe.
> >    In case of anon-volatile, too excess discarding cause too many minor
> >    fault for the allocator so it would be better to discard part of
> >    the range.
> I don't understand point 3).
> Are you saying that using mvolatile in conjuction with madvise could
> allow mvolatile behavior to be tweaked in the future? Or are you
> suggesting adding an extra parameter in the future(what would that
> have to do with madvise)?

I meant I might want to expand mvolatile's interface like below
during discussion.

        int mvolatile(start, len, mode);

> 
> 4) Having a new system call makes it easier for userspace apps to
> detect kernels without this functionality.

I coudn't understand your claim.
Now mvolatile just return EINVAL on !CONFIG_VOLATILE_PAGE system.
Why is it easy compared to returning EINVAL when we call madvise(VOLATILE)
on !CONFIG_VOLATILE_PAGE?

> 
> I really like the proposed interface. I like the suggestion of

Thanks.

> having explicit FULL|PARTIAL_VOLATILE. Why not include
> PARTIAL_VOLATILE as a required 3rd param in first version with
> expectation that FULL_VOLATILE will be added later(and returning
> some not-supported error in meantime)?

I just wanted to discuss about needs of it.
The reason I need PARTIAL_VOLATILE is that avoids many minor fault
for allocator. Is it useful for tmpfs-volatile, too?

Thanks for the feedback, Taras.

> >
> >3) The mvolatile system call's return value is quite different with
> >    madvise. Look at below semantic explanation.
> >
> >So I want to separate mvolatile from madvise.
> >
> >mvolatile(start, len)'s semantics
> >
> >1) It makes range(start, len) as volatile although the range includes
> >unmapped area, speacial mapping and mlocked area which are just skipped.
> >
> >Return -EINVAL if range doesn't include a right vma at all.
> >Return -ENOMEM with interrupting range opeartion if memory is not
> >enough to merge/split vmas. In this case, some ranges would be
> >volatile and others not so user may recall mvolatile after he
> >cancel all range by mnovolatile.
> >Return 0 if range consists of only proper vmas.
> >Return 1 if part of range includes hole/huge/ksm/mlock/special area.
> >
> >2) If user calls mvolatile to the range which was already volatile VMA and
> >even purged state, VOLATILE attributes still remains but purged state
> >is reset. I expect some user want to split volatile vma into smaller
> >ranges. Although he can do it for mnovlatile(whole range) and serveral calling
> >with movlatile(smaller range), this function can avoid mnovolatile if he
> >doesn't care purged state. I'm not sure we really need this function so
> >I hope listen opinions. Unfortunately, current implemenation doesn't split
> >volatile VMA with new range in this case. I forgot implementing it
> >in this version but decide to send it to listen opinions because
> >implementing is rather trivial if we decided.
> >
> >mnovolatile(start, len)'s semantics is following as.
> >
> >1) It makes range(start, len) as non-volatile although the range
> >includes unmapped area, speacial mapping and non-volatile range
> >which are just skipped.
> >
> >2) If the range is purged, it will return 1 regardless of including
> >invalid range.
> If I understand this correctly:
> mvolatile(0, 10);
> //then range [9,10] is purged by kernel
> mnovolatile(0,4) will fail?
> that seems counterintuitive.
> 
> One of the uses for mnovolatile is to atomicly lock the pages(vs a
> racy proposed is_volatile) syscall. Above situation would make it
> less effective.
> 
> 
> >
> >3) It returns -ENOMEM if system doesn't have enough memory for vma operation.
> >
> >4) It returns -EINVAL if range doesn't include a right vma at all.
> >
> >5) If user try to access purged range without mnovoatile call, it encounters
> >SIGBUS which would show up next patch.
> >
> >Cc: Michael Kerrisk <mtk.manpages@...il.com>
> >Cc: Arun Sharma <asharma@...com>
> >Cc: sanjay@...gle.com
> >Cc: Paul Turner <pjt@...gle.com>
> >CC: David Rientjes <rientjes@...gle.com>
> >Cc: John Stultz <john.stultz@...aro.org>
> >Cc: Andrew Morton <akpm@...ux-foundation.org>
> >Cc: Christoph Lameter <cl@...ux.com>
> >Cc: Android Kernel Team <kernel-team@...roid.com>
> >Cc: Robert Love <rlove@...gle.com>
> >Cc: Mel Gorman <mel@....ul.ie>
> >Cc: Hugh Dickins <hughd@...gle.com>
> >Cc: Dave Hansen <dave@...ux.vnet.ibm.com>
> >Cc: Rik van Riel <riel@...hat.com>
> >Cc: Dave Chinner <david@...morbit.com>
> >Cc: Neil Brown <neilb@...e.de>
> >Cc: Mike Hommey <mh@...ndium.org>
> >Cc: Taras Glek <tglek@...illa.com>
> >Cc: KOSAKI Motohiro <kosaki.motohiro@...il.com>
> >Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
> >Signed-off-by: Minchan Kim <minchan@...nel.org>
> >---
> >  arch/x86/syscalls/syscall_64.tbl |    2 +
> >  fs/exec.c                        |    4 +-
> >  include/linux/mm.h               |    6 +-
> >  include/linux/mm_types.h         |    4 +
> >  include/linux/mvolatile.h        |   30 ++++
> >  include/linux/syscalls.h         |    2 +
> >  mm/Kconfig                       |   11 ++
> >  mm/Makefile                      |    2 +-
> >  mm/madvise.c                     |    2 +-
> >  mm/mempolicy.c                   |    2 +-
> >  mm/mlock.c                       |    7 +-
> >  mm/mmap.c                        |   62 ++++++--
> >  mm/mprotect.c                    |    3 +-
> >  mm/mremap.c                      |    2 +-
> >  mm/mvolatile.c                   |  312 ++++++++++++++++++++++++++++++++++++++
> >  mm/rmap.c                        |    2 +
> >  16 files changed, 427 insertions(+), 26 deletions(-)
> >  create mode 100644 include/linux/mvolatile.h
> >  create mode 100644 mm/mvolatile.c
> >
> >diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> >index a582bfe..568d488 100644
> >--- a/arch/x86/syscalls/syscall_64.tbl
> >+++ b/arch/x86/syscalls/syscall_64.tbl
> >@@ -319,6 +319,8 @@
> >  310	64	process_vm_readv	sys_process_vm_readv
> >  311	64	process_vm_writev	sys_process_vm_writev
> >  312	common	kcmp			sys_kcmp
> >+313	common	mvolatile		sys_mvolatile
> >+314	common	mnovolatile		sys_mnovolatile
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> >diff --git a/fs/exec.c b/fs/exec.c
> >index 0039055..da677d1 100644
> >--- a/fs/exec.c
> >+++ b/fs/exec.c
> >@@ -594,7 +594,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
> >  	/*
> >  	 * cover the whole range: [new_start, old_end)
> >  	 */
> >-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
> >+	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL, NULL))
> >  		return -ENOMEM;
> >  	/*
> >@@ -628,7 +628,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
> >  	/*
> >  	 * Shrink the vma to just the new range.  Always succeeds.
> >  	 */
> >-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
> >+	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL, NULL);
> >  	return 0;
> >  }
> >diff --git a/include/linux/mm.h b/include/linux/mm.h
> >index bcaab4e..4bb59f3 100644
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp);
> >  #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
> >  #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
> >+#define VM_VOLATILE	0x00001000	/* Pages could be discarede without swapout */
> >  #define VM_LOCKED	0x00002000
> >  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
> >@@ -1411,11 +1412,12 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
> >  /* mmap.c */
> >  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
> >  extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
> >+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> >+	bool *purged);
> >  extern struct vm_area_struct *vma_merge(struct mm_struct *,
> >  	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> >  	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> >-	struct mempolicy *);
> >+	struct mempolicy *, bool *purged);
> >  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> >  extern int split_vma(struct mm_struct *,
> >  	struct vm_area_struct *, unsigned long addr, int new_below);
> >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >index 31f8a3a..1eaf458 100644
> >--- a/include/linux/mm_types.h
> >+++ b/include/linux/mm_types.h
> >@@ -275,6 +275,10 @@ struct vm_area_struct {
> >  #ifdef CONFIG_NUMA
> >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> >  #endif
> >+#ifdef CONFIG_VOLATILE_PAGE
> >+	/* True if more than a page in this vma is reclaimed. */
> >+	bool purged;	/* Serialized by mmap_sem and anon_vma's mutex */
> >+#endif
> >  };
> >  struct core_thread {
> >diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> >new file mode 100644
> >index 0000000..cfb12b4
> >--- /dev/null
> >+++ b/include/linux/mvolatile.h
> >@@ -0,0 +1,30 @@
> >+#ifndef __LINUX_MVOLATILE_H
> >+#define __LINUX_MVOLATILE_H
> >+
> >+#include <linux/syscalls.h>
> >+
> >+#ifdef CONFIG_VOLATILE_PAGE
> >+static inline bool vma_purged(struct vm_area_struct *vma)
> >+{
> >+	return vma->purged;
> >+}
> >+
> >+static inline void vma_purge_copy(struct vm_area_struct *dst,
> >+					struct vm_area_struct *src)
> >+{
> >+	dst->purged = src->purged;
> >+}
> >+#else
> >+static inline bool vma_purged(struct vm_area_struct *vma)
> >+{
> >+	return false;
> >+}
> >+
> >+static inline void vma_purge_copy(struct vm_area_struct *dst,
> >+					struct vm_area_struct *src)
> >+{
> >+
> >+}
> >+#endif
> >+#endif
> >+
> >diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> >index 727f0cd..a8ded1c 100644
> >--- a/include/linux/syscalls.h
> >+++ b/include/linux/syscalls.h
> >@@ -470,6 +470,8 @@ asmlinkage long sys_munlock(unsigned long start, size_t len);
> >  asmlinkage long sys_mlockall(int flags);
> >  asmlinkage long sys_munlockall(void);
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> >+asmlinkage long sys_mvolatile(unsigned long start, size_t len);
> >+asmlinkage long sys_mnovolatile(unsigned long start, size_t len);
> >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  				unsigned char __user * vec);
> >diff --git a/mm/Kconfig b/mm/Kconfig
> >index a3f8ddd..30b24ba 100644
> >--- a/mm/Kconfig
> >+++ b/mm/Kconfig
> >@@ -355,6 +355,17 @@ choice
> >  	  benefit.
> >  endchoice
> >+config VOLATILE_PAGE
> >+	bool "Volatile Page Support"
> >+	depends on MMU
> >+	help
> >+	  Enabling this option adds the system calls mvolatile and mnovolatile
> >+	  which are for giving user's address space range to kernel so VM
> >+	  can discard pages of the range anytime instead swapout. This feature
> >+	  can enhance performance to certain application(ex, memory allocator,
> >+	  web browser's tmpfs pages) by reduce the number of minor fault and
> >+          swap out.
> >+
> >  config CROSS_MEMORY_ATTACH
> >  	bool "Cross Memory Support"
> >  	depends on MMU
> >diff --git a/mm/Makefile b/mm/Makefile
> >index 6b025f8..1efb735 100644
> >--- a/mm/Makefile
> >+++ b/mm/Makefile
> >@@ -5,7 +5,7 @@
> >  mmu-y			:= nommu.o
> >  mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
> >  			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> >-			   vmalloc.o pagewalk.o pgtable-generic.o
> >+			   mvolatile.o vmalloc.o pagewalk.o pgtable-generic.o
> >  ifdef CONFIG_CROSS_MEMORY_ATTACH
> >  mmu-$(CONFIG_MMU)	+= process_vm_access.o
> >diff --git a/mm/madvise.c b/mm/madvise.c
> >index 03dfa5c..6ffad21 100644
> >--- a/mm/madvise.c
> >+++ b/mm/madvise.c
> >@@ -99,7 +99,7 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >  	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
> >-				vma->vm_file, pgoff, vma_policy(vma));
> >+				vma->vm_file, pgoff, vma_policy(vma), NULL);
> >  	if (*prev) {
> >  		vma = *prev;
> >  		goto success;
> >diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> >index 4ea600d..9b1aa2d 100644
> >--- a/mm/mempolicy.c
> >+++ b/mm/mempolicy.c
> >@@ -675,7 +675,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
> >  			((vmstart - vma->vm_start) >> PAGE_SHIFT);
> >  		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
> >  				  vma->anon_vma, vma->vm_file, pgoff,
> >-				  new_pol);
> >+				  new_pol, NULL);
> >  		if (prev) {
> >  			vma = prev;
> >  			next = vma->vm_next;
> >diff --git a/mm/mlock.c b/mm/mlock.c
> >index f0b9ce5..e03523a 100644
> >--- a/mm/mlock.c
> >+++ b/mm/mlock.c
> >@@ -316,13 +316,14 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  	int ret = 0;
> >  	int lock = !!(newflags & VM_LOCKED);
> >-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
> >-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
> >+	if (newflags == vma->vm_flags || (vma->vm_flags &
> >+		(VM_SPECIAL|VM_VOLATILE)) || is_vm_hugetlb_page(vma) ||
> >+		vma == get_gate_vma(current->mm))
> >  		goto out;	/* don't set VM_LOCKED,  don't count */
> >  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >  	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
> >-			  vma->vm_file, pgoff, vma_policy(vma));
> >+			  vma->vm_file, pgoff, vma_policy(vma), NULL);
> >  	if (*prev) {
> >  		vma = *prev;
> >  		goto success;
> >diff --git a/mm/mmap.c b/mm/mmap.c
> >index 9a796c4..ba636c3 100644
> >--- a/mm/mmap.c
> >+++ b/mm/mmap.c
> >@@ -31,6 +31,7 @@
> >  #include <linux/audit.h>
> >  #include <linux/khugepaged.h>
> >  #include <linux/uprobes.h>
> >+#include <linux/mvolatile.h>
> >  #include <asm/uaccess.h>
> >  #include <asm/cacheflush.h>
> >@@ -516,7 +517,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
> >   * before we drop the necessary locks.
> >   */
> >  int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
> >+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> >+	bool *purged)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct vm_area_struct *next = vma->vm_next;
> >@@ -527,10 +529,9 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >  	struct file *file = vma->vm_file;
> >  	long adjust_next = 0;
> >  	int remove_next = 0;
> >+	struct vm_area_struct *exporter = NULL;
> >  	if (next && !insert) {
> >-		struct vm_area_struct *exporter = NULL;
> >-
> >  		if (end >= next->vm_end) {
> >  			/*
> >  			 * vma expands, overlapping all the next, and
> >@@ -621,6 +622,15 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  	if (adjust_next) {
> >  		next->vm_start += adjust_next << PAGE_SHIFT;
> >  		next->vm_pgoff += adjust_next;
> >+		/*
> >+		 * Look at mm/mvolatile.c for knowing terminology.
> >+		 * V4. NNPPVV -> NNNPVV
> >+		 */
> >+		if (purged) {
> >+			*purged = vma_purged(next);
> >+			if (exporter == vma) /* V5. VVPPNN -> VVPNNN */
> >+				*purged = vma_purged(vma);
> >+		}
> >  	}
> >  	if (root) {
> >@@ -651,6 +661,13 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  		anon_vma_interval_tree_post_update_vma(vma);
> >  		if (adjust_next)
> >  			anon_vma_interval_tree_post_update_vma(next);
> >+		/*
> >+		 * Look at mm/mvolatile.c for knowing terminology.
> >+		 * V7. VVPPVV -> VVNPVV
> >+		 * V8. VVPPVV -> VVPNVV
> >+		 */
> >+		if (insert)
> >+			vma_purge_copy(insert, vma);
> >  		anon_vma_unlock(anon_vma);
> >  	}
> >  	if (mapping)
> >@@ -670,6 +687,20 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  		}
> >  		if (next->anon_vma)
> >  			anon_vma_merge(vma, next);
> >+
> >+		/*
> >+		 * next is detatched from anon vma chain so purged isn't
> >+		 * raced any more.
> >+		 * Look at mm/mvolatile.c for knowing terminology.
> >+		 *
> >+		 * V1. NNPPVV -> NNNNVV
> >+		 * V2. VVPPNN -> VVNNNN
> >+		 * V3. NNPPNN -> NNNNNN
> >+		 */
> >+		if (purged) {
> >+			*purged |= vma_purged(vma); /* case V2 */
> >+			*purged |= vma_purged(next); /* case V1,V3 */
> >+		}
> >  		mm->map_count--;
> >  		mpol_put(vma_policy(next));
> >  		kmem_cache_free(vm_area_cachep, next);
> >@@ -798,7 +829,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  			struct vm_area_struct *prev, unsigned long addr,
> >  			unsigned long end, unsigned long vm_flags,
> >  		     	struct anon_vma *anon_vma, struct file *file,
> >-			pgoff_t pgoff, struct mempolicy *policy)
> >+			pgoff_t pgoff, struct mempolicy *policy, bool *purged)
> >  {
> >  	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
> >  	struct vm_area_struct *area, *next;
> >@@ -808,7 +839,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  	 * We later require that vma->vm_flags == vm_flags,
> >  	 * so this tests vma->vm_flags & VM_SPECIAL, too.
> >  	 */
> >-	if (vm_flags & VM_SPECIAL)
> >+	if (vm_flags & (VM_SPECIAL|VM_VOLATILE))
> >  		return NULL;
> >  	if (prev)
> >@@ -837,10 +868,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  						      next->anon_vma, NULL)) {
> >  							/* cases 1, 6 */
> >  			err = vma_adjust(prev, prev->vm_start,
> >-				next->vm_end, prev->vm_pgoff, NULL);
> >+				next->vm_end, prev->vm_pgoff, NULL, purged);
> >  		} else					/* cases 2, 5, 7 */
> >  			err = vma_adjust(prev, prev->vm_start,
> >-				end, prev->vm_pgoff, NULL);
> >+				end, prev->vm_pgoff, NULL, purged);
> >  		if (err)
> >  			return NULL;
> >  		khugepaged_enter_vma_merge(prev);
> >@@ -856,10 +887,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >  					anon_vma, file, pgoff+pglen)) {
> >  		if (prev && addr < prev->vm_end)	/* case 4 */
> >  			err = vma_adjust(prev, prev->vm_start,
> >-				addr, prev->vm_pgoff, NULL);
> >+				addr, prev->vm_pgoff, NULL, purged);
> >  		else					/* cases 3, 8 */
> >  			err = vma_adjust(area, addr, next->vm_end,
> >-				next->vm_pgoff - pglen, NULL);
> >+				next->vm_pgoff - pglen, NULL, purged);
> >  		if (err)
> >  			return NULL;
> >  		khugepaged_enter_vma_merge(area);
> >@@ -1292,7 +1323,8 @@ munmap_back:
> >  	/*
> >  	 * Can we just expand an old mapping?
> >  	 */
> >-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> >+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file,
> >+				pgoff, NULL, NULL);
> >  	if (vma)
> >  		goto out;
> >@@ -2025,9 +2057,10 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
> >  	if (new_below)
> >  		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
> >-			((addr - new->vm_start) >> PAGE_SHIFT), new);
> >+			((addr - new->vm_start) >> PAGE_SHIFT), new, NULL);
> >  	else
> >-		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
> >+		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff,
> >+			new, NULL);
> >  	/* Success. */
> >  	if (!err)
> >@@ -2240,7 +2273,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
> >  	/* Can we just expand an old private anonymous mapping? */
> >  	vma = vma_merge(mm, prev, addr, addr + len, flags,
> >-					NULL, NULL, pgoff, NULL);
> >+					NULL, NULL, pgoff, NULL, NULL);
> >  	if (vma)
> >  		goto out;
> >@@ -2396,7 +2429,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >  	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
> >  		return NULL;	/* should never get here */
> >  	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> >-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
> >+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> >+			NULL);
> >  	if (new_vma) {
> >  		/*
> >  		 * Source vma may have been merged into new_vma
> >diff --git a/mm/mprotect.c b/mm/mprotect.c
> >index a409926..f461177 100644
> >--- a/mm/mprotect.c
> >+++ b/mm/mprotect.c
> >@@ -179,7 +179,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> >  	 */
> >  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >  	*pprev = vma_merge(mm, *pprev, start, end, newflags,
> >-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
> >+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> >+			NULL);
> >  	if (*pprev) {
> >  		vma = *pprev;
> >  		goto success;
> >diff --git a/mm/mremap.c b/mm/mremap.c
> >index 1b61c2d..8586c52 100644
> >--- a/mm/mremap.c
> >+++ b/mm/mremap.c
> >@@ -512,7 +512,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> >  			int pages = (new_len - old_len) >> PAGE_SHIFT;
> >  			if (vma_adjust(vma, vma->vm_start, addr + new_len,
> >-				       vma->vm_pgoff, NULL)) {
> >+				       vma->vm_pgoff, NULL, NULL)) {
> >  				ret = -ENOMEM;
> >  				goto out;
> >  			}
> >diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> >new file mode 100644
> >index 0000000..8b812d2
> >--- /dev/null
> >+++ b/mm/mvolatile.c
> >@@ -0,0 +1,312 @@
> >+/*
> >+ *	linux/mm/mvolatile.c
> >+ *
> >+ *  Copyright 2012 Minchan Kim
> >+ *
> >+ *  This work is licensed under the terms of the GNU GPL, version 2. See
> >+ *  the COPYING file in the top-level directory.
> >+ */
> >+
> >+#include <linux/mvolatile.h>
> >+#include <linux/mm_types.h>
> >+#include <linux/mm.h>
> >+#include <linux/rmap.h>
> >+#include <linux/mempolicy.h>
> >+
> >+#ifndef CONFIG_VOLATILE_PAGE
> >+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> >+{
> >+	return -EINVAL;
> >+}
> >+
> >+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> >+{
> >+	return -EINVAL;
> >+}
> >+#else
> >+
> >+#define NO_PURGED	0
> >+#define PURGED		1
> >+
> >+/*
> >+ * N: Normal VMA
> >+ * V: Volatile VMA
> >+ * P: Purged volatile VMA
> >+ *
> >+ * Assume that each VMA has two block so case 1-8 consists of three VMA.
> >+ * For example, NNPPVV means VMA1 has normal VMA, VMA2 has purged volailte VMA,
> >+ * and VMA3 has volatile VMA. With another example, NNPVVV means VMA1 has
> >+ * normal VMA, VMA2-1 has purged volatile VMA, VMA2-2 has volatile VMA.
> >+ *
> >+ * Case 7,8 create a new VMA and we call it VMA4 which can be loated before VMA2
> >+ * or after.
> >+ *
> >+ * Notice: The merge between volatile VMAs shouldn't happen.
> >+ * If we call mnovolatile(VMA2),
> >+ *
> >+ * Case 1 NNPPVV -> NNNNVV
> >+ * Case 2 VVPPNN -> VVNNNN
> >+ * Case 3 NNPPNN -> NNNNNN
> >+ * Case 4 NNPPVV -> NNNPVV
> >+ * case 5 VVPPNN -> VVPNNN
> >+ * case 6 VVPPVV -> VVNNVV
> >+ * case 7 VVPPVV -> VVNPVV
> >+ * case 8 VVPPVV -> VVPNVV
> >+ */
> >+static int do_mnovolatile(struct vm_area_struct *vma,
> >+		struct vm_area_struct **prev, unsigned long start,
> >+		unsigned long end, bool *is_purged)
> >+{
> >+	unsigned long new_flags;
> >+	int error = 0;
> >+	struct mm_struct *mm = vma->vm_mm;
> >+	pgoff_t pgoff;
> >+	bool purged = false;
> >+
> >+	new_flags = vma->vm_flags & ~VM_VOLATILE;
> >+	if (new_flags == vma->vm_flags) {
> >+		*prev = vma;
> >+		goto success;
> >+	}
> >+
> >+
> >+	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >+	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
> >+			vma->vm_file, pgoff, vma_policy(vma), &purged);
> >+	if (*prev) {
> >+		vma = *prev;
> >+		goto success;
> >+	}
> >+
> >+	*prev = vma;
> >+
> >+	if (start != vma->vm_start) {
> >+		error = split_vma(mm, vma, start, 1);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+	if (end != vma->vm_end) {
> >+		error = split_vma(mm, vma, end, 0);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+success:
> >+	/* V6. VVPPVV -> VVNNVV */
> >+	vma_lock_anon_vma(vma);
> >+	*is_purged |= (vma->purged|purged);
> >+	vma_unlock_anon_vma(vma);
> >+
> >+	vma->vm_flags = new_flags;
> >+	vma->purged = false;
> >+	return 0;
> >+out:
> >+	return error;
> >+}
> >+
> >+/* I didn't look into KSM/Hugepage so disalbed them */
> >+#define VM_NO_VOLATILE	(VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\
> >+		VM_MERGEABLE|VM_HUGEPAGE|VM_LOCKED)
> >+
> >+static int do_mvolatile(struct vm_area_struct *vma,
> >+	struct vm_area_struct **prev, unsigned long start, unsigned long end)
> >+{
> >+	int error = -EINVAL;
> >+	vm_flags_t new_flags = vma->vm_flags;
> >+	struct mm_struct *mm = vma->vm_mm;
> >+
> >+	new_flags |= VM_VOLATILE;
> >+
> >+	/* Note : Current version doesn't support file vma volatile */
> >+	if (vma->vm_file) {
> >+		*prev = vma;
> >+		goto out;
> >+	}
> >+
> >+	if (vma->vm_flags & VM_NO_VOLATILE ||
> >+			(vma == get_gate_vma(current->mm))) {
> >+		*prev = vma;
> >+		goto out;
> >+	}
> >+	/*
> >+	 * In case of calling MADV_VOLATILE again,
> >+	 * We just reset purged state.
> >+	 */
> >+	if (new_flags == vma->vm_flags) {
> >+		*prev = vma;
> >+		vma_lock_anon_vma(vma);
> >+		vma->purged = false;
> >+		vma_unlock_anon_vma(vma);
> >+		error = 0;
> >+		goto out;
> >+	}
> >+
> >+	*prev = vma;
> >+
> >+	if (start != vma->vm_start) {
> >+		error = split_vma(mm, vma, start, 1);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+	if (end != vma->vm_end) {
> >+		error = split_vma(mm, vma, end, 0);
> >+		if (error)
> >+			goto out;
> >+	}
> >+
> >+	error = 0;
> >+
> >+	vma_lock_anon_vma(vma);
> >+	vma->vm_flags = new_flags;
> >+	vma_unlock_anon_vma(vma);
> >+out:
> >+	return error;
> >+}
> >+
> >+/*
> >+ * Return -EINVAL if range doesn't include a right vma at all.
> >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough to
> >+ * merge/split vmas.
> >+ * Return 0 if range consists of only proper vmas.
> >+ * Return 1 if part of range includes inavlid area(ex, hole/huge/ksm/mlock/
> >+ * special area)
> >+ */
> >+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len)
> >+{
> >+	unsigned long end, tmp;
> >+	struct vm_area_struct *vma, *prev;
> >+	bool invalid = false;
> >+	int error = -EINVAL;
> >+
> >+	down_write(&current->mm->mmap_sem);
> >+	if (start & ~PAGE_MASK)
> >+		goto out;
> >+
> >+	len &= PAGE_MASK;
> >+	if (!len)
> >+		goto out;
> >+
> >+	end = start + len;
> >+	if (end < start)
> >+		goto out;
> >+
> >+	vma = find_vma_prev(current->mm, start, &prev);
> >+	if (!vma)
> >+		goto out;
> >+
> >+	if (start > vma->vm_start)
> >+		prev = vma;
> >+
> >+	for (;;) {
> >+		/* Here start < (end|vma->vm_end). */
> >+		if (start < vma->vm_start) {
> >+			start = vma->vm_start;
> >+			if (start >= end)
> >+				goto out;
> >+			invalid = true;
> >+		}
> >+
> >+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> >+		tmp = vma->vm_end;
> >+		if (end < tmp)
> >+			tmp = end;
> >+
> >+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >+		error = do_mvolatile(vma, &prev, start, tmp);
> >+		if (error == -ENOMEM) {
> >+			up_write(&current->mm->mmap_sem);
> >+			return error;
> >+		}
> >+		if (error == -EINVAL)
> >+			invalid = true;
> >+		else
> >+			error = 0;
> >+		start = tmp;
> >+		if (prev && start < prev->vm_end)
> >+			start = prev->vm_end;
> >+		if (start >= end)
> >+			break;
> >+
> >+		vma = prev->vm_next;
> >+		if (!vma)
> >+			break;
> >+	}
> >+out:
> >+	up_write(&current->mm->mmap_sem);
> >+	return invalid ? 1 : 0;
> >+}
> >+/*
> >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough
> >+ * to merge/split vmas.
> >+ * Return 1 if part of range includes purged's one, otherwise, return 0
> >+ */
> >+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len)
> >+{
> >+	unsigned long end, tmp;
> >+	struct vm_area_struct *vma, *prev;
> >+	int ret, error = -EINVAL;
> >+	bool is_purged = false;
> >+
> >+	down_write(&current->mm->mmap_sem);
> >+	if (start & ~PAGE_MASK)
> >+		goto out;
> >+
> >+	len &= PAGE_MASK;
> >+	if (!len)
> >+		goto out;
> >+
> >+	end = start + len;
> >+	if (end < start)
> >+		goto out;
> >+
> >+	vma = find_vma_prev(current->mm, start, &prev);
> >+	if (!vma)
> >+		goto out;
> >+
> >+	if (start > vma->vm_start)
> >+		prev = vma;
> >+
> >+	for (;;) {
> >+		/* Here start < (end|vma->vm_end). */
> >+		if (start < vma->vm_start) {
> >+			start = vma->vm_start;
> >+			if (start >= end)
> >+				goto out;
> >+		}
> >+
> >+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> >+		tmp = vma->vm_end;
> >+		if (end < tmp)
> >+			tmp = end;
> >+
> >+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >+		error = do_mnovolatile(vma, &prev, start, tmp, &is_purged);
> >+		if (error) {
> >+			WARN_ON(error != -ENOMEM);
> >+			goto out;
> >+		}
> >+		start = tmp;
> >+		if (prev && start < prev->vm_end)
> >+			start = prev->vm_end;
> >+		if (start >= end)
> >+			break;
> >+
> >+		vma = prev->vm_next;
> >+		if (!vma)
> >+			break;
> >+	}
> >+out:
> >+	up_write(&current->mm->mmap_sem);
> >+
> >+	if (error)
> >+		ret = error;
> >+	else if (is_purged)
> >+		ret = PURGED;
> >+	else
> >+		ret = NO_PURGED;
> >+
> >+	return ret;
> >+}
> >+#endif
> >diff --git a/mm/rmap.c b/mm/rmap.c
> >index 2ee1ef0..402d9da 100644
> >--- a/mm/rmap.c
> >+++ b/mm/rmap.c
> >@@ -57,6 +57,7 @@
> >  #include <linux/migrate.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/backing-dev.h>
> >+#include <linux/mvolatile.h>
> >  #include <asm/tlbflush.h>
> >@@ -308,6 +309,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> >  	vma->anon_vma = anon_vma;
> >  	anon_vma_lock(anon_vma);
> >  	anon_vma_chain_link(vma, avc, anon_vma);
> >+	vma_purge_copy(vma, pvma);
> >  	anon_vma_unlock(anon_vma);
> >  	return 0;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/