linux-kernel - Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4df04b840701222108o6992933bied5fff8a525413@mail.gmail.com>
Date:	Tue, 23 Jan 2007 13:08:25 +0800
From:	"yunfeng zhang" <zyf.zeroos@...il.com>
To:	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

re-code my patch, tab = 8. Sorry!

       Signed-off-by: Yunfeng Zhang <zyf.zeroos@...il.com>

Index: linux-2.6.19/Documentation/vm_pps.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19/Documentation/vm_pps.txt	2007-01-23 11:32:02.000000000 +0800
@@ -0,0 +1,236 @@
+                         Pure Private Page System (pps)
+                              zyf.zeroos@...il.com
+                              December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section <How to Reclaim
+Pages more Efficiently> and how I patch it into Linux 2.6.19 in section
+<Pure Private Page System -- pps>.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section <Pure Private Page System -- pps>. Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if both are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages, the whole process is divided into six stages -- <Stage Definition>. PPS
+uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma)
+of mm/vmscan.c. Other sections show the remain aspects of pps
+1) <Data Definition> is basic data definition.
+2) <Concurrent racers of Shrinking pps> is focused on synchronization.
+3) <Private Page Lifecycle of pps> how private pages enter in/go off pps.
+4) <VMA Lifecycle of pps> which VMA is belonging to pps.
+5) <Others about pps> new daemon thread kppsd, pps statistic data etc.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section <Delay to Flush TLB>.
+// }])>
+
+// Delay to Flush TLB (dftlb) <([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   flushing tasks.
+3) all data are defined in include/linux/mm.h.
+4) dftlb is done on stage 1 and 2 of vmscan.c:shrink_pvma_scan_ptes.
+
+The restriction of dftlb. Following conditions must be met
+1) atomic cmpxchg instruction.
+2) atomically set the access bit after CPU touches a pte firstly.
+3) To some architectures, vma parameter of flush_tlb_range is maybe important,
+   if it's true, since it's possible that the vma of a TLB flushing task has
+   gone when a CPU starts to execute the task in timer interrupt, so don't use
+   dftlb.
+combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks.
+
+dftlb increases mm_struct::mm_users to prevent the mm from being freed when
+other CPU works on it.
+// }])>
+
+// Stage Definition <([{
+The whole process of private page page-out is divided into six stages
+shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar ptes/pages to
+a series.
+1) PTE to untouched PTE (clear access bit), append flushing tasks to dftlb.
+---) Other CPUs do flushing tasks in their timer interrupt.
+2) Resume from 1, convert untouched PTE to UnmappedPTE (cmpxchg).
+3) Link SwapEntry to PrivatePage of every UnmappedPTE.
+4) Flush PrivatePage to its disk SwapPage.
+5) Reclaimed the page and shift UnmappedPTE to SwappedPTE.
+6) SwappedPTE stage (Null operation).
+// }])>
+
+// Data Definition <([{
+New VMA flag (VM_PURE_PRIVATE) is appended into VMA in include/linux/mm.h.
+
+New PTE type (UnmappedPTE) is appended into PTE system in
+include/asm-i386/pgtable.h. Its prototype is
+struct UnmappedPTE {
+    int present : 1; // must be 0.
+    ...
+    int pageNum : 20;
+};
+The new PTE has a feature, it keeps a link to its PrivatePage and prevent the
+page from being visited by CPU, so you can use it in <Stage Definition> as a
+middleware.
+// }])>
+
+// Concurrent Racers of Shrinking pps <([{
+shrink_private_vma of mm/vmscan.c uses init_mm.mmlist to scan all swappable
+mm_struct instances, during the process of scaning and reclamation, it
+readlockes mm_struct::mmap_sem, which brings some potential concurrent racers
+1) mm/swapfile.c pps_swapoff    (swapoff API)
+2) mm/memory.c   do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page,
+   do_swap_page (page-fault)
+3) mm/memory.c   get_user_pages (sometimes core need share PrivatePage with us)
+
+There isn't new lock order defined in pps, that is, it's compliable to Linux
+lock order.
+// }])>
+
+// Others about pps <([{
+A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
+execute the stages of pps periodically, note an appropriate timeout ticks is
+necessary so we can give application a chance to re-map back its PrivatePage
+from UnmappedPTE to PTE, that is, show their conglomeration affinity.
+
+kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node
+may_reclaim = 1 means starting reclamation (stage 5).  reclaim_node = (node
+number) is used when a memory node is low. Caller should set them to wakeup_sc,
+then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to
+timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy
+fields are gfp_mask, may_writepage and may_swap.
+
+PPS statistic data is appended to /proc/meminfo entry, its prototype is in
+include/linux/mm.h.
+// }])>
+
+// Private Page Lifecycle of pps <([{
+All pages belonging to pps are called as pure private page, its PTE type is PTE
+or UnmappedPTE. Note, Linux fork API potentially make PrivatePage shared by
+multiple processes, so is excluded from pps.
+
+IN (NOTE, when a pure private page enters into pps, it's also trimmed from
+Linux legacy page system by commeting lru_cache_add_active clause)
+1) fs/exec.c    install_arg_pages    (argument pages)
+2) mm/memory    do_anonymous_page, do_wp_page, do_swap_page    (page fault)
+3) mm/swap_state.c    read_swap_cache_async    (swap pages)
+
+OUT
+1) mm/vmscan.c  shrink_pvma_scan_ptes   (stage 5, reclaim a private page)
+2) mm/memory    zap_pte_range           (free a page)
+3) kernel/fork.c    dup_mmap            (if someone uses fork, migrate all pps
+   pages back to let Linux legacy page system manage them)
+
+When a pure private page is in pps, it can be visited simultaneously by
+page-fault and SwapDaemon.
+// }])>
+
+// VMA Lifecycle of pps <([{
+When a PrivateVMA enters into pps, it's or-ed a new flag -- VM_PURE_PRIVATE in
+memory.c:enter_pps, you can also find which VMAs are fit with pps in it. The
+flag is used mainly in the shrink_private_vma of mm/vmscan.c.  Other fields are
+left untouched.
+
+IN.
+1) fs/exec.c    setup_arg_pages         (StackVMA)
+2) mm/mmap.c    do_mmap_pgoff, do_brk   (DataVMA)
+3) mm/mmap.c    split_vma, copy_vma     (in some cases, we need copy a VMA from
+   an exist VMA)
+
+OUT.
+1) kernel/fork.c   dup_mmap               (if someone uses fork, return the vma
+   back to Linux legacy system)
+2) mm/mmap.c       remove_vma, vma_adjust (destroy VMA)
+3) mm/mmap.c       do_mmap_pgoff          (delete VMA when some errors occur)
+
+The VMAs of pps can coexist with madvise, mlock, mprotect, mmap and munmap,
+that is why new VMA created from mmap.c:split_vma can re-enter into pps.
+// }])>
+
+// Postscript <([{
+Note, some circumstances aren't tested due to hardware restriction e.g. SMP
+dftlb. So there is no guanrantee in my dftlb code and EVEN my idea.
+
+Here are some improvements about pps
+1) In fact, I recommend one-to-one private model -- PrivateVMA, (PTE,
+   UnmappedPTE) and (PrivatePage, DiskSwapPage) which is described in my OS and
+   the above hyperlink of Linux kernel mail list. Current Linux core supports a
+   trick -- COW on PrivatePage which is used by fork API, the API should be
+   used rarely, POSIX thread library, vfork/execve are enough to application,
+   but as the result, it potentially makes PrivatePage shared, so I think it's
+   unnecessary to Linux, do copy-on-calling if someone really need it. If you
+   agree it, you will find UnmappedPTE + PrivatePage IS swap cache of Linux,
+   and swap_info_struct::swap_map should be bitmap other than (short int)map.
+   So it's a compromise to use Linux legacy SwapCache in my pps. That's why my
+   patch is called pps -- pure private (page) system.
+2) SwapSpace should provide more flexible interfaces, shrink_pvma_scan_ptes
+   need allocate swap entries in batch, exactly, allocate a batch of fake
+   continual swap entries, see memory.c:pps_swapin_readahead. In fact, the
+   interface should be overloaded, that is, swap file should has a different
+   strategy versus swap partition.
+
+If Linux kernel group can't make a schedule to re-write their memory code,
+however, pps maybe is the best solution until now.
+// }])>
+// vim: foldmarker=<([{,}])> foldmethod=marker et
Index: linux-2.6.19/fs/exec.c
===================================================================
--- linux-2.6.19.orig/fs/exec.c	2007-01-22 13:58:30.000000000 +0800
+++ linux-2.6.19/fs/exec.c	2007-01-23 11:32:30.000000000 +0800
@@ -321,10 +321,11 @@
 		pte_unmap_unlock(pte, ptl);
 		goto out;
 	}
+	atomic_inc(&pps_info.total);
+	atomic_inc(&pps_info.pte_count);
 	inc_mm_counter(mm, anon_rss);
-	lru_cache_add_active(page);
-	set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
-					page, vma->vm_page_prot))));
+	set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(page,
+			    vma->vm_page_prot))));
 	page_add_new_anon_rmap(page, vma, address);
 	pte_unmap_unlock(pte, ptl);

@@ -437,6 +438,7 @@
 			kmem_cache_free(vm_area_cachep, mpnt);
 			return ret;
 		}
+		enter_pps(mm, mpnt);
 		mm->stack_vm = mm->total_vm = vma_pages(mpnt);
 	}

Index: linux-2.6.19/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.19.orig/fs/proc/proc_misc.c	2007-01-22 13:58:31.000000000 +0800
+++ linux-2.6.19/fs/proc/proc_misc.c	2007-01-22 14:00:00.000000000 +0800
@@ -181,7 +181,11 @@
 		"Committed_AS: %8lu kB\n"
 		"VmallocTotal: %8lu kB\n"
 		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"VmallocChunk: %8lu kB\n"
+		"PPS Total:    %8d kB\n"
+		"PPS PTE:      %8d kB\n"
+		"PPS Unmapped: %8d kB\n"
+		"PPS Swapped:  %8d kB\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -212,7 +216,11 @@
 		K(committed),
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
-		vmi.largest_chunk >> 10
+		vmi.largest_chunk >> 10,
+		K(pps_info.total.counter),
+		K(pps_info.pte_count.counter),
+		K(pps_info.unmapped_count.counter),
+		K(pps_info.swapped_count.counter)
 		);

 		len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.19/include/asm-i386/mmu_context.h
===================================================================
--- linux-2.6.19.orig/include/asm-i386/mmu_context.h	2007-01-22
13:58:32.000000000 +0800
+++ linux-2.6.19/include/asm-i386/mmu_context.h	2007-01-23
11:43:00.000000000 +0800
@@ -32,6 +32,10 @@
 		/* stop flush ipis for the previous mm */
 		cpu_clear(cpu, prev->cpu_vm_mask);
 #ifdef CONFIG_SMP
+		// vmscan.c::end_tlb_tasks maybe had copied cpu_vm_mask before
+		// we leave prev, so let's flush the trace of prev of
+		// delay_tlb_tasks.
+		timer_flush_tlb_tasks(NULL);
 		per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;
 		per_cpu(cpu_tlbstate, cpu).active_mm = next;
 #endif
Index: linux-2.6.19/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.19.orig/include/asm-i386/pgtable-2level.h	2007-01-22
13:58:32.000000000 +0800
+++ linux-2.6.19/include/asm-i386/pgtable-2level.h	2007-01-23
12:50:09.905950872 +0800
@@ -48,21 +48,22 @@
 }

 /*
- * Bits 0, 6 and 7 are taken, split up the 29 bits of offset
+ * Bits 0, 5, 6 and 7 are taken, split up the 28 bits of offset
  * into this range:
  */
-#define PTE_FILE_MAX_BITS	29
+#define PTE_FILE_MAX_BITS	28

 #define pte_to_pgoff(pte) \
-	((((pte).pte_low >> 1) & 0x1f ) + (((pte).pte_low >> 8) << 5 ))
+	((((pte).pte_low >> 1) & 0xf ) + (((pte).pte_low >> 8) << 4 ))

 #define pgoff_to_pte(off) \
-	((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })
+	((pte_t) { (((off) & 0xf) << 1) + (((off) >> 4) << 8) + _PAGE_FILE })

 /* Encode and de-code a swap entry */
-#define __swp_type(x)			(((x).val >> 1) & 0x1f)
+#define __swp_type(x)			(((x).val >> 1) & 0xf)
 #define __swp_offset(x)			((x).val >> 8)
-#define __swp_entry(type, offset)	((swp_entry_t) { ((type) << 1) |
((offset) << 8) })
+#define __swp_entry(type, offset)	((swp_entry_t) { ((type & 0xf) << 1) |\
+	((offset) << 8) | _PAGE_SWAPPED })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })

Index: linux-2.6.19/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.19.orig/include/asm-i386/pgtable.h	2007-01-22
13:58:32.000000000 +0800
+++ linux-2.6.19/include/asm-i386/pgtable.h	2007-01-23 11:47:00.775687672 +0800
@@ -121,7 +121,11 @@
 #define _PAGE_UNUSED3	0x800

 /* If _PAGE_PRESENT is clear, we use these: */
-#define _PAGE_FILE	0x040	/* nonlinear file mapping, saved PTE; unset:swap */
+#define _PAGE_UNMAPPED	0x020	/* a special PTE type, hold its page reference
+				   even it's unmapped, see more from
+				   Documentation/vm_pps.txt. */
+#define _PAGE_SWAPPED 0x040 /* swapped PTE. */
+#define _PAGE_FILE	0x060	/* nonlinear file mapping, saved PTE; */
 #define _PAGE_PROTNONE	0x080	/* if the user mapped it with PROT_NONE;
 				   pte_present gives true */
 #ifdef CONFIG_X86_PAE
@@ -227,7 +231,12 @@
 /*
  * The following only works if pte_present() is not true.
  */
-static inline int pte_file(pte_t pte)		{ return (pte).pte_low & _PAGE_FILE; }
+static inline int pte_unmapped(pte_t pte)	{ return ((pte).pte_low & 0x60)
+    == _PAGE_UNMAPPED; }
+static inline int pte_swapped(pte_t pte)	{ return ((pte).pte_low & 0x60)
+    == _PAGE_SWAPPED; }
+static inline int pte_file(pte_t pte)		{ return ((pte).pte_low & 0x60)
+    == _PAGE_FILE; }

 static inline pte_t pte_rdprotect(pte_t pte)	{ (pte).pte_low &=
~_PAGE_USER; return pte; }
 static inline pte_t pte_exprotect(pte_t pte)	{ (pte).pte_low &=
~_PAGE_USER; return pte; }
Index: linux-2.6.19/include/linux/mm.h
===================================================================
--- linux-2.6.19.orig/include/linux/mm.h	2007-01-22 13:58:34.000000000 +0800
+++ linux-2.6.19/include/linux/mm.h	2007-01-23 12:27:56.171419760 +0800
@@ -168,6 +168,9 @@
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
 #define VM_INSERTPAGE	0x02000000	/* The vma has had
"vm_insert_page()" done on it */
+#define VM_PURE_PRIVATE	0x04000000	/* Is the vma is only belonging to a
+					   mm, see more from
+					   Documentation/vm_pps.txt */

 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -1166,5 +1169,33 @@

 __attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma);

+struct pps_info {
+	atomic_t total;
+	atomic_t pte_count; // stage 1 and 2.
+	atomic_t unmapped_count; // stage 3 and 4.
+	atomic_t swapped_count; // stage 6.
+};
+extern struct pps_info pps_info;
+
+/* vmscan.c::delay flush TLB */
+struct delay_tlb_task
+{
+	struct mm_struct* mm;
+	cpumask_t cpu_mask;
+	struct vm_area_struct* vma[32];
+	unsigned long start[32];
+	unsigned long end[32];
+};
+extern struct delay_tlb_task delay_tlb_tasks[32];
+
+// The prototype of the function is fit with the "func" of "int
+// smp_call_function (void (*func) (void *info), void *info, int retry, int
+// wait);" of include/linux/smp.h of 2.6.16.29. Call it with NULL.
+void timer_flush_tlb_tasks(void* data /* = NULL */);
+
+void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma);
+void leave_pps(struct vm_area_struct* vma, int migrate_flag);
+
+#define MAX_SERIES_LENGTH 8
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux-2.6.19/include/linux/swapops.h
===================================================================
--- linux-2.6.19.orig/include/linux/swapops.h	2006-11-30
05:57:37.000000000 +0800
+++ linux-2.6.19/include/linux/swapops.h	2007-01-22 14:00:00.000000000 +0800
@@ -50,7 +50,7 @@
 {
 	swp_entry_t arch_entry;

-	BUG_ON(pte_file(pte));
+	BUG_ON(!pte_swapped(pte));
 	arch_entry = __pte_to_swp_entry(pte);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
@@ -64,7 +64,7 @@
 	swp_entry_t arch_entry;

 	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
-	BUG_ON(pte_file(__swp_entry_to_pte(arch_entry)));
+	BUG_ON(!pte_swapped(__swp_entry_to_pte(arch_entry)));
 	return __swp_entry_to_pte(arch_entry);
 }

Index: linux-2.6.19/kernel/fork.c
===================================================================
--- linux-2.6.19.orig/kernel/fork.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/kernel/fork.c	2007-01-22 14:00:00.000000000 +0800
@@ -241,6 +241,7 @@
 		tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (!tmp)
 			goto fail_nomem;
+		leave_pps(mpnt, 1);
 		*tmp = *mpnt;
 		pol = mpol_copy(vma_policy(mpnt));
 		retval = PTR_ERR(pol);
Index: linux-2.6.19/kernel/timer.c
===================================================================
--- linux-2.6.19.orig/kernel/timer.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/kernel/timer.c	2007-01-22 14:00:00.000000000 +0800
@@ -1115,6 +1115,10 @@
 		rcu_check_callbacks(cpu, user_tick);
 	scheduler_tick();
  	run_posix_cpu_timers(p);
+
+#ifdef SMP
+	timer_flush_tlb_tasks(NULL);
+#endif
 }

 /*
Index: linux-2.6.19/mm/fremap.c
===================================================================
--- linux-2.6.19.orig/mm/fremap.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/mm/fremap.c	2007-01-22 14:00:00.000000000 +0800
@@ -37,7 +37,7 @@
 			page_cache_release(page);
 		}
 	} else {
-		if (!pte_file(pte))
+		if (pte_swapped(pte))
 			free_swap_and_cache(pte_to_swp_entry(pte));
 		pte_clear_not_present_full(mm, addr, ptep, 0);
 	}
Index: linux-2.6.19/mm/memory.c
===================================================================
--- linux-2.6.19.orig/mm/memory.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/mm/memory.c	2007-01-23 12:47:12.000000000 +0800
@@ -435,7 +435,7 @@

 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
-		if (!pte_file(pte)) {
+		if (pte_swapped(pte)) {
 			swp_entry_t entry = pte_to_swp_entry(pte);

 			swap_duplicate(entry);
@@ -628,6 +628,9 @@
 	spinlock_t *ptl;
 	int file_rss = 0;
 	int anon_rss = 0;
+	int pps_pte = 0;
+	int pps_unmapped = 0;
+	int pps_swapped = 0;

 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -672,6 +675,13 @@
 						addr) != page->index)
 				set_pte_at(mm, addr, pte,
 					   pgoff_to_pte(page->index));
+			if (vma->vm_flags & VM_PURE_PRIVATE) {
+				if (page != ZERO_PAGE(addr)) {
+					if (PageWriteback(page))
+						lru_cache_add_active(page);
+					pps_pte++;
+				}
+			}
 			if (PageAnon(page))
 				anon_rss--;
 			else {
@@ -691,12 +701,31 @@
 		 */
 		if (unlikely(details))
 			continue;
-		if (!pte_file(ptent))
+		if (pte_unmapped(ptent)) {
+			struct page *page;
+			page = pfn_to_page(pte_pfn(ptent));
+			BUG_ON(page == ZERO_PAGE(addr));
+			if (PageWriteback(page))
+				lru_cache_add_active(page);
+			pps_unmapped++;
+			ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+			tlb_remove_page(tlb, page);
+			anon_rss--;
+			continue;
+		}
+		if (pte_swapped(ptent)) {
+			if (vma->vm_flags & VM_PURE_PRIVATE)
+				pps_swapped++;
 			free_swap_and_cache(pte_to_swp_entry(ptent));
+		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

 	add_mm_rss(mm, file_rss, anon_rss);
+	atomic_sub(pps_pte + pps_unmapped, &pps_info.total);
+	atomic_sub(pps_pte, &pps_info.pte_count);
+	atomic_sub(pps_unmapped, &pps_info.unmapped_count);
+	atomic_sub(pps_swapped, &pps_info.swapped_count);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);

@@ -955,7 +984,8 @@
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
 			set_page_dirty(page);
-		mark_page_accessed(page);
+		if (!(vma->vm_flags & VM_PURE_PRIVATE))
+			mark_page_accessed(page);
 	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
@@ -1606,7 +1636,12 @@
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
-		lru_cache_add_active(new_page);
+		if (!(vma->vm_flags & VM_PURE_PRIVATE))
+			lru_cache_add_active(new_page);
+		else {
+			atomic_inc(&pps_info.total);
+			atomic_inc(&pps_info.pte_count);
+		}
 		page_add_new_anon_rmap(new_page, vma, address);

 		/* Free the old page.. */
@@ -1975,6 +2010,85 @@
 }

 /*
+ * New read ahead code, mainly for VM_PURE_PRIVATE only.
+ */
+static void pps_swapin_readahead(swp_entry_t entry, unsigned long addr, struct
+	vm_area_struct *vma, pte_t* pte, pmd_t* pmd)
+{
+	struct page* page;
+	pte_t *prev, *next;
+	swp_entry_t temp;
+	spinlock_t* ptl = pte_lockptr(vma->vm_mm, pmd);
+	int swapType = swp_type(entry);
+	int swapOffset = swp_offset(entry);
+	int readahead = 1, abs;
+
+	if (!(vma->vm_flags & VM_PURE_PRIVATE)) {
+		swapin_readahead(entry, addr, vma);
+		return;
+	}
+
+	page = read_swap_cache_async(entry, vma, addr);
+	if (!page)
+		return;
+	page_cache_release(page);
+
+	// read ahead the whole series, first forward then backward.
+	while (readahead < MAX_SERIES_LENGTH) {
+		next = pte++;
+		if (next - (pte_t*) pmd >= PTRS_PER_PTE)
+			break;
+		spin_lock(ptl);
+        if (!(!pte_present(*next) && pte_swapped(*next))) {
+			spin_unlock(ptl);
+			break;
+		}
+		temp = pte_to_swp_entry(*next);
+		spin_unlock(ptl);
+		if (swp_type(temp) != swapType)
+			break;
+		abs = swp_offset(temp) - swapOffset;
+		abs = abs < 0 ? -abs : abs;
+		swapOffset = swp_offset(temp);
+		if (abs > 8)
+			// the two swap entries are too far, give up!
+			break;
+		page = read_swap_cache_async(temp, vma, addr);
+		if (!page)
+			return;
+		page_cache_release(page);
+		readahead++;
+	}
+
+	swapOffset = swp_offset(entry);
+	while (readahead < MAX_SERIES_LENGTH) {
+		prev = pte--;
+		if (prev - (pte_t*) pmd < 0)
+			break;
+		spin_lock(ptl);
+        if (!(!pte_present(*prev) && pte_swapped(*prev))) {
+			spin_unlock(ptl);
+			break;
+		}
+		temp = pte_to_swp_entry(*prev);
+		spin_unlock(ptl);
+		if (swp_type(temp) != swapType)
+			break;
+		abs = swp_offset(temp) - swapOffset;
+		abs = abs < 0 ? -abs : abs;
+		swapOffset = swp_offset(temp);
+		if (abs > 8)
+			// the two swap entries are too far, give up!
+			break;
+		page = read_swap_cache_async(temp, vma, addr);
+		if (!page)
+			return;
+		page_cache_release(page);
+		readahead++;
+	}
+}
+
+/*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
@@ -2001,7 +2115,7 @@
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		grab_swap_token(); /* Contend for token _before_ read-in */
- 		swapin_readahead(entry, address, vma);
+		pps_swapin_readahead(entry, address, vma, page_table, pmd);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
@@ -2021,7 +2135,8 @@
 	}

 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
-	mark_page_accessed(page);
+	if (!(vma->vm_flags & VM_PURE_PRIVATE))
+		mark_page_accessed(page);
 	lock_page(page);

 	/*
@@ -2033,6 +2148,10 @@

 	if (unlikely(!PageUptodate(page))) {
 		ret = VM_FAULT_SIGBUS;
+		if (vma->vm_flags & VM_PURE_PRIVATE) {
+			lru_cache_add_active(page);
+			mark_page_accessed(page);
+		}
 		goto out_nomap;
 	}

@@ -2053,6 +2172,11 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);
 	unlock_page(page);
+	if (vma->vm_flags & VM_PURE_PRIVATE) {
+		atomic_dec(&pps_info.swapped_count);
+		atomic_inc(&pps_info.total);
+		atomic_inc(&pps_info.pte_count);
+	}

 	if (write_access) {
 		if (do_wp_page(mm, vma, address,
@@ -2104,8 +2228,13 @@
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto release;
+		if (!(vma->vm_flags & VM_PURE_PRIVATE))
+			lru_cache_add_active(page);
+		else {
+			atomic_inc(&pps_info.total);
+			atomic_inc(&pps_info.pte_count);
+		}
 		inc_mm_counter(mm, anon_rss);
-		lru_cache_add_active(page);
 		page_add_new_anon_rmap(page, vma, address);
 	} else {
 		/* Map the ZERO_PAGE - vm_page_prot is readonly */
@@ -2392,6 +2521,22 @@

 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
+		if (pte_unmapped(entry)) {
+			BUG_ON(!(vma->vm_flags & VM_PURE_PRIVATE));
+			atomic_dec(&pps_info.unmapped_count);
+			atomic_inc(&pps_info.pte_count);
+			struct page* page = pte_page(entry);
+			pte_t temp_pte = mk_pte(page, vma->vm_page_prot);
+			pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+			if (unlikely(pte_same(*pte, entry))) {
+				page_add_new_anon_rmap(page, vma, address);
+				set_pte_at(mm, address, pte, temp_pte);
+				update_mmu_cache(vma, address, temp_pte);
+				lazy_mmu_prot_update(temp_pte);
+			}
+			pte_unmap_unlock(pte, ptl);
+			return VM_FAULT_MINOR;
+		}
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (vma->vm_ops->nopage)
@@ -2685,3 +2830,118 @@

 	return buf - old_buf;
 }
+
+static void migrate_back_pte_range(struct mm_struct* mm, pmd_t *pmd, struct
+		vm_area_struct *vma, unsigned long addr, unsigned long end)
+{
+	struct page* page;
+	pte_t entry;
+	pte_t *pte;
+	spinlock_t* ptl;
+	int pps_pte = 0;
+	int pps_unmapped = 0;
+	int pps_swapped = 0;
+
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	do {
+		if (!pte_present(*pte) && pte_unmapped(*pte)) {
+			page = pte_page(*pte);
+			entry = mk_pte(page, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			set_pte_at(mm, addr, pte, entry);
+			BUG_ON(page == ZERO_PAGE(addr));
+			page_add_new_anon_rmap(page, vma, addr);
+			lru_cache_add_active(page);
+			pps_unmapped++;
+		} else if (pte_present(*pte)) {
+			page = pte_page(*pte);
+			if (page == ZERO_PAGE(addr))
+				continue;
+			lru_cache_add_active(page);
+			pps_pte++;
+		} else if (!pte_present(*pte) && pte_swapped(*pte))
+			pps_swapped++;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	pte_unmap_unlock(pte - 1, ptl);
+	lru_add_drain();
+	atomic_sub(pps_pte + pps_unmapped, &pps_info.total);
+	atomic_sub(pps_pte, &pps_info.pte_count);
+	atomic_sub(pps_unmapped, &pps_info.unmapped_count);
+	atomic_sub(pps_swapped, &pps_info.swapped_count);
+}
+
+static void migrate_back_pmd_range(struct mm_struct* mm, pud_t *pud, struct
+		vm_area_struct *vma, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		migrate_back_pte_range(mm, pmd, vma, addr, next);
+	} while (pmd++, addr = next, addr != end);
+}
+
+static void migrate_back_pud_range(struct mm_struct* mm, pgd_t *pgd, struct
+		vm_area_struct *vma, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		migrate_back_pmd_range(mm, pud, vma, addr, next);
+	} while (pud++, addr = next, addr != end);
+}
+
+// migrate all pages of pure private vma back to Linux legacy memory
management.
+static void migrate_back_legacy_linux(struct mm_struct* mm, struct
vm_area_struct* vma)
+{
+	pgd_t* pgd;
+	unsigned long next;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		migrate_back_pud_range(mm, pgd, vma, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma)
+{
+	int condition = VM_READ | VM_WRITE | VM_EXEC | \
+		 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | \
+		 VM_GROWSDOWN | VM_GROWSUP | \
+		 VM_LOCKED | VM_SEQ_READ | VM_RAND_READ | VM_DONTCOPY | \
+		 VM_ACCOUNT | VM_PURE_PRIVATE;
+	if (!(vma->vm_flags & ~condition) && vma->vm_file == NULL) {
+		vma->vm_flags |= VM_PURE_PRIVATE;
+		if (list_empty(&mm->mmlist)) {
+			spin_lock(&mmlist_lock);
+			if (list_empty(&mm->mmlist))
+				list_add(&mm->mmlist, &init_mm.mmlist);
+			spin_unlock(&mmlist_lock);
+		}
+	}
+}
+
+void leave_pps(struct vm_area_struct* vma, int migrate_flag)
+{
+	struct mm_struct* mm = vma->vm_mm;
+
+	if (vma->vm_flags & VM_PURE_PRIVATE) {
+		vma->vm_flags &= ~VM_PURE_PRIVATE;
+		if (migrate_flag)
+			migrate_back_legacy_linux(mm, vma);
+	}
+}
Index: linux-2.6.19/mm/mmap.c
===================================================================
--- linux-2.6.19.orig/mm/mmap.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/mm/mmap.c	2007-01-22 14:00:00.000000000 +0800
@@ -229,6 +229,7 @@
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_free(vma_policy(vma));
+	leave_pps(vma, 0);
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
 }
@@ -620,6 +621,7 @@
 			fput(file);
 		mm->map_count--;
 		mpol_free(vma_policy(next));
+		leave_pps(next, 0);
 		kmem_cache_free(vm_area_cachep, next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
@@ -1112,6 +1114,8 @@
 	if ((vm_flags & (VM_SHARED|VM_ACCOUNT)) == (VM_SHARED|VM_ACCOUNT))
 		vma->vm_flags &= ~VM_ACCOUNT;

+	enter_pps(mm, vma);
+
 	/* Can addr have changed??
 	 *
 	 * Answer: Yes, several device drivers can do it in their
@@ -1138,6 +1142,7 @@
 			fput(file);
 		}
 		mpol_free(vma_policy(vma));
+		leave_pps(vma, 0);
 		kmem_cache_free(vm_area_cachep, vma);
 	}
 out:	
@@ -1165,6 +1170,7 @@
 	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
 	charged = 0;
 free_vma:
+	leave_pps(vma, 0);
 	kmem_cache_free(vm_area_cachep, vma);
 unacct_error:
 	if (charged)
@@ -1742,6 +1748,10 @@

 	/* most fields are the same, copy all, and then fixup */
 	*new = *vma;
+	if (new->vm_flags & VM_PURE_PRIVATE) {
+		new->vm_flags &= ~VM_PURE_PRIVATE;
+		enter_pps(mm, new);
+	}

 	if (new_below)
 		new->vm_end = addr;
@@ -1950,6 +1960,7 @@
 	vma->vm_flags = flags;
 	vma->vm_page_prot = protection_map[flags &
 				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+	enter_pps(mm, vma);
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
@@ -2073,6 +2084,10 @@
 				get_file(new_vma->vm_file);
 			if (new_vma->vm_ops && new_vma->vm_ops->open)
 				new_vma->vm_ops->open(new_vma);
+			if (new_vma->vm_flags & VM_PURE_PRIVATE) {
+				new_vma->vm_flags &= ~VM_PURE_PRIVATE;
+				enter_pps(mm, new_vma);
+			}
 			vma_link(mm, new_vma, prev, rb_link, rb_parent);
 		}
 	}
Index: linux-2.6.19/mm/rmap.c
===================================================================
--- linux-2.6.19.orig/mm/rmap.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/mm/rmap.c	2007-01-22 14:00:00.000000000 +0800
@@ -618,6 +618,7 @@
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;

+	BUG_ON(vma->vm_flags & VM_PURE_PRIVATE);
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
 		goto out;
@@ -676,7 +677,7 @@
 #endif
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
-		BUG_ON(pte_file(*pte));
+		BUG_ON(!pte_swapped(*pte));
 	} else
 #ifdef CONFIG_MIGRATION
 	if (migration) {
Index: linux-2.6.19/mm/swap_state.c
===================================================================
--- linux-2.6.19.orig/mm/swap_state.c	2006-11-30 05:57:37.000000000 +0800
+++ linux-2.6.19/mm/swap_state.c	2007-01-22 14:00:00.000000000 +0800
@@ -354,7 +354,8 @@
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			if (vma == NULL || !(vma->vm_flags & VM_PURE_PRIVATE))
+				lru_cache_add_active(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
Index: linux-2.6.19/mm/swapfile.c
===================================================================
--- linux-2.6.19.orig/mm/swapfile.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/mm/swapfile.c	2007-01-23 12:31:38.000000000 +0800
@@ -501,6 +501,166 @@
 }
 #endif

+static int pps_test_swap_type(struct mm_struct* mm, pmd_t* pmd, pte_t* pte, int
+		type, struct page** ret_page)
+{
+	spinlock_t* ptl = pte_lockptr(mm, pmd);
+	swp_entry_t entry;
+	struct page* page;
+
+	spin_lock(ptl);
+	if (!pte_present(*pte) && pte_swapped(*pte)) {
+		entry = pte_to_swp_entry(*pte);
+		if (swp_type(entry) == type) {
+			*ret_page = NULL;
+			spin_unlock(ptl);
+			return 1;
+		}
+	} else {
+		page = pfn_to_page(pte_pfn(*pte));
+		if (PageSwapCache(page)) {
+			entry.val = page_private(page);
+			if (swp_type(entry) == type) {
+				page_cache_get(page);
+				*ret_page = page;
+				spin_unlock(ptl);
+				return 1;
+			}
+		}
+	}
+	spin_unlock(ptl);
+	return 0;
+}
+
+static int pps_swapoff_scan_ptes(struct mm_struct* mm, struct vm_area_struct*
+		vma, pmd_t* pmd, unsigned long addr, unsigned long end, int type)
+{
+	pte_t *pte;
+	struct page* page;
+
+	pte = pte_offset_map(pmd, addr);
+	do {
+		while (pps_test_swap_type(mm, pmd, pte, type, &page)) {
+			if (page == NULL) {
+				switch (__handle_mm_fault(mm, vma, addr, 0)) {
+				case VM_FAULT_SIGBUS:
+				case VM_FAULT_OOM:
+					return -ENOMEM;
+				case VM_FAULT_MINOR:
+				case VM_FAULT_MAJOR:
+					break;
+				default:
+					BUG();
+				}
+			} else {
+				wait_on_page_locked(page);
+				wait_on_page_writeback(page);
+				lock_page(page);
+				if (!PageSwapCache(page)) {
+					unlock_page(page);
+					page_cache_release(page);
+					break;
+				}
+				wait_on_page_writeback(page);
+				delete_from_swap_cache(page);
+				unlock_page(page);
+				page_cache_release(page);
+				break;
+			}
+		}
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	return 0;
+}
+
+static int pps_swapoff_pmd_range(struct mm_struct* mm, struct vm_area_struct*
+		vma, pud_t* pud, unsigned long addr, unsigned long end, int type)
+{
+	unsigned long next;
+	int ret;
+	pmd_t* pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		ret = pps_swapoff_scan_ptes(mm, vma, pmd, addr, next, type);
+		if (ret == -ENOMEM)
+			return ret;
+	} while (pmd++, addr = next, addr != end);
+	return 0;
+}
+
+static int pps_swapoff_pud_range(struct mm_struct* mm, struct vm_area_struct*
+		vma, pgd_t* pgd, unsigned long addr, unsigned long end, int type)
+{
+	unsigned long next;
+	int ret;
+	pud_t* pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		ret = pps_swapoff_pmd_range(mm, vma, pud, addr, next, type);
+		if (ret == -ENOMEM)
+			return ret;
+	} while (pud++, addr = next, addr != end);
+	return 0;
+}
+
+static int pps_swapoff_pgd_range(struct mm_struct* mm, struct vm_area_struct*
+		vma, int type)
+{
+	unsigned long next;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	int ret;
+	pgd_t* pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		ret = pps_swapoff_pud_range(mm, vma, pgd, addr, next, type);
+		if (ret == -ENOMEM)
+			return ret;
+	} while (pgd++, addr = next, addr != end);
+	return 0;
+}
+
+static int pps_swapoff(int type)
+{
+	struct vm_area_struct* vma;
+	struct list_head *pos;
+	struct mm_struct *prev, *mm;
+	int ret = 0;
+
+	prev = mm = &init_mm;
+	pos = &init_mm.mmlist;
+	atomic_inc(&prev->mm_users);
+	spin_lock(&mmlist_lock);
+	while ((pos = pos->next) != &init_mm.mmlist) {
+		mm = list_entry(pos, struct mm_struct, mmlist);
+		if (!atomic_inc_not_zero(&mm->mm_users))
+			continue;
+		spin_unlock(&mmlist_lock);
+		mmput(prev);
+		prev = mm;
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+			if (!(vma->vm_flags & VM_PURE_PRIVATE))
+				continue;
+			if (vma->vm_flags & VM_LOCKED)
+				continue;
+			ret = pps_swapoff_pgd_range(mm, vma, type);
+			if (ret == -ENOMEM)
+				break;
+		}
+		up_read(&mm->mmap_sem);
+		spin_lock(&mmlist_lock);
+	}
+	spin_unlock(&mmlist_lock);
+	mmput(prev);
+	return ret;
+}
+
 /*
  * No need to decide whether this PTE shares the swap entry with others,
  * just let do_wp_page work it out if a write is requested later - to
@@ -694,6 +854,12 @@
 	int reset_overflow = 0;
 	int shmem;

+	// Let's first read all pps pages back! Note, it's one-to-one mapping.
+	retval = pps_swapoff(type);
+	if (retval == -ENOMEM) // something was wrong.
+		return -ENOMEM;
+	// Now, the remain pages are shared pages, go ahead!
+
 	/*
 	 * When searching mms for an entry, a good strategy is to
 	 * start at the first mm we freed the previous entry from
@@ -914,16 +1080,20 @@
  */
 static void drain_mmlist(void)
 {
-	struct list_head *p, *next;
+	// struct list_head *p, *next;
 	unsigned int i;

 	for (i = 0; i < nr_swapfiles; i++)
 		if (swap_info[i].inuse_pages)
 			return;
+	/*
+	 * Now, init_mm.mmlist list not only is used by SwapDevice but also is
+	 * used by PPS, see Documentation/vm_pps.txt.
 	spin_lock(&mmlist_lock);
 	list_for_each_safe(p, next, &init_mm.mmlist)
 		list_del_init(p);
 	spin_unlock(&mmlist_lock);
+	*/
 }

 /*
Index: linux-2.6.19/mm/vmscan.c
===================================================================
--- linux-2.6.19.orig/mm/vmscan.c	2007-01-22 13:58:36.000000000 +0800
+++ linux-2.6.19/mm/vmscan.c	2007-01-23 12:39:48.000000000 +0800
@@ -66,6 +66,10 @@
 	int swappiness;

 	int all_unreclaimable;
+
+	/* pps control command. See Documentation/vm_pps.txt. */
+	int may_reclaim;
+	int reclaim_node;
 };

 /*
@@ -1097,6 +1101,443 @@
 	return ret;
 }

+// pps fields.
+static wait_queue_head_t kppsd_wait;
+static struct scan_control wakeup_sc;
+struct pps_info pps_info = {
+	.total = ATOMIC_INIT(0),
+	.pte_count = ATOMIC_INIT(0), // stage 1 and 2.
+	.unmapped_count = ATOMIC_INIT(0), // stage 3 and 4.
+	.swapped_count = ATOMIC_INIT(0) // stage 6.
+};
+// pps end.
+
+struct series_t {
+	pte_t orig_ptes[MAX_SERIES_LENGTH];
+	pte_t* ptes[MAX_SERIES_LENGTH];
+	struct page* pages[MAX_SERIES_LENGTH];
+	int series_length;
+	int series_stage;
+} series;
+
+static int get_series_stage(pte_t* pte, int index)
+{
+	series.orig_ptes[index] = *pte;
+	series.ptes[index] = pte;
+	if (pte_present(series.orig_ptes[index])) {
+		struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index]));
+		series.pages[index] = page;
+		if (page == ZERO_PAGE(addr)) // reserved page is exclusive from us.
+			return 7;
+		if (pte_young(series.orig_ptes[index])) {
+			return 1;
+		} else
+			return 2;
+	} else if (pte_unmapped(series.orig_ptes[index])) {
+		struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index]));
+		series.pages[index] = page;
+		if (!PageSwapCache(page))
+			return 3;
+		else {
+			if (PageWriteback(page) || PageDirty(page))
+				return 4;
+			else
+				return 5;
+		}
+	} else // pte_swapped -- SwappedPTE
+		return 6;
+}
+
+static void find_series(pte_t** start, unsigned long* addr, unsigned long end)
+{
+	int i;
+	int series_stage = get_series_stage((*start)++, 0);
+	*addr += PAGE_SIZE;
+
+	for (i = 1; i < MAX_SERIES_LENGTH && *addr < end; i++, (*start)++,
+		*addr += PAGE_SIZE) {
+		if (series_stage != get_series_stage(*start, i))
+			break;
+	}
+	series.series_stage = series_stage;
+	series.series_length = i;
+}
+
+struct delay_tlb_task delay_tlb_tasks[32] = { [0 ... 31] = {0} };
+
+void timer_flush_tlb_tasks(void* data)
+{
+	int i;
+#ifdef CONFIG_X86
+	int flag = 0;
+#endif
+	for (i = 0; i < 32; i++) {
+		if (delay_tlb_tasks[i].mm != NULL &&
+				cpu_isset(smp_processor_id(),
+				    delay_tlb_tasks[i].mm->cpu_vm_mask) &&
+				cpu_isset(smp_processor_id(),
+				    delay_tlb_tasks[i].cpu_mask)) {
+#ifdef CONFIG_X86
+			flag = 1;
+#elif
+			// smp::local_flush_tlb_range(delay_tlb_tasks[i]);
+#endif
+			cpu_clear(smp_processor_id(), delay_tlb_tasks[i].cpu_mask);
+		}
+	}
+#ifdef CONFIG_X86
+	if (flag)
+		local_flush_tlb();
+#endif
+}
+
+static struct delay_tlb_task* delay_task = NULL;
+static int vma_index = 0;
+
+static struct delay_tlb_task* search_free_tlb_tasks_slot(void)
+{
+	struct delay_tlb_task* ret = NULL;
+	int i;
+again:
+	for (i = 0; i < 32; i++) {
+		if (delay_tlb_tasks[i].mm != NULL) {
+			if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) {
+				mmput(delay_tlb_tasks[i].mm);
+				delay_tlb_tasks[i].mm = NULL;
+				ret = &delay_tlb_tasks[i];
+			}
+		} else
+			ret = &delay_tlb_tasks[i];
+	}
+	if (!ret) { // Force flush TLBs.
+		on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1);
+		goto again;
+	}
+	return ret;
+}
+
+static void init_delay_task(struct mm_struct* mm)
+{
+	cpus_clear(delay_task->cpu_mask);
+	vma_index = 0;
+	delay_task->mm = mm;
+}
+
+/*
+ * We will be working on the mm, so let's force to flush it if necessary.
+ */
+static void start_tlb_tasks(struct mm_struct* mm)
+{
+	int i, flag = 0;
+again:
+	for (i = 0; i < 32; i++) {
+		if (delay_tlb_tasks[i].mm == mm) {
+			if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) {
+				mmput(delay_tlb_tasks[i].mm);
+				delay_tlb_tasks[i].mm = NULL;
+			} else
+				flag = 1;
+		}
+	}
+	if (flag) { // Force flush TLBs.
+		on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1);
+		goto again;
+	}
+	BUG_ON(delay_task != NULL);
+	delay_task = search_free_tlb_tasks_slot();
+	init_delay_task(mm);
+}
+
+static void end_tlb_tasks(void)
+{
+	atomic_inc(&delay_task->mm->mm_users);
+	delay_task->cpu_mask = delay_task->mm->cpu_vm_mask;
+	delay_task = NULL;
+#ifndef CONFIG_SMP
+	timer_flush_tlb_tasks(NULL);
+#endif
+}
+
+static void fill_in_tlb_tasks(struct vm_area_struct* vma, unsigned long addr,
+		unsigned long end)
+{
+	struct mm_struct* mm;
+	// First, try to combine the task with the previous.
+	if (vma_index != 0 && delay_task->vma[vma_index - 1] == vma &&
+			delay_task->end[vma_index - 1] == addr) {
+		delay_task->end[vma_index - 1] = end;
+		return;
+	}
+fill_it:
+	if (vma_index != 32) {
+		delay_task->vma[vma_index] = vma;
+		delay_task->start[vma_index] = addr;
+		delay_task->end[vma_index] = end;
+		vma_index++;
+		return;
+	}
+	mm = delay_task->mm;
+	end_tlb_tasks();
+
+	delay_task = search_free_tlb_tasks_slot();
+	init_delay_task(mm);
+	goto fill_it;
+}
+
+static void shrink_pvma_scan_ptes(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma, pmd_t* pmd, unsigned long addr,
+		unsigned long end)
+{
+	int i, statistic;
+	spinlock_t* ptl = pte_lockptr(mm, pmd);
+	pte_t* pte = pte_offset_map(pmd, addr);
+	int anon_rss = 0;
+	struct pagevec freed_pvec;
+	int may_enter_fs = (sc->gfp_mask & (__GFP_FS | __GFP_IO));
+	struct address_space* mapping = &swapper_space;
+
+	pagevec_init(&freed_pvec, 1);
+	do {
+		memset(&series, 0, sizeof(struct series_t));
+		find_series(&pte, &addr, end);
+		if (sc->may_reclaim == 0 && series.series_stage == 5)
+			continue;
+		switch (series.series_stage) {
+		case 1: // PTE -- untouched PTE.
+		for (i = 0; i < series.series_length; i++) {
+			struct page* page = series.pages[i];
+			lock_page(page);
+			spin_lock(ptl);
+			if (unlikely(pte_same(*series.ptes[i],
+					series.orig_ptes[i]))) {
+				if (pte_dirty(*series.ptes[i]))
+				    set_page_dirty(page);
+				set_pte_at(mm, addr + i * PAGE_SIZE,
+					series.ptes[i],
+					pte_mkold(pte_mkclean(*series.ptes[i])));
+			}
+			spin_unlock(ptl);
+			unlock_page(page);
+		}
+		fill_in_tlb_tasks(vma, addr, addr + (PAGE_SIZE *
+			    series.series_length));
+		break;
+		case 2: // untouched PTE -- UnmappedPTE.
+		/*
+		 * Note in stage 1, we've flushed TLB in fill_in_tlb_tasks, so
+		 * if it's still clear here, we can shift it to Unmapped type.
+		 *
+		 * If some architecture doesn't support atomic cmpxchg
+		 * instruction or can't atomically set the access bit after
+		 * they touch a pte at first, combine stage 1 with stage 2, and
+		 * send IPI immediately in fill_in_tlb_tasks.
+		 */
+		spin_lock(ptl);
+		statistic = 0;
+		for (i = 0; i < series.series_length; i++) {
+			if (unlikely(pte_same(*series.ptes[i],
+					series.orig_ptes[i]))) {
+				pte_t pte_unmapped = series.orig_ptes[i];
+				pte_unmapped.pte_low &= ~_PAGE_PRESENT;
+				pte_unmapped.pte_low |= _PAGE_UNMAPPED;
+				if (cmpxchg(&series.ptes[i]->pte_low,
+					    series.orig_ptes[i].pte_low,
+					    pte_unmapped.pte_low) !=
+					series.orig_ptes[i].pte_low)
+					continue;
+				page_remove_rmap(series.pages[i], vma);
+				anon_rss--;
+				statistic++;
+			}
+		}
+		atomic_add(statistic, &pps_info.unmapped_count);
+		atomic_sub(statistic, &pps_info.pte_count);
+		spin_unlock(ptl);
+		break;
+		case 3: // Attach SwapPage to PrivatePage.
+		/*
+		 * A better arithmetic should be applied to Linux SwapDevice to
+		 * allocate fake continual SwapPages which are close to each
+		 * other, the offset between two close SwapPages is less than 8.
+		 */
+		if (sc->may_swap) {
+			for (i = 0; i < series.series_length; i++) {
+				lock_page(series.pages[i]);
+				if (!PageSwapCache(series.pages[i])) {
+					if (!add_to_swap(series.pages[i],
+						    GFP_ATOMIC)) {
+						unlock_page(series.pages[i]);
+						break;
+					}
+				}
+				unlock_page(series.pages[i]);
+			}
+		}
+		break;
+		case 4: // SwapPage isn't consistent with PrivatePage.
+		/*
+		 * A mini version pageout().
+		 *
+		 * Current swap space can't commit multiple pages together:(
+		 */
+		if (sc->may_writepage && may_enter_fs) {
+			for (i = 0; i < series.series_length; i++) {
+				struct page* page = series.pages[i];
+				int res;
+
+				if (!may_write_to_queue(mapping->backing_dev_info))
+					break;
+				lock_page(page);
+				if (!PageDirty(page) || PageWriteback(page)) {
+					unlock_page(page);
+					continue;
+				}
+				clear_page_dirty_for_io(page);
+				struct writeback_control wbc = {
+					.sync_mode = WB_SYNC_NONE,
+					.nr_to_write = SWAP_CLUSTER_MAX,
+					.nonblocking = 1,
+					.for_reclaim = 1,
+				};
+				page_cache_get(page);
+				SetPageReclaim(page);
+				res = swap_writepage(page, &wbc);
+				if (res < 0) {
+					handle_write_error(mapping, page, res);
+					ClearPageReclaim(page);
+					page_cache_release(page);
+					break;
+				}
+				if (!PageWriteback(page))
+					ClearPageReclaim(page);
+				page_cache_release(page);
+			}
+		}
+		break;
+		case 5: // UnmappedPTE -- SwappedPTE, reclaim PrivatePage.
+		statistic = 0;
+		for (i = 0; i < series.series_length; i++) {
+			struct page* page = series.pages[i];
+			if (!(page_to_nid(page) == sc->reclaim_node ||
+				    sc->reclaim_node == -1))
+				continue;
+
+			lock_page(page);
+			spin_lock(ptl);
+			if (!pte_same(*series.ptes[i], series.orig_ptes[i]) ||
+					/* We're racing with get_user_pages. */
+					PageSwapCache(page) ?  page_count(page)
+					> 2 : page_count(page) > 1) {
+				spin_unlock(ptl);
+				unlock_page(page);
+				continue;
+			}
+			statistic++;
+			swp_entry_t entry = { .val = page_private(page) };
+			swap_duplicate(entry);
+			pte_t pte_swp = swp_entry_to_pte(entry);
+			set_pte_at(mm, addr + i * PAGE_SIZE,
+				series.ptes[i], pte_swp);
+			spin_unlock(ptl);
+			if (PageSwapCache(page) && !PageWriteback(page))
+				delete_from_swap_cache(page);
+			unlock_page(page);
+
+			if (!pagevec_add(&freed_pvec, page))
+				__pagevec_release_nonlru(&freed_pvec);
+		}
+		atomic_add(statistic, &pps_info.swapped_count);
+		atomic_sub(statistic, &pps_info.unmapped_count);
+		atomic_sub(statistic, &pps_info.total);
+		break;
+		case 6:
+		// NULL operation!
+		break;
+		}
+	} while (addr < end);
+	add_mm_counter(mm, anon_rss, anon_rss);
+	if (pagevec_count(&freed_pvec))
+		__pagevec_release_nonlru(&freed_pvec);
+}
+
+static void shrink_pvma_pmd_range(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma, pud_t* pud, unsigned long addr,
+		unsigned long end)
+{
+	unsigned long next;
+	pmd_t* pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		shrink_pvma_scan_ptes(sc, mm, vma, pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+}
+
+static void shrink_pvma_pud_range(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma, pgd_t* pgd, unsigned long addr,
+		unsigned long end)
+{
+	unsigned long next;
+	pud_t* pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		shrink_pvma_pmd_range(sc, mm, vma, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+}
+
+static void shrink_pvma_pgd_range(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma)
+{
+	unsigned long next;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	pgd_t* pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		shrink_pvma_pud_range(sc, mm, vma, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+static void shrink_private_vma(struct scan_control* sc)
+{
+	struct vm_area_struct* vma;
+	struct list_head *pos;
+	struct mm_struct *prev, *mm;
+
+	prev = mm = &init_mm;
+	pos = &init_mm.mmlist;
+	atomic_inc(&prev->mm_users);
+	spin_lock(&mmlist_lock);
+	while ((pos = pos->next) != &init_mm.mmlist) {
+		mm = list_entry(pos, struct mm_struct, mmlist);
+		if (!atomic_inc_not_zero(&mm->mm_users))
+			continue;
+		spin_unlock(&mmlist_lock);
+		mmput(prev);
+		prev = mm;
+		start_tlb_tasks(mm);
+		if (down_read_trylock(&mm->mmap_sem)) {
+			for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+				if (!(vma->vm_flags & VM_PURE_PRIVATE))
+					continue;
+				if (vma->vm_flags & VM_LOCKED)
+					continue;
+				shrink_pvma_pgd_range(sc, mm, vma);
+			}
+			up_read(&mm->mmap_sem);
+		}
+		end_tlb_tasks();
+		spin_lock(&mmlist_lock);
+	}
+	spin_unlock(&mmlist_lock);
+	mmput(prev);
+}
+
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
@@ -1144,6 +1585,11 @@
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);

+	wakeup_sc = sc;
+	wakeup_sc.may_reclaim = 1;
+	wakeup_sc.reclaim_node = pgdat->node_id;
+	wake_up_interruptible(&kppsd_wait);
+
 	for (i = 0; i < pgdat->nr_zones; i++)
 		temp_priority[i] = DEF_PRIORITY;

@@ -1723,3 +2169,39 @@
 	return __zone_reclaim(zone, gfp_mask, order);
 }
 #endif
+
+static int kppsd(void* p)
+{
+	struct task_struct *tsk = current;
+	int timeout;
+	DEFINE_WAIT(wait);
+	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+	struct scan_control default_sc;
+	default_sc.gfp_mask = GFP_KERNEL;
+	default_sc.may_writepage = 1;
+	default_sc.may_swap = 1;
+	default_sc.may_reclaim = 0;
+	default_sc.reclaim_node = -1;
+
+	while (1) {
+		try_to_freeze();
+		prepare_to_wait(&kppsd_wait, &wait, TASK_INTERRUPTIBLE);
+		timeout = schedule_timeout(2000);
+		finish_wait(&kppsd_wait, &wait);
+
+		if (timeout)
+			shrink_private_vma(&wakeup_sc);
+		else
+			shrink_private_vma(&default_sc);
+	}
+	return 0;
+}
+
+static int __init kppsd_init(void)
+{
+	init_waitqueue_head(&kppsd_wait);
+	kthread_run(kppsd, NULL, "kppsd");
+	return 0;
+}
+
+module_init(kppsd_init)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/