linux-kernel - Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <000a01c7311e$ca8c4a00$ec10480a@IBMF0038A435B7>
Date:	Sat, 6 Jan 2007 07:10:28 +0800
From:	"zyf.zeroos" <zyf.zeroos@...il.com>
To:	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2.6.16.29 1/1] memory: enhance Linux swap subsystem

Test mail with my signature, mail content is based on the second quilt patch (Linux 2.6.16.29), only two key files are re-sent 1) Documentation/vm_pps.txt 2) mm/vmscan.c

Index: test.signature/Documentation/vm_pps.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ test.signature/Documentation/vm_pps.txt 2007-01-06 07:00:18.146480584 +0800
@@ -0,0 +1,214 @@
+                         Pure Private Page System (pps)
+                     Copyright by Yunfeng Zhang on GFDL 1.2
+                              zyf.zeroos@...il.com
+                              December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section <How to Reclaim
+Pages more Efficiently> and how I patch it into Linux 2.6.16.29 in section
+<Pure Private Page System -- pps>.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) PTE, zone/memory inode layer (architecture-dependent).
+4) Maybe it makes you sense that Page should be placed on the 3rd layer, but
+   here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+
+Unfortunately, Linux 2.6.16.29 swap subsystem is based on the 3rd layer -- a
+system on zone::active_list/inactive_list.
+
+I've finished a patch, see section <Pure Private Page System -- pps>. Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented (PPS), other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if all are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages in SwapDaemon mm/vmscan.c:shrink_private_vma, the whole process is
+divided into six stages -- <Stage Definition>. Other sections show the remain
+aspects of pps
+1) <Data Definition> is basic data definition.
+2) <Concurrent racers of Shrinking pps> is focused on synchronization.
+3) <Private Page Lifecycle of pps> -- how private pages enter in/go off pps.
+4) <VMA Lifecycle of pps> which VMA is belonging to pps.
+
+PPS uses init_mm.mm_list list to enumerate all swappable UserSpace
+(shrink_private_vma).
+
+A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
+execute the stages of pps periodically, note an appropriate timeout ticks is
+necessary so we can give application a chance to re-map back its PrivatePage
+from UnmappedPTE to PTE, that is, show their conglomeration affinity.
+scan_control::pps_cmd field is used to control the behavior of kppsd, = 1 for
+accelerating scanning process and reclaiming pages, it's used in balance_pgdat.
+
+PPS statistic data is appended to /proc/meminfo entry, its prototype is in
+include/linux/mm.h.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section <Delay to Flush TLB>.
+// }])>
+
+// Delay to Flush TLB (dftlb) <([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   flushing tasks.
+3) all data are defined in include/linux/mm.h.
+
+The restriction of dftlb. Following conditions must be met
+1) atomic cmpxchg instruction.
+2) atomically set the access bit after they touch a pte firstly.
+3) To some architectures, vma parameter of flush_tlb_range is maybe important,
+   if it's true, since it's possible that the vma of a TLB flushing task has
+   gone when a CPU starts to execute the task in timer interrupt, so don't use
+   dftlb.
+combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks.
+
+dftlb increases mm_struct::mm_users to prevent the mm from being freed when
+other CPU works on it.
+// }])>
+
+// Stage Definition <([{
+The whole process of private page page-out is divided into six stages, as
+showed in shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar
+pages to a series.
+1) PTE to untouched PTE (access bit is cleared), append flushing tasks to dftlb.
+2) Convert untouched PTE to UnmappedPTE.
+3) Link SwapEntry to every UnmappedPTE.
+4) Flush PrivatePage of UnmappedPTE to its disk SwapPage.
+5) Reclaimed the page and shift UnmappedPTE to SwappedPTE.
+6) SwappedPTE stage.
+// }])>
+
+// Data Definition <([{
+New VMA flag (VM_PURE_PRIVATE) is appended into VMA in include/linux/mm.h.
+
+New PTE type (UnmappedPTE) is appended into PTE system in
+include/asm-i386/pgtable.h. Its prototype is
+struct UnmappedPTE {
+    int present : 1; // must be 0.
+    ...
+    int pageNum : 20;
+};
+The new PTE has a feature, it keeps a link to its PrivatePage and prevent the
+page from being visited by CPU, so you can use it in <Stage Definition> as a
+middleware.
+// }])>
+
+// Concurrent Racers of Shrinking pps <([{
+shrink_private_vma of mm/vmscan.c uses init_mm.mmlist to scan all swappable
+mm_struct instances, during the process of scaning and reclaiming process, it
+readlockes every mm_struct object, which brings some potential concurrent
+racers
+1) mm/swapfile.c    pps_swapoff (swapoff API).
+2) mm/memory.c  do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page
+   (page-fault).
+
+The VMAs of pps can coexist with madvise, mlock, mprotect, mmap and munmap,
+that is why new VMA created from mmap.c:split_vma can re-enter into pps.
+// }])>
+
+// Private Page Lifecycle of pps <([{
+All pages belonging to pps are called as pure private page, its PTE type is PTE
+or UnmappedPTE.
+
+IN (NOTE, when a pure private page enters into pps, it's also trimmed from
+Linux legacy page system by commeting lru_cache_add_active clause)
+1) fs/exec.c install_arg_pages (argument pages).
+2) mm/memory do_anonymous_page, do_wp_page, do_swap_page (page fault).
+3) mm/swap_state.c read_swap_cache_async (swap pages).
+
+OUT
+1) mm/vmscan.c  shrink_pvma_scan_ptes   (stage 6, reclaim a private page).
+2) mm/memory    zap_pte_range   (free a page).
+3) kernel/fork.c dup_mmap (if someone uses fork, migrate all pps pages
+   back to let Linux legacy page system manage them).
+
+When a pure private page is in pps, it can be visited simultaneously by
+page-fault and SwapDaemon.
+// }])>
+
+// VMA Lifecycle of pps <([{
+When a PrivateVMA enters into pps, it's or-ed a new flag -- VM_PURE_PRIVATE in
+memory.c:enter_pps, you can also find which VMA is fit with pps in it, the flag
+is used in the shrink_private_vma of mm/vmscan.c.  Other fields are left
+untouched.
+
+IN.
+1) fs/exec.c setup_arg_pages (StackVMA).
+2) mm/mmap.c do_mmap_pgoff, do_brk (DataVMA).
+3) mm/mmap.c split_vma, copy_vma (in some cases, we need copy a VMA from an
+   exist VMA).
+
+OUT.
+1) kernel/fork.c dup_mmap (if someone uses fork, return the vma back to
+   Linux legacy system).
+2) mm/mmap.c remove_vma, vma_adjust (destroy VMA).
+3) mm/mmap.c do_mmap_pgoff (delete VMA when some errors occur).
+// }])>
+
+// Postscript <([{
+Note, some circumstances aren't tested due to hardware restriction e.g. SMP
+dftlb.
+
+Here are some improvements about pps
+1) In fact, I recommend one-to-one private model -- PrivateVMA, (PTE,
+   UnmappedPTE) and PrivatePage (SwapPage) which is described in my OS and the
+   aboved hyperlink of Linux kernel mail list. So it's a compromise to use
+   Linux legacy SwapCache in my pps.
+2) SwapSpace should provide more flexible interfaces, shrink_pvma_scan_ptes
+   need allocate swap entries in batch, exactly, allocate a batch of fake
+   continual swap entries, see mm/pps_swapin_readahead.
+
+If Linux kernel group can't make a schedule to re-write their memory code,
+however, pps maybe is the best solution until now.
+// }])>
+// vim: foldmarker=<([{,}])> foldmethod=marker et
Index: test.signature/mm/vmscan.c
===================================================================
--- test.signature.orig/mm/vmscan.c 2007-01-06 07:00:11.799445480 +0800
+++ test.signature/mm/vmscan.c 2007-01-06 07:00:23.326693072 +0800
@@ -79,6 +79,9 @@
   * In this context, it doesn't matter that we scan the
   * whole list at once. */
  int swap_cluster_max;
+
+ /* pps control command, 0: do stage 1-4, kppsd only; 1: full stages. */
+ int pps_cmd;
 };
 
 /*
@@ -1514,6 +1517,428 @@
  return ret;
 }
 
+// pps fields.
+static wait_queue_head_t kppsd_wait;
+static struct scan_control wakeup_sc;
+struct pps_info pps_info = {
+ .total = ATOMIC_INIT(0),
+ .pte_count = ATOMIC_INIT(0), // stage 1 and 2.
+ .unmapped_count = ATOMIC_INIT(0), // stage 3 and 4.
+ .swapped_count = ATOMIC_INIT(0) // stage 6.
+};
+// pps end.
+
+struct series_t {
+ pte_t orig_ptes[MAX_SERIES_LENGTH];
+ pte_t* ptes[MAX_SERIES_LENGTH];
+ struct page* pages[MAX_SERIES_LENGTH];
+ int series_length;
+ int series_stage;
+} series;
+
+static int get_series_stage(pte_t* pte, int index)
+{
+ series.orig_ptes[index] = *pte;
+ series.ptes[index] = pte;
+ if (pte_present(series.orig_ptes[index])) {
+  struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index]));
+  series.pages[index] = page;
+  if (page == ZERO_PAGE(addr)) // reserved page is exclusive from us.
+   return 7;
+  if (pte_young(series.orig_ptes[index])) {
+   return 1;
+  } else
+   return 2;
+ } else if (pte_unmapped(series.orig_ptes[index])) {
+  struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index]));
+  series.pages[index] = page;
+  if (!PageSwapCache(page))
+   return 3;
+  else {
+   if (PageWriteback(page) || PageDirty(page))
+    return 4;
+   else
+    return 5;
+  }
+ } else // pte_swapped -- SwappedPTE
+  return 6;
+}
+
+static void find_series(pte_t** start, unsigned long* addr, unsigned long end)
+{
+ int i;
+ int series_stage = get_series_stage((*start)++, 0);
+ *addr += PAGE_SIZE;
+
+ for (i = 1; i < MAX_SERIES_LENGTH && *addr < end; i++, (*start)++, *addr += PAGE_SIZE) {
+  if (series_stage != get_series_stage(*start, i))
+   break;
+ }
+ series.series_stage = series_stage;
+ series.series_length = i;
+}
+
+struct delay_tlb_task delay_tlb_tasks[32] = { [0 ... 31] = {0} };
+
+void timer_flush_tlb_tasks(void* data)
+{
+ int i;
+#ifdef CONFIG_X86
+ int flag = 0;
+#endif
+ for (i = 0; i < 32; i++) {
+  if (delay_tlb_tasks[i].mm != NULL &&
+    cpu_isset(smp_processor_id(), delay_tlb_tasks[i].mm->cpu_vm_mask) &&
+    cpu_isset(smp_processor_id(), delay_tlb_tasks[i].cpu_mask)) {
+#ifdef CONFIG_X86
+   flag = 1;
+#elif
+   // smp::local_flush_tlb_range(delay_tlb_tasks[i]);
+#endif
+   cpu_clear(smp_processor_id(), delay_tlb_tasks[i].cpu_mask);
+  }
+ }
+#ifdef CONFIG_X86
+ if (flag)
+  local_flush_tlb();
+#endif
+}
+
+static struct delay_tlb_task* delay_task = NULL;
+static int vma_index = 0;
+
+static struct delay_tlb_task* search_free_tlb_tasks_slot(void)
+{
+ struct delay_tlb_task* ret = NULL;
+ int i;
+again:
+ for (i = 0; i < 32; i++) {
+  if (delay_tlb_tasks[i].mm != NULL) {
+   if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) {
+    mmput(delay_tlb_tasks[i].mm);
+    delay_tlb_tasks[i].mm = NULL;
+    ret = &delay_tlb_tasks[i];
+   }
+  } else
+   ret = &delay_tlb_tasks[i];
+ }
+ if (!ret) { // Force flush TLBs.
+  on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1);
+  goto again;
+ }
+ return ret;
+}
+
+static void init_delay_task(struct mm_struct* mm)
+{
+ cpus_clear(delay_task->cpu_mask);
+ vma_index = 0;
+ delay_task->mm = mm;
+}
+
+/*
+ * We will be working on the mm, so let's force to flush it if necessary.
+ */
+static void start_tlb_tasks(struct mm_struct* mm)
+{
+ int i, flag = 0;
+again:
+ for (i = 0; i < 32; i++) {
+  if (delay_tlb_tasks[i].mm == mm) {
+   if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) {
+    mmput(delay_tlb_tasks[i].mm);
+    delay_tlb_tasks[i].mm = NULL;
+   } else
+    flag = 1;
+  }
+ }
+ if (flag) { // Force flush TLBs.
+  on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1);
+  goto again;
+ }
+ BUG_ON(delay_task != NULL);
+ delay_task = search_free_tlb_tasks_slot();
+ init_delay_task(mm);
+}
+
+static void end_tlb_tasks(void)
+{
+ atomic_inc(&delay_task->mm->mm_users);
+ delay_task->cpu_mask = delay_task->mm->cpu_vm_mask;
+ delay_task = NULL;
+#ifndef CONFIG_SMP
+ timer_flush_tlb_tasks(NULL);
+#endif
+}
+
+static void fill_in_tlb_tasks(struct vm_area_struct* vma, unsigned long addr,
+  unsigned long end)
+{
+ struct mm_struct* mm;
+ // First, try to combine the task with the previous.
+ if (vma_index != 0 && delay_task->vma[vma_index - 1] == vma &&
+   delay_task->end[vma_index - 1] == addr) {
+  delay_task->end[vma_index - 1] = end;
+  return;
+ }
+fill_it:
+ if (vma_index != 32) {
+  delay_task->vma[vma_index] = vma;
+  delay_task->start[vma_index] = addr;
+  delay_task->end[vma_index] = end;
+  vma_index++;
+  return;
+ }
+ mm = delay_task->mm;
+ end_tlb_tasks();
+
+ delay_task = search_free_tlb_tasks_slot();
+ init_delay_task(mm);
+ goto fill_it;
+}
+
+static void shrink_pvma_scan_ptes(struct scan_control* sc, struct mm_struct*
+  mm, struct vm_area_struct* vma, pmd_t* pmd, unsigned long addr,
+  unsigned long end)
+{
+ int i, statistic;
+ spinlock_t* ptl = pte_lockptr(mm, pmd);
+ pte_t* pte = pte_offset_map(pmd, addr);
+ int anon_rss = 0;
+ struct pagevec freed_pvec;
+ int may_enter_fs = (sc->gfp_mask & (__GFP_FS | __GFP_IO));
+ struct address_space* mapping = &swapper_space;
+
+ pagevec_init(&freed_pvec, 1);
+ do {
+  memset(&series, 0, sizeof(struct series_t));
+  find_series(&pte, &addr, end);
+  if (sc->pps_cmd == 0 && series.series_stage == 5)
+   continue;
+  switch (series.series_stage) {
+   case 1: // PTE -- untouched PTE.
+    for (i = 0; i < series.series_length; i++) {
+     struct page* page = series.pages[i];
+     lock_page(page);
+     spin_lock(ptl);
+     if (unlikely(pte_same(*series.ptes[i], series.orig_ptes[i]))) {
+      if (pte_dirty(*series.ptes[i]))
+       set_page_dirty(page);
+      set_pte_at(mm, addr + i * PAGE_SIZE, series.ptes[i],
+        pte_mkold(pte_mkclean(*series.ptes[i])));
+     }
+     spin_unlock(ptl);
+     unlock_page(page);
+    }
+    fill_in_tlb_tasks(vma, addr, addr + (PAGE_SIZE * series.series_length));
+    break;
+   case 2: // untouched PTE -- UnmappedPTE.
+    /*
+     * Note in stage 1, we've flushed TLB in fill_in_tlb_tasks, so
+     * if it's still clear here, we can shift it to Unmapped type.
+     *
+     * If some architecture doesn't support atomic cmpxchg
+     * instruction or can't atomically set the access bit after
+     * they touch a pte at first, combine stage 1 with stage 2, and
+     * send IPI immediately in fill_in_tlb_tasks.
+     */
+    spin_lock(ptl);
+    statistic = 0;
+    for (i = 0; i < series.series_length; i++) {
+     if (unlikely(pte_same(*series.ptes[i], series.orig_ptes[i]))) {
+      pte_t pte_unmapped = series.orig_ptes[i];
+      pte_unmapped.pte_low &= ~_PAGE_PRESENT;
+      pte_unmapped.pte_low |= _PAGE_UNMAPPED;
+      if (cmpxchg(&series.ptes[i]->pte_low,
+         series.orig_ptes[i].pte_low,
+         pte_unmapped.pte_low) !=
+        series.orig_ptes[i].pte_low)
+       continue;
+      page_remove_rmap(series.pages[i]);
+      anon_rss--;
+      statistic++;
+     }
+    }
+    atomic_add(statistic, &pps_info.unmapped_count);
+    atomic_sub(statistic, &pps_info.pte_count);
+    spin_unlock(ptl);
+    break;
+   case 3: // Attach SwapPage to PrivatePage.
+    /*
+     * A better arithmetic should be applied to Linux SwapDevice to
+     * allocate fake continual SwapPages which are close to each
+     * other, the offset between two close SwapPages is less than 8.
+     */
+    if (sc->may_swap) {
+     for (i = 0; i < series.series_length; i++) {
+      lock_page(series.pages[i]);
+      if (!PageSwapCache(series.pages[i])) {
+       if (!add_to_swap(series.pages[i], GFP_ATOMIC)) {
+        unlock_page(series.pages[i]);
+        break;
+       }
+      }
+      unlock_page(series.pages[i]);
+     }
+    }
+    break;
+   case 4: // SwapPage isn't consistent with PrivatePage.
+    /*
+     * A mini version pageout().
+     *
+     * Current swap space can't commit multiple pages together:(
+     */
+    if (sc->may_writepage && may_enter_fs) {
+     for (i = 0; i < series.series_length; i++) {
+      struct page* page = series.pages[i];
+      int res;
+
+      if (!may_write_to_queue(mapping->backing_dev_info))
+       break;
+      lock_page(page);
+      if (!PageDirty(page) || PageWriteback(page)) {
+       unlock_page(page);
+       continue;
+      }
+      clear_page_dirty_for_io(page);
+      struct writeback_control wbc = {
+       .sync_mode = WB_SYNC_NONE,
+       .nr_to_write = SWAP_CLUSTER_MAX,
+       .nonblocking = 1,
+       .for_reclaim = 1,
+      };
+      page_cache_get(page);
+      SetPageReclaim(page);
+      res = swap_writepage(page, &wbc);
+      if (res < 0) {
+       handle_write_error(mapping, page, res);
+       ClearPageReclaim(page);
+       page_cache_release(page);
+       break;
+      }
+      if (!PageWriteback(page))
+       ClearPageReclaim(page);
+      page_cache_release(page);
+     }
+    }
+    break;
+   case 5: // UnmappedPTE -- SwappedPTE, reclaim PrivatePage.
+    statistic = 0;
+    for (i = 0; i < series.series_length; i++) {
+     struct page* page = series.pages[i];
+     lock_page(page);
+     spin_lock(ptl);
+     if (unlikely(!pte_same(*series.ptes[i], series.orig_ptes[i]))) {
+      spin_unlock(ptl);
+      unlock_page(page);
+      continue;
+     }
+     statistic++;
+     swp_entry_t entry = { .val = page_private(page) };
+     swap_duplicate(entry);
+     pte_t pte_swp = swp_entry_to_pte(entry);
+     set_pte_at(mm, addr + i * PAGE_SIZE, series.ptes[i], pte_swp);
+     spin_unlock(ptl);
+     if (PageSwapCache(page) && !PageWriteback(page))
+      delete_from_swap_cache(page);
+     unlock_page(page);
+
+     if (!pagevec_add(&freed_pvec, page))
+      __pagevec_release_nonlru(&freed_pvec);
+     sc->nr_reclaimed++;
+    }
+    atomic_add(statistic, &pps_info.swapped_count);
+    atomic_sub(statistic, &pps_info.unmapped_count);
+    atomic_sub(statistic, &pps_info.total);
+    break;
+   case 6:
+    // NULL operation!
+    break;
+  }
+ } while (addr < end);
+ add_mm_counter(mm, anon_rss, anon_rss);
+ if (pagevec_count(&freed_pvec))
+  __pagevec_release_nonlru(&freed_pvec);
+}
+
+static void shrink_pvma_pmd_range(struct scan_control* sc, struct mm_struct*
+  mm, struct vm_area_struct* vma, pud_t* pud, unsigned long addr,
+  unsigned long end)
+{
+ unsigned long next;
+ pmd_t* pmd = pmd_offset(pud, addr);
+ do {
+  next = pmd_addr_end(addr, end);
+  if (pmd_none_or_clear_bad(pmd))
+   continue;
+  shrink_pvma_scan_ptes(sc, mm, vma, pmd, addr, next);
+ } while (pmd++, addr = next, addr != end);
+}
+
+static void shrink_pvma_pud_range(struct scan_control* sc, struct mm_struct*
+  mm, struct vm_area_struct* vma, pgd_t* pgd, unsigned long addr,
+  unsigned long end)
+{
+ unsigned long next;
+ pud_t* pud = pud_offset(pgd, addr);
+ do {
+  next = pud_addr_end(addr, end);
+  if (pud_none_or_clear_bad(pud))
+   continue;
+  shrink_pvma_pmd_range(sc, mm, vma, pud, addr, next);
+ } while (pud++, addr = next, addr != end);
+}
+
+static void shrink_pvma_pgd_range(struct scan_control* sc, struct mm_struct*
+  mm, struct vm_area_struct* vma)
+{
+ unsigned long next;
+ unsigned long addr = vma->vm_start;
+ unsigned long end = vma->vm_end;
+ pgd_t* pgd = pgd_offset(mm, addr);
+ do {
+  next = pgd_addr_end(addr, end);
+  if (pgd_none_or_clear_bad(pgd))
+   continue;
+  shrink_pvma_pud_range(sc, mm, vma, pgd, addr, next);
+ } while (pgd++, addr = next, addr != end);
+}
+
+static void shrink_private_vma(struct scan_control* sc)
+{
+ struct vm_area_struct* vma;
+ struct list_head *pos;
+ struct mm_struct *prev, *mm;
+
+ prev = mm = &init_mm;
+ pos = &init_mm.mmlist;
+ atomic_inc(&prev->mm_users);
+ spin_lock(&mmlist_lock);
+ while ((pos = pos->next) != &init_mm.mmlist) {
+  mm = list_entry(pos, struct mm_struct, mmlist);
+  if (!atomic_add_unless(&mm->mm_users, 1, 0))
+   continue;
+  spin_unlock(&mmlist_lock);
+  mmput(prev);
+  prev = mm;
+  start_tlb_tasks(mm);
+  if (down_read_trylock(&mm->mmap_sem)) {
+   for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+    if (!(vma->vm_flags & VM_PURE_PRIVATE))
+     continue;
+    if (vma->vm_flags & VM_LOCKED)
+     continue;
+    shrink_pvma_pgd_range(sc, mm, vma);
+   }
+   up_read(&mm->mmap_sem);
+  }
+  end_tlb_tasks();
+  spin_lock(&mmlist_lock);
+ }
+ spin_unlock(&mmlist_lock);
+ mmput(prev);
+}
+
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
@@ -1557,6 +1982,10 @@
  sc.may_swap = 1;
  sc.nr_mapped = read_page_state(nr_mapped);
 
+ wakeup_sc = sc;
+ wakeup_sc.pps_cmd = 1;
+ wake_up_interruptible(&kppsd_wait);
+
  inc_page_state(pageoutrun);
 
  for (i = 0; i < pgdat->nr_zones; i++) {
@@ -1693,6 +2122,33 @@
  return total_reclaimed;
 }
 
+static int kppsd(void* p)
+{
+ struct task_struct *tsk = current;
+ int timeout;
+ DEFINE_WAIT(wait);
+ daemonize("kppsd");
+ tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+ struct scan_control default_sc;
+ default_sc.gfp_mask = GFP_KERNEL;
+ default_sc.may_writepage = 1;
+ default_sc.may_swap = 1;
+ default_sc.pps_cmd = 0;
+
+ while (1) {
+  try_to_freeze();
+  prepare_to_wait(&kppsd_wait, &wait, TASK_INTERRUPTIBLE);
+  timeout = schedule_timeout(2000);
+  finish_wait(&kppsd_wait, &wait);
+
+  if (timeout)
+   shrink_private_vma(&wakeup_sc);
+  else
+   shrink_private_vma(&default_sc);
+ }
+ return 0;
+}
+
 /*
  * The background pageout daemon, started as a kernel thread
  * from the init process. 
@@ -1837,6 +2293,15 @@
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
+static int __init kppsd_init(void)
+{
+ init_waitqueue_head(&kppsd_wait);
+ kernel_thread(kppsd, NULL, CLONE_KERNEL);
+ return 0;
+}
+
+module_init(kppsd_init)
+
 static int __init kswapd_init(void)
 {
  pg_data_t *pgdat;

Download attachment "smime.p7s" of type "application/x-pkcs7-signature" (2884 bytes)