linux-kernel - Re: [PATCH v7] mm: support madvise(MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHO5Pa2o8XAabF31s1LdPMwMziLYxn6__wBo5bk5Q6zwLm-WyA@mail.gmail.com>
Date:	Thu, 22 May 2014 12:49:47 +0200
From:	Michael Kerrisk <mtk.manpages@...il.com>
To:	Minchan Kim <minchan@...nel.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	Dave Hansen <dave.hansen@...el.com>,
	John Stultz <john.stultz@...aro.org>,
	Zhang Yanfei <zhangyanfei@...fujitsu.com>,
	Hugh Dickins <hughd@...gle.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Rik van Riel <riel@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Mel Gorman <mgorman@...e.de>, Jason Evans <je@...com>,
	Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH v7] mm: support madvise(MADV_FREE)

Hi Minchan,

On Mon, May 19, 2014 at 5:16 AM, Minchan Kim <minchan@...nel.org> wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).

Since this patch changes the ABI, could you please CC future
iterations to linux-api@...r.kernel.org as per
Documentation/SubmitChecklist.

Thanks,

Michael


> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
>
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
>
> How to work is following as.
>
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
>
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)
> barrios@...ptop:~/benchmark/ebizzy$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    2
> Core(s) per socket:    2
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 42
> Stepping:              7
> CPU MHz:               2801.000
> BogoMIPS:              5581.64
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              4096K
> NUMA node0 CPU(s):     0-3
>
> ebizzy benchmark(./ebizzy -S 10 -n 512)
>
>  vanilla-jemalloc               MADV_free-jemalloc
>
> 1 thread
> records:  10              records:  10
> avg:      7682.10         avg:      15306.10
> std:      62.35(0.81%)    std:      347.99(2.27%)
> max:      7770.00         max:      15622.00
> min:      7598.00         min:      14772.00
>
> 2 thread
> records:  10              records:  10
> avg:      12747.50        avg:      24171.00
> std:      792.06(6.21%)   std:      895.18(3.70%)
> max:      13337.00        max:      26023.00
> min:      10535.00        min:      23152.00
>
> 4 thread
> records:  10              records:  10
> avg:      16474.60        avg:      33717.90
> std:      1496.45(9.08%)  std:      2008.97(5.96%)
> max:      17877.00        max:      35958.00
> min:      12224.00        min:      29565.00
>
> 8 thread
> records:  10              records:  10
> avg:      16778.50        avg:      33308.10
> std:      825.53(4.92%)   std:      1668.30(5.01%)
> max:      17543.00        max:      36010.00
> min:      14576.00        min:      29577.00
>
> 16 thread
> records:  10              records:  10
> avg:      20614.40        avg:      35516.30
> std:      602.95(2.92%)   std:      1283.65(3.61%)
> max:      21753.00        max:      37178.00
> min:      19605.00        min:      33217.00
>
> 32 thread
> records:  10              records:  10
> avg:      22771.70        avg:      36018.50
> std:      598.94(2.63%)   std:      1046.76(2.91%)
> max:      24035.00        max:      37266.00
> min:      22108.00        min:      34149.00
>
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
>
> * From v6
>  * Remove page from swapcache in syscal time
>  * Move utility functions from memory.c to madvise.c - Johannes
>  * Rename untilify functtions - Johannes
>  * Remove unnecessary checks from vmscan.c - Johannes
>  * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
>  * Drop Reviewe-by because there was some changes since then.
>
> * From v5
>  * Fix PPC problem which don't flush TLB - Rik
>  * Remove unnecessary lazyfree_range stub function - Rik
>  * Rebased on v3.15-rc5
>
> * From v4
>  * Add Reviewed-by: Zhang Yanfei
>  * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
>
> * From v3
>  * Add "how to work part" in description - Zhang
>  * Add page_discardable utility function - Zhang
>  * Clean up
>
> * From v2
>  * Remove forceful dirty marking of swap-readed page - Johannes
>  * Remove deactivation logic of lazyfreed page
>  * Rebased on 3.14
>  * Remove RFC tag
>
> * From v1
>  * Use custom page table walker for madvise_free - Johannes
>  * Remove PG_lazypage flag - Johannes
>  * Do madvise_dontneed instead of madvise_freein swapless system
>
> Cc: Hugh Dickins <hughd@...gle.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> Cc: Rik van Riel <riel@...hat.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Jason Evans <je@...com>
> Cc: Zhang Yanfei <zhangyanfei@...fujitsu.com>
> Signed-off-by: Minchan Kim <minchan@...nel.org>
> ---
>  include/linux/rmap.h                   |   8 +-
>  include/linux/vm_event_item.h          |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c                           | 174 +++++++++++++++++++++++++++++++++
>  mm/rmap.c                              |  34 ++++++-
>  mm/vmscan.c                            |  37 +++++--
>  mm/vmstat.c                            |   1 +
>  7 files changed, 245 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 9be55c7617da..1fb2beb351f8 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -182,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
>   * Called from mm/vmscan.c to handle paging out
>   */
>  int page_referenced(struct page *, int is_locked,
> -                       struct mem_cgroup *memcg, unsigned long *vm_flags);
> +                       struct mem_cgroup *memcg, unsigned long *vm_flags,
> +                       int *is_dirty);
>
>  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>
> @@ -261,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
>
>  static inline int page_referenced(struct page *page, int is_locked,
>                                   struct mem_cgroup *memcg,
> -                                 unsigned long *vm_flags)
> +                                 unsigned long *vm_flags,
> +                                 int *is_pte_dirty)
>  {
>         *vm_flags = 0;
> +       if (is_pte_dirty)
> +               *is_pte_dirty = 0;
>         return 0;
>  }
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index ced92345c963..e2d3fb1e9814 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 FOR_ALL_ZONES(PGALLOC),
>                 PGFREE, PGACTIVATE, PGDEACTIVATE,
>                 PGFAULT, PGMAJFAULT,
> +               PGLAZYFREED,
>                 FOR_ALL_ZONES(PGREFILL),
>                 FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>                 FOR_ALL_ZONES(PGSTEAL_DIRECT),
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index ddc3b36f1046..7a94102b7a02 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -34,6 +34,7 @@
>  #define MADV_SEQUENTIAL        2               /* expect sequential page references */
>  #define MADV_WILLNEED  3               /* will need these pages */
>  #define MADV_DONTNEED  4               /* don't need these pages */
> +#define MADV_FREE      5               /* free pages only if memory pressure */
>
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_REMOVE    9               /* remove these pages & resources */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index a402f8fdc68e..3085441c484c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,9 @@
>  #include <linux/blkdev.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/mmu_notifier.h>
> +
> +#include <asm/tlb.h>
>
>  /*
>   * Any behaviour which results in changes to the vma->vm_flags needs to
> @@ -31,6 +34,7 @@ static int madvise_need_mmap_write(int behavior)
>         case MADV_REMOVE:
>         case MADV_WILLNEED:
>         case MADV_DONTNEED:
> +       case MADV_FREE:
>                 return 0;
>         default:
>                 /* be safe, default to 1. list exceptions explicitly */
> @@ -251,6 +255,168 @@ static long madvise_willneed(struct vm_area_struct *vma,
>         return 0;
>  }
>
> +static unsigned long madvise_free_pte_range(struct mmu_gather *tlb,
> +                               struct vm_area_struct *vma, pmd_t *pmd,
> +                               unsigned long addr, unsigned long end)
> +{
> +       struct mm_struct *mm = tlb->mm;
> +       spinlock_t *ptl;
> +       pte_t *start_pte;
> +       pte_t *pte;
> +       struct page *page;
> +
> +       start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +       pte = start_pte;
> +       arch_enter_lazy_mmu_mode();
> +       do {
> +               pte_t ptent = *pte;
> +
> +               if (pte_none(ptent))
> +                       continue;
> +
> +               if (!pte_present(ptent))
> +                       continue;
> +
> +               page = vm_normal_page(vma, addr, ptent);
> +               if (page && PageSwapCache(page)) {
> +                       if (trylock_page(page)) {
> +                               if (try_to_free_swap(page))
> +                                       ClearPageDirty(page);
> +                               unlock_page(page);
> +                       } else
> +                               continue;
> +               }
> +
> +               /*
> +                * Some of architecture(ex, PPC) don't update TLB
> +                * with set_pte_at and tlb_remove_tlb_entry so for
> +                * the portability, remap the pte with old|clean
> +                * after pte clearing.
> +                */
> +               ptent = ptep_get_and_clear_full(mm, addr, pte,
> +                                               tlb->fullmm);
> +               ptent = pte_mkold(ptent);
> +               ptent = pte_mkclean(ptent);
> +               set_pte_at(mm, addr, pte, ptent);
> +               tlb_remove_tlb_entry(tlb, pte, addr);
> +       } while (pte++, addr += PAGE_SIZE, addr != end);
> +       arch_leave_lazy_mmu_mode();
> +       pte_unmap_unlock(start_pte, ptl);
> +
> +       return addr;
> +}
> +
> +static inline unsigned long madvise_free_pmd_range(struct mmu_gather *tlb,
> +                               struct vm_area_struct *vma, pud_t *pud,
> +                               unsigned long addr, unsigned long end)
> +{
> +       pmd_t *pmd;
> +       unsigned long next;
> +
> +       pmd = pmd_offset(pud, addr);
> +       do {
> +               /*
> +                * XXX: We can optimize with supporting Hugepage free
> +                * if the range covers.
> +                */
> +               next = pmd_addr_end(addr, end);
> +               if (pmd_trans_huge(*pmd))
> +                       split_huge_page_pmd(vma, addr, pmd);
> +               /*
> +                * Here there can be other concurrent MADV_DONTNEED or
> +                * trans huge page faults running, and if the pmd is
> +                * none or trans huge it can change under us. This is
> +                * because MADV_LAZYFREE holds the mmap_sem in read
> +                * mode.
> +                */
> +               if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> +                       goto next;
> +               next = madvise_free_pte_range(tlb, vma, pmd, addr, next);
> +next:
> +               cond_resched();
> +       } while (pmd++, addr = next, addr != end);
> +
> +       return addr;
> +}
> +
> +static inline unsigned long madvise_free_pud_range(struct mmu_gather *tlb,
> +                               struct vm_area_struct *vma, pgd_t *pgd,
> +                               unsigned long addr, unsigned long end)
> +{
> +       pud_t *pud;
> +       unsigned long next;
> +
> +       pud = pud_offset(pgd, addr);
> +       do {
> +               next = pud_addr_end(addr, end);
> +               if (pud_none_or_clear_bad(pud))
> +                       continue;
> +               next = madvise_free_pmd_range(tlb, vma, pud, addr, next);
> +       } while (pud++, addr = next, addr != end);
> +
> +       return addr;
> +}
> +
> +static void madvise_free_page_range(struct mmu_gather *tlb,
> +                            struct vm_area_struct *vma,
> +                            unsigned long addr, unsigned long end)
> +{
> +       pgd_t *pgd;
> +       unsigned long next;
> +
> +       BUG_ON(addr >= end);
> +       tlb_start_vma(tlb, vma);
> +       pgd = pgd_offset(vma->vm_mm, addr);
> +       do {
> +               next = pgd_addr_end(addr, end);
> +               if (pgd_none_or_clear_bad(pgd))
> +                       continue;
> +               next = madvise_free_pud_range(tlb, vma, pgd, addr, next);
> +       } while (pgd++, addr = next, addr != end);
> +       tlb_end_vma(tlb, vma);
> +}
> +
> +static int madvise_free_single_vma(struct vm_area_struct *vma,
> +                       unsigned long start_addr, unsigned long end_addr)
> +{
> +       unsigned long start, end;
> +       struct mm_struct *mm = vma->vm_mm;
> +       struct mmu_gather tlb;
> +
> +       if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> +               return -EINVAL;
> +
> +       /* MADV_FREE works for only anon vma at the moment */
> +       if (vma->vm_file)
> +               return -EINVAL;
> +
> +       start = max(vma->vm_start, start_addr);
> +       if (start >= vma->vm_end)
> +               return -EINVAL;
> +       end = min(vma->vm_end, end_addr);
> +       if (end <= vma->vm_start)
> +               return -EINVAL;
> +
> +       lru_add_drain();
> +       tlb_gather_mmu(&tlb, mm, start, end);
> +       update_hiwater_rss(mm);
> +
> +       mmu_notifier_invalidate_range_start(mm, start, end);
> +       madvise_free_page_range(&tlb, vma, start, end);
> +       mmu_notifier_invalidate_range_end(mm, start, end);
> +       tlb_finish_mmu(&tlb, start, end);
> +
> +       return 0;
> +}
> +
> +static long madvise_free(struct vm_area_struct *vma,
> +                            struct vm_area_struct **prev,
> +                            unsigned long start, unsigned long end)
> +{
> +       *prev = vma;
> +       return madvise_free_single_vma(vma, start, end);
> +}
> +
>  /*
>   * Application no longer needs these pages.  If the pages are dirty,
>   * it's OK to just throw them away.  The app will be more careful about
> @@ -384,6 +550,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>                 return madvise_remove(vma, prev, start, end);
>         case MADV_WILLNEED:
>                 return madvise_willneed(vma, prev, start, end);
> +       case MADV_FREE:
> +               /*
> +                * XXX: In this implementation, MADV_FREE works like
> +                * MADV_DONTNEED on swapless system or full swap.
> +                */
> +               if (get_nr_swap_pages() > 0)
> +                       return madvise_free(vma, prev, start, end);
>         case MADV_DONTNEED:
>                 return madvise_dontneed(vma, prev, start, end);
>         default:
> @@ -403,6 +576,7 @@ madvise_behavior_valid(int behavior)
>         case MADV_REMOVE:
>         case MADV_WILLNEED:
>         case MADV_DONTNEED:
> +       case MADV_FREE:
>  #ifdef CONFIG_KSM
>         case MADV_MERGEABLE:
>         case MADV_UNMERGEABLE:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3333baab6ece..6f69055311da 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -657,6 +657,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
>  }
>
>  struct page_referenced_arg {
> +       int dirtied;
>         int mapcount;
>         int referenced;
>         unsigned long vm_flags;
> @@ -671,6 +672,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>         struct mm_struct *mm = vma->vm_mm;
>         spinlock_t *ptl;
>         int referenced = 0;
> +       int dirty = 0;
>         struct page_referenced_arg *pra = arg;
>
>         if (unlikely(PageTransHuge(page))) {
> @@ -723,6 +725,10 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                         if (likely(!(vma->vm_flags & VM_SEQ_READ)))
>                                 referenced++;
>                 }
> +
> +               if (pte_dirty(*pte))
> +                       dirty++;
> +
>                 pte_unmap_unlock(pte, ptl);
>         }
>
> @@ -731,6 +737,9 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                 pra->vm_flags |= vma->vm_flags;
>         }
>
> +       if (dirty)
> +               pra->dirtied++;
> +
>         pra->mapcount--;
>         if (!pra->mapcount)
>                 return SWAP_SUCCESS; /* To break the loop */
> @@ -755,6 +764,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>   * @is_locked: caller holds lock on the page
>   * @memcg: target memory cgroup
>   * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
> + * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
>   *
>   * Quick test_and_clear_referenced for all mappings to a page,
>   * returns the number of ptes which referenced the page.
> @@ -762,7 +772,8 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
>  int page_referenced(struct page *page,
>                     int is_locked,
>                     struct mem_cgroup *memcg,
> -                   unsigned long *vm_flags)
> +                   unsigned long *vm_flags,
> +                   int *is_pte_dirty)
>  {
>         int ret;
>         int we_locked = 0;
> @@ -777,6 +788,9 @@ int page_referenced(struct page *page,
>         };
>
>         *vm_flags = 0;
> +       if (is_pte_dirty)
> +               *is_pte_dirty = 0;
> +
>         if (!page_mapped(page))
>                 return 0;
>
> @@ -804,6 +818,9 @@ int page_referenced(struct page *page,
>         if (we_locked)
>                 unlock_page(page);
>
> +       if (is_pte_dirty)
> +               *is_pte_dirty = pra.dirtied;
> +
>         return pra.referenced;
>  }
>
> @@ -1142,6 +1159,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>         spinlock_t *ptl;
>         int ret = SWAP_AGAIN;
>         enum ttu_flags flags = (enum ttu_flags)arg;
> +       int dirty = 0;
>
>         pte = page_check_address(page, mm, address, &ptl, 0);
>         if (!pte)
> @@ -1171,7 +1189,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>         pteval = ptep_clear_flush(vma, address, pte);
>
>         /* Move the dirty bit to the physical page now the pte is gone. */
> -       if (pte_dirty(pteval))
> +       dirty = pte_dirty(pteval);
> +       if (dirty)
>                 set_page_dirty(page);
>
>         /* Update high watermark before we lower rss */
> @@ -1218,6 +1237,16 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>                         }
>                         dec_mm_counter(mm, MM_ANONPAGES);
>                         inc_mm_counter(mm, MM_SWAPENTS);
> +               } else if (TTU_ACTION(flags) == TTU_UNMAP) {
> +                       if (dirty || PageDirty(page)) {
> +                               set_pte_at(mm, address, pte, pteval);
> +                               ret = SWAP_FAIL;
> +                               goto out_unmap;
> +                       } else {
> +                               /* It's a freeable page by madvise_free */
> +                               dec_mm_counter(mm, MM_ANONPAGES);
> +                               goto discard;
> +                       }
>                 } else if (IS_ENABLED(CONFIG_MIGRATION)) {
>                         /*
>                          * Store the pfn of the page in a special migration
> @@ -1241,6 +1270,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>         } else
>                 dec_mm_counter(mm, MM_FILEPAGES);
>
> +discard:
>         page_remove_rmap(page);
>         page_cache_release(page);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7f8504198d41..db6eda9163d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -707,13 +707,17 @@ enum page_references {
>  };
>
>  static enum page_references page_check_references(struct page *page,
> -                                                 struct scan_control *sc)
> +                                                 struct scan_control *sc,
> +                                                 bool *freeable)
>  {
>         int referenced_ptes, referenced_page;
>         unsigned long vm_flags;
> +       int pte_dirty;
> +
> +       VM_BUG_ON_PAGE(!PageLocked(page), page);
>
>         referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
> -                                         &vm_flags);
> +                                         &vm_flags, &pte_dirty);
>         referenced_page = TestClearPageReferenced(page);
>
>         /*
> @@ -754,6 +758,10 @@ static enum page_references page_check_references(struct page *page,
>                 return PAGEREF_KEEP;
>         }
>
> +       if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> +                       !PageDirty(page))
> +               *freeable = true;
> +
>         /* Reclaim if clean, defer dirty pages to writeback */
>         if (referenced_page && !PageSwapBacked(page))
>                 return PAGEREF_RECLAIM_CLEAN;
> @@ -823,6 +831,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                 int may_enter_fs;
>                 enum page_references references = PAGEREF_RECLAIM_CLEAN;
>                 bool dirty, writeback;
> +               bool freeable = false;
>
>                 cond_resched();
>
> @@ -945,7 +954,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                 }
>
>                 if (!force_reclaim)
> -                       references = page_check_references(page, sc);
> +                       references = page_check_references(page, sc,
> +                                                       &freeable);
>
>                 switch (references) {
>                 case PAGEREF_ACTIVATE:
> @@ -961,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                  * Anonymous process memory has backing store?
>                  * Try to allocate it some swap space here.
>                  */
> -               if (PageAnon(page) && !PageSwapCache(page)) {
> +               if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
>                         if (!(sc->gfp_mask & __GFP_IO))
>                                 goto keep_locked;
>                         if (!add_to_swap(page, page_list))
> @@ -976,7 +986,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                  * The page is mapped into the page tables of one or more
>                  * processes. Try to unmap it here.
>                  */
> -               if (page_mapped(page) && mapping) {
> +               if (page_mapped(page) && (mapping || freeable)) {
>                         switch (try_to_unmap(page, ttu_flags)) {
>                         case SWAP_FAIL:
>                                 goto activate_locked;
> @@ -985,7 +995,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         case SWAP_MLOCK:
>                                 goto cull_mlocked;
>                         case SWAP_SUCCESS:
> -                               ; /* try to free the page below */
> +                               /* try to free the page below */
> +                               if (!freeable)
> +                                       break;
> +                               /*
> +                                * Freeable anon page doesn't have mapping
> +                                * due to skipping of swapcache so we free
> +                                * page in here rather than __remove_mapping.
> +                                */
> +                               VM_BUG_ON_PAGE(PageSwapCache(page), page);
> +                               if (!page_freeze_refs(page, 1))
> +                                       goto keep_locked;
> +                               __clear_page_locked(page);
> +                               count_vm_event(PGLAZYFREED);
> +                               goto free_it;
>                         }
>                 }
>
> @@ -1723,7 +1746,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>                 }
>
>                 if (page_referenced(page, 0, sc->target_mem_cgroup,
> -                                   &vm_flags)) {
> +                                   &vm_flags, NULL)) {
>                         nr_rotated += hpage_nr_pages(page);
>                         /*
>                          * Identify referenced, file-backed active pages and
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index eef6321c8470..da18337c6c66 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -794,6 +794,7 @@ const char * const vmstat_text[] = {
>
>         "pgfault",
>         "pgmajfault",
> +       "pglazyfreed",
>
>         TEXTS_FOR_ZONES("pgrefill")
>         TEXTS_FOR_ZONES("pgsteal_kswapd")
> --
> 1.9.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/